Search

From AlfrescoWiki

Jump to: navigation, search

Contents

[edit] The Search API

Searches are defined using the org.alfresco.service.cmr.search.SearchParameters object. They are executed using the public SearchService bean available from the RepositoryServices helper bean. A search returns a org.alfresco.service.cmr.search.ResultSet which is itself made up of org.alfresco.service.cmr.search.ResultSetRow's. Each row in a result set refers to a node in the repository. The rows returned in the result sets from the SearchService are filtered to contain only the nodes to which the user executing the search has read access.


[edit] Search Parameters

A SearchParameters object allows the specification of:

  • the search language;
  • the search string;
  • the properties to be returned by the query;
  • query parameters;
  • sorting;
  • a limit to the size of the result set;
  • a limit to the number of permission evaluations carried out;
  • which data to search; and
  • the transactional isolation level to use for the search.


Currently, the supported search languages are,

  • Lucene; and
  • XPath.


The specification of the language is case insensitive.


The search string depends on the query language. Examples for all supported query languages are given below.


The data to search is specified by selecting in which store to search. At present this is restricted to a single store. There is, in principal, no reason why we can not search across multiple stores.


The properties returned by the query perform two roles. The first, to limit the number of properties returned and the second to recover related data using a simple relative path. For example, this allows for access to an attribute from the parent node. NOTE THAT THIS IS NOT CURRENTLY IMPLEMENTED.


Query parameters are used to define and substitute values into queries. The definition of query parameters is dependant on the query language, examples are given for each supported query language.


The results from a search can be sorted using any property that can be recovered direct from the nodes that are found; but not properties of ancestors or dependants. There are special keys to sort by index order and relevance/score. By default search results are returned with the most relevant (i.e., highest score) first.


The maximum size of the result set can be set. The return results set will be no bigger than this size.


The maxiumum number of permission evaluations may also be set. Only this number of nodes will be considered for inclusion in the results set. If you set this to 10,000 and you do not have read permission for the first 10,000 things you will see no results, evne though there may be a 10,001 node which you can see.


The transactional isolation determines if the search sees information made in the transaction or not. It will see all other data committted by other transactions. In database terms, the default behaviour is READ_COMMITTED. If data changes made in the transaction are excluded from the search then data from all committed transactions will be found, excluding changes in the current transaction.

[edit] Example API Call

In this example, we use lucene to find all nodes of the content type (using the first lucene query example. It assumes you have access to the ServiceRegistry bean via Spring injection, or some other means.

...
        SearchParameters sp = new SearchParameters();
        sp.addStore(getStoreRef());
        sp.setLanguage(SearchService.LANGUAGE_LUCENE);
        sp.setQuery("TYPE:\"{http://www.alfresco.org/model/content/1.0}content\"");
        ResultSet results = null;
        try
        {
            results = serviceRegistry.getSearchService().query(sp);
            for(ResultSetRow row : results)
            {
                NodeRef currentNodeRef = row.getNodeRef();
                ...
            }
        }
        finally
        {
            if(results != null)
            {
                results.close();
            }
        } 
...


It is important that the results set is closed after use, as some search implementations need this call to release resources. For example, the lucene implementation release IO resources. The try/finally pattern above is recommended.

[edit] Lucene

The lucene query API is built on top the lucene standard query parser. The basics of the query language can be found on the Lucene Web Site.


[edit] Understanding tokenisation

If an attribute is tokenised it goes into the index for searching as the tokens it generates. For example, "The quick Brown fox", by default, will be tokenised in the index as the lower case tokens "quick" "brown" "fox" Some words, such as "the", are excluded as tokens. These are known as stop words. So if you try and search for the tokens "The" or even "the" you will find nothing.

Tokenisers are language specific and so are the stop words and actual tokens they generate. The tokens may not always be what you expect, particularly if the tokeniser uses stemming. Currently, tokenisation is set on the data dictionary type. This is picked up from a definition file that matches the locale of the server.

Again, using the example, "The quick Brown fox".

In lucene if you do a search

TEXT:"The quick Brown fox"

It will match as stuff in quotes is tokenized magically for you.


If you do not use quotes you do an exact match against the token � there is no magic to save you. It is assumed you know the tokens.


The following will work as all go though tokenization.


TEXT:"quick" 

TEXT:"Brown" 

TEXT:"BROWN"

TEXT:"The quick Brown fox"

TEXT:"The Brown fox"

TEXT:"FOX"


Exact token matches will work.


TEXT:quick

TEXT:brown

TEXT:fox


These will not work: as case is left as is � the tokens are lower cased; or some words like "the" are not indexed (stop words).


TEXT:"The"

TEXT:the

TEXT:Brown

TEXT:FOX


In summary, stuff in quotes will always magically work.

Stuff with no quotes needs a deeper understanding so the tokens used match those produced by the analyzer. This type of search should be avoided unless you now what you are doing.

Wild cards (* and ?) are not currently supported in phrases. The intention is to remove this restriction by supporting "Qui*" correctly. This will also integrate with stemming tokenisers.

At the moment "Qui*" matches the token "qui*" and not everything starting with qui as you may expect.

[edit] Simple Queries

[edit] The query parser

The query parser is a minor modification from the lucene query parser. The only modification is to support wildcards at the start of term queries and additional fields, as descibed in this document.

Note that queries with a single "NOT" entry or single TERM query precided with "-" are not supported.


-TYPE:"cm:object"
NOT TYPE:"cm:object"

[edit] Finding Nodes By Type

To find all nodes of type cm:content, including all subtypes of cm:content.

 TYPE:"{http://www.alfresco.org/model/content/1.0}content"

The TYPE field does not support well known prefixes at the moment. It will soon support queries of the form

 TYPE:"cm:content"

Note that local names containing invalid XML attribute characters should be encoded according to ISO 9075.

[edit] Finding nodes that have a particular aspect

To find all nodes with the cm:titled aspect, including all derived aspects from cm:titled.

 ASPECT:"{http://www.alfresco.org/model/content/1.0}titled"

The ASPECT field does not support well known prefixes at the moment. It will soon support queries of the form

 ASPECT:"cm:titled"

[edit] Finding nodes by text property values

To find all nodes with the cm:name property containing the word banana.

 @cm\:name:"banana"

Note that lucene requires the : to be escaped using the \ character. You will have to escape the escape character in Java like "@cm\\:name:\"banana\""


You can use the full {namesape}localName version of QName to identify the property: you will have a bit more escaping to do.


To find all nodes with the cm:name property containing words starting with "ban".

 @cm\:name:ban*


To find all nodes with the cm:name property containing words ending with "ana".


 @cm\:name:*ana


Note: the standard lucene query parser does not allow wild cards at the start for performance reasons. Use this with caution.


To find all nodes with the cm:name property containing words containing "anan".


 @cm\:name:*anan*


To find all nodes with the cm:name property containing phrase "green banana".


 @cm\:name:"green banana"

[edit] Finding nodes by integer or long property values

To find all nodes with the integer property test:int set to 12.

 @test\:int:12

Note that leading zeros are ignored, so the following will also work.

 @test\:int:00012

Long properties can be queried in a similar manner.


[edit] Finding nodes by float and double property values

To find all nodes with the property test:float equal to 3.4.

 @test\:float:"3.2"


The search would be idential for double property values.


[edit] Finding nodes by content

Full text searches can be done in two ways. If the attribute cm:abstract is of type d:content then its full text content will be indexed and searchable in exactly the same way as d:text attributes shown above. You will be able to search against the text of the content proving there is a translation to the mime type text/plain.


So if cm:abstract contains the plain text "The quick brown fox jumped over the lazy dog" you could find the node holding the property using:

 @cm\:abstract:"brown fox"

You can also use the TEXT field - this accumulates a broad full text search over all d:content type properties of the node. So to find nodes that have the word "lazy" in any content property.

 TEXT:"lazy"


Content may not be indexed correctly for several reasons. Currently these failures are recorded by inserting special tokens. These failures are not retried.

"nint"
- the content was not indexed as no appropriate transformation to text/plain was available.
"nitf"
- the content was not indexed as the transformation to text/plain failed with an exception.
"nicm"
- the content was not indexed as no content was found in the content store.

You can search for these tokens to find documents that had indexing problems. These tokens will also work in the UI search.

 TEXT:"nint"

As of version 2.0 all uploaded content is tokensied according to the users locale. (It is not yet possible to specify locale on upload). At search time, the users locale is used for tokenisation. This is not currently picked up from the search parameters or cna be specified in the query.

A search is performed in the users locale except for MLText attributes which can be found in other locales as specified on the serach parameters. Content will also be found in the locales specified on the search paremeters if the tokens generated match thoses in other languages. The content part of the query only generate tokens in one language but thoses tokens are looked for in all the languages specified on the search parameters.

[edit] Finding nodes by content mimetype

All content type properties have an extended property to allow search by mimetype. This will be extended to include the size of the content in the future.

To find all cm:abstract content properties of mime type "text/plain".

 @cm\:abstract.mimetype:"text/plain"


Note: this extended attribute is not available for TEXT.mimetype as it may be made up of many different content types.


[edit] Finding nodes by date and time property values

The index only contains the date part of the date at the moment. It expects the date in ISO 8601 datetime format "yyyy-MM-dd'T'HH:mm:ss.sssZ".

So to find date and datetime properties you would use something like:

@cm\:modified:"2006-07-20T00:00:00.000Z"

[edit] Finding nodes by QName

Each node is named within its parent. This name can be used as the basis of a search.


To find all the nodes with QName "user" in the name space with the standard prefix "usr"

 QNAME:"usr:user"

Note: this search type is related to path searches and only accepts name space prefixes and not full namespace entries. This is equivalent to the search PATH:"//usr:user"

Note: the local name of a QName is ISO9075 encoded. So to search for a QName of local name "space separated" and name space "example" you would use:

 QNAME:"example:space_x0020_separated"


[edit] Searching Multilingual Text Fields

All multi-lingual fields are indexed and tokenised according to the locale. The tokens are prefixed in the index with locale information.

When searching, the locale(s) to use can be specified on the SearchParameters. If they are not specified then the locale defaults to the user's login locale (the locale selected when the user logged in).

The search is restricted to the specific strings in just those locales. By default, the locale "fr" will only match "fr" and not "fr_CA". How locales expand can be configured in the search parameters. "fr" can match "fr" only, or "fr" and all countries and all varients. "fr_CA_SomeVarient" can match only "fr_CA_SomeVarien" or "fr_CA_SomeVarient", "fr_CA", and "fr".

In the default configuration MLText is only identified by language - country and variants are ignored. So searching in French will find all locales starting with fr.

If cm:mltitle were a ML string, it could be queried in Lucene using

@cm\:mltitle:"banana"

The locales specified on the search parameters or the user's default locale would govern what locales were matched.

It is intended to specify locales in the Lucene search at some point for interlingual searches; e.g. Extending attributes with locale information is not implemented yet. It is obviously useful where the token you wish to use is different in each language.

@cm\:mltitle_en:"banana" @cm\:mltitle_ja:"�?ナナ"

At the moment you should specify the two locales in the search parameters and use

@cm\:mltitle:"banana" @cm\:mltitle:"�?ナナ"


The tokenisation for each locale is picked up as defined by the data dictionary localisation. By default, the locales are: default(en), cn, cs, da, de, el, en, es, fr, it, ja, ko, nl, no, pt_BR, pt, ru, and sv. Some locales have alternatives.

[edit] Path Queries

The path to a node is the trail of QNames of the child relationships to get to the node.

If the root node contains only one child called "one" in namespace "example", and this node has a child called "two" in namespace "example" then the nodes in the repository can be identified by:

  • "/"
  • "/example:one" (a node specified by a child association "example:one" from the root node)
  • "/example:one/example:two" (a node specified by a child association "example:one" from the root node, and then "example:two" from this node.)

This is very similar to attribute names and how they are specified in XPath. There is a special PATH field available to support queries against P

Path queries support a subset of XPATH with the following axes:

  • child;
  • descendant-or-self; and
  • self.

It supports th following node tests:

  • name "name"
  • namespace qualified name "prefix:name"
  • the "*" character; and
  • namespace pattern "prefix:*".

You can not find all nodes regardless of namespace; "*:woof" is invalid.

It supports the standard abbreviations

  • . (self::node())
  • // (descendant-or-self::node())

The default aspect, if omitted, is "child::".

Predicates are not supported.

To find all nodes directly beneath the root node:

 PATH:"/*"

To find all nodes directly beneath the root node in the "sys" namespace:

 PATH:"/sys:*"

To find all node directly beneath the root node in the "sys" namespace and with local name "user":

 PATH:"/sys:user"

To find all nodes directly below "/sys:user"

 PATH:"/sys:user/*"

To find all nodes at any depth below "/sys:user"

 PATH:"/sys:user//*"

To find all nodes at any depth below "/sys:user" including the node "/sys:user"

 PATH:"/sys:user//."

To find the all nodes with QName "sys:user" anywhere in the repository:

 PATH:"//sys:user"

To find all the children of all the all nodes with QName "sys:user" anywhere in the repository:

 PATH:"//sys:user//*"

[edit] Category Queries

Categories are treated as special PATHs to nodes.

There are not true child relationships between category type nodes and the things they categorize. However, these links can be searched using the special "member" QName. (If you try to follow these relationships via the node service this will not work.)

Categories themselves can be identified by a path starting with the QName of the aspect derived from "cm:classifiable" that defines them.

The following examples use the bootstrap categories. These are all categories in the "cm:generalclassifiable" classification.

To find all root categories in the classification:

 PATH:"/cm:generalclassifiable/*"

If you know there is a "cm:Software Document Classification" category but you do not know at what level it exists, and you want the direct members. If there are two categories with the same association QName then both will be found.

PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification/member"

If you know there is a "cm:Software Document Classification" category but you do not know at what level it exists, and you want all members.

PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification//member"

To find direct subclassifications of "cm:Software Document Classification" is more complex as you need to find any children that are not members of the category (i.e. things that are children but have not been categorised)

 +PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification/*"
 -PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification/member"

This finds any child that is not a "member" child.

To find all subclassifications of "cm:Software Document Classification" is more complex as you need to find any children that are not members of the category (ie things that are children but have not been categorised)

 +PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification//*"
 -PATH:"/cm:generalclassifiable//cm:Software_x0020_Document_x0020_Classification//member"

If you know "cm:Software Document Classification" is a root category you could miss the first //. For example, to find the direct members of the top level category "cm:Software Document Classification"

PATH:"/cm:generalclassifiable/cm:Software_x0020_Document_x0020_Classification/member"

If you have a category "cm:Alfresco Versions" as a root category and you want to check for a particular version say 1.4, the query below will find the direct members of the top level category "cm:Alfresco Versions" This query can then be appended with your other Lucene search queries

query="PATH:\"/cm:generalclassifiable/cm:Alfresco_x0020_Versions/cm:\"1.4\"/member\""

[edit] Combined Queries

Any of the above queries can be combined using the lucene standard methods

The prefixes:

  • - must not match
  • + must match
  • (no prefix) may match - if there are only unprefixed clauses one must match

You may also use AND, OR and NOT.

To match two attributes.

 +@test\:one:"mustmatch" +@test\:two:"mustalsomatch"
 @test\:one:"mustmatch" AND @test\:two:"mustalsomatch"

To match one or other attribute

 @test\:one:"maymatch" @test\:two:"maymatch"
 @test\:one:"maymatch" OR @test\:two:"maymatch"


To match one attribute and not another

 +@test\:one:"mustmatch" -@test\:two:"mustnotmatch"
 @test\:one:"mustmatch" AND NOT @test\:two:"mustnotmatch"

Any of the simple searches above may be combined in these ways.

For example, to restrict a search by location in the hierarchy, and category, and full text search against title.

 +PATH:"/cm:generalclassifiable/cm:Software_x0020_Document_x0020_Classification/member"
 +@cm\:title:"banana"
 +PATH:"/sys:user//*"

[edit] Parameterised Queries

Strings of the form ${namespaceprefix:variablename} are used to parameterise queries.

So to parameterise a full text search using the parameter "sys:text":

        ...
        QueryParameterDefImpl paramDef = new QueryParameterDefImpl(QName.createQName("sys:text", namespacePrefixResolver) (DataTypeDefinition) null, true, "fox");
        ResultsSet results = null;
        try
        {
           results = serviceRegistry.getSearchService().query(getStoreRef(), "lucene", "TEXT:\"${sys:text}\"", null,
                new QueryParameterDefinition[] { paramDef });
           ...
        }
        finally
        {
           if(results != null)
           {
              results.close();
           }
        }
        ...

This will search for nodes that contain "fox" in the TEXT. This does a straight replacement of pattern "${sys:text}" with "fox" in the query definition before executing the Lucene query.

[edit] Queries That Sort

        ...
        SearchParameters sp = new SearchParameters();
        sp.addStore(getStoreRef());
        sp.setLanguage(SearchService.LANGUAGE_LUCENE);
        sp.setQuery("PATH:\"//.\"");
        sp.addSort("ID", true);
        ResultSet results = null;
        try
        {
            results = serviceRegistry.getSearchService().query(sp);
            for(ResultSetRow row : results)
            {
                NodeRef currentNodeRef = row.getNodeRef();
                ...
            }
        }
        finally
        {
            if(results != null)
            {
                results.close();
            }
        } 
        ...


[edit] Range Queries

Range queries follow the lucene default query parser standard, with support for date, integer, long, float and double types.

To search for integer values between 0 and 10 inclusive for the attribute "test:integer"

 @test\:integer:[0 TO 10]


To search for integer values between 0 and 10 exclusive for the attribute "test:integer"

 @test\:integer:{0 TO 10}

The constants 0 and 10 are tokenised according to the property type. So the above search could aslo be used for long, float and double types.

The following could be used for float and double ranges.

 @test\:integer:[0.3 TO 10.5]


Date ranges can be specified as


 @test\:date:[2003\-12\-16T00:00:00 TO 2003\-12\-17T00:00:00]

Currenly the time element in date time searches is ignored.


[edit] Fields in the index and how they are exposed for queries

The lucene index is split into two types of data:

  • properties and other key information held about nodes; and
  • additional information for nodes that contain other nodes.

So a file would have one entry in the index, for all of its properties and key information. A folder will have at least two entries in the index: one for all of its properties and key information; and one entry for each of the paths to the folder, a container entry. The container entries are used to support hierarchical queries.


[edit] Fields present on all entries

ID
The full noderef for the node to which the entry applies (eg ID:"workspace://SpacesStore/19f238df-7b5a-11dc-8388-991c49e2eac8").

[edit] Fields present on node entries

@{uri}localname
Fields of this form are created in the index for each property that is indexed and/or stored. The uri is the full name space uri of the property (not the prefix) and the local name the name of the property. For example, "@{http://www.alfresco.org/model/content/1.0}content". When performing a search against the repository, well known prefixes defined by the models loaded by the data dictionary are available to identify attributes. So there are virtual fields of the form "@cm:content". The prefix form is converted into the previous full form to execute the query. The QueryParser will use the appropriate tokenisation method based on the type of the property as defined in the data dictionary.
Content has some additional information about the mimetype and size which can also be used in queries. These are of the form @cm:conent.mimtype. (In future this will also support @cm:content.size for the size of the content in bytes and @cm:content.url for the internal content url.)


ASPECT
The aspects that have been applied to the node as the full QName string, e.g. {url}localname.


ASSOCTYPEQNAME
The type of the association from a parent.


FTSSTATUS
The status of the node index entry. This can be in three states: Clean, New and Dirty. New means the index contains a new entry but some properties have yet to be indexed in the background. Dirty means the node values have changed and some properties have yet to be indexed in the background. Clean means the index entry is up to date. Note that background indexing has to reindex all properties as lucene does not support updates only add and delete.


ISNODE
Set to T. This identifies a node index entry.


ISROOT
Set to T only for the root node, and F for all other node entries.


LINKASPECT
For parent child relationships this contains "", for catageory entries it contains the QName of the aspect that identifies the classification.


PARENT
Contains the parent ids for the node


PRIMARYASSOCIATIONTYPEQNAME
The type of the primary association to the node


PRIMARYPARENT
The primary parent of the node.


QNAME
The QNames of the node in each of the parents and categories that contain it.


TEXT
The accumulated full text search index.


TX
The transaction is in which the index was updated.


TYPE
The primary type of the node.


If a node has more than one parent or is categorised then there will be multiple values for ASSOCTYPEQNAME, LINKASPECT, PARENT and QNAME. These entries are multi-valued ordered with respect to each other .....the first QNAME will be the name of the assocation in the first PARENT. The number of entries will be equal to the sum of the number of parents for the and and the number of categories into which the node has been placed.

[edit] Fields present on container entries to support hierarchical queries

ANCESTOR
This field stores the parent trail of node refs to the container


PATH
A series of ordered QNames representing a path to the container.


ISCATEGORY
T if this node is a category


ISCONTAINER
T as this is a comtainer


[edit] XPath

XPath queries support the contructs of Jaxen version 1.1, as it is implemented as document navigator against the NodeServce.

It has some JCR 170 specific extensions for functions, to support mutiple parents (which XML does not have), and to support sorting as required for JCR 170.


[edit] Example Queries

[edit] Example Parameterised Queries

You should "sign" your comments

[edit] Example Queries That Sort

[edit] ISO 9075 encoding

XML attribute names can only start with and contain certain characters.


Invalid characters are encoded in the form "_xhhhh_" where the four digits represent the hex encoding of the character. For example the space character, "a b", would get encoded to "a_x0020_b". If the initial string contains a pattern that matches "_xhhhh_" then the first "_" is encoded. For example "a_xabcd_b" would be encoded as "a_x005f_abcd_b".

[edit] APIs

Note there is no index service exposed to the outside world. There may be some exposure at the admin level through JMX.


[edit] Component Diagram

Image:Index-and-Search-Class-Diagram.png

[edit] Component APIs

[edit] Implementation

[edit] Lucene 2.0

As of version 2.0 we have moved to lucene version 2.0.0 - but with FSDirectory patched to avoid issues with stale file handles. The standard lucene 2.0.0 jar is used: the changes are overlayed as our version is found first in the deployment structure. The FSDirectory implementation can be specified as a Java property if required.

[edit] Requirement

A mechanism is required to search against the property, full text, content and semi-structured data in the hub. The structural data is in two forms: the parent - child relationship between nodes; and the location within hierarchies used for categarisation.

The persistence of the data may be separate form the index used to search and locate data. For example, indexing external content, separating the storage of content from other information.


[edit] File Handles and lucene

[edit] Why does this happen?

Lucene version 1.4.3 has an issue where one of the operations we use does not create the compact file format (one file) but creates and leaves the "old" file format with many more files. This is not a major issue but can mean you just hit the limit for file handles on unix.


This issue is fixed in Alfresco releases 1.4.3; and also in Alfrecso 2.x in which we use lucene 2.0 which resolves this issue.


It is safe to work around this issue by increasing the number of file handles available on unix. The issue is not going to recur. This is demonstrated by experience on the forum.


[edit] What to check first

On linux, the global setting for the maximum number of file handles is usually set in

/proc/sys/fs/file-max

Check that this is a large number.


As the user used to run alfresco, run:

ulimit -n

This will tell you how many file handles the current user can have open at any one time. This should be aroud 4096 which has worked well for us.


You should also check that the pam config enables all the right stuff - which it most likely will. In /etc/pam.d/system-auth you should see:

session     required      /lib/security/$ISA/pam_limits.so
session     required      /lib/security/$ISA/pam_unix.so


[edit] How to increase the number for file handles available?

Assuming you are on linux and running alfresco as the user "alfresco", then, as the root user, edit /etc/security/limits.conf and add:

alfresco soft nofile 4096
alfresco hard nofile 65536

This will set the normal number of file handles available to the alfresco user to be 4096 (the soft limit). If this proves to be too few, which is unlikly, the alfresco user can then give themselves more file handles to play with, upto the hard limit, using:

ulimit -n 8192