Full-Text Search Configuration

From AlfrescoWiki

Jump to: navigation, search

The repository.properties file defines a number of properties that influence how all indexes behave. The main index and deltas all use the same configuration at the moment.

The data dictionary settings for properties determine how individual properties are indexed

[edit] repository.properties

dir.indexes
The directory that contains all lucene indexes and deltas against those indexes.
dir.indexes.lock
The directory that contains the locks for lucene indexes.
Max Clauses (Lucene standard parameter)
Lucene queries limit the number of clauses in a boolean query to this value. Some queries are expanded into a whole set of boolean query with many clauses under the covers. For example, searching for luc* will expand to a boolean query containing an "OR" for every token the index knows about that matches luc*.
Batch size (Alfresco indexing parameter)
The indexer stores a list of what it has to do as changes are made using the node service API. Typically, there are many events that would cause a node to be re-indexed. Keeping an event list means the actions can be optimised - the algorithm limits reindexes to one per batch size, will not index if a delete is pending etc. When the list of events reached this size the whole event list is processed and docuements added to the delta index.
Min Merger Docs (Lucene standard parameter)
This determines the size of the in memory lucene index used for each delta index. Higher values trade memory for less IO writing to the index delta. The in memory information will be flushed and written to disk at the start of the next batch of index events as processes the event list requires reads against the delta index. This does not affect the way information is stored on disk, just how it is buffered before it gets there.
Merge Factor (Lucene standard parameter - modified use)
This detemines the number of index segements that are created on disk. When there are more than this number of segments some segments will be combined.
Max Merge Docs (Lucene standard parameter)
The maximum number of documents that can be stored in an index segment. When this value is reached in a segement it will not grow any larger. As a result thyere may be more segments than expected from looking at the merge factor.
Max Field Length (Lucene standard parameter)
The maximum number of tokens used to generate the index entry for a property. So for full text indexing, only this number of tokens will be considered.
If you have the workd muppet at the end of a 100,000 word doc and this parameter set to 4, this token will not be indexed and it would not be found in searches.

Lucene In Action, by Otis Gospodnetic and Erik Hatcher, describes the lucene specific terms in more detail and gives examples of their use and impact. The values in the config are a reasonable trade off.

dir.indexes=${dir.root}/lucene-indexes
dir.indexes.lock=${dir.indexes}/locks
# #################### #
# Lucene configuration #
# #################### #
#
# The maximum number of clauses that are allowed in a lucene query 
#
lucene.query.maxClauses=10000
#
# The size of the queue of nodes waiting for index
# Events are generated as nodes are changed, this is the maximum size of the queue used to coalesce event
# When this size is reached the lists of nodes will be indexed
#
lucene.indexer.batchSize=1000
#
# Lucene index min merge docs - the in memory size of the index 
#
lucene.indexer.minMergeDocs=1000
#
# When lucene index files are merged together - it will try to keep this number of segments/files in  
#
lucene.indexer.mergeFactor=10
#
# Roughly the maximum number of nodes indexed in one file/segment 
#
lucene.indexer.maxMergeDocs=100000
#
# The number of terms from a document that will be indexed
#
lucene.indexer.maxFieldLength=10000 

[edit] Data dictionary options

The indexing behaviour of each property can be set in the content model. By default, they are indexed atomically. The property value is not stored in the index, and the property is tokenised when it is indexed.

The example below shows how indexing can be controlled.

Enabled="false"
If this is false there will be no entry for this property in the index.
Atomic="true"
If this is true then the property is indexed in the transaction, if not the property is indexed in the background.
Stored="true"
If true, the property value is stored in the index and may be obtained via the Lucene low level query API.
Tokenised="true"
If true, the string value of the property is tokenised before indexing; if false, it is indexed "as is" as a single string.

All content is not stored, indexed and tokenised, if it is indexed.

The tokeniser is determined by the property type in the data dictionary. This is locale sensitive as supported by the data dictionary. So you could switch to tokenise all your content in German. At the moment you can not mix German and English tokenisation.

     <type name="cm:content">
        <title>Content</title>
        <parent>cm:cmobject</parent>
        <properties>
           <property name="cm:content">
              <type>d:content</type>
              <mandatory>false</mandatory>
              <index enabled="true">
                 <atomic>true</atomic>
                 <stored>false</stored>
                 <tokenised>true</tokenised>
              </index>
           </property>
        </properties>
     </type>

Back to Server Configuration