Views
Index Merging Performance
From alfrescowiki
This page intends to document the relation between the document distribution in the index segments and merge performance.
For each Alfresco Store, there is a corresponding IndexInfo live file in /path/to/repository/lucene-indexes/${protocol}/${identifier}, for example /opt/alfresco/repos/3.3.3/lucene-indexes/workspace/SpacesStore/IndexInfo.
The class org.alfresco.repo.search.impl.lucene.index.IndexInfo contains a main method to read those files. The index info file contains, as its name implies, summary information about the status of the various underlying lucene segments (committed, merge, merge target, ...) and number of lucene documents contained in a particular segment. The IndexInfoBackup file is a fallback, and is used if the IndexInfo file is missing or corrupted.
Note that the IndexInfo does not need to be read on the production server. It can be copied on another server, and read by passing the full path as command line parameter. The class can be run with this very simple script, running for example off of the Alfresco SDK :
#!/bin/bash # $1 : path IndexInfo file to read SDK_ROOT=/opt/alfresco/sdks/3.3.3 #replace as appropriate MAIN_CLASS=org.alfresco.repo.search.impl.lucene.index.IndexInfo THIRD_PARTY_JARS=$(find $SDK_ROOT/lib/server/dependencies -name '*.jar' | xargs | sed -e 's/ /:/g') ALF_JARS=$(ls -1 $SDK_ROOT/lib/server/alfresco*.jar | xargs | sed -e 's/ /:/g') java -classpath $THIRD_PARTY_JARS:$ALF_JARS $MAIN_CLASS $@
Note: It is NOT needed to have repository / db / tomcat installed on the machine where the script is run.
Let's analyze some sample outputs from test repositories:
"Good" Sample output :
Entry List for /opt/alfresco/repos/3.3.0/lucene-indexes/workspace/SpacesStore
Size = 10
0 Name=8e6be575-e1d3-489f-b4c7-20337b700515 Type=INDEX Status=COMMITTED Docs=8527 Deletions=0
1 Name=752eac1c-7f29-47fe-992b-6ab9e4c850ac Type=INDEX Status=COMMITTED Docs=3024 Deletions=0
2 Name=b2c507b9-f4a3-4974-800c-60def99a4aa0 Type=INDEX Status=COMMITTED Docs=1512 Deletions=0
3 Name=0b80e759-8e15-4e8b-a786-f89210681d53 Type=INDEX Status=COMMITTED Docs=523 Deletions=0
4 Name=a6b147fe-f39b-402f-8479-61dabf879d65 Type=INDEX Status=COMMITTED Docs=92 Deletions=0
5 Name=724cb73a-e455-4da1-be76-44c6222d1d93 Type=DELTA Status=COMMITTED Docs=1 Deletions=1
6 Name=dc1a47b7-2243-407e-9fe4-c782f62746e1 Type=DELTA Status=COMMITTED Docs=1 Deletions=1
7 Name=df930230-089e-4b9c-bf68-3e500f13e679 Type=DELTA Status=COMMITTED Docs=1 Deletions=1
8 Name=82487ce7-654a-48d6-899f-d5887dda76d9 Type=DELTA Status=COMMITTED Docs=1 Deletions=1
9 Name=fbc386a6-9cd8-401c-9330-f2b79f851059 Type=DELTA Status=COMMITTED Docs=1 Deletions=1
...
"Bad" Sample Output :
Entry List for /opt/alfresco/repos/3.3.0-bad/lucene-indexes/workspace/SpacesStore
Size = 9
0 Name=ca612da7-fa5b-4612-aa1e-14e47ff97eb6 Type=INDEX Status=COMMITTED Docs=271702 Deletions=0
1 Name=9c75c493-4588-47cf-b9dd-4b5a67cfc0dc Type=INDEX Status=COMMITTED Docs=239888 Deletions=0
2 Name=3d185854-a40d-4de6-bab6-61d2888ec67c Type=INDEX Status=COMMITTED Docs=162640 Deletions=0
3 Name=fde91947-fb1e-406a-a411-ad103b0deccb Type=INDEX Status=COMMITTED Docs=154118 Deletions=0
4 Name=019578d2-cf7a-4067-91a0-16f2ecb82ee3 Type=INDEX Status=COMMITTED Docs=81467 Deletions=0
5 Name=79cd965e-6cbe-4559-98b7-0ae337b315c3 Type=DELTA Status=COMMITTED Docs=1 Deletions=2
6 Name=5c67efc8-7454-4c7a-9694-ebc7d0e53b55 Type=DELTA Status=COMMITTED Docs=0 Deletions=2
7 Name=be3bf0ba-e0dc-450c-a864-01f08024fcf5 Type=DELTA Status=COMMITTED Docs=0 Deletions=2
8 Name=71ebb8b3-17e0-4606-92df-d4e5a3c471a8 Type=DELTA Status=COMMITTED Docs=0 Deletions=2
...
It is usualy a good practice that the highest-numbered INDEX entries (which contains the least documents, number 4 in the examples above) do not contain more than a few hundred documents. It it's not the case, it could lead to massive amount of IO pressure on the index directories for merging operations. Only applies to COMMITTED index segments, if you have a lot of MERGE segments, then it may make more sense to look at it when the indexes are not currenlty being heavily merged (ie when most statuses are not COMMITTED) as the numbers may change a lot in a short amount of time.
The number of on-disk segments is controlled by lucene.indexer.mergerTargetIndexCount for alfresco versions >= 3.3.3 and lucene.indexer.mergerMergeFactor for alfresco versions < 3.3.3.The defaults for these properties is 5.
WARNING : this setting is only for advanced administrators. It is not recommended to experiment with these settings on a production server without a proper understanding of the merging behaviours, and/or under the guidance of Alfresco Support.
To achieve better performance for the "bad" sample above, the number of segments need to be increased to have better doc spreading, and therefore a lower number of docs in the top index. Thus, the value can be increased from 5 to, for example, 8, so that the docs are spread in a larger number of segments. The value should not be raised too much, otherwise the distribution progression will be imbalanced, with the 1st entry containing most of the docs, the others little.
The setting can take effect by either :
- restarting, and triggering lucene activity, by for example, adding a few documents through the UI, but the reorganization of those segments will still take time (can be potentially followed by the index web scripts or the FTSSTATUS queries, and server activity, etc...), and the reorganization will not be immediate.
- Running a FULL reindex if it's operationally possible
After the reorganization is complete, you should see a better doc distribution in the segments, lower number of documents in the highest numbered INDEX entry, and lower IO pressure from the merging process.
Note : the Name entries correspond to filesystem directories in their respective stores. They are named either through generated GUID or identified by their transaction ID / delta ID (also a GUID). (see org/alfresco/repo/search/impl/lucene/AbstractLuceneIndexerAndSearcherFactory#getIndexer(StoreRef) )
Note (prior to Alfresco 3.2): Due to http://issues.alfresco.com/jira/browse/ETHREEOH-2843 you mergeFactor or targetOverlayCount parameters are not applied if added in *.properties files since they are not injected in the Spring XML configuration