Index Version 2
This page explains the design of Alfresco's Lucene indexes and how they relate to each other. Version 2 has been introduced in Alfresco Version 1.4.
The index structure is made up indexes and deltas.
- Indexes are simply Lucene indexes that are concatenated in order.
- Deltas Indexes are lucene indexes plus a list of nodes that should be deleted from previously indexed information.
- IndexInfo file - the type, transactional status and order of these elements is stored in the indexInfo file along with other auxiliary information.
All the information from a transaction is stored as an index delta. The deletions should be applied to any previous indexes or deltas and then the index concatenated to the result.
For example, if the first transaction to an index committed nodes 1, 2, 3 and 4 it would contain a deletion list for 1, 2, 3 and 4 and then the index containing the information for nodes 1, 2, 3 and 4
Index Delta 1: Deletions: 1 2 3 4 Index: 1 2 3 4
The overall index is 1, 2, 3 and 4;
If the next transaction deletes node 1, updates node 2 and adds node 5.
Index Delta 1: Deletions: 1 2 3 4 Index: 1 2 3 4 Delta 2: Deletions: 1 2 5 Index: 2 5
The overall Index is 3, 4, 2 and 5
Lets say the next transaction updates node 4 deletes node 5 and adds node 6. This creates Delta Index 3.
Index Delta 1: Deletions: 1 2 3 4 Index: 1 2 3 4 Delta 2: Deletions: 1 2 5 Index: 2 5 Delta 3: Deletions: 4 5 6 Index: 4 6
The overall Index is 3, 2, 4 and 6
Merging the Delta Indexes
This index could be simplified in two ways - applying the deletions - merging into a bigger index
The first stage would be to apply the deletion defined in Delta 1 to give:
Index Index 1: Index: 1 2 3 4 Delta 2: Deletions: 1 2 5 Index: 2 5 Delta 3: Deletions: 4 5 6 Index: 4 6
Then to apply the deletions from Delta 2 to give:
Index Index 1': Index: 3 4 Index 2: Index: 2 5 Delta 3: Deletions: 4 5 6 Index: 4 6
Then to apply the deletions from Delta 3 to give :
Index Index 1: Index: 3 Index 2': Index: 2 Index 3: Index: 4 6
These indexes could then be merged to produce
Index Index 4: Index 3 2 4 6
The order of additions to the index being preserved.
The application of deletes and merging of indexes can be done in the background.
In this design, prepare has to position a delta in the list. Once done, its position is fixed. Creating the index delta and storing the deletes is done by the transaction thread. Commit is then just a status change which needs to be persisted to the IndexInfo file.
For recovery a duplicate index info file is created to allow for failure while writing one or the other.
A Lucene reader is required that applies deletes to one reader and appends another.
The components of the index remain unchanged unless they have deletions applied or are merged together. IndexReaders are valid until these actions occur. They can be cached and reused until they go out of date. This is managed via reference counting and an out of date flag.
After merges unused index information is cleaned up in the background.
The merge and deletion algorithm aims for a fixed number of index files waiting to be merged or deltas waiting to have deletions applied.
NIO is used for file locking, RandomAccessFile buffering and flushing to disk. This is sufficient to provide locking for sharing indexes between mutiple repositories. Additional work is required to verify and update index state as altered by other readers and to remove pending operations that hang because the JVM dies or looses communication.
- Shared Indexer
- recovery when one indexer is lost
- reconnecting (may have been rolled back by another indexer ....commit will just fail)
- Reporting and warnings
- Blocked merging of index deltas
- Split out keys for delete etc
- Contribute to lucene
- Fix against unit tests
- Index backup
- Don't stop indexing write/reads
- Tidy up unknown directories
- Expose lock for external actions?
- Federated Search
- overlays for federated seraches?
- In memory merges for small merge operations.
- No need to force the delta index to disk to do the search?
Why version 2?
Version 1 has several issues which are addressed by version 2
- IPC index lock and holding of index locks
- Version 2 users nio locking and disables lucene locking. This gives correct IPC and does not leave outstanding locks
- Commit performance
- V1 does index merging at commit time
- V2 commits a delta and deals with bulk merging etc in the background. Commit is very fast. Improves concurrency.
- Recovery, state and potential support for XA
- V2 holds index and transaction state allowing recovery and potential support for XA
- Shared indexer
- V1 two repositories can not use the same index
- well it is possible but there are lock issues and no way to clear confused locks automatically
- V2 Supports shared indexing and uses nio to do this.
- V1 two repositories can not use the same index
- V1 mutipe merges on commits
- V2 background merges can merge more - which is more efficient
- Use of reference counting searches
- V1 re-reads index info too often
- V2 manages what can be reused and supports reference counting for auto closing when unused and out of date
- Improved search performance
- V2 specifically supports the idea of transactions as overlays.
- There is no reason why we can not have long running overlays.