Views
Content Transformation and Metadata Extraction with Apache Tika
From alfrescowiki
This document assumes knowledge of how to extend the repository configuration.
Back to Developer Guide
Back to Server Configuration
Back to Content Transformations
Back to Metadata Extraction
Contents |
Introduction
From Swift onwards, Alfresco makes use of Apache Tika. This is used for both metadata extraction, and content transformation. For metadata extraction, it allows easy extraction of the metadata of documents and their translation into your content model. For content transformation, it allows the production of plain text, HTML and XML (XHTML) versions of content.
The exact list of formats which are supported will vary based on the version of Tika being used. For Project Swift, Tika 0.8 is used, the list of formats that are supported will shortly be available at http://tika.apache.org/0.8/formats.html . For now, the list of features in the previous release are available at http://tika.apache.org/0.7/formats.html
All the Parsers which ship as standard with Tika are available under the Apache License (or similar), as are there dependencies. However, a few 3rd Party parsers are available which have different licenses (usually GPL), details available from the Tika Wiki. See the details below for how to enable these plugins if required.
Tika and Metadata Extraction
A number of Metadata Extractors are powered by Apache Tika. Many of the existing extractors in Alfresco have been converted to use Tika,
Auto Detect
The Auto-Detect parser allows the extraction of metadata from any files which are supported by Tika, but where no dedicated metadata extractor exists. It provides a common set of mappings from Tika metadata to the Alfresco content model, which will be used across all files that are handled by the auto-detect parser fall-back.
The auto-detect parser is provided by org.alfresco.repo.content.metadata.TikaAutoMetadataExtracter, and as such the properties mapping is handled by /org/alfresco/repo/content/metadata/TikaAutoMetadataExtracter.properties If we wish to add extra mappings, then we can follow the Configuring an Extractor guide, for the extracter.TikaAuto bean to add in the extra mapping(s).
The auto-detect parser can be disabled just like any other Alfresco supplied metadata extractor. Simply comment out the bean definition for extractor.TikaAuto inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart the repository. For more details, see the main Metadata Extractors page.
New Tika Parsers
Whilst Tika ships with a large number of file format parsers, it won't always cover all formats out of the box. All the parsers that ship with Tika depend on Apache Licensed (or compatible) libraries, which means that some parsers (typically depending on GPL or propriatary libraries) cannot be shipped.
If you have an additional Tika parser that you wish to use within Alfresco, a small amount of coding and configuration is required. However, this is generally much less work than adding in a whole new metadata extractor from scratch.
For full control
Firstly, you should create a new Metadata Extractor class that extends from org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter . Your class should register the mimetypes it handles via the contructor of the superclass, and override the getParser method to return the appropriate Tika Parser object for your file type. If needed, you can also override the extractSpecific method to control the mapping. For an example of how fairly quick and simple this can be, see org.alfresco.repo.content.metadata.DWGMetadataExtracter
Once you have written your extractor class, you need to register it with the repository, and configure the mappings. More details on how to do this are provided in the Configuring an Extractor page. The quick answer is to install the class files for your TikaPoweredMetadataExtracter instance, the new Tika Parser and its dependent libraries into the repository, then register a new bean in an extension content file with a definition something like:
...
<bean id="extracter.MyCustomTika"
class="com.example.mycompany.MyCustomTikaMetadataExtractor"
parent="baseMetadataExtracter" >
<property name="inheritDefaultMapping">
<value>true</value>
</property>
<!-- Mappings from Tika metadata keys to your content model -->
<property name="mappingProperties">
<props>
<prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
<prop key="user1">cm:description</prop>
</props>
</property>
</bean>
...
Letting spring handle it
Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b
If your Tika parser doesn't need any special work to process the output, you can simply spring-in the parser to the metadata extractor service with a bean definition.
...
<bean id="extracter.MyCustomTika"
class="org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter"
parent="baseMetadataExtracter" >
<!-- Specify either the class name, or a spring created parser bean -->
<property name="tikaParserName">
<value>example.HelloWorldParser</value>
</property>
<!--
<property name="tikaParser">
<ref bean="mySpringCreatedParser" />
</property>
-->
<property name="inheritDefaultMapping">
<value>true</value>
</property>
<!-- Mappings from Tika metadata keys to your content model -->
<property name="mappingProperties">
<props>
<prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
<prop key="newTikaKey1">cm:description</prop>
<prop key="newTikaKey2">cm:title</prop>
</props>
</property>
</bean>
...
Let Tika Handle It
Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser file then it is probably correctly configured, and will be used by the Auto-Detect parser if you don't define your own spring bean for it.
Tika and Content Transformation
A number of Content Transformers are powered by Apache Tika. Several of the existing to-plain-text transformers in Alfresco have been converted to use Tika.
Auto Detect
The Auto-Detect parser allows the conversion to plain text, html or xml/xhtml for any files which are supported by Tika, but where no dedicated content transformer exists. It generally provides a transformed version contain most of the text, but which is light on formatting and layout. This can normally be well used for indexing and simple preview, but is not normally of the sorts of quality seen with transformers such as Open Office to PDF. However, it does easily and quickly allow some transformation for a wide range of formats that previously were not handled by Alfresco.
The auto-detect transformer is provided by org.alfresco.repo.content.transform.TikaAutoContentTransformer.
The auto-detect transformer can be disabled just like any other Alfresco supplied content transformer. Simply comment out the bean definition for transformer.TikaAuto inside <WEB-INF>/classes/alfresco/content-services-context.xml and restart the repository. For more details, see the main Content Transformer page.
New Tika Parsers
Whilst Tika ships with a large number of file format parsers, it won't always cover all formats out of the box. All the parsers that ship with Tika depend on Apache Licensed (or compatible) libraries, which means that some parsers (typically depending on GPL or propriatary libraries) cannot be shipped.
If you have an additional Tika parser that you wish to use within Alfresco, a small amount of coding and configuration is required. However, this is generally much less work than adding in a whole new content transformer from scratch.
For full control
Firstly, you should create a new Content Transformer class that extends from org.alfresco.repo.content.transform.TikaPoweredContentTransformer . Your class should register the mimetypes it handles via the contructor of the superclass, and override the getParser method to return the appropriate Tika Parser object for your file type. For an example of how fairly quick and simple this can be, see org.alfresco.repo.content.transform.PdfBoxContentTransformer
Once you have written your transformer class, you simply need to install the classes and register the transformer with the repository. Full details on how to add a new transformer are provided on the Content Transformations page. The quick answer is to install the class files for your TikaPoweredContentTransformer instance, the new Tika Parser and its dependent libraries into the repository, then register a new bean in an extension content file with a definition something like:
...
<bean id="transformer.MyCustomTika"
class="com.example.mycompany.MyCustomTikaContentTransformer"
parent="baseContentTransformer" />
...
Letting spring handle it
Note that this isn't available in 3.4.a, but will be in 3.4 and 3.4.b
If your Tika parser doesn't need any special work to process the output, you can simply spring-in the parser to the content transformer service with a bean definition.
...
<bean id="transformer.MyCustomTika"
class="org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer"
parent="baseContentTransformer">
<!-- Specify either the class name, or a spring created parser bean -->
<property name="tikaParserName">
<value>example.HelloWorldParser</value>
</property>
<!--
<property name="tikaParser">
<ref bean="mySpringCreatedParser" />
</property>
-->
</bean>
...
Let Tika Handle It
Some 3rd party Tika plugins include the required services files to be detected and used by the Tika Auto-Detect parser. If the Parser Jar includes a META-INF/services/org.apache.tika.parser.Parser file then it is probably correctly configured, and will be used by the Auto-Detect parser if you don't define your own spring bean for it.