Content Transformations
From AlfrescoWiki
This document assumes knowledge of how to extend the repository configuration.
Back to Developer Guide
Back to Server Configuration
Contents |
[edit] Introduction
Alfresco has built-in ability to transform documents between formats. The user can do it manually, or by invoking a rule on a space. Basically, a transformation allows to transform content between two different mime types. Alfresco uses third-party libraries, such as pdfbox and applications like OpenOffice (running on the server-side) to achieve the content transformations.
[edit] Configuration
The default Content transformers are declared and initialized in <configRoot>/alfresco/content-services-context.xml, but extensions should be added to an extension xml configuration file, such as <extension>/alfresco/extension/my-transformers-context.xml.
Default transformers are declared in the package org.alfresco.repo.content.transform
- use the javadocs to see the effects of the different settings.
The configuration of the RuntimeExecutableContentTransformer and ComplexContentTransformer are outlined below.
[edit] RuntimeExecutableContentTransformer
This transformer is able to execute system executables. An example of such a transformation utility program is the tidy utility program, which can transform HTML documents into XHTML documents. The transformation mechanism performs substitutions of the variables ${source} and ${target}, which are the full file paths of the source and target files for the transformation.
tidy -asxhtml -o "${target}" "${source}"
The transformer comes with an optional feature: checkCommand. This is executed by the init method. If an error occurs during execution of this command, which cannot take any parameters, then the transformer is flagged as not available. When not available, the getReliability method will always return 0.0, otherwise it is assumed that the transformation command will be successful. The reliability of the transformation is used by the transformation registry to select the most appropriate transformer for a given transformation. The transformer remains directly usable, e.g. if directly selected as an action to perform.
External utilities stick to a rough convention regarding the return codes. In the case of tidy an error condition results in a return code of 2. Use the errorCodes property to supply a comma separated list of values indicating failure. The default is "1, 2".
<bean id="transformer.Tidy.XHTML" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
<property name="checkCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<entry key=".*">
<value>tidy -help</value>
</entry>
</map>
</property>
<property name="errorCodes">
<value>2</value>
</property>
</bean>
</property>
<property name="transformCommand">
<bean class="org.alfresco.util.exec.RuntimeExec">
<property name="commandMap">
<map>
<entry key="Linux">
<value>tidy -asxhtml -o '${target}' '${source}'</value>
</entry>
<entry key="Windows.*">
<value>tidy -asxhtml -o "${target}" "${source}"</value>
</entry>
</map>
</property>
<property name="errorCodes">
<value>2</value>
</property>
</bean>
</property>
<property name="explicitTransformations">
<list>
<bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
<constructor-arg><value>text/html</value></constructor-arg>
<constructor-arg><value>application/xhtml+xml</value></constructor-arg>
</bean>
</list>
</property>
</bean>
The following logging categories are useful to see what happens when the server starts up:
log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG log4j.logger.org.alfresco.repo.content.transform.ContentTransformerRegistry=DEBUG
The parent bean, which performs initialization of the usual base class AbstractContentTransformer, will auto-register the transformer. The explicitTransformations property is required here as the transformer itself is not particular to a particular conversion. The transformer is now available for transformations from regular HTML to XHTML.
To make the transformation visible in the Alfresco Web Client, you also need to register the Transformation_Mimetype.
[edit] ComplexContentTransformer
If a transformation from mimetype X->Z is requested but only X-Y and Y-Z is available, the transformation registry will not automatically assume that chaining the transformation together will be a useful thing to do. Linked transformations can be manually declared using the ComplexContentTransformer.
Extracting meaningful text from a Microsoft PowerPoint document is not directly possible. However, Open Office Presentations can export PowerPoint as a PDF document, and the PDFBox project provides a very efficient converter from PDF to text. The following transformer can convert .ppt extensions to .txt and is used extensively during full text indexing.
<bean id="transformer.complex.OpenOffice.PdfBox"
class="org.alfresco.repo.content.transform.ComplexContentTransformer"
parent="baseContentTransformer" >
<property name="transformers">
<list>
<ref bean="transformer.OpenOffice" />
<ref bean="transformer.PdfBox" />
</list>
</property>
<property name="intermediateMimetypes">
<list>
<value>application/pdf</value>
</list>
</property>
</bean>
The intermediateMimetypes list explicitly lists the desired transformation route.
A consequence of adding this particular converter to the registry will be that PowerPoint documents become searchable for full text indexing. In fact, any document that OpenOffice can export to PDF will become available for full text indexing.
When the registry asks the ComplexContentTransformer for a particular transformation, the value returned is the product of the individual transformations along the chain. It is therefore not necessary to specify the explicit transformations supported, but this means that you can't reuse the runtime executable transformers as part of the chain.
[edit] Developing New Transformations
Let us assume that a 3rd party or in-house library exists for performing a transformation between two streams or files. Wiring that library into the Alfresco transformation registry involves writing a thin wrapper, with most of the support already provided by an abstract class and surrounding support classes. We'll use the org.alfresco.repo.content.transform.PdfBoxContentTransformer as an example.
Two methods need to be implemented when deriving from AbstractContentTransformer:
- getReliability
- transformInternal.
[edit] getReliability Method
This method gives a best guess as to how accurate and reliable the transformation is going to be. If the transformer cannot possibly work, then 0.0 must be returned. In the case of PDFBox, this is only the case if the transformation is not PDF to TEXT.
If two transformers perform the same transformation, the most reliable one will always be chosen. If two or more transformer exist with the same reliability, then they will be cycled until the fastest one is determined. The timing code is automatically provided by the AbstractContentTransformer.
[edit] transformInternal Method
The ContentReader supplied allows direct access to the source content to transform. In our example, a string is extracted from the source input stream. This string is then written directly to the ContentWriter. If it is necessary to work against physical files during the transformation, then use the org.alfresco.util.TempFileProvider#createTempFile to ensure that all temporary files will get cleaned up appropriately. Failure to do this will mean that temporary files are not cleaned up while the system is running. Do not use deleteOnExit - the Alfresco repository is designed to run under load indefinitely, i.e. until the next upgrade. Give your temp files meaningful prefixes as it will help during debugging.
File tempFromFile = TempFileProvider.createTempFile("UnoContentTransformer", "." + getMimetypeService().getExtension(sourceMimetype));
File tempToFile = TempFileProvider.createTempFile("UnoContentTransformer", "." + getMimetypeService().getExtension(targetMimetype));
Be careful when accessing the content streams available from the ContentReader and ContentWriter directly. Failure to close the streams, regardless of success or failure, will result in an error message being generated to the log output and the stream will be held open indefinitely.
There is no need to handle any exceptions generated during the transformation, either Runtime or otherwise. The abstract base class will handle and report these as required.
protected void transformInternal(
ContentReader reader,
ContentWriter writer,
Map<String, Object> options) throws Exception
{
PDDocument pdf = null;
InputStream is = null;
try
{
is = reader.getContentInputStream();
// stream the document in
pdf = PDDocument.load(is);
// strip the text out
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(pdf);
// dump it all to the writer
writer.putContent(text);
}
finally
{
if (pdf != null)
{
try { pdf.close(); } catch (Throwable e) {e.printStackTrace(); }
}
if (is != null)
{
try { is.close(); } catch (Throwable e) {e.printStackTrace(); }
}
}
}
[edit] Full Text Indexing
When content is uploaded by whatever means, the indexer component attempts to perform a transformation to plain text. First a ContentTransformer is requested from the ContentTransformationRegistry. If one is available, then the transformation to plain text is performed.
There are three primary modes of transformation failure that are handled internally by the indexer. The indexer indexes some cryptic strings in place of the actual content (which is unavailable) in order to allow searching for content that experienced problems. The following can be searched for:
- nint - content not indexed due to no transformation to text being available
- nitf - content not indexed due to transformation failure
- nicm - content not indexed due to missing content
Note: Currently, the POI Excel to Text library may dump an exception to the console if the transformation failed. We have no control over this output.
[edit] Related Articles
- Tutorial Transforming - How transformations are setup via the UI
- Custom Actions - How the current transformation actions were implemented
- Custom Action UI - How the current transformation UI functionality was introduced
- Tiger OCR integration - Simple example implementation of a TIFF to PDF OCR transformation

