Content Transformations

From alfrescowiki

Jump to: navigation, search

This document assumes knowledge of how to extend the repository configuration.
Back to Developer Guide
Back to Server Configuration
Also see Content_Transformation_Debug


Introduction

Alfresco has the ability to transform content between different mime types. The user can do it manually, or by invoking a rule on a space. The content transformation process is accomplished by the use of third-party libraries, such as pdfbox, and applications like OpenOffice (running server-side.)

Configuration

The default Content transformers are declared and initialized in <configRoot>/alfresco/content-services-context.xml, but extensions should be added to an extension xml configuration file, such as <extension>/alfresco/extension/my-transformers-context.xml.

Default transformers are declared in the package org.alfresco.repo.content.transform

  • use the javadocs to see the effects of the different settings.

The configuration of the RuntimeExecutableContentTransformer and ComplexContentTransformer are outlined below.

RuntimeExecutableContentTransformerWorker (Alfresco >=3.2)

In Alfresco 3.2 the RuntimeExecutableContentTransformer was renamed to RuntimeExecutableContentTransformerWorker (see Upgrading_to_Alfresco_Community_Edition_3.2). Below you will find an example of a converter transforming DWG drawings into PDF files (utility is not freely available, document serves as example only).

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
	<bean id="transformer.worker.dwg2pdf" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker">
		<property name="mimetypeService">
			<ref bean="mimetypeService" />
		</property>
		<property name="checkCommand">
			<bean class="org.alfresco.util.exec.RuntimeExec">
				<property name="commandsAndArguments">
					<map>
						<entry key=".*">
							<list>
								<value>ls</value>
								<value>/usr/local/bin/dwg2pdf</value>
							</list>
						</entry>
					</map>
				</property>
			</bean>
		</property>

		<property name="transformCommand">
			<bean class="org.alfresco.util.exec.RuntimeExec">
				<property name="commandsAndArguments">
					<map>
						<entry key=".*">
							<list>
								<value>/usr/local/bin/dwg2pdf</value>
								<value>${source}</value>
								<value>${target}</value>
							</list>
						</entry>
					</map>
				</property>
				<property name="errorCodes">
					<value>1,2</value>
				</property>
			</bean>
		</property>

		<property name="explicitTransformations">
			<list>
				<bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails">
					<property name="sourceMimetype"><value>image/x-dwg</value></property>
					<property name="targetMimetype"><value>application/pdf</value></property>
				</bean>
			</list>
		</property>
	</bean>

	<bean id="transformer.dwg2pdf" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
		<property name="worker">
			<ref bean="transformer.worker.dwg2pdf" />
		</property>
	</bean>
</beans>

You can save this file in your extension directory with a name ending in -context.xml. This file was named dwg2pdf-transform-context.xml for convenience.

RuntimeExecutableContentTransformer

This transformer is able to execute system executables. An example of such a transformation utility program is the tidy utility program, which can transform HTML documents into XHTML documents. The transformation mechanism performs substitutions of the variables ${source} and ${target}, which are the full file paths of the source and target files for the transformation.

   tidy -asxhtml -o "${target}" "${source}"

The transformer comes with an optional feature: checkCommand. This is executed by the init method. If an error occurs during execution of this command, which cannot take any parameters, then the transformer is flagged as not available. When not available, the getReliability method will always return 0.0, otherwise it is assumed that the transformation command will be successful. The reliability of the transformation is used by the transformation registry to select the most appropriate transformer for a given transformation. The transformer remains directly usable, e.g. if directly selected as an action to perform.

External utilities stick to a rough convention regarding the return codes. In the case of tidy an error condition results in a return code of 2. Use the errorCodes property to supply a comma separated list of values indicating failure. The default is "1, 2".

   <bean id="transformer.Tidy.XHTML" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformer" parent="baseContentTransformer">
      <property name="checkCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key=".*">
                        <value>tidy -help</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>2</value>
            </property>
         </bean>
      </property>
      <property name="transformCommand">
         <bean class="org.alfresco.util.exec.RuntimeExec">
            <property name="commandMap">
                <map>
                    <entry key="Linux">
                        <value>tidy -asxhtml -o '${target}' '${source}'</value>
                    </entry>
                    <entry key="Windows.*">
                        <value>tidy -asxhtml -o "${target}" "${source}"</value>
                    </entry>
                </map>
            </property>
            <property name="errorCodes">
               <value>2</value>
            </property>
         </bean>
      </property>
      <property name="explicitTransformations">
         <list>
            <bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails" >
                <property name="sourceMimetype"><value>text/html</value></property>
                <property name="targetMimetype"><value>application/xhtml+xml</value></property>
            </bean>
         </list>
      </property>
   </bean>

The following logging categories are useful to see what happens when the server starts up:

log4j.logger.org.alfresco.util.exec.RuntimeExec=DEBUG
log4j.logger.org.alfresco.repo.content.transform.ContentTransformerRegistry=DEBUG

The parent bean, which performs initialization of the usual base class AbstractContentTransformer, will auto-register the transformer. The explicitTransformations property is required here as the transformer itself is not particular to a particular conversion. The transformer is now available for transformations from regular HTML to XHTML.

Note: Prior to 3.X, the explicitTransformations were configured using a config like the following:

<bean class="org.alfresco.repo.content.transform.ContentTransformerRegistry$TransformationKey" >
   <constructor-arg><value>text/html</value></constructor-arg>	
   <constructor-arg><value>application/xhtml+xml</value></constructor-arg>
</bean>

If you are upgrading to 3.x, take note of this change. The above example for html->xhtml is using the new configuration which sets properties in the ExplicitTransformationDetails class.

To make the transformation visible in the Alfresco Web Client, you also need to register the Transformation_Mimetype.

ComplexContentTransformer

If a transformation from mimetype X->Z is requested but only X-Y and Y-Z is available, the transformation registry will not automatically assume that chaining the transformation together will be a useful thing to do. Linked transformations can be manually declared using the ComplexContentTransformer.

Extracting text from PowerPoint documents

Extracting meaningful text from a Microsoft PowerPoint document is not directly possible. However, Open Office Presentations can export PowerPoint as a PDF document, and the PDFBox project provides a very efficient converter from PDF to text. The following transformer can convert .ppt extensions to .txt and is used extensively during full text indexing.

   <bean id="transformer.complex.OpenOffice.PdfBox"
        class="org.alfresco.repo.content.transform.ComplexContentTransformer"
        parent="baseContentTransformer" >
      <property name="transformers">
         <list>
            <ref bean="transformer.OpenOffice" />
            <ref bean="transformer.PdfBox" />
         </list>
      </property>
      <property name="intermediateMimetypes">
         <list>
            <value>application/pdf</value>
         </list>
      </property>
   </bean>

The intermediateMimetypes list explicitly lists the desired transformation route.

A consequence of adding this particular converter to the registry will be that PowerPoint documents become searchable for full text indexing. In fact, any document that OpenOffice can export to PDF will become available for full text indexing.

When the registry asks the ComplexContentTransformer for a particular transformation, the value returned is the product of the individual transformations along the chain. It is therefore not necessary to specify the explicit transformations supported, but this means that you can't reuse the runtime executable transformers as part of the chain.

Converting an AutoCAD drawing to SWF for the Alfresco Share preview

Using the converter listed above as example for the RuntimeExecutableContentTransformerWorker it is possible to convert DWG drawings to PDF files. Alfresco already has the ability to convert PDF files to SWF files using the SWFtools pdf2swf application. We can create a ComplexContentTransformer that uses both of these to create SWF previews from DWG files using the following code:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>

<beans>
    <bean id="transformer.complex.dwg2swf" class="org.alfresco.repo.content.transform.ComplexContentTransformer" parent="baseContentTransformer">
        <property name="transformers">
            <list>
                <ref bean="transformer.dwg2pdf" />
                <ref bean="transformer.Pdf2swf" />
            </list>
        </property>
        <property name="intermediateMimetypes">
            <list>
                <value>application/pdf</value>
            </list>
        </property>
    </bean>
</beans>

You can put this in extension/dwg2swf-transform-content.xml to have Alfresco pick it up.

Alternatives

There are also several alternative routes for the same DWG preview use case. For example, one can do :


Use the same intermediate mimetypes approach as shown above. The command line options passed to each of the tool can also be adjusted depending on the project's particular needs.

Developing New Transformations

Let us assume that a 3rd party or in-house library exists for performing a transformation between two streams or files. Wiring that library into the Alfresco transformation registry involves writing a thin wrapper, with most of the support already provided by an abstract class and surrounding support classes. We'll use the org.alfresco.repo.content.transform.PdfBoxContentTransformer as an example.

Two methods need to be implemented when deriving from AbstractContentTransformer:

  • getReliability
  • transformInternal.

getReliability Method

This method gives a best guess as to how accurate and reliable the transformation is going to be. If the transformer cannot possibly work, then 0.0 must be returned. In the case of PDFBox, this is only the case if the transformation is not PDF to TEXT.

If two transformers perform the same transformation, the most reliable one will always be chosen. If two or more transformer exist with the same reliability, then they will be cycled until the fastest one is determined. The timing code is automatically provided by the AbstractContentTransformer.

transformInternal Method

The ContentReader supplied allows direct access to the source content to transform. In our example, a string is extracted from the source input stream. This string is then written directly to the ContentWriter. If it is necessary to work against physical files during the transformation, then use the org.alfresco.util.TempFileProvider#createTempFile to ensure that all temporary files will get cleaned up appropriately. Failure to do this will mean that temporary files are not cleaned up while the system is running. Do not use deleteOnExit - the Alfresco repository is designed to run under load indefinitely, i.e. until the next upgrade. Give your temp files meaningful prefixes as it will help during debugging.

        File tempFromFile = TempFileProvider.createTempFile("UnoContentTransformer", "." + getMimetypeService().getExtension(sourceMimetype));
        File tempToFile = TempFileProvider.createTempFile("UnoContentTransformer", "." + getMimetypeService().getExtension(targetMimetype));

Be careful when accessing the content streams available from the ContentReader and ContentWriter directly. Failure to close the streams, regardless of success or failure, will result in an error message being generated to the log output and the stream will be held open indefinitely.

There is no need to handle any exceptions generated during the transformation, either Runtime or otherwise. The abstract base class will handle and report these as required.

    protected void transformInternal(
            ContentReader reader,
            ContentWriter writer,
            Map<String, Object> options) throws Exception
    {
        PDDocument pdf = null;
        InputStream is = null;
        try
        {
            is = reader.getContentInputStream();
            // stream the document in
            pdf = PDDocument.load(is);
            // strip the text out
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(pdf);
            
            // dump it all to the writer
            writer.putContent(text);
        }
        finally
        {
            if (pdf != null)
            {
                try { pdf.close(); } catch (Throwable e) {e.printStackTrace(); }
            }
            if (is != null)
            {
                try { is.close(); } catch (Throwable e) {e.printStackTrace(); }
            }
        }
    }

Full Text Indexing

When content is uploaded by whatever means, the indexer component attempts to perform a transformation to plain text. First a ContentTransformer is requested from the ContentTransformationRegistry. If one is available, then the transformation to plain text is performed.

There are three primary modes of transformation failure that are handled internally by the indexer. The indexer indexes some cryptic strings in place of the actual content (which is unavailable) in order to allow searching for content that experienced problems. The following can be searched for:

  • nint - content not indexed due to no transformation to text being available
  • nitf - content not indexed due to transformation failure
  • nicm - content not indexed due to missing content

Note: Currently, the POI Excel to Text library may dump an exception to the console if the transformation failed. We have no control over this output.

Checking Registered Transformers (Alfresco >=3.4)

To find out what transformers are currently registered and active within Alfresco, you can use an admin webscript which was introduced in 3.4. This is available at /alfresco/service/mimetypes (eg http://localhost:8080/alfresco/service/mimetypes)

This will list all the currently registered mimetypes, and provide a details link for each one. Selecting the details link will then show which transformations are currently supported both to and from that mimetype, and by what transformer. If a transformer becomes unavailable (eg the Open Office connection dies), then refreshing the list will show the updated transformations.

Usage Examples

JavaScript

The content transformer is not available as a root service in JavaScript, but it is available as an action.

var action = actions.create("transform"); 
// Store the transformed version in the same folder as the source
action.parameters["destination-folder"] = document.parent; 
action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains"; 
action.parameters["assoc-name"] = document.name + "transformed"; 
// If it's already plain text, make html, otherwise make plain text
if(document.mimetype == "text/plain") { 
   action.parameters["mime-type"] = "text/html"; 
} else {
   action.parameters["mime-type"] = "text/plain"; 
}
// Execute
action.execute(document); 

"The preview could not be loaded from the server" (3.4.d bug)

In 3.4.d, the pdf2swf conversion that is used by default in preview, is broken (at least in Linux package). The reason is the corrupted library link (libstdc++.so.5). You have to fix it this way:

  
$ cd $YOUR_ALFRESCO_DIRECTORY/common/lib  
$ rm libstdc++.so.5  
$ ln -s libstdc++.so.5.0.3 libstdc++.so.5  

Now it should work fine.

Related Articles

Personal tools
© 2014 Alfresco Software, Inc. All Rights Reserved. Legal | Privacy | Accessibility