Information | Content Transformers |
---|---|
Support Status | Full Support |
Architecture Information | Platform Architecture |
Description | Content transformers transform one type of content into another, such as a HTML
file into a PDF file. They are used to enable indexing, thumbnails, and preview of
content. If the target type cannot be achieved with one transformation several
transformations can be chained together, such as JSON to HTML to PDF. To implement these
transformations third party tools such as PDFBox, OpenOffice, ImageMagic, and Apache Tika
are used. There are a number of transformations supported out of the box that you should familiarize yourself with before implementing a custom transformer, such as:
These are just a few of the supported transformations and they can also be combined to form so called transformation pipelines when it is not possible to go directly from source mimetype to target mimetype. Renditions are related to Transformations and they will be covered at the end of this article. To find out what transformers are currently registered and active within an SkyVault Content Services installation, you can use an admin Web Script. This is available at http://localhost:8080/alfresco/service/mimetypes. This will list all the currently registered mimetypes, and provide a details link for each one. Selecting the details link will then show which transformations are currently supported both to and from that mimetype, and by what transformer. If a transformer becomes unavailable (for example if the Open Office connection fails), then refreshing the list will show the updated transformations. When working with transformations and renditions it is important to make sure that the involved mimetypes are known to SkyVault Content Services. So when accessing the "mimetypes" Web Script make sure the mimetypes that will be used in transformations and renditions are included there, if not you would have to register them with SkyVault Content Services, see the Mimetypes extention point for more information about that. The Spring bean definitions for the transformer implementations can be found in the content-services-context.xml file. This file is contained in the repository JAR and can be found as follows in an installation: $ find . -name "*.jar" | xargs grep "content-services-context.xml" Binary file ./tomcat/webapps/alfresco/WEB-INF/lib/alfresco-repository-5.1.d-EA.jar matches Inside this XML file are the bean definitions for the transformer implementations, such as: <bean id="transformer.PdfBox" class="org.alfresco.repo.content.transform.PdfBoxContentTransformer" parent="baseContentTransformer" > <property name="documentSelector" ref="pdfBoxEmbededDocumentSelector" /> </bean> The transformer bean definition does not contain the source to target mimetype transformations it supports, rather this is contained in a properties file for easier management by for example a System Administrator (and these properties can be re-defined in SkyVault-global.properties). This properties file is also located in the repository JAR (/alfresco/subsystems/Transformers/default) and is called transformers.properties. The configuration for the PdfBoxContentTransformer is as follows: content.transformer.PdfBox.priority=110 content.transformer.PdfBox.extensions.pdf.txt.priority=50 The properties have names that follow a certain convention: content.<transformer bean id>.<property name> (Some custom property for the transformer) content.<transformer bean id>.priority=<number> (Default priority for this transformer) content.<transformer bean id>.extensions.<source mimetype>.<target mimetype>.priority=<number> (Priority for this transformation) content.<transformer bean id>.extensions.<source mimetype>.<target mimetype>.pipeline=transformer 1 | intermediate mimetype A)| transformer 2 content.<transformer bean id>.extensions.<source mimetype>.<target mimetype>.failover=transformer 1 (mimetype A)| transformer 2 (mimetype A) content.<transformer bean id>.extensions.<source mimetype>.<target mimetype>.supported=[true|false] The concept of 'explicit' transformations does not exist but instead the priority and supported properties are used to determine what transformations that are used. As the default priority is 100, setting the priority to 50 normally results in the transformer being used. Other compatible transformers will be tried in priority order if the one with highest priority fails for some reason. If multiple transformations are needed to get from source to target mimetype a pipeline transformation can be set up. It is also possible to control exactly which transformers are used in case of a failure by using the failover property. If you are running an Enterprise edition these properties may be changed via JMX while SkyVault Content Services is running (note, any changes via JMX and database takes precedence over any property file settings). You can create custom content transformers to transform one type of content into another, where that transformation is not already supported, such as when you have a custom input content type, or custom output content type. Take the JSON mimetype for example, if you upload a JSON file to the repository it will not have a thumbnail or a preview, and it will not be indexed and searchable. The following will show up in the logs when debugging is turned on (turn it on via log4j.logger.org.alfresco.repo.content.transform.TransformerDebug=DEBUG and log4j.logger.org.alfresco.repo.content.transform=DEBUG): 2016-01-05 07:50:15,111 DEBUG [content.transform.ContentTransformerRegistry] [http-bio-8443-exec-9] Searched for transformer: source mimetype: application/json target mimetype: image/jpeg transformers: [] 2016-01-05 07:50:15,116 DEBUG [content.transform.ContentTransformerRegistry] [http-bio-8443-exec-9] Searched for transformer: source mimetype: application/json target mimetype: text/plain transformers: [] 2016-01-05 07:50:15,117 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-9] 10 json txt testnode.json 968 bytes -- index -- SolrIndexer NO transformers 2016-01-05 07:50:15,117 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-9] 10 workspace://SpacesStore/68b0e43d-d972-406d-b211-52ce647ef41a 2016-01-05 07:50:15,117 DEBUG [content.transform.TransformerLog] [http-bio-8443-exec-9] 10 json txt INFO testnode.json 968 bytes 7 ms No transformers 2016-01-05 07:50:15,117 DEBUG [content.transform.TransformerDebug] [http-bio-8443-exec-9] 10 Finished in 7 ms Transformer NOT called You can also see that there is no transformations available for the JSON mimetype on the http://localhost:8080/alfresco/service/mimetypes?mimetype=application/json#application/json admin page: application/json - json No extractors Transformable To: Cannot be transformed into anything else Transformable From: Cannot be generated from anything else So if you wanted to change that you could implement a JSON to HTML transformation to start with. That would give you a lot of functionality as HTML is already fully supported with thumbnail, preview, indexing and search. To do this you will need a tool that can convert from JSON to HTML. One such tool is json2html, which is written in Python and can easily be invoked from the command line, and also then from a custom transformer. It produces a HTML table with the JSON data. To create this transformer you do not need to do any Java coding, just some Spring bean definitions and Python script coding. Starting with the Spring beans for the custom transformer you would use the RuntimeExecutableContentTransformerWorker class as a bean implementation. It is able to execute any command line transformation that accepts an input and an output file on the command line. Basically, if you have a command line utility or a script that takes an input file, called the source, and an output file, called the target, then you can invoke it via this class. This is a technique used for a lot of custom transformation implementations. Here is the Spring bean definition for what is referred to as the transformer worker: <beans> <bean id="transformer.worker.json2html" class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker" > <property name="mimetypeService"> <ref bean="mimetypeService" /> </property> <property name="checkCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>ls</value> <value>/home/martin/Downloads/temp/transformation/convertJson2html.py</value> </list> </entry> </map> </property> </bean> </property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key=".*"> <list> <value>/home/martin/Downloads/temp/transformation/convertJson2html.py</value> <value>${source}</value> <value>${target}</value> </list> </entry> </map> </property> <property name="errorCodes"> <value>1,2</value> </property> </bean> </property> </bean> The transformer worker does the actual job of executing the transformation. It has two important properties that need to be set, the checkCommand property, which is used to verify that the command line tool/script that is to be used for the transformation is actually available. The other property is called transformCommand and should contain the script/tool path plus the source and target variables, which will resolve to temporary files that will be used during the transformation. If you want to run one script on Linux and another one on Windows you can provide multiple entries in the command line arguments map as in the following example: <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments"> <map> <entry key="Windows.*"> <list> <value>cmd</value> <value>/C</value> <value>${ffmpeg.exe} ${opts} ${infile_opts} -i "${source}" ${outfile_opts} "${target}" 2> NUL</value> </list> </entry> <entry key="Linux"> <list> <value>sh</value> <value>-c</value> <value>${ffmpeg.exe} ${opts} ${infile_opts} -i '${source}' ${outfile_opts} '${target}' 2> /dev/null</value> </list> </entry> <entry key="Mac OS X"> <list> <value>sh</value> <value>-c</value> <value>${ffmpeg.exe} ${opts} ${infile_opts} -i '${source}' ${outfile_opts} '${target}' 2> /dev/null</value> </list> </entry> </map> </property> <property name="waitForCompletion"> <value>true</value> </property> <property name="defaultProperties"> <props> <prop key="opts">-y</prop> <prop key="infile_opts"/> <prop key="outfile_opts">-f flv</prop> </props> </property> </bean> </property> When the transformation worker bean is defined you can refer to it from the transformation bean definition: <bean id="transformer.json2html" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer"> <property name="worker"> <ref bean="transformer.worker.json2html" /> </property> </bean> The transformer bean needs to specify baseContentTransformer as the parent, as it handles registering this new transformer with the SkyVault Content Services system. The transformer implementation class that you use in this case is called ProxyContentTransformer and it is delegating the actual transformation to the worker. The last thing you need to do for this transformer to be active is to add some properties to SkyVault-global.properties: content.transformer.json2html.priority=30 content.transformer.json2html.extensions.json.html.supported=true content.transformer.json2html.extensions.json.html.priority=30 See above for more information about these properties. Currently the transformer would halt on the Python script call as that has not yet been implemented. Download the json2html python module as follows: $ sudo pip install json2html Now create a script called convertJson2html.py: #!/usr/bin/python import os, sys lib_path = os.path.abspath(os.path.join('..')) sys.path.append(lib_path) from json2html import * # Get the source and target file print 'Number of arguments:', len(sys.argv), 'arguments.' print 'Argument List:', str(sys.argv) sourceTempFile = sys.argv[1] targetTempFile = sys.argv[2] # Open and read the JSON source file with open(sourceTempFile, 'r') as jsonF: jsondata = jsonF.read() print "Read json is : ", jsondata # Run conversion to HTML jsonAsHTML = json2html.convert(json = jsondata) # Write the resulting HTML representation of the JSON to target file with open(targetTempFile, 'w+') as htmlF: htmlF.write(jsonAsHTML) You can test this script from command line before trying the transformation: $ python convertJson2html.py testnode.json test.html This assumes that the testnode.json file is in the same directory as the script is run from. You are now ready to test this new JSON to HTML transformer. The only thing you need to do is create a rule that runs a script that does the following: document.transformDocument("text/html"); Uploading a JSON file to a folder with this rule will result in a new HTML file being generated as a result of the transformation, you will see something like this:
You can see that the newly generated HTML file has a thumbnail (refresh the page to see it). If you navigate to the details page a preview will also be available, and you can search for the content in the JSON file and you would see the HTML version of it. You could expand the JavaScript file to move the generated HTML file somewhere else in the repository if you needed to. The next step is to create a pipeline transformation JSON to HTML to TEXT to make the JSON searchable and to avoid having the extra HTML file generated. This can be achieved with some extra properties in the configuration: content.transformer.json2text.pipeline=*|html|* content.transformer.json2text.extensions.json.txt.priority=30 content.transformer.json2text.extensions.json.txt.supported=true This defines a transformation pipeline (also referred to as a complex transformer). With the "*|html|*" expression means convert any supported extension (that is only json to txt) by using any transformer that can convert JSON to first the intermediate format HTML, which is what the transformer we just implemented above does, then use any other transformer to convert HTML to TXT, which is supported out-of-the-box. After this change you will see that any uploaded JSON file is now searchable (remember to disable the rule). Note that there is no Spring bean defined for the transformer.json2text. To get the thumbnail you need to define another pipeline for JSON to PNG: content.transformer.json2png.pipeline=*|html|* content.transformer.json2png.extensions.json.png.priority=30 content.transformer.json2png.extensions.json.png.supported=true Running with this pipeline configuration will first initiate a transform to HTML and then to PNG with any out-of-the-box transformer supporting HTML to PNG. Finally you can get the preview working with the following JSON to PDF pipeline configuration: content.transformer.json2pdf.pipeline=*|html|* content.transformer.json2pdf.extensions.json.pdf.priority=30 content.transformer.json2pdf.extensions.json.pdf.supported=true SkyVault Content Services supports PDF previewing so as long as you can transform your content into a PDF it will be available for preview in the Document Details page. When implementing a transformer it is possible to associate it with an Edition, such as in the following example: content.transformer.complex.JodConverter.PdfBox.edition=Enterprise This sets the transformer.complex.JodConverter to be available only for SkyVault Content Services installations. It is also possible to associate a transformer with a specific AMP: content.transformer.json2html.amp=custom-content-transformer-repo This would make your custom JSON content transformer available only if an AMP with module id custom-content-transformer-repo has been applied. Related to transformations are renditions, and the purpose of them is to provide support for rendering a specific content item into other forms, known as renditions. The rendition items are derived from their source item and as such can be updated automatically when their source item's content (or other properties) are changed. Examples of renditions (and rendition engines) include:
Renditions can be performed synchronously or asynchronously and can be created at a specified location within the repository. By default they are created as primary children of their source item but it is possible to have them created at other places specified explicitly or as templated paths. The following describes an example with the reformat rendering engine and the custom transformation you have previously implemented (that is json to html). You can invoke the rendering engine and create an HTML rendition for a JSON document via JavaScript executed from a folder rule: var renderingEngineName = 'reformat'; var renditionDefinitionName = 'cm:htmlRenditionDef'; var renditionDef = renditionService.createRenditionDefinition(renditionDefinitionName, renderingEngineName); renditionDef.parameters['mime-type'] = 'text/html'; var htmlRendition= renditionService.render(document, renditionDef); The rendition definition name is a QName that we have to come up with, use a known namespace such as cm. The mime-type parameter tells the reformat rendering engine that the new rendition should be a HTML file. By default this rendition will be store as a hidden child to the uploaded JSON document. If you used the Node Browser to inspect the JSON content node you should see three hidden renditions as follows: Children Child Name Child Type Child Reference Primary Association Type Index cm:htmlRenditionDef cm:content workspace://SpacesStore/78448da2-fbab-4ef1-b451-6bbaa569b8c4 true rn:rendition 0 cm:doclib cm:thumbnail workspace://SpacesStore/f7f354a7-010f-4462-82a1-7cec7f36fa1d true rn:rendition 1 cm:pdf cm:thumbnail workspace://SpacesStore/8b3ec283-cd2a-4378-af2d-e72e30127210 true rn:rendition 2 If the solution being implemented is very transformation intensive a remote transformation server can be used. It would be totally dedicated to performing transformations, and can be scaled out separately from the rest of the SkyVault Content Services system, depending on transformation load. It is possible for an AMP (or a properties file) to set the initial 'average transform times' for the standard transformers, rather than having to override them in order to ensure they are not called. If an AMP that includes a new transformer provides the following global properties, this will cause the new transformer to be given priority over the OpenOffice and JOD transformers. If a new transformer returns false from its isTransformable method when their transformer is not available, transformations will fall back to the OpenOffice and JOD transformers. transformer.OpenOffice=3600000 transformer.complex.OpenOffice.Image.time=3600000 transformer.complex.OpenOffice.Pdf2swf.time=3600000 transformer.complex.OpenOffice.PdfBox.time=3600000 transformer.JodConverter.time=3600000 transformer.complex.JodConverter.Image.time=3600000 transformer.complex.JodConverter.Pdf2swf.time=3600000 transformer.complex.JodConverter.PdfBox.time=3600000 If an initial 'XXX.time' global property is supplied for a transformer, the number of transformations performed may also be supplied with a '.count' value. The default being 10,000. This avoids the average time reducing too fast if transformations are requested (because the new transformer is not available). A higher number ensures it reduces at a slower rate. For example, 1,000,000: transformer.OpenOffice.count=1000000 transformer.complex.OpenOffice.Image.count=1000000 transformer.complex.OpenOffice.Pdf2swf.count=1000000 transformer.complex.OpenOffice.PdfBox.count=1000000 transformer.JodConverter.count=1000000 transformer.complex.JodConverter.Image.count=1000000 transformer.complex.JodConverter.Pdf2swf.count=1000000 transformer.complex.JodConverter.PdfBox.count=1000000 |
Deployment - App Server |
|
Deployment - SDK Project |
|
More Information | |
Sample Code | |
Tutorials |
|
Developer Blogs |
You are here
Content Transformers (and Renditions)
SkyVault Content Services provides many
different types of content transformations out-of-the-box. Custom transformations can also be
implemented and configured.
© 2017 TBS-LLC. All Rights Reserved. Follow @twitter