How to publish your pipeline on GATE Cloud
Contents
1. Introduction
The GATE Cloud platform is based on the GATE framework, and GATE Cloud pipelines are standard GATE pipelines with a small amount of additional metadata — if it runs in GATE Developer, your pipeline can run on GATE Cloud. GATE Cloud can support pipelines running under either version 8.4.1 or 8.6.1 of GATE Embedded, let us know which version you require when submitting your pipeline.
This document explains the packaging format that is used for GATE Cloud pipelines. There is also a worked example of how to build such a package for the "TwitIE" Tweet processing pipeline provided with GATE Developer.
2. The basics
GATE Cloud uses the established "xgapp" format for saved GATE pipelines, which is a way to store the state of a loaded pipeline in GATE — the set of plugins it requires, the names, types and parameters of all the processing resources (PRs) in the pipeline. An xgapp file is essentially a "script" telling GATE Embedded how to re-construct the pipeline in the same state it was when it was saved. By default, when saving a GATE pipeline as an xgapp, the saved state simply refers to plugins and resource files (gazetteer lists, etc.) by relative paths that point to the files on disk on the source machine, which is neither portable nor self-contained. Therefore, GATE Cloud imposes the following requirements on pipelines:
- The pipeline must be supplied as a single ZIP format archive, containing the saved application state along with all the plugins and other resources it depends on. For GATE 8.6.x pipelines all Maven plugins used by the app must be cached in maven-cache.gate at the top level of the zip file.
- The saved application state must be named application.xgapp and must be located at the 'root directory' of the zip file (i.e. when the ZIP is unpacked it must leave a file named application.xgapp in the directory where the ZIP is unpacked and not in a sub-directory).
- All URL paths used by the application should be relative paths that do not contain any .. path components, so they will point to files in the same directory as application.xgapp or a sub-directory under this location. In particular the xgapp file must not use any $gatehome$ placeholders - the environment in which the pipeline will run does not include a full GATE distribution so all plugins you depend on must be inside the ZIP.
The easiest way to produce such a bundle is to use the export function in GATE Developer (or the underlying Ant task in more complex cases).
3. Pipeline metadata
In addition to the pipeline software, GATE Cloud requires certain metadata describing the pipeline. This is provided as additional files within the pipeline ZIP, under a directory named “metadata”. Basic metadata about the pipeline is placed in a file named metadata.xml, with the following structure:
<metadata xmlns="http://gate.ac.uk/ns/pipeline-metadata"> <pipeline-name>...</pipeline-name> <slug>...</slug> <ordering>...</ordering> <annotation-selectors>...</annotation-selectors> <additional-annotations>...</additional-annotations> <tags> <tag>...</tag> </tags> </metadata>
All elements in the file must be in the http://gate.ac.uk/ns/pipeline-metadata namespace, and the elements under the root metadata element are as follows:
- pipeline-name the name of the pipeline as it should appear in the list of pipelines on the GATE Cloud services list (e.g. "ANNIE Named Entity Recognizer")
- slug a short URL-friendly name for the pipeline (e.g. "annie-named-entity-recognizer"). This is effectively the pipeline's unique identifier — if you want to deploy a new version of an existing pipeline you must use the same slug value.
- ordering determines where the pipeline will appear in the services list. We can advise on a suitable ordering value for each pipeline, if in doubt set it to a default of 100 and we can choose a final value later.
- annotation-selectors selector expressions defining the "significant" annotations that the pipeline will create. This is a comma-separated list of selector expressions, each taking the form annotationSetName:annotationType, as desribed in the API documentation. For example a value of ":Person, :Location, :Date" denotes the three annotation types Person, Location and Date, all in the default annotation set. These selectors will determine the annotations that will be returned by default by the online annotation service.
- additional-annotations selector expressions defining additional annotations that the pipeline can produce but which are not returned by default. This uses the same format as the annotation-selectors element.
- tags free text tags that should be applied to the pipeline to allow users
to filter the list of pipelines to find those that are potentially useful for
their task. Typical tags include
- "functional" tags describing the pipeline’s purpose, such as "Named Entity"
- language tags denoting the (human) languages the pipeline can process, such as "English" or "Russian"
- classification tags, for example where several pipelines are provided that expose different functionality of the same underlying library they may all be assigned a common tag, such as "OpenNLP"
Other metadata is provided in separate files in the metadata directory:
- icon.png a 64x64 pixel icon, displayed within the pipeline’s tile on the main shop page and also at the top of the pipeline’s detail page. It must be in PNG format and should have a transparent background.
- short-desc.html a short description in HTML format, suitable for display within an individual tile on the main shop page. This will be the first point of contact with a particular pipeline for GATE Cloud users so should be carefully designed to provide the most important information in a single paragraph, encouraging the users to click through to the full detail page.
- long-desc.html a longer description which will appear on the detail page when the user clicks through to this specific pipeline from the shop front. This file is in XHTML format and may embed images by using relative paths to other files within the metadata directory. When rendered on annomarket.com you will have the Bootstrap CSS/JS framework (version 2.3.2) available, so you can use Bootstrap features such as glyphicons, collapsible panels, etc. via data- attributes. The shop pages use the fluid grid system, so your HTML must not use fixed grids.
As an example, the pipeline metadata for the ANNIE named entity annotation pipeline can be found in the ANNIE plugin on GitHub.
The following shows how the different pieces of metadata contribute to the various GATE Cloud shop pages. The name, icon and short description contribute to the pipeline's tile on the main shop page.
The long description appears on the detail page, along with the icon and name again, and the annotation selectors determine which annotations are returned by the "test this pipeline" function.
4. For advanced users - the GATE Cloud runtime environment
For many GATE applications that just use the standard pure-Java ANNIE components, the basic information above is all you need to know to run your application on GATE Cloud. But for more advanced applications that involve custom PRs, platform-specific native helpers (such as an external tagger), or other components that need to know the path where they are installed, you will need to know a little more about the environment in which your application will be running.
4.1. Hardware and software
GATE Cloud annotation jobs are executed on virtual 64-bit (x86_64) Linux servers in the cloud, specifically Ubuntu 14.04 (Trusty Tahr). The GATE application is run on Oracle Java 8. The current offering uses the Amazon EC2 cloud, and runs jobs on their 'm3.xlarge' machines which provide 4 virtual CPU cores and 15GB of memory, of which 13GB is available to the GCP process. Pipelines should assume a maximum heap memory size of 2GiB (-Xmx2G) - for the on-line API each pipeline is isolated in its own Java Virtual Machine with 2G of heap by default, for batch jobs the pipeline is run with 12G using the open-source GCP tool, but this is split between 6 parallel threads.
GATE Cloud pipelines are run using GATE Embedded version 8.4.1 or 8.6.1, and any plugins used by your application must be compatible with this version. Certain plugins from the GATE Embedded distribution are pre-loaded, principally those that provide additional document format parsers.
The pipeline runner Java process is configured for 'headless' operation (-Djava.awt.headless=true), and your code should not assume that a GUI display is available.
Your pipeline must not assume it has permission to read or write files to disk outside of the directory into which it is unpacked, and there are no guarantees about what the path to this directory will be — if your pipeline needs to know the full path to something unpacked from its zip file it should use a CREOLE parameter of type java.net.URL, which will be resolved automatically by GATE in the usual way. The pipeline will additionally have access to the usual temporary storage identified by the java.io.tmpdir system property.
4.2. Native code components
Many PRs are simply wrappers around non-Java tools, for example third-party taggers of various kinds. If your application requires the use of any non-Java components you must ensure that the version you include in your ZIP package is the one that will run on Linux x86_64, and in particular on Ubuntu 14.04.
As your code will be running in a Linux environment, remember that any native executable or script that your application needs to call must be marked with execute permission on the filesystem. GATE Cloud uses the standard Info-ZIP "unzip" tool to unpack the application ZIP package, which respects permission settings specified in the ZIP file, so if you build your package using the corresponding "zip" tool the permissions will be preserved. However, many ZIP file creation tools (including GATE Developer's export mechanism) do not preserve permissions in this way. Therefore GATE Cloud also supports an alternative mechanism to mark files as executable.
Once the application ZIP has been unpacked, we look through the resulting directory tree for files named .executables. If any such file is found, we treat each line in the file as a relative path, and set the execute flag on the corresponding file in the file system. For example, imagine the following structure:
- application.xgapp
- plugins
- MyTagger
- resources
- tagger.sh
- postprocessor.pl
- resources
- MyTagger
Here, tagger.sh and postprocessor.pl are scripts that need to be marked as executable, so we could create a file plugins/MyTagger/.executables containing the two lines:
resources/tagger.sh resources/postprocessor.pl
or equivalently, create plugins/MyTagger/resources/.executables containing
tagger.sh postprocessor.pl
Either way, the effect would be to make the GATE Cloud processing machine mark the relevant files as executable before running your application.