Preparing your GATE application to run on GATE Cloud
Contents
1. Overview
GATE Cloud provides a way for you to run pretty much any GATE application on the cloud. You develop your application in the usual way using GATE developer and then save it as a single self-contained ZIP file, typically using the "Export for GATE Cloud" option. This document aims to tell you what you need to know to ensure that your application will run on GATE Cloud. More detailed documentation is available on our page about packaging applications.
2. Basics
You supply your GATE application to GATE Cloud as a single ZIP file, which is expected to contain a saved application state in the usual ".xgapp" format, along with all the GATE plugins, JAPE grammars and other resources that the application requires. The saved application state must be named application.xgapp and must be located at the 'root directory' of the zip file (i.e. when the ZIP is unpacked it must leave a file named application.xgapp in the directory where the ZIP is unpacked and not in a sub-directory). All URL paths used by the application should be relative paths that do not contain any '..' components, so they will point to files in the same directory as application.xgapp or a sub-directory under this location.
The easiest way to build such a package is simply to save your application in GATE Developer using the "Export for GATE Cloud" option, which produces a ZIP file containing an application.xgapp and all its required resources in one click. (This is explained in more detail here.)
3. The GATE Cloud environment
For many GATE applications that just use the standard pure-Java ANNIE components, the basic information above is all you need to know to run your application on GATE Cloud. But for more advanced applications that involve custom PRs, platform-specific native helpers (such as an external tagger), or other components that need to know the path where they are installed, you will need to know a little more about the environment in which your application will be running.
3.1. Hardware and software
GATE Cloud annotation jobs are executed on virtual 64-bit (x86_64) Linux servers in the cloud, specifically Ubuntu 14.04 (Trusty Tahr). The GATE application is run using the open-source GCP tool on Oracle Java 8 (1.8.0_71). The current offering uses the Amazon EC2 cloud, and runs jobs on their 'm3.xlarge' machines which provide 4 virtual CPU cores and 15GB of memory, of which 13GB is available to the GCP process.
The jobs are run using GATE Embedded version 8.3, and your application must use GATE plugins that are compatible with this version. The following plugins are pre-loaded by default in order to support additional input data formats: Format_Twitter, Format_MediaWiki, Format_PubMed, Format_FastInfoset and Format_DataSift.
The GCP process is configured for 'headless' operation (-Djava.awt.headless=true), and your code should not assume that a GUI display is available.
GCP loads one copy of your application.xgapp in the usual way using the PersistenceManager. It then uses the GATE duplication mechanism to make a further 5 independent copies of the loaded application, and runs 6 parallel threads to process your documents. For most PRs this duplication process is essentially equivalent to loading the original application.xgapp 6 times but if you are writing a custom PR you may wish to consider implementing a custom duplication strategy.
3.2. Directories
The application ZIP file will always be unpacked in a directory named /gatecloud/application on the cloud server. Thus the application file will always be /gatecloud/application/application.xgapp and if any of your components need to know the absolute path to their resource files you can work this out by prepending /gatecloud/application/ to the path of the entry inside your ZIP package. The user account that runs the GCP process has full read and write access in the /gatecloud/application directory, so if any of your components need to create temporary files then this is a good place to put them. Any files created under /gatecloud/application will be lost when the current batch of documents has been processed.
This path can also be used if your documents need to refer to external files. For example, if your input files are XML and need a particular DTD, you could package that DTD inside your application ZIP file and then refer to file:/gatecloud/application/my.dtd in your XML files.
The directory /gatecloud/batch/output is where GCP will write any output files specified by the output definitions you supply when running an annotation job. All files created under this directory will be packaged up into ZIP files when the batch of documents has been processed and made available for download when the job has completed. Thus, any additional output files that your application creates and that need to be returned to the user should be placed under /gatecloud/batch/output.
Your code should not assume it has permission to read and write any files outside these two locations.
3.3. Native code components
Many PRs are simply wrappers around non-Java tools, for example third-party taggers of various kinds. If your application requires the use of any non-Java components you must ensure that the version you include in your ZIP package is the one that will run on Linux x86_64, and in particular on Ubuntu 10.10. The cloud processing servers have a reasonable set of packages installed by default, including a basic install of Perl and Python, sed, awk and bash. To request additional packages please contact us with your requirements. If you want to be sure your code will work on GATE Cloud then the best approach is to sign up for your own account at Amazon Web Services, run your own instance of the same machine image that GATE Cloud uses and test the software yourself. As Amazon charges by the hour with no up-front fees this should cost you less than a pound.
As your code will be running in a Linux environment, remember that any native executable or script that your application needs to call must be marked with execute permission on the filesystem. GATE Cloud uses the standard Info-ZIP "unzip" tool to unpack the application ZIP package, which respects permission settings specified in the ZIP file, so if you build your package using the corresponding "zip" tool the permissions will be preserved. However, many ZIP file creation tools (including GATE's "Export for GATE Cloud") do not preserve permissions in this way. Therefore GATE Cloud also supports an alternative mechanism to mark files as executable.
Once the application ZIP has been unpacked, we look through the resulting directory tree for files named .executables. If any such file is found, we treat each line in the file as a relative path, and set the execute flag on the corresponding file in the file system. For example, imagine the following structure:
- application.xgapp
- plugins
- MyTagger
- resources
- tagger.sh
- postprocessor.pl
- resources
- MyTagger
Here, tagger.sh and postprocessor.pl are scripts that need to be marked as executable, so we could create a file plugins/MyTagger/.executables containing the two lines:
resources/tagger.sh resources/postprocessor.pl
or equivalently, create plugins/MyTagger/resources/.executables containing
tagger.sh postprocessor.pl
Either way, the effect would be to make the GATE Cloud processing machine mark the relevant files as executable before running your application.
4. Security and privacy
GATE Cloud does not run a separate machine for each annotation job. Instead it splits each annotation job up into manageable pieces (referred to as tasks), puts these tasks into a queue, and runs a collection of processing machines (referred to as "nodes") that simply take the next task from the queue whenever they have finished processing their previous task. While a task is running it has exclusive use of that particular node - we never run more than one task on the same node at the same time - but once the task is complete the same node will then run another task (which may or may not be part of the same annotation job).
To ensure the security and privacy of your code and data, the node takes the following precautions:
- All GCP processes are run as an unprivileged user account which only has write permission in a restricted area of the filesystem (see above). The /gatecloud directory is mounted with the nosuid option, so jobs cannot run set-UID programs to gain elevated privileges.
- At the end of every task, all processes running under that user ID are forcibly terminated (so there's no risk of a stray or malicious background process started by a previous task being able to read your data).
- The /gatecloud/application and /gatecloud/batch directories are completely deleted at the end of every task (whether the task completed successfully or failed) so your data will not be left for the following task to see.