Job Management REST API

1. Introduction

This document describes the REST API provided by GATE Cloud to configure and control annotation jobs. General information about GATE Cloud REST APIs can be found on this page. The available API calls break down into several groups:

  • General information
    • List all jobs
    • Basic job details (name, pipeline details and pricing)
  • Input sources
    • List configured input sources
    • Add/delete input definitions
  • Output configuration
    • List configured output definitions
    • Add/delete output definitions
  • Job control
    • Start job execution
    • Monitor execution progress
    • Abort a misconfigured job
    • Resume a suspended job
  • Results
    • Download job results and execution reports

These APIs are explained in detail in the following sections.

2. List all jobs

GET https://cloud.gate.ac.uk/api/job

Request a summary of all jobs owned by the authenticating user.

Query string parameters

  • state (optional) show only jobs in a particular state or states (RESERVED, READY, ACTIVE, COMPLETED, DELETED, SUSPENDED). If omitted, default is to list RESERVED, READY, ACTIVE and COMPLETED jobs.

Example: https://cloud.gate.ac.uk/api/job?state=ACTIVE&state=SUSPENDED would list ACTIVE and SUSPENDED jobs only.

Response format

XML JSON
<jobs>
  <job>
    <url>https://cloud.gate.ac.uk/api/job/132</url>
    <state>ACTIVE</state>
  </job>
  ...
</jobs>
[
  {
    "url":"https://cloud.gate.ac.uk/api/job/132",
    "state":"ACTIVE"
  },
  ...
]

The url can be queried to get more details about a specific job.

3. Job details

3.1. Query job details

GET https://cloud.gate.ac.uk/api/job/{jobID}

Get basic details for a job. This does not include the job's application, input or output specifications, which are available through separate calls.

Response format

XML JSON
<job>
  <key>value</key>
  ...
  <price setup="0" hour="0.8" mib="0" />
</job>
{
  "key":"value",
  ...
  "price":{
    "setup":0, "hour":0.8, "mib":0
  }
}
Key Value
id The job identifier
name The job name
uuid Randomly-generated UUID used as part of the path when storing job files in S3
state

The job state (ACTIVE, COMPLETED, etc., see above)

price The one-off setup price and price per processing hour/MiB of data for this job
dateCreated Date and time when the job was first created
dateCompleted,
resultsAvailableUntil
If the job state is COMPLETED, the date and time when it finished and the date until which the results will be available for download
resultBundle

If the job state is COMPLETED, an API URL that can be used to access the data bundle containing the job's results

timeUsed,
timeCharged
If the job has been started, how much processing time has so far been used/charged for
bytesUsed,
bytesCharged
If the job has been started, how much data has so far been processed/charged for
applicationFixed,
inputsFixed
true if this job has a fixed application location or input corpus, omitted otherwise

3.2. Update job details

POST https://cloud.gate.ac.uk/api/job/{jobID}

Modify the job details. Currently the only modifiable "detail" is the job name.

Request format

XML JSON
<job>
  <name>New name</name>
</job>
{
  "name":"New name"
}

Response

Exactly as for the GET case above.

4. The application to be run

GATE Cloud annotation jobs work by running a GATE application over your documents on the cloud. The application is provided as a ZIP package, such as would be created by the "Export for GATE Cloud" option in GATE Developer. Custom annotation jobs allow you to provide your own application, marketplace jobs allow you to run an application provided by a third party.

For custom jobs, you can either upload your application package to storage space managed by GATE Cloud, or if your package is already stored in Amazon S3 you can simply provide the location (bucket name and key). If the S3 bucket containing the package allows anonymous access this is sufficient, if it does not then you will also need to provide an access key ID/secret key pair that has permission to download the package from that location (this would typically be an IAM user with limited permissions — do not use your master credentials).

4.1. Querying the current application location

GET https://cloud.gate.ac.uk/api/job/{jobID}/application

Response format

XML JSON
<application>
  <location>s3://bucketName/key</location>
  <!-- one of the following three -->
  <fixed>true</fixed>
  <putUrl>...</putUrl>
  <accessKey>...</accessKey>
</application>
{
  "location":"s3://bucketName/key",
  /* one of the following three */
  "fixed":true,
  "putUrl":"...",
  "accessKey":"..."
}
Key Value
location The location in Amazon S3 of the application ZIP package. This may be in a GATE Cloud bucket or in your own bucket
fixed true if this is a marketplace job with a fixed application location
putUrl If your application is to be stored in a GATE Cloud-owned location, this is a URL that will allow you to upload a replacement application ZIP package via an HTTP(S) PUT request
accessKey If your application package is in your own S3 bucket, an Amazon access key ID that has the rights to download it (typically an IAM user).

Note the putUrl will only work for a limited time, so you should upload your file to this URL immediately. A new URL is generated every time you request the application details, so if you find your PUT URL has expired simply access this operation again to generate a fresh one. When uploading a file to the putUrl the PUT request must have a Content-Type of "application/octet-stream" and the correct Content-Length, and must not have a Content-MD5 header.

4.2. Changing the application for a custom job

POST https://cloud.gate.ac.uk/api/job/{jobID}/application

Request format - upload

If you wish to upload your package to a GATE Cloud bucket:

XML JSON
<application>
  <fileName>applicationName.zip</fileName>
</application>
{
  "fileName":"applicationName.zip"
}

Request format - own bucket

If your application is in your own S3 bucket:

XML JSON
<application>
  <s3Location>s3://myBucket/path/app.zip</fileName>
  <accessKey>AKIA......</accesssKey>
  <secretKey>...</secretKey>
</application>
{
  "s3Location":"s3://myBucket/path/app.zip",
  "accessKey":"AKIA......",
  "secretKey":"..."
}

Response

Exactly as for the GET case above.

5. Job results storage

By default, job output files are stored in a GATE Cloud-managed storage location and transferred to a data bundle when the job completes. If you have your own Amazon web services account you can instead choose to have the output files written directly to your own S3 bucket.

5.1. Querying the current output directory

GET https://cloud.gate.ac.uk/api/job/{jobID}/outputDirectory

Response format

XML JSON
<outputDirectory>
  <!-- either -->
  <defaultLocation>true</defaultLocation>
  <!-- or -->
  <s3Location>s3://bucketName/keyPrefix/</s3Location>
  <accessKey>...</accessKey>
</outputDirectory>
{
  /* either */
  "defaultLocation":true
  /* or */
  "s3Location":"s3://bucketName/keyPrefix/",
  "accessKey":"..."
}
Key Value
defaultLocation If true, indicates that the job uses the default GATE Cloud-managed storage location
s3Location If the job uses your own bucket, this is the location to which output files will be written. It is a string of the form s3://bucketName/keyPrefix — the files will be written to the given bucket and their keys (i.e. file names) will have the given "directory name" prefix prepended.
accessKey If the job uses your own S3 bucket, an Amazon access key ID that has the rights to upload objects to the given location (typically an IAM user).

5.2. Changing the output directory

POST https://cloud.gate.ac.uk/api/job/{jobID}/outputDirectory

Request format - default location

To (re)set the job to use the default GATE Cloud-managed output storage

XML JSON
<outputDirectory>
  <defaultLocation>true</defaultLocation>
</outputDirectory>
{
  "defaultLocation":true
}

Request format - own bucket

If you want to send output to your own S3 bucket:

XML JSON
<outputDirectory>
  <s3Location>s3://myBucket/keyPrefix/</fileName>
  <accessKey>AKIA......</accesssKey>
  <secretKey>...</secretKey>
</outputDirectory>
{
  "s3Location":"s3://myBucket/keyPrefix/",
  "accessKey":"AKIA......",
  "secretKey":"..."
}

Response

Exactly as for the GET case above.

6. The documents to process

The documents to be processed by a GATE Cloud annotation job are provided to the system in data bundles (which can be either uploaded directly or created automatically as the result of another annotation job). See the data API documentation for more details.

6.1. Listing a job's input specifications

GET https://cloud.gate.ac.uk/api/job/{jobID}/input

Response format

XML JSON
<inputs>
  <input>
    <url>{detail URL}</url>
    <type>{input type}</type>
    <sourceBundle>{bundle URL}</sourceBundle>
  </input>
  ...
</inputs>
[
  {
    "url":"detail URL",
    "type":"input type",
    "sourceBundle":"bundle URL"
  },
  ...
]

The url responds to GET, POST and DELETE requests as detailed below.

6.2. Adding a new input specification

POST https://cloud.gate.ac.uk/api/job/{jobID}/input

Request format

XML JSON
<input>
  <sourceBundle>{bundle ID}</sourceBundle>
</input>
{
  "sourceBundle":bundleID
}

The bundle ID is the identifier of one of your data bundles, which can be found via your web dashboard (remove the leading "D-") or the data API.

Response

A 303 redirect to the URL of the newly created input specification.

6.3. Querying details of an input specification

GET <inputSpecURL>

To query the current configuration of a particular input specification, make a GET request to the "details URL" returned when listing a job's input specifications.

Response format

XML JSON
<input>
  <key>value</key>
  ...
</input>
{
  "key":"value",
  ...
}
Key Value
url The detail URL for this input specification
type

The type of the input specification: ZIP, TAR or ARC for archives, TWITTER_SEARCH for the JSON result of a Twitter search API call, TWITTER_STREAM for Tweets in Twitter's streaming format, or DATASIFT_STREAM for social media posts represented in the "interaction" format produced by DataSift.

sourceBundle

For inputs derived from a data bundle, the API endpoint which can be queried for information about the bundle

encoding The character encoding to use when reading text documents out of the archive, e.g. UTF-8, ISO-8859-1, etc. If absent, entries from within ZIP and TAR archives will be assumed to be UTF-8, and entries from ARC files will be read using the encoding specified in the record header. This should always be left blank for TWITTER_SEARCH, TWITTER_STREAM and DATASIFT_STREAM inputs.
mimeTypeOverride The MIME type that should be used to read all entries from this archive. If absent, entries from ZIP and TAR files will have their mime type guessed based on their file name, and entries from ARC files will be read using the MIME type specified in the record header. This should always be left blank for TWITTER_SEARCH, TWITTER_STREAM and DATASIFT_STREAM inputs.
fileExtensions (ZIP and TAR inputs only) Comma-separated list of file extensions identifying the entries from the archive that should be processed, e.g. html, xml, txt. If absent, all files in the archive will be processed
mimeTypes (ARC inputs only) Comma-separated list of MIME type prefixes identifying the entries from the archive that should be processed, e.g. text/plain, text/html. If absent, all entries are processed (including, for example, those with image/* MIME types)

6.4. Deleting an input specification

DELETE <inputSpecURL>

To delete a particular input specification, make a DELETE request to the "details URL" returned when listing a job's input specifications. The request takes no further parameters, and if successful it returns a 204 (no content) status code.

Note that deleting the specification does not delete the underlying bundle. You must delete this yourself once it is no longer required, and you will continue to incur monthly storage charges until you do.

7. The required outputs

A GATE Cloud annotation job requires one or more output specifications, defining what to do with the documents once they have been processed. The documents can be saved to files in one of four formats (GATE standoff XML, inline XML tags, JSON, or the XCES xcesAna standoff format), or sent to a Mímir indexing server.

7.1. Listing a job's output specifications

GET https://cloud.gate.ac.uk/api/job/{jobID}/output

Response format

XML JSON
<outputs>
  <output>
    <key>value</key>
  </output>
  ...
</outputs>
[
  {
    "key":"value",
    ...
  },
  ...
]

The response is a list of items, one per job, in the same format as is returned when querying the details of a single output (described below). In particular, each output includes a url which can be used to manage that specific output with GET, POST and DELETE requests as detailed below.

7.2. Adding a new output specification

POST https://cloud.gate.ac.uk/api/job/{jobID}/output

Request format

XML JSON
<output>
  <key>value</key>
  ...
</output>
{
  "key":"value",
  ...
}
Key Value
type The type of the output specification: GATE_XML, INLINE_XML, XCES, JSON or MIMIR (default GATE_XML)

For GATE_XML, INLINE_XML, XCES and JSON, the following keys are also accepted:

Key Value
fileExtension The extension for the output file, e.g. .GATE.xml. This will be appended to the name of the original input file to produce the output file name
annotationSelectors A comma-separated list of annotation selector expressions defining which annotations from the document should be saved

Annotation selector expressions are strings of the form annotationSetName:annotationType. If the annotation set name is omitted it is taken to mean the default (unnamed) annotation set, and if the annotation type is omitted then all annotations in the selected set are output. For example ":Person" would output all Person annotations from the default set, "Results:" would output all annotations (regardless of type) in the set named "Results". In particular, ":" is a valid selector expression, representing all annotations in the default annotation set.

Example:

<annotationSelectors>:Person, :Organisation, combined:Entity</annotationSelectors>

Many pipelines provide a default set of annotation selectors that are appropriate to the annotations produced by the pipeline, so in many cases you can omit the annotationSelectors parameter and the default values will be appropriate, but if you have specific requirements (e.g. you want to write some of the annotations to one output file and others to another) then you can override them. However note that there is no validity check — if you specify an annotation selector that is inconsistent with the annotation sets and types that the pipeline produces, you will simply receive an empty result. You may wish to test your selectors with a few documents against the on-line API before submitting a large batch job.

For MIMIR, the following keys are accepted:

Key Value
indexUrl The URL of the Mímir index. For a Mímir server running in the GATE Cloud cloud this should be the private URL from the index information page
username, password If the Mímir server is configured to require HTTP basic authentication for indexing, the relevant username (typically "manager") and password

The complete document is always sent to the Mímir server, so annotation selectors are ignored.

The JSON output format is based on the format used by Twitter to represent entities in Tweet data. For each document, it consists of a JSON object with two properties, "text" containing the text of the document and "entities" containing the annotations as follows:

{
  "text":"The text of the document",
  "entities":{
    "Determiner":[
      {
        "indices":[0,3],
        "feature1":"value1",
        "feature2":"value2"
      },
      {
        "indices":[12,15],
        "feature3":"value3"
      }
    ]
  }
}

The "entities" value is a map from annotation type to array of annotation objects, with the annotation's position within the text represented as "indices":[start,end] (zero-based character offsets, start inclusive, end exclusive) and the annotation's features represented as the other JSON properties.

If the original document was Twitter JSON (ZIP entries with a .json file extension or the text/x-json-twitter MIME type, or inputs of one of the TWITTER_* types) then the output JSON will attempt to preserve as far as possible the JSON structure of the original Tweet. If the original Tweet contained "entities" then the output annotations will be merged with those from the original JSON.

The JSON output format respects annotation selectors, but will group all annotations of the same annotation type together regardless of which annotation set they came from. If you need to be able to distinguish annotations of the same type that came from different annotation sets, you should define separate outputs for each one, giving you separate JSON output files.

Response

A 303 redirect to the URL of the newly created output specification.

7.3. Querying details of an output specification

GET <outputSpecURL>

To query the current configuration of a particular output specification, make a GET request to the "details URL" returned when listing a job's output specifications.

Response format

XML JSON
<output>
  <key>value</key>
  ...
</output>
{
  "key":"value",
  ...
}

Returned keys are type, fileExtension, annotationSelectors, indexUrl and username (but not password), as described above.

7.4. Modifying details of an output specification

POST <outputSpecURL>

To modify the configuration of a particular output specification, make a POST request to the "details URL" returned when listing a job's output specifications.

The request format is identical to that used to add a new output specification, but modifies an existing specification rather than adding a new one. To un-set a value that was previously set, specify the key as null in JSON (e.g. { "username":null }) or xsi:nil in XML (e.g. <username xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true" />).

Response

Exactly as for a GET request, with values reflecting the changes made by this request.

7.5. Deleting an output specification

DELETE <outputSpecURL>

To delete a particular output specification, make a DELETE request to the "details URL" returned when listing a job's output specifications. The request takes no further parameters, and if successful it returns a 204 (no content) status code.

8. Controlling a job

POST https://cloud.gate.ac.uk/api/job/{jobID}/control

This URL is the interface through which you can start a fully-configured job, stop a running job, resume a job that has been suspended by the system, or reset a completed job so it can be re-run.

Request format

XML JSON
<control>
  <action>{value}</action>
</control>
{
  "action":"{value}",
}

Supported actions

Action Valid state Effect
start READY Start a job running
resume SUSPENDED Resume a job that was suspended by the system due to insufficient credit
stop RUNNING Abort a running job. Any running tasks will be aborted and pending tasks will be cancelled. Results from completed tasks will be available to download.
reset COMPLETED Reset a completed job to allow it to be run again. If the job has an associated setup price, this will be charged again. Any supplementary files from the previous run of the job will be deleted by the reset and will no longer be available to download, but the job's result data bundle will not be deleted - you must delete it yourself when it is no longer required, and will be charged a monthly storage fee until it is deleted.

If the request is successful, a 204 (no content) HTTP response is returned.

9. Monitoring a running job

GET https://cloud.gate.ac.uk/api/job/{jobID}/log

Fetch log messages generated by the given job.

Query string parameters

Log messages are timestamped, and you can use the from and to query parameters to restrict the date/time range of messages retrieved. These parameters are ISO8601-format date and time strings, assumed to be UTC if no timezone is included. For example:

https://cloud.gate.ac.uk/api/job/15/log?from=2012-09-01T10:00:00&to=2012-09-01T12:00:00

Note that as + in URL encoding denotes a space, you will need to use %2B to encode the plus sign used to signify time zones east of UTC, for example:

https://cloud.gate.ac.uk/api/job/15/log?from=2012-09-01T10:00:00%2B02:00

Response format

XML JSON
<log>
  <entry date='{timestamp}'>Log message</entry>
  ...
</log>
[
  {
    "date":"{timestamp}",
    "message":"Log message"
  },
  ...
]

Log entries are returned in reverse chronological order, with the most recent message first.

10. Downloading results

Once a job is complete you will need to download its output files. Unless you have configured the job to use your own S3 bucket for output, the job's results will be converted into a data bundle. To download them you should use the data management API via the URL given by the resultBundle property returned when you query the job's details after it is complete. Note that if your GATE Cloud account is overdrawn beyond your credit limit you will not be able to download your results until you have made a payment.

Additional files such as an overall summary report may be available to download separately:

GET https://cloud.gate.ac.uk/api/job/{jobID}/reports

List the additional report files available for this job

Response format

XML JSON
<reports>
  <report>{URL}</report>
  ...
</reports>
["{URL}", ...]

To download individual supplementary files, simply make a GET request to the URL returned in the listing. You must follow any HTTP redirects resulting from the initial request.