Job Management REST API
Contents
- 1. Introduction
- 2. List all jobs
- 3. Job details
- 3.1. Query job details
- 3.2. Update job details
- 4. The application to be run
- 4.1. Querying the current application location
- 4.2. Changing the application for a custom job
- 5. Job results storage
- 5.1. Querying the current output directory
- 5.2. Changing the output directory
- 6. The documents to process
- 6.1. Listing a job's input specifications
- 6.2. Adding a new input specification
- 6.3. Querying details of an input specification
- 6.4. Deleting an input specification
- 7. The required outputs
- 7.1. Listing a job's output specifications
- 7.2. Adding a new output specification
- 7.3. Querying details of an output specification
- 7.4. Modifying details of an output specification
- 7.5. Deleting an output specification
- 8. Controlling a job
- 9. Monitoring a running job
- 10. Downloading results
1. Introduction
This document describes the REST API provided by GATE Cloud to configure and control annotation jobs. General information about GATE Cloud REST APIs can be found on this page. The available API calls break down into several groups:
- General information
- List all jobs
- Basic job details (name, pipeline details and pricing)
- Input sources
- List configured input sources
- Add/delete input definitions
- Output configuration
- List configured output definitions
- Add/delete output definitions
- Job control
- Start job execution
- Monitor execution progress
- Abort a misconfigured job
- Resume a suspended job
- Results
- Download job results and execution reports
These APIs are explained in detail in the following sections.
2. List all jobs
GET https://cloud.gate.ac.uk/api/job
Request a summary of all jobs owned by the authenticating user.
Query string parameters
- state (optional) show only jobs in a particular state or states (RESERVED, READY, ACTIVE, COMPLETED, DELETED, SUSPENDED). If omitted, default is to list RESERVED, READY, ACTIVE and COMPLETED jobs.
Example: https://cloud.gate.ac.uk/api/job?state=ACTIVE&state=SUSPENDED would list ACTIVE and SUSPENDED jobs only.
Response format
XML | JSON |
<jobs> <job> <url>https://cloud.gate.ac.uk/api/job/132</url> <state>ACTIVE</state> </job> ... </jobs> | [ { "url":"https://cloud.gate.ac.uk/api/job/132", "state":"ACTIVE" }, ... ] |
The url can be queried to get more details about a specific job.
3. Job details
3.1. Query job details
GET https://cloud.gate.ac.uk/api/job/{jobID}
Get basic details for a job. This does not include the job's application, input or output specifications, which are available through separate calls.
Response format
XML | JSON |
<job> <key>value</key> ... <price setup="0" hour="0.8" mib="0" /> </job> | { "key":"value", ... "price":{ "setup":0, "hour":0.8, "mib":0 } } |
Key | Value |
id | The job identifier |
name | The job name |
uuid | Randomly-generated UUID used as part of the path when storing job files in S3 |
state | The job state (ACTIVE, COMPLETED, etc., see above) |
price | The one-off setup price and price per processing hour/MiB of data for this job |
dateCreated | Date and time when the job was first created |
dateCompleted, resultsAvailableUntil | If the job state is COMPLETED, the date and time when it finished and the date until which the results will be available for download |
resultBundle | If the job state is COMPLETED, an API URL that can be used to access the data bundle containing the job's results |
timeUsed, timeCharged | If the job has been started, how much processing time has so far been used/charged for |
bytesUsed, bytesCharged | If the job has been started, how much data has so far been processed/charged for |
applicationFixed, inputsFixed | true if this job has a fixed application location or input corpus, omitted otherwise |
3.2. Update job details
POST https://cloud.gate.ac.uk/api/job/{jobID}
Modify the job details. Currently the only modifiable "detail" is the job name.
Request format
XML | JSON |
<job> <name>New name</name> </job> | { "name":"New name" } |
Response
Exactly as for the GET case above.
4. The application to be run
GATE Cloud annotation jobs work by running a GATE application over your documents on the cloud. The application is provided as a ZIP package, such as would be created by the "Export for GATE Cloud" option in GATE Developer. Custom annotation jobs allow you to provide your own application, marketplace jobs allow you to run an application provided by a third party.
For custom jobs, you can either upload your application package to storage space managed by GATE Cloud, or if your package is already stored in Amazon S3 you can simply provide the location (bucket name and key). If the S3 bucket containing the package allows anonymous access this is sufficient, if it does not then you will also need to provide an access key ID/secret key pair that has permission to download the package from that location (this would typically be an IAM user with limited permissions — do not use your master credentials).
4.1. Querying the current application location
GET https://cloud.gate.ac.uk/api/job/{jobID}/application
Response format
XML | JSON |
<application> <location>s3://bucketName/key</location> <!-- one of the following three --> <fixed>true</fixed> <putUrl>...</putUrl> <accessKey>...</accessKey> </application> | { "location":"s3://bucketName/key", /* one of the following three */ "fixed":true, "putUrl":"...", "accessKey":"..." } |
Key | Value |
location | The location in Amazon S3 of the application ZIP package. This may be in a GATE Cloud bucket or in your own bucket |
fixed | true if this is a marketplace job with a fixed application location |
putUrl | If your application is to be stored in a GATE Cloud-owned location, this is a URL that will allow you to upload a replacement application ZIP package via an HTTP(S) PUT request |
accessKey | If your application package is in your own S3 bucket, an Amazon access key ID that has the rights to download it (typically an IAM user). |
Note the putUrl will only work for a limited time, so you should upload your file to this URL immediately. A new URL is generated every time you request the application details, so if you find your PUT URL has expired simply access this operation again to generate a fresh one. When uploading a file to the putUrl the PUT request must have a Content-Type of "application/octet-stream" and the correct Content-Length, and must not have a Content-MD5 header.
4.2. Changing the application for a custom job
POST https://cloud.gate.ac.uk/api/job/{jobID}/application
Request format - upload
If you wish to upload your package to a GATE Cloud bucket:
XML | JSON |
<application> <fileName>applicationName.zip</fileName> </application> | { "fileName":"applicationName.zip" } |
Request format - own bucket
If your application is in your own S3 bucket:
XML | JSON |
<application> <s3Location>s3://myBucket/path/app.zip</fileName> <accessKey>AKIA......</accesssKey> <secretKey>...</secretKey> </application> | { "s3Location":"s3://myBucket/path/app.zip", "accessKey":"AKIA......", "secretKey":"..." } |
Response
Exactly as for the GET case above.
5. Job results storage
By default, job output files are stored in a GATE Cloud-managed storage location and transferred to a data bundle when the job completes. If you have your own Amazon web services account you can instead choose to have the output files written directly to your own S3 bucket.
5.1. Querying the current output directory
GET https://cloud.gate.ac.uk/api/job/{jobID}/outputDirectory
Response format
XML | JSON |
<outputDirectory> <!-- either --> <defaultLocation>true</defaultLocation> <!-- or --> <s3Location>s3://bucketName/keyPrefix/</s3Location> <accessKey>...</accessKey> </outputDirectory> | { /* either */ "defaultLocation":true /* or */ "s3Location":"s3://bucketName/keyPrefix/", "accessKey":"..." } |
Key | Value |
defaultLocation | If true, indicates that the job uses the default GATE Cloud-managed storage location |
s3Location | If the job uses your own bucket, this is the location to which output files will be written. It is a string of the form s3://bucketName/keyPrefix — the files will be written to the given bucket and their keys (i.e. file names) will have the given "directory name" prefix prepended. |
accessKey | If the job uses your own S3 bucket, an Amazon access key ID that has the rights to upload objects to the given location (typically an IAM user). |
5.2. Changing the output directory
POST https://cloud.gate.ac.uk/api/job/{jobID}/outputDirectory
Request format - default location
To (re)set the job to use the default GATE Cloud-managed output storage
XML | JSON |
<outputDirectory> <defaultLocation>true</defaultLocation> </outputDirectory> | { "defaultLocation":true } |
Request format - own bucket
If you want to send output to your own S3 bucket:
XML | JSON |
<outputDirectory> <s3Location>s3://myBucket/keyPrefix/</fileName> <accessKey>AKIA......</accesssKey> <secretKey>...</secretKey> </outputDirectory> | { "s3Location":"s3://myBucket/keyPrefix/", "accessKey":"AKIA......", "secretKey":"..." } |
Response
Exactly as for the GET case above.
6. The documents to process
The documents to be processed by a GATE Cloud annotation job are provided to the system in data bundles (which can be either uploaded directly or created automatically as the result of another annotation job). See the data API documentation for more details.
6.1. Listing a job's input specifications
GET https://cloud.gate.ac.uk/api/job/{jobID}/input
Response format
XML | JSON |
<inputs> <input> <url>{detail URL}</url> <type>{input type}</type> <sourceBundle>{bundle URL}</sourceBundle> </input> ... </inputs> | [ { "url":"detail URL", "type":"input type", "sourceBundle":"bundle URL" }, ... ] |
The url responds to GET, POST and DELETE requests as detailed below.
6.2. Adding a new input specification
POST https://cloud.gate.ac.uk/api/job/{jobID}/input
Request format
XML | JSON |
<input> <sourceBundle>{bundle ID}</sourceBundle> </input> | { "sourceBundle":bundleID } |
The bundle ID is the identifier of one of your data bundles, which can be found via your web dashboard (remove the leading "D-") or the data API.
Response
A 303 redirect to the URL of the newly created input specification.
6.3. Querying details of an input specification
GET <inputSpecURL>
To query the current configuration of a particular input specification, make a GET request to the "details URL" returned when listing a job's input specifications.
Response format
XML | JSON |
<input> <key>value</key> ... </input> | { "key":"value", ... } |
Key | Value |
url | The detail URL for this input specification |
type | The type of the input specification: ZIP, TAR or ARC for archives, TWITTER_SEARCH for the JSON result of a Twitter search API call, TWITTER_STREAM for Tweets in Twitter's streaming format, or DATASIFT_STREAM for social media posts represented in the "interaction" format produced by DataSift. |
sourceBundle | For inputs derived from a data bundle, the API endpoint which can be queried for information about the bundle |
encoding | The character encoding to use when reading text documents out of the archive, e.g. UTF-8, ISO-8859-1, etc. If absent, entries from within ZIP and TAR archives will be assumed to be UTF-8, and entries from ARC files will be read using the encoding specified in the record header. This should always be left blank for TWITTER_SEARCH, TWITTER_STREAM and DATASIFT_STREAM inputs. |
mimeTypeOverride | The MIME type that should be used to read all entries from this archive. If absent, entries from ZIP and TAR files will have their mime type guessed based on their file name, and entries from ARC files will be read using the MIME type specified in the record header. This should always be left blank for TWITTER_SEARCH, TWITTER_STREAM and DATASIFT_STREAM inputs. |
fileExtensions | (ZIP and TAR inputs only) Comma-separated list of file extensions identifying the entries from the archive that should be processed, e.g. html, xml, txt. If absent, all files in the archive will be processed |
mimeTypes | (ARC inputs only) Comma-separated list of MIME type prefixes identifying the entries from the archive that should be processed, e.g. text/plain, text/html. If absent, all entries are processed (including, for example, those with image/* MIME types) |
6.4. Deleting an input specification
DELETE <inputSpecURL>
To delete a particular input specification, make a DELETE request to the "details URL" returned when listing a job's input specifications. The request takes no further parameters, and if successful it returns a 204 (no content) status code.
Note that deleting the specification does not delete the underlying bundle. You must delete this yourself once it is no longer required, and you will continue to incur monthly storage charges until you do.
7. The required outputs
A GATE Cloud annotation job requires one or more output specifications, defining what to do with the documents once they have been processed. The documents can be saved to files in one of four formats (GATE standoff XML, inline XML tags, JSON, or the XCES xcesAna standoff format), or sent to a Mímir indexing server.
7.1. Listing a job's output specifications
GET https://cloud.gate.ac.uk/api/job/{jobID}/output
Response format
XML | JSON |
<outputs> <output> <key>value</key> </output> ... </outputs> | [ { "key":"value", ... }, ... ] |
The response is a list of items, one per job, in the same format as is returned when querying the details of a single output (described below). In particular, each output includes a url which can be used to manage that specific output with GET, POST and DELETE requests as detailed below.
7.2. Adding a new output specification
POST https://cloud.gate.ac.uk/api/job/{jobID}/output
Request format
XML | JSON |
<output> <key>value</key> ... </output> | { "key":"value", ... } |
Key | Value |
type | The type of the output specification: GATE_XML, INLINE_XML, XCES, JSON or MIMIR (default GATE_XML) |
For GATE_XML, INLINE_XML, XCES and JSON, the following keys are also accepted:
Key | Value |
fileExtension | The extension for the output file, e.g. .GATE.xml. This will be appended to the name of the original input file to produce the output file name |
annotationSelectors | A comma-separated list of annotation selector expressions defining which annotations from the document should be saved |
Annotation selector expressions are strings of the form annotationSetName:annotationType. If the annotation set name is omitted it is taken to mean the default (unnamed) annotation set, and if the annotation type is omitted then all annotations in the selected set are output. For example ":Person" would output all Person annotations from the default set, "Results:" would output all annotations (regardless of type) in the set named "Results". In particular, ":" is a valid selector expression, representing all annotations in the default annotation set.
Example:
<annotationSelectors>:Person, :Organisation, combined:Entity</annotationSelectors>
Many pipelines provide a default set of annotation selectors that are appropriate to the annotations produced by the pipeline, so in many cases you can omit the annotationSelectors parameter and the default values will be appropriate, but if you have specific requirements (e.g. you want to write some of the annotations to one output file and others to another) then you can override them. However note that there is no validity check — if you specify an annotation selector that is inconsistent with the annotation sets and types that the pipeline produces, you will simply receive an empty result. You may wish to test your selectors with a few documents against the on-line API before submitting a large batch job.
For MIMIR, the following keys are accepted:
Key | Value |
indexUrl | The URL of the Mímir index. For a Mímir server running in the GATE Cloud cloud this should be the private URL from the index information page |
username, password | If the Mímir server is configured to require HTTP basic authentication for indexing, the relevant username (typically "manager") and password |
The complete document is always sent to the Mímir server, so annotation selectors are ignored.
The JSON output format is based on the format used by Twitter to represent entities in Tweet data. For each document, it consists of a JSON object with two properties, "text" containing the text of the document and "entities" containing the annotations as follows:
{ "text":"The text of the document", "entities":{ "Determiner":[ { "indices":[0,3], "feature1":"value1", "feature2":"value2" }, { "indices":[12,15], "feature3":"value3" } ] } }
The "entities" value is a map from annotation type to array of annotation objects, with the annotation's position within the text represented as "indices":[start,end] (zero-based character offsets, start inclusive, end exclusive) and the annotation's features represented as the other JSON properties.
If the original document was Twitter JSON (ZIP entries with a .json file extension or the text/x-json-twitter MIME type, or inputs of one of the TWITTER_* types) then the output JSON will attempt to preserve as far as possible the JSON structure of the original Tweet. If the original Tweet contained "entities" then the output annotations will be merged with those from the original JSON.
The JSON output format respects annotation selectors, but will group all annotations of the same annotation type together regardless of which annotation set they came from. If you need to be able to distinguish annotations of the same type that came from different annotation sets, you should define separate outputs for each one, giving you separate JSON output files.
Response
A 303 redirect to the URL of the newly created output specification.
7.3. Querying details of an output specification
GET <outputSpecURL>
To query the current configuration of a particular output specification, make a GET request to the "details URL" returned when listing a job's output specifications.
Response format
XML | JSON |
<output> <key>value</key> ... </output> | { "key":"value", ... } |
Returned keys are type, fileExtension, annotationSelectors, indexUrl and username (but not password), as described above.
7.4. Modifying details of an output specification
POST <outputSpecURL>
To modify the configuration of a particular output specification, make a POST request to the "details URL" returned when listing a job's output specifications.
The request format is identical to that used to add a new output specification, but modifies an existing specification rather than adding a new one. To un-set a value that was previously set, specify the key as null in JSON (e.g. { "username":null }) or xsi:nil in XML (e.g. <username xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true" />).
Response
Exactly as for a GET request, with values reflecting the changes made by this request.
7.5. Deleting an output specification
DELETE <outputSpecURL>
To delete a particular output specification, make a DELETE request to the "details URL" returned when listing a job's output specifications. The request takes no further parameters, and if successful it returns a 204 (no content) status code.
8. Controlling a job
POST https://cloud.gate.ac.uk/api/job/{jobID}/control
This URL is the interface through which you can start a fully-configured job, stop a running job, resume a job that has been suspended by the system, or reset a completed job so it can be re-run.
Request format
XML | JSON |
<control> <action>{value}</action> </control> | { "action":"{value}", } |
Supported actions
Action | Valid state | Effect |
start | READY | Start a job running |
resume | SUSPENDED | Resume a job that was suspended by the system due to insufficient credit |
stop | RUNNING | Abort a running job. Any running tasks will be aborted and pending tasks will be cancelled. Results from completed tasks will be available to download. |
reset | COMPLETED | Reset a completed job to allow it to be run again. If the job has an associated setup price, this will be charged again. Any supplementary files from the previous run of the job will be deleted by the reset and will no longer be available to download, but the job's result data bundle will not be deleted - you must delete it yourself when it is no longer required, and will be charged a monthly storage fee until it is deleted. |
If the request is successful, a 204 (no content) HTTP response is returned.
9. Monitoring a running job
GET https://cloud.gate.ac.uk/api/job/{jobID}/log
Fetch log messages generated by the given job.
Query string parameters
Log messages are timestamped, and you can use the from and to query parameters to restrict the date/time range of messages retrieved. These parameters are ISO8601-format date and time strings, assumed to be UTC if no timezone is included. For example:
https://cloud.gate.ac.uk/api/job/15/log?from=2012-09-01T10:00:00&to=2012-09-01T12:00:00
Note that as + in URL encoding denotes a space, you will need to use %2B to encode the plus sign used to signify time zones east of UTC, for example:
https://cloud.gate.ac.uk/api/job/15/log?from=2012-09-01T10:00:00%2B02:00
Response format
XML | JSON |
<log> <entry date='{timestamp}'>Log message</entry> ... </log> | [ { "date":"{timestamp}", "message":"Log message" }, ... ] |
Log entries are returned in reverse chronological order, with the most recent message first.
10. Downloading results
Once a job is complete you will need to download its output files. Unless you have configured the job to use your own S3 bucket for output, the job's results will be converted into a data bundle. To download them you should use the data management API via the URL given by the resultBundle property returned when you query the job's details after it is complete. Note that if your GATE Cloud account is overdrawn beyond your credit limit you will not be able to download your results until you have made a payment.
Additional files such as an overall summary report may be available to download separately:
GET https://cloud.gate.ac.uk/api/job/{jobID}/reports
List the additional report files available for this job
Response format
XML | JSON |
<reports> <report>{URL}</report> ... </reports> | ["{URL}", ...] |
To download individual supplementary files, simply make a GET request to the URL returned in the listing. You must follow any HTTP redirects resulting from the initial request.