GATE Cloud on-line API

GATE Cloud provides an "on-line" API to process individual documents and return the annotated results in real time. As with all GATE Cloud APIs, you will need to generate an API key on the website and use it for HTTP basic authentication of all requests.

The online service accepts the data to be annotated as a standard HTTP POST request, and all configuration is done through HTTP protocol headers and URL query parameters. It can accept input in any of the formats supported by GATE Embedded, and returns the annotated result as either GATE XML, FastInfoset or JSON.

The endpoint URL for a particular pipeline can be found on the pipeline's description page linked from the GATE Cloud shop; to process a document you must make a POST request to this URL. (For example, the English Named Entity Recognizer has the endpoint https://cloud-api.gate.ac.uk/process-document/annie-named-entity-recognizer.) The POST body should be the data to be annotated, and the following headers are supported:

Request HTTP headers

  • Authorization - HTTP basic authentication with your API key credentials
  • Content-Type - the MIME type of the request. This can be any of the MIME types supported by GATE, including:
    • text/plain - plain text
    • text/xml or application/xml - XML (including GATE's own standoff XML format)
    • text/html - HTML
    • application/pdf, application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/vnd.oasis.opendocument.text - PDF, Word and OpenDocument binary formats.
    • text/x-json-twitter - Twitter format JSON, with the text as "text" and any pre-existing standoff annotations as "entities"
    • text/x-json-datasift - DataSift format JSON
    • application/fastinfoset - FastInfoset representation of GATE XML format
    • text/x-pubmed and text/x-cochrane - documents in the textual formats used by Pubmed and the "Cochrane library".
  • Accept - MIME type of the desired response type - see below for more details.
  • Content-Encoding - the API can accept GZIP compressed request bodies, if the body is compressed set Content-Encoding: gzip to inform the server.
  • Accept-Encoding - likewise, the API can return compressed responses, if your client can understand these set Accept-Encoding: gzip
  • Expect: 100-continue - the API supports the standard HTTP "100 Continue" mechanism to allow it to verify the acceptability of the supplied authentication credentials and check the rate limits as soon as the request headers have been received. This way, if there is a problem with the request the client can be informed before sending the request body. If your client supports 100-continue you are encouraged to include the Expect header.

URL query parameters

The following parameters can be passed to the service using the URL query string:

  https://..../endpoint?key=value&key=value
  • annotations - annotation selector expressions specifying the annotation types you would like the service to return. You can find each pipeline's available selectors under "Annotation details" on the pipeline's detail page. If omitted, the pipeline's defaults will be used. Multiple selectors can be specified either by repeating the parameter (?annotations=:Person&annotations=:Location) or by providing a comma-separated list in a single parameter (?annotations=:Person,:Location).
  • nextAnnotationId - all GATE annotations have a numeric identifier, which can be used for things like linking annotations into co-reference chains. The GATE XML and FastInfoset response formats (see below) include the annotation IDs in their responses. If the resulting annotations are to be inserted into an existing GATE document then it is important that the IDs returned by the service do not conflict with IDs of any existing annotations in the document. This parameter allows you to specify a minimum ID number (typically one greater than the highest ID in the source document), and it is guaranteed that the service will not return any new annotations with IDs less than this number.

Response format

The GATE Cloud Online Processing service supports various different output formats configured by the Accept header in the request.

GATE JSON format

GATE JSON is a JSON format based on the format used by Twitter to represent entities in Tweet data. For each document, it consists of a JSON object with two properties, "text" containing the text of the document and "entities" containing the annotations as follows:

{
  "text":"The text of the document",
  "entities":{
    "SampleAnnotationType1":[
      {
        "indices":[0,3],
        "feature1":"value1",
        "feature2":"value2"
      }
    ],
    "SampleAnnotationType2":[
      {
        "indices":[12,15],
        "feature3":"value3"
      }
    ]
  }
}

The "entities" value is a map from annotation type to an array of annotation objects, belonging to this set, with the annotation's position within the text represented as "indices":[start,end] (zero-based character offsets, start inclusive, end exclusive). The annotation's features are represented as the other JSON properties of this object.

If the original document was Twitter JSON (i.e. it was sent with text/x-json-twitter MIME type), then the output JSON will attempt to preserve as far as possible the JSON structure of the original Tweet. If the original Tweet contained "entities" then the output annotations will be merged with those from the original JSON.

To request GATE JSON output, send an Accept header of application/gate+json (or application/json)

GATE Standoff XML

This is the XML based format used by GATE Developer and described in the GATE user guide.

To request GATE XML output, send an Accept header of application/gate+xml (or application/xml). The API can also return the same XML data using the binary FastInfoset serialization mechanism, which can be requested with an Accept header of application/fastinfoset.

By default, the GATE XML response formats encode the full GATE document, including the text, document features, and the selected annotations. The annotation offsets are indexes into the text as given in the response, which may be different from the data sent in the request (e.g. for HTML and XML requests the markup tags will be stripped out by GATE leaving just the plain text, for binary formats like PDF there is no real concept of "character offsets" in the original data). But in cases where you are sending plain text to be annotated, you can opt to omit the text and receive only the annotations in the response, by adding a parameter to the Accept header:

  Accept: application/gate+xml; includeText=no

  Accept: application/fastinfoset; includeText=no

In this case the returned XML will still have the <GateDocument> root element, but this will contain only the <AnnotationSet> elements from the GATE XML, without the <GateDocumentFeatures> or the <TextWithNodes>.

Error messages

When there is some problem preventing the correct execution of the request, an error response is returned with an HTTP status code specifying the type of the error as described below together with a human readable error message.

Status Code Description
40x Problems with the user input
50x Errors during the execution of the request