Data Management REST API

1. Introduction

This document describes the REST API provided by GATE Cloud to create, manage, upload and download persistent data sets on the GATE Cloud platform, referred to as data bundles. General information about GATE Cloud REST APIs can be found on this page.

2. List existing bundles

GET https://cloud.gate.ac.uk/api/data/bundle

Request a summary of all data bundles owned by the authenticating user.

Response format

XML JSON
<bundles>
  <bundle
    id="NN"
    name="Bundle name"      
    url="https://cloud.gate.ac.uk/api/data/bundle/NN"
    downloadable="true"
    closed="true"
  />
  ...
</bundles>
[
  {
    "id":NN,
    "name":"Bundle name",
    "url":"https://cloud.gate.ac.uk/api/data/bundle/NN",
    "downloadable":true,
    "closed":true
  },
  ...
]

The url can be queried to get more details about a specific bundle.

3. Details of a specific bundle

3.1. Query bundle details

GET https://cloud.gate.ac.uk/api/data/bundle/{bundleID}

Get the full details of a single data bundle.

Response format

XML JSON
<bundle>
  <key>value</key>
  ...
  <files>
    <file>{downloadURL}</file>
    ...
  </files>
</job>
{
  "key":"value",
  ...
  "files":["downloadURL", ...]
}
Key Value
id The bundle's identifier
name The bundle's display name
url The API URL from which the bundle's details can be fetched
downloadable (boolean) Does this bundle permit its contents to be downloaded directly? If false the bundle can only be used as input to an annotation job
closed (boolean) Is this bundle complete and ready for use (true) or is it still open for further files to be uploaded (false)?
dateCreated Date when this bundle was first created
totalSize The total amount of data in this bundle, in bytes
monthlyPrice The total monthly storage cost for this bundle. May be zero, for bundles that point to your own S3 bucket (i.e. you pay your own storage charges direct to Amazon)
type The type of files stored in this bundle, for bundles that are usable as input to an annotation job. The valid types are described in the job management API documentation
encoding The character encoding that an annotation job should use to read text documents out of this data bundle
mimeTypeOverride The MIME type that an annotation job should assume when reading documents from this bundle
fileExtensions (ZIP and TAR bundles only) Comma-separated list of file extensions identifying the entries within this bundle's archives that should be processed by an annotation job
mimeTypeFilters (ARC and WARC bundles only) Comma-separate list of MIME type prefixes that identify the entries in the archive that should be processed by an annotation job

For full details of the type, encoding, mimeTypeOverride, fileExtensions and mimeTypeFilters options, see the job management API documentation.

Downloadable bundles also provide a list of URLs under the files property that can be used to download the bundle's contents. When retrieving these URLs you must follow all 3xx redirects.

3.2. Update bundle details

POST https://cloud.gate.ac.uk/api/data/bundle/{bundleID}

Modify the bundle details. Currently the only modifiable "detail" is the bundle's display name.

Request format

XML JSON
<bundle>
  <name>New name</name>
</bundle>
{
  "name":"New name"
}

Response

Exactly as for the GET case above.

4. Creating a new bundle

There are two options to create a new data bundle:

  • Upload a set of files from your local machine
  • Point to a set of files that are already stored in your own bucket on Amazon S3.

4.1. Uploading files to a bundle

Creating a bundle from uploaded files is a three step process. First create the empty bundle, then add files one by one, and finally close the bundle. Note that all files uploaded to a bundle must be of the same kind (all ZIP files, all WARC files, etc.) and must share the same additional settings such as file extension filters.

POST https://cloud.gate.ac.uk/api/data/bundle

Request format

XML JSON
<bundle>
  <key>value</key>
  ...
</bundle>
{
  "key":"value",
  ...
}
Key Value
name The name for the new bundle
type The type of files stored in this bundle, for bundles that are usable as input to an annotation job. The valid types are described in the job management API documentation
encoding The character encoding that an annotation job should use to read text documents out of this data bundle
mimeTypeOverride The MIME type that an annotation job should assume when reading documents from this bundle
fileExtensions (ZIP and TAR bundles only) Comma-separated list of file extensions identifying the entries within this bundle's archives that should be processed by an annotation job
mimeTypeFilters (ARC and WARC bundles only) Comma-separate list of MIME type prefixes that identify the entries in the archive that should be processed by an annotation job

For full details of the type, encoding, mimeTypeOverride, fileExtensions and mimeTypeFilters options, see the job management API documentation.

Response format

As for GET bundle details above, the details of the newly created bundle.

Once the bundle has been created you can upload files:

POST https://cloud.gate.ac.uk/api/data/bundle/{bundleID}/add

Request format

XML JSON
<add>
  <!-- file name extension should match
       the bundle type -->
  <fileName>archiveName.zip</fileName>
</add>
{
  /* file name extension should match
   * the bundle type */
  "fileName":"archiveName.zip"
}

Response format

XML JSON
<putUrl>https://....</putUrl>
{
  "putUrl":"https://...."
}

The putUrl is a URL to which you can upload the file using an HTTP PUT request. It will only work for a limited time, so you should upload your file to this URL immediately. A new URL is generated every time you post to the .../add URL, so if you find your PUT URL has expired simply POST again to generate a fresh one. When uploading a file to the putUrl the PUT request must have a Content-Type of "application/octet-stream" and the correct Content-Length, and must not have a Content-MD5 header.

Once all files have been successfully uploaded to their generated PUT URLs the bundle must be closed:

POST https://cloud.gate.ac.uk/api/data/bundle/{bundleID}/close

Request format

XML JSON
<action>close</action>
{
  "action":"close"
}

Response format

As for GET bundle details above, the details of the newly created bundle.

4.2. Creating a bundle to reference existing files on Amazon S3

Creating a bundle that references existing files in your own Amazon S3 bucket is a single-step operation. Note that all files referenced by a bundle must be of the same kind (all ZIP files, all WARC files, etc.), must be downloadable using the same credentials, and must share the same additional settings such as file extension filters.

POST https://cloud.gate.ac.uk/api/data/bundle

Request format

XML JSON
<bundle>
  <key>value</key>
  ...
  <accessKey>AKIA......</accessKey>
  <secretKey>...</secretKey>
  <locations>
    <location>s3://bucketName/key</location>
    ...
  </locations>
</bundle>
{
  "key":"value",
  ...
  "accessKey":"AKIA......",
  "secretKey":"...",
  "locations":[
    "s3://bucketName/key",
    ...
  ]
}
Key Value
name The name for the new bundle
type

The type of files stored in this bundle, for bundles that are usable as input to an annotation job. The valid types are described in the job management API documentation

encoding The character encoding that an annotation job should use to read text documents out of this data bundle
mimeTypeOverride The MIME type that an annotation job should assume when reading documents from this bundle
fileExtensions (ZIP and TAR bundles only) Comma-separated list of file extensions identifying the entries within this bundle's archives that should be processed by an annotation job
mimeTypeFilters (ARC and WARC bundles only) Comma-separate list of MIME type prefixes that identify the entries in the archive that should be processed by an annotation job

For full details of the type, encoding, mimeTypeOverride, fileExtensions and mimeTypeFilters options, see the job management API documentation.

The S3 locations are specified as pseudo-URLs of the form s3://bucketName/key. The accessKey and secretKey parameters specify AWS credentials that should be used when GATE Cloud components need to fetch the files in the bundle. For security reasons you should not provide your AWS master credentials here, instead create an IAM user whose rights are restricted to GET requests on the objects in the bundle.

Response format

As for GET bundle details above, the details of the newly created bundle.

5. Deleting a data bundle

DELETE https://cloud.gate.ac.uk/api/data/bundle/{bundleID}

When you no longer require the data in a bundle you should delete it. For bundles that are stored in GATE Cloud managed storage you will be charged a monthly fee for each bundle based on its size, and you must delete the bundle to stop incurring charges.

For a bundle that points to objects in your own S3 bucket there are no monthly storage charges, and deleting the bundle will not delete the target objects.