Multilingual News Similarity Service

Note that this service is currently under active development and may change without warning. This note will be removed once the service stabilises

This service takes two news article texts and attempts to determine whether the two articles refer to the same underlying news story. The service is based on a model developed as the GateNLP-USHEF submission to the multilingual news article similarity task at SemEval-2022. The service can handle documents in a variety of languages including English, German, French, Spanish, Italian, Polish, Turkish, Russian, Chinese and Arabic, and the two articles being compared can be in the same language or in different languages - the model maps the two texts into a multi-lingual embedding space and compares their embedding vectors.

The service expects input as JSON, sent in a POST request to the following endpoint (calls can be authenticated using a GATE Cloud API Key via Basic Authentication):

In the simplest case you can simply provide the two articles as plain text:

{
  "text1": "Text of the first article",
  "text2": "Text of the second article"
}

and the service will return a similarity score between 0 and 1:

{"score": 0.7762627005577087}

However if the two texts have been annotated with named entities then the model can use these to inform the similarity calculation. To supply named entities your request should look like this:

{
  "text1": {
    "text": "Text of the first article",
    "entities": {
      "PERSON": [...],
      "GPE": [...]
    }
  },
  "text2": {
    "text": "Text of the second article",
    "entities": {
      "PERSON": [...],
      "GPE": [...]
    }
  }
}

The "entities" object has a list of items for each entity type, and each item in the list can be:

A string giving the entity label or identifier, e.g. "GPE":["UK", "Sheffield"]
An object with properties "inst", "uri", "lemma" or "string" giving the entity label, e.g. "PERSON":[{"string":"John Smith"}] - the features are checked in that order and the first one found is used
If the object does not have an explicit label as a feature, then it should have a property "indices" giving the start and end offsets of the entity within the corresponding text, e.g. "LOC":[{"indices":[5,11]}]

This format was designed so that the request is as simple as possible for direct REST usage, while also allowing for "text1" and "text2" to be the JSON returned by another GATE Cloud named entity annotation service such as spaCy or ANNIE. The similarity service can make use of entity annotations of the following types (matching those used by spaCy and ANNIE), aggregated into four groups:

"location" types: GPE, LOC, Location
"date and time" types: DATE, TIME, Date
types representing numbers and measurements: ORDINAL, CARDINAL, QUANTITY, Number, Measurement
other kinds of named entity: PERSON, ORG, FAC, EVENT, NORP, PRODUCT, WORK_OF_ART, Person, Organization, Event

Note: the named entities must be "compatible" across both texts, for example if the documents are in different languages then simply using the surface strings of the entities in each document will not work, instead you must use entities that have been disambiguated to a common representation across languages, for example URIs in a knowledge base such as DBpedia or Wikidata. This service will work with any set of entity labels as long as the same entity uses the same label in both texts.

Test this service

Examples

Example 1 two articles about Elon Musk (Article 1, Yahoo news, article 2, CNBC)
Example 2 the same articles, pre-processed using the spaCy NER service
Example 3 cross-language example - Spanish and German articles annotated with references to DBpedia

… or enter your own texts

First article text

Second article text

Named entities

None (just compare the text) Pre-process with NER service Specify my own

Named entity annotation service

First article Locations Dates & times Numbers & measurements Other entities (Person, Organization, etc.)

Second article Locations Dates & times Numbers & measurements Other entities (Person, Organization, etc.)

Accessing the embedding vectors

As well as the similarity calculation, this service provides a subsidiary endpoint /embeddings which accepts a POST of a single article with Content-Type: text/plain and returns the embedding vector that would be used in the similarity calculation. The response is JSON:

{
  "embedding": [ ... ],
  "dimensions": 768
}

where "embedding" is a list of floating point numbers representing the vector and "dimensions" gives the number of dimensions in the embedding vector space.

800 free requests / day

Batch processing not available

Use this service

You can make up to 800 API calls per day free of charge, at an average rate of 2 calls/sec. Higher quotas are available for research users by arrangement, contact us for details.

The API base URL for this service is:

https://cloud-api.gate.ac.uk/process/news-similarity

See the API documentation for more details.

Create API Key