Multilingual News Similarity Service
Note that this service is currently under active development and may change without warning. This note will be removed once the service stabilises
This service takes two news article texts and attempts to determine whether the two articles refer to the same underlying news story. The service is based on a model developed as the GateNLP-USHEF submission to the multilingual news article similarity task at SemEval-2022. The service can handle documents in a variety of languages including English, German, French, Spanish, Italian, Polish, Turkish, Russian, Chinese and Arabic, and the two articles being compared can be in the same language or in different languages - the model maps the two texts into a multi-lingual embedding space and compares their embedding vectors.
The service expects input as JSON, sent in a POST request to the following endpoint (calls can be authenticated using a GATE Cloud API Key via Basic Authentication):
In the simplest case you can simply provide the two articles as plain text:
{
"text1": "Text of the first article",
"text2": "Text of the second article"
}
and the service will return a similarity score between 0 and 1:
{"score": 0.7762627005577087}
However if the two texts have been annotated with named entities then the model can use these to inform the similarity calculation. To supply named entities your request should look like this:
{
"text1": {
"text": "Text of the first article",
"entities": {
"PERSON": [...],
"GPE": [...]
}
},
"text2": {
"text": "Text of the second article",
"entities": {
"PERSON": [...],
"GPE": [...]
}
}
}
The "entities" object has a list of items for each entity type, and each item in the list can be:
- A string giving the entity label or identifier, e.g.
"GPE":["UK", "Sheffield"]
- An object with properties "inst", "uri", "lemma" or "string" giving the entity label, e.g.
"PERSON":[{"string":"John Smith"}]
- the features are checked in that order and the first one found is used - If the object does not have an explicit label as a feature, then it should have a property "indices"
giving the start and end offsets of the entity within the corresponding text, e.g.
"LOC":[{"indices":[5,11]}]
This format was designed so that the request is as simple as possible for direct REST usage, while also allowing for "text1" and "text2" to be the JSON returned by another GATE Cloud named entity annotation service such as spaCy or ANNIE. The similarity service can make use of entity annotations of the following types (matching those used by spaCy and ANNIE), aggregated into four groups:
- "location" types:
GPE
,LOC
,Location
- "date and time" types:
DATE
,TIME
,Date
- types representing numbers and measurements:
ORDINAL
,CARDINAL
,QUANTITY
,Number
,Measurement
- other kinds of named entity:
PERSON
,ORG
,FAC
,EVENT
,NORP
,PRODUCT
,WORK_OF_ART
,Person
,Organization
,Event
Note: the named entities must be "compatible" across both texts, for example if the documents are in different languages then simply using the surface strings of the entities in each document will not work, instead you must use entities that have been disambiguated to a common representation across languages, for example URIs in a knowledge base such as DBpedia or Wikidata. This service will work with any set of entity labels as long as the same entity uses the same label in both texts.
Test this service
Examples
- Example 1 two articles about Elon Musk (Article 1, Yahoo news, article 2, CNBC)
- Example 2 the same articles, pre-processed using the spaCy NER service
- Example 3 cross-language example - Spanish and German articles annotated with references to DBpedia
… or enter your own texts
Accessing the embedding vectors
As well as the similarity calculation, this service provides a subsidiary endpoint
/embeddings
which accepts a POST of a single article with
Content-Type: text/plain
and returns the embedding vector that would be used in the similarity
calculation. The response is JSON:
{
"embedding": [ ... ],
"dimensions": 768
}
where "embedding"
is a list of floating point numbers representing the vector and
"dimensions"
gives the number of dimensions in the embedding vector space.
Use this service
You can make up to 800 API calls per day free of charge, at an average rate of 2 calls/sec. Higher quotas are available for research users by arrangement, contact us for details.
The API base URL for this service is:
See the API documentation for more details.