As part of our newly released TrustServista API call we introduced a unique functionality: Determining the semantic similarity between two articles.
In comparing two online news articles we can use different methods of comparison, the most usual being syntantic and semantic similarity. Syntactic similarity looks as how similar two articles/paragraphs/strings in terms of structure and grammar. A typical application for digital news would be detecting if one article is “reddited” from another, containing identical or slightly modified sentences.
Semantic similarity deals with the meaning of the compared articles. These articles could be written in totally different styles, can have different lengths and so on, but the meaning could still be the same. TrustServista uses a proprietary semantic similarity algorithm to determine if two or more articles are about the same topic.
This functionality, used in the TrustServista web user interface in creating the Trending Stories and determining Patient Zero, is now available in the API as a simple POST call that works for both URLs and raw text. It supports English, German and Spanish content.
https://trust.servista.eu/api/rest/v2/similar
Examples
One straightforward example of using this API call would be to compare 2 URLs:
http://www.timesofisrael.com/new-find-shakes-up-theory-on-dinosaur-evolution/
https://www.yahoo.com/tech/scientists-solve-mystery-frankenstein-dinosaur-200520731.html
The cURL API call would be:
curl -X POST -H "X-TRUS-API-Key: ${API_KEY}" \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -H "Cache-Control: no-cache" \ -d '{"contentUri1": "http://www.timesofisrael.com/new-find-shakes-up-theory-on-dinosaur-evolution/", "contentUri2": "https://www.yahoo.com/tech/scientists-solve-mystery-frankenstein-dinosaur-200520731.html" }' \ "${API_SERVICE_URL}/rest/v2/similar"
The results of this call is “similarity:95” interpreted as “the compared URLs are 95% similar from a semantic perspective”.
An even better understanding of the semantic similarity algorithm can be achieved by comparing two pieces of raw text:
“Dinosaurs were the monarchs of Earth for 160 million years until a space rock collided with the planet 65.5 million years ago and wiped out those confined to land.”
“Dinosaurs disappeared from the Earth million of years ago, after and unknown cataclysm that occurred 65.5 million years ago. Their demise is most likely to have been caused by a meteorite, a meteor or a comet.”
The API call for comparing the two strings is:
curl -X POST -H "X-TRUS-API-Key: ${API_KEY}" \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -H "Cache-Control: no-cache" \ -d '{"content1": "Dinosaurs were the monarchs of Earth for 160 million years until a space rock collided with the planet 65.5 million years ago and wiped out those confined to land.", "content2": "Dinosaurs disappeared from the Earth million of years ago, after and unknown cataclysm that occurred 65.5 million years ago. Their demise is most likely to have been caused by a meteorite, a meteor or a comet." }' \ "${API_SERVICE_URL}/rest/v2/similar"
The results is a semantic similarity of 72%. Although the compared paragraphs are very different from a syntactic perspective (less than 40%), their semantic similarity is obvious and the difference in meaning comes mostly from the 2nd paragraph not mentioning when (“160 million years”) the dinosaurs went extinct.