TrustServista explained #2: Finding similarities and links in news articles

In today’s blog post we are going to explain TrustServista’s core functionality: finding similar news articles and identifying visible and hidden links between them.

When performing an investigation on digital news it is crucial to determine the origin of every piece of information that is used to build up a story or create an article. This way, a journalist can easily verify if a news item is trustworthy or not and can subsequently create high-quality, trusted content.

In today’s high-velocity digital media environment editors uses multiple source types when creating content: from on-the-field journalists, to news agencies feeds, press releases, other publications’ articles and even social media posts from officials or politicians. Information can travel very fast, it can be referenced, altered and built upon until the original source is not visible anymore or even lost. Uncovering how articles use information and reference it is TrustServista’s key functionality, the goal being to discover “patient zero”, which is the origin of information. For example, more than often nowadays, politics- related articles in the US media are created using the President’s Twitter messages as an information source.

Finding and tracing the information source within as story requires discovering the following:

  • Similarity between articles
  • Explicit (URL) links between articles
  • Implicit (referenced) links between articles

Article Similarity

In TrustServista, articles are part of the same story (or topic) if they are similar enough. Similarity is determined using complex proprietary algorithms that rely on text analytics. Articles are transformed into vectors of keywords, using entity extraction NLP techniques. These vectors are then used to create article clusters; in these clusters articles can be closer or farther from each other. The “closer” they are, the more chances the articles refer to the same topic and might be written using the same information.

In the TrustServista main screen, you can see how similar articles are forming Trending Topics. Each topic contains “related articles” (61 in the screenshot below) and is named using the top 3 most found keywords (“DONALD TRUMP, PRESIDENT, UNITED STATES” in the screenshot below):

Article Linking

Similar to a search engine, TrustServista uncovers links between articles, in order to trace the path to the origin of information (“patient zero”):

Explicit Links

Each article is automatically processed and URLs (web links) are uncovered. These URLs usually point to other articles from the same publisher or other newspapers; sometime they link to Twitter or other social media platform content. These links can be found withing a Trending Topic, adding even more weight to the previous similarity algorithm (and represented as arrows connecting article edges in the Topic Relationships card):

Examples of explicit links in actual article content can be found below:

  • A Russia Today (rt.com) article referencing Twitter content (the inline URL highlighted with yellow):
  • An Independent article that directly embeds Twitter content (notice the tweet from Matthew Butcher):

Implicit links

Sometimes editors do no use URLs to link and reference other articles. Instead they use implicit links or mentions, such as this article excerpt that references an information originated from the “Szazadveg Foundation”, but without using an URL to the actual webpage of this institution:

In this type of situations, TrustServista will rely on its similarity algorithm to find the “closest” article on the reference’s website to the current article, and link them. Thus, an article can be linked to several other sources, which are linked to other sources and so on.

Limiting this almost never-ending linking between articles is done using the similarity functionality. Articles can reference other articles that are not necessary part of the story; eliminating “noise” is important as to finding “patient zero”, which needs to be both the oldest referenced articles from withing a story, but also needs to be “close”/similar enough to the current article to truly be considered the origin of information.

Also read: TrustServista explained #1: The text analytics approach