TrustServista explained #1: The text analytics approach

In this blog series we will dig into the TrustServista functionality and explain the inner workings of the platform. One of the key feedback items we received from our beta testers was the somewhat-difficult-to-grasp text analytics concepts that are used throughout the software, from the implementation of our algorithms to the visualization of information.

Dealing with human generated text (in our case news articles, blog posts or social media messages) that needs to be “understood” by a software program with no human intervention, falls under the Artificial Intelligence discipline know as “text analytics“, also know as text mining. This approach consists of a set of linguistic, statistical, and machine learning techniques that can automatically extract high quality information from digital text. For example, automatically categorizing documents, calculating frequency of certain words or determining the sentiment of an article, is done by text analytics.

Going even beyond in complexity, human-machine interaction relies on Natural Language Processing (NLP) techniques, that can provide even more understanding to a text: extraction of named entities (names, location, products), automatic summarization, morphological analysis, word segmentation and so on.

TrustServista makes use of a mix of text analytics and NLP in order to determine the origin and trustworthiness of news, as described below. We use our own algorithms and also rely on Rosette API from Basis Technology.

Generating Trending Topics

These topics, or stories, are basically clusters of articles grouped together by similarity. All articles grouped within a topic talk about the same things, within a relatively short (1 week) time span. TrustServista generates these topic fully automated, with no human intervention.

We will detail the trending topic generation and the text analytics algorithms we use in a future blog post.

Determining article sentiment

Whether and article is “positive”, “neutral” or “negative” is also determined automatically, by using sentiment analysis algorithms. Document level sentiment analysis is applied to each article, in order to determine its polarity and ultimately its subjectivity, which is used in TrustServista’s TrustLevel calculation.

We will detail the sentiment analysis algorithms in a future blog post.

Extracting information from articles

Understanding articles without even reading them is possible with the NLP technique of entity extraction. Extraction of entities means the automatic identification of keywords in an article that represent people, places, organizations, email addresses, products, dates, times. Using machine learning statistical models, all articles are processed by TrustServista this way.

Entity extraction is not only used for visualization purposes, but also to perform the trending topic clustering, finding links between articles and determining the TrustLevel. In the Top Entities card below, extracted entities are color-coded depending on their type: people, places, organizations, dates, aso.

We will detail the entity extraction algorithms in a future blog post.

Automatic summarization of articles

Not having to read an entire article in order to understand the story is also useful and is achieved by TrustServista using NLP algorithms. Produce a readable summary from an entire article is done by automatically identifying most relevant sentences, containing most used (“heaviest”) entities used throughout the article’s content.