By: Zeyad Deeb on October 3rd, 2018

# In the (AI)sle™ - Part II - Utilizing Natural Language Processing with Product Data

You've heard it before: color has an impact on our mood and our appetite. Consumers have adjusted to years of evolving marketing tactics designed to draw shopper attention, build credibility, and communicate value. The problem is, not everyone can agree on what certain words mean. Take, for instance, "all natural flavor," "made with real cheese," or, my favorite, "made with love." While the legal definition of a "natural" flavor could mean a flavor essence or an actual juice derivative, a discerning parent just wants to know it's organic. Well, ok, but is it certified organic? We are here to help!

According to C&R research, 47% of consumers will rely on external signals like seals of approval or FDA statements, while 1 in 6 will walk away if the label doesn’t telegraph health and goodness. This has interesting implications for manufacturers, who may need to jump on the clean label trend to maintain brand loyalty. New products are being manufactured daily, and brands are trying to build trust with consumers, but it results in an overload of information.

Being a human who requires nourishment, you’ve likely been to the grocery store more than a few times. Browsing through the aisles, you’ll quickly realize that brands love to talk about their products. In fact, some brands will even write you essays on the back of their packaging. Take this product, for example:

Anyone can look at this product and tell you the brand name: Benefit. They can also tell you that the product contains raisins and seems to be a “healthy” product. But rarely does anyone read all the details on the package. We just don’t have time for that. Luckily, we train robots to read stuff for you. At Label Insight, we take data seriously, and because this unstructured information is, well, messy, and different for every product on the market, we examine all of the tiny details under a microscope to provide you insights that no other company can.

### Part II - A rose by any other name? Utilizing Natural Language Processing at Label Insight

While you read this blog post, your brain is using shortcuts to help you complete many tasks at once: understand the sentence structure and the tone of the writing, make references to other parts of the text, and keep multiple thoughts in short-term memory. What comes naturally to you might be a painstaking multi-step process for a non-native speaker, much less an Artificial Intelligence model. If you’ve ever gotten mad at Alexa for not knowing something obvious, you’ll understand when we say that robots aren’t naturally talented. We at Label Insight use massive datasets to train AI models to recognize patterns in language and images in the way that our brains do, going farther than the usual Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Naive Bayes (NB), and straight to word embeddings.

Word embeddings and Recurrent Neural Networks (RNN)

The statistics of word occurrences on a product package, or a whole category of products, is one of the primary sources of information available to all unsupervised machine learning models for estimating probability and numerical representations of words. Although many such methods now exist, the question still remains as to how meaning is generated from these numbers, and how the resulting word vectors might represent that meaning. Some of the more common models are Glove and Word2Vec that can provide numerical representation that can be operated on by mathematical models more easily than with pure text.

Utilizing neural networks and AI models to find the underlying relationship of data at scale, a wellknown architecture for unstructured text is called Recurrent Neural Networks (RNN)

(Source: Oxford University)

We use probability models to predict the usage of a specific phrase on a product package, given the occurrence of that phrase in the same product category:

$$p(w|x) = \prod_{n = 1}^{\ell} p(w_n | w_1, w_2, ..., w_{n-1})$$

This technique is also known as finding the conditional context. For example, the occurrence of the word "healthy" in the product category is not uncommon, so the probability of encountering it on any given package is high. Conversely, the usage of "high fat" has a low probability.

How we use Word Embeddings: Coreference Resolution, Dependency Parsing and Textual Entailment

Let’s analyze the first sentence on the package featured above. This is known as the “premise.”

“Benefit with raisins. It’s for your benefit! This product contains low-fat contents and is made from natural ingredients.”

First, we need to find all of the ways in which the product name displays itself. Here we see that “Benefit” “Its”, and “This Product,” all refer to the same thing.

Beyond simply knowing the product name, we also need to understand how the product name and its surrounding descriptors are related. Take, for example, clausal complements. In the sentence, “This product contains low fat,” “contains” indicates the presence of a clausal complement which helps describe the product: it claims to be low in fat.

Other descriptors include prepositional modifiers, such as “Made with whole grains.” “With” leads into a useful description of the product: it has/includes/is made with whole grains.

Now that the model knows more about the individual components (nouns, adjectives, phrases, etc.) of sentences, we can dive deeper into the meaning of the text as a whole. As humans, from reading the most prominent text on the product package, we can use heuristics to jump to the conclusion that Benefit is, in fact, a healthy product. But when analyzing the product in comparison to other cereal offerings, a trained AI model indicates that, although Benefit claims to be “healthy,” “low-fat” and “natural,” it does not appear to be any better than the average product in its class. Therefore, the model finds that Benefit’s claim is “neutral,” meaning neither true nor false.

We want to give a special thanks to researchers at Allen Institute for Artificial Intelligence for making their research available and accessible.