When working with a dataset containing both text and numerical/categorical data, there are different connections that you might want to draw between the text and non-text data.

If you have a dataset of e-mails containing the each e-mail’s text and a classification as spam/not spam, you could build a model that predicts the likelihood that an e-mail is spam given the e-mail’s text. By contrast, if you have a dataset of Yelp reviews containing the text of the review and the user’s 1-5 star rating of the business, you likely don’t care about predicting the star rating; after all, while e-mails don’t arrive with “spam” labels, every Yelp review is linked to a user’s 1-5 star rating. For datasets like Yelp reviews, rather than using the text to predict the user’s rating, you probably want to use the text to describe what’s driving particular ratings, discovering nuances that business owners could take action on.

This post outlines how to use multinomial Naive Bayes, an algorithm commonly used for predicting classifications given text data, to answer the descriptive “What prompts users to give 5-star reviews to Greek restaurants?”-style questions you could ask of a dataset. We’ll use sklearn’s implementation of multinomial Naive Bayes to first link words to our outcome of interest (e.g., getting a 5-star review) and then look at phrases to discover more niche correlations.

The Data

For this exercise, we’ll use data from customer complaints about financial institutions provided by the Consumer Finance Protection Bureau. The outcome/classification we’ll care about isn’t a user’s 1-5 star rating but instead whether the company provided either monetary or non-monetary compensation (technically known as “relief”) to the complaining customer.

We’ll look at the 5,826 observations where customers complained about credit card issues, assuming that we’ve already processed the dataset to look like:

Relief Complaint
0 True Home Depot habitually credits my account a wee…
1 False Dear representatives of The Consumer Financial…
2 False I applied for a credit card with USAA and I wa…
3 False I believe what happened was a system error. I …
4 True I filed a Ch. XXXX Bankruptcy in the XXXX XXXX…

(Want to see the full details? Check out this post’s Jupyter Notebook.)

The purpose of our analysis is to answer the question: Qualitatively, what types of complaints are most likely to receive relief? While we’ll keep this post focused to that question, this is actually a case where a predictive algorithm would be useful; you could use that algorithm to provide immediate feedback to users that drafted complaints that are predicted to have a low likelihood of receiving relief.

As we mentioned above, multinomial Naive Bayes is typically used for predicting classifications given text data; it’s a classic introductory algorithm in data science courses. As part of its predictive machinery, for each term (i.e., a word or phrase) in the text, multinomial Naive Bayes estimates $P(term|class)$, the probability of seeing a term given a particular class (e..g, customer received relief, customer did not receive relief). With its most basic parameter values, multinomial Naive Bayes estimates $P(term|class)$ as:

$\frac{\text{Number of times term occurs in all observations of a given class}}{\text{Total number of terms in all observations of a given class}}$

When searching over all words in a text, this fraction is essentially, for a given class, the number of times the word appeared divided by the total number of opportunities for the word to appear.

Given our motivating question, we’re not particularly interested in the words that are most likely to occur for a given class, though. The set of words with the largest values for $P(term|class)$ contains words like “bank” and “credit” that are prevalent in all complaints, irrespective of whether they received relief or not. Instead, what we care about is the words with the highest and lowest relative prevalence, which we’ll define as:

$RP(term,class)\equiv P(term|\text{in class})-P(term|\text{not in class})$

In our specific case, we care about the words with the highest and lowest value for $P(term|\text{complaint received relief})-P(term|\text{complaint did not receive relief})$. The words with the highest values for $RP$ can suggest reasons why customers received relief; words with the lowest values for $RP$, by contrast, can help explain why customers did not receive relief. Two asides here:

1. We could have chosen to look for words that maximized $P(term|\text{in class})/P(term|\text{not in class})$, but that would cause rare words to have very high scores, preventing us from finding words that are more informative about what drives our outcome of interest.
2. For a setting with more than two outcomes–e.g., 1-5 star reviews for Yelp–this approach works once you have created dummy variables for your outcome of interest.

Given our Pandas DataFrame with the Complaint and Relief columns we showed earlier, here’s the code to start finding the words with the highest and lowest values of $RP$:

import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

complaints = df['Complaint'].values

vect = CountVectorizer(ngram_range=(1,1), stop_words='english')
X = vect.fit_transform(complaints)
words = vect.get_feature_names()

y = [int(x) for x in df['Relief'].values]

clf = MultinomialNB(alpha=0)
clf.fit(X,y)

likelihood_df = pd.DataFrame(clf.feature_log_prob_.transpose(),columns=['NoRelief', 'Relief'], index=words)


Breaking down that code:

• We import pandas for working with panel data, sklearn’s version of multinomial Naive Bayes, and CountVectorizer, which converts a list of text strings into a sparse matrix where the rows correspond to each observation/document, columns correspond to terms in the text, and the values are the total count of each term in each document.
• For CountVectorizer, setting ngram_range=(1,1) extracts words (i.e., 1-grams) from our text. Setting stop_words='english' drops common words (e.g., a, an, the) from the text. By default, CountVectorizer takes care of punctuation and ignores difference between lower- and uppercase words.
• For MultinomialNB, we set alpha=0 to override the default behavior of additive/Laplace smoothing, which, while a good idea for predictive applications, can bias our results here (particularly when working with imbalanced datasets).
• The feature_log_prob_ attribute of our fitted classifier contains the log of the probabilities we’re interested in, $P(term|class)$. (Logs are commonly used in situations like this where probabilities can get so low that “underflow” can occur.)

To find the words with the highest and lowest values of $RP$, we just need to invert the log operation, take the difference of the probabilities, and extract the top and bottom 10.

likelihood_df['Relative Prevalence for Relief'] = likelihood_df.eval('(exp(Relief) - exp(NoRelief))')

top_10 = likelihood_df['Relative Prevalence for Relief'].sort_values(ascending=False).ix[:10]

#Double-sorting here so that the graph will look nicer
bottom_10 = likelihood_df['Relative Prevalence for Relief'].sort_values().ix[:10].sort_values(ascending=False)

top_and_bottom_10 = pd.concat([top_10,bottom_10])


Graphing the result yields:

Even though the values for relative prevalence are quite small, the terms at the top and bottom of this ranking have sizable impacts on the likelihood of receiving relief, as we’ll see below for the case of “xxxx xxxx.”

Searching for two-word phrases (2-grams) instead of words only requires changing one line of code. Rather than:

vect = CountVectorizer(ngram_range=(1,1), stop_words='english')


We simply write:

vect = CountVectorizer(ngram_range=(2,2), stop_words='english')


Which yields:

Since the relative prevalence scores for 2-grams are generally lower than for 1-grams given their relative rarity, I generally run the analysis for them separately.

Gotcha: The Importance of Preventing Smoothing

Does setting alpha=0 really matter? Let’s re-run our 2-gram analysis, changing our classifier to the default (e.g., clf = MultinomialNB()). Here’s the result:

By turning on additive smoothing, “xxxx xxxx”–a signifier for two pieces of scrubbed out personal information–went from having the highest relative prevalence for relief among 2-grams to the absolute lowest. What’s more, the interpretation you’d want to make from this graph–that complaints with more instances of “xxxx xxxx” were less likely to receive relief–is demonstrably incorrect:

What went wrong? The issue is with how additive smoothing increases the denominator of estimated $P(term|class)$ values. Additive smoothing adds a constant to each probability’s denominator, decreasing larger fractions by more than it decreases smaller fractions. (Consider adding 4 to the denominator of 5/6 and 3/6: the larger fraction decreases by .33 from .83 to .5, while the smaller fraction only decreases by .2 from .5 to .3.)

This means that the bias you get hits word with high values for $RP$ the most, as the high value for $P(\text{xxxx xxxx}|relief)$ gets deflated by much more than the correspondingly lower value for $P(\text{xxxx xxxx}|\text{no relief})$. As we saw above, this bias can be so strong that it flips the sign of the relative prevalence, making words that are strong indictors of an outcome (e.g., “xxxx xxxx”) appear as the exact opposite. For this reason, it’s critical to avoid additive smoothing when using this approach to link terms with outcomes of interest.

The Gist

If you want to use text data to help describe what’s driving outcomes that you care about, simply fit an instance of multinomial Naive Bayes and look at the terms with the highest and lowest relative prevalence in observations with the outcome of interest. Make sure to prevent the multinomial Naive Bayes instance from doing additive smoothing by setting alpha=0.

## Stray Observations

• Looking at larger n-grams–e.g., 3-grams, or three-word phrases–has the advantage of yielding more interpretable results. It’s easier to know what “bonus award night” is referring to compared to “charge,” after all. Larger n-grams have the disadvantage of being quite rare, though, as the likelihood of people using the same string words drops of significantly as you go from 2-grams to 3- or 4-grams.
• To see the full analysis and the code to generate the graphs above, check out the Jupyter Notebook.
• Photo of typewriter letters c/o Derek Gavey