Natural Language Processing: N-gram Extraction for World Cup

In last month’s blog, I talked about sentiment analysis for social media analysis in the field of computational linguistics, which takes human language and translates it so it can be processed by a computer. One of the techniques used by Shultz and team was n-gram extraction, which I’ll talk a little bit more using an example today.

As always, let’s start with a problem we’re trying to solve using data science. And as blogged about in a mini series on characteristics of a data scientist from 2017, a great data scientist then uses analytics to solve the problem. Suppose we are a sports brand company that wants to measure the effectiveness of our advertising campaign of our newest shoe released in stores just before the start of the 2019 Women’s World Cup in France. We’ve spent 5 million euros for our campaign and want to measure short and long-term return on investment for those advertising money. A target audience of 1 billion people was predicted to watch the Women’s World Cup, giving advertisers plenty of potential customers to catch. 

Let’s suppose that one of the metrics the sports brand company uses to measure advertising ROI is Tweets, Instagram posts and Facebook posts from different genders, demographics, and cities during the time each match in the World Cup was played. Social media analysis has increasingly become another dimension companies are using to get the pulse on customer sentiment. We have about 30,000 Tweets, 200 Instagram captions and 5,000 Facebook posts to analyze. There are photos too but for now, let’s just focus on the text. The text from those social media channels is in multiple languages and contain all sorts of nuances such as emojis, misspellings and abbreviations that we can talk about in another future blog post. Let’s assume after getting the data, we have already pre-processed it into English.

In the field of computational linguistics, an N-gram describes a continuous sequence of N terms within a sentence. As Z. Ye and team describes in their paper SparkText: Biomedical Text Mining on Big Data Framework, “the N-gram model can be likened to putting a small window over a sentence in which only N words are detectable at a time.” So if we’re talking about unigrams, that window over the sentence is one word, a bigram is two words, a trigram is three words, etc. 

In the social media example, the Original Text is “LMAO – outrageous attempt at footwear…was it created by someone who didn’t want @mpinoe to get a hattrick? #ESPvsUSA

For now, let’s just consider the features to be English-language words and not include any hashtags or abbreviations.

“outrageous” “attempt” “at” “footwear” “was” “it” “created” “by” “someone” “who” “didn’t” “want” “mpinoe” “to” “get” “a” “hat” “trick”

“outrageous attempt” “at footwear” “was it” “created by” “someone who” “didn’t want” “mpinoe” “to get” “hat trick

“outrageous attempt at” “footwear created by” “someone who didn’t” “want mpinoe to” “get hat trick”

“Outrageous attempt at footwear” “created by someone who” “didn’t want mpinoe to” “get a hat trick”

I hope you’ve enjoyed today’s lessons on N-grams in text mining extraction. I’ll be taking a break from blogging for the month of August and will evaluate whether I continue this series after the break. 

Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.