NLP Tokenization
Image from aio-tv

Today I want to continue looking at machine learning case studies for beginners and in particular, the use of tokenization in natural language processing. A standard definition of natural language processing, or NLP, is translating the words and meaning spoken or written by humans so computers can understand it. Tokenization is the first step in NLP that breaks an object down into sub-objects. NLP is used a lot in the field of social media mining, which takes unstructured data from social media (Facebook, Instagram, Reddit, Twitter, etc) and gets new insights from it. Natural language processing is key in understanding text data. And there’s no shortage of social media data. As this infographic below shows, Twitter alone produced over 300,000 Tweets each minute in 2015.


Image from:

In the fall of 2016, I discussed a social media mining project that used NLP to predict the origin of Tweets at the neighborhood level within a city. I talked in broad terms about the project goals and results but wanted to dive into more technical details today. Social media mining can be used to give insight into how citizens express themselves and can be the most reliable source of information in societies with free speech limits. Almost 37,000 Spanish Tweets that had a latitude and longitude from the city of Caracas, Venezuela were used to observe reactions to the food shortages within each of the city’s five municipalities from December 2014 to October 2016. I wanted to test the hypothesis of whether certain Tweets are particular to a municipality location.

I used the NLTK library within the Python programming language was used to analyze the text from these Tweets. NLTK is a great beginner library and includes common computational linguistic techniques. There are many great blogs out there that will give you code snippets if you want to delve straight in.

Let’s take a look at how the Tweet text written by a human is translated so a computer can understand and process it. I searched for Tweets with a date range of December 2014-October 2016, tagged with a latitude and longitude of Caracas, Venezuela, from the Baruta municipality of Caracas, with the search term “#AnaquelesVaciosEnVenezuela”. There were 2835 Tweets in my filtered list.

This is Tweet # 24 out of 2835 results. It was written at 18:42 on January 4, 2015 in the Baruta municipality. For privacy considerations the author of the Tweet is not shown.

Using a word tokenizer from the Python NLTK library breaks down each word in a sentence and returns each word in list format. The result after applying the word tokenizer would be:


As you can see this is step 1 and the tokenized list has errors from a human point of view since it considers punctuation marks like ‘,’ and ‘. ‘ and ‘:’ words in a sentence. Next time we’ll take a look at the next step in natural language processing – stemming.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.