NLP Punctuation, Lower-Case and StopWords Pre-Processing

cteqh4muaaagkbp.jpgoriginal_tweetIn my March blog, I explained how to use the stemming technique in Natural Language Processing (NLP) to predict whether a particular Tweet could be geolocated to a particular neighborhood in the city of Caracas, Venezuela. Almost 37,000 Spanish Tweets that had a latitude and longitude from the city of Caracas, Venezuela were used to observe reactions to the food shortages within each of the city’s five municipalities from December 2014 to October 2016.

Today I want to conclude this NLP mini-series by looking at other pre-processing steps such as removing punctuation, changing upper case words into lower case words and removing stopwords. A reminder there are many great blogs out there that will give you code snippets if you want to delve straight in.

This is the original Tweet # 24 out of 2835 filtered results. It was written at 18:42 on January 4, 2015 in the Baruta municipality of Caracas, Venezuela. For privacy considerations the author of the Tweet is not shown.

original_tweetHere’s some slightly different code that tokenizes the words in the Tweet (splits each word in the sentence into a separate word) and then removes all of the punctuation. The result “tokenized_sent” is a list of tokenized words without punctuation marks.


Now that the words have been tokenized and the punctuation removed, we want to convert all the uppercase words into lowercase words. Looking at the first 9 words in the last line of code tells us we’re on the right track. (It’s interesting to note that the tokenizer unlike the Snowball Stemmer we discussed last month, leaves in the Spanish language accent marks such as the “é” in the word café.


Finally, the last step for our Tweet text is to remove any stopwords. These are words such as “and”, “from”, “to” in English or the equivalent “y”, “de” and “a” in Spanish. Stopwords are removed to save processing space.


Even with all of these nltk Python library pre-processing steps, our processed text still has two items that we will need to correct before feeding the data into a machine learning algorithm: ‘h’ and ‘anaquelesvaciosenvenezuela’. In order to replace these terms, I need to convert the list into a string with the following code.


Given our understanding of the words used in this domain and the context of other part of the tweet that said ‘no hay jabón’ I will replace the ‘h’ with the word ‘hay’ (there is). Given my understanding of the semantic meaning, I will also replace the term ‘anaquelesvaciosenvenezuela’ (emptyshelvesinvenezuela) which was part of a hashtag with the term ‘anaquelesvacios’. The term ‘venezuela’ appears elsewhere in the text and the term ‘en’ is a Spanish stopword. The final filtered text is shown below.

finalI’d love to hear your input about this Natural Language Processing blog mini-series. Next time we’ll talk about Artificial Neural Networks in more detail.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.