Simple Preprocessing for Sentiment Analysis Using Python

Today in this Machine Learning Tutorial we’re gonna learn how to do a effective preprocessing of text data for sentiment Analysis.

What is Sentiment Analysis?

In a business when we take feedback from our customer and then we measure the satisfaction or dissatisfaction of customer towards our product or service. It is called sentiment analysis. This task uses pieces of sentences to determine the view of customers

What is preprocessing?

Preprocessing of data means transformation of data before we feed it to the machine learning algorithms. It is a technique to clean the data from Raw data. It is a data mining technique to transform Raw data into an understandable and readable format.

Why do need the preprocessing of the data?

When a data is collected from the different sources , usually it is unstructured and we can’t use it for the analysis. Preporcessing of data is necessary for achieving better results with our machine learning models.

Why do we need sentiment analysis?

In modern times customers have vast options for the choice to buy something or choose a service. So for a company, it becomes necessary to take care of their customer satisfaction and to compete among their business rivals. Usually, Sentiment Analysis was done manually but it became harder to work on large data when we have too much amount of feedback and without any sort of bias and error. With the advancement of technology and Machine Learning algorithms, It has become easier to find the customer’s attitude toward a product. Using sentiment analysis we can find out the most important issues in an organization.

import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
%matplotlib inline
import xml.etree.ElementTree as ET
xml_path = 'C:/Users/ndpl/Downloads/archive/ABSA15_RestaurantsTrain/ABSA-15_Restaurants_Train_Final.xml'
def parse_data_2015(xml_path):
    container = []                                              
    reviews = ET.parse(xml_path).getroot()                      
    for review in reviews:  
        sentences = review.getchildren()[0].getchildren()       
        for sentence in sentences:                                  
            sentence_text = sentence.getchildren()[0].text          
            try:                                                     
                opinions = sentence.getchildren()[1].getchildren()
                for opinion in opinions:                                
                    polarity = opinion.attrib["polarity"]
                    target = opinion.attrib["target"]
                    row = {"sentence": sentence_text, "sentiment":polarity}   
                    container.append(row)                                                              
            except IndexError: 
                row = {"sentence": sentence_text}        
                container.append(row)                                                               
    return pd.DataFrame(container)
my_df = parse_data_2015(xml_path)
my_df.head()

As we can see above the datasets is imported. the first columns shows the text reviews and other columns shows the polarity of the review. Now we’ll look for the datatypes and other information of the attributes in datasets.

my_df.info()

Now let’s look at the counts of different ratings.

my_df.Sentiment.value_counts()

As above we can see that reviews that are positive are 1198, and negative reviews are 403. As we can see the number of sentiments and reviews are not equal which means there are some null values in the sentiment column. Now let’s drop the null values.

my_df.isnull().sum()

print ("Original:", my_df.shape)
my_dd = my_df.drop_duplicates()
dd = my_dd.reset_index(drop=True)
print("Drop Dupicates:", dd.shape)
dd_dn = dd.dropna()
df = dd_dn.reset_index(drop=True)
print("Drop Nulls:", df.shape)

Preprocessing: Tokenization.

Now we all know that text is the most unstructured type of data and we need to do a lot of cleaning to run it through ML models. and this also helps us to get more accurate predictions.

For example, we’ll tokenize one review here.

df.sentence[1]

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
tokens = word_tokenize(df.sentence[1])
print(tokens)

Now let us define a own analyzer function for the movie text review

from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
def own_analyser(phrase):
    phrase = phrase.split()
    for i in range(0,len(phrase)):
        k = phrase.pop(0)
        if k not in string.punctuation:
                phrase.append(lm.lemmatize(k).lower())    
    return phrase

Stopwords

Stopwords are mostly words which occur in a sentence very often. To learn why stopwords need to be removed read it.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print [i for i in tokens if i not in stop_words]

Now select the inputs and outputs for our ML Models.

Preprocessing: Normalisation

In reviews there are words which look different but have the same meaning. These types of words need to be processed. Normalization process ensures these words are considered to be same.

The following normalization changes are done.

1. Casing the characters

converting the characters into the same cases so that these are treated same.

df.sentence[2]

lower_case = df.sentence[2].lower()
lower_case

2. Negation Handling

In normal language there are words which use apostrophe with negative words like aren’t are not. To make the computer read both words same we need to preprocess the words using lexicons.

appos = {"aren't" : "are not","can't" : "cannot","couldn't" : "could not","didn't" : "did not","doesn't" : "does not","don't" : "do not","hadn't" : "had not","hasn't" : "has not","haven't" : "have not","he'd" : "he would","he'll" : "he will","he's" : "he is","i'd" : "I would","i'd" : "I had","i'll" : "I will","i'm" : "I am","isn't" : "is not","it's" : "it is","it'll":"it will","i've" : "I have","let's" : "let us","mightn't" :"might not","mustn't" : "must not","shan't" : "shall not","she'd" : "she would","she'll" : "she will","she's" : "she is","shouldn't" : "should not","that's" : "that is","there's" : "there is","they'd" : "they would","they'll" : "they will","they're" : "they are","they've" : "they have","we'd" : "we would","we're" : "we are","weren't" : "were not","we've" : "we have","what'll" : "what will","what're" : "what are","what's" : "what is","what've" : "what have","where's" : "where is","who'd" : "who would","who'll" : "who will","who're" : "who are","who's" : "who is","who've" : "who have","won't" : "will not","wouldn't" : "would not","you'd" : "you would","you'll" : "you will","you're" : "you are","you've" : "you have","'re": " are","wasn't": "was not","we'll":" will","didn't": "did not"}

words = lower_case.split()
reformed = [appos[word] if word in appos else word for word in words]
reformed = " ".join(reformed) 
reformed

3. Removing

In this section we will remove punctuation, special characters and numerical characters as they don’t contribute in the sentiment.

tokens

words = [word for word in tokens if word.isalpha()]
words

4. Lemmatization

This process removes the words which have probably same meanings. e.g.:- is, am are, being —>>>>> be.

The function uses the English lemmatizer from the pattern library to extract their lemmas. The words in a text are identified through word-category disambiguation where both its definition and context are taken into account to identify the specific POS- tag.

Here Let’s split the dataset into train and test

my_df.sentence[24]

from gensim.utils import lemmatize
lemm = lemmatize(my_df.sentence[24])
lemm

my_df.sentence[17]

lemmatize(df.sentence[17])

Preprocessing: Substitution

Removing Noise from the raw texts. For example it may contain HTML and XML as it is extracted from web.Removal of these can be done through regular expressions.

Decoding

UTF-8 Character Bytes

1 byte: Standard ASCII
2 bytes: Arabic, Hebrew, most European scripts
3 bytes: BMP
4 bytes: All Unicode characters.

my_df.sentence[24].decode("utf-8-sig")

def cleaning_function(tips):
    all_ = []
    for tip in tqdm(tips):
        time.sleep(0.0001)
#       Decoding function
        decode = tip.decode("utf-8-sig")
#       Lowercasing before negation
        lower_case = decode.lower()
#       Replace apostrophes with words
        words = lower_case.split()
        split = [appos[word] if word in appos else word for word in words]
        reformed = " ".join(split) 
#       Lemmatization
        lemm = lemmatize(lower_case)
        all_.append(lemm)
    return all_
def separate_word_tag(df_lem_test):
    words=[]
    types=[]
    my_df= pd.DataFrame()
    for row in df_lem_test:
        sent = []
        type_ =[]
        for word in row:
            split = word.split('/')
            sent.append(split[0])
            type_.append(split[1])
words.append(' '.join(word for word in sent))
        types.append(' '.join(word for word in type_))
my_df['lem_words']= words
    my_df['lem_tag']= types
    return my_df

Clean the training data.

word_tag = cleaning_function(df.sentence)
lemm_df = separate_word_tag(word_tag)
# concat cleaned text with original
my_df_training = pd.concat([my_df, lemm_df], axis=1)
my_df_training['word_tags'] = word_tag
my_df_training.head()

Now let us check for the Null and empty values.

my_df_training = my_df_training.reset_index(drop=True)
#check null values
my_df_training.isnull().sum()

# empty values
my_df_training[df_training['lem_words']=='']

# drop these rows
print(my_df_training.shape)
my_df_training = df_training.drop([475, 648, 720])
my_df_training = df_training.reset_index(drop=True)
my_df_training.shape

Clean the prediction Data

# load the data
fs = pd.read_csv(‘./foursquare/foursquare_csv/londonvenues.csv’)
# use cleaning functions on the tips
word_tag_fs = cleaning_function(fs.tips)
lemm_fs = separate_word_tag(word_tag_fs)
# concat cleaned text with original
df_fs_predict = pd.concat([fs, lemm_fs], axis=1)
df_fs_predict['word_tags'] = word_tag_fs
# separate the long lat
lng=[]
lat=[]
for ll in df_fs_predict['ll']:
    lnglat = ll.split(',')
    lng.append(lnglat[0])
    lat.append(lnglat[1])
df_fs_predict['lng'] =lng
df_fs_predict['lat'] =lat
#  drop the ll column
df_fs_predict = df_fs_predict.drop(['ll'], axis=1)
df_fs_predict.head()

# save clean foursquare to csv
df_fs_predict.to_csv('./foursquare/foursquare_csv/foursquare_clean.csv', header=True, index=False, encoding='UTF8')