Today in this Machine Learning Tutorial we’re gonna learn how to do a effective preprocessing of text data for sentiment Analysis.
What is Sentiment Analysis?
In a business when we take feedback from our customer and then we measure the satisfaction or dissatisfaction of customer towards our product or service. It is called sentiment analysis. This task uses pieces of sentences to determine the view of customers
What is preprocessing?
Preprocessing of data means transformation of data before we feed it to the machine learning algorithms. It is a technique to clean the data from Raw data. It is a data mining technique to transform Raw data into an understandable and readable format.
Why do need the preprocessing of the data?
When a data is collected from the different sources , usually it is unstructured and we can’t use it for the analysis. Preporcessing of data is necessary for achieving better results with our machine learning models.
Why do we need sentiment analysis?
In the modern times customer have a vast options for the choice to buy something or choose a service. So for a company it becomes necessary to take care of their customer’s satisfaction and to compete among their business rivals. Usually Sentiment Analysis was done manually but it became harder to work on a large data when we have too much amount of feedback and without any sort of bias and error. With advancement of technology and Machine Learning algorithms It has become easier to find the customer’s attitude towards a product. Using sentiment analysis we can find out most important issues in an organization.
Dataset can be downloaded from here. You can learn how to import dataset in python here.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import warnings import seaborn as sns %matplotlib inline import xml.etree.ElementTree as ET xml_path = 'C:/Users/ndpl/Downloads/archive/ABSA15_RestaurantsTrain/ABSA-15_Restaurants_Train_Final.xml' def parse_data_2015(xml_path): container = [] reviews = ET.parse(xml_path).getroot() for review in reviews: sentences = review.getchildren()[0].getchildren() for sentence in sentences: sentence_text = sentence.getchildren()[0].text try: opinions = sentence.getchildren()[1].getchildren() for opinion in opinions: polarity = opinion.attrib["polarity"] target = opinion.attrib["target"] row = {"sentence": sentence_text, "sentiment":polarity} container.append(row) except IndexError: row = {"sentence": sentence_text} container.append(row) return pd.DataFrame(container) my_df = parse_data_2015(xml_path) my_df.head()
As we can see above the datasets is imported. the first columns shows the text reviews and other columns shows the polarity of the review. Now we’ll look for the datatypes and other information of the attributes in datasets.
my_df.info()






Now let’s look at the counts of different different ratings.
my_df.Sentiment.value_counts()






As above we can see that reviews which are positive are 1198, and negative reviews are 403. As we can see that number of sentiments and reviews are not equal that means there are some null values in sentiment column. Now let’s drop the null values.
my_df.isnull().sum()






print ("Original:", my_df.shape) my_dd = my_df.drop_duplicates() dd = my_dd.reset_index(drop=True) print("Drop Dupicates:", dd.shape) dd_dn = dd.dropna() df = dd_dn.reset_index(drop=True) print("Drop Nulls:", df.shape)






Preprocessing :Tokenization.
Now we all know that text is the most unstructured type of data and we need to do a lot of cleaning to run it through ML models. and this also helps us to get more accurate predictions.
For example we’ll tokenize one review here.
df.sentence[1]






import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize tokens = word_tokenize(df.sentence[1]) print(tokens)
Now let us define a own analyzer function for the movie text review
from nltk.stem import WordNetLemmatizer lm = WordNetLemmatizer() def own_analyser(phrase): phrase = phrase.split() for i in range(0,len(phrase)): k = phrase.pop(0) if k not in string.punctuation: phrase.append(lm.lemmatize(k).lower()) return phrase






Stopwords
Stopwords are mostly words which occur in a sentence very often. To learn why stopwords need to be removed read it.
from nltk.corpus import stopwords stop_words = stopwords.words('english') print [i for i in tokens if i not in stop_words]






Now select the inputs and outputs for our ML Models.
Preprocessing: Normalisation
In reviews there are words which look different but have the same meaning. These types of words need to be processed. Normalization process ensures these words are considered to be same.
The following normalization changes are done.
1. Casing the characters
converting the characters into the same cases so that these are treated same.
df.sentence[2]






lower_case = df.sentence[2].lower() lower_case






2. Negation Handling
In normal language there are words which use apostrophe with negative words like aren’t are not. To make the computer read both words same we need to preprocess the words using lexicons.
appos = {"aren't" : "are not","can't" : "cannot","couldn't" : "could not","didn't" : "did not","doesn't" : "does not","don't" : "do not","hadn't" : "had not","hasn't" : "has not","haven't" : "have not","he'd" : "he would","he'll" : "he will","he's" : "he is","i'd" : "I would","i'd" : "I had","i'll" : "I will","i'm" : "I am","isn't" : "is not","it's" : "it is","it'll":"it will","i've" : "I have","let's" : "let us","mightn't" :"might not","mustn't" : "must not","shan't" : "shall not","she'd" : "she would","she'll" : "she will","she's" : "she is","shouldn't" : "should not","that's" : "that is","there's" : "there is","they'd" : "they would","they'll" : "they will","they're" : "they are","they've" : "they have","we'd" : "we would","we're" : "we are","weren't" : "were not","we've" : "we have","what'll" : "what will","what're" : "what are","what's" : "what is","what've" : "what have","where's" : "where is","who'd" : "who would","who'll" : "who will","who're" : "who are","who's" : "who is","who've" : "who have","won't" : "will not","wouldn't" : "would not","you'd" : "you would","you'll" : "you will","you're" : "you are","you've" : "you have","'re": " are","wasn't": "was not","we'll":" will","didn't": "did not"}
words = lower_case.split() reformed = [appos[word] if word in appos else word for word in words] reformed = " ".join(reformed) reformed






3. Removing
In this section we will remove punctuation, special characters and numerical characters as they don’t contribute in the sentiment.
tokens






words = [word for word in tokens if word.isalpha()] words






4. Lemmatization
This process removes the words which have probably same meanings. e.g.:- is, am are, being —>>>>> be.
The function uses the English lemmatizer from the pattern library to extract their lemmas. The words in a text are identified through word-category disambiguation where both its definition and context are taken into account to identify the specific POS- tag.
Here Let’s split the dataset into train and test
my_df.sentence[24]






from gensim.utils import lemmatize lemm = lemmatize(my_df.sentence[24]) lemm






my_df.sentence[17]






lemmatize(df.sentence[17])






Preprocessing: Substitution
Removing Noise from the raw texts. For example it may contain HTML and XML as it is extracted from web.Removal of these can be done through regular expressions.
Decoding
UTF-8 Character Bytes
- 1 byte: Standard ASCII
- 2 bytes: Arabic, Hebrew, most European scripts
- 3 bytes: BMP
- 4 bytes: All Unicode characters.
my_df.sentence[24].decode("utf-8-sig")






def cleaning_function(tips): all_ = [] for tip in tqdm(tips): time.sleep(0.0001) # Decoding function decode = tip.decode("utf-8-sig") # Lowercasing before negation lower_case = decode.lower() # Replace apostrophes with words words = lower_case.split() split = [appos[word] if word in appos else word for word in words] reformed = " ".join(split) # Lemmatization lemm = lemmatize(lower_case) all_.append(lemm) return all_ def separate_word_tag(df_lem_test): words=[] types=[] my_df= pd.DataFrame() for row in df_lem_test: sent = [] type_ =[] for word in row: split = word.split('/') sent.append(split[0]) type_.append(split[1]) words.append(' '.join(word for word in sent)) types.append(' '.join(word for word in type_)) my_df['lem_words']= words my_df['lem_tag']= types return my_df
Clean the training data.
word_tag = cleaning_function(df.sentence) lemm_df = separate_word_tag(word_tag) # concat cleaned text with original my_df_training = pd.concat([my_df, lemm_df], axis=1) my_df_training['word_tags'] = word_tag my_df_training.head()






Now let us check for the Null and empty values.
my_df_training = my_df_training.reset_index(drop=True) #check null values my_df_training.isnull().sum()






# empty values my_df_training[df_training['lem_words']=='']






# drop these rows print(my_df_training.shape) my_df_training = df_training.drop([475, 648, 720]) my_df_training = df_training.reset_index(drop=True) my_df_training.shape






Clean the prediction Data
# load the data fs = pd.read_csv(‘./foursquare/foursquare_csv/londonvenues.csv’) # use cleaning functions on the tips word_tag_fs = cleaning_function(fs.tips) lemm_fs = separate_word_tag(word_tag_fs) # concat cleaned text with original df_fs_predict = pd.concat([fs, lemm_fs], axis=1) df_fs_predict['word_tags'] = word_tag_fs # separate the long lat lng=[] lat=[] for ll in df_fs_predict['ll']: lnglat = ll.split(',') lng.append(lnglat[0]) lat.append(lnglat[1]) df_fs_predict['lng'] =lng df_fs_predict['lat'] =lat # drop the ll column df_fs_predict = df_fs_predict.drop(['ll'], axis=1) df_fs_predict.head()






# save clean foursquare to csv df_fs_predict.to_csv('./foursquare/foursquare_csv/foursquare_clean.csv', header=True, index=False, encoding='UTF8')