Introduction
Most machine learning algorithms, including popular ones like linear regression, decision trees, and neural networks, require numeric input features. These algorithms perform mathematical operations on the data, such as addition, multiplication, and comparison, which can only be applied to numeric values. Therefore, text data must be converted into numeric form to use these algorithms effectively.
In this post we will be talking about how to convert a Text into a number.
For example we are given a dataframe that contains a column in which number is given in form of text(“zero point five nine three four six zero two four”) which 0.59346024. Now to convert this column into a numerical feature we need to convert this string.
To perform the above operation we can use python and pandas library.
Create a dictionary to map text to numeric representations
# Define a dictionary to map text representations to numeric values
text_to_numeric = {
'zero': 0,
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
'six': 6,
'seven': 7,
'eight': 8,
'nine': 9,
'ten': 10,
'eleven': 11,
'twelve': 12,
'thirteen': 13,
'fourteen': 14,
'fifteen': 15,
'sixteen': 16,
'seventeen': 17,
'eighteen': 18,
'nineteen': 19,
'twenty': 20,
'thirty': 30,
'forty': 40,
'fifty': 50,
'sixty': 60,
'seventy': 70,
'eighty': 80,
'ninety': 90,
'hundred': 100,
'thousand': 1000,
'million': 1000000,
}
# Use a lambda function to apply the conversion to each cell in the specified column
df[column_name] = df[column_name].apply(lambda x: ''.join(str(text_to_numeric[word]) if word in text_to_numeric else word for word in x.split()))
# Convert the column to a numeric type
df[column_name] = pd.to_numeric(df[column_name], errors='coerce')
Now creating a function using above code
import pandas as pd
def convert_text_to_numeric(df, column_name):
# Define a dictionary to map text representations to numeric values
text_to_numeric = {
'zero': 0,
'one': 1,
'two': 2,
'three': 3,
'four': 4,
'five': 5,
'six': 6,
'seven': 7,
'eight': 8,
'nine': 9,
'ten': 10,
'eleven': 11,
'twelve': 12,
'thirteen': 13,
'fourteen': 14,
'fifteen': 15,
'sixteen': 16,
'seventeen': 17,
'eighteen': 18,
'nineteen': 19,
'twenty': 20,
'thirty': 30,
'forty': 40,
'fifty': 50,
'sixty': 60,
'seventy': 70,
'eighty': 80,
'ninety': 90,
'hundred': 100,
'thousand': 1000,
'million': 1000000,
}
# Use a lambda function to apply the conversion to each cell in the specified column
df[column_name] = df[column_name].apply(lambda x: ''.join(str(text_to_numeric[word]) if word in text_to_numeric else word for word in x.split()))
# Convert the column to a numeric type
df[column_name] = pd.to_numeric(df[column_name], errors='coerce')
return df
# Example usage:
data = {'text_column': ["zero point five nine three four six zero two four", "two point three", "four point zero"]}
df = pd.DataFrame(data)
# Convert the specified column in the DataFrame
df = convert_text_to_numeric(df, 'text_column')
# Print the resulting DataFrame
print(df)
This function first defines a dictionary (text_to_numeric
) to map words to their numeric equivalents, then uses a lambda function with apply
to replace the words in the specified column and convert it to a numeric type using pd.to_numeric
. The result is a DataFrame with the converted values in the specified column.