How to evaluate readability using Python?

Introduction

How can you evaluate “Readability“?

Textstat is an easy-to-use library to calculate statistics from the text. It helps determine readability, complexity, and grade level.

For example, The Flesch Reading Ease Score is the score of readability which indicates how difficult a passage in English is to understand. And the score is described below.

Score	Difficulty
90-100	Very Easy
80-89	Easy
70-79	Fairly Easy
60-69	Standard
50-59	Fairly Difficult
30-49	Difficult
0-29	Very Confusing

If you want to know that score, you can calculate it by using textstat;

textstat.flesch_reading_ease(text)

If the returned value is 45.8, the text is difficult. And if 77.03, Fairly Easy.

In addition to this, you can calculate other readability scores;

The Flesch Reading Ease formula
The Flesch-Kincaid Grade Level
The Fog Scale (Gunning FOG Formula)
The SMOG Index
Automated Readability Index
The Coleman-Liau Index
Linear Write Formula
Dale-Chall Readability Score And
Readability Consensus-based upon all the above tests

You can also count the number of

Syllable
Lexicon
Sentence

by using textstat.

Load data

import numpy as np
import pandas as pd

train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
sub = pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')
train

id	url_legal	license	excerpt	target	standard_error
0	c12129c31	NaN	NaN	When the young people returned to the ballroom…	-0.340259	0.464009
1	85aa80a4c	NaN	NaN	All through dinner time, Mrs. Fayre was somewh…	-0.315372	0.480805
2	b69ac6792	NaN	NaN	As Roger had predicted, the snow departed as q…	-0.580118	0.476676
3	dd1000b26	NaN	NaN	And outside before the palace a great garden w…	-1.054013	0.450007
4	37c1b32fb	NaN	NaN	Once upon a time there were Three Bears who li…	0.247197	0.510845
…	…	…	…	…	…	…
2829	25ca8f498	https://sites.ehe.osu.edu/beyondpenguins/files…	CC BY-SA 3.0	When you think of dinosaurs and where they liv…	1.711390	0.646900
2830	2c26db523	https://en.wikibooks.org/wiki/Wikijunior:The_E…	CC BY-SA 3.0	So what is a solid? Solids are usually hard be…	0.189476	0.535648
2831	cd19e2350	https://en.wikibooks.org/wiki/Wikijunior:The_E…	CC BY-SA 3.0	The second state of matter we will discuss is …	0.255209	0.483866
2832	15e2e9e7a	https://en.wikibooks.org/wiki/Geometry_for_Ele…	CC BY-SA 3.0	Solids are shapes that you can actually touch….	-0.215279	0.514128
2833	5b990ba77	https://en.wikibooks.org/wiki/Wikijunior:Biolo…	CC BY-SA 3.0	Animals are made of many cells. They eat thing…	0.300779	0.512379

2834 rows × 6 columns

# The row with the min target value
train_min = train.loc[train['target'].idxmin()]
excerpt_min = train_min['excerpt']
print(train_min)
print()
print(excerpt_min)

# The row with the max target value
train_max = train.loc[train['target'].idxmax()]
excerpt_max = train_max['excerpt']
print(train_max)
print()
print(excerpt_max)

Textstat

Textstat is an easy to use library to calculate statistics from text.
It helps determine readability, complexity, and grade level.

We will use;

textstat.flesch_reading_ease(test_data)
textstat.smog_index(test_data)
textstat.flesch_kincaid_grade(test_data)
textstat.coleman_liau_index(test_data)
textstat.automated_readability_index(test_data)
textstat.dale_chall_readability_score(test_data)
textstat.difficult_words(test_data)
textstat.linsear_write_formula(test_data)
textstat.gunning_fog(test_data)
textstat.text_standard(test_data)

The following functions are specifically designed for spanish language.
They can be used on non-spanish texts, even though that use case is not recommended.

textstat.fernandez_huerta(test_data)
textstat.szigriszt_pazos(test_data)
textstat.gutierrez_polini(test_data)
textstat.crawford(test_data)

!pip install textstat

import textstat

List of Functions

syllable_count

Returns the number of syllables present in the given text.
Uses the Python module Pyphen for syllable calculation.

print(textstat.syllable_count(excerpt_min))
print(textstat.syllable_count(excerpt_max))

285
201

lexicon_count

Calculates the number of words present in the text. Optional remove punct specifies whether we need to take punctuation symbols into account while counting lexicons. The default value is True, which removes the punctuation before counting lexicon items.

print(textstat.lexicon_count(excerpt_min, removepunct=True))
print(textstat.lexicon_count(excerpt_max, removepunct=True))

177
145

Sentence Count

Returns the number of sentences present in the given text.

print(textstat.sentence_count(excerpt_min))
print(textstat.sentence_count(excerpt_max))

7
13

Flesch Reading Ease

Returns the Flesch Reading Ease Score.
The following table can be helpful to assess the ease of readability of a document.
The table is an example of values. While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid.

Score	Difficulty
90-100	Very Easy
80-89	Easy
70-79	Fairly Easy
60-69	Standard
50-59	Fairly Difficult
30-49	Difficult
0-29	Very Confusing

Formula

print(textstat.flesch_reading_ease(excerpt_min))
print(textstat.flesch_reading_ease(excerpt_max))

45.8
77.03

The Flesch-Kincaid Grade Level

Returns the Flesch-Kincaid Grade of the given text.
This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.

Formula;

print(textstat.flesch_kincaid_grade(excerpt_min))
print(textstat.flesch_kincaid_grade(excerpt_max))

13.2
5.3

Gunning fog index

Returns the FOG index of the given text. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.

Formula;

# gunning_fog by textstat
print(textstat.gunning_fog(excerpt_min))
print(textstat.gunning_fog(excerpt_max))

14.64
6.69

SMOG

Returns the SMOG index of the given text.
This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
Texts of fewer than 30 sentences are statistically invalid because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result.

Count a number of sentences (at least 30)
In those sentences, count the polysyllables (words of 3 or more syllables).

# smog_index by textstat
print(textstat.smog_index(excerpt_min))
print(textstat.smog_index(excerpt_max))

14.6
8.4

Automated readability index

Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text.

For example, if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.

Formula;

print(textstat.automated_readability_index(excerpt_min))
print(textstat.automated_readability_index(excerpt_max))

15.0
8.8

The Coleman-Liau Index

Returns the grade level of the text using the Coleman-Liau Formula. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.

CLI=0.0588L−0.296S−15.8

LL : the average number of letters per 100 words
SS : the average number of sentences per 100 words.

print(textstat.coleman_liau_index(excerpt_min))
print(textstat.coleman_liau_index(excerpt_max))

11.44
10.88

Linsear Write Formula

Returns the grade level using the Linsear Write Formula. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.

Formula;
The standard Linsear Write metric LwLw runs on a 100-word sample:

For each “easy word”, defined as words with 2 syllables or less, add 1 point.
For each “hard word”, defined as words with 3 syllables or more, add 3 points.
Divide the points by the number of sentences in the 100-word sample.
Adjust the provisional result r:

If r>20,Lw=r/2r>20,Lw=r/2
If r≤20,Lw=r/2−1r≤20,Lw=r/2−1

The result is a “grade level” measure, reflecting the estimated years of education needed to read the text fluently.

print(textstat.linsear_write_formula(excerpt_min))
print(textstat.linsear_write_formula(excerpt_max))

16.75
5.222222222222222

Dale–Chall readability formula

Different from other tests, it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula.

Score	Understood by
4.9 or lower	average 4th-grade student or lower
5.0–5.9	average 5th or 6th-grade student
6.0–6.9	average 7th or 8th-grade student
7.0–7.9	average 9th or 10th-grade student
8.0–8.9	average 11th or 12th-grade student
9.0–9.9	average 13th to 15th-grade (college) student

print(textstat.dale_chall_readability_score(excerpt_min))
print(textstat.dale_chall_readability_score(excerpt_max))

8.19
6.37

# difficult_words
print(textstat.difficult_words(excerpt_min))
print(textstat.difficult_words(excerpt_max))

37
20

Readability Consensus-based upon all the above tests

Based upon all the above tests, returns the estimated school grade level required to understand the text. Optionally float_output allows the score to be returned as a float. Defaults to False.

print(textstat.text_standard(excerpt_min))
print(textstat.text_standard(excerpt_max))

14th and 15th grade
6th and 7th grade

Download the dataset used in this blog here. For more Python-related blogs Visit Us Geekycodes. This Notebook has been provided on Kaggle by Shoku Pan

If you’re looking for a The Ultimate Artificial Intelligence & Machine Learning course for CxOs, Managers, Team Leaders and Entrepreneurs. Go through the link given below and purchase the Course.

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram.

How to evaluate readability using Python?

Introduction

Load data

Textstat

List of Functions

syllable_count

lexicon_count

Sentence Count

Flesch Reading Ease

The Flesch-Kincaid Grade Level

Gunning fog index

SMOG

Automated readability index

The Coleman-Liau Index

Linsear Write Formula

Dale–Chall readability formula

Readability Consensus-based upon all the above tests

Important Notice for college students

Like this:

Related

1 comment

Leave a ReplyCancel reply

Introduction

Load data

Textstat

List of Functions

syllable_count

lexicon_count

Sentence Count

Flesch Reading Ease

The Flesch-Kincaid Grade Level

Gunning fog index

SMOG

Automated readability index

The Coleman-Liau Index

Linsear Write Formula

Dale–Chall readability formula

Readability Consensus-based upon all the above tests

Important Notice for college students

Share this:

Like this:

Related

1 comment

Leave a ReplyCancel reply

Discover more from Geeky Codes

Discover more from Geeky Codes