Introduction
How can you evaluate “Readability“?
Textstat is an easy-to-use library to calculate statistics from the text. It helps determine readability, complexity, and grade level.
For example, The Flesch Reading Ease Score is the score of readability which indicates how difficult a passage in English is to understand. And the score is described below.
Score | Difficulty |
---|---|
90-100 | Very Easy |
80-89 | Easy |
70-79 | Fairly Easy |
60-69 | Standard |
50-59 | Fairly Difficult |
30-49 | Difficult |
0-29 | Very Confusing |
If you want to know that score, you can calculate it by using textstat;
textstat.flesch_reading_ease(text)
If the returned value is 45.8, the text is difficult. And if 77.03, Fairly Easy.
In addition to this, you can calculate other readability scores;
- The Flesch Reading Ease formula
- The Flesch-Kincaid Grade Level
- The Fog Scale (Gunning FOG Formula)
- The SMOG Index
- Automated Readability Index
- The Coleman-Liau Index
- Linear Write Formula
- Dale-Chall Readability Score And
- Readability Consensus-based upon all the above tests
You can also count the number of
- Syllable
- Lexicon
- Sentence
by using textstat.
Load data
import numpy as np
import pandas as pd
train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
sub = pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')
train
id | url_legal | license | excerpt | target | standard_error | |
---|---|---|---|---|---|---|
0 | c12129c31 | NaN | NaN | When the young people returned to the ballroom… | -0.340259 | 0.464009 |
1 | 85aa80a4c | NaN | NaN | All through dinner time, Mrs. Fayre was somewh… | -0.315372 | 0.480805 |
2 | b69ac6792 | NaN | NaN | As Roger had predicted, the snow departed as q… | -0.580118 | 0.476676 |
3 | dd1000b26 | NaN | NaN | And outside before the palace a great garden w… | -1.054013 | 0.450007 |
4 | 37c1b32fb | NaN | NaN | Once upon a time there were Three Bears who li… | 0.247197 | 0.510845 |
… | … | … | … | … | … | … |
2829 | 25ca8f498 | https://sites.ehe.osu.edu/beyondpenguins/files… | CC BY-SA 3.0 | When you think of dinosaurs and where they liv… | 1.711390 | 0.646900 |
2830 | 2c26db523 | https://en.wikibooks.org/wiki/Wikijunior:The_E… | CC BY-SA 3.0 | So what is a solid? Solids are usually hard be… | 0.189476 | 0.535648 |
2831 | cd19e2350 | https://en.wikibooks.org/wiki/Wikijunior:The_E… | CC BY-SA 3.0 | The second state of matter we will discuss is … | 0.255209 | 0.483866 |
2832 | 15e2e9e7a | https://en.wikibooks.org/wiki/Geometry_for_Ele… | CC BY-SA 3.0 | Solids are shapes that you can actually touch…. | -0.215279 | 0.514128 |
2833 | 5b990ba77 | https://en.wikibooks.org/wiki/Wikijunior:Biolo… | CC BY-SA 3.0 | Animals are made of many cells. They eat thing… | 0.300779 | 0.512379 |
2834 rows × 6 columns
# The row with the min target value
train_min = train.loc[train['target'].idxmin()]
excerpt_min = train_min['excerpt']
print(train_min)
print()
print(excerpt_min)

# The row with the max target value
train_max = train.loc[train['target'].idxmax()]
excerpt_max = train_max['excerpt']
print(train_max)
print()
print(excerpt_max)

Textstat
Textstat is an easy to use library to calculate statistics from text.
It helps determine readability, complexity, and grade level.
We will use;
- textstat.flesch_reading_ease(test_data)
- textstat.smog_index(test_data)
- textstat.flesch_kincaid_grade(test_data)
- textstat.coleman_liau_index(test_data)
- textstat.automated_readability_index(test_data)
- textstat.dale_chall_readability_score(test_data)
- textstat.difficult_words(test_data)
- textstat.linsear_write_formula(test_data)
- textstat.gunning_fog(test_data)
- textstat.text_standard(test_data)
The following functions are specifically designed for spanish language.
They can be used on non-spanish texts, even though that use case is not recommended.
- textstat.fernandez_huerta(test_data)
- textstat.szigriszt_pazos(test_data)
- textstat.gutierrez_polini(test_data)
- textstat.crawford(test_data)
!pip install textstat

import textstat
List of Functions
syllable_count
Returns the number of syllables present in the given text.
Uses the Python module Pyphen for syllable calculation.
print(textstat.syllable_count(excerpt_min))
print(textstat.syllable_count(excerpt_max))
285 201
lexicon_count
Calculates the number of words present in the text. Optional remove punct specifies whether we need to take punctuation symbols into account while counting lexicons. The default value is True, which removes the punctuation before counting lexicon items.
print(textstat.lexicon_count(excerpt_min, removepunct=True))
print(textstat.lexicon_count(excerpt_max, removepunct=True))
177 145
Sentence Count
Returns the number of sentences present in the given text.
print(textstat.sentence_count(excerpt_min))
print(textstat.sentence_count(excerpt_max))
7 13
Flesch Reading Ease
Returns the Flesch Reading Ease Score.
The following table can be helpful to assess the ease of readability of a document.
The table is an example of values. While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid.
Score | Difficulty |
---|---|
90-100 | Very Easy |
80-89 | Easy |
70-79 | Fairly Easy |
60-69 | Standard |
50-59 | Fairly Difficult |
30-49 | Difficult |
0-29 | Very Confusing |
Formula

print(textstat.flesch_reading_ease(excerpt_min))
print(textstat.flesch_reading_ease(excerpt_max))
45.8 77.03
The Flesch-Kincaid Grade Level
Returns the Flesch-Kincaid Grade of the given text.
This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
Formula;

print(textstat.flesch_kincaid_grade(excerpt_min))
print(textstat.flesch_kincaid_grade(excerpt_max))
13.2 5.3
Gunning fog index
Returns the FOG index of the given text. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
Formula;

# gunning_fog by textstat
print(textstat.gunning_fog(excerpt_min))
print(textstat.gunning_fog(excerpt_max))
14.64 6.69
SMOG
Returns the SMOG index of the given text.
This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
Texts of fewer than 30 sentences are statistically invalid because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result.

- Count a number of sentences (at least 30)
- In those sentences, count the polysyllables (words of 3 or more syllables).
# smog_index by textstat
print(textstat.smog_index(excerpt_min))
print(textstat.smog_index(excerpt_max))
14.6 8.4
Automated readability index
Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text.
For example, if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.
Formula;

print(textstat.automated_readability_index(excerpt_min))
print(textstat.automated_readability_index(excerpt_max))
15.0 8.8
The Coleman-Liau Index
Returns the grade level of the text using the Coleman-Liau Formula. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
CLI=0.0588L−0.296S−15.8
- LL : the average number of letters per 100 words
- SS : the average number of sentences per 100 words.
print(textstat.coleman_liau_index(excerpt_min))
print(textstat.coleman_liau_index(excerpt_max))
11.44 10.88
Linsear Write Formula
Returns the grade level using the Linsear Write Formula. This is a grading formula in that a score of 9.3 means that a ninth-grader would be able to read the document.
Formula;
The standard Linsear Write metric LwLw runs on a 100-word sample:
- For each “easy word”, defined as words with 2 syllables or less, add 1 point.
- For each “hard word”, defined as words with 3 syllables or more, add 3 points.
- Divide the points by the number of sentences in the 100-word sample.
- Adjust the provisional result r:
- If r>20,Lw=r/2r>20,Lw=r/2
- If r≤20,Lw=r/2−1r≤20,Lw=r/2−1
The result is a “grade level” measure, reflecting the estimated years of education needed to read the text fluently.
print(textstat.linsear_write_formula(excerpt_min))
print(textstat.linsear_write_formula(excerpt_max))
16.75 5.222222222222222
Dale–Chall readability formula
Different from other tests, it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula.
Score | Understood by |
---|---|
4.9 or lower | average 4th-grade student or lower |
5.0–5.9 | average 5th or 6th-grade student |
6.0–6.9 | average 7th or 8th-grade student |
7.0–7.9 | average 9th or 10th-grade student |
8.0–8.9 | average 11th or 12th-grade student |
9.0–9.9 | average 13th to 15th-grade (college) student |

print(textstat.dale_chall_readability_score(excerpt_min))
print(textstat.dale_chall_readability_score(excerpt_max))
8.19 6.37
# difficult_words
print(textstat.difficult_words(excerpt_min))
print(textstat.difficult_words(excerpt_max))
37 20
Readability Consensus-based upon all the above tests
Based upon all the above tests, returns the estimated school grade level required to understand the text. Optionally float_output
allows the score to be returned as a float
. Defaults to False
.
print(textstat.text_standard(excerpt_min))
print(textstat.text_standard(excerpt_max))
14th and 15th grade 6th and 7th grade
Download the dataset used in this blog here. For more Python-related blogs Visit Us Geekycodes. This Notebook has been provided on Kaggle by Shoku Pan
If you’re looking for a The Ultimate Artificial Intelligence & Machine Learning course for CxOs, Managers, Team Leaders and Entrepreneurs. Go through the link given below and purchase the Course.
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram.