Introduction

How can you evaluate “Readability“?

Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.

For example, The Flesch Reading Ease Score is the score of readability which indicates how difficult a passage in English is to understand. And the score is described like below.

ScoreDifficulty
90-100Very Easy
80-89Easy
70-79Fairly Easy
60-69Standard
50-59Fairly Difficult
30-49Difficult
0-29Very Confusing

If you want to know that score, you can calculate by using textstat;

textstat.flesch_reading_ease(text)

If the returned value is 45.8, the text is Difficult. And if 77.03Fairly Easy.

In addition to this, you can calculate other readability scores;

  • The Flesch Reading Ease formula
  • The Flesch-Kincaid Grade Level
  • The Fog Scale (Gunning FOG Formula)
  • The SMOG Index
  • Automated Readability Index
  • The Coleman-Liau Index
  • Linsear Write Formula
  • Dale-Chall Readability Score And
  • Readability Consensus based upon all the above tests

You can also count the number of

  • Syllable
  • Lexicon
  • Sentence

by using textstat.

Load data

import numpy as np
import pandas as pd
train = pd.read_csv('/kaggle/input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('/kaggle/input/commonlitreadabilityprize/test.csv')
sub = pd.read_csv('/kaggle/input/commonlitreadabilityprize/sample_submission.csv')
train

id
url_legallicenseexcerpttargetstandard_error
0c12129c31NaNNaNWhen the young people returned to the ballroom…-0.3402590.464009
185aa80a4cNaNNaNAll through dinner time, Mrs. Fayre was somewh…-0.3153720.480805
2b69ac6792NaNNaNAs Roger had predicted, the snow departed as q…-0.5801180.476676
3dd1000b26NaNNaNAnd outside before the palace a great garden w…-1.0540130.450007
437c1b32fbNaNNaNOnce upon a time there were Three Bears who li…0.2471970.510845
282925ca8f498https://sites.ehe.osu.edu/beyondpenguins/files…CC BY-SA 3.0When you think of dinosaurs and where they liv…1.7113900.646900
28302c26db523https://en.wikibooks.org/wiki/Wikijunior:The_E…CC BY-SA 3.0So what is a solid? Solids are usually hard be…0.1894760.535648
2831cd19e2350https://en.wikibooks.org/wiki/Wikijunior:The_E…CC BY-SA 3.0The second state of matter we will discuss is …0.2552090.483866
283215e2e9e7ahttps://en.wikibooks.org/wiki/Geometry_for_Ele…CC BY-SA 3.0Solids are shapes that you can actually touch….-0.2152790.514128
28335b990ba77https://en.wikibooks.org/wiki/Wikijunior:Biolo…CC BY-SA 3.0Animals are made of many cells. They eat thing…0.3007790.512379

2834 rows × 6 columns

# The row with the min target value
train_min = train.loc[train['target'].idxmin()]
excerpt_min = train_min['excerpt']
print(train_min)
print()
print(excerpt_min)
# The row with the max target value
train_max = train.loc[train['target'].idxmax()]
excerpt_max = train_max['excerpt']
print(train_max)
print()
print(excerpt_max)

Textstat

Textstat is an easy to use library to calculate statistics from text.
It helps determine readability, complexity, and grade level.

We will use;

  • textstat.flesch_reading_ease(test_data)
  • textstat.smog_index(test_data)
  • textstat.flesch_kincaid_grade(test_data)
  • textstat.coleman_liau_index(test_data)
  • textstat.automated_readability_index(test_data)
  • textstat.dale_chall_readability_score(test_data)
  • textstat.difficult_words(test_data)
  • textstat.linsear_write_formula(test_data)
  • textstat.gunning_fog(test_data)
  • textstat.text_standard(test_data)

The following functions are specifically designed for spanish language.
They can be used on non-spanish texts, even though that use case is not recommended.

  • textstat.fernandez_huerta(test_data)
  • textstat.szigriszt_pazos(test_data)
  • textstat.gutierrez_polini(test_data)
  • textstat.crawford(test_data)
!pip install textstat
import textstat

List of Functions

syllable_count

Returns the number of syllables present in the given text.
Uses the Python module Pyphen for syllable calculation.

print(textstat.syllable_count(excerpt_min))
print(textstat.syllable_count(excerpt_max))
285
201

lexicon_count

Calculates the number of words present in the text. Optional removepunct specifies whether we need to take punctuation symbols into account while counting lexicons. Default value is True, which removes the punctuation before counting lexicon items.

print(textstat.lexicon_count(excerpt_min, removepunct=True))
print(textstat.lexicon_count(excerpt_max, removepunct=True))
177
145

Sentence Count

Returns the number of sentences present in the given text.

print(textstat.sentence_count(excerpt_min))
print(textstat.sentence_count(excerpt_max))
7
13

Flesch Reading Ease

Returns the Flesch Reading Ease Score.
The following table can be helpful to assess the ease of readability in a document.
The table is an example of values. While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid.

ScoreDifficulty
90-100Very Easy
80-89Easy
70-79Fairly Easy
60-69Standard
50-59Fairly Difficult
30-49Difficult
0-29Very Confusing

Formula

print(textstat.flesch_reading_ease(excerpt_min))
print(textstat.flesch_reading_ease(excerpt_max))
45.8
77.03

The Flesch-Kincaid Grade Level

Returns the Flesch-Kincaid Grade of the given text.
This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;

print(textstat.flesch_kincaid_grade(excerpt_min))
print(textstat.flesch_kincaid_grade(excerpt_max))
13.2
5.3

Gunning fog index

Returns the FOG index of the given text. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;

# gunning_fog by textstat
print(textstat.gunning_fog(excerpt_min))
print(textstat.gunning_fog(excerpt_max))
14.64
6.69

SMOG

Returns the SMOG index of the given text.
This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.
Texts of fewer than 30 sentences are statistically invalid, because the SMOG formula was normed on 30-sentence samples. textstat requires at least 3 sentences for a result.

  • Count a number of sentences (at least 30)
  • In those sentences, count the polysyllables (words of 3 or more syllables).
# smog_index by textstat
print(textstat.smog_index(excerpt_min))
print(textstat.smog_index(excerpt_max))
14.6
8.4

Automated readability index

Returns the ARI (Automated Readability Index) which outputs a number that approximates the grade level needed to comprehend the text.

For example if the ARI is 6.5, then the grade level to comprehend the text is 6th to 7th grade.

Fromula;

print(textstat.automated_readability_index(excerpt_min))
print(textstat.automated_readability_index(excerpt_max))
15.0
8.8

The Coleman-Liau Index

Returns the grade level of the text using the Coleman-Liau Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

CLI=0.0588L−0.296S−15.8

  • LL : the average number of letters per 100 words
  • SS : the average number of sentences per 100 words.
print(textstat.coleman_liau_index(excerpt_min))
print(textstat.coleman_liau_index(excerpt_max))
11.44
10.88

Linsear Write Formula

Returns the grade level using the Linsear Write Formula. This is a grade formula in that a score of 9.3 means that a ninth grader would be able to read the document.

Formula;
The standard Linsear Write metric LwLw runs on a 100-word sample:

  1. For each “easy word”, defined as words with 2 syllables or less, add 1 point.
  2. For each “hard word”, defined as words with 3 syllables or more, add 3 points.
  3. Divide the points by the number of sentences in the 100-word sample.
  4. Adjust the provisional result r:
  • If r>20,Lw=r/2r>20,Lw=r/2
  • If r≤20,Lw=r/2−1r≤20,Lw=r/2−1

The result is a “grade level” measure, reflecting the estimated years of education needed to read the text fluently.

print(textstat.linsear_write_formula(excerpt_min))
print(textstat.linsear_write_formula(excerpt_max))
16.75
5.222222222222222

Dale–Chall readability formula

Different from other tests, since it uses a lookup table of the most commonly used 3000 English words. Thus it returns the grade level using the New Dale-Chall Formula.

ScoreUnderstood by
4.9 or loweraverage 4th-grade student or lower
5.0–5.9average 5th or 6th-grade student
6.0–6.9average 7th or 8th-grade student
7.0–7.9average 9th or 10th-grade student
8.0–8.9average 11th or 12th-grade student
9.0–9.9average 13th to 15th-grade (college) student
print(textstat.dale_chall_readability_score(excerpt_min))
print(textstat.dale_chall_readability_score(excerpt_max))
8.19
6.37
# difficult_words
print(textstat.difficult_words(excerpt_min))
print(textstat.difficult_words(excerpt_max))
37
20

Readability Consensus based upon all the above tests

Based upon all the above tests, returns the estimated school grade level required to understand the text. Optional float_output allows the score to be returned as a float. Defaults to False.

print(textstat.text_standard(excerpt_min))
print(textstat.text_standard(excerpt_max))
14th and 15th grade
6th and 7th grade

Download the dataset used in this blog here . For more Python related blogs Visit Us Geekycodes . This Notebook has been provided on Kaggle by Shoku Pan

Leave a Reply