Language

Lexical Complexity of Dante's Inferno

June 12, 2017

#nltk #language #d3

A corpus linguistics analysis of the original Italian text of the Divina Commedia — measuring vocabulary coverage thresholds and frequency distribution to quantify the reading challenge it presents to a learner of Italian.

Background

Dante Alighieri (1265–1321) was a Florentine poet whose Divina Commedia — written between approximately 1308 and 1320, the year before his death — is considered a cornerstone of world literature and a foundational text of the Italian language. The work comprises three canticles: Inferno, Purgatorio, and Paradiso, tracing an allegorical journey through the afterlife guided by the Roman poet Virgil.

Analysis of Harry Potter in German

May 9, 2017

#nltk #language #d3

A frequency analysis of Harry Potter und der Stein der Weisen using Python and NLTK, exploring what corpus linguistics reveals about the vocabulary threshold for reading comprehension in a second language.

Methodology

The source text was processed using NLTK (Natural Language Toolkit) in Python. The pipeline: tokenize the raw text with nltk.word_tokenize, lowercase and strip punctuation, then build a frequency distribution with nltk.FreqDist. Stopwords were deliberately not removed — function words like articles, pronouns, and conjunctions are exactly what a language learner needs to acquire, and stripping them would distort the comprehension model.