Text simplification data sets

If you use this data, please send me (David Kauchak) an e-mail and let me know what project you're working on.

Wikipedia Data Sets

Two different versions of the data set now exist. Both were generated by aligning Simple English Wikipedia and English Wikipedia. A complete description of the extraction process can be found in "Simple English Wikipedia: A New Simplification Task", William Coster and David Kauchak (2011). In Proceedings of ACL (short paper).

For questions regarding either version of the data set, contact David Kauchak at Middlebury College.

Version 1.0

The original version of the data set created from Wikipeda pages downloaded in May 2010. The data set contains 137K aligned sentences pairs based on those sentences that have a similarity above 0.50. Higher precision alignments may be obtained by TF-IDF thresholding at higher levels.

Version 1.0 data (split into train, tune and test)

Version 2.0

Updated version of the data set from Wikipedia pages downloaded in May 2011. This data set was used in "Improving Text Simplification Language Modeling Using Unsimplified Text Data", David Kauchak (2013). In Proceedings of ACL. The versions below are slightly different (larger) than the data used in the paper. For the data used in the paper, please contact me.

This data set contains two parts, a sentence-aligned part and a document aligned part:

Sentence-aligned: Updated version of the version 1.0 sentence-aligned data set with updated Wikipedia data and improved text processing. The data set contains 167K aligned sentence pairs.
Version 2.0 sentence-aligned data
Document-aligned: The text from 60K English Wikipedia and Simple English Wikipedia articles paired based on the titles. The data set contains all Simple English Wikipedia articles that also have a corresponding article in English Wikipedia.
Version 2.0 document-aligned data

Mechanical Turk Lexical Simplification Data Set

Available here.

For more information and to cite, see:
Colby Horn, Katie Manduca and David Kauchak (2014). Learning a Lexical Simplifier Using Wikipedia. In Proceedings of ACL (short paper).

Human Simplification with Sentence Fusion Data Set

Available here

For more information and to cide, see:
Max Schwarzer, Teerapaun Tanprasert and David Kauchak (2021). Improving Human Text Simplification with Sentence Fusion. In Proceeding of TextGraphs Workshop.

Mechanical Turk Word Substitution Filtering Data Set

Available here

For more information and to cite, see:
David Kauchak, Jorge Apricio and Gondy Leroy (2022). Improving the Quality of Suggestions for Medical Text Simplification Tools. In Proceedings of AMIA Informatics Summit.