Skip to content

Information and code about applying spelling correction to the PELIC dataset

Notifications You must be signed in to change notification settings

ELI-Data-Mining-Group/PELIC-spelling

Repository files navigation

PELIC-spelling

Version 1.0
Authors: Ben Naismith, John Starr, Eva Bacas
Contact: bnaismith@pitt.edu

This repo provides information and code about applying spelling correction to the PELIC dataset.


Table of contents

  1. Overview
  2. Repository contents
  3. SCOWL wordlist
  4. PELIC spelling
  5. Licenses

1. Overview

This README.md file introduces the PELIC-spelling repository which provides information and code about applying spelling correction to the PELIC dataset. To download and find out more about the PELIC dataset, see the PELIC-dataset repository. For information regarding publications and presentations based on PELIC data, as well as for information regarding the people and parties responsible for the corpus, please visit the Pitt ELI Corpus web page.

Spelling correction is an important element to consider in any corpus study involving learner data. The decision whether to correct texts or not will invariably impact results: in some instances it may be preferable to use the raw text, maintaining its integrity and avoiding an additional layer of processing. However, for other projects, corrected text may provide a more accurate representation of the language features being investigated.

There are three main components to the spelling correction process, presented in two Jupyter notebooks:

  1. The SCOWL_wordlist: In this notebook we decide on a list of what we consider to be real words, using an edited version of the SCOWL wordlists.
  2. PELIC_spelling: In this notebook we create a dataframe of misspellings, apply an automated spelling correction process, and re-incorporate the corrected text into our corpus.
  3. PELIC_spelling_validation: In this notebook we detail a validation of the spell checker. Manual checking of spelling is performed on a sample of PELIC and is then compared to the output of the automated spell checker. The results indicate that spell-checker is highly accurate in terms of the total tokens in PELIC, but conservative resulting in lower precision. For details, please see the Jupyter notebook.

2. Repository contents

The PELIC-spelling repository contains 14 main files:

File File type Description
all_names.txt text list of over 90,000 names (first and last) from the 1990 US census data. Names collected by the names random name generator project
contractions.txt text short list of contractions approved as legitimate tokens (not misspellings)
frequency_bigramdictionary_en_243_342.txt text bigram frequency dictionary supplied by SymSpell spell correction module
frequency_dictionary_en_82_765.txt text frequency dictionary supplied by SymSpell spell correction module
hyphens.txt text list of hyphenated words which appear in PELIC and have been approved as legitimate tokens (not misspellings)
PELIC_compiled_spellcorrected.csv csv final output of updated PELIC_compiled.csv with spelling correction
PELIC_spelling.ipynb Jupyter notebook notebook demonstrating how spelling correction is applied to PELIC texts
PELIC-SCOWL.txt text a combination of the SCOWL_condensed.txt, contractions.txt, and hyphens.txt lists
README.md markdown this file describing the repository
SCOWL_condensed.txt text final compiled word list based on SCOWL word lists
SCOWL_supp.txt text short list of words manually approved as being legitimate words, e.g. proper names not found in SCOWL
SCOWL_wordlist.ipynb Jupyter notebook notebook demonstrating how the SCOWL_condensed word list is created
SCOWL_wordlist.txt text the full SCOWL wordlist before condensing
PELIC_spelling_validation.ipynb Jupyter notebook manual validation of the spell checker

3. SCOWL wordlist

This notebook produces a definitive list of 'real' words to use when deciding what to consider a word/non-word. The final output is the SCOWL_condensed.txt file. The primary wordlists are from the SCOWL set of word lists, freely availabe at http://wordlist.aspell.net/.

The notebook is divided into two main sections:

  • Exploratory Data Anaylsis : Here, we examine the various SCOWL dictionaries which include different language varieties, proper nouns, slang, abbreviations, etc. From this exploration, we opt to include all available dictionaries except the abbreviation dictionaries due to the high number of short strings of letters which may match learner errors. It is possible, however, to include these dictionaries if desired.

  • Compiling and condensing dictionaries : In the second part of the notebook, SCOWL_condensed is created by combining the various SCOWL dictionaries and then removing duplicates, blanks, and possessives. The final wordlist is slightly less than 500k words.


4. PELIC spelling

This notebook adds further processing to PELIC_compiled.csv in the PELIC-dataset repo by creating a column of tokens and their parts of speech which have been corrected in terms of spelling.

The notebook is divided into four main sections:

  • Building a non_words dataframe : We first collect all of the non-words from the PELIC dataset (in PELIC_compiled.csv) by extracting all words which are not found in SCOWL_condensed:
>>> non_words.head()
tok_lem_POS sentence answer_id
0 ('beacause', 'beacause', 'NN') i organized the instructions by time, beacause to make tea people who want to make tea have to follow the instructions step by step. 8
1 ('wallmart', 'wallmart', 'NN') next, you need to buy a box of tea in wallmart or giant eagle. 11
2 ('dovn', 'dovn', 'NN') first, you should take some hot water, you can use dovn, mircowave or other ways. 13
3 ('mircowave', 'mircowave', 'VBP') first, you should take some hot water, you can use dovn, mircowave or other ways. 13
4 ('paragragh', 'paragragh', 'NN') every paragragh's instructions depend on a main idea. 16
  • Building a dataframe of misspellings and their frequencies : In the non-words dataframe above, each row is an occurrence of a misspelling (i.e. tokens). We then create a dataframe where each row is a misspelling type with frequency information attached:
>>> misspell_df.sample(5)
Index misspelling tok_lem_POS freq
9164 spel ('spel', 'spel', 'VB') 1
5495 invesigate ('invesigate', 'invesigate', 'VB') 1
3645 estmatied ('estmatied', 'estmatied', 'JJ') 1
9313 straigten ('straigten', 'straigten', 'VB') 1
8455 hobbys ('hobbys', 'hobbys', 'NN') 2
  • Applying spelling correction : Having collected and organized the misspellings, we then correct these occurrences using SymSpell. In SymSpell complete sentence context is not considered, only bigrams and frequencies. Though this is not ideal, other well-known spellcheckers (hunspell, pyspell, etc.) use the same strategy - frequency based criteria for suggestions, without considering co-text beyond bigrams. As such, it is important to remember that accuracy of corrected tokens will not be 100% and must be taken into consideration.
>>> print(non_words2[['answer_id','misspelling','sentence','final_correction_POS']].sample(5))
# Sample of 5 rows and key columns
answer_id misspelling sentence final_correction_POS
11487 ('celemony', 'celemony', 'NN') Third, the ANON_NAME_0-Ju international movie celemony is opened in my hometown. ('ceremony', 'ceremony', 'NN')
13444 ('miliion', 'miliion', 'NN') 200 miliion people ('million', 'million', 'NN')
17707 ('korian', 'korian', 'JJ') Korian pizza is healthier than American pizza. ('korean', 'korean', 'JJ')
35162 ('grammer', 'grammer', 'NN') Although my grammer was not impeccable, they could usually understand what I meant. ('grammar', 'grammar', 'NN')
10839 ('comunity', 'comunity', 'NN') Second, truth make our comunity be truthable sociaty. ('community', 'community', 'NN')
  • Incorporating corrections into pelic_df : Finally, these corrected tokens are incorporated back into pelic_df, creating a new tok_lem_POS column for easy comparison to the original texts. Below is an example of an original and corrected text:
>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,11]) #uncorrected
[(('My', 'my', 'PRP$'), ('friend', 'friend', 'NN'), ('is', 'be', 'VBZ'), ('realy', 'realy', 'JJ'), ('nise', 'nise', 'RB'), ('guy', 'guy', 'NN'), ('.', '.', '.'), ('I', 'i', 'PRP'), ('like', 'like', 'VBP'), ('hem', 'hem', 'JJ'), ('becuase', 'becuase', 'NN'), ('he', 'he', 'PRP'), ('is', 'be', 'VBZ'), ('friendlly', 'friendlly', 'RB'), ('and', 'and', 'CC'), ('lovliy', 'lovliy', 'NN'), ('.', '.', '.'))]

>>> print(pelic_df.loc[pelic_df.text.str.contains('becuase')].iloc[1,12]) #corrected
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('real', 'JJ'), ('nice', 'RB'), ('guy', 'NN'), ('.', '.'), ('I', 'PRP'), ('like', 'VBP'), ('hem', 'JJ'), ('because', 'NN'), ('he', 'PRP'), ('is', 'VBZ'), ('friendly', 'RB'), ('and', 'CC'), ('lovely', 'NN'), ('.', '.')]

We can see here that many approrpriate corrections have been made, including beccuase -> because , nise -> nice , friendlly -> friendly , and lovily -> lovely . Importantly, incorrect spellings that are actual words, e.g. hem (should be him in this case) are not corrected. In addition, as limited context is considered, there will be some inaccuracies, e.g. realy (real nice is a frequent bigram) -> real rather than really.

Overall, the application of spelling correction is an important resource as it allows for more accurate tracking of what learners may have been intending to write. For example, learners may know a word in every sense, except for its spelling. However, as with any automated text manipulation, the added layer of processing will allow for errors to enter the data, and as such, must be considered carefully when drawing conclusions from the data.


5. Licenses

PELIC license: Creative Commons License
PELIC dataset by Alan Juffs, Na-Rae Han, Ben Naismith is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Based on a work at https://github.com/ELI-Data-Mining-Group/PELIC-dataset.

SCOWL license: SCOWL Copyright and License Agreement

Spell Checking Oriented Word Lists (SCOWL) (http://wordlist.sourceforge.net/scowl-readme) The collective work is Copyright 2000-2011 by Kevin Atkinson as well as any of the copyrights mentioned below:

Copyright 2000-2011 by Kevin Atkinson Permission to use, copy, modify, distribute and sell these word lists, the associated scripts, the output created from the scripts, and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Kevin Atkinson makes no representations about the suitability of this array for any purpose. It is provided "as is" without express or implied warranty.


Back to top

About

Information and code about applying spelling correction to the PELIC dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published