Wordless is an integrated corpus tool with multi-language support for the study of language, literature and translation designed and developed by Ye Lei (叶磊), MA student in interpreting studies at Shanghai International Studies University (上海外国语大学).
Copyright (C) 2018-2019 Ye Lei (叶磊)
This project is licensed under GNU GPLv3.
For details, see: https://github.com/BLKSerene/Wordless/blob/master/LICENSE.txt
All other rights reserved.
If you publish work that uses Wordless, please cite as follows.
MLA (8th Edition):
Ye Lei. Wordless, version 1.1.0, 2019, https://github.com/BLKSerene/Wordless.
APA (6th Edition):
Ye, L. (2019) Wordless (Version 1.1.0) [Computer Software]. Retrieved from https://github.com/BLKSerene/Wordless
GB (GB/T 7714—2015):
叶磊. Wordless version 1.1.0[CP]. (2019). https://github.com/BLKSerene/Wordless.
The latest version of Wordless supports Windows 7/8.1/10, macOS 10.12 and later, and Ubuntu 16.04 and later, all 64-bit only.
Download the latest version for Windows (unzip the file and double-click Wordless/Wordless.exe to run)
Download the latest version for macOS (unzip the file and double-click Wordless.app to run)
Download the latest version for Linux (unzip the file and double-click Wordless/Wordless to run)
Chinese users with slow connections to Github can download from Baidu Netdisk (password: k3ny).
- Main Window
- File Area
- Overview
- Concordancer
- Wordlist
- N-grams
- Collocation
- Colligation
- Keywords
- Supported Languages
- Supported Text Types
- Supported File Types
- Supported File Encodings
- Supported Measures
- Works Cited
If you encounter a problem, find a bug or require any further information, feel free to ask questions, submit bug reports or provide feedback by creating an issue on Github if you fail to find the answer by searching existing issues first.
If you need to post sample texts or other information that cannot be shared or you do not want to share publicly, you may send me an email.
Home Page: https://github.com/BLKSerene/Wordless
Documentation: https://github.com/BLKSerene/Wordless#documentation
Email: blkserene@gmail.com
WeChat Official Account: Wordless
Important Note: I CANNOT GUARANTEE that all emails will always be checked or replied in time. I WILL NOT REPLY to irrelevant emails and I reserve the right to BLOCK AND/OR REPORT people who send me spam emails.
If you have an interest in helping the development of Wordless, you may contribute bug fixes, enhancements or new features by creating a pull request on Github.
Besides, you may contribute by submitting enhancement proposals or feature requests, write tutorials or Github Wiki for Wordless, or helping me translate Wordless and its documentation to other languages.
If you would like to support the development of Wordless, you may donate via PayPal, Alipay or WeChat.
PayPal | Alipay | |
---|---|---|
Important Note: I WILL NOT PROVIDE refund services, private email/phone support, information concerning my social media, gurantees on bug fixes, enhancements, new features or new releases of Wordless, invoices, receipts or detailed weekly/monthly/yearly/etc. spending report for donation.
Wordless stands on the shoulders of giants. Thus, I would like to extend my thanks to the following open-source projects:
- jieba by Sun Junyi
- nagisa by Taishi Ikeda (池田大志)
- NLTK by Steven Bird, Liling Tan
- pybo by Hélios Drupchen Hildt
- pymorphy2 by Mikhail Korobov
- PyThaiNLP by Wannaphong Phatthiyaphaibun (วรรณพงษ์ ภัททิยไพบูลย์)
- SacreMoses by Liling Tan
- spaCy by Matthew Honnibal, Ines Montani
- Underthesea by Vu Anh
- Matplotlib by Matplotlib Development Team
- wordcloud by Andreas Christian Mueller
- Beautiful Soup by Leonard Richardson
- cChardet by Yoshihiro Misawa
- chardet by Daniel Blanchard
- langdetect by Michal Mimino Danilak
- langid.py by Marco Lui
- lxml by Stefan Behnel
- NumPy by NumPy Developers
- openpyxl by Eric Gazoni, Charlie Clark
- PyInstaller by Hartmut Goebel
- python-docx by Steve Canny
- requests by Kenneth Reitz
- SciPy by SciPy Developers
- xlrd by Stephen John Machin
- grk-stoplist by Annette von Stockhausen
- lemmalist-greek by Michael Stenskjær Christensen
- Lemmatization Lists by Michal Boleslav Měchura
- Stopwords ISO by Gene Diaz
Main Window [Back to Contents]
The main window of Wordless is divided into several sections:
-
Menu Bar
-
Work Area:
The Work Area is further divided into the Resutls Area on the left side and the Settings Area on the right side.
You can click on the tabs at the top to toggle between different panels. -
File Area:
The File Area is further divided into the File Table on the left side and the Settings Area on the right side. -
Status Bar:
You can show/hide the Status Bar by checking/unchecking Menu → Preferences → Show Status Bar
File Area [Back to Contents]
In most cases, the first thing to do in Wordless is open and select your files to be processed via Menu → File or by clicking the buttons residing under the File Table.
Files are selected by default after being added to the File Table. Only selected files will be processed by Wordless. You can drag and drop files around the File Table to change their orders, which will be reflected in the results produced by Wordless.
By default, Wordless will try to detect the language, text type and encoding of the file, you should check and make sure that the settings of each and every file is correct. If you do not want Wordless to detect the settings for you and prefer setting them manually, you can change the settings in Auto-detection Settings in the Settings Area.
-
Add File(s):
Add one single file or multiple files to the File Table.* You can use the Ctrl key (Command key on macOS) and/or the Shift key to select multiple files.
-
Add Folder:
Add all files in the folder to the File Table.By default, all files in subfolders (and subfolders of subfolders, and so on) will also be added to the File Table. If you do not want to add files in subfolders to the File Table, uncheck Folder Settings → Subfolders in the Settings Area.
-
Reopen Closed File(s):
Add file(s) that are closed the last time back to the File Table.* The history of all closed files will be erased upon exit of Wordless.
-
Select All:
Select all files in the File Table. -
Invert Selection:
Select all files that are not currently selected and deselect all currently selected files in the File Table. -
Deselect All:
Deselect all files in the File Table. -
Close Selected:
Remove all currently selected files in the File Table. -
Close All:
Remove all files in the File Table.
Overview [Back to Contents]
In Overview, you can check/compare the language features of different files.
-
Count of Paragraphs:
Number of paragraphs in each file. Each line in the file will be counted as one paragraph. Blank lines and lines containing only spaces, tabs and other invisible characters are ignored. -
Count of Sentences:
Number of sentences in each file. Wordless will automatically apply the built-in sentence tokenizer according to the language of each file in order to calculate the number of sentences in each file. You can change the sentence tokenizer settings via Menu → Preferences → Settings → Sentence Tokenization → Sentence Tokenizer Settings. -
Count of Tokens:
Number of tokens in each file. Wordless will automatically apply the built-in word tokenizer according to the language of each file in order to calculate the number of tokens in each file. You can change the word tokenizer settings via Menu → Preferences → Settings → Word Tokenization → Word Tokenizer Settings.You can specify what should be counted as a "token" via Token Settings in the Settings Area
-
Count of Types:
Number of token types in each file. -
Count of Characters:
Number of single characters in each file. Spaces, tabs and all other invisible characters are ignored. -
Type-Token Ratio:
Number of token types divided by number of tokens. -
Type-Token Ratio (Standardized):
Standardized type-token ratio. Each file will be divided into several sub-sections with each one consisting of 1000 tokens by default and type-token ratio will be calculated for each part. The standardized type-token ratio of each file is then averaged out over all sub-sections. You can change the number of tokens in each sub-section via Generation Settings → Base of standardized type-token ratio.The last section will be discarded if the number of tokens in it is smaller than the base of standardized type-token ratio in order to prevent the result from being affected by outliers (extreme values).
-
Average Paragraph Length (in Sentence):
Number of sentences divided by number of paragraphs. -
Average Paragraph Length (in Token):
Number of Tokens divided by number of paragraphs. -
Average Sentence Length (in Token):
Number of tokens divided by number of sentences. -
Average Token Length (in Character):
Number of characters divided by number of tokens. -
Count of n-length Tokens:
Number of n-length tokens, where n = 1, 2, 3, etc.
Concordancer [Back to Contents]
In Concordancer, you can search for any token in different files and generate concordance lines. You can adjust the settings for the generated data via Generation Settings.
After the concordance lines are generated and displayed in the table, you can sort the results by clicking Sort Results or search in results by clicking Search in Results, both buttons residing at the right corner of the Results Area.
In addition, you can generate concordance plots for any search term. You can modify the settings for the generated figure via Figure Settings. By default, data in concordance plot are sorted by file. You can sort the data by search term instead via Figure Settings → Sort Results by.
- Left:
The context before each search term, which displays 10 tokens left to the Node by default. You can change this behavior via Generation Settings. - Node:
Nodes are search terms specified in Search Settings → Search Term. - Right:
The context after each search term, which displays 10 tokens right to the Node by default. You can change this behavior via Generation Settings. - Token No.
The position of the first token of Node in each file. - Sentence No.
The position of the sentence in which the Node is found in each file. - Paragraph No.
The position of the paragraph in which the Node is found in each file. - File
The file in which the Node is found.
Wordlist [Back to Contents]
In Wordlist, you can generate wordlists for different files and calculate the raw frequency, relative frequency, dispersion and adjusted frequency for each token.
In addition, you can generate line charts or word clouds for wordlists using any statistics. You can modify the settings for the generated figure via Figure Settings.
Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.
-
Rank:
The rank of the token sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. -
Tokens:
You can specify what should be counted as a "token" via Token Settings. -
Frequency:
The number of occurrences of the token in each file. -
Dispersion:
The dispersion of the token in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See Measures of Dispersion & Adjusted Frequency for more details. -
Adjusted Frequency:
The adjusted frequency of the token in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See Measures of Dispersion & Adjusted Frequency for more details. -
Number of Files Found:
The number of files in which the token appears at least once.
N-grams [Back to Contents]
In N-grams, you can search for n-grams (consecutive tokens) or skip-grams (non-consecutive tokens) in different files, count and compute the raw frequency and relative frequency of each n-gram/skip-gram, and calculate the dispersion and adjusted frequency for each n-gram/skip-gram using different measures. You can adjust the settings for the generated data via Generation Settings. To allow skip-grams in the results, check Generation Settings → Allow skipped tokens and modify the settings. You can also set constraints on the position of search terms in all n-grams via Search Settings → Search Term Position.
It is possible to disable searching altogether and generate an exhausted list of n-grams/skip-grams by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.
In addition, you can generate line charts or word clouds for n-grams using any statistics. You can modify the settings for the generated figure via Figure Settings.
Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.
-
Rank:
The rank of the n-gram sorted by its frequency in the first file in descending order (by default). You can sort the results again by clicking the column headers. -
N-grams:
You can specify what should be counted as a "n-gram" via Token Settings. -
Frequency:
The number of occurrences of the n-gram in each file. -
Dispersion:
The dispersion of the n-gram in each file. You can change the measure of dispersion used via Generation Settings → Measure of Dispersion. See Measures of Dispersion & Adjusted Frequency for more details. -
Adjusted Frequency:
The adjusted frequency of the n-gram in each file. You can change the measure of adjusted frequency used via Generation Settings → Measure of Adjusted Frequency. See Measures of Dispersion & Adjusted Frequency for more details. -
Number of Files Found:
The number of files in which the n-gram appears at least once.
Collocation [Back to Contents]
In Collocation, you can search for patterns of collocation (tokens that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of tokens and calculate the effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings.
It is possible to disable searching altogether and generate an exhausted list of patterns of collocation by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.
In addition, you can generate line charts or word clouds for patterns of collocation using any statistics. You can modify the settings for the generated figure via Figure Settings.
Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.
-
Rank:
The rank of the collocating token sorted by the p-value of the significance test conducted on the node and the collocating token in the first file in ascending order (by default). You can sort the results again by clicking the column headers. -
Nodes:
The search term. You can specify what should be counted as a "token" via Token Settings. -
Collocates:
The collocating token. You can specify what should be counted as a "token" via Token Settings. -
Ln, ... , L3, L2, L1, R1, R2, R3, ... , Rn:
The number of co-occurrences of the node and the collocating token with the collocating token at the given position in each file. -
Frequency:
The total number of co-occurrences of the node and the collocating token with the collocating token at all possible positions in each file. -
Test Statistic:
The test statistic of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.Please note that test statistic is not avilable for some tests of statistical significance.
-
p-value:
The p-value of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Bayes Factor:
The bayes factor of the significance test conducted on the node and the collocating token in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.Please note that bayes factor is not avilable for some tests of statistical significance.
-
Effect Size:
The effect size of the node and the collocating token in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Number of Files Found:
The number of files in which the the node and the collocating token co-occur at least once.
Colligation [Back to Contents]
In Colligation, you can search for patterns of colligation (parts of speech that co-occur more often than would be expected by chance) within a given collocational window (from 5 words to the left to 5 words to the right by default), conduct different tests of statistical significance on each pair of parts of speech and calculate the effect size for each pair using different measures. You can adjust the settings for the generated data via Generation Settings.
Wordless will automatically apply its built-in POS tagger on every file that are not POS-tagged already according to the language of each file. If POS-tagging is not supported for the given languages, the user should provide a file that has already been POS-tagged and make sure that the correct Text Type has been set on each file.
It is possible to disable searching altogether and generate an exhausted list of patterns of colligation by unchecking Search Settings for each file, but it is not recommended to do so, since the processing speed might be to slow.
In addition, you can generate line charts or word clouds for patterns of colligation using any statistics. You can modify the settings for the generated figure via Figure Settings.
Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.
-
Rank:
The rank of the collocating part of speech sorted by the p-value of the significance test conducted on the node and the collocating part of speech in the first file in ascending order (by default). You can sort the results again by clicking the column headers. -
Nodes:
The search term. You can specify what should be counted as a "token" via Token Settings. -
Collocates:
The collocating part of speech. You can specify what should be counted as a "token" via Token Settings. -
Ln, ... , L3, L2, L1, R1, R2, R3, ... , Rn:
The number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at the given position in each file. -
Frequency:
The total number of co-occurrences of the node and the collocating part of speech with the collocating part of speech at all possible positions in each file. -
Test Statistic:
The test statistic of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.Please note that test statistic is not avilable for some tests of statistical significance.
-
p-value:
The p-value of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Bayes Factor:
The bayes factor of the significance test conducted on the node and the collocating part of speech in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.Please note that bayes factor is not avilable for some tests of statistical significance.
-
Effect Size:
The effect size of the node and the collocating part of speech in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Number of Files Found:
The number of files in which the the node and the collocating part of speech co-occur at least once.
Keywords [Back to Contents]
In Keywords, you can search for candidates of potential keywords (tokens that have far more or far less frequency in the observed file than in the reference file) in different files given a reference corpus, conduct different tests of statistical significance on each keyword and calculate the effect size for each keyword using different measures. You can adjust the settings for the generated data via Generation Settings.
In addition, you can generate line charts or word clouds for keywords using any statistics. You can modify the settings for the generated figure via Figure Settings.
Lastly, you can further filter the results as you see fit by clicking Filter Results or search in the results for the part that might be of interest to you by clicking Search in Results, both buttons residing at the right corner of the Results Area.
-
Rank:
The rank of the keyword sorted by the p-value of the significance test conducted on the keyword in the first file in ascending order (by default). You can sort the results again by clicking the column headers. -
Keywords:
The candidates of potantial keywords. You can specify what should be counted as a "token" via Token Settings. -
Frequency (in Reference File):
The number of co-occurrences of the keywords in the reference file. -
Frequency (in Observed Files):
The number of co-occurrences of the keywords in each observed file. -
Test Statistic:
The test statistic of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details. -
p-value:
The p-value of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Bayes Factor:
The bayes factor of the significance test conducted on the keyword in each file. You can change the test of statistical significance used via Generation Settings → Test of Statistical Significance. See Tests of Statistical Significance & Measures of Effect Size for more details.Please note that bayes factor is not avilable for some tests of statistical significance.
-
Effect Size:
The effect size of on the keyword in each file. You can change the measure of effect size used via Generation Settings → Measure of Effect Size. See Tests of Statistical Significance & Measures of Effect Size for more details. -
Number of Files Found:
The number of files in which the keyword appears at least once.
Supported Languages [Back to Contents]
Languages | Sentence Tokenization | Word Tokenization | Word Detokenization | POS Tagging | Lemmatization | Stop Words |
---|---|---|---|---|---|---|
Afrikaans | ⭕️ | ✔ | ⭕️ | ✖️ | ✖️ | ✔️ |
Albanian | ⭕️ | ✔ | ⭕️ | ✖️ | ✖️ | ✔️ |
Arabic | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Armenian | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Asturian | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✖️ |
Azerbaijani | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Basque | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Bengali | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✖️ |
Breton | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Bulgarian | ⭕️ | ✔️ | ⭕️ | ✖️ | ✔️ | ✔️ |
Catalan | ⭕️ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Chinese (Simplified) | ✔ | ✔️ | ✔️ | ✔️ | ✖️ | ✔️ |
Chinese (Traditional) | ✔ | ✔️ | ✔️ | ✔️ | ✖️ | ✔️ |
Croatian | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Czech | ✔ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Danish | ✔ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Dutch | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
English | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Esperanto | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Estonian | ✔ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✔️ |
Finnish | ✔ | ✔️ | ✔️ | ✖️ | ✖️ | ✔️ |
French | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Galician | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✔️ |
German | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Greek (Ancient) | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✔️ |
Greek (Modern) | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Hausa | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Hebrew | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Hindi | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Hungarian | ⭕️ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Icelandic | ⭕️ | ✔️ | ✔️ | ✖️ | ✖️ | ✔️ |
Indonesian | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Irish | ⭕️ | ✔️ | ⭕️ | ✖️ | ✔️ | ✔️ |
Italian | ✔ | ⭕️ | ⭕️ | ✔️ | ✔️ | ✔️ |
Japanese | ✔ | ⭕️ | ✔️ | ✔️ | ✖️ | ✔️ |
Kannada | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Kazakh | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Korean | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Kurdish | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Latin | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Latvian | ⭕️ | ✔️ | ✔️ | ✖️ | ✖️ | ✔️ |
Lithuanian | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Malay | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Manx | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✖️ |
Marathi | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Nepali | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Norwegian Bokmål | ✔ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Norwegian Nynorsk | ✔ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Persian | ⭕️ | ✔️ | ⭕️ | ✖️ | ✔️ | ✔️ |
Polish | ✔ | ✔️ | ✔️ | ✖️ | ✖️ | ✔️ |
Portuguese | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Romanian | ⭕️ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Russian | ⭕️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Scottish Gaelic | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✖️ |
Sinhala | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Slovak | ⭕️ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Slovenian | ✔ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Sotho (Southern) | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Spanish | ✔ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Swahili | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Swedish | ✔ | ✔️ | ✔️ | ✖️ | ✔️ | ✔️ |
Tagalog | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Tajik | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✖️ |
Tamil | ⭕️ | ✔️ | ✔️ | ✖️ | ✖️ | ✔️ |
Tatar | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Telugu | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Thai | ✔ | ✔️ | ✔️ | ✔️ | ✖️ | ✔️ |
Tibetan | ⭕️ | ✔️ | ✔️ | ✔️ | ✔️ | ✖️ |
Turkish | ✔ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Ukrainian | ⭕️ | ✔️ | ⭕️ | ✔️ | ✔️ | ✔️ |
Urdu | ⭕️ | ✔️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Vietnamese | ✔ | ✔️ | ⭕️ | ✔️ | ✖️ | ✔️ |
Welsh | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✔️ | ✖️ |
Yoruba | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Zulu | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✔️ |
Other Languages | ⭕️ | ⭕️ | ⭕️ | ✖️ | ✖️ | ✖️ |
✔: Supported
⭕️: Supported but falls back to the default English tokenizer
✖️: Not supported
Supported Text Types [Back to Contents]
You can specify your custom POS/Non-POS tags via Menu → Preferences → Settings → Tags.
Text Types | Auto-detection |
---|---|
Untokenized / Untagged | ✔ |
Untokenized / Tagged (Non-POS) | ✔ |
Tokenized / Untagged | ✖ |
Tokenized / Tagged (POS) | ✔ |
Tokenized / Tagged (Non-POS) | ✖ |
Tokenized / Tagged (Both) | ✔ |
Supported File Types [Back to Contents]
File Types | File Extensions |
---|---|
Text Files | *.txt |
Microsoft Word Documents | *.docx |
Microsoft Excel Workbook | *.xls, *.xlsx |
CSV Files | *.csv |
HTML Pages | *.htm, *.html |
Translation Memory Files | *.tmx |
Lyrics Files | *.lrc |
* Microsoft 97-03 Word documents (*.doc) are not supported.
* Non-text files will be converted to text files first before being added to the File Table. You can check the converted files under folder Import at the installation location of Wordless on your computer (as for macOS users, right click Wordless.app, select Show Package Contents and navigate to Contents/MacOS/Import/). You can change this location via Menu → Preferences → Settings → Import → Temporary Files → Default Path.
Supported File Encodings [Back to Contents]
Languages | File Encodings | Auto-detection |
---|---|---|
All Languages | UTF-8 Without BOM | ✔ |
All Languages | UTF-8 with BOM | ✔ |
All Languages | UTF-16 with BOM | ✔ |
All Languages | UTF-16 Big Endian Without BOM | ✖ |
All Languages | UTF-16 Little Endian Without BOM | ✖ |
All Languages | UTF-32 with BOM | ✖ |
All Languages | UTF-32 Big Endian Without BOM | ✖ |
All Languages | UTF-32 Little Endian Without BOM | ✖ |
All Languages | UTF-7 | ✖ |
All Languages | CP65001 | ✖ |
Arabic | CP720 | ✖ |
Arabic | CP864 | ✖ |
Arabic | ISO-8859-6 | ✔ |
Arabic | Mac OS Arabic | ✖ |
Arabic | Windows-1256 | ✔ |
Baltic Languages | CP775 | ✖ |
Baltic Languages | ISO-8859-13 | ✖ |
Baltic Languages | Windows-1257 | ✖ |
Celtic Languages | ISO-8859-14 | ✖ |
Central European | CP852 | ✔ |
Central European | ISO-8859-2 | ✔ |
Central European | Mac OS Central European | ✔ |
Central European | Windows-1250 | ✔ |
Chinese | GB18030 | ✔ |
Chinese | GBK | ✖ |
Chinese (Simplified) | GB2312 | ✖ |
Chinese (Simplified) | HZ | ✔ |
Chinese (Traditional) | Big-5 | ✔ |
Chinese (Traditional) | Big5-HKSCS | ✖ |
Chinese (Traditional) | CP950 | ✖ |
Croatian | Mac OS Croatian | ✖ |
Cyrillic | CP855 | ✔ |
Cyrillic | CP866 | ✔ |
Cyrillic | ISO-8859-5 | ✔ |
Cyrillic | Mac OS Cyrillic | ✔ |
Cyrillic | Windows-1251 | ✔ |
English | ASCII | ✔ |
English | EBCDIC 037 | ✖ |
English | CP437 | ✖ |
Esperanto/Maltese | ISO-8859-3 | ✔ |
European | HP Roman-8 | ✖ |
French | CP863 | ✖ |
German | EBCDIC 273 | ✖ |
Greek | CP737 | ✖ |
Greek | CP869 | ✖ |
Greek | CP875 | ✖ |
Greek | ISO-8859-7 | ✔ |
Greek | Mac OS Greek | ✖ |
Greek | Windows-1253 | ✔ |
Hebrew | CP856 | ✖ |
Hebrew | CP862 | ✖ |
Hebrew | EBCDIC 424 | ✖ |
Hebrew | ISO-8859-8 | ✔ |
Hebrew | Windows-1255 | ✔ |
Icelandic | CP861 | ✖ |
Icelandic | Mac OS Icelandic | ✖ |
Japanese | CP932 | ✔ |
Japanese | EUC-JP | ✔ |
Japanese | EUC-JIS-2004 | ✖ |
Japanese | EUC-JISx0213 | ✖ |
Japanese | ISO-2022-JP | ✔ |
Japanese | ISO-2022-JP-1 | ✖ |
Japanese | ISO-2022-JP-2 | ✖ |
Japanese | ISO-2022-JP-2004 | ✖ |
Japanese | ISO-2022-JP-3 | ✖ |
Japanese | ISO-2022-JP-EXT | ✖ |
Japanese | Shift_JIS | ✔ |
Japanese | Shift_JIS-2004 | ✖ |
Japanese | Shift_JISx0213 | ✖ |
Kazakh | KZ-1048 | ✖ |
Kazakh | PTCP154 | ✖ |
Korean | EUC-KR | ✖ |
Korean | ISO-2022-KR | ✔ |
Korean | JOHAB | ✖ |
Korean | UHC | ✔ |
Nordic Languages | CP865 | ✖ |
Nordic Languages | ISO-8859-10 | ✔ |
North European | ISO-8859-4 | ✔ |
Persian/Urdu | Mac OS Farsi | ✖ |
Portuguese | CP860 | ✖ |
Romanian | Mac OS Romanian | ✖ |
Russian | KOI8-R | ✔ |
South-Eastern European | ISO-8859-16 | ✔ |
Tajik | KOI8-T | ✖ |
Thai | CP874 | ✖ |
Thai | ISO-8859-11 | ✖ |
Thai | TIS-620 | ✔ |
Turkish | CP857 | ✖ |
Turkish | EBCDIC 1026 | ✖ |
Turkish | ISO-8859-9 | ✔ |
Turkish | Mac OS Turkish | ✖ |
Turkish | Windows-1254 | ✖ |
Ukrainian | CP1125 | ✖ |
Ukrainian | KOI8-U | ✖ |
Urdu | CP1006 | ✖ |
Vietnamese | CP1258 | ✖ |
Western European | EBCDIC 500 | ✖ |
Western European | CP850 | ✖ |
Western European | CP858 | ✖ |
Western European | CP1140 | ✖ |
Western European | ISO-8859-1 | ✔ |
Western European | ISO-8859-15 | ✔ |
Western European | Mac OS Roman | ✖ |
Western European | Windows-1252 | ✔ |
Supported Measures [Back to Contents]
The dispersion and adjusted frequency of a word in each file is calculated by first dividing each file into n (5 by default) sub-sections and the frequency of the word in each part is counted, which are denoted by F₁, F₂, F₃ ... Fn. The total frequency of the word in each file is denoted by F. The mean value of the frequencies over all sub-sections is denoted by .
Then, the dispersion and adjusted frequency of the word will be calcuated as follows:
Measures of Dispersion | Formulas |
---|---|
Juilland's D [1] | |
Carroll's D₂ [2] | |
Lyne's D₃ [3] | |
Rosengren's S [4] | |
Zhang's Distributional Consistency [5] | |
Gries's DP [6] | |
Gries's DPnorm [6] [7] |
Measures of Adjusted Frequency | Formulas |
---|---|
Juilland's U [1] | |
Carroll's Um [2] | |
Rosengren's KF [4] | |
Engwall's FM [8] | where R is the number of sub-sections in which the word appears at least once |
Kromer's UR [9] | where ψ is the digamma function, C is the Euler–Mascheroni constant |
To calculate the statistical significance, bayes factor and effect size (except Student's t-test (Two-sample) and Mann-Whitney U Test) for two words in the same file (collocates) or one specific word in two different files (keywords), two contingency tables must be constructed first, one for observed values, the other for expected values.
As for collocates (in Collocation and Colligation):
Observed Values | Word 1 | Not Word 1 | Row Total |
---|---|---|---|
Word 2 | |||
Not Word 2 | |||
Column Total |
Expected Values | Word 1 | Not Word 1 |
---|---|---|
Word 2 | ||
Not Word 2 |
: Number of occurrences of Word 1 followed by Word 2
: Number of occurrences of Word 1 followed by any word except Word 2
: Number of occurrences of any word except Word 1 followed by Word 2
: Number of occurrences of any word except Word 1 followed by any word except Word 2
As for keywords (in Keywords):
Observed Values | Observed File | Reference File | Row Total |
---|---|---|---|
Word w | |||
Not Word w | |||
Column Total |
Expected Values | Observed File | Reference File |
---|---|---|
Word w | ||
Not Word w |
: Number of occurrences of Word w in the observed file
: Number of occurrences of Word w in the reference file
: Number of occurrences of all words except Word w in the observed file
: Number of occurrences of all words except Word w in the reference file
To conduct Student's t-test (Two-sample) or Mann-Whitney U Test on a specific word, the observed file and the reference file are first divided into n (5 by default) sub-sections respectively. Then, the frequencies of the word in each sub-section of the observed file and the reference file are counted and denoted by FO₁, FO₂, FO₃ ... FOn and FR₁, FR₂, FR₃ ... FRn respectively. The total frequency of the word in the observed file and the reference file are denoted by FO and FR respectively. The mean value of the frequencies over all sub-sections of the observed file and the reference file are denoted by and respectively.
Then the statistical significance, bayes factor and effect size will be calculated as follows:
Tests of Statistical Significance | Formulas |
---|---|
z-score [10][11] | |
Student's t-test (One-sample) [12] | |
Student's t-test (Two-sample) [13] | |
Pearson's Chi-squared Test [14][15] | |
Log-likelihood Ratio [16] | |
Fisher's Exact Test [17] | See: Fisher's exact test - Wikipedia |
Mann-Whitney U Test [18] | See: Mann–Whitney U test - Wikipedia |
Measures of Bayes Factor | Formulas |
---|---|
Student's t-test (Two-sample) [19] | |
Log-likelihood Ratio [19] |
Measures of Effect Size | Formulas |
---|---|
Pointwise Mutual Information [20] | |
Mutual Dependency [21] | |
Log-Frequency Biased MD [21] | |
Cubic Association Ratio [22] | |
MI.log-f [23][24] | |
Mutual Information [25] | |
Squared Phi Coefficient [26] | |
Dice's Coefficient [27] | |
logDice [28] | |
Mutual Expectation [29] | |
Jaccard Index [25] | |
Minimum Sensitivity [30] | |
Poisson Collocation Measure [31] | |
Kilgarriff's Ratio [32] | where α is the smoothing parameter, which is 1 by default. You can change the value of α via Menu → Preferences → Settings → Measures → Effect Size → Kilgarriff's Ratio → Smoothing Parameter. |
Odds Ratio [33] | |
Log Ratio [34] | |
Difference Coefficient [14][35] | |
%DIFF [36] |
Works Cited [Back to Contents]
[1] Juilland, Alphonse and Eugenio Chang-Rodriguez. Frequency Dictionary of Spanish Words, Mouton, 1964.
[2] Carroll, John B. "An alternative to Juilland’s usage coefficient for lexical frequencies and a proposal for a standard frequency index." Computer Studies in the Humanities and Verbal Behaviour, vol.3, no. 2, 1970, pp. 61-65.
[3] Lyne, A. A. "Dispersion." The Vocabulary of French Business Correspondence. Slatkine-Champion, 1985, pp. 101-24.
[4] Rosengren, Inger. "The quantitative concept of language and its relation to the structure of frequency dictionaries." Études de linguistique appliquée, no. 1, 1971, pp. 103-27.
[5] Zhang Huarui, et al. "Distributional Consistency: As a General Method for Defining a Core Lexicon." Proceedings of Fourth International Conference on Language Resources and Evaluation, Lisbon, 26-28 May 2004.
[6] Gries, Stefan Th. "Dispersions and Adjusted Frequencies in Corpora." International Journal of Corpus Linguistics, vol. 13, no. 4, 2008, pp. 403-37.
[7] Lijffijt, Jefrey and Stefan Th. Gries. "Correction to Stefan Th. Gries’ “Dispersions and adjusted frequencies in corpora”" International Journal of Corpus Linguistics, vol. 17, no. 1, 2012, pp. 147-49.
[8] Engwall, Gunnel. "Fréquence Et Distribution Du Vocabulaire Dans Un Choix De Romans Français." Dissertation, Stockholm University, 1974.
[9] Kromer, Victor. "A Usage Measure Based on Psychophysical Relations." Journal of Quatitative Linguistics, vol. 10, no. 2, 2003, pp. 177-186.
[10] Dennis, S. F. "The Construction of a Thesaurus Automatically from a Sample of Text." Proceedings of the Symposium on Statistical Association Methods For Mechanized Documentation, Washington, D.C., 17 March, 1964, edited by Stevens, M. E., et at., National Bureau of Standards, 1965, pp. 61-148.
[11] Berry-rogghe, Godelieve L. M. "The Computation of Collocations and their Relevance in Lexical Studies." The computer and literary studies, edited by Aitken, A. J., Edinburgh UP, 1973, pp. 103-112.
[12] Church, Kenneth Ward, et al. "Using Statistics in Lexical Analysis." Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, edited by Uri Zernik, Psychology Press, 1991, pp. 115-64.
[13] Paquot, Magali and Yves Bestgen. "Distinctive Words in Academic Writing: A Comparison of Three Statistical Tests for Keyword Extraction." Language and Computers, vol.68, 2009, pp. 247-269.
[14] Hofland, Knut and Stig Johansson. Word Frequencies in British and American English. Norwegian Computing Centre for the Humanities, 1982.
[15] Oakes, Michael P. Statistics for Corpus Linguistics. Edinburgh UP, 1998.
[16] Dunning, Ted Emerson. "Accurate Methods for the Statistics of Surprise and Coincidence." Computational Linguistics, vol. 19, no. 1, Mar. 1993, pp. 61-74.
[17] Pedersen, Ted. "Fishing for Exactness." Proceedings of the South-Central SAS Users Group Conference, 27-29 Oct. 1996, Austin.
[18] Kilgarriff, Adam. "Comparing Corpora." International Journal of Corpus Linguistics, vol.6, no.1, Nov. 2001, pp. 232-263.
[19] Wilson, Andrew. "Embracing Bayes Factors for Key Item Analysis in Corpus Linguistics." New Approaches to the Study of Linguistic Variability, edited by Markus Bieswanger and Amei Koll-Stobbe, Peter Lang, 2013, pp. 3-11.
[20] Church, Kenneth Ward and Patrick Hanks. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, vol. 16, no. 1, Mar. 1990, pp. 22-29.
[21] Thanopoulos, Aristomenis, et al. "Comparative Evaluation of Collocation Extraction Metrics." Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas, 29-31 May 2002, edited by Rodríguez, Manuel González Rodríguez and Carmen Paz Suarez Araujo, European Language Resources Association, May 2002, pp. 620-25.
[22] Daille, Béatrice. "Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering." UCREL Technical Papers, vol. 5, University of Lancaster, 1995.
[23] Kilgarriff, Adam and David Tugwell. "Word Sketch: Extraction and Display of Significant Collocations for Lexicography." Proceedings of the ACL 2001 Collocations Workshop, Toulouse, 2001, pp. 32–38.
[24] "Statistics used in Sketch Engine." Sketch Engine, https://www.sketchengine.eu/documentation/statistics-used-in-sketch-engine/. Accessed 26 Nov 2018.
[25] Dunning, Ted Emerson. "Finding Structure in Text, Genome and Other Symbolic Sequences." Dissertation, U of Sheffield, 1998. arXiv, arxiv.org/pdf/1207.1847.pdf.
[26] Church, Kenneth Ward and William A. Gale. "Concordances for Parallel Text." Using Corpora: Seventh Annual Conference of the UW Centre for the New OED and Text Research, St. Catherine's College, 29 Sept - 1 Oct 1991, UW Centre for the New OED and Text Research, 1991.
[27] Smadja, Frank, et al. "Translating Collocations for Bilingual Lexicons: A Statistical Approach." Computational Linguistics, vol. 22, no. 1, 1996, pp. 1-38.
[28] Rychlý, Pavel. "A Lexicographyer-Friendly Association Score." Proceedings of Second Workshop on Recent Advances in Slavonic Natural Languages Processing, Karlova Studanka, 5-7 Dec. 2008, edited by Sojka, P. and A. Horák, Masaryk U, 2008, pp. 6-9.
[29] Dias, Gaël. "Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text Corpora." Proceedings of Conférence Traitement Au-tomatique des Langues Naturelles, 12-17 July 1999, Cargèse, edited by Mitkov, Ruslan and Jong C. Park, 1999, pp. 333-39.
[30] Pedersen, Ted. "Dependent Bigram Identification." Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, 26-30 July 1998, American Association for Artificial Intelligence, 1998, p. 1197.
[31] Quasthoff, Uwe and Christian Wolff. "The Poisson Collocation Measure and Its Applications." Proceedings of 2nd International Workshop on Computational Approaches to Collocations, Wien, Austria, 2002.
[32] Kilgarriff, Adam. "Simple Maths for Keywords." Proceedings of Corpus Linguistics Conference, Liverpool, 20-23 July 2009, edited by Mahlberg, M., et al., U of Liverpool, July 2009.
[33] Pojanapunya, Punjaporn and Richard Watson Todd. "Log-likelihood and Odds Ratio Keyness Statistics for Different Purposes of Keyword Analysis." Corpus Linguistics and Lingustic Theory, vol. 15, no. 1, Jan. 2016, pp. 133-67.
[34] Hardie, Andrew. "Log Ratio: An Informal Introduction." The Centre for Corpus Approaches to Social Science, http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/
[35] Gabrielatos, Costas. "Keyness Analysis: Nature, Metrics and Techniques." Corpus Approaches to Discourse: A Critical Review, edited by Taylor, Charlotte and Anna Marchi, Routledge, 2018.
[36] Gabrielatos, Costas and Anna Marchi. "Keyness: Appropriate Metrics and Practical Issues." Proceedings of CADS International Conference, U of Bologna, 13-14 Sept. 2012.
Editing...