sttr

Calculate STTR on tokenized text with metadata using Python

Requirements

Tested using Python 3.7.

Pandas

pip install pandas

Or using Pipenv:

pipenv install
pipenv shell

Usage

Run the run_sttr.py script, specifying the datadir and output parameters.

You may also use the pre-defined sttr run command when using pipenv (i.e. replace all python run_sttr.py incantations with pipenv run sttr).

Example

python run_sttr.py /path/to/corpus/dir

The above command will look under /path/to/corpus/dir for all directories that have a groups.csv or metadata.csv file and try to extract the specified filenames from the Tokenized, Lemmatized, POS, POS_Tri, UniversalPOS, and UniversalPOS_Tri directories (if present). For each corpus and folder (Tokenized/Lemmatized/...) combination, a results_CORPUSNAME_TYPE.tsv will be generated containing calculated measures.

Finally, a merged_results_CORPUSNAME1+CORPUSNAME2+...tsv file wile be generated containing the merged results from all corpora.

An example run on the whole project, with extended metadata:

python run_sttr.py --meta 'author,genre,brow,narrative_perspective,year' ~/Dropbox/Complexity/Corpora/*

This will calculate Yule's K, STTR, and associated length measures, for every corpus directory under ~/Dropbox/Complexity/Corpora. The author,genre,brow,narrative_perspective metadata will be extracted from the groups.csv file as well and merged into the merged_results_....tsv file at the end. Missing metadata is output as NA.

Advanced usage

See the usage:

usage: run_sttr.py [-h] [--check-only] [--meta META_FIELDS] [-t TYPES] [-p]
                   [-f FIELD]
                   datadirs [datadirs ...]

calculates sttr

positional arguments:
  datadirs            directory with data in csv files

optional arguments:
  -h, --help          show this help message and exit
  --check-only        do a pass through all specified corpus directories to
                      make sure they conform to project standards
  --meta META_FIELDS  specify metadata fields in CSV to use as categorical
                      features, optional, (default='Brow'); Format: specify as
                      CSV string
  -t TYPES            specify folders to use (Tokenized or POS etc.),
                      optional, (default='Tokenized,Lemmatized,POS,POS_Tri,Uni
                      versalPOS,UniversalPOS_Tri')
  -p                  remove punctuation, optional, (default='False')
  -f FIELD            use delimited field number to extract chosen unit
                      (token/POS/lemma/...), optional, (default='0' (the first
                      field))

Note that you may specify multiple corpora on the command line like below:

python run_sttr.py /path/to/corpus/dirs/* /path/to/other/corpus

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
dist.sh		dist.sh
run_sttr.py		run_sttr.py
visualize.R		visualize.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sttr

Requirements

Usage

Example

Advanced usage

About

Releases

Packages

Languages

License

borh/sttr

Folders and files

Latest commit

History

Repository files navigation

sttr

Requirements

Usage

Example

Advanced usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages