All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Handled the duration mark (e.g.,
[# 0.4]
) in utterance cleaning.
- Added support for Python 3.12.
- Handled pre-clitics and post-clitics from %mor tiers and honored their distinction in the parsed utterance.
- Added support for Python 3.11.
- Updated the test data from Brown's Eve from the upstream CHILDES.
- Dropped support for Python 3.7.
- Added the
exclude_switch
option for MLU (mlu()
,mlum()
, andmluw()
), so that words with @s for switching language may be excluded.
- Fixed MLU computation (
mlu()
,mlum()
, andmluw()
):- If xxx, yyy, or www appears in an utterance, the whole utterance is ignored.
- If there are no MLU-relevant words/morphemes in an utterances, the whole utterance is ignored.
- Moved
download_and_extract_brown
test function to under thepylangacq
package namespace, as tests fromBaseTestCHATReader
require downloaded CHAT data files.
- Restructured the repository to use top-level
src/
andtests/
directories.
- Removed
setup.py
.
- Moved
BaseTestCHATReader
back under thepylangacq
package namespace so that downstream packages can importBaseTestCHATReader
for testing.
Reader
objects can now be concatenated by the addition operator+
.- Implemented the
head
,tail
, andinfo
methods atReader
. - Added support for Python 3.10.
- Turned on Windows testing on CircleCI.
- Added
pyproject.toml
. Related to prioritizingsetup.cfg
for specifying build metadata and options.
- The
to_strs
andto_chat
methods of aReader
object return tabulated outputs by default. - Prioritized
to_chat
for the single file output use case. - Unzipping CHAT data now uses less memory.
- Switched to
setup.cfg
to fully specify build metadata and options, while keeping a minimalsetup.py
for backward compatibility. Related to the newpyproject.toml
. - Switched the Sphinx docs theme from
sphinx-rtd-theme
tofuro
.
- Dropped support for Python 3.6.
- Turned on
safety
andbandit
checks at CircleCI builds.
Reader.from_zip
(alsoread_chat
) now keeps the downloaded ZIP archive in a non-temporary directory for possible re-use.- Added the kwarg
use_cached
inReader.from_zip
, so that we use the cached data by default for the same input URL, and that we can force re-downloading by settinguse_cached
toFalse
. - Added the kwarg
session
inReader.from_zip
, in case using a customizedrequests.Session
instance is desired.session
also makes it possible to write tests for the new kwarguse_cached
. - Added the helper functions
cached_data_info
andremove_cached_data
.
- Added the kwarg
Reader
has the newto_strs
method that yields CHAT data strings.Reader
has the newto_chat
method that exports data to local files.
- CHAT parsing for the header information is now more robust for varying whitespace characters between the head and its associated value.
- Dropped kwarg
allow_remote
inReader.from_zip
. This kwarg wouldn't make any sense anymore, or at least would be confusing with the introduction ofuse_cached
.
- The header/metadata has a more reasonable representation for emptiness when input data is empty.
- Added the
parallel
optional argument to theReader
methods{from_zip, from_dir, from_files, from_strs}
so that parallelization can be turned off if desired. - Added the
filter
method toReader
for filtering data by file paths.
- The methods
append
,append_left
,extend
, andextend_left
now work with a subclass ofReader
, not justReader
itself.
- Fixed utterance cleaning so that it is now compatible with all CHILDES datasets.
- Fixed a CHAT parsing issue when correction and repetition are combined.
API-breaking changes:
The Reader
class has been completely rewritten.
A couple methods have been removed, while others have been renamed.
For methods that remain (renamed or not),
their behavior for output data structure and arguments allowed has been changed.
The details are in the following.
- New classmethods of
Reader
for reader instantiation:from_zip
from_dir
- New classes to better structure CHAT data:
Utterance
Token
Gra
- New Reader methods:
append_left
,extend
,extend_left
,pop
,pop_left
tokens
(which givesToken
objects, essentially the "tagged words" from before)
- In the header dictionary, each participant's info has the new key
"dob"
for date of birth (if the info is available in the CHAT header). The corresponding value is adatetime.date
object. (The same info was previously exposed as theReader
methoddate_of_birth
, now removed.) - The test suite now covers code snippets in both the docstrings and
.rst
doc files.
- CHAT parsing in
Reader
instantiation has been completely rewritten. The previous private class_SingleReader
has been removed. This private class duplicated a lot of theReader
code, which made it hard to make changes. - The
Reader
rewrite has also greatly sped up the reading and parsing of CHAT data. - The
by_files
argument, which manyReader
methods has, now gives you a simpler list of results for each data file, no longer the previous output of a dict that mapped a file path to the file's result. - The
participant
argument, which manyReader
methods has for specifying which participants' data to include in the output, has been renamed asparticipants
to avoid confusion. There is no change to its behavior of handling either a single string (e.g.,"CHI"
) or a collection of strings (e.g.,{"CHI", "MOT"}
) . - The following
Reader
methods have been renamed as indicated, some for stylistic or Pythonic reasons, others for reasons as given:age
->ages
number_of_utterances
->n_utterances
number_of_files
->n_files
filenames
->file_paths
MLU
->mlu
MLUm
->mlum
MLUw
->mluw
TTR
->ttr
IPSyn
->ipsyn
word_frequency
->word_frequencies
from_chat_str
->from_strs
from_chat_files
->from_files
add
->append
. Since the data files in aReader
have a natural ordering (by time of recording sessions, and therefore commonly by file paths as well), a reader is list-like rather than an unordered set of data files, whichadd
would suggest.participant_codes
->participants
. Before this version, the methodsparticipant_codes
(for CHI, MOT, etc) andparticipants
(for, say, Eve, Mother, Investigator, etc) co-existed, but in practice we mostly only care about CHI, MOT, etc. So the methodparticipants
for Eve etc has been removed, andparticipant_codes
has been renamed asparticipants
.
- Each participant's info in a header dictionary has these keys renamed:
participant_name
->name
participant_role
->role
SES
->ses
(socioeconomic status)
- The class
DependencyGraph
has been made private (i.e., now_DependencyGraph
with a leading underscore). Its functionality hasn't really changed (it's used in the computation of IPSyn). It may be made more visible again in the future if more functionality related to grammatical relations is developed in the package. - Switched to sphinx-rtd-theme as the documentation theme.
- Switched to CircleCI orbs; update dev requirements' versions.
- The following Reader methods have been deprecated:
tagged_sents
(usetokens
withby_utterances=True
instead)tagged_words
(usetokens
withby_utterances=False
instead)sents
(usewords
withby_utterances=True
instead)
- The following methods of the
Reader
class have been removed:abspath
. Usefile_paths
instead.index_to_tiers
. All the unparsed tiers are now available fromutterances
.participant_codes
. It's been renamed asparticipants
, another method now removed; see "Changed" above.part_of_speech_tags
update
andremove
. A reader is a list-like collection of CHAT data files, not a set (whichupdate
andremove
would suggest).search
andconcordance
. To search, use one of thewords
,tokens
, andutterances
methods to walk through a reader's CHAT data and keep track of elements of interest.date_of_birth
. The info is now available underheaders
, in each participant's"dob"
key.
- Handled
[/-]
in cleaning utterances. [x <number>]
means a repetition of the previous word/item, not repetition of the entire utterance.
- Added support for Python 3.9.
- Enabled
black
to enforce styling consistency.
- Started testing Python 3.7 and 3.8 on continuous integration. (#9)
- Add time marker support (available at
_SingleReader
), originally contributed at #3 by @hellolzc. (#8)
- Switched from Travis CI to CircleCI for autobuilds. (#9)
- Switched README from reStructuredText to Markdown. (#9)
- Removed conversational quotes in utterance processing; updated test CHAT file to match the latest CHILDES data. (#7)
- Dropped support for Python 2.7, 3.4, and 3.5. All code related to Python 2+3 cross compatibility was removed. (#9)
- Fixed unicode handling across Python 2 and 3
- Renamed method
find_filename
ofReader
asabspath
. - Fixed bug in
Reader
method decorators - Handled multiple dates of recording in one CHAT file.
The method
dates_of_recording
of aReader
instance now returns a list of dates. - Implemented the
exclude
parameter in variousReader
methods for excluding specific participants. - Fixed bug in IPSyn.
- Python 2 and 3 cross compatibility
- Renamed the
grammar.py
module asdependency.py
- Rewrite the class
DependencyGraph
; do not subclass from networkx's DiGraph anymore (and we remove networkx as a dependency of this library)
- Rewrite the class
- Removed multiprocessing in reading data files. Datasets are usually small enough that the performance gain, if any, wouldn't be worth it for the potential issues w.r.t. spawning multiple processes)
- Developed capabilities to handle PhonBank data for
handling
%pho
and%mod
tiers - Improved
clean_utterance()
- Added parameter
encoding
inread_chat()
- Added
get_lemma_from_mor()
- Added
date_of_recording()
anddate_of_birth()
; removedate()
- Added
clean_word()
- Restricted
get_IPSyn()
to only the first 100 utterances - Added tests
-
Library now compatible only with Python 3.4 or above
-
For class
Reader
:- Defined
read_chat()
for initializing aReader
object - Added parameter
by_files
to various methods; remove the "all_" methods - Added reader manipulation methods:
update()
,add()
,remove()
,clear()
- Added parameter
sorted_by_age
infilenames()
- Added parameter
month
inage()
- Added
word_ngrams()
- Added
find_filename()
- Added language development measures:
MLUm()
,MLUw()
,TTR()
,IPSyn()
- Added
search()
andconcordance()
- Allowed regular expression matching for parameter
participant
- Added output formats for dependency graphs:
to_tikz()
andto_conll()
- Distinguished
participant_name
andparticipant_role
in metadata - The
@Languages
header contents are treated as a list but not a set now for ordering in bi/multilingualism - Undid collapses in transcriptions such as
[x 4]
- Various bug fixes
- Defined
- Added
part_of_speech_tags()
inSingleReader
- Added "all X" methods in
Reader
- Bug fixes:
clean_utterance()
,DependencyGraph
cha_lines
optimized- Methods added:
tagged_words()
,words()
,tagged_sents()
,sents()
- Tier detection revamped.
tier_sniffer()
method removed, withself.tier_markers
inSingleReader
now being a set of %-tier markers. len()
forSingleReader
addedword_frequency()
forSingleReader
added- Module
grammar
added, with classDependencyGraph
being set up - Static methods in classes pulled out
- New
utterances()
method for extracting utterances from transcripts _clean_utterance
method developed for filtering CHAT annotations away in utterances- Standardizing terminology: use "participant(s)" consistently instead of "speaker(s)"
- New
number_of_utterances()
method for bothReader
andSingleReader
- To avoid confusion,
metadata()
method is removed. - Extraction of utterances and tiers with dict
index_to_tiers
- Class
Reader
can read multiple.cha
files. The methods associated withReader
are mostly a dict mapping from a absolute-path filename to something.Reader
depends on the classSingleReader
for a single CHAT file. - Following the conventional CHILDES and CHAT terminology,
the
metadata()
method inReader
is renamedheaders()
(though a "new"metadata()
method is defined and points toheaders()
for convenience).
- new methods for class
Reader
:languages()
,date()
,participants()
,participant_codes()
- first commit; set up the
chat
submodule - class
Reader
defined for reading CHAT files, with methodscha_lines()
,metadata()
, andage()