This project was originally my term project for a computational linguistics course at Pitt. It was turned into a research project later and I am working on publishing the work.
Man Ho Wong (m.wong@pitt.edu), University of Pittsburgh.
April 24, 2022
This project aims to investigate the relationship between early vocabulary development in children from different socio-economic backgrounds and their mother's child-directed speech (CDS). Lexical semantic networks for the child speech (CS) and the CDS were constructed from individual files in a dataset collected from CHILDES (see Data sources).
For the original project plan, please see project_plan.md
.
progress_report.md
documents the development of this project.
progress_presentation.pdf
summarized the progress at the end of the spring semester 2022.
Here is the link for the final report submitted to the course LING1340/2340.
The guestbook for the project can be found Here.
./
|---code/ # code for data processing/analysis
| |---etc/
| | |---PyLangAcq_notes.ipynb
| | |---pittchat.py
| |
| |---data_curation.ipynb
| |---data_preprocessing.ipynb
| |---exploratory_analysis.ipynb
| |---pylangacq_license.txt
| |---vocabulary_analysis.ipynb
|
|---data/ # processed and unprocessed data
| |---data_samples/ # data samples
|
|---reports/ # reports and presentation
| |---images/ # images used in the final report
| |---final_report.md
| |---progress_report.md
| |---progress_presentation.pdf
|
|---.gitignore
|---LICENSE.md
|---project_plan.md
|---README.md # YOU ARE HERE
The following scripts form the pipeline for data processing and analysis. Each generates the data required by the next script. They should be executed in the same sequence as listed:
data_curation.ipynb
(nbviewer) curates datasets from CHILDES needed for this project.data_preprocessing.ipynb
(nbviewer) integrates datasets curated and cleans the data before analysis.exploratory_analysis.ipynb
(nbviewer) explores what kinds of linguistic analysis can be done with the curated data.vocabulary_analysis.ipynb
(nbviewer) examines the characteristics of semantic networks in children of different SES group.
The code is written in Python 3.9.7. For easy sharing, scripts are organized into Jupyter notebooks (see above).
Viewing: You can view the notebooks either on GitHub or on nbviewer.org.
Running: To run the code, you will need a Jupyter Notebook interface. You can also run the code on Google Colab.
Below is a list of required libraries and packages that are not included in the Python Standard Library, as well as the version tested in this project:
- Gensim (4.1.2)
- Matplotlib (3.4.3)
- NumPy (1.20.3)
- Pandas (1.3.4)
- PyLangAcq (0.16.0)
- NLTK (3.6.5)
- NetworkX (2.6.3)
- scikit-learn (0.24.2)
- Tqdm (4.62.3) (Optional, for showing progress bar during running)
The corpus data used in this project was downloaded from the CHILDES database:
MacWhinney, B. (2000). The CHILDES Project: Tools for analyzing talk. Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates.
See this page for more information.
This project also used data containing semantic vectors from ConceptNet Numberbatch 19.08, by Luminoso Technologies, Inc. You may redistribute or modify the data under a compatible Share-Alike license.
The following Python package was used in this project for processing CHAT files:
Lee, Jackson L., Ross Burkholder, Gallagher B. Flinn, and Emily R. Coppess. 2016. Working with CHAT transcripts in Python. Technical report TR-2016-02, Department of Computer Science, University of Chicago.
Github repo: https://github.com/jacksonllee/pylangacq
The package is licensed under the MIT License. See pylangacq_license.txt
for more information.
The non-code parts of the project are licensed under Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0). See LICENSE-non_code.md
for more information.
The rest of the project is licensed under GNU General Public License Version 3 (GPLv3). See LICENSE.md
for more information.
I would like to thank my instructors and fellow students of the course Data Science for Linguists for their help and valuable inputs. I would also like to express my special thanks to Prof. Na-Rae Han for helping me to review the course Introduction to Computational Linguistics, which I missed last semester due to other commitments. Both courses helped me to devlop better computational thinking to work with large linguistic data.