>>> import nltk
>>> nltk.download()
...
Identifier> treebank
- Generate groundtruth boundary labels from Penn Treebank under
treebank/
:python convert_boundary.py --path TARGET_PATH --threshold MIN_TOKENS
- End-to-end training, testing, and evaluation on NYU HPC clusters:
sbatch ptb_pipe.sbt
- Tuning configurations: modify
hierarchical-rnn/config.yml
- Relax, wait, and collect pickled output(s)
-
F1 score of HM-RNN boundary detection:
- (finished) Convert parsing in PTB to 1s/0s boundary indicators, and use that as ground truth boundaries
- (finished) Test trained HM-LSTM models on PTB, and store layer-wise indicators
- (finished) calculate F1 scores of HM-LSTM for some layer’s boundary indicators (TODO: plot fancy figures)
- (finished) Calculate BPC (LM evaluation metric) by these HM-LSTM on PTB
- Train more models; compare the correlation/trending of F1 and BPC
-
Statistically analyze with PCFG from PTB:
- (finished) Compute PCFGs from PTB
- Pick the model with best syntactic meanings of HM-LSTM boundary indicators / highest F1 score
- Find out if/what constituencies detected by HM-LSTM boundary coincide with PCFGs
-
QA on children book dataset
- (finished) Setup data preprocessing, pipeline to hm-lstm model
- (finished) Tune to improve test precision
- Replace self embedding nets with GloVe pre-trained word embeddings
- Beat the baseline performance of vanilla LSTM