Skip to content

Effective Ways to Select Dataset from Large Corpus

Notifications You must be signed in to change notification settings

TikaToka/CapstoneSpring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CapstoneDesign(CSI4101) 2020 Spring Semester

"Effective Ways to Select Dataset from Large Corpus"

Team: 'input.txt' (E-mail: input.txt.2020@gmail.com)

  1. Joochan Kim(Leader, Code, Idea)
  2. Dobreva Iva(Code)
  3. Yujin Kim(Documentation, presentation)

Idea originated from Prof. Jinyeong Yeo @ Convei lab in Yonsei Univ.

Assistance was given by Gayeon Lee @ Convei lab in Yonsei Univ.

Most of the NLP datasets are so big that lead developers spend a lot of time and costs to train model. To reduce this burden, We propose a new approach that reduces size of data to lessen time and costs needed and improves performances.


Model

- CEDR: Contextualized Embeddings for Document Ranking

- TIM_PLUS: Two-phase Influence Maximization

Dataset

- Robust04: TREC Robust document collection for Retrieval task


How to Run?

  1. Download Models and Dataset and unzip
  2. run graph/graph-generator.py to make graph (Change data location at line 105 to /filename.pkl. Check README.md)
  3. run TIM_PLUS using step 2's result (check README)
  4. Use step 3's result to make seed.txt (Just copy the result and write into it)
  5. run /graph/create-set.py (data, pkl and seed.txt needed, check README.md)
  6. run /Robust-Ranker-Master/main.py (Check README)
  7. Compare MAP! ^-^

Result

Number of dataset: 110000 -> 50000

Train time spend: 11h -> 6h

Result

About

Effective Ways to Select Dataset from Large Corpus

Resources

Stars

Watchers

Forks

Languages