Pascalson/chatbot-data

The repository is a collection of all chatbot datasets I have used. I will continually upload my preprocessed datasets to this repo.

The sources of these datasets are listed below:

English

OpenSubtitles (~100-180W): 2009 version source
Counted OpenSubtitles: OpenSubtitles sorted by one-to-many input-outputs pairs.
cornell movie-dialog (~22W): source

Chinese_Movie (~240W): source
PTT_Gossiping (~30W): [source](https://github.com/zake7749/Gossiping-Chinese-Corpussorted subtitles) )

These are my self-defined synthetic tasks for conditional sequence learning. The two versions have different division of train, dev, and test sets.

counting: details
sequence:
- definition: continue the input sequence with a random legth.
- e.g., given input:<1,2,3>, the possible outputs set is {<4, 5, ..., N>| N>=4}
addition:
- definition: randomly segment the input sequence; then add the two segmentations.
- e.g., given input:<1,2,3>,
  - if we divide it into 1 and 23, we will get 24 as the output
  - if we divide it into 12 and 3, we will get 15 as the output
  - the above answers are both in the possible outputs set {<2,4>, <1,5>}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
English		English
Synthetics1		Synthetics1
Synthetics2		Synthetics2
Readme.md		Readme.md