Is PeQa a knight? A samurai? A robot? No! It is a MASSIVE Persian Question Answering dataset! :)
PeQa dataset is a huge dataset of 14 million Persian tweets from tweeter that is meticulously processed to create a rich collection of 420,000 pairs of question-answer data. Therefore, this valuable dataset can be used in many chatbot or other question-answering projects. Although over 14 million pairs of questions and answers were extracted, bout 400,000 pairs are published in .CSV cleaned format for now in this repository for researcher's use.
Total | Unique Questions | Unique Answers | |
---|---|---|---|
Data Count | 436,072 | 127,752 | 308,320 |
The complete 14 million data is also available for free.
Each pair of data consist of a tweet and it's relevant reply, which are considered as a question and it's answer. PeQa dataset consists of tweets and replies, rather than questions and answers, which has some consequences for the type of linguistic patterns it contains. Specifically, a question-answering dataset primarily consists of questions and related answers but in this dataset, a question can be answered by another question.
Question(Tweet) | Answer(Reply) |
---|---|
اتفاقا تازه سوپ خوردم | نوش جان ، چه سوپی ؟ |
دانشگاهها هم تعطیل شدند | نشدند ، تکذیب شد |
استعفا داده ظریف ؟ | قبلا استعفا داده_بود وقتی اقای بشار اسد اومده بود ایران |
چرا پنج تا ؟ | پ چندتا ؟ |
پادکست گوش نمیدین ؟ | تا الان این کارو نکردم |
- The raw data is too large to be shown in a repository, you can download it from here.
- We have trained a baseline model based on transformers to check the dataset. you can also check it out on Google Colab.
- PeQa paper soon!
- PeQa Blog Post
The testbench requires PyTorch framework. There are multiple ways to install it:
- Pip:
!pip install torch===[version]
- Anaconda
conda install python=3.6 pytorch torchvision
Computation Graph:
- Convert word indexes to embeddings.
- Pack padded batch of sequences for RNN module.
- Forward pass through GRU.
- Unpack padding.
- Sum bidirectional GRU outputs.
- Return output and final hidden state.
Inputs:
input_seq
: batch of input sentences; shape=\ (max_length, batch_size)input_lengths
: list of sentence lengths corresponding to each sentence in the batch; shape=\ (batch_size)hidden
: hidden state; shape=\ (n_layers x num_directions, batch_size, hidden_size)
Outputs:
outputs
: output features from the last hidden layer of the GRU (sum of bidirectional outputs); shape=\ (max_length, batch_size, hidden_size)hidden
: updated hidden state from GRU; shape=\ (n_layers x num_directions, batch_size, hidden_size)
Computation Graph:
- Get embedding of current input word.
- Forward through unidirectional GRU.
- Calculate attention weights from the current GRU output from (2).
- Multiply attention weights to encoder outputs to get new "weighted sum" context vector.
- Concatenate weighted context vector and GRU output.
- Predict next word(without softmax).
- Return output and final hidden state.
Inputs:
input_step
: one time step (one word) of input sequence batch; shape=\ (1, batch_size)last_hidden
: final hidden layer of GRU; shape=\ (n_layers x num_directions, batch_size, hidden_size)encoder_outputs
: encoder model’s output; shape=\ (max_length, batch_size, hidden_size)
Outputs:
output
: softmax normalized tensor giving probabilities of each word being the correct next word in the decoded sequence; shape=\ (batch_size, voc.num_words)hidden
: final hidden state of GRU; shape=\ (n_layers x num_directions, batch_size, hidden_size)
Run the Configure training/optimization
block if you want to train the model.
First we set training parameters, then we initialize our optimizers, and finally we call the trainIters
function to run our training iterations.
Contributions, issues, and feature requests are welcome!
Feel free to check the issues page.
Give a ⭐️ if you like this project!
We would like to acknowledge both MUT DeepLearning Lab and MUT NLP lab for their financial support.
This project is MIT licensed.