Question-Answer Dataset Format ? #38

szh-max · 2023-06-05T10:26:07Z

Hello, may I ask if you are using the model for question answering, and what format is the dataset in?

I am using the retro model, and the dataset I created is:

question: Where is the Alberta Basin located?
answer: It is located in western Canada, between latitudes 49° to 60°.

question: What is the area of the Alberta Basin?
answer: The area is approximately 748,889 square kilometers.

question: What type of basin is the Alberta Basin?
answer: It is a foreland basin.

But there seems to be an error in the following generation:
prefix = 'Where is the Alberta Basin'
prompt = torch.LongTensor(tokenizer.encode(prefix, add_special_tokens=False)).unsqueeze(0)
sampled = wrapper.generate(prompt, filter_thres = 0.1, temperature = 0.1) # (1, <2049) terminates early if all
print(sampled)
print('#######')
print(tokenizer.decode(sampled.squeeze(), skip_special_tokens=True))

The output is garbled: (Where is the Alberta Basin: 。 question : 。 question : ？ answer : : 。 question : 。 question : 。 question : : 。 question : 。 question : : 。 question : 。 question : : 。 question : 。 question : : 。 question :stion : 。 question : 。 question :ion : : : 。 question : 。 question : : 。 question : : 的？ answer : 。 question : : 。 question : 。 question : 。 question : 。 question : 。 question : : : 。 question : 。 question :stion : 。 question : 。 question : 。）
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question-Answer Dataset Format ? #38

Question-Answer Dataset Format ? #38

szh-max commented Jun 5, 2023

Question-Answer Dataset Format ? #38

Question-Answer Dataset Format ? #38

Comments

szh-max commented Jun 5, 2023