Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions on the processed dataset in LongBench #72

Closed
jiqimaoke opened this issue Aug 21, 2024 · 1 comment
Closed

Some questions on the processed dataset in LongBench #72

jiqimaoke opened this issue Aug 21, 2024 · 1 comment

Comments

@jiqimaoke
Copy link

I found that in HotpotQA, each passage in the original dataset is approximately 100-150 tokens long, but in LongBench, some samples in the hotpotqa.jsonl have a combined passage length of over 10,000 tokens, averaging around 1,000 tokens per passage. I want to ask how such long lengths were processed.

For example, in the second-to-last sample in hotpotqa.jsonl (the question is "Which artist is known for his work on Marvel Team-Up and Batman: Son of the Demon?"), in passage 1, the content is as follows:

Arcade (Marvel Comics)\nArcade is a supervillain appearing in American comic books published by Marvel Comics. He first appeared in 1978's Marvel Team-Up #65, the creation of writer Chris Claremont and writer/artist John Byrne. The character is a combination of an evil genius and a hitman who carries out his assassinations via various elaborate traps, often referred to as Murderworld.\nArcade's first intended victims were Spider-Man and Captain Britain but since Arcade's Murderworld games always leave the outcome up to chance, the duo defeated Arcade and escaped with their lives. Over the years, Arcade has targeted a multitude of Marvel heroes, often focusing on the X-Men and associated members of X-Factor, X-Force, and Excalibur. In what is considered the "game changer" for Arcade, Avengers Arena, he managed to kidnap 16 superpowered teens and forced them to kill each other for survival in his latest version of Murderworld; unlike most Murderworld schemes, this endeavor yielded several casualties.\nArcade has appeared in a number of other Marvel properties outside of comic books, in X-Men: Evolution voiced by Gabe Khouth, and in the Ultimate Spider-Man animated series voiced by Eric Bauza. He has also appeared as one of the main villains in a number of video games......

However, in the original dataset (hotpot_dev_distractor_v1.json, hotpot_dev_fullwiki_v1.json), the content is:

["Arcade (Marvel Comics)",["Arcade is a fictional supervillain appearing in American comic books published by Marvel Comics."," He first appeared in 1978's "Marvel Team-Up" (vol."," 1) #65, the creation of writer Chris Claremont and writer/artist John Byrne."," The character is a combination of evil genius and hitman who carries out his assassinations via various elaborate traps, often referred to as his "Murderworld"."]]

Where does the content after "often referred to as Murderworld." in the example come from?

@bys0318
Copy link
Member

bys0318 commented Sep 6, 2024

Since the original length of HotpotQA is not enough, we additionally select irrelevant documents and concat them to the original input sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants