-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning Instruct-Prompt Version For OpenAI EVAL #470
Conversation
This is a instruct form of logiqa test data
LogiQA 2.0 dataset section that is specifically collected after 2021, the tests are released in 2022 and later. GPT3.5 turbo did them horribly, only achieves 38% accuracy. Manual tests on GPT4 yield 54 correct answers out of 112 questions (48.21%). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this eval! We've approved the PR and will merge. We’ll also add GPT-4 API access to your account.
…gical Reasoning Instruct-Prompt Version For OpenAI EVAL (openai#470) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples. ## Eval details 📑 ### Eval name [LogiQA] ### Eval description [LogiQA is a multi-choice machine reading dataset released in 2020, and in 2023 we updated the dataset to version 2. The dataset is collected from Logical reasoning exams with excellent annotation quality. Our experiments showed that it is challenging even for Large Language Models like GPT-3. Fine-tuning methods perform even worse. Up to now, the SOTA score on the dataset is 55% accuracy. The dataset is used by many researchers in their probing tasks and is an important benchmark for logical reasoning. For this OpeanAI Eval, we transformed 1572 instances from the test set into an instruct-prompt scheme.] ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100) If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Sigatoka disease drastically reduces the yield of banana trees and is epidemic throughout the areas of the world where bananas are grown. The fungus that causes the disease can be controlled with fungicides, but the fungicides can pose a health hazard to people living nearby. The fungicides are thus unsuitable for small banana groves in populated areas. Fortunately, most large banana plantations are in locations so isolated that fungicides can be used safely there. Therefore, most of the world' s banana crop is not seriously threatened by Sigatoka disease."}, {"role": "system", "content": "Question: Which one of the following is an assumption on which the argument depends?"}, {"role": "user", "content": "Option 1: Sigatoka disease is the only disease that threatens bananas on a worldwide scale."}, {"role": "user", "content": "Option 2: Most of the banana trees that have not been exposed to the Sigatoka fungus grow in small banana groves."}, {"role": "user", "content": "Option 3: Large plantations produce most or all of the world's bananas."}, {"role": "user", "content": "Option 4: Sigatoka disease is the only disease that threatens bananas on a worldwide scale."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: At present, there are many books such as Ten Keys to Success in the book market. Publishers marketed these books as books that would actually help readers achieve great success. In fact, almost everyone knows that great success is destined to belong to a minority, and people cannot all become one of the minority through books. In this regard, the exaggerated and even false claims made by publishers cannot be considered unethical. To say the least, even if one believes the publisher's false claims, it is not immoral to make such claims as long as reading such books does more good than harm to one's success."}, {"role": "system", "content": "Question: Which of the following conclusions best fits the above argument?"}, {"role": "user", "content": "Option 1: Deliberately making false propaganda is immoral only when it has no positive effect"}, {"role": "user", "content": "Option 2: Deliberate propaganda of this kind is only immoral if people are deceived and suffer from it"}, {"role": "user", "content": "Option 3: If the deliberate disinformation is made to profit at the expense of the deceived, then the deliberate disinformation is immoral"}, {"role": "user", "content": "Option 4: Deliberately making false propaganda is immoral only when it has no positive effect"}], "ideal": "2"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Attorney for Ziegler: My client continued to do consulting work between the time of his arrest for attempted murder and the start of this trial. But I contend that Ziegler was insane at the time that he fired the shot. This is the only reasonable conclusion to draw from the fact that the accusers have submitted no evidence that he was sane at the time he pulled the trigger, only that he was sane some time after he did so."}, {"role": "system", "content": "Question: Which one of the following most accurately describes a flaw in the reasoning of Ziegler's attorney?"}, {"role": "user", "content": "Option 1: It presumes that being a well-educated professional is relevant to being guilty or innocent."}, {"role": "user", "content": "Option 2: It fails to consider that Ziegler might have been insane when he worked as a consultant."}, {"role": "user", "content": "Option 3: It fails to consider the possibility that Ziegler's being sane after the shooting is an indication that he was sane at the time of the shooting."}, {"role": "user", "content": "Option 4: It presumes that being a well-educated professional is relevant to being guilty or innocent."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: It is proposed to allow the sale, without prescription, of a medication that physicians currently prescribe to treat the common ear inflammation called swimmer' s ear. The principal objection is that most people lack the expertise for proper self-diagnosis and might not seek medical help for more serious conditions in the mistaken belief that they have swimmer' s ear. Yet in a recent study, of 1, 000 people who suspected that they had swimmer' s ear, 84 percent had made a correct diagnosis -- a slightly better accuracy rate than physicians have in diagnosing swimmer' s ear. Thus, clearly, most people can diagnose swimmer' s ear in themselves without ever having to consult a physician."}, {"role": "system", "content": "Question: Which one of the following, if true, most undermines the conclusion?"}, {"role": "user", "content": "Option 1: Cases in which swimmer's ear progresses to more serious infections are very rare."}, {"role": "user", "content": "Option 2: For many people who develop swimmer's ear, the condition disappears without medical or pharmaceutical intervention."}, {"role": "user", "content": "Option 3: Physicians who specialize in ear diseases are generally able to provide more accurate diagnoses than those provided by general practitioners."}, {"role": "user", "content": "Option 4: Cases in which swimmer's ear progresses to more serious infections are very rare."}], "ideal": "4"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: All any reporter knows about the accident is what the press agent has said. Ttherefore, if the press agent told every reporter everything about the accident, then no reporter knows any more about it than any other reporter. If no reporter knows any more about the accident than any other reporter, then no reporter can scoop all of the other reporters. However, the press agent did not tell every reporter everything about the accident. It follows that some reporter can scoop all of the other reporters."}, {"role": "system", "content": "Question: The argument's reasoning is flawed because the argument fails to recognize that which one of the following is consistent with the facts the argument presents?"}, {"role": "user", "content": "Option 1: The press agent may not know any more about the accident than the most knowledgeable reporter."}, {"role": "user", "content": "Option 2: No reporter knows any more about the accident than any other reporter."}, {"role": "user", "content": "Option 3: Even if some reporter knows more about the accident than all of the other reporters, that reporter need not scoop any other reporter."}, {"role": "user", "content": "Option 4: The press agent may not know any more about the accident than the most knowledgeable reporter."}], "ideal": "2"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Crowdsourcing refers to the practice of a company or organization to delegate tasks traditionally performed by employees to the general public."}, {"role": "system", "content": "Question: Which of the following is not crowdsourcing?"}, {"role": "user", "content": "Option 1: A toy company has been encouraging and sponsoring users to participate in its design work. From robotic control systems to building block kits, the company has had fairly good results."}, {"role": "user", "content": "Option 2: A detergent company often posts its own R & D projects on major websites, soliciting solutions, and promises to give certain rewards for solutions."}, {"role": "user", "content": "Option 3: In the past three years, a real estate company has handed over all the daily maintenance of computers, networks and peripherals to a computer company."}, {"role": "user", "content": "Option 4: A toy company has been encouraging and sponsoring users to participate in its design work. From robotic control systems to building block kits, the company has had fairly good results."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Social risk refers to the risk of loss of social production and people's life due to the actions of individuals or groups."}, {"role": "system", "content": "Question: Which of the following is not a social risk?"}, {"role": "user", "content": "Option 1: Larceny."}, {"role": "user", "content": "Option 2: Robbery."}, {"role": "user", "content": "Option 3: Frost disaster."}, {"role": "user", "content": "Option 4: Larceny."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: A manager is hoping to reach a certain target for camera sales in his store, which sells between 10 and 20 cameras a week. Typically, most cameras sold in any week are the less expensive economy models, and his store has sold relatively fewer of the more expensive, high-end cameras. The manager realizes that if, on average, three more cameras sold each week were high-end instead of economy models, the store would reach its target in sales. The manager prepares a detailed information sheet for the sales associates, outlining the numerous advantages of the high-end cameras over the economy cameras, and provides each sales associate with a portfolio of contrasting photos of the same images, showing the clearly superior image quality of the high-end cameras."}, {"role": "system", "content": "Question: Which of the following, if true, would provide most support for the prediction that the detailed information sheet and photo portfolio given to sales associates will have its intended effect of allowing the store to reach its target in sales?"}, {"role": "user", "content": "Option 1: Camera stores that are part of the same national franchise in major metropolitan locations, like New York or Los Angeles, sell comparatively large numbers of the high end cameras."}, {"role": "user", "content": "Option 2: The sales associates are already well informed about the capabilities of all the cameras, and often know detailed technical information about their circuitry."}, {"role": "user", "content": "Option 3: The high end cameras can generate photographs of profession quality, such as those a portrait photographer might produce"}, {"role": "user", "content": "Option 4: Camera stores that are part of the same national franchise in major metropolitan locations, like New York or Los Angeles, sell comparatively large numbers of the high end cameras."}], "ideal": "4"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: In people's impression, bio-fuel is a renewable green energy. The latest research results overturn people's traditional impression. Researchers found that bio-fuel may be converted into acetaldehyde due to incomplete combustion, which will pollute the air. This pollution will lead to 1400 early deaths in country M every year. Therefore, some medical institution personnel in country M believe that the promotion of bio-fuels should be suspended and its use should be limited at this stage."}, {"role": "system", "content": "Question: Which of the following, if true, would most effectively question the views of medical institution personnel?"}, {"role": "user", "content": "Option 1: At present, the country's scientists have developed a new technology to fully burn biofuels."}, {"role": "user", "content": "Option 2: Pollution from other fuels currently being used in the country causes more than 3,000 premature deaths a year."}, {"role": "user", "content": "Option 3: Conventional fuels such as oil have been technologically improved to reduce pollution from combustion."}, {"role": "user", "content": "Option 4: At present, the country's scientists have developed a new technology to fully burn biofuels."}], "ideal": "1"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Road traffic accident refers to the event of personal injury or property loss caused by vehicle fault or accident on the road. Among them, road refers to roads, urban roads and places where social motor vehicles are allowed to pass although within the jurisdiction of the unit, including squares, public parking lots and other places used for public passage. Vehicle refers to motor vehicles and non motor vehicles. Non motor vehicles, It refers to the means of transport driven by human or animal power and running on the road, as well as the motor wheelchair, electric bicycle and other means of transport for the disabled whose design maximum speed, empty vehicle quality and overall dimensions meet the relevant national standards although driven by power devices."}, {"role": "system", "content": "Question: According to the above definition, which of the followings doesn't belong to road traffic accident:"}, {"role": "user", "content": "Option 1: Xiao Wang accidentally knocked down an old man when reversing in the closed management community"}, {"role": "user", "content": "Option 2: When Miss Zhou crossed the road with her pet dog, the stray pet dog unfortunately died under the ring"}, {"role": "user", "content": "Option 3: Xiao Zhao parked his car in the parking lot near the shopping mall. When he picked up the car, he found that the rear of the car was hit and the accident vehicle had escaped"}, {"role": "user", "content": "Option 4: Xiao Wang accidentally knocked down an old man when reversing in the closed management community"}], "ideal": "1"} ``` </details>
Thank you so much! With the access of GPT-4 for my openai account with this
gmail, I can test more benchmarks for logical reasoning. I'd like to know
when could I use gpt4 api, I noticed that I haven't get the access yet.
Andrew Kondrich ***@***.***> 于 2023年4月12日周三 上午2:45写道:
… ***@***.**** approved this pull request.
Thanks for submitting this eval! We've approved the PR and will merge.
We’ll also add GPT-4 API access to your account.
—
Reply to this email directly, view it on GitHub
<#470 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AO5H2FEJRARHZIXWODYOD7TXAWRFPANCNFSM6AAAAAAWJM3RDI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi, unfortunately I couldn't find your email details in the commits, please email me at kondrich@openai.com from your OpenAI-affiliated email! |
…gical Reasoning Instruct-Prompt Version For OpenAI EVAL (openai#470) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples. ## Eval details 📑 ### Eval name [LogiQA] ### Eval description [LogiQA is a multi-choice machine reading dataset released in 2020, and in 2023 we updated the dataset to version 2. The dataset is collected from Logical reasoning exams with excellent annotation quality. Our experiments showed that it is challenging even for Large Language Models like GPT-3. Fine-tuning methods perform even worse. Up to now, the SOTA score on the dataset is 55% accuracy. The dataset is used by many researchers in their probing tasks and is an important benchmark for logical reasoning. For this OpeanAI Eval, we transformed 1572 instances from the test set into an instruct-prompt scheme.] ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100) If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ```jsonl {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Sigatoka disease drastically reduces the yield of banana trees and is epidemic throughout the areas of the world where bananas are grown. The fungus that causes the disease can be controlled with fungicides, but the fungicides can pose a health hazard to people living nearby. The fungicides are thus unsuitable for small banana groves in populated areas. Fortunately, most large banana plantations are in locations so isolated that fungicides can be used safely there. Therefore, most of the world' s banana crop is not seriously threatened by Sigatoka disease."}, {"role": "system", "content": "Question: Which one of the following is an assumption on which the argument depends?"}, {"role": "user", "content": "Option 1: Sigatoka disease is the only disease that threatens bananas on a worldwide scale."}, {"role": "user", "content": "Option 2: Most of the banana trees that have not been exposed to the Sigatoka fungus grow in small banana groves."}, {"role": "user", "content": "Option 3: Large plantations produce most or all of the world's bananas."}, {"role": "user", "content": "Option 4: Sigatoka disease is the only disease that threatens bananas on a worldwide scale."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: At present, there are many books such as Ten Keys to Success in the book market. Publishers marketed these books as books that would actually help readers achieve great success. In fact, almost everyone knows that great success is destined to belong to a minority, and people cannot all become one of the minority through books. In this regard, the exaggerated and even false claims made by publishers cannot be considered unethical. To say the least, even if one believes the publisher's false claims, it is not immoral to make such claims as long as reading such books does more good than harm to one's success."}, {"role": "system", "content": "Question: Which of the following conclusions best fits the above argument?"}, {"role": "user", "content": "Option 1: Deliberately making false propaganda is immoral only when it has no positive effect"}, {"role": "user", "content": "Option 2: Deliberate propaganda of this kind is only immoral if people are deceived and suffer from it"}, {"role": "user", "content": "Option 3: If the deliberate disinformation is made to profit at the expense of the deceived, then the deliberate disinformation is immoral"}, {"role": "user", "content": "Option 4: Deliberately making false propaganda is immoral only when it has no positive effect"}], "ideal": "2"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Attorney for Ziegler: My client continued to do consulting work between the time of his arrest for attempted murder and the start of this trial. But I contend that Ziegler was insane at the time that he fired the shot. This is the only reasonable conclusion to draw from the fact that the accusers have submitted no evidence that he was sane at the time he pulled the trigger, only that he was sane some time after he did so."}, {"role": "system", "content": "Question: Which one of the following most accurately describes a flaw in the reasoning of Ziegler's attorney?"}, {"role": "user", "content": "Option 1: It presumes that being a well-educated professional is relevant to being guilty or innocent."}, {"role": "user", "content": "Option 2: It fails to consider that Ziegler might have been insane when he worked as a consultant."}, {"role": "user", "content": "Option 3: It fails to consider the possibility that Ziegler's being sane after the shooting is an indication that he was sane at the time of the shooting."}, {"role": "user", "content": "Option 4: It presumes that being a well-educated professional is relevant to being guilty or innocent."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: It is proposed to allow the sale, without prescription, of a medication that physicians currently prescribe to treat the common ear inflammation called swimmer' s ear. The principal objection is that most people lack the expertise for proper self-diagnosis and might not seek medical help for more serious conditions in the mistaken belief that they have swimmer' s ear. Yet in a recent study, of 1, 000 people who suspected that they had swimmer' s ear, 84 percent had made a correct diagnosis -- a slightly better accuracy rate than physicians have in diagnosing swimmer' s ear. Thus, clearly, most people can diagnose swimmer' s ear in themselves without ever having to consult a physician."}, {"role": "system", "content": "Question: Which one of the following, if true, most undermines the conclusion?"}, {"role": "user", "content": "Option 1: Cases in which swimmer's ear progresses to more serious infections are very rare."}, {"role": "user", "content": "Option 2: For many people who develop swimmer's ear, the condition disappears without medical or pharmaceutical intervention."}, {"role": "user", "content": "Option 3: Physicians who specialize in ear diseases are generally able to provide more accurate diagnoses than those provided by general practitioners."}, {"role": "user", "content": "Option 4: Cases in which swimmer's ear progresses to more serious infections are very rare."}], "ideal": "4"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: All any reporter knows about the accident is what the press agent has said. Ttherefore, if the press agent told every reporter everything about the accident, then no reporter knows any more about it than any other reporter. If no reporter knows any more about the accident than any other reporter, then no reporter can scoop all of the other reporters. However, the press agent did not tell every reporter everything about the accident. It follows that some reporter can scoop all of the other reporters."}, {"role": "system", "content": "Question: The argument's reasoning is flawed because the argument fails to recognize that which one of the following is consistent with the facts the argument presents?"}, {"role": "user", "content": "Option 1: The press agent may not know any more about the accident than the most knowledgeable reporter."}, {"role": "user", "content": "Option 2: No reporter knows any more about the accident than any other reporter."}, {"role": "user", "content": "Option 3: Even if some reporter knows more about the accident than all of the other reporters, that reporter need not scoop any other reporter."}, {"role": "user", "content": "Option 4: The press agent may not know any more about the accident than the most knowledgeable reporter."}], "ideal": "2"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Crowdsourcing refers to the practice of a company or organization to delegate tasks traditionally performed by employees to the general public."}, {"role": "system", "content": "Question: Which of the following is not crowdsourcing?"}, {"role": "user", "content": "Option 1: A toy company has been encouraging and sponsoring users to participate in its design work. From robotic control systems to building block kits, the company has had fairly good results."}, {"role": "user", "content": "Option 2: A detergent company often posts its own R & D projects on major websites, soliciting solutions, and promises to give certain rewards for solutions."}, {"role": "user", "content": "Option 3: In the past three years, a real estate company has handed over all the daily maintenance of computers, networks and peripherals to a computer company."}, {"role": "user", "content": "Option 4: A toy company has been encouraging and sponsoring users to participate in its design work. From robotic control systems to building block kits, the company has had fairly good results."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Social risk refers to the risk of loss of social production and people's life due to the actions of individuals or groups."}, {"role": "system", "content": "Question: Which of the following is not a social risk?"}, {"role": "user", "content": "Option 1: Larceny."}, {"role": "user", "content": "Option 2: Robbery."}, {"role": "user", "content": "Option 3: Frost disaster."}, {"role": "user", "content": "Option 4: Larceny."}], "ideal": "3"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: A manager is hoping to reach a certain target for camera sales in his store, which sells between 10 and 20 cameras a week. Typically, most cameras sold in any week are the less expensive economy models, and his store has sold relatively fewer of the more expensive, high-end cameras. The manager realizes that if, on average, three more cameras sold each week were high-end instead of economy models, the store would reach its target in sales. The manager prepares a detailed information sheet for the sales associates, outlining the numerous advantages of the high-end cameras over the economy cameras, and provides each sales associate with a portfolio of contrasting photos of the same images, showing the clearly superior image quality of the high-end cameras."}, {"role": "system", "content": "Question: Which of the following, if true, would provide most support for the prediction that the detailed information sheet and photo portfolio given to sales associates will have its intended effect of allowing the store to reach its target in sales?"}, {"role": "user", "content": "Option 1: Camera stores that are part of the same national franchise in major metropolitan locations, like New York or Los Angeles, sell comparatively large numbers of the high end cameras."}, {"role": "user", "content": "Option 2: The sales associates are already well informed about the capabilities of all the cameras, and often know detailed technical information about their circuitry."}, {"role": "user", "content": "Option 3: The high end cameras can generate photographs of profession quality, such as those a portrait photographer might produce"}, {"role": "user", "content": "Option 4: Camera stores that are part of the same national franchise in major metropolitan locations, like New York or Los Angeles, sell comparatively large numbers of the high end cameras."}], "ideal": "4"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: In people's impression, bio-fuel is a renewable green energy. The latest research results overturn people's traditional impression. Researchers found that bio-fuel may be converted into acetaldehyde due to incomplete combustion, which will pollute the air. This pollution will lead to 1400 early deaths in country M every year. Therefore, some medical institution personnel in country M believe that the promotion of bio-fuels should be suspended and its use should be limited at this stage."}, {"role": "system", "content": "Question: Which of the following, if true, would most effectively question the views of medical institution personnel?"}, {"role": "user", "content": "Option 1: At present, the country's scientists have developed a new technology to fully burn biofuels."}, {"role": "user", "content": "Option 2: Pollution from other fuels currently being used in the country causes more than 3,000 premature deaths a year."}, {"role": "user", "content": "Option 3: Conventional fuels such as oil have been technologically improved to reduce pollution from combustion."}, {"role": "user", "content": "Option 4: At present, the country's scientists have developed a new technology to fully burn biofuels."}], "ideal": "1"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage, and a question about that passage. There are four options to be choosed from, you need to choose the only correct option answering that question. If the first option is right, you generate the answer '1', if the second option is right, you generate the answer '2', if the third option is right, you generate the answer '3', if the fourth option is right, you generate the answer '4'. Read the question and options thoroughly and select the correct answer from the four answer labels. Read the passage thoroughly to ensure you know what the passage entails."}, {"role": "system", "content": "Passage: Road traffic accident refers to the event of personal injury or property loss caused by vehicle fault or accident on the road. Among them, road refers to roads, urban roads and places where social motor vehicles are allowed to pass although within the jurisdiction of the unit, including squares, public parking lots and other places used for public passage. Vehicle refers to motor vehicles and non motor vehicles. Non motor vehicles, It refers to the means of transport driven by human or animal power and running on the road, as well as the motor wheelchair, electric bicycle and other means of transport for the disabled whose design maximum speed, empty vehicle quality and overall dimensions meet the relevant national standards although driven by power devices."}, {"role": "system", "content": "Question: According to the above definition, which of the followings doesn't belong to road traffic accident:"}, {"role": "user", "content": "Option 1: Xiao Wang accidentally knocked down an old man when reversing in the closed management community"}, {"role": "user", "content": "Option 2: When Miss Zhou crossed the road with her pet dog, the stray pet dog unfortunately died under the ring"}, {"role": "user", "content": "Option 3: Xiao Zhao parked his car in the parking lot near the shopping mall. When he picked up the car, he found that the rear of the car was hit and the accident vehicle had escaped"}, {"role": "user", "content": "Option 4: Xiao Wang accidentally knocked down an old man when reversing in the closed management community"}], "ideal": "1"} ``` </details>
Thank you for contributing an eval!♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.
Eval details 📑
Eval name
[LogiQA]
Eval description
[LogiQA is a multi-choice machine reading dataset released in 2020, and in 2023 we updated the dataset to version 2. The dataset is collected from Logical reasoning exams with excellent annotation quality. Our experiments showed that it is challenging even for Large Language Models like GPT-3. Fine-tuning methods perform even worse. Up to now, the SOTA score on the dataset is 55% accuracy. The dataset is used by many researchers in their probing tasks and is an important benchmark for logical reasoning. For this OpeanAI Eval, we transformed 1572 instances from the test set into an instruct-prompt scheme.]
What makes this a useful eval?
[Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if GPT4 sheds more light on this topic.]
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
Basic
evals or theFact
Model-graded eval, or an exhaustive rubric for evaluating answers for theCriteria
Model-graded eval.If there is anything else that makes your eval worth including, please document it below.
Unique eval value
Eval structure 🏗️
Your eval should
evals/registry/data/{name}
evals/registry/evals/{name}.yaml
(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.
Limited availability acknowledgement
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
Submit eval
pip install pre-commit; pre-commit install
and have verified thatblack
,isort
, andautoflake
are running when I commit and pushFailure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval