Fix few issues with the dataset #32

gururise · 2023-03-17T01:33:09Z

Being that the training dataset was generated through GPT3, there seem to have been several issues I noticed when going through it. I have manually fixed the following issues:

Resolve empty outputs
Added a few CoT examples
Fixed a few empty code examples
Removed instructions asking to generate images
Resolve N/A outputs
Make empty inputs consistent (some used N/A, others uses None)
Fixed a few wrong answers.

Hoping this slightly curated dataset will help produce better training results.

tloen · 2023-03-17T02:13:12Z

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

tloen · 2023-03-17T03:38:30Z

alpaca_data.json

@@ -121612,7 +121567,7 @@
    {
        "instruction": "Calculate the area of the following shape in square centimeters.",
        "input": "rectangle of size 4 cm x 5 cm",
-        "output": ""
+        "output": "The area the rectangle is 20 square centimeters."


note to self: typo

tloen · 2023-03-17T03:39:56Z

alpaca_data.json

        "input": "",
-        "output": ""
+        "output": "Over the last 5 years, there have been advancements in deep learning, increased used of AI in the industry, a greater emphasis on explainability, development of new hardware and increased importance placed on ethical and social considerations of AI."


note to self: typo (used)

tloen · 2023-03-17T03:40:51Z

alpaca_data.json

-    {
-        "instruction": "Create a graphic or logo that visually represents the word \"courage\".",
-        "input": "",
-        "output": "<No Output>"


we should probably standardize an output for these instead of removing all of them completely. i liked the one above:

As a large language model, I am unable to generate visual data.

tloen · 2023-03-17T03:42:51Z

alpaca_data.json

-        "instruction": "Given a list of materials, come up with a plan to build a table.",
-        "input": "Materials: Wood, paint, glue, nails, screws",
-        "output": "No Output"
+        "instruction": "Come up with a motto for a robot company.",
+        "input": "",
+        "output": "Machines that make a difference."


Might be worth running this one through GPT again.

tloen · 2023-03-17T03:43:29Z

alpaca_data.json

        "instruction": "Is the following statement true or false? People in China make the most of their allowances.",
-        "input": "None",
+        "input": "",
        "output": "False"


lol. yeah, there are a few of these floating around. Seem to make no sense.

…nse.

gururise · 2023-03-17T05:02:57Z

Fixed a few more issues.
Put the "visualization" tasks back with a standard response: "As a large language model, I am unable to generate visual data."

Noticed there are a several tasks that expect the LLM to use data from URL's. Many of which don't even exist. I've replaced equivalent data when available.

gururise · 2023-03-17T05:14:33Z

Very interesting — I hadn't realized there were so many holes in the data. Fixing them could improve the model quality significantly. Out of curiosity, how many examples did you view and was there any method to your approach?

I only gave a cursory look and fixed the very obvious issues (ie. inconsistent empty input, obviously wrong answers, blank outputs, etc). I probably manually went through a few hundred examples.

I think I got most of the low-hanging fruit via searching for empty inputs and blank outputs. I did notice there are many instructions asking the LLM to reference online data to answer a question. These should probably be addressed in some manner.

niclimcy · 2023-03-17T08:09:32Z

I’m not sure if this is the right place to ask but I was thinking of crowdsourcing updating of each response in the training data set with functions to approve and review each line

chris-aeviator · 2023-03-17T08:18:41Z

Could contribute a simple system to accept/decline/upsert the entries

(Imagine each card in this kanban board beeing one instruction -> answer pair each)

Instead of category it would be a free form text field with the data from the original dataset that a reviewer can edit

zkenda · 2023-03-17T11:11:18Z

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools, that could potentially improve the accuracy of certain types of responses, such as calculations, image generation, or code compilation. The model should propose tools and use their output instead of relying solely on the language model's internal capabilities (which could be a big limitation considering the model size).

One can still detect the tool usage and replace it with generic answer if necessary.

AndriyMulyar · 2023-03-17T18:58:22Z

To assist with this, I made an embedding space explorer (running the data through a transformer) for visualizing the instructions and outputs.

Training Data Instructions Latent Space: https://atlas.nomic.ai/map/alpaca_instructions
Training Data Outputs: https://atlas.nomic.ai/map/alpaca_outputs

For example, here is a link to a bunch of bad data points in the outputs: https://atlas.nomic.ai/map/d2139cc3-bc1c-441c-8d6f-3e6ffbbc2eda/838019ff-8fe2-42ba-809a-d86d2b98cd50/-18.11668742841587/-11.348087116836096/-20.88850316347706/-17.680468640801223/774455612

…o curated

gururise · 2023-03-17T19:13:07Z

The original Stanford Dataset is full of mistakes and holes. Another large issue I found was that many of the instructions hallucinated references to article URL's.

I made my best effort first pass through the dataset to clean it up:

Resolve empty outputs
Resolve empty inputs (no input, , n/a, etc.) for consistency
Added several CoT examples (from Google's FLAN paper)
Fixed a few empty code examples
Instructions to Generate Audio or Images default to message stating as an llm I can't do this.
Resolve N/A outputs
Fixed a few wrong answers.
Did my best to either insert actual text for URL's referring to articles, or replace them with an alternate instruction.
Remove several instructions asking the LLM to pull data from the internet.
Removed extraneous escape/ctrl characters in some answers

The patched dataset is much more consistent and no longer assumes the LLM can access the internet or view/generate visual data. It also now has a few CoT training examples.
Would be interested to see how training goes on this updated dataset.

This reverts commit 8f4b5ba, reversing changes made to ececcc0.

tloen · 2023-03-17T20:30:02Z

I spent some time thinking about how to crowdsource dataset cleaning with minimal tooling. One way to do this is to create a separate repo with the following structure:

stanford_dataset.jsonl: a copy of the Alpaca dataset augmented with an id field for identification across versions
reviews: a folder of human-submitted data reviews
clean.py: a script or web interface that randomly samples unreviewed data points from stanford_dataset.jsonl for reviewing, then writes the edited or approved example to a new jsonl file in reviews
combine.py: a script that applies all the changes in reviews to the original dataset, and outputs a new cleaned_dataset.jsonl.

I suppose the utility of such an approach would depend on how many bad data points remain. In the meantime, I'll review the changes made so far and save a new "cleaned" dataset alongside the existing one.

teknium1 · 2023-03-17T20:33:07Z

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

tloen · 2023-03-17T20:36:09Z

Would the dataset benefit from multiple prompt:response chains rather than just single prompt>response? i.e. Question:Answer:FollowupQ:FollowupA

That's a lot of work to build. I'd hold out for that 22k dataset that LAION used to train SFT-1.

tloen · 2023-03-17T22:05:29Z

Folded into f704404. Thanks for your work!

spAnser · 2023-03-17T22:09:52Z

Looks like this just closed as I was typing but there is an typo not to far into the file which I'm not sure intentional or not.

construciton instead of construction

https://github.com/tloen/alpaca-lora/blob/main/alpaca_data_cleaned.json#L23

tloen · 2023-03-17T22:12:05Z

8aecde8

tloen · 2023-03-17T22:12:38Z

Although honestly we might want to leave typos in the instructions.

spAnser · 2023-03-17T22:13:02Z

Yeah it might be worth it idk.

teknium1 · 2023-03-17T23:45:26Z

for prompts it seems a good idea to keep typos

underlines · 2023-03-19T18:11:16Z

People should really support LAION's open-assistant.io project, because every person helping there, will improve a fully curated, crowd sourced, open sourced instruction fine tuning dataset, which in turn can be used for alpaca fine tuning.

gururise · 2023-03-22T18:07:45Z

FYI, the dataset cleaning is on-going. Latest cleaned dataset can be accessed here.

wassname · 2023-03-22T23:54:59Z

Instead of providing generic answers like "As a large language model, I am unable to..." we could introduce a standardized set of tools

Good idea, meta is already working on it with toolformer and there are a few other efforts too, for example getting it to control a web browser. They help but not as much as you would expect at the moment (red is baseline, blue is with a calculator). Since it's a WIP I would guess it's outside the scope of this repo for now.

gururise added 7 commits March 16, 2023 17:52

curate dataset

e88bb40

changes for 13B fine-tuning

ad5fcca

update readme to include virtual environment

3f8399e

13b changes for checkpointing

5430363

13b generate changes

241b377

13b update

5862817

Fix few issues with the dataset

e93ca85

T-Atlas mentioned this pull request Mar 17, 2023

Would it be possible to use a special token for separating segments? #24

Open

tloen reviewed Mar 17, 2023

View reviewed changes

More cleanup. Added the visualization tasks back, with standard respo…

08fe313

…nse.

gururise added 5 commits March 16, 2023 22:20

found a few more typos and wrong answers

d2f161f

Added a few CoT examples, update url instructions

c5d6031

minor fix.

34f467c

address a few more internet instructions

74a5db0

address a few more internet instructions

f1fc6f4

gururise added 3 commits March 17, 2023 09:06

replace additional instructions requiring access to internet

f8a1ea4

remove extraneous control characters.

a1fdf92

Merge branch 'tloen:main' into curated

62f17e2

gururise mentioned this pull request Mar 17, 2023

Anyone try fine-tuning 13B model? #28

Open

gururise added 2 commits March 17, 2023 12:05

lots of fixes

c652b65

Merge branch 'curated' of https://github.com/gururise/alpaca-lora int…

ececcc0

…o curated

gururise added 3 commits March 17, 2023 12:57

Merge branch 'rssc' into curated

8f4b5ba

some more fixes.

20a97f4

Revert "Merge branch 'rssc' into curated"

f78ddbb

This reverts commit 8f4b5ba, reversing changes made to ececcc0.

few more fixes

992a3be

tloen closed this Mar 17, 2023

samching mentioned this pull request Mar 20, 2023

Bad dataset #65

Open

wassname mentioned this pull request Apr 22, 2023

Plans? EleutherAI/eleutherai-instruct-dataset#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix few issues with the dataset #32

Fix few issues with the dataset #32

gururise commented Mar 17, 2023 •

edited

Loading

tloen commented Mar 17, 2023

tloen Mar 17, 2023

tloen Mar 17, 2023

tloen Mar 17, 2023

tloen Mar 17, 2023

tloen Mar 17, 2023

gururise Mar 17, 2023

gururise commented Mar 17, 2023 •

edited

Loading

gururise commented Mar 17, 2023 •

edited

Loading

niclimcy commented Mar 17, 2023 •

edited

Loading

chris-aeviator commented Mar 17, 2023 •

edited

Loading

zkenda commented Mar 17, 2023

AndriyMulyar commented Mar 17, 2023

gururise commented Mar 17, 2023 •

edited

Loading

tloen commented Mar 17, 2023 •

edited

Loading

teknium1 commented Mar 17, 2023

tloen commented Mar 17, 2023

tloen commented Mar 17, 2023

spAnser commented Mar 17, 2023

tloen commented Mar 17, 2023

tloen commented Mar 17, 2023

spAnser commented Mar 17, 2023

teknium1 commented Mar 17, 2023

underlines commented Mar 19, 2023 •

edited

Loading

gururise commented Mar 22, 2023

wassname commented Mar 22, 2023 •

edited

Loading

Fix few issues with the dataset #32

Fix few issues with the dataset #32

Conversation

gururise commented Mar 17, 2023 • edited Loading

tloen commented Mar 17, 2023

tloen Mar 17, 2023

Choose a reason for hiding this comment

tloen Mar 17, 2023

Choose a reason for hiding this comment

tloen Mar 17, 2023

Choose a reason for hiding this comment

tloen Mar 17, 2023

Choose a reason for hiding this comment

tloen Mar 17, 2023

Choose a reason for hiding this comment

gururise Mar 17, 2023

Choose a reason for hiding this comment

gururise commented Mar 17, 2023 • edited Loading

gururise commented Mar 17, 2023 • edited Loading

niclimcy commented Mar 17, 2023 • edited Loading

chris-aeviator commented Mar 17, 2023 • edited Loading

zkenda commented Mar 17, 2023

AndriyMulyar commented Mar 17, 2023

gururise commented Mar 17, 2023 • edited Loading

tloen commented Mar 17, 2023 • edited Loading

teknium1 commented Mar 17, 2023

tloen commented Mar 17, 2023

tloen commented Mar 17, 2023

spAnser commented Mar 17, 2023

tloen commented Mar 17, 2023

tloen commented Mar 17, 2023

spAnser commented Mar 17, 2023

teknium1 commented Mar 17, 2023

underlines commented Mar 19, 2023 • edited Loading

gururise commented Mar 22, 2023

wassname commented Mar 22, 2023 • edited Loading

gururise commented Mar 17, 2023 •

edited

Loading

gururise commented Mar 17, 2023 •

edited

Loading

gururise commented Mar 17, 2023 •

edited

Loading

niclimcy commented Mar 17, 2023 •

edited

Loading

chris-aeviator commented Mar 17, 2023 •

edited

Loading

gururise commented Mar 17, 2023 •

edited

Loading

tloen commented Mar 17, 2023 •

edited

Loading

underlines commented Mar 19, 2023 •

edited

Loading

wassname commented Mar 22, 2023 •

edited

Loading