Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial plugin design #1

Closed
simonw opened this issue Aug 15, 2023 · 4 comments
Closed

Initial plugin design #1

simonw opened this issue Aug 15, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Collaborator

simonw commented Aug 15, 2023

The goal of this plugin is to provide a UI for extracting structured data from unstructured text, using the trick described in https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction

Datasette is all about tables, so a plugin which makes it as easy as possible to turn unstructured data into table data makes a ton of sense.

@simonw simonw added the enhancement New feature or request label Aug 15, 2023
@simonw
Copy link
Collaborator Author

simonw commented Aug 15, 2023

Assorted ideas:

  • You can select an existing table to write data to, or you can define a new table
  • GPT-assisted schema creation could be available too, including giving it the example data and having it suggest a schema that could make sense
  • Sources of data could include:
  • Extracted data can be previewed? Though now we have https://datasette.io/plugins/datasette-write-ui that might not be so important any more

@simonw
Copy link
Collaborator Author

simonw commented Aug 15, 2023

Most basic version: you select an existing table (hence avoiding the need to implement a schema editing tool) and paste text into a textarea. I'll build that first.

@simonw
Copy link
Collaborator Author

simonw commented Aug 16, 2023

It's going to need a description for each column - it can guess in some cases, but the option to give it clues will help a lot.

@simonw
Copy link
Collaborator Author

simonw commented Aug 16, 2023

I got this working, but it was really slow... because the OpenAI APIs take a while to stream back all of that JSON.

I had a note about that https://til.simonwillison.net/gpt3/openai-python-functions-data-extraction where I mentioned that maybe ijson could help with that.

So I spent some time and figured out the ijson recipe for it, described in a new TIL: https://til.simonwillison.net/json/ijson-stream

Short version:

events = ijson.sendable_list()
coro = ijson.items_coro(events, "items.item")

seen_events = set()

for chunk in chunks:
    coro.send(chunk.encode("utf-8"))
    if events:
        # Any we have not seen yet?
        unseen_events = [e for e in events if json.dumps(e) not in seen_events]
        if unseen_events:
            for event in unseen_events:
                seen_events.add(json.dumps(event))
                print(json.dumps(event))
                time.sleep(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant