Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run import tasks in the background and show people a progress indicator while they run #4

Closed
simonw opened this issue Feb 26, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Collaborator

simonw commented Feb 26, 2024

Currently blocks for several seconds while the job runs, which isn't ideal.

@simonw simonw added the enhancement New feature or request label Feb 26, 2024
@simonw
Copy link
Collaborator Author

simonw commented Feb 26, 2024

Can't easily show a progress indicator in terms of 0-100%, but showing parsed results as they are extracted would be neat.

@simonw
Copy link
Collaborator Author

simonw commented Feb 26, 2024

Two ways to do this:

  • A database table with a record for each job, which can be polled for status
  • An in-memory mechanism for storing progress

I like the table method, since that makes things easier to debug in the future. It could also be used to help power future features like remembering the last used column descriptions.

@simonw
Copy link
Collaborator Author

simonw commented Feb 26, 2024

Database table will be similar to the one used for Datasette enrichments: https://github.com/datasette/datasette-enrichments/blob/eeded4173fff0614e2efe559022a9023e5ff8640/datasette_enrichments/__init__.py#L35-L54

create table if not exists _enrichment_jobs (
    id integer primary key,
    status text, -- pending, running, cancelled, finished
    enrichment text, -- slug of enrichment
    database_name text,
    table_name text,
    filter_querystring text, -- querystring used to filter rows
    config text, -- JSON dictionary of config
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    cancel_reason text, -- null or reason for cancellation
    next_cursor text, -- next cursor to fetch
    row_count integer, -- number of rows to enrich at start
    error_count integer, -- number of rows with errors encountered
    done_count integer, -- number of rows processed
    actor_id text, -- optional ID of actor who created the job
    cost_100ths_cent integer -- cost of job so far in 1/100ths of a cent
)

@simonw
Copy link
Collaborator Author

simonw commented Feb 26, 2024

I'm torn on whether I should persist the full input text to the model. I think yes, for debuggability.

@simonw
Copy link
Collaborator Author

simonw commented Feb 26, 2024

Maybe this as a starting point:

create table if not exists _extract_jobs (
    id integer primary key,
    database_name text,
    table_name text,
    input text,
    custom_prompt text, -- for custom system prompt
    columns text, -- JSON function definition
    started_at text, -- ISO8601 when added
    finished_at text, -- ISO8601 when completed or cancelled
    row_count integer, -- starts at 0
    error text
);

@simonw
Copy link
Collaborator Author

simonw commented Feb 27, 2024

Options for polling for extracted records to display to the user:

  1. Poll the database table that is being populated (though that's tricky as we don't want to show any previously existing records - could use rowid for inserted ones but that won't help for replace-by-pk if we implement that)
  2. Poll against an in-memory data structure that exists just for the duration of the task
  3. Append the JSON to either a column in _extract_jobs or a separate table and poll that
  4. Instead of polling use WebSockets or SSE

I'm going to try option 2 first because I think it will be easiest to get working - can very easily change how that works later.

@simonw
Copy link
Collaborator Author

simonw commented Feb 27, 2024

I'm going to display the extracted data to the user as a raw JSON array - I considered a more human-friendly format like a table display or definition list, but I think JSON is a better representation of what's actually happening plus it gives them the chance to copy and paste it out to another tool if they are so inclined.

@simonw
Copy link
Collaborator Author

simonw commented Feb 27, 2024

Should the rows be inserted one at a time as they are extracted, or should they all be inserted at once at the end of the operation?

  • One at a time means a server crash won't lose work-so-far.
  • At the end provides a chance to let the user preview what has been extracted and approve/deny the operation.

I'll try one at a time first and see how that feels,

@simonw
Copy link
Collaborator Author

simonw commented Feb 27, 2024

I generated a new loading indicator using https://loading.io/ (unrestricted public domain license).

@simonw simonw closed this as completed in 4bcd623 Feb 27, 2024
@simonw
Copy link
Collaborator Author

simonw commented Feb 27, 2024

Demo:

projects-demo

simonw added a commit that referenced this issue Mar 7, 2024
Refs #1, #2, #3, #4, #5, #6, #7, #12, #13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant