Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an openai based parser for the saami pdf files #13

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

johanoskarsson
Copy link

I made an attempt at parsing this report: https://saami.org/wp-content/uploads/2023/11/ANSI-SAAMI-Z299.4-CFR-Approved-2015-12-14-Posting-Copy.pdf

It contains center fire rifle cartridges. Since the pdfs have drawings in them and differ a bit I figured this was a good opportunity to learn some OpenAI.

The current version will do the following:

  • It'll download the main cartridge pdf from the SAAMI website.
  • The file is split into one pdf file per cartridge.
  • These files are sent individually to the openai api to be parsed.
  • The returned json is not perfect, so we massage it a bit.

See the final output in saami.json. It's not too bad considering the input data. I have not done an in depth review to verify against the source material however. I'm also only extracting the main data: name, caliber and coal. I figure that's a good start.

@erichiggins
Copy link
Contributor

Thanks for creating this, good time to have the discussion.

I'm personally a fan of leveraging AI models to create simple scripts and programs, and it's possible that this might be a good fit. The trouble of course is that if SAAMI decides to change their PDF format, then this would no longer work. Also, we humans are still responsible for reviewing the results of the script each time for correctness.

So naturally I wonder if it's easier to simply spend the human time/energy on manually adding the details from SAAMI when they make changes (roughly once per year). Is that more or less time spent than running the script, and reviewing all of the results for correctness, and possibly having to make changes to the script?

@johanoskarsson
Copy link
Author

Yeah there are definitely some tradeoffs here. I think this code should be fairly agnostic to the exact PDF layout. Some of the cartridge pages in the document already differs from each other and the AI seems to be pretty good at "reading" it.

That said there are a couple of hard coded bits in the PR like where the pages for cartridges can be found in the PDF. That could probably be improved, but I didn't want to spend too much time on it up front. It would also not be too much work for a human to provide these as arguments if the PDF is only updated once a year.

My general attitude is that since there were very few details in the JSON files in the repo right now, providing more would be an improvement that can be built on. I can set aside some time to manually review the JSON output in this PR against the PDF to make sure that they are reasonable. I would assume that changes to existing published cartridge specs are rare (if ever?) so then we have a baseline to add to when the PDF is updated with new catridges.

@johanoskarsson
Copy link
Author

I finally had a moment to review the output json. I've updated the file to match the pdf. The numbers should now be correct and I adjusted the names to be a bit less screamy as well.

The openai output was probably less than 50% correct. It has a tendency to round the numbers (and it seems to pick up the diameter from the cartridge name instead of from the drawing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants