Add an openai based parser for the saami pdf files #13

johanoskarsson · 2024-07-18T06:27:17Z

I made an attempt at parsing this report: https://saami.org/wp-content/uploads/2023/11/ANSI-SAAMI-Z299.4-CFR-Approved-2015-12-14-Posting-Copy.pdf

It contains center fire rifle cartridges. Since the pdfs have drawings in them and differ a bit I figured this was a good opportunity to learn some OpenAI.

The current version will do the following:

It'll download the main cartridge pdf from the SAAMI website.
The file is split into one pdf file per cartridge.
These files are sent individually to the openai api to be parsed.
The returned json is not perfect, so we massage it a bit.

See the final output in saami.json. It's not too bad considering the input data. I have not done an in depth review to verify against the source material however. I'm also only extracting the main data: name, caliber and coal. I figure that's a good start.

erichiggins · 2024-07-24T15:59:08Z

Thanks for creating this, good time to have the discussion.

I'm personally a fan of leveraging AI models to create simple scripts and programs, and it's possible that this might be a good fit. The trouble of course is that if SAAMI decides to change their PDF format, then this would no longer work. Also, we humans are still responsible for reviewing the results of the script each time for correctness.

So naturally I wonder if it's easier to simply spend the human time/energy on manually adding the details from SAAMI when they make changes (roughly once per year). Is that more or less time spent than running the script, and reviewing all of the results for correctness, and possibly having to make changes to the script?

johanoskarsson · 2024-07-25T03:16:52Z

Yeah there are definitely some tradeoffs here. I think this code should be fairly agnostic to the exact PDF layout. Some of the cartridge pages in the document already differs from each other and the AI seems to be pretty good at "reading" it.

That said there are a couple of hard coded bits in the PR like where the pages for cartridges can be found in the PDF. That could probably be improved, but I didn't want to spend too much time on it up front. It would also not be too much work for a human to provide these as arguments if the PDF is only updated once a year.

My general attitude is that since there were very few details in the JSON files in the repo right now, providing more would be an improvement that can be built on. I can set aside some time to manually review the JSON output in this PR against the PDF to make sure that they are reasonable. I would assume that changes to existing published cartridge specs are rare (if ever?) so then we have a baseline to add to when the PDF is updated with new catridges.

johanoskarsson · 2024-08-07T18:08:11Z

I finally had a moment to review the output json. I've updated the file to match the pdf. The numbers should now be correct and I adjusted the names to be a bit less screamy as well.

The openai output was probably less than 50% correct. It has a tendency to round the numbers (and it seems to pick up the diameter from the cartridge name instead of from the drawing).

Add an openai based parser for the saami pdf files

7a87a6d

johanoskarsson added 3 commits August 7, 2024 19:59

Reviewed and updated per pdf

0a2d392

Add note to readme

ba63106

Adjust names

e1e4e37

Add the new cartridges from the SAAMI site

6e9651c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an openai based parser for the saami pdf files #13

Add an openai based parser for the saami pdf files #13

johanoskarsson commented Jul 18, 2024

erichiggins commented Jul 24, 2024

johanoskarsson commented Jul 25, 2024

johanoskarsson commented Aug 7, 2024

Add an openai based parser for the saami pdf files #13

Are you sure you want to change the base?

Add an openai based parser for the saami pdf files #13

Conversation

johanoskarsson commented Jul 18, 2024

erichiggins commented Jul 24, 2024

johanoskarsson commented Jul 25, 2024

johanoskarsson commented Aug 7, 2024