You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The results of dump_data_fields() does not give me all the information I need from the PDF file. For instance (using the US IRS Form 941 as an example), here is my code and a section of the output:
But if I run PDFtk from the shell, it shows another "FieldStateOption" for each of these checkboxes. Here's the corresponding output of pdftk f941-2019.pdf dump_data_fields:
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 1
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 2
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 3
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 4
FieldStateOption: Off
So I understand that the reason this happens is because pypdftk is putting the output of PDFtk into a list of dictionaries, so naturally the later values for a given key overwrite the earlier values. But the fact is that data is lost, and in this case it is precisely the data I need. (The "FieldStateOption" that isn't "Off" is the one I have to use to "check" the checkbox. Note that it is different for each field, which is why I want my program to read it. In this case it comes first; apparently it doesn't always. See this StackExchange discussion.)
My suggestion would be doing a little more sophisticated processing of the PDFtk output, so that if a key is repeated, its value in the resulting dictionary would be a list. Then the result in Python would look like this in my case — taking the first item in the example above:
…tiple values. (#41)
This resolves Github issue #32.
The test files form.json and form-filled.json
were also updated to match the new output.
Co-authored-by: Adam Lehigh <adam@dominantseventh.net>
The results of
dump_data_fields()
does not give me all the information I need from the PDF file. For instance (using the US IRS Form 941 as an example), here is my code and a section of the output:But if I run PDFtk from the shell, it shows another "FieldStateOption" for each of these checkboxes. Here's the corresponding output of
pdftk f941-2019.pdf dump_data_fields
:So I understand that the reason this happens is because pypdftk is putting the output of PDFtk into a list of dictionaries, so naturally the later values for a given key overwrite the earlier values. But the fact is that data is lost, and in this case it is precisely the data I need. (The "FieldStateOption" that isn't "Off" is the one I have to use to "check" the checkbox. Note that it is different for each field, which is why I want my program to read it. In this case it comes first; apparently it doesn't always. See this StackExchange discussion.)
My suggestion would be doing a little more sophisticated processing of the PDFtk output, so that if a key is repeated, its value in the resulting dictionary would be a list. Then the result in Python would look like this in my case — taking the first item in the example above:
The text was updated successfully, but these errors were encountered: