Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values in dump_data_fields() results #32

Open
aslehigh opened this issue Jul 19, 2019 · 1 comment
Open

Missing values in dump_data_fields() results #32

aslehigh opened this issue Jul 19, 2019 · 1 comment
Labels

Comments

@aslehigh
Copy link
Contributor

The results of dump_data_fields() does not give me all the information I need from the PDF file. For instance (using the US IRS Form 941 as an example), here is my code and a section of the output:

In [27]: fieldlist = pypdftk.dump_data_fields("f941-2019.pdf")

In [28]: fieldlist[18:22]
Out[28]:
[{'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'},
 {'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]',
  'FieldStateOption': 'Off',
  'FieldType': 'Button',
  'FieldValue': 'Off'}]

But if I run PDFtk from the shell, it shows another "FieldStateOption" for each of these checkboxes. Here's the corresponding output of pdftk f941-2019.pdf dump_data_fields:

---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 1
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[1]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 2
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[2]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 3
FieldStateOption: Off
---
FieldType: Button
FieldName: topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[3]
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: 4
FieldStateOption: Off

So I understand that the reason this happens is because pypdftk is putting the output of PDFtk into a list of dictionaries, so naturally the later values for a given key overwrite the earlier values. But the fact is that data is lost, and in this case it is precisely the data I need. (The "FieldStateOption" that isn't "Off" is the one I have to use to "check" the checkbox. Note that it is different for each field, which is why I want my program to read it. In this case it comes first; apparently it doesn't always. See this StackExchange discussion.)

My suggestion would be doing a little more sophisticated processing of the PDFtk output, so that if a key is repeated, its value in the resulting dictionary would be a list. Then the result in Python would look like this in my case — taking the first item in the example above:

[{'FieldFlags': '0',
  'FieldJustification': 'Left',
  'FieldName': 'topmostSubform[0].Page1[0].Header[0].ReportForQuarter[0].c1_1[0]',
  'FieldStateOption': ['1', 'Off'],
  'FieldType': 'Button',
  'FieldValue': 'Off'}]
@revolunet
Copy link
Owner

Thanks for reporting, we should definitely return a list for FieldStateOption

@revolunet revolunet added the bug label Sep 22, 2019
aslehigh pushed a commit to aslehigh/pypdftk that referenced this issue Jun 2, 2020
…values.

This resolves Github issue revolunet#32.
The test files form.json and form-filled.json
were also updated to match the new output.
revolunet pushed a commit that referenced this issue Apr 11, 2021
…tiple values. (#41)

This resolves Github issue #32.
The test files form.json and form-filled.json
were also updated to match the new output.

Co-authored-by: Adam Lehigh <adam@dominantseventh.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants