Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

frankrem · 2022-03-11T07:05:39Z

File CompactedPDFSyntaxTest.pdf.json provides a JSON representation of the PDF test file and can be used to assert correctness of the PDF lexical analyzer.

CompactedSyntax/README.md

CompactedSyntax/CompactedPDFSyntaxTest.pdf.json

petervwyatt · 2022-03-14T01:17:06Z

CompactedSyntax/CompactedPDFSyntaxTest.pdf.json

@@ -0,0 +1,1658 @@
+[


This JSON appears not to document the trailer - is that correct?

And "Null" should really be "null"

And "Null" should really be "null"

Resolved

This JSON appears not to document the trailer - is that correct?

This becomes a matter of defining the scope of the lexical analyser (e.g. does it include decryption and object stream parsing?). After the indirect objects have been parsed and identified by object number, the trailer has no meaning anymore.

The trailer in the test PDF does have a very specific parsing test (comment after name without whitespace) that is not duplicated elsewhere which is why I asked.

And no matter which way you look at or name things, something somewhere has to lex and parse the startxref, trailer(s), xref(s), incremental updates, etc. before you get to knowing anything about potential decrypting (which itself requires lexing/parsing of certain key data structures) which is before you eventually get to lexing/parsing the majority of main body objects in a PDF. The PDF spec is also not prescriptive about things such as lazy resolution, when and in what order you lex, parse or validate certain syntactic constructs, etc. - arguably this is just one reason "shadow attacks" are successful.

I've no problem in accepting your JSON contribution (which I think is really useful as a human-understandable version) so long as such caveats and assumptions are clearly documented as not all lexer/parsers use the same assumptions, hence why I was asking about the originating source.

Maybe the simpler solution is just to put this documentation into the README.md?

Thank you. I agree. I will update the README.

And "Null" should really be "null"

To be consistent with the casing of null, I have pushed a new version that formats JSON properties and PDF object types lower-case.

petervwyatt · 2022-03-14T01:18:44Z

CompactedSyntax/CompactedPDFSyntaxTest.pdf.json

+          "Operands": [
+            {
+              "Type": "Real",
+              "Value": 0.0


This should be ".0" not "0.0"

Can you help me understand why you want to distinguish between .0 and 0.0 after lexical analysis?

Mainly because not all lexers and tokenizers are equal.
And there are nasty things like signed zeros that can and do occur (not for this specific line, but my other comments elsewhere).
And parsing JSON is also parser-dependent (see https://seriot.ch/projects/parsing_json.html and https://labs.bishopfox.com/tech-blog/an-exploration-of-json-interoperability-vulnerabilities).
So if the value of "Value" was a string rather than a post-processed number then it would avoid all assumptions but still keep the nicely decomposed structure from the PDF as a more friendly JSON (thank you!). (And, yes, slightly more work for each user)

Exactly because not all lexers and tokenisers are equal (both in terms of implementation details and technology) a neutral reference is helpful. By representing a real or integer as a string, verification would still rely on the parsing of the particular lexer. In my view 0.0 is the most neutral representation of all zero real numbers such as -.0, -0.0, +.0, +0.00000, -.00000, etc.

https://json-schema.org/understanding-json-schema/reference/numeric.html explicitly discourages representing a number as a string.

CompactedSyntax/CompactedPDFSyntaxTest.pdf.json

Added expected result of lexical analysis as JSON

3d3aec5

petervwyatt reviewed Mar 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

frankrem commented Mar 11, 2022 •

edited

Loading

petervwyatt Mar 14, 2022

petervwyatt Mar 14, 2022

frankrem Mar 14, 2022 •

edited

Loading

frankrem Mar 14, 2022

petervwyatt Mar 15, 2022

frankrem Mar 15, 2022

frankrem Mar 15, 2022

petervwyatt Mar 14, 2022

frankrem Mar 14, 2022

petervwyatt Mar 14, 2022

frankrem Mar 14, 2022

Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

Are you sure you want to change the base?

Added expected result of lexical analysis of CompactedPDFSyntaxTest.pdf as JSON #6

Conversation

frankrem commented Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankrem Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

frankrem commented Mar 11, 2022 •

edited

Loading

frankrem Mar 14, 2022 •

edited

Loading