Attachments in PDF failing to hyperlink #900

opoudjis · 2024-07-26T02:12:30Z

In #898 I have had to do some debugging of attachments, to make it possible to compile an Asciidoctor document with attachments outside of the working directory.

This has worked on HTML, with it finding the attachments now. But the PDF has stopped linking to attachments.

What is perplexing is

the difference between the commit where links worked for Ronald and links didn't is generating identical XML representation of the attachment
I am compiling from the same commit, and the PDF I generate does not hyperlink

Which makes me suspect this is not a matter of my code, but of processing constraints on the PDF.

I am sending the 200 MB Presentation XML on Skype for you to look at. @ronaldtse will be able to send you different iterations of the document in question.

Intelligent2013 · 2024-07-26T06:54:43Z

I've generated PDF and only one attachment presents in the PDF - READY-20230316-no-toc-iso-10303-49.pdf:

This attachment encoded in the Presentation XML as:

	<metanorma-extension>
...
		<attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBERi
...

    <p id="_bed0f9b3-394f-9910-dab9-8f46f0cb958b">Trial PDF document: <link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">10303-49/READY-20230316-no-toc-iso-10303-49.pdf</link>
...
	<bibliography>
		<references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
			<title depth="1">Bibliography</title>
			<bibitem id="attachment-10303-49-trial" hidden="true">
				<formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
				<uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
				<uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
				<docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
			</bibitem>
		</references>
	</bibliography>

Also, there are link with links to the files which should be attached to the PDF also:

<p id="_3c1b569d-6058-5228-5c17-0c06c39a7da7">PDF document comparison report: <link target="10303-49-comparison-report.pdf"/>
...
<p id="_a9f03ffe-d062-d97b-a425-e9e45692f302">Annotated EXPRESS schema: <link target="10303-49/method_definition_schema/method_definition_schema.exp"/>

I need update XSLT for such case. To differentiate link to the external entity like <link target="https://github.com/metanorma/iso-10303-detached-docs/issues/187"/>, I'll add the case: if link/@target doesn't start with https, http, www or ftp, then @target points to the file that should be attached to the PDF.

Also, there are xref with attachment- prefix:

<p id="_be27e7cc-b2c2-f0d7-8ccb-e2d32357c97f">Trial PDF document: <xref target="attachment-10303-50-trial">[attachment-10303-50-trial]</xref>

@opoudjis how to process such xref? How can I determine that xref points to the file instead of internal id? @target starts with attachment-?

ronaldtse · 2024-07-26T08:01:29Z

It is correct to only have 1 attachment. I can provide another file for you that I have linked the attachments but they are not attached.

There are two types of links.

A link to an attachment. This is a link that will open an attachment in the PDF. In HTML, it will open an external file.
An external link to whatever file, could be PDF, HTML, or any other format. In PDF it is only a path that will open a file in the file system.

opoudjis · 2024-07-26T08:02:56Z

I think part of the problem is that not all the attachments that were supposed to be there were, so the links weren't properly generated. (That might even be the case in the large file I also sent.)

Since I am addressing both HTML and DOC, should link/target be the same as attachment/name, so that you know which attachment is which? Or is the current arrangement workable?

If you see an xref, it simply is not an attachment, because the attachment has not been loaded in: attachments are loaded in via the bibliography. If the attachment had been loaded in, it would be showing up as an eref => link. You can ignore xref as an error in the underlying markup.

Intelligent2013 · 2024-07-26T08:26:15Z

There are two types of links.

A link to an attachment. This is a link that will open an attachment in the PDF. In HTML, it will open an external file.

It's working in the PDF:

An external link to whatever file, could be PDF, HTML, or any other format. In PDF it is only a path that will open a file in the file system.

It's working in the PDF also:

I can provide another file for you that I have linked the attachments but they are not attached.

@ronaldtse yes, it would be helpful.

Intelligent2013 · 2024-07-26T13:25:57Z

from my PDF - the link points to the embedded object:

from PDF generated by @ronaldtse - the link points to the external file:

I'll investigate it.

Intelligent2013 · 2024-07-26T14:20:23Z

How currently the attachment mechanism is working in the XSLT.

The Presentation XML contains:

attachment with name READY-20230316-no-toc-iso-10303-49.pdf:

      <metanorma-extension>...
		<attachment name="READY-20230316-no-toc-iso-10303-49.pdf">data:application/pdf;base64,JVBER...

the link with reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf:

       <link target="_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf">

bibitem with uri[@type="attachment"] = _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf

...
	<bibliography>
		<references id="_bibliography" normative="false" obligation="informative" hidden="true" displayorder="9">
			<title depth="1">Bibliography</title>
			<bibitem id="attachment-10303-49-trial" hidden="true">
				<formattedref format="application/x-isodoc+xml">[NO INFORMATION AVAILABLE]</formattedref>
				<uri type="attachment">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
				<uri type="citation">_document_attachments/READY-20230316-no-toc-iso-10303-49.pdf</uri>
				<docidentifier type="metanorma">[10303-49/READY-20230316-no-toc-iso-10303-49.pdf]</docidentifier>
			</bibitem>
		</references>
	</bibliography>

I.e. there isn't explicit relationship between the attachment READY-20230316-no-toc-iso-10303-49.pdf and link reference _document_attachments/READY-20230316-no-toc-iso-10303-49.pdf

THEREFORE, the XSLT executes such actions:

get the input XML name without presentation.xml or .xml suffix, for instance document
add _ at the start and add _attachments at the end: _document_attachments.
if link/@target starts with _document_attachments/, then gets the string after _document_attachments/, i.e. READY-20230316-no-toc-iso-10303-49.pdf.
add link to the PDF embedded file READY-20230316-no-toc-iso-10303-49.pdf

The code:

	<xsl:template match="*[local-name()='link']" name="link">
			...
				<xsl:when test="contains(@target, concat('_', $inputxml_filename_prefix, '_attachments'))">
					<!-- link to the PDF attachment -->
					<xsl:variable name="target_" select="translate(@target, '\', '/')"/>
					<xsl:variable name="target__" select="substring-after($target_, concat('_', $inputxml_filename_prefix, '_attachments', '/'))"/>
					<xsl:value-of select="concat('url(embedded-file:', $target__, ')')"/>
				</xsl:when>

BUT if input XML filename isn't document.presentation.xml or document.xml, then such mechanism isn't working. And link/@target will be point to the external file.
So, looks like the input XML isn't document.presentation.xml.
I have to change the XSLT, but currently, don't understand clearly how.

@opoudjis the question - _document_attachments/ is the fixed prefix for attached file or depends on the input adoc. I.e. for test.adoc the prefix in the Presentation XML in link/@target will be _test_attachments or document_attachments/?

I've found second issue with links. If there is a comment note on the page, then all references are not working, i.e, they are showing as blue text without links (the mouse pointer isn't changes on mouse over):

This issue doesn't relate to the XSLT. Something wrong in the PDFBox post-processing for notes.

opoudjis · 2024-07-26T14:59:59Z

Can I get back to this query on Monday? I'm going out of town for the weekend. The prefix is indeed _{document-name}_attachments/{attachment-name}, which is why I suggested above that I make the name attribute in the attachment the same as the target attribute in the link, so that you do know they are the same. Looks like that is the right thing to do.

Intelligent2013 · 2024-07-26T15:28:23Z

@opoudjis ok.

Intelligent2013 · 2024-07-26T20:06:37Z

I've found second issue with links. If there is a comment note on the page, then all references are not working,

Fixed in mn2pdf (https://github.com/metanorma/mn2pdf/releases/tag/v1.96.)

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

Intelligent2013 · 2024-07-27T18:08:32Z

I've update common.xsl to process PDF attachments correctly if attachment/@name and link/@target doesn't equal. @opoudjis so no need to fix it urgently.

I've found another bug. The attachments:

READY-20230316-no-toc-iso-10303-50.pdf
READY-20230316-no-toc-iso-10303-104.pdf
are broken. The Adobe Acrobat shows the error when attempt to open them:

The content of both PDF is truncated (doesn't end with %%EOF♪).

The reason - the text content of the element
<attachment name="READY-20230316-no-toc-iso-10303-50.pdf">data:application/pdf;base64,... is 10000000 bytes exactly. Looks like there is the 10Mb limit somewhere in the XML api. Ping @opoudjis.

opoudjis · 2024-07-29T09:21:00Z

Hm.

I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.

The MB limit is a surprise to me, and I don't think it's my doing. I have recently imposed a 10 MB limit on images, but that should be resulting in crashes, and it should not be truncating. Will investigate.

opoudjis · 2024-07-29T10:33:03Z

The MB limit is indeed Nokogiri, even when I changed the code to append the string as a child. I am going to have to introduce linebreaks.

Odd that Nokogiri does not have this issue with XML attributes...

opoudjis · 2024-07-29T11:34:46Z

Nokogiri::XML(file, &:huge) might take care of it; I don't use it in standoc (to my surprise), though I do in metanorma collections. But having a 10 MB long line is asking for trouble anyway, so I will break it up into lins of 60 characters, per the older Base64 spec.

opoudjis · 2024-07-29T12:06:12Z

... Still didn't work... Having to add it one line at a time in Nokogiri.

opoudjis · 2024-07-29T12:29:32Z

Works. Will generate entire document and pass it to you.

Intelligent2013 · 2024-07-29T13:04:14Z

Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.

Intelligent2013 · 2024-07-29T13:18:46Z

mn2pdfends with the error on my machine:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

or

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The presentation XML size is 141Mb.
I'll try to increase the max memory just for PDF generation.

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

Intelligent2013 · 2024-07-29T15:58:00Z

I'm going to fix the attachment link anyway, though it may make life more complicated for HTML.

common.xsl updated for the processing explicit link from xref/@target to attachment/@name.

Very strange, Adobe Reader shows only 1 (first) page for 86Mb document.pdf. I'll investigate it.

I don't understand why the PDF generated by @opoudjis contains only 1 page:

Works. Will generate entire document and pass it to you.

I've generated the PDF with increased Java heap space up to 5Gb, and can confirm that PDF contains correct all PDF attachments.

mn2pdfends with the error on my machine:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

or

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The error occurs on the Presentation XML size 141Mb, but process correctly old Presentation XML size 193Mb.

So, currently there is only one issue with Java heap space.

…#900

common.xsl, error fix, metanorma/metanorma-standoc#900

Intelligent2013 · 2024-07-29T19:10:14Z

I don't understand why the PDF generated by @opoudjis contains only 1 page:

After a few attempts I've generated PDF (86Mb) with 1 page. The log contains java.lang.OutOfMemoryError: Java heap space:

file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.OutOfMemoryError: Java heap space
file:/D:/Work/Metanorma/XML/ISO/ISO198_Hyperlinks/iso.international-standard.xsl; Line #16976; Column #203; java.lang.NullPointerException
...
Rendered page #1.
Bookmarks: Unresolved ID reference "_conclusion_3" found.
Bookmarks: Unresolved ID reference "_conclusion_4" found.
Bookmarks: Unresolved ID reference "_assembly_constraint_schema_schema" found.
...
Can't highlight the text ''.
Can't highlight the text ''.
Can't highlight the text ''.
...
Error parsing annotation information [null]. Annotation ignored
java.io.IOException: Error: wrong amount of numbers in attribute 'rect'
        at org.apache.pdfbox.pdmodel.fdf.FDFAnnotation.<init>(FDFAnnotation.java:205)
        at org.apache.pdfbox.pdmodel.fdf.FDFAnnotationText.<init>(FDFAnnotationText.java:67)
        at org.apache.pdfbox.pdmodel.fdf.FDFDictionary.<init>(FDFDictionary.java:155)
        at org.apache.pdfbox.pdmodel.fdf.FDFCatalog.<init>(FDFCatalog.java:63)
        at org.apache.pdfbox.pdmodel.fdf.FDFDocument.<init>(FDFDocument.java:90)
        at org.apache.pdfbox.pdmodel.fdf.FDFDocument.loadXFDF(FDFDocument.java:241)
        at org.metanorma.fop.annotations.Annotation.process(Annotation.java:260)
        at org.metanorma.fop.PDFGenerator.runFOP(PDFGenerator.java:700)
        at org.metanorma.fop.PDFGenerator.convertmn2pdf(PDFGenerator.java:493)
        at org.metanorma.fop.PDFGenerator.process(PDFGenerator.java:311)
        at org.metanorma.fop.mn2pdf.main(mn2pdf.java:350)
...

but the process didn't end abnormally and PDF generated with 1 page. So this is exactly the error with Java heap space.

So, currently there is only one issue with Java heap space.

common.xsl optimized and now PDF generated successfully.

opoudjis · 2024-08-15T10:07:44Z

FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...

Intelligent2013 · 2024-08-15T13:01:29Z

FYI @Intelligent2013 it has just run out of heap space on my side again, but IMO 100MB of PDF attachments are unreasonable to compile into a PDF to begin with...

@opoudjis could you share the Presentation XML to dropbox or similar? Thanks!

Intelligent2013 · 2024-08-15T18:24:34Z

@opoudjis thank you! I have Exception in thread "main" java.lang.OutOfMemoryError: Java heap space also with 144Mb Presentation XML. But the PDF for previous version (141Mb) generates ok. I'll investigate it.

Intelligent2013 · 2024-08-16T18:27:27Z

@opoudjis issue Exception in thread "main" java.lang.OutOfMemoryError: Java heap space fixed in the XSLT.

opoudjis added the bug Something isn't working label Jul 26, 2024

opoudjis assigned Intelligent2013 Jul 26, 2024

opoudjis added this to Metanorma Jul 26, 2024

github-project-automation bot moved this to 🆕 New in Metanorma Jul 26, 2024

Intelligent2013 mentioned this issue Jul 26, 2024

Links disappear after comments adding metanorma/mn2pdf#259

Closed

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 27, 2024

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

0d1af74

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 27, 2024

Merge pull request #725 from metanorma/fix/attachments

978534b

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

opoudjis added a commit to metanorma/isodoc that referenced this issue Jul 29, 2024

match eref and attachment name: metanorma/metanorma-standoc#900

0ce20f2

opoudjis added a commit to metanorma/isodoc that referenced this issue Jul 29, 2024

allow linebreaks in attachment Bin64: metanorma/metanorma-standoc#900

39d63b1

opoudjis added a commit that referenced this issue Jul 29, 2024

match eref and attachment name: #900

de8a0c2

opoudjis added a commit that referenced this issue Jul 29, 2024

line breaks in attachment Base64 encoding: #900

7e4a5a4

opoudjis added a commit that referenced this issue Jul 29, 2024

line breaks in attachment Base64 encoding: #900

bb57d07

opoudjis added a commit to metanorma/isodoc that referenced this issue Jul 29, 2024

allow linebreaks in attachment Bin64: metanorma/metanorma-standoc#900

eadbfae

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 29, 2024

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

857cee7

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 29, 2024

Merge pull request #726 from metanorma/fix/attachments

e445b04

common.xsl updated for PDF attachments, metanorma/metanorma-standoc#900

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 29, 2024

common.xsl optimized for PDF attachments, metanorma/metanorma-standoc…

a27a6dc

…#900

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 29, 2024

common.xsl, error fix, metanorma/metanorma-standoc#900

d881f0c

Intelligent2013 added a commit to metanorma/mn-native-pdf that referenced this issue Jul 29, 2024

Merge pull request #728 from metanorma/fix/attachments

d89756d

common.xsl, error fix, metanorma/metanorma-standoc#900

Intelligent2013 closed this as completed Jul 30, 2024

github-project-automation bot moved this from 🆕 New to ✅ Done in Metanorma Jul 30, 2024

Intelligent2013 mentioned this issue Aug 15, 2024

(URGENT) An attachment doesn’t work in a browser. metanorma/metanorma-iso#1203

Closed

Intelligent2013 mentioned this issue Aug 15, 2024

PDF ISO: java.lang.OutOfMemoryError metanorma/mn-native-pdf#730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attachments in PDF failing to hyperlink #900

Attachments in PDF failing to hyperlink #900

opoudjis commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

ronaldtse commented Jul 26, 2024

opoudjis commented Jul 26, 2024 •

edited

Loading

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024 •

edited

Loading

opoudjis commented Jul 26, 2024 •

edited

Loading

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 27, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024 •

edited

Loading

opoudjis commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

opoudjis commented Aug 15, 2024

Intelligent2013 commented Aug 15, 2024

Intelligent2013 commented Aug 15, 2024

Intelligent2013 commented Aug 16, 2024

Attachments in PDF failing to hyperlink #900

Attachments in PDF failing to hyperlink #900

Comments

opoudjis commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

ronaldtse commented Jul 26, 2024

opoudjis commented Jul 26, 2024 • edited Loading

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024 • edited Loading

How currently the attachment mechanism is working in the XSLT.

opoudjis commented Jul 26, 2024 • edited Loading

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 26, 2024

Intelligent2013 commented Jul 27, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024

opoudjis commented Jul 29, 2024 • edited Loading

opoudjis commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

Intelligent2013 commented Jul 29, 2024

opoudjis commented Aug 15, 2024

Intelligent2013 commented Aug 15, 2024

Intelligent2013 commented Aug 15, 2024

Intelligent2013 commented Aug 16, 2024

opoudjis commented Jul 26, 2024 •

edited

Loading

Intelligent2013 commented Jul 26, 2024 •

edited

Loading

opoudjis commented Jul 26, 2024 •

edited

Loading

opoudjis commented Jul 29, 2024 •

edited

Loading