-
-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to Decode Content Streams (for Text/Paragraph Parsing)? #296
Comments
@cshenks Can you please provide an example document I can run this script on? |
@Hopding here's the document I've been using. |
The reason the content streams are not legible is that they are all encoded. So if you want to process their contents, you'll first need to decode them. Fortunately, pdf-lib/src/core/streams/decode.ts Lines 48 to 69 in 9535e35
However, this function is not exported as it has only been used internally up to now. So if you're using the UMD modules, you can't really access it. But if you're using the NPM package, you can pull it out of I modified your example using import fs from 'fs';
import {
arrayAsString,
PDFArray,
PDFDict,
PDFDocument,
PDFName,
PDFNumber,
PDFPageLeaf,
PDFRawStream,
PDFRef,
} from 'pdf-lib';
// Note that this little guy isn't really accessible in the UMD modules, as he
// is not exported to the root, as of `pdf-lib@1.3.0`. But perhaps this will
// change in the next release.
import { decodePDFRawStream } from 'pdf-lib/cjs/core/streams/decode';
const markedContentRegex = (mcid: number) =>
new RegExp(`<<[^]*\\/MCID[\\0\\t\\n\\f\\r\\ ]*${mcid}[^]*>>[^]*BDC([^]*)EMC`);
const extractMarkedContent = (mcid: number, contentStream: string) => {
const regex = markedContentRegex(mcid);
const res = contentStream.match(regex);
return res?.[1];
};
const traverseStructTree = (root: PDFDict) => {
const kidsRef = root.get(PDFName.of('K'));
const structElementType = root.get(PDFName.of('S'));
const paragraphType = PDFName.of('P');
if (structElementType === paragraphType) {
// TODO: What if this isn't a `PDFPageLeaf`?
const page = root.lookup(PDFName.of('Pg')) as PDFPageLeaf;
// TODO: What if this isn't a `PDFRawStream`?
const contents = page.Contents() as PDFRawStream;
// TODO: What if this isn't a `PDFNumber`?
const markedContentIdentifer = kidsRef! as PDFNumber;
const mcid = markedContentIdentifer.value();
console.log(`------- Marked Content (id=${mcid}) --------`);
const decodedBytes = decodePDFRawStream(contents).decode();
const decodedString = arrayAsString(decodedBytes);
const content = extractMarkedContent(mcid, decodedString);
console.log(content);
console.log(`-------- End (id=${mcid}) ---------`);
console.log();
}
let node;
if (!kidsRef || kidsRef instanceof PDFNumber) return;
if (kidsRef instanceof PDFRef) {
node = root.context.lookup(kidsRef, PDFDict);
traverseStructTree(node);
} else if (kidsRef instanceof PDFArray) {
for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
const nodeRef = kidsRef.get(idx);
node = root.context.lookup(nodeRef);
if (!(node instanceof PDFDict)) return;
traverseStructTree(node);
}
}
};
(async () => {
const pdfDoc = await PDFDocument.load(fs.readFileSync('f1099msc.pdf'));
const structTreeRoot = pdfDoc.catalog.lookup(
PDFName.of('StructTreeRoot'),
PDFDict,
);
traverseStructTree(structTreeRoot);
})(); Running this will output the following:
To obtain sentences/paragraphs of text, you'll need to parse and process the graphics operators in the marked content streams. The above example is written in TypeScript. I also created a working NPM script you can use: extract-marked-content.zip There are a couple of important things to note about this script/example:
I hope this helps. Please let me know if you have any additional questions! |
Do you have any plans to add simple function to get page content objects ? |
@Hopding |
Following up on #137, I would also like to use
pdf-lib
to extract and modify the text content of PDFs. I've been looking into traversing the structure tree to identify paragraphs. I've been able to accomplish this, but in the case where I reach a structure element dictionary whose kids array contains references to portions of a page content stream, I've been unable to figure out how to convert that portion of the context stream into readable text. Is this doable?The text was updated successfully, but these errors were encountered: