Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping a PDF #51

Open
psychemedia opened this issue Jun 8, 2016 · 5 comments
Open

Scraping a PDF #51

psychemedia opened this issue Jun 8, 2016 · 5 comments

Comments

@psychemedia
Copy link

psychemedia commented Jun 8, 2016

How do I scrape a local PDF?

I'm running:

  • norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb
  • ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

and using one of your test files trying:

norma  -i /contentmineself/trialsjournal_15_1_511.pdf -o /contentmineself/test_ct/

but all it seems to do is copy the pdf and rename it fulltext.pdf?

If I add the switch --transform pdf2html, as per #38, I get:

java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more
0    [main] DEBUG org.xmlcml.cmine.args.DefaultArgProcessor  - option in exception  or --transform; (1,2147483647); parseTransform; STRING: null / []; pdf2html; [pdf2html]
java.lang.RuntimeException: invoke runTransform fails
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1052)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runMethodsOfType(DefaultArgProcessor.java:946)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runRunMethodsOnChosenArgOptions(DefaultArgProcessor.java:927)
    at org.xmlcml.cmine.args.DefaultArgProcessor.runAndOutput(DefaultArgProcessor.java:1111)
    at org.xmlcml.norma.Norma.run(Norma.java:23)
    at org.xmlcml.norma.Norma.main(Norma.java:18)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.xmlcml.cmine.args.DefaultArgProcessor.instantiateAndRunMethod(DefaultArgProcessor.java:1049)
    ... 5 more
Caused by: java.lang.RuntimeException: Input must be reserved file; found: /contentmineself/trialsjournal_15_1_511.pdf
    at org.xmlcml.norma.NormaArgProcessor.checkAndGetInputFile(NormaArgProcessor.java:282)
    at org.xmlcml.norma.NormaTransformer.transform(NormaTransformer.java:114)
    at org.xmlcml.norma.NormaArgProcessor.runTransform(NormaArgProcessor.java:202)
    ... 10 more

My complete install is:

RUN apt-get clean -y && apt-get -y update && apt-get -y upgrade && \
  apt-get -y update && apt-get install -y wget ant unzip openjdk-7-jdk  && \
    apt-get clean -y

RUN wget --no-check-certificate https://github.com/ContentMine/norma/releases/download/v0.2.26/norma_0.1.SNAPSHOT_all.deb

RUN wget --no-check-certificate https://github.com/ContentMine/ami/releases/download/v0.2.24/ami2_0.1.SNAPSHOT_all.deb

RUN dpkg -i norma_0.1.SNAPSHOT_all.deb
RUN dpkg -i ami2_0.1.SNAPSHOT_all.deb

RUN npm install --global getpapers

in a basic linux environment with node installed (Dockerhub image node:4.3.2).

Hmm - is this the issue maybe? #21 (comment)

@petermr
Copy link
Member

petermr commented Jun 8, 2016

Thanks!

How do I scrape a local PDF?

wrong terminology. You have already scraped it. "How do I transform a PDF to Foo?"

norma  -i /contentmineself/trialsjournal_15_1_511.pdf -o /contentmineself/test_ct/

but all it seems to do is copy the pdf and rename it fulltext.pdf?

Yes, because no --transform``is given. And the only thing it can reasonably do is to normalize the name. Did it create aCTreefolder for thefulltext.pdf`?

If I add the switch --transform pdf2html, as per #38, I get:

If the first command has set up a CTree with fulltext.pdf it should work on that:

norma  --ctree /contentmineself/trialsjournal_15_1_511_pdf -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt

I would use pdf2txt first as it's more self-contained and less experimental.

@psychemedia
Copy link
Author

psychemedia commented Jun 8, 2016

re: terminology - I disagree. For me, scraping is the extraction of content in a structured form from a document where the content is not structured in form useful for processing as data. So I can scrape a table from an HTML document. In the HTML doc, the table is structured as a table, but not in a form I can usefully process. Under your terms, I guess that's just a transformation of the HTML. But in the vernacular, it's table scraping?

Re: the ctree commands - thanks; I'm still not clear on what the pipeline is, what components are available, how to wire them together, and what the intermediate data structures are. Is there something I should read....?

@petermr
Copy link
Member

petermr commented Jun 9, 2016

"scraping" - I've looked at https://en.wikipedia.org/wiki/Data_scraping and agree that "data scraping" could be aligned with our "extraction". I don't think there is a consistent world view. However - rightly or wrongly - we use "scraping" to mean "web scraping" and "extraction" to mean "information extraction" . https://en.wikipedia.org/wiki/Information_extraction .

In CM we have the phases:

  • crawl
  • scrape
  • transform
  • extract / index

generally transformation represents transforming the document per se rather than extracting bits, though it's woolly - some transformations remove cruft, and some extract tables.

Anyway the more targeted answer is that it should be relatively easy to run PDF2TXT, whereas PDF2SVG2XML is more involved and less predictable.

As always the question is "what do you want to achieve"?

@psychemedia
Copy link
Author

What do I want to achieve?

  • get enough of a clue about how to call the different contentmine tools in an appropriate order so I can get a feel for what they do, how they work together and how I might be able to start appropriating them;
  • task wise: one is to see how easy it is to then add "filters" for scraping new classes of regular PDFs (eg PDFs from a particular journal, or published in a particular style (eg Parliamentary Library briefing docs, perhaps?); my feeling is that I should be able to take this quite far?
  • the other is more general: explore whether those tools help speed up getting data out of a random collection of arbitrarily and independently styled pdf docs, such as reports from across government or the NHS; becuase of the arbitrary/independent nature of the doc formats, I don't expect this to result in a fully automated pipeline, but I'm interested to see what bits I might be able to usefully do; eg trying to parse data out of charts, other then just running OCR over them, or trying to extract their captions in the report text to act as image metadata for an image gallery generated from the doc, would be a start.

@psychemedia
Copy link
Author

psychemedia commented Jun 9, 2016

Re: the pathway. Running:

''''
norma --project /contentmineself/test -i /contentmineself/test/trialsjournal_15_1_511.pdf -o /contentmineself/test/
norma --project /contentmineself/test --ctree /contentmineself/test/trialsjournal_15_1_511 -i fulltext.pdf -o fulltext.pdf.html --transform pdf2html
''''
I get a verbose output log:

.0    [main] DEBUG org.xmlcml.svg2xml.pdf.PDFAnalyzer  - running /contentmineself/test/trialsjournal_15_1_511/fulltext.pdf to target/svg/fulltext
1 = 3275 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: null
3275 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
3289 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: i
6268 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: BDC
7123 [main] INFO  org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled operation: EMC
7368 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
2 = 7620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
7620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
9726 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
3 = 9991 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
9991 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
10924 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
4 = 10973 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
10973 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
11691 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
5 = 11730 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
11730 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
12396 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
6 = 12406 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
12406 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
12620 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
7 = 12634 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
12634 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13011 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
8 = 13025 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
13025 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13301 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream
9 = 13316 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - pageSize: java.awt.Dimension[width=595,height=793]
13316 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - startStream
13538 [main] DEBUG org.xmlcml.pdf2svg.PDFPage2SVGConverter  - endStream

13568 [main] DEBUG org.xmlcml.svg2xml.pdf.PDFAnalyzer  - target/svg/fulltext files: 9
0~21759 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page1.svg
1~27272 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page2.svg
2~30831 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page3.svg
3~34574 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page4.svg
4~38269 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page5.svg
5~42727 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page6.svg
6~46832 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page7.svg
7~56291 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page8.svg
8~57855 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - Path: /target/svg/fulltext/page9.svg
57880 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.0.svg
58905 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.0.svg
59626 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.3.svg
60783 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.1.3.svg
<1><2>62996 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.2.svg
63044 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.2.svg
63188 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.12.svg
63228 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.3.12.svg
<3><4>64570 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.5.2.svg
64832 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.5.2.svg
<5><6><7><8>69238 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.9.3.svg
69485 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - generated filename target/svg/fulltext/image.g.9.3.svg
<9>69738 [main] DEBUG org.xmlcml.svg2xml.page.PageIO  - writing to target/output/fulltext/TEXT.0.html
.

which looks as if it created some output? But running eg:

find / -name 'TEXT.0.html' 2>/dev/null

to try to find out where the file was placed returns nothing? So where did the output files go? Or do I have a write permissions issue somewhere?

(Running with --transform pdf2txt worked fine, and I could see the extracted text file...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants