Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing data on pdf's with different page sizes #31

Open
jas1 opened this issue Apr 3, 2018 · 1 comment
Open

missing data on pdf's with different page sizes #31

jas1 opened this issue Apr 3, 2018 · 1 comment

Comments

@jas1
Copy link

jas1 commented Apr 3, 2018

After calling pdf_text; i got the text, nevertheless some pages are clipped. Also missing data.

it's similar to the landscape problem, but not the same. As not all pages are same size. Also some data is missing

im calling the function directly on the file , no other configurations

reference issue: #7

Script

# after downloading the file and saving it as 0003_PDF198_206_articulo.pdf
current_pdf <- '0003_PDF198_206_articulo.pdf'
pdf_ejemplo <- paste0(current_pdf)
texto_extraido <- pdf_text(pdf_ejemplo)
pdf_output_file_name <- str_replace(current_pdf,".pdf",".txt")
pdf_output_file <- paste0(pdf_output_file_name)
write.table(x=texto_extraido,file = pdf_output_file,row.names = FALSE,col.names = FALSE,quote = FALSE,fileEncoding = 'UTF-8')
pdf_output_file_name

Data

The example PDF: https://revistas.unlp.edu.ar/raab/article/view/198/206
The output of pdf_text: 0003_PDF198_206_articulo.txt

some clipped:

  • page 2 ( numbered 6 ): have clipped lines ( line 36 txt, is line 7 of page 2 in pdf )
  • page 4 ( numbered 8 ): have clipped lines ( 2nd paragraph 2nd line )

some missing:

  • page 2 ( numbered 6 ): have missing lines ( 1st paragraph, and some words on 2nd paragraph)
  • page 4( numbered 8 ): have missing lines ( 1st paragraph )

Thanks in advance! Also great work with pdftools , love it :D!

@jeroen
Copy link
Member

jeroen commented Apr 3, 2018

Copy of the pdf file: document.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants