Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more pdf importers #7947

Merged
merged 51 commits into from
Aug 18, 2021
Merged
Show file tree
Hide file tree
Changes from 47 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
8a7b80f
GrobidPdfMetadataImporter implemented
btut Jul 20, 2021
6fe2a23
Fixed class when accessing resources
btut Jul 20, 2021
f99bc52
Use FileHelper method to get extension
btut Jul 28, 2021
1d64d80
Use jsoup to issue POST request
btut Jul 28, 2021
f591bfc
Removed unnecessary field
btut Jul 28, 2021
b2bd365
Reverted URLDownload
btut Jul 28, 2021
e458c77
Changelog entry
btut Jul 30, 2021
5478585
Add pdf link to imported entry
btut Jul 30, 2021
d0cc663
Remove citationkey from Grobid
btut Jul 30, 2021
2cd78fc
FirstPageImporter
btut Jul 30, 2021
eb22157
Fixed grammar mistake in CHANGELOG.md
btut Jul 30, 2021
3ac0094
Fixed Grobid tests
btut Jul 30, 2021
c87ed4e
Fixed Grobid URL
btut Jul 30, 2021
3d8c4da
Checkstyle
btut Jul 30, 2021
168b866
Fixed doc
btut Jul 30, 2021
42adea9
Checkstyle
btut Jul 30, 2021
73dc505
Use JSoup for plaintext citations as well
btut Aug 1, 2021
7ce7105
Renamed FirstPageImporter to PdfVerbatimBibTextImporter
btut Aug 4, 2021
53d8e9a
Fixed getName (no importer)
btut Aug 4, 2021
9080f14
Renamed Grobid importer to match convention
btut Aug 4, 2021
2757be6
PdfEmbeddedBibTeXImporter
btut Aug 5, 2021
8a05c3e
Renamed PdfEmbeddedBibTeXImporter to PdfEmbeddedBibFileImporter
btut Aug 5, 2021
0c488ec
Checkstyle
btut Aug 5, 2021
02057f0
Remove debug output
btut Aug 5, 2021
3d66855
Checkstyle
btut Aug 5, 2021
fd8918b
PdfMergeMetadataImporter
btut Aug 5, 2021
56868f5
Add DOI and ISBN fetching in PdfMergeMetadataImporter
btut Aug 5, 2021
479a0bc
Fixed concurrent list access
btut Aug 5, 2021
cb6a910
Adapted tests to contain fetchable ID's
btut Aug 5, 2021
e18eabd
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 10, 2021
1bf6409
Derive XMP preferences from importFormatPreferences
btut Aug 10, 2021
787e040
Localization
btut Aug 10, 2021
a3cdff9
Use Importers in JabRef
btut Aug 10, 2021
564988a
Remove unnecessary test documents
btut Aug 10, 2021
e3d279a
Checkstyle
btut Aug 11, 2021
04eecaf
Grobid Timeout
btut Aug 14, 2021
b7e5b62
Null-check
btut Aug 14, 2021
5cbf919
Use MergeImporter as WebFetcher
btut Aug 14, 2021
1cb4dfc
Only force BibTeX import if everything else fails
btut Aug 16, 2021
3ab8ebb
Prioritize non-bruteforce importers that
btut Aug 16, 2021
7ba8b40
Checkstyle
btut Aug 16, 2021
18dbb67
Fixed WebFetchersTest
btut Aug 16, 2021
3d46df4
Grobid does not need localization
btut Aug 16, 2021
40b2759
Followup on removed Grobid localization
btut Aug 16, 2021
6324cf2
Fixed tests
btut Aug 16, 2021
5cf2af7
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 16, 2021
b555ada
Checkstyle
btut Aug 16, 2021
a7604b6
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 17, 2021
4cff87c
Grobid Fetcher and Tests adapted to updated Grobid
btut Aug 18, 2021
4ac2002
Adapted GrobidServiceTest to updated Grobid
btut Aug 18, 2021
43e22b7
Merge branch 'main' of github.com:JabRef/jabref into improvement/more…
btut Aug 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Note that this project **does not** adhere to [Semantic Versioning](http://semve

- We added the option to copy the DOI of an entry directly from the context menu copy submenu. [#7826](https://github.com/JabRef/jabref/issues/7826)
- We added a fulltext search feature. [#2838](https://github.com/JabRef/jabref/pull/2838)
- We improved the deduction of bib-entries from imported fulltext pdfs. [#7947](https://github.com/JabRef/jabref/pull/7947)
- We added unprotect_terms to the list of bracketed pattern modifiers [#7826](https://github.com/JabRef/jabref/pull/7960)

### Changed
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/jabref/gui/entryeditor/EntryEditor.java
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,7 @@ private void setupToolBar() {

// Add menu for fetching bibliographic information
ContextMenu fetcherMenu = new ContextMenu();
for (EntryBasedFetcher fetcher : WebFetchers.getEntryBasedFetchers(preferencesService.getImportFormatPreferences())) {
for (EntryBasedFetcher fetcher : WebFetchers.getEntryBasedFetchers(preferencesService.getImportFormatPreferences(), preferencesService.getFilePreferences(), databaseContext, preferencesService.getDefaultEncoding())) {
MenuItem fetcherMenuItem = new MenuItem(fetcher.getName());
fetcherMenuItem.setOnAction(event -> fetchAndMerge(fetcher));
fetcherMenu.getItems().add(fetcherMenuItem);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import org.jabref.logic.importer.ImportFormatPreferences;
import org.jabref.logic.importer.OpenDatabase;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.importer.fileformat.PdfContentImporter;
import org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter;
import org.jabref.logic.importer.fileformat.PdfXmpImporter;
import org.jabref.logic.preferences.TimestampPreferences;
import org.jabref.model.util.FileUpdateMonitor;
Expand All @@ -23,7 +23,11 @@ public ExternalFilesContentImporter(ImportFormatPreferences importFormatPreferen
}

public ParserResult importPDFContent(Path file) {
return new PdfContentImporter(importFormatPreferences).importDatabase(file, StandardCharsets.UTF_8);
try {
return new PdfMergeMetadataImporter(importFormatPreferences).importDatabase(file, StandardCharsets.UTF_8);
} catch (IOException e) {
return ParserResult.fromError(e);
}
}

public ParserResult importXMPContent(Path file) {
Expand Down
53 changes: 34 additions & 19 deletions src/main/java/org/jabref/logic/importer/ImportFormatReader.java
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@

import java.io.IOException;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
import java.util.Optional;
import java.util.SortedSet;
import java.util.TreeSet;

import org.jabref.logic.importer.fetcher.GrobidCitationFetcher;
import org.jabref.logic.importer.fileformat.BibTeXMLImporter;
import org.jabref.logic.importer.fileformat.BiblioscapeImporter;
import org.jabref.logic.importer.fileformat.BibtexImporter;
Expand All @@ -22,6 +24,10 @@
import org.jabref.logic.importer.fileformat.MsBibImporter;
import org.jabref.logic.importer.fileformat.OvidImporter;
import org.jabref.logic.importer.fileformat.PdfContentImporter;
import org.jabref.logic.importer.fileformat.PdfEmbeddedBibFileImporter;
import org.jabref.logic.importer.fileformat.PdfGrobidImporter;
import org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter;
import org.jabref.logic.importer.fileformat.PdfVerbatimBibTextImporter;
import org.jabref.logic.importer.fileformat.PdfXmpImporter;
import org.jabref.logic.importer.fileformat.RepecNepImporter;
import org.jabref.logic.importer.fileformat.RisImporter;
Expand All @@ -42,7 +48,7 @@ public class ImportFormatReader {
* All import formats.
* Sorted accordingly to {@link Importer#compareTo}, which defaults to alphabetically by the name
*/
private final SortedSet<Importer> formats = new TreeSet<>();
private final List<Importer> formats = new ArrayList<>();

private ImportFormatPreferences importFormatPreferences;

Expand All @@ -51,8 +57,6 @@ public void resetImportFormats(ImportFormatPreferences newImportFormatPreference

formats.clear();

formats.add(new BiblioscapeImporter());
formats.add(new BibtexImporter(importFormatPreferences, fileMonitor));
formats.add(new BibTeXMLImporter());
formats.add(new CopacImporter());
formats.add(new EndnoteImporter(importFormatPreferences));
Expand All @@ -64,11 +68,17 @@ public void resetImportFormats(ImportFormatPreferences newImportFormatPreference
formats.add(new ModsImporter(importFormatPreferences));
formats.add(new MsBibImporter());
formats.add(new OvidImporter());
formats.add(new PdfMergeMetadataImporter(importFormatPreferences));
formats.add(new PdfVerbatimBibTextImporter(importFormatPreferences));
formats.add(new PdfContentImporter(importFormatPreferences));
formats.add(new PdfEmbeddedBibFileImporter(importFormatPreferences));
formats.add(new PdfGrobidImporter(GrobidCitationFetcher.GROBID_URL, importFormatPreferences));
formats.add(new PdfXmpImporter(xmpPreferences));
formats.add(new RepecNepImporter(importFormatPreferences));
formats.add(new RisImporter());
formats.add(new SilverPlatterImporter());
formats.add(new BiblioscapeImporter());
formats.add(new BibtexImporter(importFormatPreferences, fileMonitor));

// Get custom import formats
formats.addAll(importFormatPreferences.getCustomImportList());
Expand Down Expand Up @@ -110,26 +120,26 @@ public ParserResult importFromFile(String format, Path file) throws ImportExcept
* All importers.
* <p>
* <p>
* Elements are in default order.
* Elements are sorted by name.
* </p>
*
* @return all custom importers, elements are of type InputFormat
*/
public SortedSet<Importer> getImportFormats() {
return this.formats;
return new TreeSet<>(this.formats);
}

/**
* Human readable list of all known import formats (name and CLI Id).
* <p>
* <p>List is in default-order.</p>
* <p>List is sorted by importer name.</p>
*
* @return human readable list of all known import formats
*/
public String getImportFormatList() {
StringBuilder sb = new StringBuilder();

for (Importer imFo : formats) {
for (Importer imFo : getImportFormats()) {
int pad = Math.max(0, 14 - imFo.getName().length());
sb.append(" ");
sb.append(imFo.getName());
Expand Down Expand Up @@ -166,20 +176,25 @@ public UnknownFormatImport(String format, ParserResult parserResult) {
public UnknownFormatImport importUnknownFormat(Path filePath, TimestampPreferences timestampPreferences, FileUpdateMonitor fileMonitor) throws ImportException {
Objects.requireNonNull(filePath);

// First, see if it is a BibTeX file:
try {
ParserResult parserResult = OpenDatabase.loadDatabase(filePath, importFormatPreferences, timestampPreferences, fileMonitor);
if (parserResult.getDatabase().hasEntries() || !parserResult.getDatabase().hasNoStrings()) {
parserResult.setFile(filePath.toFile());
return new UnknownFormatImport(ImportFormatReader.BIBTEX_FORMAT, parserResult);
UnknownFormatImport unknownFormatImport = importUnknownFormat(importer -> importer.importDatabase(filePath, importFormatPreferences.getEncoding()), importer -> importer.isRecognizedFormat(filePath, importFormatPreferences.getEncoding()));
unknownFormatImport.parserResult.setFile(filePath.toFile());
return unknownFormatImport;
} catch (ImportException e) {
// If all importers fail, try to read the file as BibTeX
try {
ParserResult parserResult = OpenDatabase.loadDatabase(filePath, importFormatPreferences, timestampPreferences, fileMonitor);
if (parserResult.getDatabase().hasEntries() || !parserResult.getDatabase().hasNoStrings()) {
parserResult.setFile(filePath.toFile());
return new UnknownFormatImport(ImportFormatReader.BIBTEX_FORMAT, parserResult);
} else {
throw new ImportException(Localization.lang("Could not find a suitable import format."));
}
} catch (IOException ignore) {
// Ignored
throw new ImportException(Localization.lang("Could not find a suitable import format."));
}
} catch (IOException ignore) {
// Ignored
}

UnknownFormatImport unknownFormatImport = importUnknownFormat(importer -> importer.importDatabase(filePath, importFormatPreferences.getEncoding()), importer -> importer.isRecognizedFormat(filePath, importFormatPreferences.getEncoding()));
unknownFormatImport.parserResult.setFile(filePath.toFile());
return unknownFormatImport;
}

/**
Expand All @@ -198,7 +213,7 @@ private UnknownFormatImport importUnknownFormat(CheckedFunction<Importer, Parser
String bestFormatName = null;

// Cycle through all importers:
for (Importer imFo : getImportFormats()) {
for (Importer imFo : formats) {
try {
if (!isRecognizedFormat.apply(imFo)) {
continue;
Expand Down
7 changes: 6 additions & 1 deletion src/main/java/org/jabref/logic/importer/WebFetchers.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package org.jabref.logic.importer;

import java.nio.charset.Charset;
import java.util.Comparator;
import java.util.HashSet;
import java.util.Optional;
Expand Down Expand Up @@ -37,10 +38,13 @@
import org.jabref.logic.importer.fetcher.SpringerLink;
import org.jabref.logic.importer.fetcher.TitleFetcher;
import org.jabref.logic.importer.fetcher.ZbMATH;
import org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter;
import org.jabref.model.database.BibDatabaseContext;
import org.jabref.model.entry.field.Field;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.identifier.DOI;
import org.jabref.model.entry.identifier.Identifier;
import org.jabref.preferences.FilePreferences;

import static org.jabref.model.entry.field.StandardField.EPRINT;
import static org.jabref.model.entry.field.StandardField.ISBN;
Expand Down Expand Up @@ -133,14 +137,15 @@ public static SortedSet<IdBasedFetcher> getIdBasedFetchers(ImportFormatPreferenc
/**
* @return sorted set containing entry based fetchers
*/
public static SortedSet<EntryBasedFetcher> getEntryBasedFetchers(ImportFormatPreferences importFormatPreferences) {
public static SortedSet<EntryBasedFetcher> getEntryBasedFetchers(ImportFormatPreferences importFormatPreferences, FilePreferences filePreferences, BibDatabaseContext databaseContext, Charset defaultEncoding) {
SortedSet<EntryBasedFetcher> set = new TreeSet<>(Comparator.comparing(WebFetcher::getName));
set.add(new AstrophysicsDataSystem(importFormatPreferences));
set.add(new DoiFetcher(importFormatPreferences));
set.add(new IsbnFetcher(importFormatPreferences));
set.add(new MathSciNet(importFormatPreferences));
set.add(new CrossRef());
set.add(new ZbMATH(importFormatPreferences));
set.add(new PdfMergeMetadataImporter.EntryBasedFetcherWrapper(importFormatPreferences, filePreferences, databaseContext, defaultEncoding));
return set;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,10 @@

public class GrobidCitationFetcher implements SearchBasedFetcher {

public static final String GROBID_URL = "http://grobid.jabref.org:8070";

private static final Logger LOGGER = LoggerFactory.getLogger(GrobidCitationFetcher.class);

private static final String GROBID_URL = "http://grobid.jabref.org:8070";
private ImportFormatPreferences importFormatPreferences;
private GrobidService grobidService;

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
package org.jabref.logic.importer.fileformat;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Objects;

import org.jabref.logic.importer.ImportFormatPreferences;
import org.jabref.logic.importer.Importer;
import org.jabref.logic.importer.ParseException;
import org.jabref.logic.importer.ParserResult;
import org.jabref.logic.l10n.Localization;
import org.jabref.logic.util.StandardFileType;
import org.jabref.logic.util.io.FileUtil;
import org.jabref.logic.xmp.EncryptedPdfsNotSupportedException;
import org.jabref.logic.xmp.XmpUtilReader;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.util.DummyFileUpdateMonitor;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentNameDictionary;
import org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDNameTreeNode;
import org.apache.pdfbox.pdmodel.common.filespecification.PDComplexFileSpecification;
import org.apache.pdfbox.pdmodel.common.filespecification.PDEmbeddedFile;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment;

/**
* PdfEmbeddedBibFileImporter imports an embedded Bib-File from the PDF.
*/
public class PdfEmbeddedBibFileImporter extends Importer {

private final ImportFormatPreferences importFormatPreferences;
private final BibtexParser bibtexParser;

public PdfEmbeddedBibFileImporter(ImportFormatPreferences importFormatPreferences) {
this.importFormatPreferences = importFormatPreferences;
bibtexParser = new BibtexParser(importFormatPreferences, new DummyFileUpdateMonitor());
}

@Override
public boolean isRecognizedFormat(BufferedReader input) throws IOException {
return input.readLine().startsWith("%PDF");
}

@Override
public ParserResult importDatabase(BufferedReader reader) throws IOException {
Objects.requireNonNull(reader);
throw new UnsupportedOperationException("PdfEmbeddedBibFileImporter does not support importDatabase(BufferedReader reader)."
+ "Instead use importDatabase(Path filePath, Charset defaultEncoding).");
}

@Override
public ParserResult importDatabase(String data) throws IOException {
Objects.requireNonNull(data);
throw new UnsupportedOperationException("PdfEmbeddedBibFileImporter does not support importDatabase(String data)."
+ "Instead use importDatabase(Path filePath, Charset defaultEncoding).");
}

@Override
public ParserResult importDatabase(Path filePath, Charset defaultEncoding) {
try (PDDocument document = XmpUtilReader.loadWithAutomaticDecryption(filePath)) {
return new ParserResult(getEmbeddedBibFileEntries(document));
} catch (EncryptedPdfsNotSupportedException e) {
return ParserResult.fromErrorMessage(Localization.lang("Decryption not supported."));
} catch (IOException | ParseException e) {
return ParserResult.fromError(e);
}
}

/**
* Extraction of embedded files in pdfs adapted from:
* Adapted from https://svn.apache.org/repos/asf/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/ExtractEmbeddedFiles.javaj
*/

private List<BibEntry> getEmbeddedBibFileEntries(PDDocument document) throws IOException, ParseException {
List<BibEntry> allParsedEntries = new ArrayList<>();
PDDocumentNameDictionary nameDictionary = document.getDocumentCatalog().getNames();
if (nameDictionary != null) {
PDEmbeddedFilesNameTreeNode efTree = nameDictionary.getEmbeddedFiles();
if (efTree != null) {
Map<String, PDComplexFileSpecification> names = efTree.getNames();
if (names != null) {
allParsedEntries.addAll(extractAndParseFiles(names));
} else {
List<PDNameTreeNode<PDComplexFileSpecification>> kids = efTree.getKids();
for (PDNameTreeNode<PDComplexFileSpecification> node : kids) {
names = node.getNames();
allParsedEntries.addAll(extractAndParseFiles(names));
}
}
}
}
// extract files from annotations
for (PDPage page : document.getPages()) {
for (PDAnnotation annotation : page.getAnnotations()) {
if (annotation instanceof PDAnnotationFileAttachment) {
PDAnnotationFileAttachment annotationFileAttachment = (PDAnnotationFileAttachment) annotation;
PDComplexFileSpecification fileSpec = (PDComplexFileSpecification) annotationFileAttachment.getFile();
allParsedEntries.addAll(extractAndParseFile(getEmbeddedFile(fileSpec)));
}
}
}
return allParsedEntries;
}

private List<BibEntry> extractAndParseFiles(Map<String, PDComplexFileSpecification> names) throws IOException, ParseException {
List<BibEntry> allParsedEntries = new ArrayList<>();
for (Map.Entry<String, PDComplexFileSpecification> entry : names.entrySet()) {
String filename = entry.getKey();
FileUtil.getFileExtension(filename);
if (FileUtil.isBibFile(Path.of(filename))) {
PDComplexFileSpecification fileSpec = entry.getValue();
allParsedEntries.addAll(extractAndParseFile(getEmbeddedFile(fileSpec)));
}
}
return allParsedEntries;
}

private List<BibEntry> extractAndParseFile(PDEmbeddedFile embeddedFile) throws IOException, ParseException {
return bibtexParser.parseEntries(embeddedFile.createInputStream());
}

private static PDEmbeddedFile getEmbeddedFile(PDComplexFileSpecification fileSpec) {
// search for the first available alternative of the embedded file
PDEmbeddedFile embeddedFile = null;
if (fileSpec != null) {
embeddedFile = fileSpec.getEmbeddedFileUnicode();
if (embeddedFile == null) {
embeddedFile = fileSpec.getEmbeddedFileDos();
}
if (embeddedFile == null) {
embeddedFile = fileSpec.getEmbeddedFileMac();
}
if (embeddedFile == null) {
embeddedFile = fileSpec.getEmbeddedFileUnix();
}
if (embeddedFile == null) {
embeddedFile = fileSpec.getEmbeddedFile();
}
}
return embeddedFile;
}

@Override
public String getName() {
return "PDFembeddedbibfile";
}

@Override
public StandardFileType getFileType() {
return StandardFileType.PDF;
}

@Override
public String getDescription() {
return "PdfEmbeddedBibFileImporter imports an embedded Bib-File from the PDF.";
}

}
Loading