-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added some new options and parallelization for ContentExtractor #47
base: master
Are you sure you want to change the base?
Conversation
Hi @Uchman21 and thanks for contributing to CERMINE! I looked at your code and there are several issues that need to be resolved before we can merge this. Here are the most important ones:
For now I've listed the biggest issues, we can continue with smaller ones once these are corrected. |
Great! Observations duly noted. I will modify the code and update. |
Currently I have change the modified the code and implementation to make use of thread pools. I agree with you on that point. That was just a poor design choice by me. The reason for adding the -start and -stop options is for users whose aim is to use this service on a very large of files in a directory. It might sound so unintuitive when you are are extracting information from a small number of files (e.g. 1000 PDFs). However, I decided to extend CERMINE by adding parallelization so that people like me working on close to a million PDF's or maybe more might benefit from it. When working with such large amount of PDF's you might not want to extract all at once. You might want to extract the first I will change the I will make a pull request as soon as you confirm or reject the |
From CERMINE's point of view, all options should have a clear and (hopefully) intuitive, obvious meaning. In my opinion, all those options: About your case. First of all, if you have a million files in a directory, you most likely have some subdirectory structure there - otherwise even very simple operations like listing the contents of such a flat directory would take a long time. If this is the case, you could simply run CERMINE on a smaller subdirectory first. Another obvious solution would be to run CERMINE on everything and simply interrupt the command when you have enough processed files. That being said, I've come to the conclusion that it would be ok to add an option, say, |
Ok. |
Improved performance
So, I have updated the files. The additions are as follows:
For performance:
Feel free to modify the code or ask me for more clarification if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general:
- make sure the unit tests pass
- make sure all variants of the input (PDFs directly in the input directory, PDFs in subdirectories, other types of files/directories present) work as expected
- the console output should be possibly close to the output of the original code
@@ -38,6 +38,9 @@ | |||
public CommandLineOptionsParser() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the unit tests throws an error because of this class. Also, you should update tests in pl.edu.icm.cermine.CommandLineOptionsParserTest
, so that the tests check your new options as well.
@@ -38,6 +38,9 @@ | |||
public CommandLineOptionsParser() { | |||
options = new Options(); | |||
options.addOption("path", true, "file or directory path"); | |||
options.addOption("limit", true, "number of pdfs to parse starting from the top"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change the help message to "maximum number of files to process".
} | ||
} | ||
|
||
public Long getLimit() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using Long instead of int for this option? The same question applies to the number of workers.
} else { | ||
Long value = Long.parseLong(commandLine.getOptionValue("limit")); | ||
if (value < 0) { | ||
throw new RuntimeException("The 'start' value given as a " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong name of the parameter.
import java.nio.file.Files; | ||
import java.nio.file.SimpleFileVisitor; | ||
import java.nio.file.FileVisitResult; | ||
import java.nio.file.attribute.BasicFileAttributes; | ||
import java.util.Collection; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not needed anymore.
+ "Tool for extracting metadata and content from PDF files.\n\n" | ||
+ "Arguments:\n" | ||
+ " -path <path> path to a directory containing PDF files\n" | ||
+ " -limit <int> (optional) the number of PDF files to parse starting from the top default: \"all\"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines should be kept under 80 characters long, so that the help prints well in a terminal.
final Map<String, String> extensions = parser.getTypesAndExtensions(); | ||
|
||
Long fileLength = Files.list(Paths.get(path)).count(); | ||
Long getlimit = parser.getLimit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable should be called "limit".
final String outpath = parser.getOutPath()+'/'; | ||
final Map<String, String> extensions = parser.getTypesAndExtensions(); | ||
|
||
Long fileLength = Files.list(Paths.get(path)).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you should be calculating the number of the PDF files in the directory, and you are calculating the number of entries directly in the input directory instead. So two problems: this does not go recursively into subdirectories, and it calculates everything, not only PDF files.
ExtractionConfigRegister.set(builder.buildConfiguration()); | ||
final ExecutorService executor = Executors.newFixedThreadPool(workers.intValue()); | ||
|
||
final Path startFilePath = Paths.get(new URI("file://"+path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Paths.get(path)
would suffice here, similarly as you did in line 903.
|
||
long end = System.currentTimeMillis(); | ||
elapsed = (end - start) / 1000F; | ||
System.out.println("Total extraction time: " + Math.round(elapsed) + "s"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code was printing the extraction time for every file. This could be added to the ParallelTask#run()
method. Also "File processed:" message from the original code is missing.
Will make the changes as soon as I get the time. |
Addressed all the issues raised from the previous versions. Passed all unit test. Fixed some bugs. Should be better now.
Added test for the new options.
I modified the code to enable running parallel parsers at once in order to speed-up the extraction when there is a large number of files to be extracted. The new options added are:
P.S: I made the default output option to be just "jats" as some people might not be interested in extracting the images.
P.P.S: Let me know if you need any more clarification. Thanks