Added some new options and parallelization for ContentExtractor #47

Uchman21 · 2017-03-27T14:33:12Z

I modified the code to enable running parallel parsers at once in order to speed-up the extraction when there is a large number of files to be extracted. The new options added are:

-start and -stop : To enable extraction between ranges of files. By default, this includes all the files in the folder as in the original algorithm.
-outpath : The output file to write the resulting files and folders. By default, this is the same path as the input path as in the original algorithm.
-workers : The number of content extraction workers to run in parallel. This divides the files equally among the workers. By default, this is "1" as in the original algorithm. The parallelism is implemented using java threads.

P.S: I made the default output option to be just "jats" as some people might not be interested in extracting the images.

P.P.S: Let me know if you need any more clarification. Thanks

dtkaczyk · 2017-04-06T18:05:36Z

Hi @Uchman21 and thanks for contributing to CERMINE!

I looked at your code and there are several issues that need to be resolved before we can merge this. Here are the most important ones:

If I understand correctly, you are assigning input files to available threads in advance, before any extraction starts. Since we do not know in advance how time consuming each file is, this strategy might result in a very uneven split, and we might end up waiting for one thread to process several difficult files (just because they happened to be in the same "chunk"), while other threads have completed their work and are free. A better idea would be to use a pool of threads in the following way: a single thread run processes one file, and we send PDFs to the pool one by one, if the thread are all busy, we wait for the first one to become free. Check for example this code, which uses CompletionService.
-start and -stop parameters seem very unintuitive to me. If you have a directory with 1000 PDFs, some of them buried in subdirectories, it is not trivial to figure out the correct start and end point of the subset you want. What is more, right now files are not sorted, so the processing order is not deterministic. I think it would be better to pass a list of file paths, or even remove this functionality at all.
Please use consistently camel case for names, single spaces after "for", before "{" and around operators such as "=" throughout your code.

For now I've listed the biggest issues, we can continue with smaller ones once these are corrected.

Uchman21 · 2017-04-11T15:37:06Z

Great! Observations duly noted. I will modify the code and update.

Uchman21 · 2017-04-12T16:14:43Z

Currently I have change the modified the code and implementation to make use of thread pools. I agree with you on that point. That was just a poor design choice by me.

The reason for adding the -start and -stop options is for users whose aim is to use this service on a very large of files in a directory. It might sound so unintuitive when you are are extracting information from a small number of files (e.g. 1000 PDFs). However, I decided to extend CERMINE by adding parallelization so that people like me working on close to a million PDF's or maybe more might benefit from it. When working with such large amount of PDF's you might not want to extract all at once. You might want to extract the first -n PDF's in your folder or the last -n PDF's so that you can start developing your algorithm while the rest are being extracted (sorting plays no role here). Just like having millions of lines in a file and using linux's -head or -tail to extract part of the file. Sure an option would be to copy the interested files to a new directory and pass the directory as input to CERMINE. However, that might be more stressful (I know I wouldn't want to do that). This would be just another option for those that need it.

I will change the -start and -stop to just -head n where n is the number of files from the head of the directory to parse. If you are still against it, I can remove it totally from the code I will send for merging.

I will make a pull request as soon as you confirm or reject the -head n option.

dtkaczyk · 2017-04-12T20:30:16Z

From CERMINE's point of view, all options should have a clear and (hopefully) intuitive, obvious meaning. In my opinion, all those options: start, end or head imply some kind of an order of the files. If the order/sorting is not deterministic, it simply makes no sense to talk about the head. Note that in the case of lines in a file there is a natural order there. So when you ask a question "which line is the first line in a file?", there is an obvious, intuitive answer, whereas when you ask "which file is the first file in this directory?", at least to me there is no single correct answer.

About your case. First of all, if you have a million files in a directory, you most likely have some subdirectory structure there - otherwise even very simple operations like listing the contents of such a flat directory would take a long time. If this is the case, you could simply run CERMINE on a smaller subdirectory first. Another obvious solution would be to run CERMINE on everything and simply interrupt the command when you have enough processed files.

That being said, I've come to the conclusion that it would be ok to add an option, say, limit, that would stop the processing after some n files were processed. I guess it is exactly what you want, but the name does not suggest any order, or which particular files will be processed. It is much more like limit clause in SQL than head or tail tools. What do you think?

Uchman21 · 2017-04-13T07:07:04Z

Ok. limit clause is fine. I will make a pull request as soon as completed.

Improved performance

Uchman21 · 2017-04-13T17:12:16Z

So, I have updated the files. The additions are as follows:
For the Interface and code:

Added the optional limit clause
Modified the code structure and namings

For performance:

Made use of thread pools as suggested for this I used the ExecutorService
As listing the contents of a flat directory with large number of files would take a long time, a better solution would be to just walk the directory; extracting files and displaying the progress as we walk instead of listing the contents first. This method proved to be more memory and time efficient as tested. For this I used the Java Files walkFileTree

Feel free to modify the code or ask me for more clarification if needed.

dtkaczyk

In general:

make sure the unit tests pass
make sure all variants of the input (PDFs directly in the input directory, PDFs in subdirectories, other types of files/directories present) work as expected
the console output should be possibly close to the output of the original code

dtkaczyk · 2017-04-22T20:19:09Z

cermine-impl/src/main/java/pl/edu/icm/cermine/CommandLineOptionsParser.java

@@ -38,6 +38,9 @@
    public CommandLineOptionsParser() {


One of the unit tests throws an error because of this class. Also, you should update tests in pl.edu.icm.cermine.CommandLineOptionsParserTest, so that the tests check your new options as well.

dtkaczyk · 2017-04-22T20:20:19Z

cermine-impl/src/main/java/pl/edu/icm/cermine/CommandLineOptionsParser.java

@@ -38,6 +38,9 @@
    public CommandLineOptionsParser() {
        options = new Options();
        options.addOption("path", true, "file or directory path");
+        options.addOption("limit", true, "number of pdfs to parse starting from the top");


I would change the help message to "maximum number of files to process".

dtkaczyk · 2017-04-22T20:43:50Z

cermine-impl/src/main/java/pl/edu/icm/cermine/CommandLineOptionsParser.java

+        }
+    }
+
+    public Long getLimit() {


Why are you using Long instead of int for this option? The same question applies to the number of workers.

dtkaczyk · 2017-04-22T20:44:16Z

cermine-impl/src/main/java/pl/edu/icm/cermine/CommandLineOptionsParser.java

+        } else {
+            Long value = Long.parseLong(commandLine.getOptionValue("limit"));
+            if (value < 0) {
+                throw new RuntimeException("The 'start' value given as a " 


Wrong name of the parameter.

dtkaczyk · 2017-04-22T20:45:05Z