Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added some new options and parallelization for ContentExtractor #47

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Uchman21
Copy link

I modified the code to enable running parallel parsers at once in order to speed-up the extraction when there is a large number of files to be extracted. The new options added are:

  1. -start and -stop : To enable extraction between ranges of files. By default, this includes all the files in the folder as in the original algorithm.
  2. -outpath : The output file to write the resulting files and folders. By default, this is the same path as the input path as in the original algorithm.
  3. -workers : The number of content extraction workers to run in parallel. This divides the files equally among the workers. By default, this is "1" as in the original algorithm. The parallelism is implemented using java threads.

P.S: I made the default output option to be just "jats" as some people might not be interested in extracting the images.

P.P.S: Let me know if you need any more clarification. Thanks

@dtkaczyk
Copy link
Collaborator

dtkaczyk commented Apr 6, 2017

Hi @Uchman21 and thanks for contributing to CERMINE!

I looked at your code and there are several issues that need to be resolved before we can merge this. Here are the most important ones:

  1. If I understand correctly, you are assigning input files to available threads in advance, before any extraction starts. Since we do not know in advance how time consuming each file is, this strategy might result in a very uneven split, and we might end up waiting for one thread to process several difficult files (just because they happened to be in the same "chunk"), while other threads have completed their work and are free. A better idea would be to use a pool of threads in the following way: a single thread run processes one file, and we send PDFs to the pool one by one, if the thread are all busy, we wait for the first one to become free. Check for example this code, which uses CompletionService.
  2. -start and -stop parameters seem very unintuitive to me. If you have a directory with 1000 PDFs, some of them buried in subdirectories, it is not trivial to figure out the correct start and end point of the subset you want. What is more, right now files are not sorted, so the processing order is not deterministic. I think it would be better to pass a list of file paths, or even remove this functionality at all.
  3. Please use consistently camel case for names, single spaces after "for", before "{" and around operators such as "=" throughout your code.

For now I've listed the biggest issues, we can continue with smaller ones once these are corrected.

@Uchman21
Copy link
Author

Great! Observations duly noted. I will modify the code and update.

@Uchman21
Copy link
Author

Uchman21 commented Apr 12, 2017

Currently I have change the modified the code and implementation to make use of thread pools. I agree with you on that point. That was just a poor design choice by me.

The reason for adding the -start and -stop options is for users whose aim is to use this service on a very large of files in a directory. It might sound so unintuitive when you are are extracting information from a small number of files (e.g. 1000 PDFs). However, I decided to extend CERMINE by adding parallelization so that people like me working on close to a million PDF's or maybe more might benefit from it. When working with such large amount of PDF's you might not want to extract all at once. You might want to extract the first -n PDF's in your folder or the last -n PDF's so that you can start developing your algorithm while the rest are being extracted (sorting plays no role here). Just like having millions of lines in a file and using linux's -head or -tail to extract part of the file. Sure an option would be to copy the interested files to a new directory and pass the directory as input to CERMINE. However, that might be more stressful (I know I wouldn't want to do that). This would be just another option for those that need it.

I will change the -start and -stop to just -head n where n is the number of files from the head of the directory to parse. If you are still against it, I can remove it totally from the code I will send for merging.

I will make a pull request as soon as you confirm or reject the -head n option.

@dtkaczyk
Copy link
Collaborator

From CERMINE's point of view, all options should have a clear and (hopefully) intuitive, obvious meaning. In my opinion, all those options: start, end or head imply some kind of an order of the files. If the order/sorting is not deterministic, it simply makes no sense to talk about the head. Note that in the case of lines in a file there is a natural order there. So when you ask a question "which line is the first line in a file?", there is an obvious, intuitive answer, whereas when you ask "which file is the first file in this directory?", at least to me there is no single correct answer.

About your case. First of all, if you have a million files in a directory, you most likely have some subdirectory structure there - otherwise even very simple operations like listing the contents of such a flat directory would take a long time. If this is the case, you could simply run CERMINE on a smaller subdirectory first. Another obvious solution would be to run CERMINE on everything and simply interrupt the command when you have enough processed files.

That being said, I've come to the conclusion that it would be ok to add an option, say, limit, that would stop the processing after some n files were processed. I guess it is exactly what you want, but the name does not suggest any order, or which particular files will be processed. It is much more like limit clause in SQL than head or tail tools. What do you think?

@Uchman21
Copy link
Author

Ok. limit clause is fine. I will make a pull request as soon as completed.

@Uchman21
Copy link
Author

Uchman21 commented Apr 13, 2017

So, I have updated the files. The additions are as follows:
For the Interface and code:

  1. Added the optional limit clause
  2. Modified the code structure and namings

For performance:

  1. Made use of thread pools as suggested for this I used the ExecutorService
  2. As listing the contents of a flat directory with large number of files would take a long time, a better solution would be to just walk the directory; extracting files and displaying the progress as we walk instead of listing the contents first. This method proved to be more memory and time efficient as tested. For this I used the Java Files walkFileTree

Feel free to modify the code or ask me for more clarification if needed.

@dtkaczyk dtkaczyk self-assigned this Apr 14, 2017
Copy link
Collaborator

@dtkaczyk dtkaczyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general:

  • make sure the unit tests pass
  • make sure all variants of the input (PDFs directly in the input directory, PDFs in subdirectories, other types of files/directories present) work as expected
  • the console output should be possibly close to the output of the original code

@@ -38,6 +38,9 @@
public CommandLineOptionsParser() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the unit tests throws an error because of this class. Also, you should update tests in pl.edu.icm.cermine.CommandLineOptionsParserTest, so that the tests check your new options as well.

@@ -38,6 +38,9 @@
public CommandLineOptionsParser() {
options = new Options();
options.addOption("path", true, "file or directory path");
options.addOption("limit", true, "number of pdfs to parse starting from the top");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change the help message to "maximum number of files to process".

}
}

public Long getLimit() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using Long instead of int for this option? The same question applies to the number of workers.

} else {
Long value = Long.parseLong(commandLine.getOptionValue("limit"));
if (value < 0) {
throw new RuntimeException("The 'start' value given as a "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong name of the parameter.

import java.nio.file.Files;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.FileVisitResult;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.Collection;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is not needed anymore.

+ "Tool for extracting metadata and content from PDF files.\n\n"
+ "Arguments:\n"
+ " -path <path> path to a directory containing PDF files\n"
+ " -limit <int> (optional) the number of PDF files to parse starting from the top default: \"all\"\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines should be kept under 80 characters long, so that the help prints well in a terminal.

final Map<String, String> extensions = parser.getTypesAndExtensions();

Long fileLength = Files.list(Paths.get(path)).count();
Long getlimit = parser.getLimit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable should be called "limit".

final String outpath = parser.getOutPath()+'/';
final Map<String, String> extensions = parser.getTypesAndExtensions();

Long fileLength = Files.list(Paths.get(path)).count();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you should be calculating the number of the PDF files in the directory, and you are calculating the number of entries directly in the input directory instead. So two problems: this does not go recursively into subdirectories, and it calculates everything, not only PDF files.

ExtractionConfigRegister.set(builder.buildConfiguration());
final ExecutorService executor = Executors.newFixedThreadPool(workers.intValue());

final Path startFilePath = Paths.get(new URI("file://"+path));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Paths.get(path) would suffice here, similarly as you did in line 903.


long end = System.currentTimeMillis();
elapsed = (end - start) / 1000F;
System.out.println("Total extraction time: " + Math.round(elapsed) + "s");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code was printing the extraction time for every file. This could be added to the ParallelTask#run() method. Also "File processed:" message from the original code is missing.

@dtkaczyk dtkaczyk assigned Uchman21 and unassigned dtkaczyk Apr 22, 2017
@Uchman21
Copy link
Author

Will make the changes as soon as I get the time.

Uchman21 added 2 commits June 20, 2017 19:21
Addressed all the issues raised from the previous versions. Passed all unit test. Fixed some bugs. Should be better now.
Added test for the new options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants