Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider reducing the number of files produced by the PDF metadata extraction phase #1442

Open
marekhorst opened this issue Jan 2, 2024 · 1 comment

Comments

@marekhorst
Copy link
Member

We should consider reducing the number of files produced by the metadataextraction job at the cost of extending the execution time of a single metadataextraction task attempt.

One of the reasons of the IIS failure which happened just before Christmassand described here:

https://support.openaire.eu/issues/9336#note-16

was an extremely large volume of PDFs streamed for metadata extraction (2.2Mi of new PDFs) resulting in spawning over 32k metadataextraction mapper tasks which led to producing huge amount of output files: meta and fault (32k each). 32k is also the limit of the number of open files on datanodes.

This lead the transformer processing faults to fail due to the “SocketException: Too many open files” error.

Even though rerunning IIS made the problem disappear because metadataextraction outcome was already available in cache and the beforementioned transformation was not required but it still may happen in future if another large batch of contents is streamed to IIS.

We might want to alter the split size driving the number of DocumentContentUrl files (well, sequence files with avro records compatible with this schema) in order to reduce the number of files produced by the metadataextraction module.

@marekhorst marekhorst self-assigned this Jan 2, 2024
@marekhorst marekhorst changed the title Consider reducing the number of files produced by the metadataextraction Consider reducing the number of files produced by the PDF metadata extraction phase Jan 2, 2024
@marekhorst
Copy link
Member Author

marekhorst commented Jan 2, 2024

One significant drawback of reducing the number of tasks running CERMINE (thus reducing the number of output files) is increased amount of time taken by each task.

This may result in blocking cluster resources for a longer amount of time and whenever preemption is triggered (killing CERMINE tasks) more PDFs would need to be processed again when running reattempt.

It seems to get down to assessing the probability of dealing with similar failure in future which means processing 2Mi of new PDFs during the provisioning round. The former failure was caused by:

introducing large amount of arXiv contents imported in bulk

having quite a long time between syncing the state of payload tables incrementally built on a dedicated Impala cluster with the tables provided at IIS input

My assessment is it is pretty unlikely to happen in future and once we know it is about to happen (e.g. after importing a large number of PDFs in bulk) we will be able to reduce the number of metadataextraction tasks for such particular run by modifying a runtime parameter controlling the number of metadtaextraction tasks: mapred_max_split_size of the importer_content wf. Currently it is set to 100000 (the default value of mapreduce.input.fileinputformat.split.maxsize is 256000000 but it was reduced significanlty for DocumentContentUrl records because of their small size and pretty long processing time of every single record) and if we decide we want to double to amount of data processed by the metadataextraction job we should double this value. The minor problem is currently this property name is also shared with importer_plaintext wf (covering WoS and html file types) and after providing it at runtime both wf will be affected. It is not much of a problem because both jobs accept the same datastore type (DocumentContentUrl) at input and adjusting its value will simply reduce the number of tasks for each mime type (PDF and text) and for text the number of tasks doesn’t make much difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant