Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process gets killed if several large files are input #68

Closed
pirolen opened this issue Apr 14, 2023 · 8 comments
Closed

Process gets killed if several large files are input #68

pirolen opened this issue Apr 14, 2023 · 8 comments

Comments

@pirolen
Copy link

pirolen commented Apr 14, 2023

Hi, on large files, the FoLiA-txt tool in the containerized foliautils gets killed.
I get:

/data # FoLiA-txt  --remove-end-hyphens yes -O . *.txt
start processing of 22 files 
Processed: 02_feb_car.txt into ./02_feb_car.folia.xml still 21 files to go.
Killed

It is not a big problem since one can call the tool separately per file, but thought to let you know.

Maybe it is better to call the tool per file in a shell script in the container, I did not try that.

@proycon
Copy link
Member

proycon commented Apr 15, 2023

I wonder if it's due to the system's OOM killer, were you running out of memory? (though that would imply there's a memory leak if all individual files do work).

@pirolen
Copy link
Author

pirolen commented Apr 16, 2023

I was trying to search in logs to track down the cause, but did not find a way to identify what happened. Grepping for kill did not return anything on /var/log/dmesg or /var/log/kern.log or /var/log/syslog
I have Ubuntu 20, could you advise where to look? Thanks!

@kosloot
Copy link
Contributor

kosloot commented Apr 17, 2023

As far as I can see, there is no significant memory leak in FoLiA-txt
But maybe there is some strange oddity in the file at hand. I don't know.
It seems that the first file is processed OK, but the second isn't.

I assume there is NO problem when that file is processed on its own?

@pirolen
Copy link
Author

pirolen commented Apr 17, 2023

The files process fine, if I call the converter one by one. I experienced the same thing on other files too, when calling the converter on directories of large files -- there can be near 1 mln tokens per file....

Typically, after having converted the first file, the process is killed.

@kosloot
Copy link
Contributor

kosloot commented Apr 17, 2023

Well, I just ran tests on some fairly small files, and here seems to be some random effect which makes the run to fail, but not always.
It is currently taking 23,6 Gb of memory, and I will kill it myself, but agreed there is something rotten.
Needs some investigation

@kosloot
Copy link
Contributor

kosloot commented Apr 17, 2023

OK, it is some multithreading problem I guess. A deadlock occurs. FoLiA-txt seems to 'stall' when running om multiple threads
You could try to use the -t1 or --threads=1 option. (which slows down of course)

Best is to upgrade to the newest GIT version, which tells on how many threads you actually run.
Good luck

kosloot added a commit that referenced this issue Apr 17, 2023
@kosloot
Copy link
Contributor

kosloot commented Apr 17, 2023

@pirolen the git master has a fix now, which hopefully fixes the deadlock

@kosloot
Copy link
Contributor

kosloot commented Jun 6, 2023

Closing, considering it to be fixed

@kosloot kosloot closed this as completed Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants