-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete invalid .traineddata files in cache #753
Comments
Summary of this changeTL;DR Setting ExplanationBy default, Tesseract.js caches Prior to Due to this bug, many developers using Tesseract.js started bypassing the caching feature entirely by setting Starting in |
I'm wondering why Tesseract.js is handling this caching and downloading of training data? I would much prefer having full control over this rather than having to rely on some built-in solution which may or may not work (for me, as of 5.0.4 it doesn't work) and that's not really related to the core feature of Tesseract. That way you could focus on developing what's unique about Tesseract.js. Downloading and caching files everybody can do that and often the solution differs depending on the application. For example someone may want a full offline solution and bundle the training data with the app, or check for updates at a regular interval, etc. |
@laurent22 The purpose of Tesseract.js is to provide a high-level, user friendly interface for running OCR. The vast majority of users do not want to manage training data. Therefore, managing language data is within the scope of this project. That being said, if you have some application that would benefit from having more control over language data than Tesseract.js currently provides, you can open a new Git Issue with a feature request. For example, it would not be particularly difficult to allow for providing language data directly as an |
One of the most common error messages reported is
Error opening data file ./eng.traineddata
(or the equivalent for other languages). This is due to our current caching behavior.When a
.traineddata
file is downloaded, any fetch response reported asok
(which corresponds to a status of 200-299) is cached.tesseract.js/src/worker-script/index.js
Lines 108 to 111 in 7a087ca
The cached file is then used until the user manually deletes it, even if the file is invalid. The assumption this code makes is that an
ok
response indicates that some.traineddata
file was successfully downloaded, and if that file is somehow corrupted, that is because the developer uploaded a corrupted.traineddata
file.This does not appear to be the case. Some server configurations appear to return
200
responses, even if thelangPath
value is invalid (see #714). Furthermore, given user reports, this may even happen when the defaultlangPath
value is used (see #521), although the mechanism for this is unclear.We should edit so that tesseract.js deletes the saved
.traineddata
file when it detects that it is invalid. With this change, the next time the code is run it will again try and download the.traineddata
file fromlangPath
, rather than re-using the cached data that has already been determined to be invalid.The text was updated successfully, but these errors were encountered: