Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Tesseract.js Version 4 #691

Merged
merged 22 commits into from
Nov 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
3678dbc
Added image preprocessing functions (rotate + save images)
Sep 15, 2022
0277db2
Updated createWorker to be async
Sep 15, 2022
b87afe9
Reworked createWorker to be async and throw errors per #654
Sep 17, 2022
ca99c35
Reworked createWorker to be async and throw errors per #654
Sep 17, 2022
8fb04f4
Edited detect to return null when detection fails rather than throwin…
Sep 17, 2022
4d3dee5
Updated types per #606 and #580 (#663) (#664)
Balearica Sep 18, 2022
689a150
Removed unused files
Sep 18, 2022
622c841
Added savePDF option to recognize per #488; cleaned up code for linter
Sep 18, 2022
81bf04c
Updated download-pdf example for node to use new savePDF option
Sep 18, 2022
c407aeb
Added OutputFormats option/interface for setting output
Sep 18, 2022
fae195f
Allowed for Tesseract parameters to be set through recognition option…
Sep 19, 2022
c41619f
Updated docs
Sep 20, 2022
c0298ff
Edited loadLanguage to no longer overwrite cache with data from cache…
Sep 20, 2022
b952488
Added interface for setting 'init only' options per #613
Sep 24, 2022
2a6a133
Wrapped caching in try block per #609
Sep 25, 2022
32c5c14
Fixed unit tests
Sep 25, 2022
9e12163
Updated setImage to resolve memory leak per #678
Oct 10, 2022
c7c2d73
Added debug output option per #681
Oct 14, 2022
d5c1f78
Fixed bug with saving images per #588
Oct 14, 2022
10b1da7
Updated examples
Nov 25, 2022
cc6de2a
Updated readme and Tesseract.js-core version
Nov 25, 2022
034697a
Resolved merge conflicts; minor changes for linter
Nov 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .eslintrc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
"no-console": 0,
"global-require": 0,
"camelcase": 0,
"no-control-regex": 0
"no-control-regex": 0,
// Airbnb disallows ForOfStatement based on the bizarre belief that loops are not readable
// https://github.com/airbnb/javascript/issues/1271
"no-restricted-syntax": ["error", "ForInStatement", "LabeledStatement", "WithStatement"]
}
}
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,11 @@ Or more imperative
```javascript
import { createWorker } from 'tesseract.js';

const worker = createWorker({
const worker = await createWorker({
logger: m => console.log(m)
});

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
Expand All @@ -62,6 +61,16 @@ const worker = createWorker({

[Check out the docs](#documentation) for a full explanation of the API.

## Major changes in v4
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.

- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
- `createWorker` is now async
- `getPDF` function replaced by `pdf` recognize option

## Major changes in v3
- Significantly faster performance
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)
Expand Down
44 changes: 10 additions & 34 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
# API

- [createWorker()](#create-worker)
- [Worker.load](#worker-load)
- [Worker.writeText](#worker-writeText)
- [Worker.readText](#worker-readText)
- [Worker.removeFile](#worker-removeFile)
Expand Down Expand Up @@ -53,7 +52,7 @@ createWorker is a factory function that creates a tesseract worker, a worker is

```javascript
const { createWorker } = Tesseract;
const worker = createWorker({
const worker = await createWorker({
langPath: '...',
logger: m => console.log(m),
});
Expand All @@ -63,7 +62,6 @@ const worker = createWorker({

A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:

- load
- FS functions // optional
- loadLanguauge
- initialize
Expand All @@ -82,23 +80,6 @@ Each function is async, so using async/await or Promise is required. When it is

jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.

<a name="worker-load"></a>
### Worker.load(jobId): Promise

Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.

**Arguments:**

- `jobId` Please see details above

**Examples:**

```javascript
(async () => {
await worker.load();
})();
```

<a name="worker-writeText"></a>
### Worker.writeText(path, text, jobId): Promise

Expand Down Expand Up @@ -225,7 +206,7 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
- `params` an object with key and value of the parameters
- `jobId` Please see details above

**Supported Paramters:**
**Useful Paramters:**

| name | type | default value | description |
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
Expand All @@ -234,11 +215,8 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |

This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)

**Examples:**

Expand All @@ -262,8 +240,9 @@ Figures out what words are in `image`, where the words are in `image`, etc.
**Arguments:**

- `image` see [Image Format](./image-format.md) for more details.
- `options` a object of customized options
- `options` an object of customized options
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
- `jobId` Please see details above

**Output:**
Expand All @@ -273,8 +252,7 @@ Figures out what words are in `image`, where the words are in `image`, etc.
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = createWorker();
await worker.load();
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image);
Expand All @@ -287,8 +265,7 @@ With rectangle
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = createWorker();
await worker.load();
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize(image, {
Expand All @@ -313,8 +290,7 @@ Worker.detect() does OSD (Orientation and Script Detection) to the image instead
```javascript
const { createWorker } = Tesseract;
(async () => {
const worker = createWorker();
await worker.load();
const worker = await createWorker();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data } = await worker.detect(image);
Expand Down Expand Up @@ -361,7 +337,7 @@ Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is
```javascript
const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = createWorker();
const worker = await createWorker();
scheduler.addWorker(worker);
```

Expand Down
41 changes: 15 additions & 26 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,9 @@ You can also check [examples](../examples) folder.
```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
Expand All @@ -24,12 +23,11 @@ const worker = createWorker();
```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker({
const worker = await createWorker({
logger: m => console.log(m), // Add logger here
});

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
Expand All @@ -43,26 +41,24 @@ const worker = createWorker({
```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();

(async () => {
await worker.load();
await worker.loadLanguage('eng+chi_tra');
await worker.initialize('eng+chi_tra');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
await worker.terminate();
})();
```
### with whitelist char (^2.0.0-beta.1)
### with whitelist char

```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
Expand All @@ -74,17 +70,16 @@ const worker = createWorker();
})();
```

### with different pageseg mode (^2.0.0-beta.1)
### with different pageseg mode

Check here for more details of pageseg mode: https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163

```javascript
const { createWorker, PSM } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
Expand All @@ -96,7 +91,7 @@ const worker = createWorker();
})();
```

### with pdf output (^2.0.0-beta.1)
### with pdf output

Please check **examples** folder for details.

Expand All @@ -110,11 +105,10 @@ Node: [download-pdf.js](../examples/node/download-pdf.js)
```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();
const rectangle = { left: 0, top: 0, width: 500, height: 250 };

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle });
Expand All @@ -128,7 +122,7 @@ const rectangle = { left: 0, top: 0, width: 500, height: 250 };
```javascript
const { createWorker } = require('tesseract.js');

const worker = createWorker();
const worker = await createWorker();
const rectangles = [
{
left: 0,
Expand All @@ -145,7 +139,6 @@ const rectangles = [
];

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
const values = [];
Expand All @@ -164,8 +157,8 @@ const rectangles = [
const { createWorker, createScheduler } = require('tesseract.js');

const scheduler = createScheduler();
const worker1 = createWorker();
const worker2 = createWorker();
const worker1 = await createWorker();
const worker2 = await createWorker();
const rectangles = [
{
left: 0,
Expand All @@ -182,8 +175,6 @@ const rectangles = [
];

(async () => {
await worker1.load();
await worker2.load();
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
Expand All @@ -198,18 +189,16 @@ const rectangles = [
})();
```

### with multiple workers to speed up (^2.0.0-beta.1)
### with multiple workers to speed up

```javascript
const { createWorker, createScheduler } = require('tesseract.js');

const scheduler = createScheduler();
const worker1 = createWorker();
const worker2 = createWorker();
const worker1 = await createWorker();
const worker2 = await createWorker();

(async () => {
await worker1.load();
await worker2.load();
await worker1.loadLanguage('eng');
await worker2.loadLanguage('eng');
await worker1.initialize('eng');
Expand Down
40 changes: 9 additions & 31 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
FAQ
===

# Project
## What is the scope of this project?
Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine. We do not edit the underlying Tesseract recognition engine in any way. Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository.

If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository.

# Trained Data
## How does tesseract.js download and keep \*.traineddata?

The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
Expand All @@ -9,34 +16,5 @@ During the downloading of language model, Tesseract.js will first check if \*.tr

## How can I train my own \*.traineddata?

For tesseract.js v2, check [TrainingTesseract 4.00](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00)

For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05)

## How can I get HOCR, TSV, Box, UNLV, OSD?

Starting from 2.0.0-beta.1, you can get all these information in the final result.

```javascript
import { createWorker } from 'tesseract.js';
const worker = createWorker({
logger: m => console.log(m)
});

(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_create_box: '1',
tessedit_create_unlv: '1',
tessedit_create_osd: '1',
});
const { data: { text, hocr, tsv, box, unlv } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
console.log(text);
console.log(hocr);
console.log(tsv);
console.log(box);
console.log(unlv);
})();
```
See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions.

4 changes: 2 additions & 2 deletions docs/local-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Tesseract.recognize(image, langs, {
Or

```javascript
const worker = createWorker({
const worker = await createWorker({
workerPath: 'https://unpkg.com/tesseract.js@v2.0.0/dist/worker.min.js',
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
corePath: 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js',
Expand All @@ -33,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.

### corePath
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js'.

Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm'. But it fails to fetch at this moment.
Loading