Skip to content

Commit

Permalink
Development snapshot
Browse files Browse the repository at this point in the history
  • Loading branch information
Dmitry Shirokov committed Aug 7, 2024
1 parent 7629d7c commit fa3b09f
Showing 1 changed file with 8 additions and 7 deletions.
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,24 +59,25 @@ Sometimes, when data set is huge and you want to optimize performance (with a tr
you can sample only the first N bytes of the buffer:

```javascript
const encoding = await chardet
.detectFile('/path/to/file', { sampleSize: 32 });
const encoding = await chardet.detectFile('/path/to/file', { sampleSize: 32 });
```

You can also specify where to begin reading from in the buffer:

```javascript
const encoding = await chardet
.detectFile('/path/to/file', { sampleSize: 32, offset: 128 });
const encoding = await chardet.detectFile('/path/to/file', {
sampleSize: 32,
offset: 128,
});
```

## Working with strings

In both Node.js and browsers, all strings in memory are represented in UTF-16 encoding. This is a fundamental aspect of the JavaScript language specification. What it means is you cannot use plain strings directly as input for `chardet.analyse()` or `chardet.detect()`. You need original string data as Buffer/Uint8Array.
In both Node.js and browsers, all strings in memory are represented in UTF-16 encoding. This is a fundamental aspect of the JavaScript language specification. Therefore, you cannot use plain strings directly as input for `chardet.analyse()` or `chardet.detect()`. Instead, you need the original string data in the form of a Buffer or Uint8Array.

In other words, if you receive a piece of data over the network and want to detect its encoding, use the original data payload, not the string representation of it. By the time you get a string it'll be in UTF-16 encoding.
In other words, if you receive a piece of data over the network and want to detect its encoding, use the original data payload, not its string representation. By the time you convert data to a string, it will be in UTF-16 encoding.

Note on [TextEncoder](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder/TextEncoder): by default it'll return UTF-8 encoded buffer, i.e. not in the original encoding of the string.
Note on [TextEncoder](https://developer.mozilla.org/en-US/docs/Web/API/TextEncoder/TextEncoder): By default, it returns a UTF-8 encoded buffer, which means the buffer will not be in the original encoding of the string.

## Supported Encodings:

Expand Down

0 comments on commit fa3b09f

Please sign in to comment.