-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to base65536 #5
Comments
That won't work well... URLs don't allow Unicode characters, so the characters would need to be URL encoded, which would defeat the purpose as the encoded data would almost certainly up longer than the original data. You can get the shortest URLs only by using characters that do not need URL encoding. You'd have more luck using an actual compression algorithm that produces URL-friendly output, for example the REPL on the Babel site uses |
That may have been correct in the past, I don't think it still holds up today. See:
|
Ah, good catch, I didn't realise that Punycode wasn't required for URLs any more. In any case, UTF-8 characters can be up to four bytes. That base65535 encoding isn't actually compressing the data, just encoding it differently. It still takes up the same amount of storage space, just with fewer characters, which has arguably minimal benefit (you're not saving any bandwidth, for example). It seems useful for cramming lots of data into a Tweet by exploiting the way that Twitter counts the length of tweets, but not for anything else :P You'll still see better results from an actual compression algorithm, as then the data will actually be compressed and take up less space. |
I did a couple tests with pretty large pages I had in codepen, actual reduction was only from 9kb to 8kb in the largest (and not noticeable in the smallest). There's probably a use-case though where this does more to actually compress a string, but it seems to perform similar to atob(). |
Yeah, it should obviously work, because that's like the whole idea of base65k. Take a look here, i've created my own implementation of base65k — very naïve, but I didn't want to glance over packages code if it didn't work for you. Create The result is 2.2x reduction:
const fs = require('fs').promises;
function btoa(text) {
// node version of the browser function.
return Buffer.from(text).toString('base64');
}
function b65k(text) {
// Test if every char is ASCII.
for (let i = 0; i < text.length; i++) {
if (text.charCodeAt(i) > 255) {
console.log('non-ASCII');
return text;
}
}
// The text was ASCII.
// Combine two symbols into one.
// Max value won't exceed 65k: 255 * 255 = 65535
let compressed = '';
for (let i = 0; i < text.length; i += 2) {
const fst = text.codePointAt(i).toString(16).padStart(2, '0');
const snd = text.codePointAt(i + 1).toString(16).padStart(2, '0');
const pair = ((i + 1) < text.length) ? fst + snd : fst;
compressed += String.fromCodePoint(parseInt(pair, 16));
}
return compressed;
}
function deb65k(compressed) {
let text = '';
for (let i = 0; i < compressed.length; i++) {
const pair = compressed.codePointAt(i).toString(16).padStart(4, '0');
const fst = parseInt(pair[0] + pair[1], 16);
text += String.fromCodePoint(fst);
const snd = parseInt(pair[2] + pair[3], 16);
text += String.fromCodePoint(snd);
}
return text;
}
(async () => {
const text = await fs.readFile('editor/main.css', 'utf-8');
console.log('file size', text.length);
console.log('btoa size', btoa(text).length);
console.log('65k size', b65k(text).length);
// Identical
// console.log(deb65k(b65k(text)));
})(); |
@paulmillr - Changing your script to output the number of bytes in the string:
Shows b65k has a clear disadvantage:
And LZString actually helps:
results in:
No, the whole idea of base65k is to take advantage of Twitter counting a character as a UTF-8 codepoint, rather than counting the number of raw bytes. It's not really useful as an actual compression or transport strategy otherwise. |
I'm talking here about browser limitations for URL parsing & URL parsing performance. I'm not talking about whether the result is less |
One use case of the project is having long URLs that can be stored by link shorteners with upper limits on URL size. With that use case in-mind, what matters most is whether the shorteners care about bytes or characters when measuring URL size, and how they handle Unicode URLs. My guess is that it's variable, but still merits implementing a Tangential note: I'm currently looking into integrating Brotli for real URL compression since it was designed for exactly this type of data and will hopefully be able to get URL lengths considerably shorter. |
@jstrieb - Shorteners usually care about bytes, given they need to store the data as bytes in their database. Regardless of if you have four characters that take one byte each, or one character that takes four bytes, that's still going to consume four bytes in a database. I ran a URL shortener for many years.
I'd be interested in hearing how Brotli compares to LZString for this data! Looking forward to seeing the results of that 😃 |
Hey — Great idea you have there!
I suggest to compress the data more; by using base65536: https://www.npmjs.com/package/base65536
The text was updated successfully, but these errors were encountered: