Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to base65536 #5

Closed
paulmillr opened this issue Jun 30, 2019 · 9 comments
Closed

Switch to base65536 #5

paulmillr opened this issue Jun 30, 2019 · 9 comments

Comments

@paulmillr
Copy link

Hey — Great idea you have there!

I suggest to compress the data more; by using base65536: https://www.npmjs.com/package/base65536

@Daniel15
Copy link

Daniel15 commented Jun 30, 2019

That won't work well... URLs don't allow Unicode characters, so the characters would need to be URL encoded, which would defeat the purpose as the encoded data would almost certainly up longer than the original data. You can get the shortest URLs only by using characters that do not need URL encoding.

You'd have more luck using an actual compression algorithm that produces URL-friendly output, for example the REPL on the Babel site uses lz-string: https://github.com/babel/website/blob/21ca7a26f47dbfe1cae8a2300d3a60b2c39f5c8e/js/repl/UriUtils.js#L31-L42

@paulmillr
Copy link
Author

URLs don't allow Unicode characters

That may have been correct in the past, I don't think it still holds up today. See:

  1. https://ko.wikipedia.org/wiki/위키백과:대문 — copy-paste this, no encoding is done
  2. https://url.spec.whatwg.org/

@Daniel15
Copy link

Daniel15 commented Jun 30, 2019

Ah, good catch, I didn't realise that Punycode wasn't required for URLs any more.

In any case, UTF-8 characters can be up to four bytes. That base65535 encoding isn't actually compressing the data, just encoding it differently. It still takes up the same amount of storage space, just with fewer characters, which has arguably minimal benefit (you're not saving any bandwidth, for example). It seems useful for cramming lots of data into a Tweet by exploiting the way that Twitter counts the length of tweets, but not for anything else :P

You'll still see better results from an actual compression algorithm, as then the data will actually be compressed and take up less space.

@kidGodzilla
Copy link

I did a couple tests with pretty large pages I had in codepen, actual reduction was only from 9kb to 8kb in the largest (and not noticeable in the smallest). There's probably a use-case though where this does more to actually compress a string, but it seems to perform similar to atob().

@paulmillr
Copy link
Author

paulmillr commented Jul 1, 2019

Yeah, it should obviously work, because that's like the whole idea of base65k. Take a look here, i've created my own implementation of base65k — very naïve, but I didn't want to glance over packages code if it didn't work for you. Create b65k.js in urlpages root directory and execute node b65k.js.

The result is 2.2x reduction:

file size 506
btoa size 676
65k size 253

const fs = require('fs').promises;

function btoa(text) {
  // node version of the browser function.
  return Buffer.from(text).toString('base64');
}

function b65k(text) {
  // Test if every char is ASCII.
  for (let i = 0; i < text.length; i++) {
    if (text.charCodeAt(i) > 255) {
      console.log('non-ASCII');
      return text;
    }
  }

  // The text was ASCII.
  // Combine two symbols into one.
  // Max value won't exceed 65k: 255 * 255 = 65535
  let compressed = '';
  for (let i = 0; i < text.length; i += 2) {
    const fst = text.codePointAt(i).toString(16).padStart(2, '0');
    const snd = text.codePointAt(i + 1).toString(16).padStart(2, '0');
    const pair = ((i + 1) < text.length) ? fst + snd : fst;
    compressed += String.fromCodePoint(parseInt(pair, 16));
  }
  return compressed;
}

function deb65k(compressed) {
  let text = '';
  for (let i = 0; i < compressed.length; i++) {
    const pair = compressed.codePointAt(i).toString(16).padStart(4, '0');
    const fst = parseInt(pair[0] + pair[1], 16);
    text += String.fromCodePoint(fst);
    const snd = parseInt(pair[2] + pair[3], 16);
    text += String.fromCodePoint(snd);
  }
  return text;
}

(async () => {
  const text = await fs.readFile('editor/main.css', 'utf-8');
  console.log('file size', text.length);
  console.log('btoa size', btoa(text).length);
  console.log('65k size', b65k(text).length);
  // Identical
  // console.log(deb65k(b65k(text)));
})();

@Daniel15
Copy link

Daniel15 commented Jul 1, 2019

@paulmillr - String.prototype.length measures the number of code points, not the byte length of the string. '문'.length returns 1 as it's just one character, even though the character actually takes 3 bytes of space. The space taken is what we care about here.

Changing your script to output the number of bytes in the string:

console.log('file size', Buffer.byteLength(text, 'utf-8'));
console.log('btoa size', Buffer.byteLength(btoa(text), 'utf-8'));
console.log('65k size', Buffer.byteLength(b65k(text), 'utf-8'));

Shows b65k has a clear disadvantage:

λ node c:\temp\b65k.js
(node:6268) ExperimentalWarning: The fs.promises API is experimental
file size 506
btoa size 676
65k size 759

And LZString actually helps:

const LZString = require('lz-string');
...
console.log(
  'LZString size',
  Buffer.byteLength(LZString.compressToBase64(text), "utf-8")
);

results in:

LZString size 440

because that's like the whole idea of base65k.

No, the whole idea of base65k is to take advantage of Twitter counting a character as a UTF-8 codepoint, rather than counting the number of raw bytes. It's not really useful as an actual compression or transport strategy otherwise.

@paulmillr
Copy link
Author

I'm talking here about browser limitations for URL parsing & URL parsing performance. I'm not talking about whether the result is less bytes; but the result definitely has less characters. So this should be twice as fast to parse.

@jstrieb
Copy link
Owner

jstrieb commented Jul 1, 2019

One use case of the project is having long URLs that can be stored by link shorteners with upper limits on URL size. With that use case in-mind, what matters most is whether the shorteners care about bytes or characters when measuring URL size, and how they handle Unicode URLs. My guess is that it's variable, but still merits implementing a base65536 encoding feature.

Tangential note: I'm currently looking into integrating Brotli for real URL compression since it was designed for exactly this type of data and will hopefully be able to get URL lengths considerably shorter.

@Daniel15
Copy link

Daniel15 commented Jul 1, 2019

With that use case in-mind, what matters most is whether the shorteners care about bytes or characters when measuring URL size

@jstrieb - Shorteners usually care about bytes, given they need to store the data as bytes in their database. Regardless of if you have four characters that take one byte each, or one character that takes four bytes, that's still going to consume four bytes in a database. I ran a URL shortener for many years.

Tangential note: I'm currently looking into integrating Brotli for real URL compression since it was designed for exactly this type of data

I'd be interested in hearing how Brotli compares to LZString for this data! Looking forward to seeing the results of that 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants