Simple-kNN-Gzip

A simplistic linear and multiprocessed approach to sentiment analysis using Gzip Normalized Compression Distances with k nearest neighbors

Original work that this concept is based on: https://aclanthology.org/2023.findings-acl.426.pdf Paper authors also have implementation code here: https://github.com/bazingagin/npc_gzip

This is not a fork of their work, this one is written myself based on what I read in the paper just to see if I actually understood what was going on. They achieve a higher accuracy than I found personally on a separate dataset, but it would appear there's something interesting and useful about this methodology.

Ken Schutte also has a couple writeups, explaining at least 2 of the major issues with the original paper:

k=2 tiebreaker issue: https://kenschutte.com/gzip-knn-paper/
Dataset test leakage (test data was also in the training data): https://kenschutte.com/gzip-knn-paper2/

I don't think either of those things invalidate my findings here, though notably my accuracy is far below their reported accuracy as well anyway.

Future work here:

I wonder about further "feature extraction" based on this sort of "compression lengths" as features. For example, rather than NCDs, maybe instead the compression ratio from original string to compressed sizeadd would be even more useful than NCDs, since (I believe) the reason this works is statistical similarities in words/phrases and their syntactic uses which Gzip uses for compression.

I am also curious if it's remotely possible to use a compressor like gzip as a compressor and potentially tokenizer for transformers? Surely this isn't a new idea and there's a great reason why this wont work, but I am tempted to try that probably next.

IDK. this just shouldnt work at all IMO :D

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
kNN-gzip-ncd-video-test it-deep-learning.ipynb		kNN-gzip-ncd-video-test it-deep-learning.ipynb
sentiment-dataset-10000.pickle		sentiment-dataset-10000.pickle
sentiment-dataset-500.pickle		sentiment-dataset-500.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple-kNN-Gzip

Future work here:

About

Releases

Packages

Languages

License

alberto-solano/Simple-kNN-Gzip

Folders and files

Latest commit

History

Repository files navigation

Simple-kNN-Gzip

Future work here:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages