Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary self organizing map #53

Open
fsicre opened this issue Apr 22, 2020 · 6 comments
Open

Binary self organizing map #53

fsicre opened this issue Apr 22, 2020 · 6 comments

Comments

@fsicre
Copy link

fsicre commented Apr 22, 2020

Hi,

First I want to thank you for this awesome library.

I'm experimenting on binary NLP technics and I was wondering if you could publish a bitmagic version of SOM (self organizing map) using about ~ 10 millions of extremely sparse (< 5 / 100 000) binary vectors of same size ~ 1Mbits. The map could be for instance 128 x 128 locations.

It could be a huge step froward for me to see a correct use of bitmagic on this particular subject.

Thanks for your help.

@tlk00
Copy link
Owner

tlk00 commented May 13, 2020

Are you asking for an example on how to use BM?
Could you provide more info on the case, I am not sure I understand the 128x128 locations problem really well.

Thank you!

@fsieduc
Copy link

fsieduc commented May 21, 2020

Are you asking for an example on how to use BM?

yes, exactly, on the use case of a Self Organizing Map of binary vectors.

Usually, SOM are used on integers or float vectors as you can see on these to pages :
https://en.wikipedia.org/wiki/Self-organizing_map
https://www.google.com/search?q=self+organizing+map&tbm=isch

In my case I need to do this kind of classification on sparse binary vectors.

BitMagic seems to be a very very good candidate for that.

The naïve SOM algorithm is very simple :

  • build a 2D matrix of empty or random binary vectors (preferably with same sparsity than the vectors to classify)
  • iterate while the map is not stabilized
    • pick a vector V to classify (for e.g. randomly in the ~ 10 millions)
    • find the best location on the map i.e. the location which has the most similar vector to V, call it the Best Matching Unit (BMU)
    • hack the BMU in order to make it even more similar to the vector V
    • hack the BMU's neighbourhood's vectors in order to "move" them in the same direction of the BMU but with less effect (the farther away of the BMU, the lesser you "move" the neighbour)

The map will progressively evolved and finally each vector to classify will have a best location identified and all similar vectors in the ~ 10 millions will also be associated to locations nearby on the map.

Now imaging you have ~ 10 millions of extremely sparse binary vectors (same size ~ 1Mbits) and you want to place them on a 2D map of 128 x 128 locations.
Hamming Distance can be used.
The map will be a 2D matrix of binary vectors of ~ 1Mbits.
The algorithm on the picture below will be used in order to "hack" the locations' vectors.

As you can see, a lot of vectors will be loaded in RAM, a lot of BV comparisons will be performed, a lot of hamming distance will be calculated, a lot of bit modification will be done, so the question is

how to correctly use BitMagic for best / maximum speed ? (32/64/128Go RAM available)

https://www.semanticscholar.org/paper/An-alternative-approach-for-binary-and-categorical-Santana-Morais/f1f079640218b72822911ef3dc7394765703d62a/figure/0
https://d3i71xaburhd42.cloudfront.net/f1f079640218b72822911ef3dc7394765703d62a/3-Figure1-1.png

@tlk00
Copy link
Owner

tlk00 commented May 22, 2020

Yes, i understand the problem.
The closest example publicly available now is:
https://github.com/tlk00/BitMagic/tree/master/samples/xsample07a

It creates binary fingerprints and implements primitive (but relatively fast) clusterization using similarity measure (COUNT_AND). Not quite well documented yet, but the code should be instructive as a use case for how to compute binary distances.

I am planning to create a separate example on how to do a more efficient distance computations for Hamming, Jacquard distances using cache locking and SIMD. Current distance compute is fast but can be even faster actually to facilitate binary SOMs.

@fsieduc
Copy link

fsieduc commented May 22, 2020

Awesome, many thanks for the hint.

@fsieduc
Copy link

fsieduc commented May 24, 2020

xsample07a is very interesting, although it's difficult to understand exactly why you've choosen some technics. never mind.

this project is a side project for me.

once I have written something (in few weeks) would you mind my showing to you ? in order to have your advices concrete bitmagic usage.

@tlk00
Copy link
Owner

tlk00 commented May 26, 2020

Surest thing. I would certainly appreciate your example, please share it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants