[WIP]Hopscotch concurrent map #106

prototype2904 · 2017-12-27T12:08:45Z

By @LeoSko, @Dron642 and me
Leonid Skorospelov 2303
Chulanov Andrey 2381
Stetskevich Roman 2381

prototype2904 · 2017-12-27T12:10:27Z

http://people.csail.mit.edu/shanir/publications/disc2008_submission_98.pdf
[2008] Herlihy,Shavit,Tzafrir Hopscotch Hashing.pdf

khizmax · 2017-12-27T15:06:41Z

Что бросается в глаза при беглом просмотре:

Почему используются pthread-примитивы синхронизации (pthread_mutex, pthread_condvar)? В C++11 это все уже стандартизовано
Зачем обилие volatile?.. Volatile предназначен для обращения к устройствам (memory port mapping). Для обычного кода нужно использовать либо std::atomic, либо примитивы синхронизации (mutex, condvar и пр.).
Реализация не общая, а только если KEY -один из int-типов. Почему?.. Что мешает использовать хеш-функции?..
Прогнать тест для 1000 потоков, каждый из которых выполняет 1 операцию, - это не тест для конкурентных структур. Нужно несколько миллионов разных операций для разумного (4-10) числа потоков, см. stress-тесты.

DaKellyFella · 2017-12-27T19:21:00Z

You should change the authors to whom you copied this code from. This code bears a striking resemblance replete with bugs and deficiencies to this and this. The header clearly states that this code is free to use so long as due acknowledgement is given.

You even have the same comments, the same dependencies to pthread, same variable names, other oddities like taking the values by pointer, and you also never use the value in the key/value hashmap.

Correct me if I'm wrong but this is similar to the level of extreme suspicion.

Edit: Grammar.

LeoSko · 2017-12-28T05:26:58Z

@DaKellyFella we apoligise for this misunderstandings and surely will change it soon.
@khizmax В субботу будет поправлено.

DaKellyFella · 2017-12-28T13:38:53Z

Another issue is that the paper you link starts off by describing Hopscotch Hashing with each bucket having a fixed size neighbourhood (unsigned int volatile _hop_info;), which is a bit-mask indicating which bucket contains relevant information, however ultimately their algorithm is changed to allow for infinite size neighbourhoods by storing direct offsets in each bucket. The details are described in the paper you linked in the appendix.

For my research I use the implementation given here, which was written by the paper's original authors. The code can be quite hard to understand but it would be great if you modify it in accordance with what @khizmax stated and change your underlying algorithm so that is uses offsets rather than bit-masks.

Edit: I added that the code I linked is written by the original authors.

LeoSko · 2017-12-28T15:22:02Z

@DaKellyFella
however ultimately their algorithm is changed to allow for infinite size neighbourhoods by storing direct offsets in each bucket. The details are described in the paper you linked in the appendix.

We don't see what paper you exactly want us to consider looking at.

But, we agree that looking at source you provided link to is worth it. It would also be great if we could find some kind of comparison why we should use infinite size neighbourhoods.

DaKellyFella · 2017-12-29T13:18:54Z

@LeoSko
We don't see what paper you exactly want us to consider looking at.

Sorry, I didn't explain it very well. Here's a more thorough explanation:

There are two versions of Hopscotch.

Each bucket has a bit-mask indicating the presence of an item in buckets at or ahead of the current bucket. The size of this neighbourhood is referred to as H.
Each bucket contains two offsets, one of them indicating which bucket to check next in the probe chain when searching for items.

The code you wrote/found is the first version. The second version is the code I linked. The second version is superior as it cannot suffer from bucket saturation where too many items get added to a single bucket, overflowing the bit-mask. If more than H items are added to a single bucket then the algorithm breaks.

If you check the appendix of their paper you can see each bucket has two short members as opposed to the bit-mask unsigned int volatile _hop_info in the current code.

LeoSko · 2017-12-29T14:46:08Z

@DaKellyFella Thank you for the last link, that is what we were looking for, we will look into it soon.

As fas as I understand for the first kind of realisation ("first version"), if it happened so that 32 items were added into the same bucket, then we just resize the table, rearrange items with new table size and try again. It can happen that it won't help, but it is almost impossible, isn't it?

DaKellyFella · 2017-12-29T15:05:20Z

@LeoSko

As fas as I understand for the first kind of realisation ("first version"), if it happened so that 32 items were added into the same bucket, then we just resize the table, rearrange items with new table size and try again. It can happen that it won't help, but it is almost impossible, isn't it?

Yes, you're right. It should be noted the authors make the same argument but their final code is implemented with version 2. Perhaps there were other concerns relating to performance that they found and as such stuck with that version.

You could measure the performance of both implementations and see for yourself.

Roma

LeoSko · 2018-01-04T19:22:17Z

@khizmax, объясните пожалуйста как должным образом добавить тестирование в проект. По образу и подобию не можем сообразить - куда смотреть вообще?

khizmax · 2018-01-04T21:31:12Z

Начнем с того, что ваш set на самом деле является map'ом, раз у него в явном виде задаются Key и Value через template-параметры. У set данные не разделены на key и value. Но это в данном случае не принципиально.

Есть два типа тестов:

unit-тесты: однопоточные, проверяют интерфейс реализации контейнера. То есть тесты должны покрывать все вызовы public-методов + наибольшее число различных сочетаний traits.
stress-тесты: многопоточные. Это самые главные тесты для конкурентных контейнеров. В libcds разработаны несколько stress-тестов для map, в которые с помощью template-магии вставляются те или иные реализации map'ов.

Для добавления unit-тестов следует просто добавить новый файл hopschotch_map.cpp сюда: libcds\test\unit\striped-map\ и добавить этот файл в солюшн и в CMakeList.txt. По большому счету Вы вольны написать в hopschotch_map.cpp все что угодно, используя google test framework.
В качестве примера можно посмотреть на libcds\test\unit\striped-map\cuckoo_map.cpp. В libcds предполагается, что сходные контейнеры предоставляют один и тот же интерфейс, поэтому тест для cuckoo_map использует уже готовый код тестов из libcds\test\unit\striped-map\test_map.h, подставляя в него себя с разными traits.

Для добавления stress-тестов критерии пожестче: интерфейс вашего map'а должен соответствовать принятому в libcds интерфейсу map'а. Здесь libcds\test\stress\map\ находятся реализации stress-тестов для map. Вам необходимо создать файл map_type_hopscotch.h, в котором объявляются все интересные специализации Вашего map'а (см. в качестве примера map_type_cuckoo.h). Далее для каждого stress-теста (это подкаталоги libcds\test\stress\map) создается файл с вашей реализацией, в котором просто прописывается тот define, который вы объявили в map_type_hopscotch.h. См, например, libcds\test\stress\map\del3\map_del3_cuckoo.cpp).
Note: не все stress-тесты подходят для любого контейнера. Например, minmax-тест подразумевает упорядоченный контейнер, что в случае hash-based контейнеров неверно. Но тесты del3, delodd, find_string, insdel_func, insdel_item_int, insdel_string, insdelfind должны быть приемлемы для каждой реализации map'а в libcds.

Libcds test framework разработан так, чтобы при добавлении новых реализаций контейнеров не нужно было изменять сам framework - файл main.cpp, файлы из libcds\test\stress\framework\ и c:\Works\libcds\test\include\cds_test.

khizmax · 2018-01-08T16:03:54Z