Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create multiple lsh indices each one in its own keyspace #171

Closed
pavelnemirovsky opened this issue Jan 6, 2022 · 6 comments
Closed
Labels

Comments

@pavelnemirovsky
Copy link

pavelnemirovsky commented Jan 6, 2022

First of all, thank you for great work @ekzhu!
Here is a reproducible test that shows that my expectation is to create 1 keyspace per each LSH index unfortunately all LSH tables are being created in the scope of the same Cassandra keyspace.

def create_lsh_index(index):
    print("(Create index: {}) - ".format(index), end='')
    threshold = 0.75
    num_perm = 128
    doc1 = MinHash(num_perm)
    lsh = MinHashLSH(
        threshold=threshold, num_perm=num_perm, storage_config={
            'type': 'cassandra',
            'basename': index.encode('ascii'),
            'cassandra': {
                'seeds': cassandra_seeds,
                'keyspace': index,
                'replication': {
                    'class': 'SimpleStrategy',
                    'replication_factor': '3',
                },
                'drop_keyspace': True,
                'drop_tables': True,
            }
        }
    )
    lsh.insert("a", doc1)
    lsh.insert("b", doc1)
    counts = lsh.get_counts()
    # second instance
    assert len(counts) == 11


def test_cassandra_multi_index():
    create_lsh_index('idx1')
    create_lsh_index('idx2')
    create_lsh_index('idx3')

The produced result inside of Cassandra DB, please see below:

cqlsh> DESCRIBE keyspaces;

system_schema  system_traces  **idx1**      system_distributed_everywhere
system_auth    same_index     system  system_distributed 

cqlsh> use idx1;

cqlsh:idx1> desc tables;

lsh_idx3_keys         lsh_idx2_bucket_0003  lsh_idx3_bucket_0006
lsh_idx2_bucket_000a  lsh_idx2_bucket_0002  lsh_idx3_bucket_0007
lsh_idx1_keys         lsh_idx2_bucket_0009  lsh_idx1_bucket_0008
lsh_idx2_keys         lsh_idx2_bucket_0008  lsh_idx1_bucket_0009
lsh_idx1_bucket_000a  lsh_idx3_bucket_0008  lsh_idx1_bucket_0006
lsh_idx3_bucket_000a  lsh_idx3_bucket_0009  lsh_idx1_bucket_0007
lsh_idx2_bucket_0005  lsh_idx3_bucket_0000  lsh_idx1_bucket_0004
lsh_idx2_bucket_0004  lsh_idx3_bucket_0001  lsh_idx1_bucket_0005
lsh_idx2_bucket_0007  lsh_idx3_bucket_0002  lsh_idx1_bucket_0002
lsh_idx2_bucket_0006  lsh_idx3_bucket_0003  lsh_idx1_bucket_0003
lsh_idx2_bucket_0001  lsh_idx3_bucket_0004  lsh_idx1_bucket_0000
lsh_idx2_bucket_0000  lsh_idx3_bucket_0005  lsh_idx1_bucket_0001

Looking forward to hear from you what we are doing wrong since we don't have any practical experience with datasketch yet.

@ekzhu
Copy link
Owner

ekzhu commented Jan 7, 2022

Thanks for the issue. It looks like it is a bug in Cassandra storage. Only the first keyspace you specified will be created due to the client session being reused by all subsequent calls to the storage. The offending code is here: https://github.com/ekzhu/datasketch/blob/master/datasketch/storage.py#L284

If we have multiple shared sessions grouped by keyspace should solve this problem. I am not very familiar with Cassandra, would you consider submit a pull request to fix this bug?

@ekzhu ekzhu added the bug label Jan 7, 2022
@pavelnemirovsky
Copy link
Author

@ekzhu sure, will take a look and we will submit PR for that.

@ronassa
Copy link
Contributor

ronassa commented Jan 16, 2022

hi @ekzhu, add a pull request that fixes this issue by creating\switching to different keyspace when needed. #172

@pavelnemirovsky
Copy link
Author

@ekzhu Hi, where you are going to merge and release the version?

@ekzhu
Copy link
Owner

ekzhu commented Jan 24, 2022

@ekzhu Hi, where you are going to merge and release the version?

I am going to merge and release a new version soon. But just need a few more clarification about the pull request #172

ekzhu pushed a commit that referenced this issue Feb 4, 2022
… (#172)

Co-authored-by: Ron Assa <ron@analytika.io>
@ekzhu
Copy link
Owner

ekzhu commented Feb 4, 2022

Merged an released.

@ekzhu ekzhu closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants