Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HnswDensevector SafeTensor Generator #2515

Closed
wants to merge 48 commits into from
Closed

Conversation

Panizghi
Copy link
Contributor

@Panizghi Panizghi commented Jun 2, 2024

Linked issue : castorini/ura-projects#31 (comment)
@17Melissa will provide the flow command below :)

@17Melissa
Copy link
Contributor

Setup for NFCorpus Indexing with Safetensors

To efficiently perform NFCorpus indexing using Safetensors, follow this setup workflow:

  1. Download and Unzip Collections
    • Begin by downloading the necessary collections and unzipping them. For instance:
      wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.tar -P collections/tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.tar -C collections/
  2. Prepare the Environment
    • Navigate to the Safetensors directory within the Anserini project
      cd /anserini/src/main/python/safetensors
    • Install the required Python packages:
      pip install -r requirements.txt
    • Activate the virtual environment
      python3 -m venv venv
      source venv/bin/activate
  3. Convert JSON to Safetensors Format
    • Use the provided script to convert JSON files to Safetensors format
      python3 -m json_to_bin
    • the script will create the following files in the target directory
      • Saved vectors to ../../../../target/safetensors/vectors/part00_vectors.safetensors
      • Saved docids to ../../../../target/safetensors/docids/part00_docids.safetensors
      • Saved docid_to_idx mapping to ../../../../target/safetensors/docid_to_idx/part00_docid_to_idx.json

Indexing Procedure

To build HNSWSafetensors indexes, use the following sample command:

bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
  -collection JsonDenseVectorCollection \
  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -threads 9 -storePositions -storeDocvectors -storeRaw \
  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

Ensure all paths and parameters are adjusted according to your setup and directory structure.

@lintool
Copy link
Member

lintool commented Jun 2, 2024

Can you make the safetensors collection go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/, alongside the original? So all files should go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/.

We also shouldn't need a new indexer. The indexing command should be similar to https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus-bge-base-en-v1.5-hnsw.md

e.g.,

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
  -collection JsonDenseVectorCollection \
  -input /path/to/beir-v1.0.0-bge-base-en-v1.5 \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus-bge-base-en-v1.5/ \
  -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge \
  >& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

With the only exception being a different -generator.

@17Melissa
Copy link
Contributor

Updated Workflow for Safetensors Conversion and Indexing Process

  1. Create Directory: Create the safetensors folder collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
  2. Run Conversion Script: Execute the python script json_to_bin.py from the root directory using the command:
    python src/main/python/safetensors/json_to_bin.py
  3. Execute Indexing Command: Following the indexing command below, which you will run after the conversion script completes
bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

@Panizghi
Copy link
Contributor Author

Panizghi commented Jul 9, 2024

Updates

  • Removed hardcoded path.
  • Removed indexer arguments and updated the path hierarchy.
  • Internal mapping of the docid and vectors.
  • Updated argument for Python input and output.

Updated commands

Python

python src/main/python/safetensors/json_to_bin.py --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Java

bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &

@Panizghi Panizghi reopened this Jul 9, 2024
@Panizghi
Copy link
Contributor Author

updated command :

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection SafeTensorsDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus   
-generator SafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge  >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

@lintool
Copy link
Member

lintool commented Aug 23, 2024

@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.

For example, for robust04:

$ ls robust04/
vectors.part00.jsonl.gz  vectors.part01.jsonl.gz  vectors.part02.jsonl.gz  vectors.part03.jsonl.gz  vectors.part04.jsonl.gz  vectors.part05.jsonl.gz

@lintool
Copy link
Member

lintool commented Aug 23, 2024

@Panizghi on your branch, running:

$ python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Works fine. However, I would like some progress indication... e.g., using tqdm?

Also, what do I do if there is more than one vector part?


However, more compact, as excepted, which is good.

$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
22M	collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
84M	collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/

@lintool
Copy link
Member

lintool commented Aug 23, 2024

Running indexing command:

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Something's not right... get an exception:

2024-08-23 07:40:33,960 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + CollectionClass: SafeTensorsDenseVectorCollection
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Threads: 16
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + Optimize (merge segments)? false
Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings:
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) -  + Generator: SafeTensorsDenseVectorDocumentGenerator
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) -  + M: 16
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) -  + efC: 100
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) -  + Store document vectors? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) -  + Int8 quantization? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) -  + Codec: Lucene99
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) -  + MemoryBuffer: 65536
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) -  + MaxThreadMemoryBeforeFlush: 2047
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) -  + MergePolicy: NoMerge
2024-08-23 07:40:34,219 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized.
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index...
2024-08-23 07:40:34,225 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1
2024-08-23 07:40:34,225 WARN  [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread.
java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Objects.java:233)
	at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563)
	at java.base/java.util.List.of(List.java:937)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476)
	at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception:
2024-08-23 07:40:34,235 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10
Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
2024-08-23 07:40:34,277 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10
20

@Panizghi
Copy link
Contributor Author

@Panizghi if I'm reading your code correctly, you're assuming that there's only one vector file per directory, right? This is not necessary the case.

For example, for robust04:

$ ls robust04/
vectors.part00.jsonl.gz  vectors.part01.jsonl.gz  vectors.part02.jsonl.gz  vectors.part03.jsonl.gz  vectors.part04.jsonl.gz  vectors.part05.jsonl.gz

Yes that is correct on the early discussion we kep it only for nfcorpus with single file, I will update the code for the multiple file handling

@Panizghi
Copy link
Contributor Author

Running indexing command:

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Something's not right... get an exception:

2024-08-23 07:40:33,960 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:205) - Setting log level to INFO
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:208) - ============ Loading Index Configuration ============
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:209) - AbstractIndexer settings:
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:210) -  + DocumentCollection path: collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:211) -  + CollectionClass: SafeTensorsDenseVectorCollection
2024-08-23 07:40:33,963 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:212) -  + Index path: indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:213) -  + Threads: 16
2024-08-23 07:40:33,964 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:214) -  + Optimize (merge segments)? false
Aug 23, 2024 7:40:34 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:149) - HnswIndexer settings:
2024-08-23 07:40:34,217 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:150) -  + Generator: SafeTensorsDenseVectorDocumentGenerator
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:151) -  + M: 16
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:152) -  + efC: 100
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:153) -  + Store document vectors? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:154) -  + Int8 quantization? false
2024-08-23 07:40:34,218 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:155) -  + Codec: Lucene99
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:156) -  + MemoryBuffer: 65536
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:157) -  + MaxThreadMemoryBeforeFlush: 2047
2024-08-23 07:40:34,219 INFO  [main] index.IndexHnswDenseVectors (IndexHnswDenseVectors.java:160) -  + MergePolicy: NoMerge
2024-08-23 07:40:34,219 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:238) - ============ Indexing Collection ============
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:247) - Thread pool with 16 threads initialized.
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:248) - 2 files found in collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
2024-08-23 07:40:34,222 INFO  [main] index.AbstractIndexer (AbstractIndexer.java:249) - Starting to index...
2024-08-23 07:40:34,225 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:48) - Processing document ID: MED-10 with thread: pool-2-thread-1
2024-08-23 07:40:34,225 WARN  [pool-2-thread-2] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:43) - Document ID: MED-10 is already being processed by another thread.
java.lang.NullPointerException
	at java.base/java.util.Objects.requireNonNull(Objects.java:233)
	at java.base/java.util.ImmutableCollections$List12.<init>(ImmutableCollections.java:563)
	at java.base/java.util.List.of(List.java:937)
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1837)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1476)
	at io.anserini.index.AbstractIndexer$IndexerThread.run(AbstractIndexer.java:135)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
2024-08-23 07:40:34,229 ERROR [pool-2-thread-2] index.AbstractIndexer$IndexerThread (AbstractIndexer.java:179) - pool-2-thread-2: Unexpected Exception:
2024-08-23 07:40:34,235 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:56) - Vector length: 768 for document ID: MED-10
Aug 23, 2024 7:40:34 AM org.apache.lucene.internal.vectorization.PanamaVectorizationProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=256; FMA enabled
2024-08-23 07:40:34,277 INFO  [pool-2-thread-1] generator.SafeTensorsDenseVectorDocumentGenerator (SafeTensorsDenseVectorDocumentGenerator.java:64) - Document created for ID: MED-10
20

This should be fixed now and work with the same command

@Panizghi
Copy link
Contributor Author

@Panizghi on your branch, running:

$ python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Works fine. However, I would like some progress indication... e.g., using tqdm?

Also, what do I do if there is more than one vector part?

However, more compact, as excepted, which is good.

$ du -h collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
22M	collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/
$ du -h collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/
84M	collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/

tqdm is added there is --overwrite in arguments which also you can use if the file already exists

command:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

sample output :

Processing lines: 100%|█████████████████████████████████████████████████| 3633/3633 [00:01<00:00, 3347.56it/s]
2024-08-25 00:53:02,642 - INFO - Saved vectors to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,643 - INFO - Saved docids to collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors
2024-08-25 00:53:02,643 - INFO - Loaded vectors from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_vectors.safetensors
2024-08-25 00:53:02,644 - INFO - Loaded document IDs (ASCII) from collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/vectors.part00_docids.safetensors

For vector parts are we considering a case like this?

{
  "docid": "MED-10",
  "vector_1": [0.00344, 0.00231, ...],
  "vector_2": [0.00112, 0.00456, ...]
}

Copy link
Member

@lintool lintool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update Submodule tools in your branch to bring up to date w/ master.

@lintool
Copy link
Member

lintool commented Aug 25, 2024

Okay, I can now run these commands:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md

The retrieval command is this:

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
  -topicReader TsvString \
  -output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

But the eval command generates errors:

$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
WARNING: Using incubator modules: jdk.incubator.vector
trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'

From here:

$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini
PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini
PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini
PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini
PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini
PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini
PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini
PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini
PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini
PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini

I appear to be getting duplicates of docs, e.g., MED-2036. Are you somehow indexing everything twice?

@Panizghi
Copy link
Contributor Author

Okay, I can now run these commands:

python src/main/python/safetensors/json_to_bin.py \
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl.gz \
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus --overwrite

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

After I build the index, I should be able to switch to retrieval, here: https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus.bge-base-en-v1.5.hnsw.onnx.md

The retrieval command is this:

bin/run.sh io.anserini.search.SearchHnswDenseVectors \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
  -topics tools/topics-and-qrels/topics.beir-v1.0.0-nfcorpus.test.tsv.gz \
  -topicReader TsvString \
  -output runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt \
  -generator VectorQueryGenerator -topicField title -removeQuery -threads 16 -hits 1000 -efSearch 1000 -encoder BgeBaseEn15

But the eval command generates errors:

$ bin/trec_eval -c -m ndcg_cut.10 tools/topics-and-qrels/qrels.beir-v1.0.0-nfcorpus.test.txt runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
WARNING: Using incubator modules: jdk.incubator.vector
trec_eval.form_res_qrels: duplicate docs MED-1000trec_eval: Can't calculate measure 'ndcg_cut'

From here:

$ head runs/run.beir-v1.0.0-nfcorpus.bge-base-en-v1.5.bge-hnsw-onnx.topics.beir-v1.0.0-nfcorpus.test.txt
PLAIN-1008 Q0 MED-2036 1 0.776562 Anserini
PLAIN-1008 Q0 MED-2036 2 0.776562 Anserini
PLAIN-1008 Q0 MED-5135 3 0.775252 Anserini
PLAIN-1008 Q0 MED-5135 4 0.775252 Anserini
PLAIN-1008 Q0 MED-4694 5 0.774549 Anserini
PLAIN-1008 Q0 MED-4694 6 0.774549 Anserini
PLAIN-1008 Q0 MED-3865 7 0.773869 Anserini
PLAIN-1008 Q0 MED-3865 8 0.773869 Anserini
PLAIN-1008 Q0 MED-3316 9 0.771660 Anserini
PLAIN-1008 Q0 MED-3316 10 0.771660 Anserini

I appear to be getting duplicates of docs, e.g., MED-2036. Are you somehow indexing everything twice?

That was initially the reason I swapped to single thread and having critical section testing the fix right now

@Panizghi
Copy link
Contributor Author

Panizghi commented Sep 1, 2024

Updated command :

python src/main/python/safetensors/json_to_bin.py --input collections/robust04 --output collections/robust04.safetensors/ --overwrite
bin/run.sh io.anserini.index.IndexHnswDenseVectors \
-collection SafeTensorsDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus \
-generator SafeTensorsDenseVectorDocumentGenerator \
-index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus.bge-base-en-v1.5/ \
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge

Indexing Performance:

  • 8 threads: Indexed 337,860 documents in 00:01:25
  • 1 thread: Indexed 337,860 documents in 00:05:50

File Sizes:

  • Original JSONL Files (Total: 3.6 GB):

    • vectors.part00.jsonl.gz: 683 MB
    • vectors.part01.jsonl.gz: 683 MB
    • vectors.part02.jsonl.gz: 682 MB
    • vectors.part03.jsonl.gz: 683 MB
    • vectors.part04.jsonl.gz: 683 MB
    • vectors.part05.jsonl.gz: 192 MB
  • Converted Safetensor Files (Total: 3.1 GB):

    • Vector Files:
      • vectors.part00_vectors.safetensors: 586 MB
      • vectors.part01_vectors.safetensors: 586 MB
      • vectors.part02_vectors.safetensors: 586 MB
      • vectors.part03_vectors.safetensors: 586 MB
      • vectors.part04_vectors.safetensors: 586 MB
      • vectors.part05_vectors.safetensors: 165 MB
    • DocID Files:
      • vectors.part00_docids.safetensors: 13 MB
      • vectors.part01_docids.safetensors: 10 MB
      • vectors.part02_docids.safetensors: 10 MB
      • vectors.part03_docids.safetensors: 13 MB
      • vectors.part04_docids.safetensors: 10 MB
      • vectors.part05_docids.safetensors: 2.8 MB

@lintool
Copy link
Member

lintool commented Sep 10, 2024

Superseded by #2582

@lintool lintool closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants