FAISS returns negative ids (not -1) #2135

abdullahbas · 2021-12-01T15:50:52Z

Summary

Faiss version: 1.7.1

Installed from: pypi

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

It is very complicated code so I couldn't share a code snippet but it returns something like

[-7882858908496526721, 148477514, 7318772159358522531, 131445014, -8263696823219615651, 123521031, -5807271324421810311, 124208452, 38032875, 146904364, 139624482, 125867015, 139643914, 125254479, 18606842, 147101967, -8246501689735019874, 119442532, 141874179, 138070620, 130286272, 129548931, 131521583, 107358047, -8528840699497380558, 148568457, 127924406, 60198081, 23002488, 134854969, 38924547, 134703770, 33097768, 146073936, 69678871, 145498691, -6661535923009526919, 145471504, 137858014, 142931410, 137858015, 140687014, 140038207, 74294394]

mdouze · 2021-12-01T18:42:45Z

This would be a bug. What type of index? Code to repro?

abdullahbas · 2021-12-01T22:05:11Z

We use 'PCAR64,IVF4096(IVF512,PQ32x4fs,RFlat),SQ8' as index type to index BERT embeddings. We applied L2 norm not only before training but also before searching.

This is not the exact code flow. I showed it only to illustrate the key parts and their standings.

index.train(doc_embeddings[tr_indices]) # training the index
index.add_with_ids(doc_embeddings[i0:i1], ids_list[i0: i1]) # adding embeddings and corresponding ids
faiss.write_index(index, tmpdir + "main.index")
index = faiss.read_index(self.tmpdir + "main.index",faiss.IO_FLAG_ONDISK_SAME_DIR) # for reading merged index from disk
D, I = self.index.search(sentence_embedding, n_result)# for searching

the I in the previous line has the values,

[-7882858908496526721, 148477514, 7318772159358522531, 131445014, -8263696823219615651, 123521031, -5807271324421810311, 124208452, 38032875, 146904364, 139624482, 125867015, 139643914, 125254479, 18606842, 147101967, -8246501689735019874, 119442532, 141874179, 138070620, 130286272, 129548931, 131521583, 107358047, -8528840699497380558, 148568457, 127924406, 60198081, 23002488, 134854969, 38924547, 134703770, 33097768, 146073936, 69678871, 145498691, -6661535923009526919, 145471504, 137858014, 142931410, 137858015, 140687014, 140038207, 74294394]

The other parts of the code are quite long and object-oriented. It is almost impossible to understand from there. Hence I pushed the essential parts of the code. If you still need the whole code I can attach the notebook files or scripts.

we used merge_on_disk.py example to index all of our 127 million data on the disk.

For merging operation, after creating several block.index we use
merge_ondisk(index, block_fnames, tmpdir + "merged_index_test.ivfdata")

By the way for updating the current index we use

pseudo_index= faiss.read_index(self.config["mount_path"]+self.config['save_path'] + "pseudo.index",faiss.IO_FLAG_ONDISK_SAME_DIR)
pseudo_index.add_with_ids(embeddings,np.array(ids))
faiss.write_index(pseudo_index,save_path+'main.index')

abdullahbas · 2021-12-05T14:14:13Z

I think the cause of the problem is adding. When I add indices on the existing index it just brokes the pipe. I don't know why. main.index and pseudo.index that I show the previous post was working but after adding new data they just get broke. They all use the same merge file maybe this is cause I don't know.

abdullahbas · 2021-12-05T21:04:23Z

This is ids list from our index. As you can see there are some ids that are not correct.

Should we train our index again after using add_with_ids ?

abdullahbas · 2021-12-06T10:22:56Z

I tried to understand the source of the error and I found that ids are corrupted after using

index.add_with_ids(embeddings,np.array(ids)).

Our index is on disk. Is it the problem of us? Do we have to add our new entries via adding new blocks and after merging them again? Do you have the support of add_with_ids for the on-disk index? @mdouze

fonspa · 2022-02-22T16:42:45Z

Hi @abdullahbas ,
did you manage to make it work ? I'm also trying to merge incremental blocks with an empty trained index, 'IVFx,PQYx4fs,Refine(SQfp16)', but got into some multiple problem with add_with_ids.
Also it seems the value for the invlists.code_size can be arbitrary and wrong, I encountered arbitrary values for this field that prevent the merge.
If you succeeded in merging the blocks, can you explain the steps you took ? Thanks.

abdullahbas · 2022-02-24T10:00:53Z

@fonspa Actually we solved it by using two alternative indexes. We create and delete asynchronously the old index. After completing everything on the new index we read it as the main index then delete the old one. You have to manage the blocking process in the most optimized way. We just update corresponding blocks according to new messages. It takes only 2-3 mins.

mdouze · 2022-03-11T09:25:07Z

Sorry for the late answer.
Adding to an ondisk index is not recommended, it's slow and will be removed in a subsequent Faiss version.
Merging "fast scan" index variants (IVFx,PQYx4fs) is not supported. This should throw an error, and eventually be implemented.

zhangchuheng123 · 2023-01-10T00:49:12Z

BTW. I have encountered the problem that FAISS return some -1 indices. But I cannot find any explanation on the meaning of -1. I have checked that index.ntotal = 1027120, method=IVFFlat with nlist=1024 and nprobs=4. I searched for xq.shape=(1024, 488) for k=50 neighbors. It returns as follows:

inds[273]
array([ 453005,  470124,  521481,  538600,  589957,  607076,  675552,
        658433,  726909,  744028,  795385,  812504,  880980,  863861,
        949456,  932337, 1017932, 1000813,   59285,   42167,  127757,
        110639,  196229,  179111,  264701,  247583,  453006,  435887,
        521482,  504363,  333173,  316055,  589958,  572839,  658434,
        641315,  384530,  367411,  863862,  846743,  726910,  709791,
        795386,  778267,      -1,      -1,      -1,      -1,      -1,
            -1])

I have also checked the X that I used to build the index: All the elements are finite and <10. Is there anyone know the reason?

mdouze · 2023-01-10T07:45:19Z

see https://github.com/facebookresearch/faiss/wiki/FAQ#what-does-it-mean-when-a-search-returns--1-ids

pablocael · 2024-06-06T23:06:36Z

@fonspa Actually we solved it by using two alternative indexes. We create and delete asynchronously the old index. After completing everything on the new index we read it as the main index then delete the old one. You have to manage the blocking process in the most optimized way. We just update corresponding blocks according to new messages. It takes only 2-3 mins.

Hi can you explain better what you did to make it work? We have same issue now and any operator over ondisk indices will corrupt the index. I have tried:

merge_from
merge_into
add_with_ids

All make index corrupt and search to return wrong indices.

pablocael · 2024-06-06T23:41:16Z

@mdouze Sorry, I have the same issue of invalid indices after trying to add data in any way into the ondisk index.
I have tried:

merge_from
merge_into
add_with_ids

All product invalid indices.

My questions are:

Is there any way of adding more data to a ondisk index?
If not, what is the purpose of the ondisk index? is it only to create index from shards and merging on disk? There is no way of adding more data to faiss indices if not using fully on-ram indices?

Thank you in advance

pablocael · 2024-06-07T06:00:33Z

@mdouze opened a issue now with simple code to reproduce the issue. The issue is not only with add_with_ids but also any merge operator (merge_from, merge_into) will corrupt about 13% of the ids within the index.
#3498

mdouze added the cant-repro label Dec 1, 2021

pablocael mentioned this issue Jun 22, 2024

Merge or Add data into "ondisk" indices corrupts index ids #3498

Open

4 tasks

asadoughi added unconfirmed-bug Implementation labels Jul 18, 2024

ego-thales mentioned this issue Sep 4, 2024

Empty index inconsistency #3830

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAISS returns negative ids (not -1) #2135

FAISS returns negative ids (not -1) #2135

abdullahbas commented Dec 1, 2021

mdouze commented Dec 1, 2021

abdullahbas commented Dec 1, 2021 •

edited

Loading

abdullahbas commented Dec 5, 2021 •

edited

Loading

abdullahbas commented Dec 5, 2021

abdullahbas commented Dec 6, 2021 •

edited

Loading

fonspa commented Feb 22, 2022

abdullahbas commented Feb 24, 2022

mdouze commented Mar 11, 2022

zhangchuheng123 commented Jan 10, 2023

mdouze commented Jan 10, 2023

pablocael commented Jun 6, 2024

pablocael commented Jun 6, 2024 •

edited

Loading

pablocael commented Jun 7, 2024

FAISS returns negative ids (not -1) #2135

FAISS returns negative ids (not -1) #2135

Comments

abdullahbas commented Dec 1, 2021

Summary

Reproduction instructions

mdouze commented Dec 1, 2021

abdullahbas commented Dec 1, 2021 • edited Loading

abdullahbas commented Dec 5, 2021 • edited Loading

abdullahbas commented Dec 5, 2021

abdullahbas commented Dec 6, 2021 • edited Loading

fonspa commented Feb 22, 2022

abdullahbas commented Feb 24, 2022

mdouze commented Mar 11, 2022

zhangchuheng123 commented Jan 10, 2023

mdouze commented Jan 10, 2023

pablocael commented Jun 6, 2024

pablocael commented Jun 6, 2024 • edited Loading

pablocael commented Jun 7, 2024

abdullahbas commented Dec 1, 2021 •

edited

Loading

abdullahbas commented Dec 5, 2021 •

edited

Loading

abdullahbas commented Dec 6, 2021 •

edited

Loading

pablocael commented Jun 6, 2024 •

edited

Loading