Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAISS returns negative ids (not -1) #2135

Open
2 of 4 tasks
abdullahbas opened this issue Dec 1, 2021 · 13 comments
Open
2 of 4 tasks

FAISS returns negative ids (not -1) #2135

abdullahbas opened this issue Dec 1, 2021 · 13 comments

Comments

@abdullahbas
Copy link

Summary

<FAISS returns really big negative ids (not -1) like -663323444433213679 and big ids like 77711200039921993321 even we do not have those indices. ->

Faiss version: 1.7.1

Installed from: pypi

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

It is very complicated code so I couldn't share a code snippet but it returns something like

[-7882858908496526721, 148477514, 7318772159358522531, 131445014, -8263696823219615651, 123521031, -5807271324421810311, 124208452, 38032875, 146904364, 139624482, 125867015, 139643914, 125254479, 18606842, 147101967, -8246501689735019874, 119442532, 141874179, 138070620, 130286272, 129548931, 131521583, 107358047, -8528840699497380558, 148568457, 127924406, 60198081, 23002488, 134854969, 38924547, 134703770, 33097768, 146073936, 69678871, 145498691, -6661535923009526919, 145471504, 137858014, 142931410, 137858015, 140687014, 140038207, 74294394]

@mdouze
Copy link
Contributor

mdouze commented Dec 1, 2021

This would be a bug. What type of index? Code to repro?

@abdullahbas
Copy link
Author

abdullahbas commented Dec 1, 2021

We use 'PCAR64,IVF4096(IVF512,PQ32x4fs,RFlat),SQ8' as index type to index BERT embeddings. We applied L2 norm not only before training but also before searching.

This is not the exact code flow. I showed it only to illustrate the key parts and their standings.

index.train(doc_embeddings[tr_indices]) # training the index
index.add_with_ids(doc_embeddings[i0:i1], ids_list[i0: i1]) # adding embeddings and corresponding ids
faiss.write_index(index, tmpdir + "main.index")
index = faiss.read_index(self.tmpdir + "main.index",faiss.IO_FLAG_ONDISK_SAME_DIR) # for reading merged index from disk
D, I = self.index.search(sentence_embedding, n_result)# for searching

the I in the previous line has the values,

[-7882858908496526721, 148477514, 7318772159358522531, 131445014, -8263696823219615651, 123521031, -5807271324421810311, 124208452, 38032875, 146904364, 139624482, 125867015, 139643914, 125254479, 18606842, 147101967, -8246501689735019874, 119442532, 141874179, 138070620, 130286272, 129548931, 131521583, 107358047, -8528840699497380558, 148568457, 127924406, 60198081, 23002488, 134854969, 38924547, 134703770, 33097768, 146073936, 69678871, 145498691, -6661535923009526919, 145471504, 137858014, 142931410, 137858015, 140687014, 140038207, 74294394]

The other parts of the code are quite long and object-oriented. It is almost impossible to understand from there. Hence I pushed the essential parts of the code. If you still need the whole code I can attach the notebook files or scripts.

we used merge_on_disk.py example to index all of our 127 million data on the disk.

For merging operation, after creating several block.index we use
merge_ondisk(index, block_fnames, tmpdir + "merged_index_test.ivfdata")

By the way for updating the current index we use

pseudo_index= faiss.read_index(self.config["mount_path"]+self.config['save_path'] + "pseudo.index",faiss.IO_FLAG_ONDISK_SAME_DIR)
pseudo_index.add_with_ids(embeddings,np.array(ids))
faiss.write_index(pseudo_index,save_path+'main.index')

@abdullahbas
Copy link
Author

abdullahbas commented Dec 5, 2021

I think the cause of the problem is adding. When I add indices on the existing index it just brokes the pipe. I don't know why. main.index and pseudo.index that I show the previous post was working but after adding new data they just get broke. They all use the same merge file maybe this is cause I don't know.

@abdullahbas
Copy link
Author

This is ids list from our index. As you can see there are some ids that are not correct.

image

Should we train our index again after using add_with_ids ?

@abdullahbas
Copy link
Author

abdullahbas commented Dec 6, 2021

I tried to understand the source of the error and I found that ids are corrupted after using

index.add_with_ids(embeddings,np.array(ids)).

Our index is on disk. Is it the problem of us? Do we have to add our new entries via adding new blocks and after merging them again? Do you have the support of add_with_ids for the on-disk index? @mdouze

@fonspa
Copy link

fonspa commented Feb 22, 2022

Hi @abdullahbas ,
did you manage to make it work ? I'm also trying to merge incremental blocks with an empty trained index, 'IVFx,PQYx4fs,Refine(SQfp16)', but got into some multiple problem with add_with_ids.
Also it seems the value for the invlists.code_size can be arbitrary and wrong, I encountered arbitrary values for this field that prevent the merge.
If you succeeded in merging the blocks, can you explain the steps you took ? Thanks.

@abdullahbas
Copy link
Author

@fonspa Actually we solved it by using two alternative indexes. We create and delete asynchronously the old index. After completing everything on the new index we read it as the main index then delete the old one. You have to manage the blocking process in the most optimized way. We just update corresponding blocks according to new messages. It takes only 2-3 mins.

@mdouze
Copy link
Contributor

mdouze commented Mar 11, 2022

Sorry for the late answer.
Adding to an ondisk index is not recommended, it's slow and will be removed in a subsequent Faiss version.
Merging "fast scan" index variants (IVFx,PQYx4fs) is not supported. This should throw an error, and eventually be implemented.

@zhangchuheng123
Copy link

BTW. I have encountered the problem that FAISS return some -1 indices. But I cannot find any explanation on the meaning of -1. I have checked that index.ntotal = 1027120, method=IVFFlat with nlist=1024 and nprobs=4. I searched for xq.shape=(1024, 488) for k=50 neighbors. It returns as follows:

inds[273]
array([ 453005,  470124,  521481,  538600,  589957,  607076,  675552,
        658433,  726909,  744028,  795385,  812504,  880980,  863861,
        949456,  932337, 1017932, 1000813,   59285,   42167,  127757,
        110639,  196229,  179111,  264701,  247583,  453006,  435887,
        521482,  504363,  333173,  316055,  589958,  572839,  658434,
        641315,  384530,  367411,  863862,  846743,  726910,  709791,
        795386,  778267,      -1,      -1,      -1,      -1,      -1,
            -1])

I have also checked the X that I used to build the index: All the elements are finite and <10. Is there anyone know the reason?

@mdouze
Copy link
Contributor

mdouze commented Jan 10, 2023

see https://github.com/facebookresearch/faiss/wiki/FAQ#what-does-it-mean-when-a-search-returns--1-ids

@pablocael
Copy link

@fonspa Actually we solved it by using two alternative indexes. We create and delete asynchronously the old index. After completing everything on the new index we read it as the main index then delete the old one. You have to manage the blocking process in the most optimized way. We just update corresponding blocks according to new messages. It takes only 2-3 mins.

Hi can you explain better what you did to make it work? We have same issue now and any operator over ondisk indices will corrupt the index. I have tried:

  • merge_from
  • merge_into
  • add_with_ids

All make index corrupt and search to return wrong indices.

@pablocael
Copy link

pablocael commented Jun 6, 2024

@mdouze Sorry, I have the same issue of invalid indices after trying to add data in any way into the ondisk index.
I have tried:

  • merge_from
  • merge_into
  • add_with_ids

All product invalid indices.

My questions are:

  • Is there any way of adding more data to a ondisk index?
  • If not, what is the purpose of the ondisk index? is it only to create index from shards and merging on disk? There is no way of adding more data to faiss indices if not using fully on-ram indices?

Thank you in advance

@pablocael
Copy link

@mdouze opened a issue now with simple code to reproduce the issue. The issue is not only with add_with_ids but also any merge operator (merge_from, merge_into) will corrupt about 13% of the ids within the index.
#3498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants