Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread local context (#359) causes test issue on Linux: 1: [4] fid:0 channel 2, to submit:64, submitted:Operation not permitted #375

Open
pabs3 opened this issue Apr 2, 2023 · 21 comments

Comments

@pabs3
Copy link

pabs3 commented Apr 2, 2023

On Linux, #359 by @PhilipBAdams causes the SSDServingTest TestSearchSSDIndexFloatL2BKTDEFAULTTXT test to continously print this message over and over again:

 1: [4] fid:0 channel 2, to submit:64, submitted:Operation not permitted

This message comes from the BatchReadFileAsync function in AnnService/src/Helper/AsyncFileReader.cpp and is printed when the io_submit syscall fails:

int s = syscall(__NR_io_submit, handler->GetIOCP(channel), iocbs[i].size() - submitted[i], iocbs[i].data() + submitted[i]);
if (s > 0) {
    submitted[i] += s;
    totalSubmitted += s;
}
else {
    SPTAGLIB_LOG(Helper::LogLevel::LL_Error, "fid:%d channel %d, to submit:%d, submitted:%s\n", i, channel, iocbs[i].size() - submitted[i], strerror(-s));
}

According to the io_submit manual page, this indicates incorrectly setting the RT IO priority class:

EPERM  The aio_reqprio field is set with the class IOPRIO_CLASS_RT, but the submitting context does not have the CAP_SYS_ADMIN capability.

However, the code does not set the IO priority class anywhere.

My conclusion is that maybe the iocb structure being passed to io_submit gets corrupted somehow, but I am not sure.

Thoughts and ideas for further debugging of this are welcome.

I also tried commenting out the contents of the Helper::SetThreadAffinity function but that didn't work.

@pabs3
Copy link
Author

pabs3 commented Apr 2, 2023

Enabling use of libnuma doesn't fix this issue.

@pabs3
Copy link
Author

pabs3 commented Apr 2, 2023

Reverting just the changes to AnnService/inc/SSDServing/SSDIndex.h in commit 5ffeabe didn't work either.

@PhilipBAdams
Copy link
Contributor

Can you log the values of the iocbs at the time you get the error message? The value of aio_reqprio should be 0 from the memset, which maps to IOPRIO_CLASS_NONE.

What's the compiler/platform? In our linux unit test run (Ubuntu 20.04, GCC 9.4.0) we don't encounter this log.

@pabs3
Copy link
Author

pabs3 commented Apr 8, 2023 via email

@pabs3
Copy link
Author

pabs3 commented Apr 8, 2023 via email

@pabs3
Copy link
Author

pabs3 commented May 3, 2023

After modifying the code to log all io_submit requests, this is what the output looks like:

./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
[Thread 0x7ffff5f9a6c0 (LWP 1940611) exited]
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile = 
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value 
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value 
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value 
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value 
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value 
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[New Thread 0x7ffff5f9a6c0 (LWP 1940614)]
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 138, total element count 2232.
[1] Total Element Count: 2666
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 63
[1] Page Count Dist: 2 138
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[New Thread 0x7ffff2f976c0 (LWP 1940615)]
[1] Sent 0.00%...
[New Thread 0x7ffff0f906c0 (LWP 1940616)]
[4] io_submit ctx_id=140737353871360 nr=64 data=140737091778224 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140737090953216 nbytes=8192 offset=368640 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted

@pabs3
Copy link
Author

pabs3 commented May 10, 2023

I noticed that when I enable only the AlgoTest/SPANNTest and SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT tests then the SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT test fails:

BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest:SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT ../SPTAGTest

I also noticed that enabling all tests except the AlgoTest/SPANNTest test means all the tests pass, including SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT:

BOOST_TEST_RUN_FILTERS='!AlgoTest/SPANNTest' ../SPTAGTest

So I think that AlgoTest/SPANNTest is somehow interfering with SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT.

The successful SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT test output looks like this:

./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile = 
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value 
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value 
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value 
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value 
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value 
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 134, total element count 2223.
[1] Total Element Count: 2676
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 67
[1] Page Count Dist: 2 134
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0 
[1] Finish sending in 0.104 seconds, actuallQPS is 96.15, query count 10.
[1] 
Finish warmup...
[1] Start loading QuerySet...
[1] Load Vector(10,100)
[1] Start ANN Search...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0 
[1] Finish sending in 0.014 seconds, actuallQPS is 714.29, query count 10.
[1] 
Finish ANN Search...
[1] Load Vector(1000,100)
[1] 
Load VectorSet(1000,100).
[1] Start loading TruthFile...
[1] Recall32@32: 0.984375 MRR@32: 1.000000
[1] 
Ex Elements Count:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 974.900	965.000	1028.000	1028.000	1028.000	1028.000	1028.000
[1] 
Head Latency Distribution:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 0.420	0.203	1.259	1.259	1.259	1.259	1.259
[1] 
Ex Latency Distribution:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 2.474	2.388	5.651	5.651	5.651	5.651	5.651
[1] 
Total Latency Distribution:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 2.894	2.715	5.835	5.835	5.835	5.835	5.835
[1] 
Total Disk Page Access Distribution:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 113.700	 113	 119	 119	 119	 119	 119
[1] 
Total Disk IO Distribution:
[1] Avg	50tiles	90tiles	95tiles	99tiles	99.9tiles	Max
[1] 63.900	  64	  64	  64	  64	  64	  64
[1] 
[1] Start output to sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Recall@32: 0.984375 MRR@32: 1.000000
[1] 
Test case SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT did not check any assertions
./Test/src/SSDServingTest.cpp(621): Leaving test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"; testing time: 183969us

@pabs3
Copy link
Author

pabs3 commented May 21, 2023

I noticed that there are a lot of tests that fail, not just one.

All of the failing tests use the TestSearchSSDIndex function, which calls the SPTAG::SSDServing::BootProgram function.

This is the list of failing tests when AlgoTest/SPANNTest is enabled:

SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatL2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineKDTXVECXVEC
SSDServingTest/RUN_FROM_MAP

@pabs3
Copy link
Author

pabs3 commented Jul 5, 2023

I note that when I rebuild SPTAG and its tests with the -fsanitize=thread option, then all of these tests have ThreadSanitizer data races reported:

$ grep ThreadSanitizer *.log  | sed -E 's/^([^:]*):(.*)/\2 \1/' | sort | uniq -c | sort -n | grep pid=
      1 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=781640) PerfTest.cpp.log
      2 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=780260) FilterTest.cpp.log
      3 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=779821) ConcurrentTest.cpp.log
     12 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=775481) AlgoTest.cpp.log
    122 WARNING: ThreadSanitizer: data race (pid=780260) FilterTest.cpp.log
    155 WARNING: ThreadSanitizer: data race (pid=779821) ConcurrentTest.cpp.log
    303 WARNING: ThreadSanitizer: data race (pid=971350) SSDServingTest.cpp.log
    324 WARNING: ThreadSanitizer: data race (pid=781640) PerfTest.cpp.log
    470 WARNING: ThreadSanitizer: data race (pid=782661) ReconstructIndexSimilarityTest.cpp.log
    872 WARNING: ThreadSanitizer: data race (pid=775481) AlgoTest.cpp.log

@pabs3
Copy link
Author

pabs3 commented Jul 5, 2023

According to ThreadSanitizer these are the most common locations for data races in SPTAG:

$ grep ThreadSanitizer *.log  | sed -E 's/^([^:]*):(.*)/\2/' | sed 's/<[^>]*>/<>/' | sort | uniq -c | sort -n | grep -v /usr/ | grep -v /libsanitizer/ | grep -v pid= | tail
     28 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:467 in void SPTAG::COMMON::NeighborhoodGraph::RefineGraph<>(SPTAG::VectorIndex*, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
     36 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:85 in SPTAG::COMMON::NeighborhoodGraph::GraphAccuracyEstimation(SPTAG::VectorIndex*, int, std::unordered_map<>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
     46 SUMMARY: ThreadSanitizer: data race AnnService/inc/Helper/ThreadPool.h:29 in SPTAG::Helper::ThreadPool::Abort::SetAbort(bool)
     54 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:247 in float SPTAG::COMMON::KmeansAssign<>(SPTAG::COMMON::Dataset<float> const&, std::vector<int, std::allocator<int> >&, int, int, SPTAG::COMMON::KmeansArgs<float>&, bool, float) [clone ._omp_fn.0]
     55 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/Dataset.h:75 in SPTAG::COMMON::Dataset<>::R() const
     68 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:92 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
     69 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:93 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
     74 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/KDTree.h:335 in void SPTAG::COMMON::KDTree::ChooseDivision<>(SPTAG::COMMON::Dataset<signed char> const&, SPTAG::COMMON::KDTNode&, std::vector<int, std::allocator<int> > const&, int, int)
     75 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:102 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)
     87 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:103 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)

@ZhenNan2016
Copy link

@pabs3
I'm sorry to bother you.
I would like to ask a few questions as follows:
#403
But I don't know what platform to contact you through.
Looking forward to your reply.
Thanks very much.

@pabs3
Copy link
Author

pabs3 commented Nov 19, 2023 via email

@ZhenNan2016
Copy link

Sorry, I am not involved in the project, I am just packaging it for my employer. So I have no idea about the answers to your questions. You will need to contact the Microsoft employees who wrote it. Some of their email addresses are available in the git log. Clone the project locally and run the git log command and you will see the addresses.

okay,thanks.

@Warmchay
Copy link

Hi, I have encountered the same question, have you found the root cause of this problem?

@pabs3
Copy link
Author

pabs3 commented Jul 12, 2024

I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.

This workaround enables all the tests to be run successfully, first run the tests excluding SPANNTest by using this environment variable BOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest then run the tests again but include only SPANNTest with this environment variable BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest.

@Warmchay
Copy link

I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.

This workaround enables all the tests to be run successfully, first run the tests excluding SPANNTest by using this environment variable BOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest then run the tests again but include only SPANNTest with this environment variable BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest.

Really thanks! It works, but it's weird, I would like to follow your testing way to check it again, I suspect that maybe the executed order influence the indices in SSD, like threads competition or other causes. Anyway, thanks your help again!

@pabs3
Copy link
Author

pabs3 commented Jul 12, 2024

If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:

Fixes: https://github.com/microsoft/SPTAG/issues/375

@Warmchay
Copy link

Okay, I hope I could~

If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:

Fixes: https://github.com/microsoft/SPTAG/issues/375

Okay, I hope I could ~

@suppersam1
Copy link

It may be related to the number of CPU cores. My CPU has 8 cores, and when I use 7 threads to access the search interface, there are no issues. However, when I use more than 8 threads, errors start to occur, specifically from channel 8 onwards. For example, if I use 15 threads, channels 8 to 15 will report errors.

@suppersam1
Copy link

I have reviewed a lot of information, but it seems that there is no apparent relationship between the io_submit call and the number of CPU cores. Therefore, I am still at a loss regarding this bug.

@suppersam1
Copy link

I tested it again, and it is not related to the number of CPU cores; it is related to the parameter NumberOfThreads. When I set this parameter to 45, there were no issues with the query. Previously, I had this parameter set to 8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants