Scanner limit not always correct #72

daz-li · 2014-11-17T23:44:30Z

There is a bug in the code (and a simple fix as well ) in dealing with limit in table.py:

# Avoid round-trip when exhausted
if len(items) < how_many: 
    break

To provide some context, items stores the rows for current batch returned from the HBase server (according to a scan iterator) . how_many indicates the number of rows fetched for the batch. The idea of the above statements is to avoid keeping fetching if the number of returned rows in items is less than how_many.

However, note that the how_many is only a suggestion to the server, not a requirement: the size of the batch returned from the server may well be less than the suggested number (how_many) despite there could still be more to return. This cause the scan to return immaturely. The fix is easy:

# if len(items) < how_many: 
#    break 
 if len(items) == 0:
     break

The text was updated successfully, but these errors were encountered:

wbolster · 2014-11-18T12:56:27Z

Are you sure the how_many flag (or whatever the name is in the Thrift API) is only a hint? Where did you get that information?

daz-li · 2014-11-20T06:16:44Z

I have used the following very simple code to debug table.py. Basically, I just print the number of items returned by inserting the following in the table.py:

   print '===:', len(items)

Certainly, I need to apply the aforementioned fix to avoid premature exit. Then, I used the following to test against my table:

   import happybase
   conn=happybase.Connection(thrift)
   tb=conn.table(tbname)
   scanner=tb.scan(row_prefix='com.google', limit=1000, batch_size=100)
   for (k,v) in scanner:
       pass

The output is:

===: 100
===: 100
===: 100
===: 100
===: 100
===: 100 
===: 48
===: 100
===: 8
===: 100
===: 7
===: 100
===: 10
===: 27

Despite the batch_size=100, we see the items returned during the scan having the length less than 100 (i.e. 48, 8, 7, 10). I have not looked in to the HBase code yet, so I can not confirm whether it is intended or accidental from HBase side. But I can understand if cache_size in scan is only a suggestion number.

BTW, this is on MapR Table implementation of Big Table.

wbolster · 2014-11-20T13:37:28Z

Okay, seems HBase behaves differently than I expected from its docs.

wbolster · 2014-11-24T22:29:54Z

FYI, I've released HappyBase 0.9.

wbolster added the bug label Nov 20, 2014

wbolster self-assigned this Nov 20, 2014

wbolster added this to the 0.9 milestone Nov 20, 2014

wbolster changed the title ~~A bug in dealing with limit.~~ Scanner limit not always correct Nov 24, 2014

wbolster closed this as completed in 8a27521 Nov 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scanner limit not always correct #72

Scanner limit not always correct #72

daz-li commented Nov 17, 2014

wbolster commented Nov 18, 2014

daz-li commented Nov 20, 2014

wbolster commented Nov 20, 2014

wbolster commented Nov 24, 2014

Scanner limit not always correct #72

Scanner limit not always correct #72

Comments

daz-li commented Nov 17, 2014

wbolster commented Nov 18, 2014

daz-li commented Nov 20, 2014

wbolster commented Nov 20, 2014

wbolster commented Nov 24, 2014