Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanner limit not always correct #72

Closed
daz-li opened this issue Nov 17, 2014 · 4 comments
Closed

Scanner limit not always correct #72

daz-li opened this issue Nov 17, 2014 · 4 comments
Assignees
Labels
Milestone

Comments

@daz-li
Copy link

daz-li commented Nov 17, 2014

There is a bug in the code (and a simple fix as well ) in dealing with limit in table.py:

# Avoid round-trip when exhausted
if len(items) < how_many: 
    break 

To provide some context, items stores the rows for current batch returned from the HBase server (according to a scan iterator) . how_many indicates the number of rows fetched for the batch. The idea of the above statements is to avoid keeping fetching if the number of returned rows in items is less than how_many.

However, note that the how_many is only a suggestion to the server, not a requirement: the size of the batch returned from the server may well be less than the suggested number (how_many) despite there could still be more to return. This cause the scan to return immaturely. The fix is easy:

# if len(items) < how_many: 
#    break 
 if len(items) == 0:
     break 
@wbolster
Copy link
Member

Are you sure the how_many flag (or whatever the name is in the Thrift API) is only a hint? Where did you get that information?

@daz-li
Copy link
Author

daz-li commented Nov 20, 2014

I have used the following very simple code to debug table.py. Basically, I just print the number of items returned by inserting the following in the table.py:

   print '===:', len(items)

Certainly, I need to apply the aforementioned fix to avoid premature exit. Then, I used the following to test against my table:

   import happybase
   conn=happybase.Connection(thrift)
   tb=conn.table(tbname)
   scanner=tb.scan(row_prefix='com.google', limit=1000, batch_size=100)
   for (k,v) in scanner:
       pass

The output is:

===: 100
===: 100
===: 100
===: 100
===: 100
===: 100 
===: 48
===: 100
===: 8
===: 100
===: 7
===: 100
===: 10
===: 27

Despite the batch_size=100, we see the items returned during the scan having the length less than 100 (i.e. 48, 8, 7, 10). I have not looked in to the HBase code yet, so I can not confirm whether it is intended or accidental from HBase side. But I can understand if cache_size in scan is only a suggestion number.

BTW, this is on MapR Table implementation of Big Table.

@wbolster
Copy link
Member

Okay, seems HBase behaves differently than I expected from its docs.

@wbolster wbolster added the bug label Nov 20, 2014
@wbolster wbolster self-assigned this Nov 20, 2014
@wbolster wbolster added this to the 0.9 milestone Nov 20, 2014
@wbolster wbolster changed the title A bug in dealing with limit. Scanner limit not always correct Nov 24, 2014
@wbolster
Copy link
Member

FYI, I've released HappyBase 0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants