Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too Many Requests /_search/scroll when using helpers.scan #2426

Closed
hben-align opened this issue Jan 31, 2024 · 1 comment
Closed

Too Many Requests /_search/scroll when using helpers.scan #2426

hben-align opened this issue Jan 31, 2024 · 1 comment

Comments

@hben-align
Copy link

hben-align commented Jan 31, 2024

version: 7.9.1 (AWS)
Same (7.9.1)

When running a scan on an index with (80M documents) I get an error after about 500K consecutive documents:
elasticsearch.exceptions.TransportError: TransportError(429, '429 Too Many Requests /_search/scroll')
The rate that I get is about 10K documents per second.

The size of the Domain is 3 nodes r6g.large.search

  1. this is my code, what am I doing wrong?
    or how can I handle this error without the need to restart the process every time?
  2. if I use try and exception with (sleep(30) on exception), what will happen to the current iteration that got the error? will I lose the data in that iteration or will it request the same data in the next iteration?

`
QUERY = {
"size": 1000,
"query": {
"match_all": {}
}
}

scanner = helpers.scan(es_client, index=index_name,query=QUERY, scroll='1m')
number_of_documents = int(es_client.cat.count(index_name, params={"format": "json"})[0]['count'])
final_result = list()   # will hold the actual data that will be exported to a csv file. will be cleared in every chunk
first_number = 0
count = 0
exported = False        # this is used for exporting the last chunk if it does not meet the chunksize.
with tqdm(total=int(number_of_documents)) as pbar:
    for each_result in scanner:
        exported = False
        final_result.append(each_result)
        count += 1
        if count%1000 == 0:
            pbar.update(1000) # update the progress bar.
        if count%chunksize == 0:
            handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
            filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
            handler.to_csv(filename, encoding='utf8', index=False)
            first_number = count + 1 # set the first number for the next chunk
            final_result = list()
            exported = True
if not exported:
    pbar.update(count - first_number)
    handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
    filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
    handler.to_csv(filename, encoding='utf8', index=False)

`

@pquentin
Copy link
Member

pquentin commented Feb 2, 2024

Hello! When Elasticsearch returns a 429 error, it means that the client should retry later, because the cluster is overwhelmed. Unfortunately, the scroll API does not support individual retries. The point-in-time API introduced in Elasticsearch 7.10 addresses this limitation and others, and should be used instead.

In other words, your options include:

  • Upgrade to Elasticsearch 7.10 (or better, 7.17 or 8.12) and use the point-in-time API, retrying as needed (there is no helper currently though)
  • Figure out why your cluster is overwhelmed. Maybe something else taxing is running at the same time?
  • Try splitting your scans in multiple smaller parts, if it's possible given your data

Sorry about that. Closing for now, but I'll be happy to reopen if you have other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants