Too Many Requests /_search/scroll when using helpers.scan #2426

hben-align · 2024-01-31T11:20:46Z

version: 7.9.1 (AWS)
Same (7.9.1)

When running a scan on an index with (80M documents) I get an error after about 500K consecutive documents:
elasticsearch.exceptions.TransportError: TransportError(429, '429 Too Many Requests /_search/scroll')
The rate that I get is about 10K documents per second.

The size of the Domain is 3 nodes r6g.large.search

this is my code, what am I doing wrong?
or how can I handle this error without the need to restart the process every time?
if I use try and exception with (sleep(30) on exception), what will happen to the current iteration that got the error? will I lose the data in that iteration or will it request the same data in the next iteration?

`
QUERY = {
"size": 1000,
"query": {
"match_all": {}
}
}

scanner = helpers.scan(es_client, index=index_name,query=QUERY, scroll='1m')
number_of_documents = int(es_client.cat.count(index_name, params={"format": "json"})[0]['count'])
final_result = list()   # will hold the actual data that will be exported to a csv file. will be cleared in every chunk
first_number = 0
count = 0
exported = False        # this is used for exporting the last chunk if it does not meet the chunksize.
with tqdm(total=int(number_of_documents)) as pbar:
    for each_result in scanner:
        exported = False
        final_result.append(each_result)
        count += 1
        if count%1000 == 0:
            pbar.update(1000) # update the progress bar.
        if count%chunksize == 0:
            handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
            filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
            handler.to_csv(filename, encoding='utf8', index=False)
            first_number = count + 1 # set the first number for the next chunk
            final_result = list()
            exported = True
if not exported:
    pbar.update(count - first_number)
    handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
    filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
    handler.to_csv(filename, encoding='utf8', index=False)

`

The text was updated successfully, but these errors were encountered:

pquentin · 2024-02-02T13:12:42Z

Hello! When Elasticsearch returns a 429 error, it means that the client should retry later, because the cluster is overwhelmed. Unfortunately, the scroll API does not support individual retries. The point-in-time API introduced in Elasticsearch 7.10 addresses this limitation and others, and should be used instead.

In other words, your options include:

Upgrade to Elasticsearch 7.10 (or better, 7.17 or 8.12) and use the point-in-time API, retrying as needed (there is no helper currently though)
Figure out why your cluster is overwhelmed. Maybe something else taxing is running at the same time?
Try splitting your scans in multiple smaller parts, if it's possible given your data

Sorry about that. Closing for now, but I'll be happy to reopen if you have other questions.

pquentin closed this as completed Feb 2, 2024

pquentin mentioned this issue Feb 2, 2024

Error 429 - Too Many Requests /_search/scroll - how can I handle it? elastic/elasticsearch-dsl-py#1697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too Many Requests /_search/scroll when using helpers.scan #2426

Too Many Requests /_search/scroll when using helpers.scan #2426

hben-align commented Jan 31, 2024 •

edited

Loading

pquentin commented Feb 2, 2024

Too Many Requests /_search/scroll when using helpers.scan #2426

Too Many Requests /_search/scroll when using helpers.scan #2426

Comments

hben-align commented Jan 31, 2024 • edited Loading

pquentin commented Feb 2, 2024

hben-align commented Jan 31, 2024 •

edited

Loading