You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running a scan on an index with (80M documents) I get an error after about 500K consecutive documents:
elasticsearch.exceptions.TransportError: TransportError(429, '429 Too Many Requests /_search/scroll')
The rate that I get is about 10K documents per second.
The size of the Domain is 3 nodes r6g.large.search
this is my code, what am I doing wrong?
or how can I handle this error without the need to restart the process every time?
if I use try and exception with (sleep(30) on exception), what will happen to the current iteration that got the error? will I lose the data in that iteration or will it request the same data in the next iteration?
scanner = helpers.scan(es_client, index=index_name,query=QUERY, scroll='1m')
number_of_documents = int(es_client.cat.count(index_name, params={"format": "json"})[0]['count'])
final_result = list() # will hold the actual data that will be exported to a csv file. will be cleared in every chunk
first_number = 0
count = 0
exported = False # this is used for exporting the last chunk if it does not meet the chunksize.
with tqdm(total=int(number_of_documents)) as pbar:
for each_result in scanner:
exported = False
final_result.append(each_result)
count += 1
if count%1000 == 0:
pbar.update(1000) # update the progress bar.
if count%chunksize == 0:
handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
handler.to_csv(filename, encoding='utf8', index=False)
first_number = count + 1 # set the first number for the next chunk
final_result = list()
exported = True
if not exported:
pbar.update(count - first_number)
handler = pd.DataFrame([i['_source'] | {'_id': i['_id']} for i in final_result])
filename = os.path.join(target_folder,f"csv_{index_name}_{first_number}_to_{count}.csv")
handler.to_csv(filename, encoding='utf8', index=False)
`
The text was updated successfully, but these errors were encountered:
Hello! When Elasticsearch returns a 429 error, it means that the client should retry later, because the cluster is overwhelmed. Unfortunately, the scroll API does not support individual retries. The point-in-time API introduced in Elasticsearch 7.10 addresses this limitation and others, and should be used instead.
In other words, your options include:
Upgrade to Elasticsearch 7.10 (or better, 7.17 or 8.12) and use the point-in-time API, retrying as needed (there is no helper currently though)
Figure out why your cluster is overwhelmed. Maybe something else taxing is running at the same time?
Try splitting your scans in multiple smaller parts, if it's possible given your data
Sorry about that. Closing for now, but I'll be happy to reopen if you have other questions.
version: 7.9.1 (AWS)
Same (7.9.1)
When running a scan on an index with (80M documents) I get an error after about 500K consecutive documents:
elasticsearch.exceptions.TransportError: TransportError(429, '429 Too Many Requests /_search/scroll')
The rate that I get is about 10K documents per second.
The size of the Domain is 3 nodes r6g.large.search
or how can I handle this error without the need to restart the process every time?
`
QUERY = {
"size": 1000,
"query": {
"match_all": {}
}
}
`
The text was updated successfully, but these errors were encountered: