Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase lock duration and paginate in getBsdsIdentifiers #3246

Merged
merged 3 commits into from
Apr 11, 2024

Conversation

benoitguigal
Copy link
Member

@benoitguigal benoitguigal commented Apr 10, 2024

D'après https://github.com/OptimalBits/bull?tab=readme-ov-file#important-notes

The queue aims for an "at least once" working strategy. This means that in some situations, a job could be processed more than once. This mostly happens when a worker fails to keep a lock for a given job during the total duration of the processing.

When a worker is processing a job it will keep the job "locked" so other workers can't process it.

It's important to understand how locking works to prevent your jobs from losing their lock - becoming stalled - and being restarted as a result. Locking is implemented internally by creating a lock for lockDuration on interval lockRenewTime (which is usually half lockDuration). If lockDuration elapses before the lock can be renewed, the job will be considered stalled and is automatically restarted; it will be double processed. This can happen when:

1 - The Node process running your job processor unexpectedly terminates.
2- Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see OptimalBits/bull#488 for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the lockDuration setting (with the tradeoff being that it will take longer to recognize a real stalled job).

As such, you should always listen for the stalled event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.

As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor always crashes its Node process), jobs will be recovered from a stalled state a maximum of maxStalledCount times (default: 1).

Dans l'indexation de cette nuit, on voit que l'event loop a été bloquée à plusieurs moment lors de la récupération des identifiants.

Capture d’écran 2024-04-10 à 15 33 04

Ce blocage a certainement empêché le lock de se renouveler et le job master a été marqué comme stalled. Le comportement par défaut dans ce cas là est de réessayer de run le job ( maxStalledCount=1) et la réindexation globale a donc eu lieu deux fois.

Cette PR inclut plusieurs corrections pour éviter que ça se reproduise :

  • Augmente légèrement la valeur de lockDuration (de 30s à 2 minutes).
  • Ajout d'une pagination par curseur dans getBsdIdentifiers.
  • Set maxStalledCount=0 pour éviter que le job soit exécuté deux fois s'il se retrouve quand mêmestalled pour une raison ou une autre.
  • Log l'événement en cas de job stalled

  • Mettre à jour la documentation
  • Mettre à jour le change log
  • Documenter les manipulations à faire lors de la mise en production (sur le ticket Favro de release)
  • S'assurer que la numérotation des nouvelles migrations est bien cohérente
  • Informer le data engineer de tout changement de schéma DB

@benoitguigal benoitguigal changed the title increase lockDuration Increase lock duration Apr 10, 2024
@benoitguigal benoitguigal changed the title Increase lock duration Increase lock duration and paginate in getBsdsIdentifiers Apr 10, 2024
Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@benoitguigal benoitguigal merged commit d57a9ff into dev Apr 11, 2024
17 checks passed
@benoitguigal benoitguigal deleted the bull-lock-duration branch April 11, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants