Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[54398] Increase delay between 686c and 674 form submissions. #11998

Merged

Conversation

data-doge
Copy link
Contributor

@data-doge data-doge commented Mar 7, 2023

Summary

Increase delay between 686c and 674 form submissions

Long ago, in #7811, a two-minute delay was set between BGS::SubmitForm686cJob and BGS::SubmitForm674Job, in an attempt to avoid a race condition that produced null dependency claims. The race condition is still occuring, so as a short-term fix, this PR increases the delay to five minutes. Here's some more detail on that issue, and the short-term fix implemented here:

  • When a vet submits their form on VA.gov, we sometimes enqueue a job to submit a 686c form to BGS and another job to submit a 674 form to BGS.
  • Both of these jobs, among other things, first fetch an incremental from BGS via a findBenefitClaimTypeIncrement endpoint, and then submit this incremental along with the vet's form data to BGS via a insertBenefitClaim endpoint. The incremental is akin to an incrementing ID or serial number. It starts at "130" for each veteran, and every time a veteran makes a successful request to the insertBenefitClaim endpoint, the increment increases by 1. So the next time a veteran consumes the findBenefitClaimTypeIncrement endpoint, they will get an incremental of "131."
  • BGS requires every veterans' requests to the insertBenefitClaim endpoint to have a unique incremental. In other words, a veteran cannot make two requests to insertBenefitClaim with the same incremental. If they do, a null dependency claim is generated and a "Duplicate Benefit Claims found on the Corporate Database" error is thrown. This is why VA.gov's 686c and 674 submission jobs are both written to fetch the next incremental from the findBenefitClaimTypeIncrement before consuming the insertBenefitClaim endpoint.
  • But here's what VA.gov gets wrong: sometimes we run the 674 job before the 686c submission is complete. Meaning, we ask for 674's incremental before it has been updated by 686c's submission request. Consequently, both submission requests use the same incremental, and the latter request fails, thereby producing a null claim.
  • In 29448 686c/674 EP code increment issue #7811, previous devs tried to avoid this issue by scheduling the 674 job to run two minutes after enqueueing the 686c job. But, given the large volume of null dependency claims we have been seeing, two minutes is not enough.
  • A short term fix for this issue is to increase that two minute delay to a five minute delay. That's what I've done here. This should dramatically reduce the number of null dependency claims, but there will still be a remote possibility that null claims will be created.
  • A better, long-term fix will be to re-write the code to force the two jobs to run consecutively. I recommend doing that some time in the near future.

I work for the benefits decision reviews team. This work has been live on VA.gov for awhile. I've only recently come onboard to resolve the issue described in this PR.

Related issue(s)

Testing done

  • Because this is an edge case race condition, to know whether this fix has worked, I will monitor the Sentry logs to see whether there is indeed a dramatic drop in null dependency claim errors after this fix is deployed.

Screenshots

Note: Optional

What areas of the site does it impact?

686c form.

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog or Grafana (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

Requested Feedback

(OPTIONAL)What should the reviewers know in addition to the above. Is there anything specific you wish the reviewer to assist with. Do you have any concerns with this PR, why?

@data-doge data-doge self-assigned this Mar 7, 2023
@va-vfs-bot va-vfs-bot temporarily deployed to 54398-increase-delay-between-686c-and-674-form-submissions/main/main March 7, 2023 19:01 Inactive
@data-doge data-doge marked this pull request as ready for review March 7, 2023 19:24
@data-doge data-doge requested review from a team as code owners March 7, 2023 19:24
@data-doge data-doge requested a review from kylesoskin March 7, 2023 19:24
Copy link
Contributor

@kylesoskin kylesoskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I agree with what you stated in the description that a better long term fix would be to have them run synchronously either via code rewrite of some sidekiq batching logic. But either way, this looks fine.

@kylesoskin
Copy link
Contributor

The queuing of the second job could just be moved into the first job when/if you refactor to do it synchronously

@data-doge data-doge merged commit 00a98fb into master Mar 7, 2023
@data-doge data-doge deleted the 54398-increase-delay-between-686c-and-674-form-submissions branch March 7, 2023 21:11
data-doge added a commit that referenced this pull request Mar 22, 2023
Force SubmitForm674Job to run after SubmitForm686cJob.

In #11998, as a temporary measure to avoid a race condition, I increased a delay between the these two jobs from two minutes to five minutes. Unfortunately, the 686c job can indeed run more than five minutes after the 674 job is set to run, e.g. when the 686c job is stalled due to timeouts from other endpoints. This PR accommodates for situations like this by eliminating the use of delay, and instead enqueuing the 674 job only after the 686c job has successfully submitted a 686c claim to BGS. In the 686c job, I'm still running the 674 job asynchronously, as was done before, because `send_confirmation_email` was never dependent on the 674 job's success.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants