Refactor the GraduationService for improved performance #2121
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current GraduationService takes over 30 minutes to run against
production data. Analysis indicated that the current query for
candiate ETD works takes over 10 minutes to process and returns
over 6000 ETDs that have already been processed by the service and
are already in a 'published' state.
This change queries only for ETDs that are 'approved' which is the only
group that are eligible for graduation and publication. Querying in
this way is much faster and also eliminates the need for additional
error handling code, thereby simplifying the overall service.
The change also removes dead code for the GraduationdateService.
TIMING ANALYSIS
I compared three different methods for determining
eligible_works
: just processing all etds, the current search method, and querying only for "approved" ETDs. The current search and conversion to an array takes over 10 minutes to process (614 seconds real time). Creating a collection of all ETDs or searching for only "approved" ETDs both take just a fraction of a second in real time.In addition to the initial query time, we also need to account for the processing time for the result set. Processing all ETDs will grow over time and currently would cause us to process over 9000 already published ETDs. The current query returns over
6000 results that are already published but would be unnecessarily reprocessed by the service. Querying for only ETDs in the "approved" workflow states returns the smallest possible set of eligible graduation candidates. In the sample data below, the number of approved works is very low (only 7) because the GraduationService has already been run during testing. In real life, we would expect 20 to 150 ETDs to be in the "approved" state awaiting graduation and publication. This is still a much lower number than the 6000+ results returned by the current code.
EDGECASE ANALYSIS
Since we're removing error checking code (the new query can't return results with an invalid workflow state), it also seemed worth checking production data to see if we needed to move this check elsewhere in the code. On production data, there is no evidence that any works are missing a valid workflow state - i.e. a facet search of all ETDs show that all are in an expected workflow state an none (0) have an undefined (null) workflow state: