Refactor the GraduationService for improved performance #2121

mark-dce · 2021-03-13T19:27:51Z

The current GraduationService takes over 30 minutes to run against
production data. Analysis indicated that the current query for
candiate ETD works takes over 10 minutes to process and returns
over 6000 ETDs that have already been processed by the service and
are already in a 'published' state.

This change queries only for ETDs that are 'approved' which is the only
group that are eligible for graduation and publication. Querying in
this way is much faster and also eliminates the need for additional
error handling code, thereby simplifying the overall service.

The change also removes dead code for the GraduationdateService.

TIMING ANALYSIS

I compared three different methods for determining eligible_works: just processing all etds, the current search method, and querying only for "approved" ETDs. The current search and conversion to an array takes over 10 minutes to process (614 seconds real time). Creating a collection of all ETDs or searching for only "approved" ETDs both take just a fraction of a second in real time.

In addition to the initial query time, we also need to account for the processing time for the result set. Processing all ETDs will grow over time and currently would cause us to process over 9000 already published ETDs. The current query returns over
6000 results that are already published but would be unnecessarily reprocessed by the service. Querying for only ETDs in the "approved" workflow states returns the smallest possible set of eligible graduation candidates. In the sample data below, the number of approved works is very low (only 7) because the GraduationService has already been run during testing. In real life, we would expect 20 to 150 ETDs to be in the "approved" state awaiting graduation and publication. This is still a much lower number than the 6000+ results returned by the current code.

irb(main):027:1* Benchmark.benchmark(CAPTION, 10, FORMAT) do |test|
irb(main):028:1*   test.report('all:') { all_etds = Etd.all }
irb(main):029:1*   test.report('awarded:') { no_degree_yet = Etd.where(degree_awarded: nil).to_a }
irb(main):030:1*   test.report('approved') { approved_etds = Etd.search_with_conditions({workflow_state_name_ssim: 'app
roved' }, rows:1000 ) }
irb(main):031:0> end
                 user     system      total        real
all:         0.004000   0.000000   0.004000 (  0.000062)
awarded:   402.852000   1.772000 404.624000 (614.374976)
approved     0.004000   0.000000   0.004000 (  0.006355)
=> [#<Benchmark::Tms:0x0000000062ff6740 @label="all:", @real=6.188200495671481e-05, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.004000000000019099, @total=0.004000000000019099>, #<Benchmark::Tms:0x00000000286cf070 @label="awarded:", @real=614.3749762410007, @cstime=0.0, @cutime=0.0, @stime=1.7720000000000002, @utime=402.852, @total=404.62399999999997>, #<Benchmark::Tms:0x00000000287b7708 @label="approved", @real=0.0063549729966325685, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.004000000000019099, @total=0.004000000000019099>]

irb(main):033:0> all_etds.count
=> 9094
irb(main):034:0> no_degree_yet.count
=> 6486
irb(main):035:0> approved_etds.count
=> 7

EDGECASE ANALYSIS

Since we're removing error checking code (the new query can't return results with an invalid workflow state), it also seemed worth checking production data to see if we needed to move this check elsewhere in the code. On production data, there is no evidence that any works are missing a valid workflow state - i.e. a facet search of all ETDs show that all are in an expected workflow state an none (0) have an undefined (null) workflow state:

http://localhost:8993/solr/laevigata/select?facet.field=workflow_state_name_ssim&facet.missing=true&facet=on&indent=on&q=has_model_ssim:Etd&rows=0&wt=json
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"has_model_ssim:Etd",
      "facet.field":"workflow_state_name_ssim",
      "indent":"on",
      "facet.missing":"true",
      "rows":"0",
      "facet":"on",
      "wt":"json",
      "_":"1615570675648"}},
  "response":{"numFound":9094,"start":0,"maxScore":3.4187903,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "workflow_state_name_ssim":[
        "published",9032,
        "pending_approval",39,
        "pending_review",11,
        "approved",7,
        "changes_required",5,
        null,0]},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}},
  "spellcheck":{
    "suggestions":[],
    "correctlySpelled":true}}

The current GraduationService takes over 30 minutes to run against production data. Analysis indicated that the current query for candiate ETD works takes over 10 minutes to process and returns over 6000 ETDs that have already been processed by the service and are already in a 'published' state. This commit queries only for ETDs that are 'approved' which is the only group that are eligible for graduation and publication. Querying in this way is much faster and also eliminates the need for additional error handling code, thereby simplifying the overall service. The commit also removes dead code for the GraduationdateService.

fnibbit approved these changes Mar 15, 2021

View reviewed changes

maxkadel approved these changes Mar 15, 2021

View reviewed changes

maxkadel merged commit f4a01dc into main Mar 15, 2021

maxkadel deleted the graduate-faster branch March 15, 2021 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the GraduationService for improved performance #2121

Refactor the GraduationService for improved performance #2121

mark-dce commented Mar 13, 2021

Refactor the GraduationService for improved performance #2121

Refactor the GraduationService for improved performance #2121

Conversation

mark-dce commented Mar 13, 2021

TIMING ANALYSIS

EDGECASE ANALYSIS