Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the GraduationService for improved performance #2121

Merged
merged 1 commit into from
Mar 15, 2021

Conversation

mark-dce
Copy link
Contributor

The current GraduationService takes over 30 minutes to run against
production data. Analysis indicated that the current query for
candiate ETD works takes over 10 minutes to process and returns
over 6000 ETDs that have already been processed by the service and
are already in a 'published' state.

This change queries only for ETDs that are 'approved' which is the only
group that are eligible for graduation and publication. Querying in
this way is much faster and also eliminates the need for additional
error handling code, thereby simplifying the overall service.

The change also removes dead code for the GraduationdateService.

TIMING ANALYSIS

I compared three different methods for determining eligible_works: just processing all etds, the current search method, and querying only for "approved" ETDs. The current search and conversion to an array takes over 10 minutes to process (614 seconds real time). Creating a collection of all ETDs or searching for only "approved" ETDs both take just a fraction of a second in real time.

In addition to the initial query time, we also need to account for the processing time for the result set. Processing all ETDs will grow over time and currently would cause us to process over 9000 already published ETDs. The current query returns over
6000 results that are already published but would be unnecessarily reprocessed by the service. Querying for only ETDs in the "approved" workflow states returns the smallest possible set of eligible graduation candidates. In the sample data below, the number of approved works is very low (only 7) because the GraduationService has already been run during testing. In real life, we would expect 20 to 150 ETDs to be in the "approved" state awaiting graduation and publication. This is still a much lower number than the 6000+ results returned by the current code.

irb(main):027:1* Benchmark.benchmark(CAPTION, 10, FORMAT) do |test|
irb(main):028:1*   test.report('all:') { all_etds = Etd.all }
irb(main):029:1*   test.report('awarded:') { no_degree_yet = Etd.where(degree_awarded: nil).to_a }
irb(main):030:1*   test.report('approved') { approved_etds = Etd.search_with_conditions({workflow_state_name_ssim: 'app
roved' }, rows:1000 ) }
irb(main):031:0> end
                 user     system      total        real
all:         0.004000   0.000000   0.004000 (  0.000062)
awarded:   402.852000   1.772000 404.624000 (614.374976)
approved     0.004000   0.000000   0.004000 (  0.006355)
=> [#<Benchmark::Tms:0x0000000062ff6740 @label="all:", @real=6.188200495671481e-05, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.004000000000019099, @total=0.004000000000019099>, #<Benchmark::Tms:0x00000000286cf070 @label="awarded:", @real=614.3749762410007, @cstime=0.0, @cutime=0.0, @stime=1.7720000000000002, @utime=402.852, @total=404.62399999999997>, #<Benchmark::Tms:0x00000000287b7708 @label="approved", @real=0.0063549729966325685, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.004000000000019099, @total=0.004000000000019099>]

irb(main):033:0> all_etds.count
=> 9094
irb(main):034:0> no_degree_yet.count
=> 6486
irb(main):035:0> approved_etds.count
=> 7

EDGECASE ANALYSIS

Since we're removing error checking code (the new query can't return results with an invalid workflow state), it also seemed worth checking production data to see if we needed to move this check elsewhere in the code. On production data, there is no evidence that any works are missing a valid workflow state - i.e. a facet search of all ETDs show that all are in an expected workflow state an none (0) have an undefined (null) workflow state:

http://localhost:8993/solr/laevigata/select?facet.field=workflow_state_name_ssim&facet.missing=true&facet=on&indent=on&q=has_model_ssim:Etd&rows=0&wt=json
{
  "responseHeader":{
    "status":0,
    "QTime":1,
    "params":{
      "q":"has_model_ssim:Etd",
      "facet.field":"workflow_state_name_ssim",
      "indent":"on",
      "facet.missing":"true",
      "rows":"0",
      "facet":"on",
      "wt":"json",
      "_":"1615570675648"}},
  "response":{"numFound":9094,"start":0,"maxScore":3.4187903,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "workflow_state_name_ssim":[
        "published",9032,
        "pending_approval",39,
        "pending_review",11,
        "approved",7,
        "changes_required",5,
        null,0]},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}},
  "spellcheck":{
    "suggestions":[],
    "correctlySpelled":true}}

The current GraduationService takes over 30 minutes to run against
production data.  Analysis indicated that the current query for
candiate ETD works takes over 10 minutes to process and returns
over 6000 ETDs that have already been processed by the service and
are already in a 'published' state.

This commit queries only for ETDs that are 'approved' which is the only
group that are eligible for graduation and publication.  Querying in
this way is much faster and also eliminates the need for additional
error handling code, thereby simplifying the overall service.

The commit also removes dead code for the GraduationdateService.
@maxkadel maxkadel merged commit f4a01dc into main Mar 15, 2021
@maxkadel maxkadel deleted the graduate-faster branch March 15, 2021 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants