Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/dim_student_script_level_activity #170

Merged
merged 12 commits into from
Sep 10, 2024

Conversation

jordan-springer
Copy link
Collaborator

@jordan-springer jordan-springer commented Aug 29, 2024

Description

This pull request seeks to evaluate the efficiency gains of re-writing various models that produce our current dim_student_script_level_activity model.

In developing this model, I have also removed dim_student_projects to redeploy later after we optimize.

Future Work

Goals:

i. Reduce overall runtime

  • Last successful Production Run: ~1hr 48min

  • Current runtime of dim_student_script_level_activity

image
  • CI job runtime of model:
    image

Design

Essentially, there are two major groups of data that need to be intertwined. So how do we do that as quickly as possible given all of the other models that will have begun before this one.
Such as:

  • schools
  • teacher info
  • user_level activity (granularity: date)
  • course_structure

In some way shape or form, I think this will need an incremental loading.
The current feature has been running for and has returned (so far) 10.5B rows

Links

Jira ticket(s): dataops-922

Testing story

I tested the model primarily using timestamps and running some checkpoints (see below*)

  • Does your change include appropriate tests on key columns?
    eg.
    - not_null
    - unique
    - `dbt_utils.unique_combination_of_columns: , ["value","value","value"...]

Note: when submitting a new model for review please make sure the following have been tested:

  1. The model compiles (dbt build -m 'your_model')
    or: has the dbt Cloud job succeeded?
  2. The model runs (dbt run -m 'your_model')
  3. The model produces accessible data in the DW (select 1 from 'your_model')

Privacy

  • 1. Does this change involve the collection, use, or sharing of new Personal Data?
  • 2. Do these data exist in the appropriate schema(s)?
  • 3. Does this change involve a new or changed use or sharing of existing Personal Data?
  • 4. Consider: will this data be visible on Tableau? will this data be surfaced in a report exported from Trevor?
  • 5. [Optional] If yes to any of the above, please list the models, columns, and justification below:
    i.
    ii.
    iii.

PR Checklist:

--> Note: if these are not all checked, the PR will be sent back.

  • Tests provide adequate coverage
  • Privacy and Security impacts have been assessed
  • Code adheres to style guide👀 and is DRY
  • Code is well-commented (**please do not leave extraneous commentary in model code, if it is for the purpose of documentation, please relocate accordingly)
  • Appropriate documentation has been provided (see .yml., did dbt docs generate succeed?)
  • New features are translatable or updates will not break up/downstream models
  • Relevant documentation has been added or updated (i.e. dbt docs has been updated successfully on Github Pages
  • Pull Request is labeled appropriately (eg. chore/, feature/, fix/)
  • Follow-up work items (including potential tech debt) are tracked and linked (if applicable)

*Redshift testing file:

-- dim_student_course_activity 
-- version: 2.0 (JS)


with 
user_levels as (
/*
    1. User Course Activity
    
    This is the mostly costly part of the model, so we will
    build it as easily as possible and earlier in the process...
    
    This first combination allows us to pull together all the user level activity and course data we need to build the rest of the model. 
    
    We can also roll through our aggregates and other calculations so as to not do them each time in some larger downstream query...
*/
    select 
        user_id as student_id, 
        -- If it's a student model, let's maintain prior conventions
        
        level_id,
        script_id,
        
        -- dates 
        date_trunc('day', created_at)    as activity_date,
        extract('month' from created_at) as activity_month,
        
        -- aggs
        max(attempts)       as total_attempts,
        max(best_result)    as best_result,
        sum(time_spent)     as time_spent_minutes
    -- from {{ ref('stg_dashboard__user_levels') }}
    -- {{ dbt_utils.group_by(5) }}
    from analytics.stg_dashboard__user_levels 
    group by 1,2,3,4,5
), 

course_structure as (
    select
        course_name,
        course_id,
        script_id,
        level_id,
        level_name,
        level_type,
        unit        as unit_name,
        stage_name  as lesson_name
    -- from {{ ref('dim_course_structure') }}
    from analytics.dim_course_structure
    where participant_audience = 'student' 
    -- Note: filter out data we don't want or need as early as possible. If we keep it around, it will be continuously processed as it is referenced in other queries.
),

school_years as (
    select * 
    -- from {{ ref('int_school_years') }}
    from analytics.int_school_years
),

student_activity as (
    select 
        ul.* ,
        sy.school_year,

        -- calc for qtr 
        case 
            when ul.activity_month in (7,8,9)       then 'Q1'
            when ul.activity_month in (10,11,12)    then 'Q2'
            when ul.activity_month in (1,2,3)       then 'Q3'
            when ul.activity_month in (4,5,6)       then 'Q4'
        end as activity_quarter, 

        cs.course_name,
        cs.level_name,
        cs.level_type,
        cs.unit_name,
        cs.lesson_name

    from user_levels as ul 
    left join course_structure as cs
        on ul.level_id = cs.level_id
        and ul.script_id = cs.script_id
    join school_years as sy
        on ul.activity_date 
            between sy.started_at 
                and sy.ended_at 
),

/*  Checkpoint:     
    Runtime: 3.8 minutes
    select * from student_activity 
*/

/* 
    2. Sections and Students
    
    We now have all that taken care of, so our join will be simpler later on (and less data in memory)

    Next, we need to map each student to their section(s). This is another big one but only bc our student_activity isn't unique to student_id... so we can pre-process out that work as well.
*/

section_mapping as (
    select * 
    -- from {{ ref('int_section_mapping') }}
    from analytics.int_section_mapping 
    where student_id in (select student_id from student_activity)
    -- Note: again, filter out anything we don't need. We need to keep the ship as light as possible.
),

section_size as (
    select 
        section_id,
        count(distinct student_id) as section_size 
    from section_mapping
    -- {{ dbt_utils.group_by(1) }}
    group by section_id
),

sections as (
    select 
        scm.* ,
        -- Note: I don't actually need student_activity data here, so I won't bother to load it 
    
        scz.section_size
    from section_mapping    as scm 
    join section_size       as scz 
        on scm.section_id = scz.section_id 
    join student_activity   as sta 
        on scm.student_id = sta.student_id 
        and scm.school_year = sta.school_year
),

    -- Checkpoint:     
    -- select * from sections ;
    -- Runtime: 23.4 minutes

/*
    So now that we have our student_activity and section mapping built, we can smash them together to get our full sections and status data 

    3. Teachers and Schools
        a. Use student_id to connect to sections
        b. Use teacher, school_id for statuses
        c. select * that shit
        d. almost forgot user_geo info 
*/

schools as (
    select * 
    -- from {{ ref('dim_schools') }}
    from analytics.dim_schools
),

users as (
    select user_id, is_international, country 
    -- from {{ ref('dim_users') }}
    from analytics.dim_users
),

school_status as (
    select school_id, school_year, status as school_status
    -- from {{ ref('dim_school_status') }}
    from analytics.dim_school_status

),

teacher_status as (
    select teacher_id, school_year, status as teacher_status  
    -- from {{ ref('dim_teacher_status') }}
    from analytics.dim_teacher_status
),

combined as (
    select 
        sec.school_year, 
        
        -- teachers
        sec.teacher_id, 
        tes.teacher_status,
        
        -- students
        sec.student_id,
        usr.is_international,
        usr.country,

        -- schools
        sec.school_id,
        sst.school_status,
        sch.school_name,
        sch.school_district_id,
        sch.school_district_name,
        sch.state                   as school_state,
        sch.school_type,
        sch.is_stage_el             as school_is_stage_el,
        sch.is_stage_mi             as school_is_stage_mi,
        sch.is_stage_hi             as school_is_stage_hi,
        sch.is_high_needs           as school_is_high_needs,
        sch.is_rural                as school_is_rural

    from sections   as sec 
    
    join teacher_status as tes 
        on tes.teacher_id   = sec.teacher_id 
        and tes.school_year = sec.school_year
    
    join school_status as sst 
        on  sec.school_id   = sst.school_id 
        and sec.school_year = sst.school_year 
    
    join schools    as sch
        on sch.school_id = sec.school_id 
    join users      as usr 
        on usr.user_id = sec.student_id
),

-- Checkpoint:
-- select * from combined ;
-- Runtime: 20.8 min

final as (
    select * 
    from combined as comb 
    left join student_activity as sta 
        on comb.student_id = sta.student_id )

-- Checkpoint: 
select *
from final 
-- Full runtime: 28 min
-- Delta: ~8 min (for final)
;

@jordan-springer jordan-springer self-assigned this Aug 29, 2024
@jordan-springer jordan-springer marked this pull request as ready for review August 30, 2024 21:46
@jordan-springer jordan-springer changed the title feature/dim_student_course_activity feature/dim_student_script_level_activity Sep 9, 2024
Copy link
Collaborator

@allison-code-dot-org allison-code-dot-org left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small change to add script_name, but otherwise good to go!

@@ -182,7 +182,7 @@ models:
config:
tags: ['released']

- name: dim_student_script_level_activity
- name: dim_student_course_activity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think naming as course_activity could be confusing since it's not by course, but rather by level. Are you thinking dim_student_script_level_activity is too long of a name and we need to go simpler?

@jordan-springer jordan-springer merged commit 1064164 into main Sep 10, 2024
1 check passed
@jordan-springer jordan-springer deleted the feature/dim_student_course_activity branch September 10, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants