Skip to content

Latest commit

 

History

History
694 lines (562 loc) · 37.9 KB

11_session.md

File metadata and controls

694 lines (562 loc) · 37.9 KB
layout title output
lesson
Session 11
markdown_document

Topics

  • Developing questions and finding data
  • Getting familiar with and focusing on data
  • Working with spreadsheets
  • Joining data frames together

Developing questions and finding data

In the previous lessons I've given you data without much explanation for how I found the data. Of course, for many, you will be generating data through a survey you deploy or with an experiment you are conducting. Reading a paper, news story, or listening to a podcast never satisfies my curiosity. I want to play with the data to answer my own variation on a question. My general strategy of finding data, frankly, is to use google with words like "download" or "raw data". For example, "temperature raw data" will get you pretty close to the site that I got our Ann Arbor data from.

I have been working on a project that is asking how well women are represented in scientific publishing and as gate keepers throughout peer review. This analysis got me to thinking about how well women and other groups are represented at various levels of training. I also wondered how that representation varies by discipline. Thinking about these types of questions, my ears perked up when I heard of a service on my campus that quantifies what the demographic pool of applicants for faculty positions should look like based on the representation of various groups amongst its trainees.

To answer these types of questions, for the next few sessions we are going to look at data collected by the National Center for Education Statistics at the US Department of Education. They maintain the Integrated Postsecondary Education Data System (IPEDS), which houses a number of surveys on higher education. This is an amazingly rich dataset that we'll use to explore some questions. According to their website, "Institutions submit data through 12 interrelated survey components about general higher education topics for 3 reporting periods." I have provided you a subset of the data as "csv" files in the ipeds directory along with the data dictionaries. These dictionaries explain the column names and codes used in the two csv files. If you go to their "Use the Data" page, on the right side of the page there is a heading "Survey Data" with a drop down menu. In the drop down menu select, "Complete Data Files". In the next page you can leave the default choices of "All years" and "All surveys" and then press "Continue". On this page you'll see the 12 surveys for each year going back to 1980. As I've found, the surveys change over time as do how the data are reported and recorded. To avoid wrangling data across years, we'll work with data that was reported for 2018.

Getting familiar with and focusing on data

Again, there are many questions we could try to answer with these data. I'm going to focus on degrees granted in 2018 by field across institutions. By looking through the files, I decided that I was most interested in "Institutional Characteristics Directory information" and "Completions Awards/degrees conferred by program (6-digit CIP code), award level, race/ethnicity, and gender: July 1, 2017 to June 30, 2018". These files are in the ipeds directory as c2018.csv/c2018.xlsx and hd2018.csv/hd2018.xlsx, respectively. The csv files contain the data and the xlsx files contain the data dictionary for each csv file. We'll start by loading the tidyverse package and reading in the c2018_a csv file as the degrees data frame.

library(tidyverse)
degrees <- read_csv("ipeds/c2018_a.csv")
degrees
## # A tibble: 290,119 x 64
##    UNITID CIPCODE MAJORNUM AWLEVEL XCTOTALT CTOTALT XCTOTALM CTOTALM XCTOTALW
##     <dbl> <chr>      <dbl> <chr>   <chr>      <dbl> <chr>      <dbl> <chr>   
##  1 100654 01.0999        1 05      R              5 Z              0 R       
##  2 100654 01.1001        1 05      R              9 R              2 R       
##  3 100654 01.1001        1 07      R              5 R              1 R       
##  4 100654 01.1001        1 17      R              1 R              1 Z       
##  5 100654 01.9999        1 05      R              6 R              1 R       
##  6 100654 01.9999        1 07      R              7 R              2 R       
##  7 100654 01.9999        1 17      R              1 Z              0 R       
##  8 100654 03.0599        1 05      R             12 R             11 R       
##  9 100654 04.0301        1 05      R              2 Z              0 R       
## 10 100654 04.0301        1 07      R             11 R              7 R       
## # … with 290,109 more rows, and 55 more variables: CTOTALW <dbl>,
## #   XCAIANT <chr>, CAIANT <dbl>, XCAIANM <chr>, CAIANM <dbl>, XCAIANW <chr>,
## #   CAIANW <dbl>, XCASIAT <chr>, CASIAT <dbl>, XCASIAM <chr>, CASIAM <dbl>,
## #   XCASIAW <chr>, CASIAW <dbl>, XCBKAAT <chr>, CBKAAT <dbl>, XCBKAAM <chr>,
## #   CBKAAM <dbl>, XCBKAAW <chr>, CBKAAW <dbl>, XCHISPT <chr>, CHISPT <dbl>,
## #   XCHISPM <chr>, CHISPM <dbl>, XCHISPW <chr>, CHISPW <dbl>, XCNHPIT <chr>,
## #   CNHPIT <dbl>, XCNHPIM <chr>, CNHPIM <dbl>, XCNHPIW <chr>, CNHPIW <dbl>,
## #   XCWHITT <chr>, CWHITT <dbl>, XCWHITM <chr>, CWHITM <dbl>, XCWHITW <chr>,
## #   CWHITW <dbl>, XC2MORT <chr>, C2MORT <dbl>, XC2MORM <chr>, C2MORM <dbl>,
## #   XC2MORW <chr>, C2MORW <dbl>, XCUNKNT <chr>, CUNKNT <dbl>, XCUNKNM <chr>,
## #   CUNKNM <dbl>, XCUNKNW <chr>, CUNKNW <dbl>, XCNRALT <chr>, CNRALT <dbl>,
## #   XCNRALM <chr>, CNRALM <dbl>, XCNRALW <chr>, CNRALW <dbl>

Ugh, those column names!

ipeds/c2018_a.xlsx contains the data dictionary. Open it up in Excel, it has different pages that tell you what the column names mean and what the various values are. For example, the "varlist" and "Description" tabs tell us that "UNITID" tells us the "Unique identification number of the institution". It also tells us that "CIPCODE" is the "CIP Code - 2010 Classification" and that "AWLEVEL" is the "Award Level code". If we want to know the values that correspond to each CIP or Award Level codes, we need to look at the "Frequencies" tab. We see that the first 1426 rows of the "Frequencies" tab are different CIP codes. These correspond to different "majors". Below the CIP codes, we see entries for "MAJORNUM" and "AWLEVEL". Cool - I'd like to look at the 2018 awards data for the University of Michigan. Which "UNITID" corresponds to the University of Michigan or any other university? That "decoder" isn't in this workbook.

If you do some digging on the page where we saw the "C2018_A" data, at the top of the page you'll find "HD2018". The description for this survey is "Institutional Characteristics, Directory information". Those data and the data dictionary are provided for you in ipeds/. Opening ipeds/hd2018.xlsx in Microsoft Excel and clicking on the "varlist" or "Description" tab show us that this survey will give us the mapping between "UNITID" and the institution name or "INSTNM". Now, let's open ipeds/hd2018.csv in Excel. You can do this by doing "File -> Open..." and browsing to ipeds/hd2018.csv. We can then click on the "Edit" menu and click on "Find". In the dialog window that opens, enter "University of Michigan" and press the return key. You'll see that the "University of Michigan-Ann Arbor"'s "UNITID" is "170976".

Now we want to pull all of the rows from the degrees data frame that correspond to that UNITID value. Remember how to do that?

degrees %>%
	filter(UNITID == 170976)
## # A tibble: 570 x 64
##    UNITID CIPCODE MAJORNUM AWLEVEL XCTOTALT CTOTALT XCTOTALM CTOTALM XCTOTALW
##     <dbl> <chr>      <dbl> <chr>   <chr>      <dbl> <chr>      <dbl> <chr>   
##  1 170976 03.0101        1 05      R              0 R              0 Z       
##  2 170976 03.0101        1 07      R            107 R             40 R       
##  3 170976 03.0101        1 17      R              2 R              0 R       
##  4 170976 03.0103        1 05      R             93 R             37 R       
##  5 170976 03.0103        1 06      R              0 R              0 Z       
##  6 170976 03.0103        1 08      R              0 R              0 Z       
##  7 170976 03.0103        2 05      R             20 R              9 R       
##  8 170976 03.0201        1 05      R              0 R              0 Z       
##  9 170976 03.0201        1 06      R              0 R              0 Z       
## 10 170976 03.0201        1 07      R              0 R              0 Z       
## # … with 560 more rows, and 55 more variables: CTOTALW <dbl>, XCAIANT <chr>,
## #   CAIANT <dbl>, XCAIANM <chr>, CAIANM <dbl>, XCAIANW <chr>, CAIANW <dbl>,
## #   XCASIAT <chr>, CASIAT <dbl>, XCASIAM <chr>, CASIAM <dbl>, XCASIAW <chr>,
## #   CASIAW <dbl>, XCBKAAT <chr>, CBKAAT <dbl>, XCBKAAM <chr>, CBKAAM <dbl>,
## #   XCBKAAW <chr>, CBKAAW <dbl>, XCHISPT <chr>, CHISPT <dbl>, XCHISPM <chr>,
## #   CHISPM <dbl>, XCHISPW <chr>, CHISPW <dbl>, XCNHPIT <chr>, CNHPIT <dbl>,
## #   XCNHPIM <chr>, CNHPIM <dbl>, XCNHPIW <chr>, CNHPIW <dbl>, XCWHITT <chr>,
## #   CWHITT <dbl>, XCWHITM <chr>, CWHITM <dbl>, XCWHITW <chr>, CWHITW <dbl>,
## #   XC2MORT <chr>, C2MORT <dbl>, XC2MORM <chr>, C2MORM <dbl>, XC2MORW <chr>,
## #   C2MORW <dbl>, XCUNKNT <chr>, CUNKNT <dbl>, XCUNKNM <chr>, CUNKNM <dbl>,
## #   XCUNKNW <chr>, CUNKNW <dbl>, XCNRALT <chr>, CNRALT <dbl>, XCNRALM <chr>,
## #   CNRALM <dbl>, XCNRALW <chr>, CNRALW <dbl>

From that filter, we have winnowed our data frame down to the 570 rows that correspond to the University of Michigan. Each row of this data frame corresponds to a different CIPCODE/MAJORNUM/AWLEVEL combination. To simplify things, let's only look at people's first major. Also, the first CIPCODE - "99" - is a grand total so we don't want to include those data.

degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & CIPCODE != 99)
## # A tibble: 503 x 64
##    UNITID CIPCODE MAJORNUM AWLEVEL XCTOTALT CTOTALT XCTOTALM CTOTALM XCTOTALW
##     <dbl> <chr>      <dbl> <chr>   <chr>      <dbl> <chr>      <dbl> <chr>   
##  1 170976 03.0101        1 05      R              0 R              0 Z       
##  2 170976 03.0101        1 07      R            107 R             40 R       
##  3 170976 03.0101        1 17      R              2 R              0 R       
##  4 170976 03.0103        1 05      R             93 R             37 R       
##  5 170976 03.0103        1 06      R              0 R              0 Z       
##  6 170976 03.0103        1 08      R              0 R              0 Z       
##  7 170976 03.0201        1 05      R              0 R              0 Z       
##  8 170976 03.0201        1 06      R              0 R              0 Z       
##  9 170976 03.0201        1 07      R              0 R              0 Z       
## 10 170976 03.0201        1 08      R              1 R              1 Z       
## # … with 493 more rows, and 55 more variables: CTOTALW <dbl>, XCAIANT <chr>,
## #   CAIANT <dbl>, XCAIANM <chr>, CAIANM <dbl>, XCAIANW <chr>, CAIANW <dbl>,
## #   XCASIAT <chr>, CASIAT <dbl>, XCASIAM <chr>, CASIAM <dbl>, XCASIAW <chr>,
## #   CASIAW <dbl>, XCBKAAT <chr>, CBKAAT <dbl>, XCBKAAM <chr>, CBKAAM <dbl>,
## #   XCBKAAW <chr>, CBKAAW <dbl>, XCHISPT <chr>, CHISPT <dbl>, XCHISPM <chr>,
## #   CHISPM <dbl>, XCHISPW <chr>, CHISPW <dbl>, XCNHPIT <chr>, CNHPIT <dbl>,
## #   XCNHPIM <chr>, CNHPIM <dbl>, XCNHPIW <chr>, CNHPIW <dbl>, XCWHITT <chr>,
## #   CWHITT <dbl>, XCWHITM <chr>, CWHITM <dbl>, XCWHITW <chr>, CWHITW <dbl>,
## #   XC2MORT <chr>, C2MORT <dbl>, XC2MORM <chr>, C2MORM <dbl>, XC2MORW <chr>,
## #   C2MORW <dbl>, XCUNKNT <chr>, CUNKNT <dbl>, XCUNKNM <chr>, CUNKNM <dbl>,
## #   XCUNKNW <chr>, CUNKNW <dbl>, XCNRALT <chr>, CNRALT <dbl>, XCNRALM <chr>,
## #   CNRALM <dbl>, XCNRALW <chr>, CNRALW <dbl>

I'd like to know how many award levels are represented at the University of Michigan and how many majors correspond to each award level. Do you recall how to count the number of times a value appears in a column? We can use the count function.

degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & CIPCODE != 99) %>%
	count(AWLEVEL)
## # A tibble: 6 x 2
##   AWLEVEL     n
##   <chr>   <int>
## 1 05        152
## 2 06         39
## 3 07        153
## 4 08         48
## 5 17        104
## 6 18          7

We see that the most common award levels are codes "05", "07", and "17". Reopening ipeds/c2018_a.xlsx in Excel, we can turn to the "Frequencies" tab and scroll to the bottom where we see the translation of the "AWLEVEL" codes. These three codes correspond to "Bachelor's degree", "Master's degree", and "Doctor's degree - research/scholarship", respectively. The university also awards "Postbaccalaureate certificate" (06), "Post-master's certificate" (08), and "Doctor's degree - professional practice" (18). Note that the values in the n column of this new data frame are the number of majors that awarded one of these "degree" types in 2018. I'd like to know how many people graduated with these degrees, across all marjors. Hopefully, this is ringing bells in your head and you're thinking about how you can use group_by and summarize. Before we can use those functions, we need to figure out which column in degrees contains the number of students to receive an award. From the data dictionary, we see that "CTOTALT" contains the "Grand total". Let's try that and save it as a variable.

um_degrees <- degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & CIPCODE != 99) %>%
	group_by(AWLEVEL) %>%
	summarize(n_graduates = sum(CTOTALT))

The University of Michigan awarded 7,450 Bachelor's degrees, 4,677 Master's degrees, 874 research-based doctorates, and 682 professional doctorates. What we've seen so far in this lesson is a bit of review from what we have done in previous lessons. I don't expect that you could have done these steps on your own, but I hope these steps jogged some memories. Even if the syntax is still hard to master, I hope your mind is swimming with questions to ask with these datasets! How do these numbers break down by gender? Race or ethnicity? Do they vary over year? Which program has the most students? Whoa! We'll get to these but let's see if we can improve upon what we've already done.

Working with spreadsheets

The first thing I would like to do is automate a few steps. To figure out things like "UNITID" and "AWLEVEL" we have had to flip back and forth to Excel. That's a little cumbersome for a single institution, but if we wanted to compare multiple schools. I'm also struggling to remember the codes for the different awards. I'd like the last data frame we generated showing the number of awards given in 2018 by award type to list the actual award rather than the code. To do this we need to learn two new things: how to read Excel files (those that end in xlsx) and how to join two data frames together.

We've seen that the "AWLEVEL" codes are provided in the "Frequencies" tab of ipeds/c2018_a.xlsx. We can read in an Excel file using the read_excel function from the readxl package, which is part of tidyverse.

library(readxl)

read_excel("ipeds/c2018_a.xlsx")
## # A tibble: 14 x 2
##    `File Documentation for the Completi… ...2                                   
##    <chr>                                 <chr>                                  
##  1 (Provisional release)                 <NA>                                   
##  2 <NA>                                  <NA>                                   
##  3 Filename                              C2018_A                                
##  4 <NA>                                  <NA>                                   
##  5 Overview                              This file contains the number of award…
##  6 Note:                                 Preliminary release data have been edi…
##  7 <NA>                                  <NA>                                   
##  8 Contents of spreadsheet               <NA>                                   
##  9 <NA>                                  <NA>                                   
## 10 Varlist                               Variable list: Lists all variables in …
## 11 Description                           Long description or glossary definitio…
## 12 Frequencies                           This worksheet contains the code value…
## 13 Statistics                            This worksheet lists the sum, mean, mi…
## 14 Imputation code values                This worksheet lists the code values f…

From the output, you should notice that it read in the first page of the workbook - "Introduction". We'd like the "Frequencies" tab. Can you use ?read_excel to figure out how to get the "Frequencies" tab?

read_excel("ipeds/c2018_a.xlsx", sheet="Frequencies")
## # A tibble: 1,438 x 6
##    varnumber varname codevalue valuelabel                      frequency percent
##        <dbl> <chr>   <chr>     <chr>                               <dbl>   <dbl>
##  1     32000 CIPCODE 99        Grand total                         20385    7.03
##  2     32000 CIPCODE 01.0000   Agriculture, General                  249    0.09
##  3     32000 CIPCODE 01.0101   Agricultural Business and Mana…       185    0.06
##  4     32000 CIPCODE 01.0102   Agribusiness/Agricultural Busi…       203    0.07
##  5     32000 CIPCODE 01.0103   Agricultural Economics                121    0.04
##  6     32000 CIPCODE 01.0104   Farm/Farm and Ranch Management         84    0.03
##  7     32000 CIPCODE 01.0105   Agricultural/Farm Supplies Ret…        30    0.01
##  8     32000 CIPCODE 01.0106   Agricultural Business Technolo…        16    0.01
##  9     32000 CIPCODE 01.0199   Agricultural Business and Mana…        24    0.01
## 10     32000 CIPCODE 01.0201   Agricultural Mechanization, Ge…        45    0.02
## # … with 1,428 more rows

Cool, eh? There is nothing inherently wrong with using Excel! Many of us find it very useful for recording data and sharing it with others. As you grow in your comfort with R, you may find it is easier to work with the data in R rather than Excel. I'm not really sure how I would do the commands in this lesson in Excel, but they will grow to become straightforward for you in R. An added motivation, is that instead of having to remember your mouse clicks and communicating those to someone else, you can record your R commands in a script to share with a friend or to apply on the 2019 data when they are released.

Let's focus on the rows for the "AWLEVEL", get rid of the extraneous columns, and save it as a new data frame

award_code <- read_excel("ipeds/c2018_a.xlsx", sheet="Frequencies") %>%
	filter(varname == "AWLEVEL") %>%
	select(codevalue, valuelabel)

Joining data frames together

Great! Now we want to join our two data frames. We can do this using a dplyr function called inner_join. This function expects two data frames that will be merged using either a column name that is in both data frames or that is specified by us. If a value is missing from either data frame is is not retained in the resulting data frame (see ?left_join, ?right_join, ?full_join for how to get other behaviors). For example, assume we have two data frames:

animal_sounds <- tibble(animal = c("cat", "dog", "fish", "pig"),
		sound = c("meow", "bark", "blub", "oink"))
animal_legs <- tibble(animal = c("cat", "chicken", "pig", "fish"),
		n_legs = c(4, 2, 0, 4))

inner_join(animal_sounds, animal_legs)
## # A tibble: 3 x 3
##   animal sound n_legs
##   <chr>  <chr>  <dbl>
## 1 cat    meow       4
## 2 fish   blub       4
## 3 pig    oink       0

In this example, because there is an animal column in both data frames, it is used to join the data frames. Here, inner_join returns a new data frame with three rows and three columns. You'll see that there's no row for "dog" or "chicken" because those animals were not found in both data frames. You'll also see that we have all three columns represented by the two data frames - animal, sound, n_legs. To see how we join data frames, like ours, where the two data frames don't have a common column between them let's add a third data frame

pats_farm <- tibble(type=c("cat", "dog", "chicken", "pig", "fish"),
		he_has=c(TRUE, TRUE, TRUE, TRUE, FALSE))

The pats_farm doesn't have an animal column, it has a type column. We could use rename to "fix" pats_farm. But we can also leave it is by providing an extra argument to inner_join

inner_join(animal_sounds, pats_farm, by=c("animal"="type"))
## # A tibble: 4 x 3
##   animal sound he_has
##   <chr>  <chr> <lgl> 
## 1 cat    meow  TRUE  
## 2 dog    bark  TRUE  
## 3 fish   blub  FALSE 
## 4 pig    oink  TRUE

Note that even in the case where we have a column in common between the data frames, we can tell inner_join which we want to join on. This can be helpful if there are multiple columns in common between the data frames and we want to exclude one of them. See if you can see the subtle difference between setting the by argument and not in this case

inner_join(animal_sounds, animal_legs, by="animal")
## # A tibble: 3 x 3
##   animal sound n_legs
##   <chr>  <chr>  <dbl>
## 1 cat    meow       4
## 2 fish   blub       4
## 3 pig    oink       0

Finally, we can chain together our joins if we have multiple data frames to join

inner_join(animal_sounds, animal_legs, by="animal") %>%
	inner_join(., pats_farm, by=c("animal" = "type"))
## # A tibble: 3 x 4
##   animal sound n_legs he_has
##   <chr>  <chr>  <dbl> <lgl> 
## 1 cat    meow       4 TRUE  
## 2 fish   blub       4 FALSE 
## 3 pig    oink       0 TRUE

To put inner_join into a pipeline like we did here, you need to put a . where data from the pipeline goes. Also, notice that for our by argument, the order of the columns needs to match their representation in the inner_join. For example, if we did inner_join(pats_farm, ., by=c("animal" = "type")) we will get an error, but if we do inner_join(pats_farm, ., by=c("type" = "animal")), we won't.

Let's try this with our um_degrees and award_code data frames!

inner_join(um_degrees, award_code, by=c("AWLEVEL"="codevalue"))
## # A tibble: 2 x 3
##   AWLEVEL n_graduates valuelabel                             
##   <chr>         <dbl> <chr>                                  
## 1 17              874 Doctor's degree - research/scholarship 
## 2 18              682 Doctor's degree - professional practice

Fantastic! Oops, maybe not. We only have our rows corresponding to the doctorate degrees. You might notice a subtle difference between the values in codevalue from award_code and AWLEVEL in um_degrees. The single digit values in AWLEVEL have a "0" on the left side to make the code a two digit string; the values in codevalue don't have the padding. We can add pad the numbers using the str_pad function, which is part of the strignr package that was loaded when we did library(tidyverse). We'll see more of functions from stringr in the next few lessons. It's a powerful package for working with string data. Look at ?str_pad to see if you can figure out how we might apply it for our case.

str_pad takes a string to be padded, the width to pad to, the side to pad (left is the default), and what to use as the pad (a space is the default). Say we have a set of numbers (i.e. a vector) from 1 to 20, we can define that as 1:20

x <- 1:20

# can you figure out why these two versions of `str_pad` do the same thing?
str_pad(string=x, width=2, side="left", pad="0")

# or

str_pad(x, 2, pad="0")
##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20"
##  [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20"

Do you remember how we can modify an existing column? mutate! Let's modify the codevalue column

award_code <- read_excel("ipeds/c2018_a.xlsx", sheet="Frequencies") %>%
	filter(varname == "AWLEVEL") %>%
	select(codevalue, valuelabel) %>%
	mutate(codevalue = str_pad(codevalue, 2, pad="0"))

Now that the formatting of codevalue is fixed, let's try that join again

inner_join(um_degrees, award_code, by=c("AWLEVEL"="codevalue"))
## # A tibble: 6 x 3
##   AWLEVEL n_graduates valuelabel                             
##   <chr>         <dbl> <chr>                                  
## 1 05             7450 Bachelor's degree                      
## 2 06               43 Postbaccalaureate certificate          
## 3 07             4677 Master's degree                        
## 4 08              185 Post-master's certificate              
## 5 17              874 Doctor's degree - research/scholarship 
## 6 18              682 Doctor's degree - professional practice

We can clean up the output a little using select and rename

inner_join(um_degrees, award_code, by=c("AWLEVEL"="codevalue")) %>%
	select(valuelabel, n_graduates) %>%
	rename("degree" = "valuelabel")
## # A tibble: 6 x 2
##   degree                                  n_graduates
##   <chr>                                         <dbl>
## 1 Bachelor's degree                              7450
## 2 Postbaccalaureate certificate                    43
## 3 Master's degree                                4677
## 4 Post-master's certificate                       185
## 5 Doctor's degree - research/scholarship          874
## 6 Doctor's degree - professional practice         682

Exercises

1. Use the left_join, right_join, and full_join functions to merge animal_sounds and animal_legs. What happens with these functions (and inner_join) if you switch the order of animal_sounds and animal_legs in the arguments to these functions?

# only information for shared animals is outputted
inner_join(animal_sounds, animal_legs, by="animal")
## # A tibble: 3 x 3
##   animal sound n_legs
##   <chr>  <chr>  <dbl>
## 1 cat    meow       4
## 2 fish   blub       4
## 3 pig    oink       0
# only information for animals in animal_sounds is outputted; missing get NA
left_join(animal_sounds, animal_legs, by="animal")
## # A tibble: 4 x 3
##   animal sound n_legs
##   <chr>  <chr>  <dbl>
## 1 cat    meow       4
## 2 dog    bark      NA
## 3 fish   blub       4
## 4 pig    oink       0
# only information for animals in animal_legs is outputted; missing get NA
right_join(animal_sounds, animal_legs, by="animal")
## # A tibble: 4 x 3
##   animal  sound n_legs
##   <chr>   <chr>  <dbl>
## 1 cat     meow       4
## 2 fish    blub       4
## 3 pig     oink       0
## 4 chicken <NA>       2
# information from both data frames is outputted; missing get NA
full_join(animal_sounds, animal_legs, by="animal")
## # A tibble: 5 x 3
##   animal  sound n_legs
##   <chr>   <chr>  <dbl>
## 1 cat     meow       4
## 2 dog     bark      NA
## 3 fish    blub       4
## 4 pig     oink       0
## 5 chicken <NA>       2
# the order of the columns is determined by the order of the data frames
inner_join(animal_legs, animal_sounds, by="animal")
## # A tibble: 3 x 3
##   animal n_legs sound
##   <chr>   <dbl> <chr>
## 1 cat         4 meow 
## 2 pig         0 oink 
## 3 fish        4 blub

2. Which programs at the University of Michigan awarded a Postbaccalaureate certificate (AWLEVEL: 06) in 2018? What about a Post-master's certificate (AWLEVEL: 08)?

cip_codes <- read_excel("ipeds/c2018_a.xlsx", sheet="Frequencies") %>%
	filter(varname == "CIPCODE" & codevalue != 99) %>%
	rename("major" = "valuelabel" ) %>%
	select(codevalue, major)

degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & AWLEVEL == "06" & CIPCODE != 99 & CTOTALT != 0) %>%
	select(CIPCODE, CTOTALT) %>%
	inner_join(., cip_codes, by=c("CIPCODE"="codevalue"))

degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & AWLEVEL == "08" & CIPCODE != 99 & CTOTALT != 0) %>%
	select(CIPCODE, CTOTALT) %>%
	inner_join(., cip_codes, by=c("CIPCODE"="codevalue")) %>%
	arrange(desc(CTOTALT)) %>%
	print(n=Inf)
## # A tibble: 1 x 3
##   CIPCODE CTOTALT major               
##   <chr>     <dbl> <chr>               
## 1 51.2299      43 Public Health, Other
## # A tibble: 30 x 3
##    CIPCODE CTOTALT major                                                        
##    <chr>     <dbl> <chr>                                                        
##  1 11.0802      41 Data Modeling/Warehousing and Database Administration        
##  2 30.9999      28 Multi-/Interdisciplinary Studies, Other                      
##  3 44.0701      19 Social Work                                                  
##  4 11.9999      13 Computer and Information Sciences and Support Services, Other
##  5 51.2706      10 Medical Informatics                                          
##  6 44.0501       8 Public Policy Analysis, General                              
##  7 50.9999       7 Visual and Performing Arts, Other                            
##  8 52.1501       7 Real Estate                                                  
##  9 26.0806       6 Human/Medical Genetics                                       
## 10 30.1501       6 Science, Technology and Society                              
## 11 05.0201       4 African-American/Black Studies                               
## 12 30.0601       4 Systems Science and Theory                                   
## 13 30.1401       4 Museology/Museum Studies                                     
## 14 05.0101       3 African Studies                                              
## 15 05.0207       3 Women's Studies                                              
## 16 27.0501       3 Statistics, General                                          
## 17 14.1001       2 Electrical and Electronics Engineering                       
## 18 26.0102       2 Biomedical Sciences, General                                 
## 19 38.0206       2 Jewish/Judaic Studies                                        
## 20 50.0904       2 Music Theory and Composition                                 
## 21 51.2205       2 Health/Medical Physics                                       
## 22 03.0201       1 Natural Resources Management and Policy                      
## 23 05.0106       1 European Studies/Civilization                                
## 24 05.0203       1 Hispanic-American, Puerto Rican, and Mexican-American/Chican…
## 25 26.1201       1 Biotechnology                                                
## 26 26.1301       1 Ecology                                                      
## 27 30.1301       1 Medieval and Renaissance Studies                             
## 28 50.0601       1 Film/Cinema/Video Studies                                    
## 29 51.9999       1 Health Professions and Related Clinical Sciences, Other      
## 30 52.1399       1 Management Sciences and Quantitative Methods, Other

3. Which major at the University of Michigan awarded the most Bachelor's degrees? Master's? Doctorates (Research)? Doctorates (Professional)? Can you join a data frame with the CIPCODES data?

cip_codes <- read_excel("ipeds/c2018_a.xlsx", sheet="Frequencies") %>%
	filter(varname == "CIPCODE" & codevalue != 99) %>%
	rename("major" = "valuelabel" ) %>%
	select(codevalue, major)


um_degrees_majors <- degrees %>%
	filter(UNITID == 170976 & MAJORNUM == 1 & CIPCODE != 99) %>%
	group_by(AWLEVEL, CIPCODE) %>%
	summarize(n_graduates = sum(CTOTALT)) %>%
	ungroup() %>%
	inner_join(., cip_codes, by=c("CIPCODE" = "codevalue")) %>%
	inner_join(., award_code, by=c("AWLEVEL" = "codevalue")) %>%
	arrange(desc(n_graduates)) %>%
	rename("degree" = "valuelabel") %>%
	select(degree, AWLEVEL, major, n_graduates) %>%
	arrange(desc(n_graduates))


um_degrees_majors %>% filter(AWLEVEL == "05")
um_degrees_majors %>% filter(AWLEVEL == "07")
um_degrees_majors %>% filter(AWLEVEL == "17")
um_degrees_majors %>% filter(AWLEVEL == "18")
## # A tibble: 152 x 4
##    degree           AWLEVEL major                                    n_graduates
##    <chr>            <chr>   <chr>                                          <dbl>
##  1 Bachelor's degr… 05      Computer and Information Sciences, Gene…         624
##  2 Bachelor's degr… 05      Business Administration and Management,…         544
##  3 Bachelor's degr… 05      Economics, General                               440
##  4 Bachelor's degr… 05      Experimental Psychology                          351
##  5 Bachelor's degr… 05      Physiological Psychology/Psychobiology           328
##  6 Bachelor's degr… 05      Political Science and Government, Gener…         295
##  7 Bachelor's degr… 05      Mechanical Engineering                           279
##  8 Bachelor's degr… 05      Speech Communication and Rhetoric                247
##  9 Bachelor's degr… 05      Industrial Engineering                           195
## 10 Bachelor's degr… 05      Neuroscience                                     195
## # … with 142 more rows
## # A tibble: 153 x 4
##    degree         AWLEVEL major                                      n_graduates
##    <chr>          <chr>   <chr>                                            <dbl>
##  1 Master's degr… 07      Business Administration and Management, G…         678
##  2 Master's degr… 07      Social Work                                        329
##  3 Master's degr… 07      Electrical and Electronics Engineering             307
##  4 Master's degr… 07      Mechanical Engineering                             186
##  5 Master's degr… 07      Information Science/Studies                        178
##  6 Master's degr… 07      Architectural and Building Sciences/Techn…         154
##  7 Master's degr… 07      Systems Engineering                                146
##  8 Master's degr… 07      Education, General                                 123
##  9 Master's degr… 07      Health/Health Care Administration/Managem…         114
## 10 Master's degr… 07      Natural Resources/Conservation, General            107
## # … with 143 more rows
## # A tibble: 104 x 4
##    degree                  AWLEVEL major                             n_graduates
##    <chr>                   <chr>   <chr>                                   <dbl>
##  1 Doctor's degree - rese… 17      Electrical and Electronics Engin…          55
##  2 Doctor's degree - rese… 17      Chemistry, General                         46
##  3 Doctor's degree - rese… 17      Mechanical Engineering                     39
##  4 Doctor's degree - rese… 17      Physics, General                           34
##  5 Doctor's degree - rese… 17      Aerospace, Aeronautical and Astr…          33
##  6 Doctor's degree - rese… 17      Computer Engineering, General              29
##  7 Doctor's degree - rese… 17      Economics, General                         29
##  8 Doctor's degree - rese… 17      Education, General                         23
##  9 Doctor's degree - rese… 17      Experimental Psychology                    23
## 10 Doctor's degree - rese… 17      Chemical Engineering                       21
## # … with 94 more rows
## # A tibble: 7 x 4
##   degree                         AWLEVEL major                       n_graduates
##   <chr>                          <chr>   <chr>                             <dbl>
## 1 Doctor's degree - professiona… 18      Law                                 293
## 2 Doctor's degree - professiona… 18      Medicine                            165
## 3 Doctor's degree - professiona… 18      Dentistry                           128
## 4 Doctor's degree - professiona… 18      Pharmacy                             82
## 5 Doctor's degree - professiona… 18      Registered Nursing/Registe…           9
## 6 Doctor's degree - professiona… 18      Public Health/Community Nu…           4
## 7 Doctor's degree - professiona… 18      Geriatric Nurse/Nursing               1

4. Without looking up numbers in Excel, can you compare the total number of different degrees granted in 2018 at the "University of Michigan-Ann Arbor" and "Michigan State University"?

institution_code <- read_csv("ipeds/hd2018.csv") %>%
	filter(INSTNM == "University of Michigan-Ann Arbor" | INSTNM == "Michigan State University") %>%
	select(UNITID, INSTNM)

inner_join(degrees, institution_code, by="UNITID") %>%
	filter(MAJORNUM == 1) %>%
	group_by(AWLEVEL, INSTNM) %>%
	summarize(n_graduates = sum(CTOTALT)) %>%
	ungroup() %>%
	inner_join(., award_code, by=c("AWLEVEL"="codevalue")) %>%
	select(INSTNM, valuelabel, n_graduates) %>%
	rename("institution"="INSTNM", "degree_type"="valuelabel")
## # A tibble: 13 x 3
##    institution                degree_type                            n_graduates
##    <chr>                      <chr>                                        <dbl>
##  1 Michigan State University  Award of at least 1 but less than 2 a…         342
##  2 Michigan State University  Bachelor's degree                            18240
##  3 University of Michigan-An… Bachelor's degree                            14900
##  4 Michigan State University  Postbaccalaureate certificate                  348
##  5 University of Michigan-An… Postbaccalaureate certificate                   86
##  6 Michigan State University  Master's degree                               4480
##  7 University of Michigan-An… Master's degree                               9354
##  8 Michigan State University  Post-master's certificate                        4
##  9 University of Michigan-An… Post-master's certificate                      370
## 10 Michigan State University  Doctor's degree - research/scholarship        1106
## 11 University of Michigan-An… Doctor's degree - research/scholarship        1748
## 12 Michigan State University  Doctor's degree - professional practi…        1166
## 13 University of Michigan-An… Doctor's degree - professional practi…        1364