-
Notifications
You must be signed in to change notification settings - Fork 3
/
data-cleaning-guidelines.Rmd
561 lines (391 loc) · 25.3 KB
/
data-cleaning-guidelines.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
---
title: "Data Cleaning Guidelines"
author: ""
date: "10/08/2019"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The purpose of this document is to provide you with guidelines for investigating and cleaning data from secondary sources (e.g., National Center for Education Statistics datasets). Data cleaning refers to the process of ensuring that your data is correct, consistent, and usable for analyses by identifying any errors or missing values in the data and either correcting or deleting them.
# 1. Criteria for Data Quality:
This document will provide a step-by-step process for performing quality checks. But what are these steps trying to accomplish? What is data quality and how do we measure it? High-quality data need to pass the following set of quality criteria: validity, accuracy, completeness, consistency, and uniformity.
## 1.1 Validity:
The extent to which the data conform to defined rules or constraints.
_Data-Type Constraints_: values in a particular column must be of a particular datatype (e.g., numeric, character, factor).
_Range Constraints_: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values (e.g., age cannot be negative).
_Set-Membership constraints_: values of a column come from a set of discrete values (e.g., a factor variable for compass directions has predefined values of North, South, East, West).
_Regular expression patterns_: text fields that have to be in a certain pattern (e.g., phone number).
_Mandatory Constraints_: certain columns cannot be empty (e.g., student identification number).
_Unique Constraints_: a field, or a combination of fields, must be unique across a dataset (e.g., social security number).
_Cross-field validation_: certain conditions that span across multiple fields must hold (e.g., a student's graduation date cannot be earlier than their enrollment date).
_Foreign-key constraints_: as in relational databases, a foreign key column can’t have a value that does not exist in the referenced primary key. For example, in our recruiting events database, the "state" column is required to belong to one of the US's defined states or territories and the set of permissible states/territories is recorded in a separate States table.
## 1.2 Accuracy:
The degree to which the data is close to the true values of what we are attempting to measure. while defining all valid values allows us to easily identify invalid values, it does not mean values are accurate. For example, a student's grade level may be a valid value, 12th grade, but may not be accurate (student is in 10th grade).
Sometimes, achieving accuracy requires accessing an external source containing the true value. For example, the recruiting database queries the Google maps API to ensure the web-scraped street addresses of recruiting events actually exist and/or to extract more accurate addresses.
## 1.3 Completeness:
The degree to which all required data is known. Missing data happens for many reasons. "Fixing" missing data will nearly always require you to return to data documentation by the original source.
## 1.4 Consistency:
The degree to which the data is consistent, within the same data set or across multiple datasets. Inconsistency occurs when two values contradict each other (e.g., a valid age for a student may not seem to match grade level: age 5 and 12th grade). Fixing inconsistency can be done by deciding which data were more recent or reliable.
## 1.5 Uniformity:
The degree to which the data is specified using the same unit of measurement (e.g., distance in miles or kilometers)
# 2. Workflow for Producing High-Quality Data
Below we will review an iterative sequence of steps to produce high-quality data that incorporate the above criteria. Section 2.1 begins these steps by providing several different options for importing data into R.
## 2.1 Loading libraries and importing data:
_Load libraries_
```{r, eval=FALSE}
library(tidyverse); library(tidyr); library(readr); library(lubridate)
library(stringr); library(kableExtra); library(labelled); library(haven)
library(stargazer); library(Hmisc); library(sjlabelled)
```
_Import data example_: load local file via relative filepath
```{r, eval=FALSE}
#Use relative file path to data
getwd() # view working directory
load("data/recruiting/recruit_ps1_small.Rdata") #load data
```
_Import data example_: download, unzip, and read CSV data
```{r, eval=FALSE}
# Downloading file from the ipeds website
download.file("https://nces.ed.gov/ipeds/datacenter/data/HD2017.zip",
destfile = "hd2017", mode = "wb")
# Unzip zip file and keep original name
unzip(zipfile = "hd2017" , unzip = "unzip")
# read csv data
# reading in CSV file does not always read in columns in correct data type, use colClasses to specify
# na.string tells R to treat both NA and empty strings in columns of character data to missing
hd <- read.csv("hd2017.csv", colClasses=c(ZIP ="character", UNITID="factor"), na.strings=c("","NA"), encoding="UTF-8")
```
_Import data example_: "readxl" package reads in excel spreadsheets
```{r,eval=FALSE}
library(readxl)
# Specify sheet by its name
excel_example <- read_excel("data/fsa/peps300.xlsx", sheet = "300")
# Specify sheet by its index
excel_example <- read_excel("data/fsa/peps300.xlsx", sheet = 1)
```
_Import data example_: "haven" package reads in SPSS (read_sav), SAS (read_sas), Stata (read_dta)
```{r,eval=FALSE}
# Import HSLS data from website
hsls <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/hsls/hsls_stu_small.dta")
# Import NLS72 data from website
nls_crs <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/nls72/nls72petscrs_v2.dta", encoding=NULL)
```
## 2.2 Investigate Data
Investigating data is time-consuming and requires using many different methods to detect unexpected, incorrect, and inconsistent data. Investigating the data before any manipulation provides you with a general idea about the quality of the data. While this is not a linear process, here is an outline of general steps to take when first investigating data. We provide syntax examples of each step below.
1. Investigate the dataframe
2. Sort, select, and order data to make it easier to view the dataframe
2. Check codebook/documentation
3. Preliminary variable investigations:
- check for missing observations across all variables
- identify the "unit of analysis" and which combination of variables uniquely identify observations
- investigate different variable types: numeric, factor, character, labelled, etc.
- run some summary statistics. For _numerical_ variables: mean, sd, range, quantiles, distribution, outliers, errors. For _categorical_ variables: labels, frequencies.
### Example of Investigating Data Using NLS72
- Investigate the data frame
```{r,eval=FALSE}
str(nls_crs) # structure of the data frame
nrow(nls_crs) #num of rows (obs)
ncol(nls_crs) #num of columns
dim(nls_crs) #rows by columns
names(nls_crs) # variable names
glimpse(nls_crs) # prints all columns + type
nls_crs %>% var_label() # variable labels
head(nls_crs) # prints first 6 obs
```
- Sort, select, and order
```{r,eval=FALSE}
# Sort observations so transcript number, term number, and course name are in ascending order
nls_crs_v2 <- nls_crs %>% arrange(id, transnum, termnum, crsename)
# Select order variables
nls_crs_v2 <- nls_crs_v2 %>% select(id, transnum, termnum, crsecred, gradtype, crsgrada, crsgradb, crsecip, crsename,cname)
```
- Preliminary investigation of variables
1. continuous variable
```{r,eval=FALSE}
#Investigate variable course credits
class(nls_crs_v2$crsecred) # variable class (numeric)
attributes(nls_crs_v2$crsecred)
nls_crs_v2%>% select(crsecred) %>% var_label() #variable label
nls_crs_v2%>% count(crsecred) # frequency count
#Run some descriptive stats: range, mean, sd
nls_crs_v2 %>%
summarise_at(
.vars = vars(crsecred),
.funs = funs(min, max, mean, sd, .args=list(na.rm=TRUE))
)
#investigate high values of crsecred
nls_crs_v2%>% filter(crsecred>=100) %>% count(crsecred) # frequency table of crsecred no less than 100
nls_crs_v2%>% filter(crsecred==999) # printing some observations for specific values of crsecred
nls_crs_v2%>% filter(crsecred>=999) %>% count(crsecred) # frequency table of crsecred greater than or equal to 999
```
2. categorical- nominal variable
```{r,eval=FALSE}
#Investigate varible type of grade
class(nls_crs_v2$gradtype) # variable class (labelled)
nls_crs_v2%>% select(gradtype) %>% var_label() #variable label
attributes(nls_crs_v2$gradtype)
nls_crs_v2 %>% select(gradtype) %>% val_labels() #value labels
nls_crs_v2 %>% count(gradtype) #freq count of values
nls_crs_v2 %>% count(gradtype) %>% as_factor() #freq count with value labels
nls_crs_v2%>% filter(gradtype %in% c("9")) #printing some observations for specific values
```
3. categorical- ordinal variable
```{r,eval=FALSE}
#Investigate variable transcript number
class(nls_crs_v2$transnum) # variable class (numeric)
attributes(nls_crs_v2$transnum) # variable class (numeric)
nls_crs_v2%>% select(transnum) %>% var_label() #variable labels
nls_crs_v2%>% count(transnum) # frequency
#Check that sum of transnum equals number of rows in dataset
nls_crs_v2 %>%
group_by(transnum) %>% #grouping by transum
summarise(count_transum = n()) %>% #count for each value of transum
ungroup() %>% #ungroup
mutate(total_obs = sum(count_transum)) #Get the sum of count to check that it equals the number of observations in the dataset
```
- Check Codebook
- [NLS-72 Postsecondary Education Transcript Study Data File User's Manual](https://github.com/ozanj/rclass/blob/master/data/nls72/NLS72_codebook.pdf)
- [NLS-72 Postsecondary Education Transcript Files Supplementary Addendum](https://github.com/ozanj/rclass/blob/master/data/nls72/NLS72_suppaddendum.pdf)
## 2.3 Cleaning Data and Creating New Variables:
Data cleaning involves removing, correcting, or imputing anomalies. This process usually involves the creation of new variables (i.e., analysis variables). To create new variables, you should:
1. Investigate input variable(s)
- keep survey questionnaire and codebooks on hand to understand how input variables were created
- explore one-way investigations of input variables
- explore relationships between input variables
2. Keep input variable(s) the same
3. Create a new analysis variable from input variable(s)
4. Add variable labels, definitions, value labels
5. Run summary statistics for input and new variable to compare and verify new variables were created correctly
In following these steps, the goal is to identify and correct:
_Irrelevant data_: data (column-wise or row-wise) that are not actually needed, and don’t fit under the context of the problem we’re trying to solve.
_Duplicates_: data points that are repeated in your data due to a combination of different data sources, multiple submissions, etc.
_Type conversion_: data that needs to be converted into a different type/class in order to be stored correctly. For example, a date variable should be stored as a date object or a categorical variable should be converted from a string type to a factor class.
_Syntax errors_: character data that needs some string manipulation in order to be correct or be used in analyses. Such as:
- Removing white spaces: Extra white spaces at the beginning or the end of a string should be removed.
- Padding strings: Strings can be padded with spaces or other characters to a certain width. For example, some numerical codes are often represented with leading or trailing zeros to ensure they always have the same number of digits (e.g., ID variables).
- Fixing typos: strings can be entered in many different ways (e.g., fem, female, f). This can often lead to errors. These inconsistencies can be fixed by mapping each value to the unique expected value via pattern matching or fuzzy matching (i.e., regular expressions).
_Standardize_: data points that are equivalent but do not have the same standardized format. For example, strings should all either be in lower or upper case or numerical values should all have a certain measurement unit.
_Scaling / Transformation_: data that need to be transformed in order to fit within a specific scale. For example, variables indicating a percentage should range from 0 to 100 (rather than proportional range 0 to 1), grade point average should range from 0 to 4.0 (if using a 4.0 scale), or using log, square root, or inverse to reduce skewness.
_Normalization_: data that need to be rescaled into a range from 0 to 1 in order to be normally distributed. For example, test scores data can be normalized based on the mean and standard deviation.
_Missing values_: data that have missing observations. Missing data are a common problem across data sources. Ignoring missing data or treating them incorrectly due to not understanding rationale(s behind missing data is a fatal error! Missing data are generally treated in three ways:
- Drop missing observations if occurring at random: If missing values for a variable rarely occur and occur at random, sometimes it's easiest to drop missing observations. While this may reduce the total number of observations in your dataset this is sometimes useful when conducting statistical analyses where "filling" missing observations could bias your results.
- Flagged to avoid losing information: sometimes missing data do not occur at random. If so, we usually do not want to drop these observation because the missing data is informative in itself (For example, on a campus climate survey question, all missing values are the result of people from the same race/ethnic group refusing to answer the question). Missing observations for numeric data can be converted to `NA` and can be ignored when calculating any statistical computation or plotting distributions. For categorical data, you can create a separate "Missing" category and assign this category to all missing observations.
- Impute missing data: replace missing data by calculating values based on other observations. For example, you can replace all missing observations with statistical values (e.g., mean, median), or generate random values within 1 or 2 standard deviations of the mean if the data is normally distributed data and randomly missing, or using linear regression to predict the value of the missing observation based on relationships with other variables. Keep in mind these imputations may yield unexpected or biased results.
_Outliers_: observations in the data that deviate substantially from all other observations. For example, the 2017 average annual household income in Seattle is \$100,630 but Bill Gates, an outlier, earned \$14.7 billion. "True values until proven otherwise" is a rule of thumb to follow when investigating outliers. Outliers should only be removed with good reason.
_In-record & cross-datasets errors_: these are data observations that contradict values between columns or across different datasets. For example a student's grade level value is 1st grade but marital status is divorced.
### Example of Cleaning Data and Creating Variables Using NLS72
Task: Create a numerical grade variable based on input variables `crsgrada`, `crsgradb`, `gradtype`, and `crsecred_v2.`
__Investigate input variable(s): One Way Investigations__
```{r, eval=FALSE}
# crsgrada
#investigate type, class, and attributes
typeof(nls_crs_v2$crsgrada) #character variable
class(nls_crs_v2$crsgrada)
attributes(nls_crs_v2$crsgrada)
#one-way investigation
nls_crs_v2%>% count(crsgrada) #frequency count
nls_crs_v2%>% filter(crsgrada %in% c("99", "AU", "CR", "I", "NO", "P", "S", "U", "W", "WP")) #printing some obs for specific values o
#missing values
nls_crs_v2%>% count(crsgrada) #per supp addendum doc p.16, value 99 is "unknown" or "missing"
# crsgradb
#investigate type, class, and attributes
typeof(nls_crs_v2$crsgradb) #numeric double variable
class(nls_crs_v2$crsgradb)
attributes(nls_crs_v2$crsgradb)
#one-way investigation
nls_crs_v2 %>% #run some descriptive stats
summarise_at(
.vars = vars(crsgradb),
.funs = funs(min, max, mean, .args=list(na.rm=TRUE))
)
#missing values
nls_crs_v2 %>% filter(crsgradb>4) %>% count(crsgradb) #per supp addendum doc p.17, values 999 and 999.999 are "none", largest true value is 4
# gradtype
#investigate type, class, and attributes
typeof(nls_crs_v2$gradtype)
class(nls_crs_v2$gradtype)
attributes(nls_crs_v2$gradtype)
#one-way investigation
nls_crs_v2%>% count(gradtype) #frequency count
nls_crs_v2%>% filter(gradtype %in% c(1, 2)) #printing some obs for specific values
#missing values
nls_crs_v2%>% filter(gradtype %in% c(9)) #per codebook, pg.58 definition, pg. 523 true categories, value 9 is "missing"
# crsecred
#investigate type, class, and attributes
typeof(nls_crs_v2$crsecred)
class(nls_crs_v2$crsecred)
attributes(nls_crs_v2$crsecred)
#one-way investigation
nls_crs_v2 %>% #run some descriptive stats
summarise_at(
.vars = vars(crsgradb),
.funs = funs(min, max, mean, .args=list(na.rm=TRUE))
)
#missing values
nls_crs_v2 %>% filter(crsgradb>4) %>% count(crsgradb)
#per codebook pg. 49: 70% of numeric grades were reported on a scale of 0 to 4
# the remaining were recorded on different scales (e.g., 0-100)
# no attempts were made to place numeric grades on a common metric
```
__Investigate input variable(s): Two Way Investigations__
```{r, eval=FALSE}
options(tibble.print_min=50)
#Investigate gradtype, crsgrada, crsgradb
#some tabulations for different values of gradtype and crsgrada
nls_crs_v2 %>% group_by(gradtype) %>% count(crsgrada) # cross tab of vars gradtype & crsgrada
nls_crs_v2 %>% group_by(gradtype) %>% count(crsgrada) %>% as_factor() #cross tab this time show value labels
nls_crs_v2%>% filter(gradtype==1) %>% count(crsgrada) # letter grade
nls_crs_v2%>% filter(gradtype==2) %>% count(crsgrada) # numeric grade
nls_crs_v2%>% filter(gradtype==9) %>% count(crsgrada) # missing
#some tabulations for different values of gradtype and crsgradb
nls_crs_v2 %>% group_by(gradtype) %>% count(crsgradb) # cross tab of vars gradtype & crsgradb
nls_crs_v2 %>% group_by(gradtype) %>% count(crsgradb) %>% as_factor() #cross tab this time show value labels
nls_crs_v2 %>% group_by(gradtype) %>%
summarise_at(.vars = vars(crsgradb),
.funs = funs(min, max, .args = list(na.rm = TRUE))) %>%
as_factor()
nls_crs_v2%>% filter(gradtype==1) %>% count(crsgradb) # letter grade
nls_crs_v2%>% filter(gradtype==2) %>% count(crsgradb) # numeric grade
nls_crs_v2%>% filter(gradtype==9) %>% count(crsgradb) # missing
#some tabulations for different values of crsgrada and crsgradb
nls_crs_v2 %>% group_by(crsgrada) %>% filter(crsgrada %in% c("A+", "A", "A-")) %>% count(crsgradb) %>% as_factor() #cross tab of vars crsgrada & crsgradb for some specific values of crsgrada
nls_crs_v2 %>% group_by(crsgrada) %>% filter(crsgrada %in% c("A+", "A", "A-")) %>%
summarise_at(.vars = vars(crsgradb),
.funs = funs(min, max, .args = list(na.rm = TRUE))) %>%
as_factor()
```
__Keep Input Variable(s), Create Analysis Variables, Treat Missing Values as NA__
```{r, eval=FALSE}
#Create measure of course credits attempted that replaces 999 and 999.999 with missing
#investigate
nls_crs_v2%>% count(crsecred)
nls_crs_v2%>% filter(crsecred==999) # printing some observations for specific values of crsecred
nls_crs_v2%>% filter(crsecred==1000) # printing some observations for specific values of crsecred
nls_crs_v2%>% filter(crsecred>=999) %>% count(crsecred) # printing some observations for specific values of crsecred
#create new variable
nls_crs_v2<- nls_crs_v2%>%
mutate(crsecredv2= ifelse(crsecred>=900, NA, crsecred))
#check
nls_crs_v2%>% filter(crsecred==999) %>% select(crsecred,crsecredv2)
nls_crs_v2%>% filter(crsecred>999) %>% select(crsecred,crsecredv2)
nls_crs_v2%>% filter(crsecred>900) %>% count(crsecredv2) # one-way frequency table
nls_crs_v2%>% filter(crsecred>900) %>% group_by(crsecred) %>% count(crsecredv2) # two-way frequency table
```
```{r, eval=FALSE}
#Create measure of grade scale that replaces 9 with missing
#investigate
nls_crs_v2%>% count(gradtype) %>% as_factor()
nls_crs_v2%>% filter(gradtype==9) # printing some observations for specific values of crsecred
#create new variable
nls_crs_v2<- nls_crs_v2%>%
mutate(gradtypev2= ifelse(gradtype==9, NA, gradtype))
#add variable label
nls_crs_v2 <- nls_crs_v2 %>%
set_variable_labels(
gradtypev2 = "new type of grade"
)
#add value label
nls_crs_v2 <- nls_crs_v2 %>%
set_value_labels(gradtypev2 = c("letter grade" = 1,
"numeric grade" = 2)
)
#check
nls_crs_v2 %>% filter(gradtype==9) %>% select(gradtype,gradtypev2)
nls_crs_v2 %>% filter(gradtype==9) %>% count(gradtypev2) # one-way frequency table
nls_crs_v2 %>% filter(gradtype==9) %>% group_by(gradtype) %>% count(gradtypev2) # two-way frequency table
```
```{r, eval=FALSE}
# Create new numeric grade variable when type of grade is numeric & course credit not missing
#investigate values of the numeric grade when gradtype==2 and course credit not missing
typeof(nls_crs_v2$crsgradb)
class(nls_crs_v2$crsgradb)
options(tibble.print_min=300)
nls_crs_v2%>% filter(gradtype==2, (!is.na(crsecredv2))) %>% count(crsgradb)
#create new variable
nls_crs_v2<- nls_crs_v2%>% mutate(crsgradbv2= ifelse(crsgradb<=4 & gradtype==2 & (!is.na(crsecredv2)), crsgradb, NA))
#%>% filter(gradtype==2)
#check
nls_crs_v2%>% filter(is.na(crsecredv2)) %>% count(crsgradbv2) # course credit is missing
nls_crs_v2%>% filter(gradtype %in% c(1,9)) %>% count(crsgradbv2) # course credit is missing
nls_crs_v2%>% filter(crsgradb>4) %>% count(crsgradbv2) # course credit is missing
nls_crs_v2%>% filter(crsgradb<=4, gradtype==2, (!is.na(crsecredv2))) %>% count(crsgradbv2) # tabulation of analysis variable
nls_crs_v2%>% filter(crsgradb<=4, gradtype==2, (!is.na(crsecredv2))) %>% group_by(crsgradb) %>% count(crsgradbv2) # tabulation of analysis variable
nls_crs_v2%>% filter(crsgradb<=4, gradtype==2, (!is.na(crsecredv2))) %>% mutate(assert=crsgradb==crsgradbv2) %>% count(assert)
```
```{r, eval=FALSE}
# Create new numeric grade variable based on letter grade (when type of grade is letter) & numeric grade (when type of grade is numeric)
# investigate
nls_crs_v2%>% count(gradtype)
nls_crs_v2%>% count(gradtype) %>% haven::as_factor()
nls_crs_v2%>% filter(gradtype==1) %>% count(crsgrada)
#options(tibble.print_min=200)
#create new variable
nls_crs_v2<- nls_crs_v2%>%
mutate(
numgrade=case_when(
crsgrada %in% c("A+","A") & gradtype==1 & (!is.na(crsecredv2)) ~ 4,
crsgrada=="A-" & gradtype==1 & (!is.na(crsecredv2)) ~ 3.7,
crsgrada=="B+" & gradtype==1 & (!is.na(crsecredv2)) ~ 3.3,
crsgrada=="B" & gradtype==1 & (!is.na(crsecredv2)) ~ 3,
crsgrada=="B-" & gradtype==1 & (!is.na(crsecredv2)) ~ 2.7,
crsgrada=="C+" & gradtype==1 & (!is.na(crsecredv2)) ~ 2.3,
crsgrada=="C" & gradtype==1 & (!is.na(crsecredv2)) ~ 2,
crsgrada=="C-" & gradtype==1 & (!is.na(crsecredv2)) ~ 1.7,
crsgrada=="D+" & gradtype==1 & (!is.na(crsecredv2)) ~ 1.3,
crsgrada=="D" & gradtype==1 & (!is.na(crsecredv2)) ~ 1,
crsgrada=="D-" & gradtype==1 & (!is.na(crsecredv2)) ~ 0.7,
crsgrada %in% c("F","E","WF") & gradtype==1 & (!is.na(crsecredv2)) ~ 0,
crsgradb<=4 & gradtype==2 & (!is.na(crsecredv2)) ~ crsgradb # use values of numeric var crsgradb
)
)
#add variable label
nls_crs_v2 <- nls_crs_v2 %>%
set_variable_labels(
numgrade = "new numeric grade"
)
#check
nls_crs_v2%>% count(numgrade)
nls_crs_v2%>% filter(is.na(crsecredv2)) %>% count(numgrade) # missing when crsecredv2==NA
nls_crs_v2%>% filter(gradtype==9) %>% count(numgrade) # missing when grade-type is not "letter"
nls_crs_v2%>% filter(gradtype==1, (!is.na(crsecredv2))) %>% count(numgrade) # when grade-type=letter and course credit is not missing
nls_crs_v2%>% filter(gradtype==1, (!is.na(crsecredv2))) %>% group_by(crsgrada) %>% count(numgrade) #when grade-type=letter and course credit is not missing
#check against values of crsgradb
nls_crs_v2%>% filter(crsgradb>4, gradtype==2, !is.na(crsecredv2)) %>% group_by(crsgradb) %>% count(numgrade) # course credit is missing
```
## 2.4 Reshape and/or merge data:
**reshape:**
- from wide to long
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
```{r, eval=FALSE}
table208_30_tidy <- table208_30 %>%
gather(`2000`,`2005`,`2009`,`2010`, `2011`, key = year, value = total_teachers)
```
- from long to wide
spread(data, key, value, fill = NA, convert = FALSE)
```{r, eval=FALSE}
all_obs_tidy <- all_obs_temp %>%
spread(key=age_lev, value=efage09)
```
**merge:**
```{r, eval=FALSE}
load("jkcf_opeid5_list.RData")
jkcf_opeid5_mrc3 <- jkcf_opeid5_list %>%
left_join(mrc3_by_cohort, by = "opeid5")
anti_test <- jkcf_opeid5_list %>%
anti_join(mrc3_by_cohort, by = "opeid5")
```
## 2.5 Export:
Save RData dataframe
```{r, eval=FALSE}
saveRDS(hsls, "hsls.rds") # could only save 1 R object
save(nls_crs_v2, nls_crs_trans, file = './dataset.RData') # for saving multiple R objects
```
Save Stata dataframe
```{r, eval=FALSE}
write_dta(hsls, path = "../../data/hsls/hsls_sch_small.dta")
```
Export to CVS
```{r, eval=FALSE}
write_csv(hsls, path = "../../data/hsls/hsls_sch_small.cvs")
```