Skip to content

Fast load and query HILDA Survey in R. Updated to work for a later version of the HILDA survey

License

Notifications You must be signed in to change notification settings

raisin-toast/hildar

 
 

Repository files navigation

hildar

R-CMD-check

HILDA survey data is a large panel survey 20 waves (2001 - 2020) and counting! Some waves have more than 5000 variables, which means reading them into R is a little challenging (personally I think it is wayyyyyyyy too slow).

The goal of this package is to provide a quick and easy way to query HILDA data into R. This is possible by converting each wave of HILDA from its STATA file (.dta), one of the three formats HILDA provides, to fst format. fst is a binary format and can be read much much quicker than .dta in R.

Function name Description
hil_setup() Setup HILDA fst files for hil_fetch() to use.
hil_fetch() Fetches HILDA records based on query options.
hil_dict() Shows HILDA data glossary and waves each variable is available in. This provides a convenient way to select multiple variables based on their description by passing it to hil_fetch().
hil_vars() Returns all variables where their variable names match a regular expression.
hil_labs() Returns all variables where their labels match a regular expression.
hil_browse() Opens up the HILDA data dictionary page on your default web browser.
hil_crosswave_info() Takes a variable name and search for its cross wave information.
hil_var_details() Similar to hil_crosswave_info() but it searches for a variable’s details.

Installation

The development version from GitHub with:

# install.packages("remotes")
remotes::install_github("asiripanich/hildar")

Setup

1) Store HILDA as .fst files

Use hil_setup() to read HILDA STATA (.dta) files and save them as .fst files. .fst is a binary data format that can be read very quickly, a lot faster than .dta. An additional benefit of hil_setup() is that it creates a HILDA dictionary file that you can later call using hil_dict(). Having a functional hil_dict() allows the user to use hil_vars() and hil_labs() for searching variable names using a regular expression.

hil_setup(
  read_dir = "/path/to/your/hilda-stata-files", 
  save_dir = "/path/to/save/hilda-fst-files"
)

This will allow you to fast query HILDA data from all waves using hil_fetch().

2) Tell hildar where the HILDA .fst files are stored at.

hil_fetch() requires the user to specify where the HILDA fst files generated in the previous step are stored. You can either set this HILDA_FST as a global option or an R environment variable. Setting this as a persistent option for all your R sessions will make hil_fetch() more convinient to use. Alternatively, you can manually set it in each call using hilda_fst_dir argument in hil_fetch().

Example

Once the setup is completed, you can now start fetching HILDA data with hildar!

library(hildar)

# fetch removes the HILDA year prefix from all the selected variable
# (e.g. axxx = 2001, bxxx = 2002).
hil_fetch(years = 2001:2003, add_geography = T) %>%
  summary()
#>    xwaveid              hhid                wave          hgage           hgsex              mrcurr             hhrih              hhsgcc              hhwth             hhwtrp       
#>  Length:55899       Length:55899       Min.   :1.00   Min.   :  0.00   Length:55899       Length:55899       Length:55899       Length:55899       Min.   :    0.0   Min.   :  -10.0  
#>  Class :character   Class :character   1st Qu.:1.00   1st Qu.: 15.00   Class :character   Class :character   Class :character   Class :character   1st Qu.:  732.7   1st Qu.:  -10.0  
#>  Mode  :character   Mode  :character   Median :2.00   Median : 34.00   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :  909.4   Median :  821.9  
#>                                        Mean   :1.96   Mean   : 34.82                                                                               Mean   : 1021.9   Mean   :  809.4  
#>                                        3rd Qu.:3.00   3rd Qu.: 51.00                                                                               3rd Qu.: 1148.8   3rd Qu.: 1122.0  
#>                                        Max.   :3.00   Max.   :100.00                                                                               Max.   :14094.0   Max.   :16000.0

There is a quick option to add basic demographic variables to the data, which is set to TRUE by default.

hil_fetch(years = 2001, add_basic_vars = T) %>%
  names()
#> [1] "xwaveid" "hhid"    "wave"    "hgage"   "hgsex"   "mrcurr"  "hhrih"   "hhwth"   "hhwtrp"

How about doing a quick search to find variables that you want? Use hil_dict which is a data.table that you can search or view HILDA variables without going to their documentation webpage.

hilda_dictionary <- hil_dict()
head(hilda_dictionary)
#>        var            wave                    label
#> 1: xwaveid              NA         XW Cross wave ID
#> 2:    hhid 1,2,3,4,5,6,...          HF Household ID
#> 3:   hhpno 1,2,3,4,5,6,...         HF Person number
#> 4:   hhpid 1,2,3,4,5,6,...             HF Person ID
#> 5:  hhrpid 1,2,3,4,5,6,... DV: Randomised person id
#> 6: hhstate 1,2,3,4,5,6,...                 HF State

Let say we want to select all variables that are related to ‘employment’. Here is how we can easily use the selected employment variables in hil_fetch().

hilda_data <- hil_fetch(years = 2001:2003, vars = hil_labs("employment"))
#> ! These variables: `cnpu_fd`, `cnpu_np`, `cnpu_o1`, `cnpu_o2`, `cnpu_na`, `cnph_o1`, `cnph_o2`, `cnpc_ps`, `cnpc_fd`, `cnpc_o1`, `cnpc_o2`, `cnsu_ps`, `cnsu_fd`, `cnsu_kp`, `cnsu_np`, `cnsu_o1`, `cnsu_o2`, `cnsu_na`, `cnsh_bs`, `cnsh_ru`, `cnsh_re`, `cnsh_ps`, `cnsh_fd`, `cnsh_kp`, `cnsh_o1`, `cnsh_o2`, `cnsc_bs`, `cnsc_ru`, `cnsc_ps`, `cnsc_fd`, `cnsc_kp`, `cnsc_o1`, `cnsc_o2`, `chkb12`, `pjothru`, `pjothra`, `pjotcnt`, `fmfempn`, `fmmempn`, `lshremp`, `lsmnemp`, `lsmncom`, `fisemr`, `lsemp`, `lscom`, `jbtremp`, `ujtros`, `ncesop`, `rcesop`, `rtgwage`, `cnsh_au`, `hepuwrk`, `herjob`, `herhour`, `hechjob`, `hetowrk`, `heonas`, `hespeq`, `heothed`, `nsu1_fd`, `nsu1_o1`, `nsu1_o2`, `nsu1_na`, `nsu1_np`, `nsu2_fd`, `nsu2_o1`, `nsu2_o2`, `nsu2_na`, `nsu2_np`, `nsu3_fd`, `nsu3_o1`, `nsu3_o2`, `nsu3_na`, `nsu3_np`, `nsu4_fd`, `nsu4_o1`, `nsu4_o2`, `nsu4_na`, `nsu4_np`, `nsu5_fd`, `nsu5_o1`, `nsu5_o2`, `nsu5_na`, `nsu5_np`, `nsu6_fd`, `nsu6_o1`, `nsu6_o2`, `nsu6_na`, `nsu6_np`, `nsh1_ps`, `nsh2_ps`, `nsh3_ps`, `nsh4_ps`, `nsh5_ps`, `nsh6_ps`, `nsh1_fd`, `nsh2_fd`, `nsh3_fd`, `nsh4_fd`, `nsh5_fd`, ... don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_a200u.fst'.
#> ! These variables: `fmfempo`, `fmmempo`, `jbempst`, `loimpew`, `jbtremp`, `ujtros`, `ncesop`, `rcesop`, `rtgwage`, `cnsh_au`, `hepuwrk`, `herjob`, `herhour`, `hechjob`, `hetowrk`, `heonas`, `hespeq`, `heothed`, `nsu1_fd`, `nsu1_o1`, `nsu1_o2`, `nsu1_na`, `nsu1_np`, `nsu2_fd`, `nsu2_o1`, `nsu2_o2`, `nsu2_na`, `nsu2_np`, `nsu3_fd`, `nsu3_o1`, `nsu3_o2`, `nsu3_na`, `nsu3_np`, `nsu4_fd`, `nsu4_o1`, `nsu4_o2`, `nsu4_na`, `nsu4_np`, `nsu5_fd`, `nsu5_o1`, `nsu5_o2`, `nsu5_na`, `nsu5_np`, `nsu6_fd`, `nsu6_o1`, `nsu6_o2`, `nsu6_na`, `nsu6_np`, `nsh1_ps`, `nsh2_ps`, `nsh3_ps`, `nsh4_ps`, `nsh5_ps`, `nsh6_ps`, `nsh1_fd`, `nsh2_fd`, `nsh3_fd`, `nsh4_fd`, `nsh5_fd`, `nsh6_fd`, `nsh1_o1`, `nsh2_o1`, `nsh3_o1`, `nsh4_o1`, `nsh5_o1`, `nsh6_o1`, `nsh1_o2`, `nsh2_o2`, `nsh3_o2`, `nsh4_o2`, `nsh5_o2`, `nsh6_o2`, `npu1_o1`, `npu1_o2`, `npu1_na`, `npu1_np`, `npu2_o1`, `npu2_o2`, `npu2_na`, `npu2_np`, `npu3_o1`, `npu3_o2`, `npu3_na`, `npu3_np`, `npu4_o1`, `npu4_o2`, `npu4_na`, `npu4_np`, `npu5_o1`, `npu5_o2`, `npu5_na`, `npu5_np`, `npu6_o1`, `npu6_o2`, `npu6_na`, `npu6_np`, `nph1_fd`, `nph2_fd`, `nph3_fd`, `nph4_fd`, ... don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_b200u.fst'.
#> ! These variables: `fmfempo`, `fmmempo`, `jbempst`, `loimpew`, `fisemr`, `cnsh_au`, `hepuwrk`, `herjob`, `herhour`, `hechjob`, `hetowrk`, `heonas`, `hespeq`, `heothed`, `nsu1_fd`, `nsu1_o1`, `nsu1_o2`, `nsu1_na`, `nsu1_np`, `nsu2_fd`, `nsu2_o1`, `nsu2_o2`, `nsu2_na`, `nsu2_np`, `nsu3_fd`, `nsu3_o1`, `nsu3_o2`, `nsu3_na`, `nsu3_np`, `nsu4_fd`, `nsu4_o1`, `nsu4_o2`, `nsu4_na`, `nsu4_np`, `nsu5_fd`, `nsu5_o1`, `nsu5_o2`, `nsu5_na`, `nsu5_np`, `nsu6_fd`, `nsu6_o1`, `nsu6_o2`, `nsu6_na`, `nsu6_np`, `nsh1_ps`, `nsh2_ps`, `nsh3_ps`, `nsh4_ps`, `nsh5_ps`, `nsh6_ps`, `nsh1_fd`, `nsh2_fd`, `nsh3_fd`, `nsh4_fd`, `nsh5_fd`, `nsh6_fd`, `nsh1_o1`, `nsh2_o1`, `nsh3_o1`, `nsh4_o1`, `nsh5_o1`, `nsh6_o1`, `nsh1_o2`, `nsh2_o2`, `nsh3_o2`, `nsh4_o2`, `nsh5_o2`, `nsh6_o2`, `npu1_o1`, `npu1_o2`, `npu1_na`, `npu1_np`, `npu2_o1`, `npu2_o2`, `npu2_na`, `npu2_np`, `npu3_o1`, `npu3_o2`, `npu3_na`, `npu3_np`, `npu4_o1`, `npu4_o2`, `npu4_na`, `npu4_np`, `npu5_o1`, `npu5_o2`, `npu5_na`, `npu5_np`, `npu6_o1`, `npu6_o2`, `npu6_na`, `npu6_np`, `nph1_fd`, `nph2_fd`, `nph3_fd`, `nph4_fd`, `nph5_fd`, `nph6_fd`, `nph1_o1`, `nph2_o1`, ... don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_c200u.fst'.
dim(hilda_data)
#> [1] 55899    79

Or if you know the prefix of a subject area that you like to query, you can use hil_vars(pattern) to query all variable names that match the pattern. For example, hil_vars("^ff") will get all variables in subject area ‘Health’ and nested area ‘Heath - diet’.

hilda_data <- hil_fetch(years = 2001:2003, vars = hil_vars("^ff"))
#> ! These variables: `ffmilk`, `ffveg`, `ffvegs`, `fffrt`, `fffrts`, `ffbf`, `ffsalt`, `ffbrfr`, `fflunr`, `ffdinr`, `ffcdiet`, `ffdietf`, `ffsrw`, `ffscw`, `ffleg`, `ffcake`, `ffpasta`, `ffsnack`, `ffcerl`, `ffconf`, `ffbread`, `ffspud`, `ffrmeat`, `ffprocm`, `ffpoult`, and `fffish` don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_a200u.fst'.
#> ! These variables: `ffmilk`, `ffveg`, `ffvegs`, `fffrt`, `fffrts`, `ffbf`, `ffsalt`, `ffbrfr`, `fflunr`, `ffdinr`, `ffcdiet`, `ffdietf`, `ffsrw`, `ffscw`, `ffleg`, `ffcake`, `ffpasta`, `ffsnack`, `ffcerl`, `ffconf`, `ffbread`, `ffspud`, `ffrmeat`, `ffprocm`, `ffpoult`, and `fffish` don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_b200u.fst'.
#> ! These variables: `ffmilk`, `ffveg`, `ffvegs`, `fffrt`, `fffrts`, `ffbf`, `ffsalt`, `ffbrfr`, `fflunr`, `ffdinr`, `ffcdiet`, `ffdietf`, `ffsrw`, `ffscw`, `ffleg`, `ffcake`, `ffpasta`, `ffsnack`, `ffcerl`, `ffconf`, `ffbread`, `ffspud`, `ffrmeat`, `ffprocm`, `ffpoult`, and `fffish` don't exist in 'C:\Users\amarin\OneDrive - UNSW\data\HILDA20-fst/Combined_c200u.fst'.
dim(hilda_data)
#> [1] 55899     9

Here is a summary of the dimensions of our HILDA data files.

# the number of variables and rows in each wave
nrows_by_wave <-
  hil_fetch(years = 2001:2020, add_basic_vars = F) %>%
  .[, .(number_of_rows = .N), by = wave]

hilda_dictionary[, unlist(wave), by = .(var, label)] %>%
  data.table::setnames("V1", "wave") %>%
  .[!is.na(wave), .(number_of_variables = .N), by = wave] %>%
  merge(nrows_by_wave, by = "wave")
#>     wave number_of_variables number_of_rows
#>  1:    1                4289          19914
#>  2:    2                5220          18295
#>  3:    3                5214          17690
#>  4:    4                5081          17209
#>  5:    5                5893          17467
#>  6:    6                6140          17453
#>  7:    7                6044          17280
#>  8:    8                6186          17144
#>  9:    9                6202          17632
#> 10:   10                6488          17855
#> 11:   11                6691          23415
#> 12:   12                6530          23182
#> 13:   13                6455          23299
#> 14:   14                6719          23114
#> 15:   15                6800          23305
#> 16:   16                6433          23507
#> 17:   17                6697          23442
#> 18:   18                7093          23267
#> 19:   19                7278          23256
#> 20:   20                6864          22932

About

Fast load and query HILDA Survey in R. Updated to work for a later version of the HILDA survey

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%