Tools for cleaning, analyzing, and understanding the Honduras data.
This is a private repository, so you will need to use GitHub credentials with access to this repository, to install the package. You will need these every time you want to reinstall or update the package.
- GraphTools.jl (public repository)
This data is too large and too protected to be tracked on GitHub. Consequently, it must be generated locally.
-
You should create a new file based on the existing
make_data.jl
template file with the directories and files appropriate to your specific data request. E.g., the types of raw data files and filenames will be specific to your request. -
From your top directory (e.g., for a Quarto project):
julia --threads=20 make_data.jl
This yields one CSV at each level of the data, with all (selected) waves concatenated together. Additionally, the data has been filtered by alter_source
(connections data) and data_source
for the respondent level data.
Cf. the codebook directory for a codebook that matches the data in this format.
(more explanations to be written)
- Will this processing code work with the data that I have requested?
- Not clear, the code needs to be checked for flexibility (e.g., are operations "if some var present, do clean it or rely on it somehow")
- How many datasets and files are there?
- There are 3 (eventually 4) waves of data, for the respondent-level, household-level, and connections data.
- Plus there is the microbiome data at W3', and the CSS data at W4'.
- Each type and wave has a separate file.
- One major purpose of this repository is to clean the data such that there is one data table for each type of data, including each wave. (the microbiome is currently separate). The general strategy is to take all possible variables, and leave missing where they are not collected. So, check whether some variable really ought to exist at that wave (in the reformatted codebook, which also combines across waves.)
- How do we extract the appropriate set of isolates? (given that the edgelist doesn't have them?)
- What is
household_id
? How do I index households?- it has been deprecated in favor of
building_id
- it has been deprecated in favor of
- Which are the microbiome villages?
- cf.
codebook/microbiome_villages.csv
- Other villages appear in the data, because the team surveyed those from other villages who happened to be present a village while it was being surveyed.
- cf.
- How do I interpret the
survey
variable in the codebooks?- For each row of the reformatted codebook, there should be an entry corresponding to each wave of the data (in which the data is present; compare to
wave
.) - "baseline" => the survey was only done if a prior measurement did not exist
- "wx" (where x is a wave number) => implies that it was the standard survey given to everyone in that wave
- "all" => both "baseline" and "wx" were carried out -- [THIS DOESN'T MAKE SENSE? Why would the surveys overlap?]
- The set of questions in "baseline" varies across the waves; hence, it is denoted "baseline wx"
- [EXPLAIN: how connected to respondent variable indicating a new subject at wave x?]
- What do we do about variables that were only collected say, at W1?
- Case-by-case; but N.B., whether some variable is plausibly static.
- For each row of the reformatted codebook, there should be an entry corresponding to each wave of the data (in which the data is present; compare to
- What about people who are in a different village from W3 to W3'?
- (where W3' is when the microbiome data was collected)
- There are around 13 cases. It is currently not clear whether these were permanent moves, or not. (lives in village, works in village are both
missing
in each of the 13 cases)
survey
variable is not clear- At least in the microbiome dataset, there are some variables not yet referenced.
- e.g.,
other_resp
,data_source
- e.g.,
- Some variables are not coded consistently
- e.g., gender is coded as "male" vs. "female" or sometimes "man" vs. "woman"?
- It is worth checking for consistency (the processing code here should resolve for gender, but not in the underlying data)
- (a few other things were cleaned up and fixed from the stated versions, so check the version history; e.g., variable
p1600
)
See make_data.jl
for a template to process the raw data into clean CSV files.
- N.B. the relative paths.
- The processing functions should be agnostic to the specific variables requested, meaning that they should work whether you requested some, many, or all variables.
- take requested connections data (W3 data as the closest to W3' MB data)
- filter to
alter_source = 1
,same_village = 1
- drop any rows with missing entries
- take the requested respondent data (W3 data)
- filter any rows with missing entries for [
village_code
,gender
,date_of_birth
,building_id
] - filter to
data_source = 1
- take requested HH data
- drop all rows with missing village codes, building ids (manual)
- prefix overlapping variables wiht
hh_
- take the requested MB data
- drop all rows with missing village codes (manual)
- drop all rows s.t.
data_source != 1
(manual) - prefix overlapping variables with
mb_
- left-join the individual level data to the MB data on resp. id and village code (manual)
- Depending on how we want to handle people who have a different village code for W3 and W3', we may want to adjust this.
- Depending on the analysis (read: most of the time), we want to remove everyone that is not in one of the 19 microbiome villages (
codebook/microbiome_villages.csv
)