Brendan Knapp
Environment Prep
# load packages
library(tidyverse)
library(readxl)
library(pander)
library(scales)
# set options so tables render nicely
pander::panderOptions('table.split.table', Inf)
Data were retrieved from:
- World Bank Data
- Improved water source (% of population with access)
- Life expectancy at birth, total (years)
- Urban population growth (annual %)
- UNDP Human Development Reports
- Adult Literacy Rate (% Ages 15 and older)
- Population with at least some secondary education (% ages 25 and older)
- Mean years of schooling (years)
- Inequality in education (%)
- GNI per capita (2011 PPP$)
We read the excel file containing our raw data and store it in a data_frame
variable named combo
.
combo <- read_excel("~/Development-Economics/data/combo.xlsx")
pander(combo)
Country | Improved water source (% of population with access) | Life expectancy at birth, total (years) | Urban population % | Adult Literacy Rate (% Ages 15 and older) | Population with at least some secondary education (% ages 25 and older) | Mean years of schooling (years) | GNI per capita (2011 PPP$) | Press Freedom Score (0 worst - 100 best) |
---|---|---|---|---|---|---|---|---|
Cuba | 94.6 | 79.39 | 77 | 99.7 | 84.8 | 11.8 | 7280 | 90 |
Ghana | 87.6 | 62.11 | 55 | 76.6 | 59.8 | 6.9 | 3472 | 28 |
India | 94.1 | 68.05 | 33 | 72.1 | 48.7 | 6.1 | 5329 | 39 |
Mexico | 96.1 | 76.7 | 80 | 94.4 | 57.4 | 8.4 | 16154 | 61 |
Russia | 96.9 | 70.74 | 74 | 99.7 | 94.6 | 12 | 24067 | 81 |
South Africa | 92.8 | 60.95 | 65 | 94.3 | 74.9 | 10.3 | 12113 | 33 |
Syria | 90.1 | 70.16 | 58 | 86.4 | 38.9 | 5.6 | 2905 | 89 |
Thailand | 97.8 | 74.86 | 52 | 96.7 | 43.3 | 7.9 | 14169 | 64 |
In order to normalize our values to a 0
to 1
scale, we use the following formula:
Divide (cardinal number – minimum) by (maximum – minimum)
We create a function called normalize
to achieve this result.
normalize <- function(x){
(x - min(x)) / (max(x) - min(x))
}
We take our normalize()
function, apply it to our combo
data, and store the results in a new data_frame
called combo_norm
. Since Country
is a non-numeric column, mutate_if
allows us to only apply our functions to columns where is.numeric()
returns TRUE
. We also rename our variables to simplify our table and drop any scale references.
combo_norm <- combo %>%
mutate_if(is.numeric, funs(normalize)) %>%
rename(`Press Freedom` = `Press Freedom Score (0 worst - 100 best)`,
`Water Access` = `Improved water source (% of population with access)`,
`Life Expectancy` = `Life expectancy at birth, total (years)`,
`Urban Population` = `Urban population %`,
`Adult Literacy` = `Adult Literacy Rate (% Ages 15 and older)`,
`GNI per Capita` = `GNI per capita (2011 PPP$)`
)
pander(combo_norm)
Country | Water Access | Life Expectancy | Urban Population | Adult Literacy | Population with at least some secondary education (% ages 25 and older) | Mean years of schooling (years) | GNI per Capita | Press Freedom |
---|---|---|---|---|---|---|---|---|
Cuba | 0.6863 | 1 | 0.9362 | 1 | 0.8241 | 0.9688 | 0.2067 | 1 |
Ghana | 0 | 0.06287 | 0.4681 | 0.163 | 0.3752 | 0.2031 | 0.02679 | 0 |
India | 0.6373 | 0.3849 | 0 | 0 | 0.1759 | 0.07812 | 0.1145 | 0.1774 |
Mexico | 0.8333 | 0.8541 | 1 | 0.808 | 0.3321 | 0.4375 | 0.6261 | 0.5323 |
Russia | 0.9118 | 0.531 | 0.8723 | 1 | 1 | 1 | 1 | 0.8548 |
South Africa | 0.5098 | 0 | 0.6809 | 0.8043 | 0.6463 | 0.7344 | 0.4351 | 0.08065 |
Syria | 0.2451 | 0.4995 | 0.5319 | 0.5181 | 0 | 0 | 0 | 0.9839 |
Thailand | 1 | 0.7545 | 0.4043 | 0.8913 | 0.07899 | 0.3594 | 0.5323 | 0.5806 |
We decided to combine our education variables by taking their mean
and assigning it to a new variable called Combined Education
.
combo_norm_comb_ed <- combo_norm %>%
group_by(Country) %>%
mutate(`Combined Education` = mean(c(`Population with at least some secondary education (% ages 25 and older)`, `Mean years of schooling (years)`, `Adult Literacy`))) %>%
select(-`Population with at least some secondary education (% ages 25 and older)`,
-`Mean years of schooling (years)`,
-`Adult Literacy`)
pander(combo_norm_comb_ed)
Country | Water Access | Life Expectancy | Urban Population | GNI per Capita | Press Freedom | Combined Education |
---|---|---|---|---|---|---|
Cuba | 0.6863 | 1 | 0.9362 | 0.2067 | 1 | 0.9309 |
Ghana | 0 | 0.06287 | 0.4681 | 0.02679 | 0 | 0.2471 |
India | 0.6373 | 0.3849 | 0 | 0.1145 | 0.1774 | 0.08469 |
Mexico | 0.8333 | 0.8541 | 1 | 0.6261 | 0.5323 | 0.5259 |
Russia | 0.9118 | 0.531 | 0.8723 | 1 | 0.8548 | 1 |
South Africa | 0.5098 | 0 | 0.6809 | 0.4351 | 0.08065 | 0.7283 |
Syria | 0.2451 | 0.4995 | 0.5319 | 0 | 0.9839 | 0.1727 |
Thailand | 1 | 0.7545 | 0.4043 | 0.5323 | 0.5806 | 0.4432 |
We decided on the following weighting scheme for our data: * Water: 0.30 * Education: 0.25 * GNI per Capita: 0.18 * Urban Population: 0.12 * Press Freedom: 0.09 * Life Expectancy: 0.06
We assign our weights to individual variables before using them. To ensure that we're weighting correctly, we also check that our weight sum
to 1
.
water_wt <- 0.30
education_wt <- 0.25
GNI_wt <- 0.18
urban_wt <- 0.12
press_freedom_wt <- 0.09
life_expect_wt <- 0.06
weights <- c(water_wt, education_wt, GNI_wt, urban_wt, press_freedom_wt, life_expect_wt)
sum(weights) == 1
## [1] TRUE
In order to weight our data, we multiply each column by the appropriate weight variable. At this point, we also round
our data to 2
digits following the decimal point.
combo_weighted <- combo_norm_comb_ed %>%
mutate(`Water Access` = `Water Access` * water_wt,
`Life Expectancy` = `Life Expectancy` * life_expect_wt,
`Urban Population` = `Urban Population` * urban_wt,
`GNI per Capita` = `GNI per Capita` * GNI_wt,
`Press Freedom` = `Press Freedom` * press_freedom_wt,
`Combined Education` = `Combined Education` * education_wt) %>%
mutate_if(is.numeric, funs(round(., digits = 2)))
pander(combo_weighted)
Country | Water Access | Life Expectancy | Urban Population | GNI per Capita | Press Freedom | Combined Education |
---|---|---|---|---|---|---|
Cuba | 0.21 | 0.06 | 0.11 | 0.04 | 0.09 | 0.23 |
Ghana | 0 | 0 | 0.06 | 0 | 0 | 0.06 |
India | 0.19 | 0.02 | 0 | 0.02 | 0.02 | 0.02 |
Mexico | 0.25 | 0.05 | 0.12 | 0.11 | 0.05 | 0.13 |
Russia | 0.27 | 0.03 | 0.1 | 0.18 | 0.08 | 0.25 |
South Africa | 0.15 | 0 | 0.08 | 0.08 | 0.01 | 0.18 |
Syria | 0.07 | 0.03 | 0.06 | 0 | 0.09 | 0.04 |
Thailand | 0.3 | 0.05 | 0.05 | 0.1 | 0.05 | 0.11 |
By sum
ming our normalized and weighted variables, we create our Score
data. We also round
these data to 2
digits following the decimal point.
Score <- combo_weighted %>%
group_by(Country) %>%
summarise(Score = sum(`Water Access`,
`Life Expectancy`,
`Urban Population`,
`GNI per Capita`,
`Press Freedom`,
`Combined Education`))
pander(Score)
Country | Score |
---|---|
Cuba | 0.74 |
Ghana | 0.12 |
India | 0.27 |
Mexico | 0.71 |
Russia | 0.91 |
South Africa | 0.5 |
Syria | 0.29 |
Thailand | 0.66 |
With our Score
data, we can then create our Rank
data.
Score_Rank <- Score %>%
mutate(Rank = dense_rank(desc(Score)))
pander(Score_Rank)
Country | Score | Rank |
---|---|---|
Cuba | 0.74 | 2 |
Ghana | 0.12 | 8 |
India | 0.27 | 7 |
Mexico | 0.71 | 3 |
Russia | 0.91 | 1 |
South Africa | 0.5 | 5 |
Syria | 0.29 | 6 |
Thailand | 0.66 | 4 |
The example format that we have seen looks like so:
To match this we need to convert our data_frame
to a matrix
, transpose the data, and store it in a variable that we'll call Score
.
Score_Rank <- Score_Rank %>%
as.matrix() %>%
t()
pander(Score_Rank)
Country | Cuba | Ghana | India | Mexico | Russia | South Africa | Syria | Thailand |
Score | 0.74 | 0.12 | 0.27 | 0.71 | 0.91 | 0.50 | 0.29 | 0.66 |
Rank | 2 | 8 | 7 | 3 | 1 | 5 | 6 | 4 |
We're then going to detach the Country
row so that we can bind the Score
and Rank
rows to our summary table.
Score_Rank <- Score_Rank[-1, ]
pander(Score_Rank)
Score | 0.74 | 0.12 | 0.27 | 0.71 | 0.91 | 0.50 | 0.29 | 0.66 |
Rank | 2 | 8 | 7 | 3 | 1 | 5 | 6 | 4 |
To produce the desired table, we convert our combo_norm
variable to a matrix, which we then transpose and store in a variable called combo_norm_matr_trans
.
combo_norm_matr_trans <- combo_weighted %>%
as.matrix() %>%
t()
pander(combo_norm_matr_trans)
Country | Cuba | Ghana | India | Mexico | Russia | South Africa | Syria | Thailand |
Water Access | 0.21 | 0.00 | 0.19 | 0.25 | 0.27 | 0.15 | 0.07 | 0.30 |
Life Expectancy | 0.06 | 0.00 | 0.02 | 0.05 | 0.03 | 0.00 | 0.03 | 0.05 |
Urban Population | 0.11 | 0.06 | 0.00 | 0.12 | 0.10 | 0.08 | 0.06 | 0.05 |
GNI per Capita | 0.04 | 0.00 | 0.02 | 0.11 | 0.18 | 0.08 | 0.00 | 0.10 |
Press Freedom | 0.09 | 0.00 | 0.02 | 0.05 | 0.08 | 0.01 | 0.09 | 0.05 |
Combined Education | 0.23 | 0.06 | 0.02 | 0.13 | 0.25 | 0.18 | 0.04 | 0.11 |
Next, we bind Score_Rank
to combo_norm_matr_trans
by row and store the result in a variable called combo_matrix
.
combo_matrix <- rbind(combo_norm_matr_trans, Score_Rank)
Before viewing the result, we bold our Score
and Rank
rows and our Syria
column.
emphasize.strong.rows(c(1, nrow(combo_matrix), nrow(combo_matrix) - 1))
emphasize.strong.cols(7)
pander(combo_matrix)
Country | Cuba | Ghana | India | Mexico | Russia | South Africa | Syria | Thailand |
Water Access | 0.21 | 0.00 | 0.19 | 0.25 | 0.27 | 0.15 | 0.07 | 0.30 |
Life Expectancy | 0.06 | 0.00 | 0.02 | 0.05 | 0.03 | 0.00 | 0.03 | 0.05 |
Urban Population | 0.11 | 0.06 | 0.00 | 0.12 | 0.10 | 0.08 | 0.06 | 0.05 |
GNI per Capita | 0.04 | 0.00 | 0.02 | 0.11 | 0.18 | 0.08 | 0.00 | 0.10 |
Press Freedom | 0.09 | 0.00 | 0.02 | 0.05 | 0.08 | 0.01 | 0.09 | 0.05 |
Combined Education | 0.23 | 0.06 | 0.02 | 0.13 | 0.25 | 0.18 | 0.04 | 0.11 |
Score | 0.74 | 0.12 | 0.27 | 0.71 | 0.91 | 0.50 | 0.29 | 0.66 |
Rank | 2 | 8 | 7 | 3 | 1 | 5 | 6 | 4 |
To faciliate understanding the data, we can visualize it with a map.
map_world <- map_data(map = "world")
index_countries <- Score %>%
inner_join(map_world, by = c("Country" = "region"))
nonindex_countries <- map_world %>%
anti_join(combo_norm, by = c("region" = "Country"))
index_map <- ggplot() +
geom_polygon(data = index_countries,
aes(x = long, y = lat,
group = group,
fill = Score), size = 0.01) +
geom_polygon(data = nonindex_countries,
aes(x = long, y = lat, group = group),
fill = "white",
size = 0.1) +
coord_map() +
scale_fill_distiller(type = "seq",
palette = "RdBu",
breaks = pretty_breaks(n=10),
direction = 1) +
coord_fixed(ratio = 1.5) +
guides(fill = guide_legend(keywidth = 1, keyheight = 3, reverse = TRUE)) +
ggtitle("Development Index") +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)
index_map