bblocks is a python package with tools to download and analyse development data. These tools are meant to be the building blocks of further analysis.
We have built bblocks to support our work at ONE, but we hope that it will be useful to others working with development data. We welcome feedback, feature requests, and collaboration.
bblocks is organised around the following main features:
-
Import tools to help with import data from:
- The World Bank (building on the wbgapi package)
- The IMF World Economic outlook (building on the weo package)
- The IMF data on Special Drawing Rights
- The World Food Programme (WFP) data on food security and inflation
- The FAO (notably the price index)
- The UNDP Human Development Report data
- UNAIDS
- The WHO Government Health Expenditure data
-
Cleaning tools to help with:
- Cleaning numbers/numeric series
- Transforming country identifiers (ISO2, ISO3, WB, UN, etc., building on the country_converter package)
- Transforming text to datetime objects, and datetime objects to text
- Formatting numbers as text (percentages, millions, billions, etc.)
-
Analysis tools to help with:
- Calculating period averages
- Calculating the change from one period to another
-
DataFrame tools to help with:
- Adding a population column to a DataFrame
- Adding a "share of population" / "per capita" column to a DataFrame
- Adding a population density column to a DataFrame
- Adding a GDP column to a DataFrame
- Adding a "share of GDP" column to a DataFrame
- Adding a poverty ratio column to a DataFrame
- Adding a government expenditure column to a DataFrame
- Adding a "share of government expenditure" column to a DataFrame
- Adding a "World Bank income level" column to a DataFrame
- Adding a column with short country names to a DataFrame
- Adding a column with ISO3 codes to a DataFrame
- Adding the median observation of a group
- Adding a column with geojson geometries to a DataFrame
-
Other tools like:
- Dictionaries mapping ISO3 codes (and vice-versa) to
- OECD DAC codes
- WB income groups
- geojson geometries
- G7, EU27, G20 countries
- Income levels
- Life expectancy
- Population
- Dictionaries mapping ISO3 codes (and vice-versa) to
More information is available:
- Documentation: https://bblocks.readthedocs.io/
- GitHub: https://github.com/ONECampaign/bblocks
- PyPI: https://pypi.org/project/bblocks/
bblocks can be installed from using pip
pip install bblocks --upgrade
The package is compatible with Python 3.10 and above.
To get started, import the package. It is strongly recommended that you specify the path to the folder where you want to store the data.
You only have to do this once per file/notebook.
from bblocks import set_bblocks_data_path
# Set to the folder you want
set_bblocks_data_path("path/to/data/folder")
All the examples below assume that you have done this.
from bblocks import WorldBankData
# create a WorldBankData object. This object will allow you
# to download indicators from the World Bank and get them as DataFrames
wb = WorldBankData()
# For example to get "primary completion rate" (SE.PRM.CMPT.ZS) from 2010 to 2020.
# If the data is not already in your data folder, it will be downloaded
wb.load_data(
indicator="SE.PRM.CMPT.ZS",
start_year=2010,
end_year=2020
)
# Get the data as a DataFrame
df = wb.get_data()
# Print a sample of 10 rows
print(df.sample(10))
The above would return a DataFrame like this:
date | iso_code | indicator_code | value |
---|---|---|---|
2010-01-01 | LMC | SE.PRM.CMPT.ZS | 87.753189 |
2012-01-01 | SWZ | SE.PRM.CMPT.ZS | 84.697472 |
2013-01-01 | NAM | SE.PRM.CMPT.ZS | 93.020042 |
2012-01-01 | PAK | SE.PRM.CMPT.ZS | 63.486210 |
2015-01-01 | LIC | SE.PRM.CMPT.ZS | 63.463470 |
2016-01-01 | BGD | SE.PRM.CMPT.ZS | NaN |
2019-01-01 | SYR | SE.PRM.CMPT.ZS | NaN |
2013-01-01 | NAC | SE.PRM.CMPT.ZS | 99.025703 |
2011-01-01 | AND | SE.PRM.CMPT.ZS | NaN |
2013-01-01 | GRL | SE.PRM.CMPT.ZS | NaN |
You can also get the latest data (most recent non-empty observation) for one or more indicators:
from bblocks import WorldBankData
# create a WorldBankData object.
wb_data = WorldBankData()
# Load the indicators. If they are not downloaded, they will be
wb_data.load_data(
indicator=["SH.XPD.CHEX.PC.CD", "SH.XPD.CHEX.GD.ZS"],
most_recent_only=True
)
# Get the data as a DataFrame
df = wb_data.get_data(indicators="all")
# Print a sample of the data
print(df.sample(10))
This would return a DataFrame like this:
date | iso_code | indicator_code | value |
---|---|---|---|
2019-01-01 | HRV | SH.XPD.CHEX.PC.CD | 1040.085693 |
2019-01-01 | ERI | SH.XPD.CHEX.GD.ZS | 4.458767 |
2019-01-01 | JAM | SH.XPD.CHEX.PC.CD | 327.403534 |
2019-01-01 | MYS | SH.XPD.CHEX.PC.CD | 436.612030 |
2019-01-01 | BHS | SH.XPD.CHEX.GD.ZS | 5.749775 |
2015-01-01 | YEM | SH.XPD.CHEX.PC.CD | 73.176743 |
2019-01-01 | PER | SH.XPD.CHEX.PC.CD | 370.109955 |
2019-01-01 | IDA | SH.XPD.CHEX.PC.CD | 52.076285 |
2019-01-01 | ERI | SH.XPD.CHEX.PC.CD | 25.267935 |
2019-01-01 | WLD | SH.XPD.CHEX.PC.CD | 1115.008730 |
In all cases, if you had already downloaded the data and you want to update it
you can call .update_data()
after loading the data in order to refresh it.
wb_data.update_data(reload_data=True)
from bblocks import Aids
# create an Aids object. This object will allow you
# to download indicators from UNAIDS and get them as DataFrames
aids = Aids()
# To view all the indicators that can be downloaded using this tool
# you can use the `.available_indicators` property
aids.available_indicators
Her are the first 10 indicators, but over 50 are available:
indicator | category | |
---|---|---|
0 | Trend of new HIV infections | Epidemic transition metrics |
1 | Trend of AIDS-related deaths | Epidemic transition metrics |
2 | Incidence:prevalence ratio | Epidemic transition metrics |
3 | Incidence:mortality ratio | Epidemic transition metrics |
4 | People living with HIV - All ages | People living with HIV |
5 | People living with HIV - Children (0-14) | People living with HIV |
6 | People living with HIV - Adolescents (10-19) | People living with HIV |
7 | People living with HIV - Young people (15-24) | People living with HIV |
8 | People living with HIV - Adults (15+) | People living with HIV |
9 | People living with HIV - Adults (15-49) | People living with HIV |
# to load/download indicators, you can use the `.load_data` method
# you can also specify whether to download "country", "region", or "all"
aids.load_data(
indicator="Trend of AIDS-related deaths",
area_grouping="region"
)
# get the data as a DataFrame
df = aids.get_data()
# print a sample of 10 rows
print(df.sample(10))
area_name | area_id | year | indicator | dimension | value |
---|---|---|---|---|---|
Global | 03M49WLD | 2013 | Trend of AIDS-related deaths | All ages estimate | 1.061395e+06 |
Latin America | UNALA | 2021 | Trend of AIDS-related deaths | All ages estimate | 2.916500e+04 |
Middle East and North Africa | UNAMENA | 2018 | Trend of AIDS-related deaths | All ages lower estimate | 4.089657e+03 |
Western & Central Europe and North America | UNAWCENA | 2019 | Trend of AIDS-related deaths | All ages estimate | 1.305140e+04 |
Caribbean | UNACAR | 2021 | Trend of AIDS-related deaths | All ages lower estimate | 4.213485e+03 |
Middle East and North Africa | UNAMENA | 2021 | Trend of AIDS-related deaths | All ages upper estimate | 6.867407e+03 |
Western & Central Europe and North America | UNAWCENA | 2016 | Trend of AIDS-related deaths | All ages upper estimate | 1.771698e+04 |
Western & Central Europe and North America | UNAWCENA | 2020 | Trend of AIDS-related deaths | All ages upper estimate | 1.632782e+04 |
Eastern Europe and Central Asia | UNAEECA | 2017 | Trend of AIDS-related deaths | All ages upper estimate | 4.553729e+04 |
Latin America | UNALA | 2020 | Trend of AIDS-related deaths | All ages upper estimate | 4.577862e+04 |
As with other bblocks tools, you can also get multiple indicators at once (see the WorldBank example).
In all cases, if you had already downloaded the data and you want to update it
you can call .update_data()
after loading the data in order to refresh it.
aids.update_data(reload_data=True)
# Import the SDR object from the sdr module of "import_tools"
from bblocks.import_tools.sdr import SDR
# Create an SDR object
sdr = SDR()
# To view the latest date for which data is available,
# call the `.latest_date()` method
sdr.latest_date()
# To download the latest data
sdr.load_data(date="latest")
# To get the data as a DataFrame. You can specify getting a
# specific indicator by using 'indicator'. In this case,
# we'll get holdings (allocations are also available)
df = sdr.get_data(indicator="holdings")
# Print a sample of 10 rows
print(df.sample(10))
entity | indicator | value | date |
---|---|---|---|
Samoa | holdings | 1.584296e+07 | 2023-01-31 |
Iraq | holdings | 3.301367e+07 | 2023-01-31 |
Lao People\'s Democratic Republic | holdings | 5.870183e+07 | 2023-01-31 |
Haiti | holdings | 9.169516e+07 | 2023-01-31 |
Bahamas, The | holdings | 1.245326e+08 | 2023-01-31 |
Total | holdings | 6.606989e+11 | 2023-01-31 |
Libya | holdings | 3.187335e+09 | 2023-01-31 |
Namibia | holdings | 1.783556e+08 | 2023-01-31 |
Tajikistan, Republic of | holdings | 1.891507e+08 | 2023-01-31 |
Malta | holdings | 2.499760e+08 | 2023-01-31 |
In all cases, if you had already downloaded the data and you want to update it
you can call .update_data()
after loading the data in order to refresh it.
sdr.update_data(reload_data=True)
For this example, we will continue using the SDR data as above.
from bblocks import add_income_level_column
# We can add the column by passing the dataframe to the function
df = add_income_level_column(
df=df,
id_column="entity",
id_type="regex", # so the text can be matched to the right country
)
Which adds the income level column:
entity | indicator | value | date | income_level |
---|---|---|---|---|
Montenegro, Republic of | holdings | 7.404593e+07 | 2023-01-31 | Upper middle income |
Gambia, The | holdings | 5.857020e+07 | 2023-01-31 | Low income |
Suriname | holdings | 1.211070e+08 | 2023-01-31 | Upper middle income |
Syrian Arab Republic | holdings | 5.636629e+08 | 2023-01-31 | Low income |
Iran, Islamic Republic of | holdings | 4.976198e+09 | 2023-01-31 | Lower middle income |
Uruguay | holdings | 6.330507e+08 | 2023-01-31 | High income |
South Africa | holdings | 4.424154e+09 | 2023-01-31 | Upper middle income |
Nigeria | holdings | 3.755370e+09 | 2023-01-31 | Lower middle income |
Dominican Republic | holdings | 4.498683e+08 | 2023-01-31 | Upper middle income |
Trinidad and Tobago | holdings | 7.722810e+08 | 2023-01-31 | High income |
An optional argument can be passed to the function to redownload the income classification data from the World Bank.
df = add_income_level_column(
df=df,
id_column="entity",
id_type="regex",
update_data=True,
)
For this example, we will continue working with data on military expenditure downloaded using the World Bank tool.
# First import the function from the `add` module of `dataframe_tools`
from bblocks.dataframe_tools.add import add_gdp_share_column
from bblocks import WorldBankData
# this data is in local currency units
df = WorldBankData().load_data(indicator="MS.MIL.XPND.CN", most_recent_only=True).get_data()
date | iso_code | indicator_code | value |
---|---|---|---|
2021-01-01 | BDI | MS.MIL.XPND.CN | 1.351000e+11 |
2014-01-01 | YEM | MS.MIL.XPND.CN | 3.685000e+11 |
2021-01-01 | AFG | MS.MIL.XPND.CN | 2.304000e+10 |
2021-01-01 | PER | MS.MIL.XPND.CN | 9.086000e+09 |
2021-01-01 | AUS | MS.MIL.XPND.CN | 4.229595e+10 |
# Then call the function, passing the DataFrame and the column name
df = add_gdp_share_column(
df=df,
id_column="iso_code",
id_type="ISO3",
date_column="date", # to match the gdp values with the year of the data
value_column="value",
decimals=1,
usd=False, # since the data is in local currency units
include_estimates=True, # to include official data and IMF estimates for GDP
)
print(df.sample(10))
Which returns a dataframe with an extra column "gdp_share".
date | iso_code | indicator_code | value | gdp_share |
---|---|---|---|---|
2021-01-01 | GIN | MS.MIL.XPND.CN | 2.406750e+12 | 1.5 |
2014-01-01 | ARE | MS.MIL.XPND.CN | 8.356800e+10 | 5.6 |
2021-01-01 | NGA | MS.MIL.XPND.CN | 1.783120e+12 | 1.0 |
2021-01-01 | GNQ | MS.MIL.XPND.CN | 9.439700e+10 | 1.4 |
2021-01-01 | ISL | MS.MIL.XPND.CN | 0.000000e+00 | 0.0 |
2021-01-01 | ESP | MS.MIL.XPND.CN | 1.652680e+10 | 1.4 |
2021-01-01 | BHR | MS.MIL.XPND.CN | 5.194000e+08 | 3.6 |
2021-01-01 | GEO | MS.MIL.XPND.CN | 9.723000e+08 | 1.6 |
2021-01-01 | MDA | MS.MIL.XPND.CN | 9.144000e+08 | 0.4 |
2013-01-01 | LAO | MS.MIL.XPND.CN | 1.782500e+11 | 0.2 |
Sometimes dataframes contain columns which don't have clean text. For example, something like
iso_code | value | |
---|---|---|
0 | USA | 10% |
1 | GBR | +12% |
2 | FRA | 13.4% |
3 | DEU | %14.3 |
4 | ITA | 15.3 % |
5 | ESP | 16% |
6 | CAN | 17% |
7 | JPN | 18% |
8 | AUS | 19% |
9 | CHN | 20% |
bblocks can help clean that data.
from bblocks import clean_numeric_series
df['value'] = clean_numeric_series(
data=df['value'],
to=float # or if dealing with integers, use to=int
)
Returns a clean version of the data
iso_code | value | |
---|---|---|
0 | USA | 10.0 |
1 | GBR | 12.0 |
2 | FRA | 13.4 |
3 | DEU | 14.3 |
4 | ITA | 15.3 |
5 | ESP | 16.0 |
6 | CAN | 17.0 |
7 | JPN | 18.0 |
8 | AUS | 19.0 |
9 | CHN | 20.0 |
Interested in contributing to the package? Please reach out.