Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

11 fetch preprocess gdp #16

Merged
merged 32 commits into from
Dec 3, 2021
Merged

11 fetch preprocess gdp #16

merged 32 commits into from
Dec 3, 2021

Conversation

Juan-Mateos
Copy link

@Juan-Mateos Juan-Mateos commented Nov 23, 2021

Closes #7
Closes #12
Closes #11

@Juan-Mateos Juan-Mateos marked this pull request as draft November 23, 2021 09:44
@Juan-Mateos Juan-Mateos marked this pull request as ready for review November 23, 2021 12:26
@Juan-Mateos Juan-Mateos changed the base branch from dev to 0_setup_cookiecutter November 23, 2021 13:06
Base automatically changed from 0_setup_cookiecutter to dev November 29, 2021 15:47
@georgerichardson
Copy link
Contributor

georgerichardson commented Nov 30, 2021

I got the following error log when I cloned the repo and ran make install from the pip install xlrd command

ERROR: Could not find a version that satisfies the requirement xldr (from industrial-taxonomy[dev]) (from versions: none)
ERROR: No matching distribution found for xldr

Copy link
Contributor

@georgerichardson georgerichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments

  • Is there a need to have population data gathered from the GDP and population flows separately? Is this to get GPD per capita per year later?
  • Minor: inconsitencies with whitespace between lines. Have noted a few
  • Should we document the shapes of the data outputs from the flows either in the docstring or in the readme?

What I did

  • cloned the respository and ran make install on this branch from scratch
  • ran python industrial_taxonomy/pipeline/sic/flow.py --production run
  • ran python industrial_taxonomy/pipeline/official/gdp/flow.py run
  • Did not finish. See xlrd error in industrial_taxonomy/pipeline/official/utils.py below
  • ran python industrial_taxonomy/pipeline/official/nomis/flow.py run
  • Finished running
  • Example output for nomis_dict
{'date': 2019,
 'geography_name': 'Tower Hamlets',
 'geography_code': 'E09000030',
 'variable': 'Economic activity rate - aged 16-64',
 'value': 74.2,
 'source': 'aps'},
{'date': 2019,
 'geography_name': 'Tower Hamlets',
 'geography_code': 'E09000030',
 'variable': 'Employment rate - aged 16-64',
 'value': 71.4,
 'source': 'aps'},
{'date': 2019,
 'geography_name': 'Tower Hamlets',
 'geography_code': 'E09000030',
 'variable': '% with NVQ4+ - aged 16-64',
 'value': 60.5,
 'source': 'aps'},
{'date': 2019,
 'geography_name': 'Tower Hamlets',
 'geography_code': 'E09000030',
 'variable': '% with other qualifications (NVQ) - aged 16-64',
 'value': 8.3,
 'source': 'aps'},
{'date': 2019,
 'geography_name': 'Tower Hamlets',
 'geography_code': 'E09000030',
 'variable': '% with no qualifications (NVQ) - aged 16-64',
 'value': 9.7,
 'source': 'aps'},
  • ran python industrial_taxonomy/pipeline/official/population/flow.py run\
  • Finished running
  • Example output for pop_est
...
 'N09000006': 117337,
'N09000007': 146452,
'N09000008': 139443,
'N09000009': 148953,
'N09000010': 181669}

return response.content


def excel_to_df(content: bytes, sheets: int, skiprows: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl I think because pandas now defaults to using openpyxl to open .xlsx files. I think you either need to specify the engine or add openpyxl to requirements.txt

Separately, I'm not sure I see the value in this function as it's just a very thin wrapper around pd.read_excel that removes access to some parameters.

@@ -12,8 +12,6 @@

## Features

### Pipeline

Read instructions to reproduce the pipeline in `industrial_taxonomy/pipeline/README.md`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add equivalents for nomis, gdp and population?


# Create dfs for the sheets with relevant information (population and GVA)
self._pop_raw, self._gva_raw = [
excel_to_df(gdp_table, sheets=sh, skiprows=1) for sh in [7, 8]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a general question as to whether these kinds of hard coded variables should be kept in a config file?

Comment on lines 11 to 12
_APS_URL = "https://www.nomisweb.co.uk/api/v01/dataset/NM_17_5.data.csv?geography=1811939329...1811939332,1811939334...1811939336,1811939338...1811939497,1811939499...1811939501,1811939503,1811939505...1811939507,1811939509...1811939517,1811939519,1811939520,1811939524...1811939570,1811939575...1811939599,1811939601...1811939628,1811939630...1811939634,1811939636...1811939647,1811939649,1811939655...1811939664,1811939667...1811939680,1811939682,1811939683,1811939685,1811939687...1811939704,1811939707,1811939708,1811939710,1811939712...1811939717,1811939719,1811939720,1811939722...1811939730&date=2019-12&variable=18,45,290,335,344&measures=20599,21001,21002,21003"
_ASHE_URL = "https://www.nomisweb.co.uk/api/v01/dataset/NM_30_1.data.csv?geography=1811939329...1811939332,1811939334...1811939336,1811939338...1811939497,1811939499...1811939501,1811939503,1811939505...1811939507,1811939509...1811939517,1811939519,1811939520,1811939524...1811939570,1811939575...1811939599,1811939601...1811939628,1811939630...1811939634,1811939636...1811939647,1811939649,1811939655...1811939664,1811939667...1811939680,1811939682,1811939683,1811939685,1811939687...1811939704,1811939707,1811939708,1811939710,1811939712...1811939717,1811939719,1811939720,1811939722...1811939730&date=latest&sex=8&item=2&pay=7&measures=20100,20701"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: could split into multiline

@project(name="industrial_taxonomy")
class NomisTables(FlowSpec):
"""Flow to collect APS / ASHE data from NOMIS
Attributes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Attributes:
Attributes:

value_column: value column
source: data source
indicator_column: column that contains the indicator
Returns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Returns:
Returns:

from metaflow import FlowSpec, project, step
from metaflow.decorators import StepDecorator

URL = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more specific name?

Comment on lines 7 to 9
"https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/",
"datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland/",
"mid2020/ukpopestimatesmid2020on2021geography.xls",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comma separation from these is removed then no need for self.url = "".join(URL) below

Comment on lines 9 to 11
"https://www.ons.gov.uk/file?uri=/economy/grossdomesticproductgdp/datasets/",
"regionalgrossdomesticproductlocalauthorities/1998to2019/",
"regionalgrossdomesticproductlocalauthorities.xlsx",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comma separation from these is removed then no need for self.url = "".join(URL) below

.rename(columns={x: str(x) for x in range(1998, 2020)})
)

_table.columns = [re.sub(" ", "_", col.lower()) for col in _table.columns]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general it's preferable to use Python string functions (in this case .replace()) over regex to avoid added complexity and for speed, although in this case it hardly makes a difference :D

@Juan-Mateos
Copy link
Author

Thanks for the review! I have:

  • Added openpyxl to requirements
  • Added instructions to README
  • Put hardcoded variables (excel sheets / year ranges) at the top of the script
  • Made URL names more informative, split into multiline, removed commas
  • Inserted whitespace before Attributes, Arguments and Returns
  • Replaced regex with .replace()

@Juan-Mateos
Copy link
Author

And then I added the getters.

Copy link
Contributor

@bishax bishax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of minor comments here but this is much easier to read, review, and use than on previous projects 😄

One difficulty I ran into was in running the getters...
it looks like the last time you ran LocalGdpData and NomisTables you ran them with --production and --datastore local which means that when it came for me to load the data, it failed because it was trying to load a local path from your machine.
Don't combine --datastore local and --production - I've just added an issue to metaflow_extensions to explore whether it's possible to make metaflow raise a warning when this happens.
I've rerun the flows with --production and S3 datastore so that the getters work for everyone again.

@@ -16,7 +16,6 @@ def get_yaml_config(file_path: Path) -> Optional[dict]:
if file_path.exists():
with open(file_path, "rt") as f:
return yaml.load(f.read(), Loader=yaml.FullLoader)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this somehow got removed when merging in dev

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add this back in

Comment on lines 12 to 23
@lru_cache()
def get_run(flow: str):
"""Last successful run executed with `--production`.

Arguments:
flow: the official data flow we want to get
"""
runs = Flow(flow).runs("project_branch:prod")
try:
return next(filter(lambda run: run.successful, runs))
except StopIteration as exc:
raise MetaflowNotFound("Matching run not found") from exc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: #25 adds this as a general util and refactors existing uses - as this PR will probably merge first I can refactor in #25

https://github.com/nestauk/industrial_taxonomy/pull/25/files#diff-00b2db5bbabdce7be9d5602c61e3c8d66ac8628e428e6149234027c25230320d

@@ -0,0 +1,109 @@
# Data getters for official data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to run flake8, it will flag up a lot of minor things to fix...

❯ flake8 industrial_taxonomy 
industrial_taxonomy/getters/official.py:4:1: F401 'typing.Dict' imported but unused
industrial_taxonomy/getters/official.py:4:1: F401 'typing.Optional' imported but unused
industrial_taxonomy/getters/official.py:6:1: F401 'metaflow.Run' imported but unused
industrial_taxonomy/getters/official.py:15:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:15:1: DAR401 Missing exception(s) in Raises section: -r MetaflowNotFound
industrial_taxonomy/getters/official.py:27:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:43:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:59:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:80:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/pipeline/official/utils.py:5:1: F401 'pandas as pd' imported but unused
industrial_taxonomy/pipeline/official/population/utils.py:2:1: F401 'pandas as df' imported but unused
industrial_taxonomy/pipeline/official/population/flow.py:4:1: F401 'metaflow.decorators.StepDecorator' imported but unused
industrial_taxonomy/pipeline/official/population/flow.py:7:112: B950 line too long (111 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/utils.py:14:1: DAR101 Missing parameter(s) in Docstring: - **kwargs
industrial_taxonomy/pipeline/official/nomis/flow.py:6:1: F401 'metaflow.decorators.StepDecorator' imported but unused
industrial_taxonomy/pipeline/official/nomis/flow.py:12:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:13:101: B950 line too long (100 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:14:101: B950 line too long (100 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:15:102: B950 line too long (101 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:16:109: B950 line too long (108 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:17:113: B950 line too long (112 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:18:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:22:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:23:114: B950 line too long (113 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:24:112: B950 line too long (111 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:25:114: B950 line too long (113 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:26:110: B950 line too long (109 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:27:111: B950 line too long (110 > 80 characters)
industrial_taxonomy/pipeline/official/gdp/utils.py:4:1: F401 're' imported but unused


@lru_cache()
def get_run(flow: str):
"""Last successful run executed with `--production`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here flake8 raises the following:

industrial_taxonomy/getters/official.py:15:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:15:1: DAR401 Missing exception(s) in Raises section: -r MetaflowNotFound

This might seem a bit anal but is useful to have this info for complex functions that may be used across modules.

For simpler functions with:

  • Obvious function name
  • A small number of obvious parameter names like this (suggest that flow_name is better than flow as it's less likely someone using it erroneously thinks they need to pass in a metaflow.Flow object)
  • Type annotations (suggest adding -> Run for the return type)

our flake8 config allows you to write a one-line docstring and not get nagged for Args/Returns/Raises

@@ -0,0 +1,109 @@
# Data getters for official data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a top-level description use doc-string style rather than a # comment.
Anyone building API documentation with tools like Sphinx will then get a module-level docstring

Comment on lines 11 to 21
_table = (
table.dropna(
axis=0, subset=["LA code"] # We are dropping bottom rows without a LA code
)
.rename(columns={"2019\n[note 3]": "2019"})
.rename(columns={x: str(x) for x in YEAR_RANGE})
)

_table.columns = [col.lower().replace(" ", "_") for col in _table.columns]

return _table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas-fu tip: rename can take a function too...

Suggested change
_table = (
table.dropna(
axis=0, subset=["LA code"] # We are dropping bottom rows without a LA code
)
.rename(columns={"2019\n[note 3]": "2019"})
.rename(columns={x: str(x) for x in YEAR_RANGE})
)
_table.columns = [col.lower().replace(" ", "_") for col in _table.columns]
return _table
return (
table.dropna(axis=0, subset=["LA code"])
.rename(columns={"2019\n[note 3]": "2019"})
.rename(columns={x: str(x) for x in YEAR_RANGE})
.rename(columns=lambda s: s.lower().replace(" ", "_"))
)```

value_column: str,
source: str,
indicator_column: str = "MEASURES_NAME",
**kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs isn't used

Suggested change
**kwargs,

.rename(columns=str.lower)
.assign(
date=lambda df: [ # We parse APS dates returned in format "y-m"
d if type(d) == int else int(d.split("-")[0]) for d in df["date"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Not requesting the change but] it may have been a better choice to separate things out a little more if the logic between processing APS and ASHE differs rather than having conditional logic embedded deep within the processing

def transform(self):
"""Transform the population table"""

from utils import clean_popest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent from L28

)


def nomis():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: from a user perspective - would it be better to have a getter for each indicator? i.e. will they be used all together or not?

@Juan-Mateos
Copy link
Author

Juan-Mateos commented Dec 3, 2021

I did the following:

  • Ran flake8 and fixed problems
  • Simplified doc-string for get_run and changed argument name
  • Changed top-level descriptions to doc-string
  • Added Returns to getters
  • Added run parameter to getters
  • Renamed itl1_region as nuts1_name
  • Fixed doc-string about columns being returned from gva_pc_lad
  • Made imports consistent
  • Changed get function so it returns a response object instead of content
  • Made process_gdp_table more informative
  • Chained rename in process_gdp_table
  • Removed unused kwargs argument from process_nomis
  • Re-ran all flows without datastore=local
  • Tested all getters

I didn't:

  • Separate aps from ashe processing in the process_nomis util
  • Create separate getters for nomis variables (they are used together so this isn't necessary here)

Note that:

  • I had to keep the pandas import outside of the flows. It was required for the typing hints at the beginning of the flows. When I tried to import pandas as pd when declaring the flow I got the following error: Cannot pickle dump artifact named "pd". I suppose I could:
    • Remove the pd.DataFrame typing hints
    • Declare the typing hints somewhere else in the script (in the last step?)

@Juan-Mateos Juan-Mateos requested a review from bishax December 3, 2021 08:55
Copy link
Contributor

@bishax bishax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For flake8 I get (I think you added the run arg after running flake8 so missed these)

industrial_taxonomy/getters/official.py:23:1: DAR101 Missing parameter(s) in Docstring: - run
industrial_taxonomy/getters/official.py:44:1: DAR101 Missing parameter(s) in Docstring: - run
industrial_taxonomy/getters/official.py:89:1: DAR101 Missing parameter(s) in Docstring: - run


There's then 1 comment around the fact that there's a few itl1_regions still around.


On the pandas imports, I had to work around the same issue in #25 as follows (follow link)...

try: # Hack for type-hints on attributes

There is an import within flow steps as well as a top-level import within a try-except block that ignores ImportErrors that may happen when executing the flow in a remote environment like Batch before it has installed pandas (it has to run the file to understand the structure of the flow and the dependencies it needs).
Flow-level type annotations are of the form so that static analysis tools like flake8 and mypy can still resolve the imports but at runtime we don't get an error on Batch because the object DataFrame doesn't exist (because we got an ImportError above and so DataFrame isn't defined)

class MyFlow(FlowSpec):
  some_data_artifact: "DataFrame"

After the above, then:

  • Re-run the flows with --production
  • Get latest changes from dev

Then all good from me.

Comment on lines 27 to 85
Name: itl1_region, dtype: str, NUTS1 region (e.g. Scotland, South East etc)
Name: la_code, dtype: str, local authority code
Name: la_name, dtype: str, local authority name
Name: year dtype: int, year (ranges between 1998 and 2019
Name: gva dtype: float, £M Gross value added

"""

if run is None:
run = get_run("LocalGdpData")

return pd.DataFrame(run.data.gva).melt(
id_vars=["itl1_region", "la_code", "la_name"], var_name="year", value_name="gva"
)


def population_lad(run: Optional[Run] = None):
"""get the population in a local authority

Returns:
Columns:
Name: itl1_region, dtype: str, NUTS1 region (e.g. Scotland, South East etc)
Name: la_code, dtype: str, local authority code
Name: la_name, dtype: str, local authority name
Name: year dtype: int, year (ranges between 1998 and 2019
Name: pop dtype: float, population

"""

if run is None:
run = get_run("LocalGdpData")

return pd.DataFrame(run.data.pop).melt(
id_vars=["itl1_region", "la_code", "la_name"], var_name="year", value_name="pop"
)


def gva_pc_lad():
"""Get the GVA per capita in a local authority

Returns:
Columns:
Name: nuts1_name, dtype: str, NUTS1 region (e.g. Scotland, South East etc)
Name: la_code, dtype: str, local authority code
Name: la_name, dtype: str, local authority name
Name: year dtype: int, year (ranges between 1998 and 2019
Name: gva_pc dtype: float, GDP per capita

"""

gva = gva_lad()
pop = population_lad()

return (
gva.merge(pop, on=["itl1_region", "la_code", "la_name", "year"])
.assign(gva_pc=lambda df: (1e6 * df["gva"] / df["pop"]).round(2))
.rename(columns={"itl1_region": "nuts1_name"})
.drop(axis=1, labels=["gva", "pop"])
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

itl1_region is renamed to nuts1_name for only 1 out of 3 getters

@@ -16,7 +16,6 @@ def get_yaml_config(file_path: Path) -> Optional[dict]:
if file_path.exists():
with open(file_path, "rt") as f:
return yaml.load(f.read(), Loader=yaml.FullLoader)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add this back in

@Juan-Mateos
Copy link
Author

Juan-Mateos commented Dec 3, 2021

I did the following:

  • Fix new flake8 problems
  • Include the typing import 'hack' in the gdp and nomis flows
  • Change itl1_region names in all getters.
  • Added None to industrial_taxonomy/__init__.py.
  • Re-ran all the flows
  • Tested the getters

@Juan-Mateos Juan-Mateos merged commit c4890b5 into dev Dec 3, 2021
@Juan-Mateos Juan-Mateos deleted the 11_fetch_preprocess_gdp branch December 3, 2021 12:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants