-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
I got the following error log when I cloned the repo and ran
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments
- Is there a need to have population data gathered from the GDP and population flows separately? Is this to get GPD per capita per year later?
- Minor: inconsitencies with whitespace between lines. Have noted a few
- Should we document the shapes of the data outputs from the flows either in the docstring or in the readme?
What I did
- cloned the respository and ran
make install
on this branch from scratch - ran
python industrial_taxonomy/pipeline/sic/flow.py --production run
- ran
python industrial_taxonomy/pipeline/official/gdp/flow.py run
- Did not finish. See xlrd error in
industrial_taxonomy/pipeline/official/utils.py
below - ran
python industrial_taxonomy/pipeline/official/nomis/flow.py run
- Finished running
- Example output for
nomis_dict
{'date': 2019,
'geography_name': 'Tower Hamlets',
'geography_code': 'E09000030',
'variable': 'Economic activity rate - aged 16-64',
'value': 74.2,
'source': 'aps'},
{'date': 2019,
'geography_name': 'Tower Hamlets',
'geography_code': 'E09000030',
'variable': 'Employment rate - aged 16-64',
'value': 71.4,
'source': 'aps'},
{'date': 2019,
'geography_name': 'Tower Hamlets',
'geography_code': 'E09000030',
'variable': '% with NVQ4+ - aged 16-64',
'value': 60.5,
'source': 'aps'},
{'date': 2019,
'geography_name': 'Tower Hamlets',
'geography_code': 'E09000030',
'variable': '% with other qualifications (NVQ) - aged 16-64',
'value': 8.3,
'source': 'aps'},
{'date': 2019,
'geography_name': 'Tower Hamlets',
'geography_code': 'E09000030',
'variable': '% with no qualifications (NVQ) - aged 16-64',
'value': 9.7,
'source': 'aps'},
- ran
python industrial_taxonomy/pipeline/official/population/flow.py run
\ - Finished running
- Example output for
pop_est
...
'N09000006': 117337,
'N09000007': 146452,
'N09000008': 139443,
'N09000009': 148953,
'N09000010': 181669}
return response.content | ||
|
||
|
||
def excel_to_df(content: bytes, sheets: int, skiprows: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throws ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl
I think because pandas now defaults to using openpyxl to open .xlsx files. I think you either need to specify the engine or add openpyxl to requirements.txt
Separately, I'm not sure I see the value in this function as it's just a very thin wrapper around pd.read_excel
that removes access to some parameters.
@@ -12,8 +12,6 @@ | |||
|
|||
## Features | |||
|
|||
### Pipeline | |||
|
|||
Read instructions to reproduce the pipeline in `industrial_taxonomy/pipeline/README.md`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add equivalents for nomis, gdp and population?
|
||
# Create dfs for the sheets with relevant information (population and GVA) | ||
self._pop_raw, self._gva_raw = [ | ||
excel_to_df(gdp_table, sheets=sh, skiprows=1) for sh in [7, 8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of a general question as to whether these kinds of hard coded variables should be kept in a config file?
_APS_URL = "https://www.nomisweb.co.uk/api/v01/dataset/NM_17_5.data.csv?geography=1811939329...1811939332,1811939334...1811939336,1811939338...1811939497,1811939499...1811939501,1811939503,1811939505...1811939507,1811939509...1811939517,1811939519,1811939520,1811939524...1811939570,1811939575...1811939599,1811939601...1811939628,1811939630...1811939634,1811939636...1811939647,1811939649,1811939655...1811939664,1811939667...1811939680,1811939682,1811939683,1811939685,1811939687...1811939704,1811939707,1811939708,1811939710,1811939712...1811939717,1811939719,1811939720,1811939722...1811939730&date=2019-12&variable=18,45,290,335,344&measures=20599,21001,21002,21003" | ||
_ASHE_URL = "https://www.nomisweb.co.uk/api/v01/dataset/NM_30_1.data.csv?geography=1811939329...1811939332,1811939334...1811939336,1811939338...1811939497,1811939499...1811939501,1811939503,1811939505...1811939507,1811939509...1811939517,1811939519,1811939520,1811939524...1811939570,1811939575...1811939599,1811939601...1811939628,1811939630...1811939634,1811939636...1811939647,1811939649,1811939655...1811939664,1811939667...1811939680,1811939682,1811939683,1811939685,1811939687...1811939704,1811939707,1811939708,1811939710,1811939712...1811939717,1811939719,1811939720,1811939722...1811939730&date=latest&sex=8&item=2&pay=7&measures=20100,20701" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: could split into multiline
@project(name="industrial_taxonomy") | ||
class NomisTables(FlowSpec): | ||
"""Flow to collect APS / ASHE data from NOMIS | ||
Attributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attributes: | |
Attributes: |
value_column: value column | ||
source: data source | ||
indicator_column: column that contains the indicator | ||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns: | |
Returns: |
from metaflow import FlowSpec, project, step | ||
from metaflow.decorators import StepDecorator | ||
|
||
URL = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more specific name?
"https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/", | ||
"datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland/", | ||
"mid2020/ukpopestimatesmid2020on2021geography.xls", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comma separation from these is removed then no need for self.url = "".join(URL)
below
"https://www.ons.gov.uk/file?uri=/economy/grossdomesticproductgdp/datasets/", | ||
"regionalgrossdomesticproductlocalauthorities/1998to2019/", | ||
"regionalgrossdomesticproductlocalauthorities.xlsx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the comma separation from these is removed then no need for self.url = "".join(URL) below
.rename(columns={x: str(x) for x in range(1998, 2020)}) | ||
) | ||
|
||
_table.columns = [re.sub(" ", "_", col.lower()) for col in _table.columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general it's preferable to use Python string functions (in this case .replace()
) over regex to avoid added complexity and for speed, although in this case it hardly makes a difference :D
Thanks for the review! I have:
|
And then I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of minor comments here but this is much easier to read, review, and use than on previous projects 😄
One difficulty I ran into was in running the getters...
it looks like the last time you ran LocalGdpData
and NomisTables
you ran them with --production
and --datastore local
which means that when it came for me to load the data, it failed because it was trying to load a local path from your machine.
Don't combine --datastore local
and --production
- I've just added an issue to metaflow_extensions to explore whether it's possible to make metaflow raise a warning when this happens.
I've rerun the flows with --production
and S3 datastore so that the getters work for everyone again.
industrial_taxonomy/__init__.py
Outdated
@@ -16,7 +16,6 @@ def get_yaml_config(file_path: Path) -> Optional[dict]: | |||
if file_path.exists(): | |||
with open(file_path, "rt") as f: | |||
return yaml.load(f.read(), Loader=yaml.FullLoader) | |||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this somehow got removed when merging in dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to add this back in
@lru_cache() | ||
def get_run(flow: str): | ||
"""Last successful run executed with `--production`. | ||
|
||
Arguments: | ||
flow: the official data flow we want to get | ||
""" | ||
runs = Flow(flow).runs("project_branch:prod") | ||
try: | ||
return next(filter(lambda run: run.successful, runs)) | ||
except StopIteration as exc: | ||
raise MetaflowNotFound("Matching run not found") from exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,109 @@ | |||
# Data getters for official data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to run flake8
, it will flag up a lot of minor things to fix...
❯ flake8 industrial_taxonomy
industrial_taxonomy/getters/official.py:4:1: F401 'typing.Dict' imported but unused
industrial_taxonomy/getters/official.py:4:1: F401 'typing.Optional' imported but unused
industrial_taxonomy/getters/official.py:6:1: F401 'metaflow.Run' imported but unused
industrial_taxonomy/getters/official.py:15:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:15:1: DAR401 Missing exception(s) in Raises section: -r MetaflowNotFound
industrial_taxonomy/getters/official.py:27:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:43:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:59:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:80:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/pipeline/official/utils.py:5:1: F401 'pandas as pd' imported but unused
industrial_taxonomy/pipeline/official/population/utils.py:2:1: F401 'pandas as df' imported but unused
industrial_taxonomy/pipeline/official/population/flow.py:4:1: F401 'metaflow.decorators.StepDecorator' imported but unused
industrial_taxonomy/pipeline/official/population/flow.py:7:112: B950 line too long (111 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/utils.py:14:1: DAR101 Missing parameter(s) in Docstring: - **kwargs
industrial_taxonomy/pipeline/official/nomis/flow.py:6:1: F401 'metaflow.decorators.StepDecorator' imported but unused
industrial_taxonomy/pipeline/official/nomis/flow.py:12:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:13:101: B950 line too long (100 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:14:101: B950 line too long (100 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:15:102: B950 line too long (101 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:16:109: B950 line too long (108 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:17:113: B950 line too long (112 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:18:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:22:90: B950 line too long (89 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:23:114: B950 line too long (113 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:24:112: B950 line too long (111 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:25:114: B950 line too long (113 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:26:110: B950 line too long (109 > 80 characters)
industrial_taxonomy/pipeline/official/nomis/flow.py:27:111: B950 line too long (110 > 80 characters)
industrial_taxonomy/pipeline/official/gdp/utils.py:4:1: F401 're' imported but unused
|
||
@lru_cache() | ||
def get_run(flow: str): | ||
"""Last successful run executed with `--production`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here flake8
raises the following:
industrial_taxonomy/getters/official.py:15:1: DAR201 Missing "Returns" in Docstring: - return
industrial_taxonomy/getters/official.py:15:1: DAR401 Missing exception(s) in Raises section: -r MetaflowNotFound
This might seem a bit anal but is useful to have this info for complex functions that may be used across modules.
For simpler functions with:
- Obvious function name
- A small number of obvious parameter names like this (suggest that
flow_name
is better thanflow
as it's less likely someone using it erroneously thinks they need to pass in ametaflow.Flow
object) - Type annotations (suggest adding
-> Run
for the return type)
our flake8 config allows you to write a one-line docstring and not get nagged for Args/Returns/Raises
@@ -0,0 +1,109 @@ | |||
# Data getters for official data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a top-level description use doc-string style rather than a #
comment.
Anyone building API documentation with tools like Sphinx will then get a module-level docstring
_table = ( | ||
table.dropna( | ||
axis=0, subset=["LA code"] # We are dropping bottom rows without a LA code | ||
) | ||
.rename(columns={"2019\n[note 3]": "2019"}) | ||
.rename(columns={x: str(x) for x in YEAR_RANGE}) | ||
) | ||
|
||
_table.columns = [col.lower().replace(" ", "_") for col in _table.columns] | ||
|
||
return _table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas-fu tip: rename
can take a function too...
_table = ( | |
table.dropna( | |
axis=0, subset=["LA code"] # We are dropping bottom rows without a LA code | |
) | |
.rename(columns={"2019\n[note 3]": "2019"}) | |
.rename(columns={x: str(x) for x in YEAR_RANGE}) | |
) | |
_table.columns = [col.lower().replace(" ", "_") for col in _table.columns] | |
return _table | |
return ( | |
table.dropna(axis=0, subset=["LA code"]) | |
.rename(columns={"2019\n[note 3]": "2019"}) | |
.rename(columns={x: str(x) for x in YEAR_RANGE}) | |
.rename(columns=lambda s: s.lower().replace(" ", "_")) | |
)``` |
value_column: str, | ||
source: str, | ||
indicator_column: str = "MEASURES_NAME", | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs isn't used
**kwargs, |
.rename(columns=str.lower) | ||
.assign( | ||
date=lambda df: [ # We parse APS dates returned in format "y-m" | ||
d if type(d) == int else int(d.split("-")[0]) for d in df["date"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Not requesting the change but] it may have been a better choice to separate things out a little more if the logic between processing APS and ASHE differs rather than having conditional logic embedded deep within the processing
def transform(self): | ||
"""Transform the population table""" | ||
|
||
from utils import clean_popest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent from L28
) | ||
|
||
|
||
def nomis(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: from a user perspective - would it be better to have a getter for each indicator? i.e. will they be used all together or not?
I did the following:
I didn't:
Note that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For flake8 I get (I think you added the run arg after running flake8 so missed these)
industrial_taxonomy/getters/official.py:23:1: DAR101 Missing parameter(s) in Docstring: - run
industrial_taxonomy/getters/official.py:44:1: DAR101 Missing parameter(s) in Docstring: - run
industrial_taxonomy/getters/official.py:89:1: DAR101 Missing parameter(s) in Docstring: - run
There's then 1 comment around the fact that there's a few itl1_region
s still around.
On the pandas
imports, I had to work around the same issue in #25 as follows (follow link)...
try: # Hack for type-hints on attributes |
There is an import within flow steps as well as a top-level import within a try-except block that ignores
ImportError
s that may happen when executing the flow in a remote environment like Batch before it has installed pandas (it has to run the file to understand the structure of the flow and the dependencies it needs).Flow-level type annotations are of the form so that static analysis tools like
flake8
and mypy
can still resolve the imports but at runtime we don't get an error on Batch because the object DataFrame
doesn't exist (because we got an ImportError
above and so DataFrame
isn't defined)
class MyFlow(FlowSpec):
some_data_artifact: "DataFrame"
After the above, then:
- Re-run the flows with
--production
- Get latest changes from
dev
Then all good from me.
Name: itl1_region, dtype: str, NUTS1 region (e.g. Scotland, South East etc) | ||
Name: la_code, dtype: str, local authority code | ||
Name: la_name, dtype: str, local authority name | ||
Name: year dtype: int, year (ranges between 1998 and 2019 | ||
Name: gva dtype: float, £M Gross value added | ||
|
||
""" | ||
|
||
if run is None: | ||
run = get_run("LocalGdpData") | ||
|
||
return pd.DataFrame(run.data.gva).melt( | ||
id_vars=["itl1_region", "la_code", "la_name"], var_name="year", value_name="gva" | ||
) | ||
|
||
|
||
def population_lad(run: Optional[Run] = None): | ||
"""get the population in a local authority | ||
|
||
Returns: | ||
Columns: | ||
Name: itl1_region, dtype: str, NUTS1 region (e.g. Scotland, South East etc) | ||
Name: la_code, dtype: str, local authority code | ||
Name: la_name, dtype: str, local authority name | ||
Name: year dtype: int, year (ranges between 1998 and 2019 | ||
Name: pop dtype: float, population | ||
|
||
""" | ||
|
||
if run is None: | ||
run = get_run("LocalGdpData") | ||
|
||
return pd.DataFrame(run.data.pop).melt( | ||
id_vars=["itl1_region", "la_code", "la_name"], var_name="year", value_name="pop" | ||
) | ||
|
||
|
||
def gva_pc_lad(): | ||
"""Get the GVA per capita in a local authority | ||
|
||
Returns: | ||
Columns: | ||
Name: nuts1_name, dtype: str, NUTS1 region (e.g. Scotland, South East etc) | ||
Name: la_code, dtype: str, local authority code | ||
Name: la_name, dtype: str, local authority name | ||
Name: year dtype: int, year (ranges between 1998 and 2019 | ||
Name: gva_pc dtype: float, GDP per capita | ||
|
||
""" | ||
|
||
gva = gva_lad() | ||
pop = population_lad() | ||
|
||
return ( | ||
gva.merge(pop, on=["itl1_region", "la_code", "la_name", "year"]) | ||
.assign(gva_pc=lambda df: (1e6 * df["gva"] / df["pop"]).round(2)) | ||
.rename(columns={"itl1_region": "nuts1_name"}) | ||
.drop(axis=1, labels=["gva", "pop"]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
itl1_region
is renamed to nuts1_name
for only 1 out of 3 getters
industrial_taxonomy/__init__.py
Outdated
@@ -16,7 +16,6 @@ def get_yaml_config(file_path: Path) -> Optional[dict]: | |||
if file_path.exists(): | |||
with open(file_path, "rt") as f: | |||
return yaml.load(f.read(), Loader=yaml.FullLoader) | |||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to add this back in
I did the following:
|
Closes #7
Closes #12
Closes #11