-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure intro.rst and other pages for data warehouse #2912
Restructure intro.rst and other pages for data warehouse #2912
Conversation
…info. Add three components of PUDL description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aesharpe! I propose we:
- Use the README changes on this branch
- Move the The Data Warehouse Design and Data Validation sections on the
create-naming-convention-docs
branch to the Data and ETL Design Guidelines page - Use the Naming Convention section changes from this branch
What do you think?
- **Raw Data Archives** | ||
|
||
- We `archive <https://github.com/catalyst-cooperative/pudl-archiver>`__ all the raw | ||
data inputs on `Zenodo <https://zenodo.org/communities/catalyst-cooperative/?page=1&size=20>`__ | ||
to ensure perminant, versioned access to the data. In the event that an agency | ||
changes how they publish data or deletes old files, the ETL will still have access | ||
to the original inputs. Each of the data inputs may have several different versions | ||
archived, and all are assigned a unique DOI and made available through the REST API. | ||
- **ETL Pipeline** | ||
|
||
- The ETL pipeline (this repo) ingests the raw archives, cleans them, integrates | ||
them, and outputs them to a series of tables stored in SQLite Databases, Parquet | ||
files, and pickle files (the Data Warehouse). Each release of the PUDL Python | ||
package is embedded with a set of of DOIs to indicate which version of the raw | ||
inputs it is meant to process. This process helps ensure that the ETL and it's | ||
outputs are replicable. | ||
- **Data Warehouse** | ||
|
||
- The outputs from the ETL, sometimes called "PUDL outputs", are stored in a data | ||
warehouse so that users can access the data without having to run any code. The | ||
majority of the outputs are stored in ``pudl.sqlite``, however CEMS data are stored | ||
in seperate Parquet files due to their large size. The warehouse also contains | ||
pickled interim assets from the ETL process, should users want to access the data | ||
at various stages of the cleaning process, and SQLite databases for the raw FERC | ||
inputs. | ||
|
||
For more information about each of the components, read our | ||
`documentation <https://catalystcoop-pudl--2874.org.readthedocs.build/en/2874/intro.html>`__ | ||
. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think including this early in the readme forces users to scroll through more text to get to the data access section which I'm assuming is what they care about.
I think this type of architecture information is more important for contributors which will be reading through the Development section of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. I don't want to assume users know what they want yet though and this provides them with the opportunity to understand what happens to the data before they use it. Maybe we could run this by some other people to see what they think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also just copy what's in the intro and use that instead:
- **Raw Data Archives** (raw, versioned inputs)
- **ETL Pipeline** (code to process, clean, and organize the raw inputs)
- **Data Warehouse** (location where ETL outputs, both interim and final, are stored)
@@ -74,13 +46,43 @@ needed and organize them in a local :doc:`datastore <dev/datastore>`. | |||
.. _etl-process: | |||
|
|||
--------------------------------------------------------------------------------------- | |||
The Data Warehouse Design | |||
The ETL Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to move this information to the development section of the docs. Do users actually care about the raw data archives, data warehouse and data validation?
I'm thinking we could move the The Data Warehouse Design and Data Validation sections to the Data and ETL Design Guidelines page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little hesitant to have an ETL and Data Warehouse section because they cover similar topics. I think it's easier to think about our data processing just in terms of the raw, core, output layers as opposed to ETL steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! As long as the concept of the Data Warehouse doesn't get lost in the code processing description I think that's fine. My concern with wanting to pull out the Data Warehouse section was similar to your comment above about people being primarily concerned with Data Access and therefore wanting to be able to jump strait to a Data Warehouse page / section might be nice, but I think depending on how we structure the rest of the docs this might not be an issue.
If we do this, I would wonder what the purpose of this introduction page is. Maybe we don't need it? Idk... I do feel like some brief description of what's going on would be nice as I don't think only developers would want to know this type of information. A lot of users might be curious what is actually happening to t the data they are using in between raw and final. Data and ETL Design Guidelines page feels a little bit hidden. Maybe we could take the mini paragraph descriptions for each section from the README page and put them in the intro instead of having longer descriptions there. |
…sions_ferc1 table
I think you're right I shouldn't assume users don't care about how the data is processed. In that case, what if we just keep the data warehouse / processing language from
Or we can move the data warehouse design language to the ETL Guidelines page and just link to it in the intro page. I think we're starting to bump up against larger unanswered questions about our docs that are out of the scope of the renaming docs. To keep things simple, what if we:
|
Changes in the branch were incorporated into #2874. |
…cols (#2818) * Rename static tables * Rename Census DP1 assets * Test doc fix * Update core table names for EIA 860, 923, harvested tables, FERC1, code * Fix integration tests * Fix alembic * Rename 714, 861, epacems * update tests and rest of assets * Fix validation tests * Rename ferc output assets * Rename denorm_cash_flow_ferc1 and remove leading underscore from cross refs in pudl_db docs * Rename a missing ferc output table and add migration * Rename EIA denorm assets * Recreate ferc rename migration * Add docs cross ref fix for intermediate assets * Resolve small denorm EIA rename issues * Clean up notebooks * Apply naming convention to allocate generation fuel assets * Fix a missing gen fuel asset name in PudlTabl * Update migrations post ferc1 output rename merge * Update contributor facing documentation with new asset naming conventions * Add new naming convention to user facing documentation * Correct allocate-get-fuel down revision * Apply new naming convention to ferc714 respondents, hourly demand and eia861 service territories * Fix refs to renamed tables in release notes * Rename ferc714 and eia861 output tables in integration tests * Add missing balance authority fk migration * Rename out_ferc714__fipsified_respondents to out_ferc714__respondents_with_fips * Respond to first round of Austen's comments * Update rename-core-assets and clarify raw asset sentence * Restrict astroid version to avoid random autoapi error * Reset migrations and fix old table refs in docs * Fix names of inputs to exploded tables and xbrl calculation fixes * Rename mcoe and ppl assets * Fix small ppl migration issue * Format and sort intermediate resource name cross refs in data dictionary * Add upstream mcoe assets back to metadata * Update stragler PudlTabl method name * Add frequency to ppl asset name and some clean up * rename six of the non-contreversial FERC1 tables (core + out) * initial rename of the FERC1 core and out tables * add db migration * rename the ferc1 transformer classes in line with new table names * Incorporate some docs changes from #2912 * FINAL FINAL rename of ferc assets * ooooops remove the eia860m extraction edit bc that was not supposed to be in here ooop * Remove README.rst from index.rst and move intro content to index * Add deprecation warnings to PudlTabl and add minor naming docs updates * Rename heat_rate_mmbtu_mwh -> heat_rate_mmbtu_mwh_by_unit * Rename heat rate mmbtu mwh to follow existing naming convention * Remove PudlTabl removal data and make assn table name sources alphabetical * Explain why CEMS is stored as parquet * Rename heat_rate_mmbtu_mwh_eia/ferc1 columns to unit_heat_rate_mmbtu_per_mwh_eia/ferc1 * Remove unused ppe_cols_to_grab variable * Make association asset names more consistent * Add association assset naming convention to docs * Resolve migration issues with unit heat rate column * Update conda-lock.yml and rendered conda environment files. * Recreate heat rate migration revision * Use pudl_sqlite_io_manager for fuel_cost_by_generator assets * Update conda-lock.yml and rendered conda environment files. * Checkout lock files from dev * Update conda-lock.yml and rendered conda environment files. * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Remove intro.rst and update ferc s3 urls again * Update conda-lock.yml and rendered conda environment files. * Remove some old table names from metaddata * Update conda-lock.yml and rendered conda environment files. * [pre-commit.ci] auto fixes from pre-commit.com hooks For more information, see https://pre-commit.ci * Remove ref to non existant doc page, remove files no longer in dev --------- Co-authored-by: bendnorman <bdn29@cornell.edu> Co-authored-by: Bennett Norman <bennett.norman@catalyst.coop> Co-authored-by: Christina Gosnell <cgosnell@catalyst.coop> Co-authored-by: bendnorman <bendnorman@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Still WIP,
Need more input on the ETL section of the
intro.rst
page! I think you can probably just go ahead and work off this branch too add it @bendnorman what do you think?