Skip to content

Track metadata when using pandas via JSON with custom DataFrame hooks

License

Notifications You must be signed in to change notification settings

Liam-Deacon/metapandas

 
 

Repository files navigation

metapandas

Track metadata when using pandas via JSON, utilising custom DataFrame hooks.

This both extends the pandas DataFrame with a MetaDataFrame class and can decorate commonly used pandas methods for retrieving/storing data to include metadata by default.

>>> import numpy as np
>>> import metapandas as mpd
>>> data = np.arange(9).reshape(3, 3)
>>> mdf = mpd.MetaDataFrame(data, columns=list('abc'), metadata={})
>>> from pprint import pprint
>>> pprint(mdf.metadata)
{'constructor': {'args': (array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]]),),
                 'class': <class 'metapandas.metadataframe.MetaDataFrame'>,
                 'kwargs': {'columns': ['a', 'b', 'c']}}}

# metadata is preserved when copied
>>> mdf.metadata['test'] = True
>>> mdf.copy().metadata.get('test')
True

# metadata is stored in a JSON when saving the dataframe to disk
>>> mdf.to_csv('test.csv', index=False)
>>> from pathlib import Path
>>> list(map(str, Path('.').glob('test.csv*')))
['test.csv', 'test.csv.meta.json']

# metadata is automatically loaded when pandas hooks are installed
# this is useful if you have existing pandas code that you want to augment with metadta
>>> from metapandas.hooks.pandas import PandasMetaDataHooks
>>> from contextlib import redirect_stdout, redirect_stderr
>>> from io import StringIO
>>> str_io = StringIO()
>>> with redirect_stderr(str_io), redirect_stdout(str_io):
...     PandasMetaDataHooks.install_metadata_hooks()
>>> print('\n'.join(str_io.getvalue().strip().split('\n')[-1:]))
Installed PandasMetaDataHooks hooks
>>> import pandas as pd
>>> new_mdf = pd.read_csv('test.csv')
>>> metadata = new_mdf.metadata
>>> pprint(metadata['storage'])
{'args': [],
 'data_filepath': 'test.csv',
 'metadata_filepath': 'test.csv.meta.json',
 'method': <function NDFrame.to_csv at ...>,
 'varargs': 'args'} 

# remove pandas decorators when no longer needed
>>> PandasMetaDataHooks.uninstall_metadata_hooks()
Uninstalled PandasMetaDataHooks hooks

# alternatively just use metapandas.read_* functions without installing hooks
>>> pprint(mpd.read_csv('test.csv').metadata['storage'])
{'args': [],
 'data_filepath': 'test.csv',
 'metadata_filepath': 'test.csv.meta.json',
 'method': <function NDFrame.to_csv at ...>,
 'varargs': 'args'} 

Pandas modification can be performed by importing the auto module as follows:

>>> import metapandas.auto
Applied hook for metapandas.metadataframe.MetaDataFrame.to_csv
Applied hook for metapandas.metadataframe.MetaDataFrame.to_excel
Applied hook for metapandas.metadataframe.MetaDataFrame.to_feather
Applied hook for metapandas.metadataframe.MetaDataFrame.to_hdf
Applied hook for metapandas.metadataframe.MetaDataFrame.to_json
Applied hook for metapandas.metadataframe.MetaDataFrame.to_parquet
Applied hook for metapandas.metadataframe.MetaDataFrame.to_pickle
Applied hook for pandas.read_csv
Applied hook for pandas.read_excel
Applied hook for pandas.read_feather
Applied hook for pandas.read_hdf
Applied hook for pandas.read_json
Applied hook for pandas.read_parquet
Applied hook for pandas.read_pickle
Applied hook for pandas.read_sql
Applied hook for pandas.read_sql_table
Applied hook for pandas.read_sql_query
Applied hook for pandas.core.frame.DataFrame.to_csv
Applied hook for pandas.core.frame.DataFrame.to_excel
Applied hook for pandas.core.frame.DataFrame.to_feather
Applied hook for pandas.core.frame.DataFrame.to_hdf
Applied hook for pandas.core.frame.DataFrame.to_json
Applied hook for pandas.core.frame.DataFrame.to_parquet
Applied hook for pandas.core.frame.DataFrame.to_pickle
Installed PandasMetaDataHooks hooks

Installation

MetaPandas itself is a pure python package, but depends on pandas and the SciPy stack. Note: It optionally uses geopandas as well, which is often difficult to install without conda.

To install, simply try:

pip install metapandas

Development

To set up a development environment, first create either a new virtual or conda environment before activating it and then run the following:

git clone https://github.com/lightbytes/metapandas
cd metapandas
pip install -r requirements-dev.txt requirements-test.txt -r requirements.txt
pip install -e .

This will install the package in development mode. Note that is you have forked the repo then change the URL as appropriate.

Documentation

Documentation can be found within the docs/ directory. This project uses sphinx to autogenerate API documentation by scraping python docstrings.

To generate the HTML documentation, simply do the following:

cd docs
make html

PDF generation

PDF documentation is currently only supported on Ubuntu systems, but needs additional packages to run. These can be installed by:

cd docs
chmod +x setup.sh
./setup.sh

PDFs can then be created with make pdf from within the docs/ directory.

Contribution Guidelines

Contributions are extremely welcome and highly encouraged. To help with consistency please can the following areas be considered before submitting a PR for review:

  • Use autopep8 -a -a -i -r . to run over any modified files to ensure basic pep8 conformance, allowing the code to be read in a style expected for most python projects.
  • New or changed functionality should be tested, running pytest should
  • Try to document any new or changed functionality. Note: this project uses numpydoc for it's docstring documentation style.

License

Released under the MIT license.

TODO

This package is mostly a proof of concept and as such there are a number of areas to add to, fix and improve. Of these, the following are considered to be of highest importance:

  1. Track pandas operations such as merge, groupby, etc. within metadata (BIG TASK)
  2. Add user friendly documentation
  3. Automated semantic versioning
  4. Automated master branch update release to PyPI
  5. More extensive testing
  6. Improve code coverage to > 90% (stretch: > 95%)

About

Track metadata when using pandas via JSON with custom DataFrame hooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages

  • Python 98.5%
  • Dockerfile 1.5%