Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary output formats #507

Closed
evansd opened this issue May 27, 2022 · 7 comments
Closed

Binary output formats #507

evansd opened this issue May 27, 2022 · 7 comments

Comments

@evansd
Copy link
Contributor

evansd commented May 27, 2022

We want to default to using a compressed binary output format instead of CSV (and avoid repeating the mistake of cohortextractor). Feather v2` seems like the optimal choice here.

(Aside: because databuilder requires an explicit output file name the question of a default format is really about documentation more than anything else.)

We should avoid Pandas as far as possible for this and use the PyArrow libraries directly because:
(a) we want to stream batches to disk rather than buffer the entire result set in memory;
(b) we want to be able to use the 32 bit date type rather than storing everything as 64-bit nanosecond precision timestamps;
(c) all the other reason for avoiding Pandas.

There's an issue tracking this in cohortextractor and Simon has already made a proof-of-concept for streaming output:

We may be forced to use Pandas to support .dta output for Stata, though we should see if it's using some other library under the hood which we can talk to directly. In any case, we should bear in the various issues encounted in cohortextractor and their fixes:

@remlapmot
Copy link

One suggestion for writing the .dta files is that you could possibly do that with an R resuable action which converts the .feather file to .dta, instead of using pandas (because as far as I know there isn't yet a Stata package to read in .feather files).

r-docker has the R arrow and haven packages installed, so you could read in the .feather file with arrow::read_feather() and write it out as a .dta file using haven::write_dta().

The R haven package is mainly a wrapper around the ReadStat command line tool and C library https://github.com/WizardMac/ReadStat

@inglesp
Copy link
Contributor

inglesp commented Jun 1, 2022

We may be forced to use Pandas to support .dta output for Stata, though we should see if it's using some other library under the hood which we can talk to directly.

There is no library under the hood. See pandas/io/stata.py.

@inglesp
Copy link
Contributor

inglesp commented Jun 1, 2022

One suggestion for writing the .dta files is that you could possibly do that with an R resuable action which converts the .feather file to .dta, instead of using pandas (because as far as I know there isn't yet a Stata package to read in .feather files).

Thanks @remlapmot, this is a neat idea. I think I'd prefer though it if we could produce .dta files directly, if at all possible.

@remlapmot
Copy link

I had a look at pyreadstat https://github.com/Roche/pyreadstat (the Python wrapper of ReadStat) but it's function to write dta files operates on pandas DataFrames (docs here), so I guess that's not really helpful.

Also, unfortunately the command line version of ReadStat doesn't take a feather file as its input file, docs here

So I put an R action here
https://github.com/opensafely/feather-to-dta
(just move to opensafely-actions org to use, no worries if no-one uses it)

@evansd
Copy link
Contributor Author

evansd commented Jun 6, 2022

Thanks @remlapmot, this is really helpful. I agree with Peter that we really want to be able to generate this directly in databuilder (and ideally without having to write to an intermediate format first) but having a working proof-of-concept is very useful nevertheless.

One option would be for us to work with ReadStat directly. It looks like someone has created Python bindings for it here:
https://github.com/Roche/pyreadstat

Although it does come with this disclaimer which feels particularly targeted at us, so more investigation will be needed :)

Pyreadstat is not a validated package. The results may have inaccuracies deriving from the fact most of the data formats are not open. Do not use it for critical tasks such as reporting to the authorities.

I don't think it would be totally mad for us to write our own bindings, as the API surface we'd need to support would be fairly limited.


EDIT: Ah, just seen your message about pyreadstat. That's disappointing, but it might be we can bypass the Pandas code and just use the bindings directly.

@iaindillingham
Copy link
Member

As mentioned, @remlapmot has created https://github.com/opensafely/feather-to-dta. As Data Builder is still being developed, and as this issue is still being discussed, I'm not going to move it from opensafely to opensafely-actions.

@inglesp inglesp added CIPHA Work needed for the CIPHA booster effectiveness study and removed dave-notes labels Jul 20, 2022
@inglesp inglesp removed the CIPHA Work needed for the CIPHA booster effectiveness study label Aug 31, 2022
@evansd
Copy link
Contributor Author

evansd commented Oct 20, 2022

Closing in favour of the more specific #794

@evansd evansd closed this as completed Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants