Binary output formats #507

evansd · 2022-05-27T13:52:54Z

We want to default to using a compressed binary output format instead of CSV (and avoid repeating the mistake of cohortextractor). Feather v2` seems like the optimal choice here.

(Aside: because databuilder requires an explicit output file name the question of a default format is really about documentation more than anything else.)

We should avoid Pandas as far as possible for this and use the PyArrow libraries directly because:
(a) we want to stream batches to disk rather than buffer the entire result set in memory;
(b) we want to be able to use the 32 bit date type rather than storing everything as 64-bit nanosecond precision timestamps;
(c) all the other reason for avoiding Pandas.

There's an issue tracking this in cohortextractor and Simon has already made a proof-of-concept for streaming output:

Stream feather outputs to disk cohort-extractor#763

We may be forced to use Pandas to support .dta output for Stata, though we should see if it's using some other library under the hood which we can talk to directly. In any case, we should bear in the various issues encounted in cohortextractor and their fixes:

The text was updated successfully, but these errors were encountered:

remlapmot · 2022-05-30T09:14:31Z

One suggestion for writing the .dta files is that you could possibly do that with an R resuable action which converts the .feather file to .dta, instead of using pandas (because as far as I know there isn't yet a Stata package to read in .feather files).

r-docker has the R arrow and haven packages installed, so you could read in the .feather file with arrow::read_feather() and write it out as a .dta file using haven::write_dta().

The R haven package is mainly a wrapper around the ReadStat command line tool and C library https://github.com/WizardMac/ReadStat

inglesp · 2022-06-01T10:51:19Z

We may be forced to use Pandas to support .dta output for Stata, though we should see if it's using some other library under the hood which we can talk to directly.

There is no library under the hood. See pandas/io/stata.py.

inglesp · 2022-06-01T10:52:32Z

One suggestion for writing the .dta files is that you could possibly do that with an R resuable action which converts the .feather file to .dta, instead of using pandas (because as far as I know there isn't yet a Stata package to read in .feather files).

Thanks @remlapmot, this is a neat idea. I think I'd prefer though it if we could produce .dta files directly, if at all possible.

remlapmot · 2022-06-06T10:40:37Z

I had a look at pyreadstat https://github.com/Roche/pyreadstat (the Python wrapper of ReadStat) but it's function to write dta files operates on pandas DataFrames (docs here), so I guess that's not really helpful.

Also, unfortunately the command line version of ReadStat doesn't take a feather file as its input file, docs here

So I put an R action here
https://github.com/opensafely/feather-to-dta
(just move to opensafely-actions org to use, no worries if no-one uses it)

evansd · 2022-06-06T10:56:46Z

Thanks @remlapmot, this is really helpful. I agree with Peter that we really want to be able to generate this directly in databuilder (and ideally without having to write to an intermediate format first) but having a working proof-of-concept is very useful nevertheless.

One option would be for us to work with ReadStat directly. It looks like someone has created Python bindings for it here:
https://github.com/Roche/pyreadstat

Although it does come with this disclaimer which feels particularly targeted at us, so more investigation will be needed :)

Pyreadstat is not a validated package. The results may have inaccuracies deriving from the fact most of the data formats are not open. Do not use it for critical tasks such as reporting to the authorities.

I don't think it would be totally mad for us to write our own bindings, as the API surface we'd need to support would be fairly limited.

EDIT: Ah, just seen your message about pyreadstat. That's disappointing, but it might be we can bypass the Pandas code and just use the bindings directly.

iaindillingham · 2022-06-09T11:07:55Z

As mentioned, @remlapmot has created https://github.com/opensafely/feather-to-dta. As Data Builder is still being developed, and as this issue is still being discussed, I'm not going to move it from opensafely to opensafely-actions.

evansd · 2022-10-20T10:47:50Z

Closing in favour of the more specific #794

evansd added the dave-notes label May 27, 2022

evansd mentioned this issue Jun 22, 2022

Critical path to running CIPHA booster effectiveness study in Graphnet #565

Closed

inglesp mentioned this issue Jun 30, 2022

Support Feather v2 as an output format #588

Closed

evansd mentioned this issue Jul 15, 2022

Handle categoricals in binary output formats #631

Closed

inglesp added CIPHA Work needed for the CIPHA booster effectiveness study and removed dave-notes labels Jul 20, 2022

inglesp removed the CIPHA Work needed for the CIPHA booster effectiveness study label Aug 31, 2022

evansd mentioned this issue Oct 20, 2022

Support Stata output format (dta/dta.gz) #794

Closed

evansd closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary output formats #507

Binary output formats #507

evansd commented May 27, 2022

remlapmot commented May 30, 2022

inglesp commented Jun 1, 2022

inglesp commented Jun 1, 2022

remlapmot commented Jun 6, 2022

evansd commented Jun 6, 2022 •

edited

Loading

iaindillingham commented Jun 9, 2022

evansd commented Oct 20, 2022

Binary output formats #507

Binary output formats #507

Comments

evansd commented May 27, 2022

remlapmot commented May 30, 2022

inglesp commented Jun 1, 2022

inglesp commented Jun 1, 2022

remlapmot commented Jun 6, 2022

evansd commented Jun 6, 2022 • edited Loading

iaindillingham commented Jun 9, 2022

evansd commented Oct 20, 2022

evansd commented Jun 6, 2022 •

edited

Loading