-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binary output formats #507
Comments
One suggestion for writing the r-docker has the R arrow and haven packages installed, so you could read in the The R haven package is mainly a wrapper around the ReadStat command line tool and C library https://github.com/WizardMac/ReadStat |
There is no library under the hood. See pandas/io/stata.py. |
Thanks @remlapmot, this is a neat idea. I think I'd prefer though it if we could produce |
I had a look at pyreadstat https://github.com/Roche/pyreadstat (the Python wrapper of ReadStat) but it's function to write dta files operates on pandas Also, unfortunately the command line version of ReadStat doesn't take a feather file as its input file, docs here So I put an R action here |
Thanks @remlapmot, this is really helpful. I agree with Peter that we really want to be able to generate this directly in databuilder (and ideally without having to write to an intermediate format first) but having a working proof-of-concept is very useful nevertheless. One option would be for us to work with ReadStat directly. It looks like someone has created Python bindings for it here: Although it does come with this disclaimer which feels particularly targeted at us, so more investigation will be needed :)
I don't think it would be totally mad for us to write our own bindings, as the API surface we'd need to support would be fairly limited. EDIT: Ah, just seen your message about pyreadstat. That's disappointing, but it might be we can bypass the Pandas code and just use the bindings directly. |
As mentioned, @remlapmot has created https://github.com/opensafely/feather-to-dta. As Data Builder is still being developed, and as this issue is still being discussed, I'm not going to move it from opensafely to opensafely-actions. |
Closing in favour of the more specific #794 |
We want to default to using a compressed binary output format instead of CSV (and avoid repeating the mistake of cohortextractor). Feather v2` seems like the optimal choice here.
(Aside: because databuilder requires an explicit output file name the question of a default format is really about documentation more than anything else.)
We should avoid Pandas as far as possible for this and use the PyArrow libraries directly because:
(a) we want to stream batches to disk rather than buffer the entire result set in memory;
(b) we want to be able to use the 32 bit date type rather than storing everything as 64-bit nanosecond precision timestamps;
(c) all the other reason for avoiding Pandas.
There's an issue tracking this in cohortextractor and Simon has already made a proof-of-concept for streaming output:
We may be forced to use Pandas to support
.dta
output for Stata, though we should see if it's using some other library under the hood which we can talk to directly. In any case, we should bear in the various issues encounted in cohortextractor and their fixes:.dta
output format cohort-extractor#534.dta
binary format does not preserve date types cohort-extractor#533The text was updated successfully, but these errors were encountered: