Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading .dta with value labels #74

Open
pdeffebach opened this issue Oct 30, 2020 · 2 comments
Open

Reading .dta with value labels #74

pdeffebach opened this issue Oct 30, 2020 · 2 comments

Comments

@pdeffebach
Copy link

As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at 1, and a Dict going from Int => String.

Accessing the string values, which we generally care the most about, is hard with ReadStat. You have to

  1. Use ReadStat not StatFiles to access the internal fields of the Stata File
  2. Construct the DataFame from the data and header fields
    3 . Use the value_label_dict field to perform the replacement
  3. Use get on the DataValue elements of the array

This is not the most user friendly thing.

There isn't a great solution for this in Julia as we dont have a CategoricalArray equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.

haven in R recently made a change with how this is handled with the <dbl+lbl> vector type. Though working with it is a bit of a pain, see here.

I can email a data-set to someone with an MWE for more information.

@doriantsolak
Copy link

I would like to work on this as I have to deal with .dta-Files quite regularly and I know the pain of handling Stata labels (in R or in general). I have also read the issues on adding metadata to dataframes and the discussion regarding metadata in DataAPI. As I believe to come from a similar context (lots of household survey data), I agree with a lof of the points @pdeffebach made there, especially about persistent metadata (like in Stata) being super useful. However, as there does not seem to be a great solution on the horizon, what would be the general idea to implement a solution that allows for a better workflow with .dta-Files?

Is the idea to create a global dict which allows for swapping integer with string labels though some mapping based on column name? Should I look into Metadata.jl as a possible dependency for that? I have not worked with Metadata.jl before but as far as I understood it seems to use the approach of a global dict.

Might be that I need a lot of guidance as this is my first open-source contribution, sorry in advance.

@pdeffebach
Copy link
Author

I think a custom array type would handle this pretty easily. Something based off of CategoricalArrays.jl. But that might be a big task for someone doing their first open source contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants