Reading `.dta` with value labels #74

pdeffebach · 2020-10-30T14:56:59Z

As you know, Stata basically stores value-labeled data as a vector of integers or doubles, not necessarily an ordered sequence starting at 1, and a Dict going from Int => String.

Accessing the string values, which we generally care the most about, is hard with ReadStat. You have to

Use ReadStat not StatFiles to access the internal fields of the Stata File
Construct the DataFame from the data and header fields
3 . Use the value_label_dict field to perform the replacement
Use get on the DataValue elements of the array

This is not the most user friendly thing.

There isn't a great solution for this in Julia as we dont have a CategoricalArray equivalent where the base dict maps arbitrary types to strings. So converting to categorical array will drop the underlying integers, which are useful to keep due to inter-operability.

haven in R recently made a change with how this is handled with the <dbl+lbl> vector type. Though working with it is a bit of a pain, see here.

I can email a data-set to someone with an MWE for more information.

The text was updated successfully, but these errors were encountered:

doriantsolak · 2021-11-30T16:03:42Z

I would like to work on this as I have to deal with .dta-Files quite regularly and I know the pain of handling Stata labels (in R or in general). I have also read the issues on adding metadata to dataframes and the discussion regarding metadata in DataAPI. As I believe to come from a similar context (lots of household survey data), I agree with a lof of the points @pdeffebach made there, especially about persistent metadata (like in Stata) being super useful. However, as there does not seem to be a great solution on the horizon, what would be the general idea to implement a solution that allows for a better workflow with .dta-Files?

Is the idea to create a global dict which allows for swapping integer with string labels though some mapping based on column name? Should I look into Metadata.jl as a possible dependency for that? I have not worked with Metadata.jl before but as far as I understood it seems to use the approach of a global dict.

Might be that I need a lot of guidance as this is my first open-source contribution, sorry in advance.

pdeffebach · 2021-11-30T16:14:12Z

I think a custom array type would handle this pretty easily. Something based off of CategoricalArrays.jl. But that might be a big task for someone doing their first open source contribution.

davidanthoff added the enhancement label Feb 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading `.dta` with value labels #74

Reading `.dta` with value labels #74

pdeffebach commented Oct 30, 2020

doriantsolak commented Nov 30, 2021

pdeffebach commented Nov 30, 2021

Reading .dta with value labels #74

Reading .dta with value labels #74

Comments

pdeffebach commented Oct 30, 2020

doriantsolak commented Nov 30, 2021

pdeffebach commented Nov 30, 2021

Reading `.dta` with value labels #74

Reading `.dta` with value labels #74