Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add interlacer support via read_interlaced_resource() #213

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

khusmann
Copy link
Contributor

@khusmann khusmann commented Jun 27, 2024

Just put together a quick sketch of what interlacer support might look like for frictionless-r

Key features that interlacer brings:

Example csv in tests/testthat/data/type_interlaced.csv:

na_fct,na_int,na_cfct,na_none,na_default,cfct_chr,cfct_int
1.1,1.1,1.1,1,1,a,10
2.2,2.2,2.2,2,2,b,20
OMITTED,-99,-99,3,OMITTED,a,10
REFUSED,-98,-98,3,3,b,20
5.5,5.5,5.5,4,4,a,10

Example run:

pkg <- read_package("tests/testthat/data/types.json")
read_interlaced_resource(pkg, "interlaced")
# # A tibble: 5 × 7
#      na_fct    na_int    na_cfct na_none na_default cfct_chr cfct_int
#   <dbl,fct> <dbl,int> <dbl,cfct>   <dbl>  <dbl,fct> <cfct>   <cfct>  
# 1       1.1       1.1        1.1       1          1 APPLE    APPLE   
# 2       2.2       2.2        2.2       2          2 BANANA   BANANA  
# 3 <OMITTED>     <-99>  <OMITTED>       3  <OMITTED> APPLE    APPLE   
# 4 <REFUSED>     <-98>  <REFUSED>       3          3 BANANA   BANANA  
# 5       5.5       5.5        5.5       4          4 APPLE    APPLE 

Notes / thoughts

  1. The complexity of read_resource is getting pretty out of hand (as you note in Make read_resource() more modular #210). I would argue that it's not just that we need to make it more modular; what really hurts us is that we're writing a complex parser here in a language that doesn't support static typing. So we can't just validate the JSON via an object schema (as you'd do with pydantic or serde), then rely on the type signatures thereafter to guarantee valid types; instead type validation is mixed into all of our transformation logic (look at all the null checks that are necessary, for example). Without static typing, it becomes very difficult to manage the input possibilities and decouple input validation from transformation. As I mentioned in the last frictionless call, in the long term I want to write a library in rust to help address / standardize / streamline some of this, but in the short term I agree that your ideas in Make read_resource() more modular #210 will go a long way.

  2. While working on this I noticed some more (easily fixable) performance issues with read_interlaced_*():

Because interlacer is still in its infancy, I don't think it should be the default reader. So presently I have an interlaced argument in read_resource, as well as an alias function read_interlaced_resource() that runs with interlaced = TRUE.

When interlaced = FALSE, it uses read_delim() instead of read_interlaced_delim(). This has the following effects:

  • cfactors just become whatever their underlying code type was (a cfactor with character codes just turns into a character, for example)
  • if there is any custom field-level missingness information, it is lost (and the user is warned).

The reason for this, is that even after I fix the performance issues above, we're still looking at a speed compromise if we want to load non-vroom types (interlaced columns , cfactors, list, etc.) with frictionless-r. This is the case because if we touch a column in R after loading it with vroom, the column gets loaded into memory, thereby defeating the lazy ALTREP loading of the column and slowing things down. In the long-term I want to work with ALTREP so it lazy loads everything just like vroom; but in the short term I'm planning to focus on the larger low hanging fruit (like the issues i link above)

  1. Minor cosmetic thing -- I've named all my classes "interlacer_*" to sort of namespace them (similar to "vectrs_vctr"). After working with the lib from the perspective of another package though, I'm not sure I like it. Would you prefer to see "interlaced" and "cfactor" instead of "interlacer_interlaced" and "interlacer_cfactor"?

Key features:
- field-level missingValues
- coded factor columns (value labels for factors)
- tagged missingness (interlaced columns)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant