support overlapping fwf columns #534

gergness · 2016-10-17T19:40:01Z

I sometimes receive fixed with files with specifications that have overlapping columns. readr's current behavior made it difficult to figure out what was going wrong.

I wish that I was able to read in the data with the original specifications and get the 2 variables as normal, but if that's not possible, perhaps fwf_positions could check for overlaps and throw a better error?

library(readr)
read_fwf("2015a\n2016b", fwf_positions(c(1, 3, 5), c(4, 4, 5)))
#> Warning: 4 parsing failures.
#> row col  expected    actual
#>   1  X2 2 chars   1        
#>   1  -- 3 columns 2 columns
#>   2  X2 2 chars   1        
#>   2  -- 3 columns 2 columns
#> 
#> # A tibble: 2 × 3
#>      X1    X2    X3
#>   <int> <chr> <chr>
#> 1  2015     a  <NA>
#> 2  2016     b  <NA>

lucasmation · 2016-10-23T05:08:05Z

You are right. Bump!

I am having an embarrassing problem in a package (microdadosBrasil) because of the error reading overlapping columns in the original fwf data source.

I am posting a new example just to be even more clear

Not only are the overlaping columns wrong, but also all the following columns have the initial and end positions shifted right, thus making all the data wrong.

Let's say I have this import dictionary

# name: initial position to final position
# X1: 1 to 2
# X2: 3 to 4
# X3: 3 to 7 (overlaps first two caracters with X2)
# X4: 8 to 9
# X5: 10 to 14

And this dataset

123456789abcdefg
123456789abcdefg

We can import this using


'123456789abcdefg\n123456789abcdefg' %>% 
  read_fwf(fwf_positions(start=c(1,3,3,8,10 ), 
                         end=  c(2,4,7,9,13 ) )
  )

This leads the data being imported with errors. In particular X3 and every column after that are wrong:

# A tibble: 2 × 5
     X1    X2    X3    X4    X5
  <int> <int> <int> <chr> <chr>
1    12    34 56789    ab  cdef
2    12    34 56789    ab  cdef

gergness · 2016-10-24T14:35:44Z

In case it is helpful @lucasmation, here is my work around. It depends on dplyr, stringr and tidyr.

There are 2 limitations that are okay for my use, but might not for all (but you should be able to adapt the code if needed):

It requires that all columns be read in as character.
It requires that the overlapping pattern be such that there are no variables that are subsets of more than one variable.

read_fwf_overlap <- function(file, col_positions, col_types = NULL, ...) {
  if (!is.null(col_types)) {
    warning("read_fwf_overlap does not support col_types, columns will be read in as character", 
         call. = FALSE)
  }
  fixed_specs <- fix_for_fwf_overlap(col_positions)

  out <- readr::read_fwf(file, fixed_specs$vinfo, col_types = cols(.default = col_character()), ...)
  out <- dplyr::mutate_(out, .dots = fixed_specs$mdots)
  out <- out[, col_positions$col_names] # Rearranges to original order
  out
}


# readr doesn't support overlapping fixed with columns. This
# function takes a fwf specificaiton with overlaps, and returns
# a non-overlapping set and another text object that can
# be passed to a mutate_ call on the loaded dataset to
# create the overlapping variables.
# Only works if the overlapping columns fit within exactly
# one other column.
fix_for_fwf_overlap <- function(fwf_positions) {
  vinfo <- dplyr::as_data_frame(fwf_positions)

  # Check to see if there are overlapping columns. The arrange
  # sorts the data so that the widest column with a given start
  # is first, this way the other columns can be created
  # from it.
  vinfo <- dplyr::mutate_(vinfo, width = ~end - begin + 1)
  vinfo <- dplyr::arrange_(vinfo, ~begin, ~desc(width))
  vinfo <- dplyr::mutate_(vinfo, maxend = ~cummax(dplyr::lag(end, default = 0)),
                          is_overlapping = ~begin < maxend,
                          is_overlapper = ~end > dplyr::lead(begin, default = Inf))

  # Get dataframe of non-overlapping columns
  nonoverlaps <- vinfo
  nonoverlaps <- dplyr::filter_(nonoverlaps, ~!is_overlapping)
  nonoverlaps <- dplyr::select_(nonoverlaps, ~col_names, ~begin, ~end)

  # Make mdots to pass into a mutate_ function that will take substrings of
  # The widest columns to make the sub-columns.
  ## Start by filling some of the overlapper information down to the overlapping
  overlaps <- vinfo
  overlaps <- dplyr::filter_(overlaps, ~is_overlapper | is_overlapping)
  if (nrow(overlaps) > 1) {
    overlaps <- dplyr::mutate_(overlaps, overlapper_name = ~ifelse(is_overlapper, col_names, NA),
                               overlapper_begin = ~ifelse(is_overlapper, begin, NA),
                               overlapper_end = ~ifelse(is_overlapper, end, NA))
    overlaps <- tidyr::fill_(overlaps, c("overlapper_name", "overlapper_begin", "overlapper_end"))

    ## Check if there are variables that are partially overlapped, if so, then just throw
    ## an error.
    if (any(overlaps$end > overlaps$overlapper_end)) stop("Columns overlap with more than one othe variable", 
                                                          call. = FALSE)

    ## Make a text representation of the substring calls we'll want to make
    ## on the dataframe after it is loaded.
    overlaps <- dplyr::mutate_(overlaps, rel_begin = ~begin - overlapper_begin + 1,
                               rel_end = ~end - overlapper_begin + 1)
    overlaps <- dplyr::filter_(overlaps, ~is_overlapping)

    mdots <- sprintf("stringr::str_sub(%s, %s, %s)", overlaps$overlapper_name, 
                     overlaps$rel_begin, overlaps$rel_end)

    names(mdots) <- overlaps$col_names
  } else {
    # If no overlapping, don't worry about mdots.
    mdots <- NULL
  }
  # Return list of the nonoverlaps, plus the dots to be passed into mdots
  list(vinfo = nonoverlaps, mdots = mdots)
}

'123456789abcdefg\n123456789abcdefg' %>% 
  read_fwf_overlap(fwf_positions(start=c(1,3,3,8,10 ), 
                                 end=  c(2,4,7,9,13 ) )
  ) 

#> # A tibble: 2 × 5
#>      X1    X2    X3    X4    X5
#>   <chr> <chr> <chr> <chr> <chr>
#> 1    12   345 34567    89  abcd
#> 2    12   345 34567    89  abcd

lucasmation · 2016-10-25T21:09:11Z

that is nice. I would only change it slightly so as to not lose the col_type information. I think the function should:

a) receive the import dictionary parameters (start, end, col_type, names, etc)
b) check for overlaps. If overlaps are not nested throw an error message. else remove the shortest (nested) overlapping collumn(s).
c) save in a different object the information on the removed columns:

original information: (start, end, col_type, names, etc)
the column (maintained in the dataset) which it overlaps
new start and end postions relative to the column overlaps

d) import the data, based on the "cleaned" dictionary (without overlaps)
e) recreate the additional variables based on "c)"

lucasmation · 2016-10-25T21:10:05Z

In any case, I hope the developer fix this ASAP

hadley · 2016-12-22T19:20:37Z

This seems like an extremely esoteric use of fwf, but this should throw an informative error.

lucasmation · 2016-12-23T03:34:40Z

Tks Hadley. Maybe it is more common than you think. In Brazil every microdata from IBGE, national statistics office, contains overlaping variables. It is always the municipal code (7digits), which contains the state code (2 digits) in its first two digits.

hadley · 2016-12-23T15:45:08Z

You can always extract that after the fact.

lucasmation · 2016-12-25T15:48:41Z

I had ommited the redundant variables in the import functions of IBGE data in the microdadosBrasil package. The problem is that the import functions in other software (SAS, Stata) do support overlapping columns, so users expect that to find those variables. I can, off course recreate, the variables afterwards, but that bloats the code with variable definitions, which ideally should be separate (I am very inspired by your cupcake recipes generalisation here). And more broadly, as long as import dictionary makes sense and there is a minimum amount of data using the nested definition, I shouldn't be up to the import function to restrict such cases.

hadley · 2016-12-28T15:05:35Z

I'd consider a PR to implement this, but we're unlikely to do it otherwise.

lucasmation · 2017-01-13T17:01:32Z

@nicolassoarespinto, @gutorc92: lets try to do this (see above)

commit 217cc7751ec8cfdec89e4fe21717c1a8aff421ad Author: Jim Hester <james.f.hester@gmail.com> Date: Mon Feb 27 10:53:49 2017 -0500 Fix failing tests, reformatting commit 3545e47df758687a0ad2676bfbd47f7efb146a7d Author: gfellis <gfellis@umn.edu> Date: Thu Jan 26 15:05:24 2017 -0600 Update news. commit 7b2e2f20bf92a593b6e79aa895124829ea41cf1b Author: gfellis <gfellis@umn.edu> Date: Tue Jan 24 10:59:23 2017 -0600 Fix off by one error commit 41763088413c47b5035163792564427db9ae9d0f Author: gfellis <gfellis@umn.edu> Date: Tue Jan 24 10:27:37 2017 -0600 Add error for overlapping fwf specification (#534).

lucasmation · 2017-03-15T17:44:10Z

@hadley,
we have been working on PRs to fix the two limitations of read_fwf

Can't deal with decimal places (issue read_fwf : dealing with non-integers in fwf data #399). Proposed solution: add a new argument decimal_positions to fwf_positions, modifying the main read_fwf function (pull request Decimal positions #632, implemented in Rcpp, tks to @gutorc92)
Can't deal with overlapping columns (current issue). Proposed solution: parse the import dictionary into a list of dictionaries with no overlapping variables. Then import the data based on each dictionary and cbind the results (pull request Read files with overlap #634, implemented in R, tks to @daniellima123). We implemented this as a separate function read_fwf2 (in the same spirit of how read.cvs2 is related to read.cvs), to avoid recursive calls. We could make the changes directly to read_fwf, defining it recursively. What do you think?

Please let us know if these suggestions are acceptable. We have been using them in our forked repos and they seem to work fine.

hadley · 2017-03-15T18:56:11Z

Please don't discuss on closed issues as it's I'm extremely likely I won't remember when we're next working on readr. It's much better to open a new issue

lucasmation mentioned this issue Oct 23, 2016

PNAD domicílios import dictionary is wrong lucasmation/microdadosBrasil#88

Closed

hadley added bug an unexpected problem or unintended behavior read 📖 labels Dec 22, 2016

hadley mentioned this issue Dec 22, 2016

empty column positions produce segfault in read_fwf() #511

Closed

hadley closed this as completed Dec 28, 2016

gergness added a commit to gergness/readr that referenced this issue Jan 24, 2017

Add error for overlapping fwf specification (tidyverse#534).

7d8ccf5

gergness mentioned this issue Jan 24, 2017

Add error for overlapping fwf specification (#534). #585

Closed

This was referenced Feb 1, 2017

Inconsistent result in sum weights Censo lucasmation/microdadosBrasil#121

Closed

create function read_fwf2 lucasmation/microdadosBrasil#122

Closed

lucasmation mentioned this issue Mar 15, 2017

accept PRs to enhance read_fwf to deal with decimal places and overlaping columns #637

Closed

jadermcs mentioned this issue Sep 21, 2018

error trying to read 5th interview from 2016 cran/PNADcIBGE#1

Closed

lock bot locked and limited conversation to collaborators Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support overlapping fwf columns #534

support overlapping fwf columns #534

gergness commented Oct 17, 2016

lucasmation commented Oct 23, 2016 •

edited

Loading

gergness commented Oct 24, 2016

lucasmation commented Oct 25, 2016 •

edited

Loading

lucasmation commented Oct 25, 2016

hadley commented Dec 22, 2016

lucasmation commented Dec 23, 2016

hadley commented Dec 23, 2016

lucasmation commented Dec 25, 2016 •

edited

Loading

hadley commented Dec 28, 2016

lucasmation commented Jan 13, 2017

lucasmation commented Mar 15, 2017 •

edited

Loading

hadley commented Mar 15, 2017

support overlapping fwf columns #534

support overlapping fwf columns #534

Comments

gergness commented Oct 17, 2016

lucasmation commented Oct 23, 2016 • edited Loading

gergness commented Oct 24, 2016

lucasmation commented Oct 25, 2016 • edited Loading

lucasmation commented Oct 25, 2016

hadley commented Dec 22, 2016

lucasmation commented Dec 23, 2016

hadley commented Dec 23, 2016

lucasmation commented Dec 25, 2016 • edited Loading

hadley commented Dec 28, 2016

lucasmation commented Jan 13, 2017

lucasmation commented Mar 15, 2017 • edited Loading

hadley commented Mar 15, 2017

lucasmation commented Oct 23, 2016 •

edited

Loading

lucasmation commented Oct 25, 2016 •

edited

Loading

lucasmation commented Dec 25, 2016 •

edited

Loading

lucasmation commented Mar 15, 2017 •

edited

Loading