-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support overlapping fwf columns #534
Comments
You are right. Bump! I am having an embarrassing problem in a package (microdadosBrasil) because of the error reading overlapping columns in the original fwf data source. I am posting a new example just to be even more clear Not only are the overlaping columns wrong, but also all the following columns have the initial and end positions shifted right, thus making all the data wrong. Let's say I have this import dictionary
And this dataset
We can import this using
This leads the data being imported with errors. In particular X3 and every column after that are wrong:
|
In case it is helpful @lucasmation, here is my work around. It depends on dplyr, stringr and tidyr. There are 2 limitations that are okay for my use, but might not for all (but you should be able to adapt the code if needed):
read_fwf_overlap <- function(file, col_positions, col_types = NULL, ...) {
if (!is.null(col_types)) {
warning("read_fwf_overlap does not support col_types, columns will be read in as character",
call. = FALSE)
}
fixed_specs <- fix_for_fwf_overlap(col_positions)
out <- readr::read_fwf(file, fixed_specs$vinfo, col_types = cols(.default = col_character()), ...)
out <- dplyr::mutate_(out, .dots = fixed_specs$mdots)
out <- out[, col_positions$col_names] # Rearranges to original order
out
}
# readr doesn't support overlapping fixed with columns. This
# function takes a fwf specificaiton with overlaps, and returns
# a non-overlapping set and another text object that can
# be passed to a mutate_ call on the loaded dataset to
# create the overlapping variables.
# Only works if the overlapping columns fit within exactly
# one other column.
fix_for_fwf_overlap <- function(fwf_positions) {
vinfo <- dplyr::as_data_frame(fwf_positions)
# Check to see if there are overlapping columns. The arrange
# sorts the data so that the widest column with a given start
# is first, this way the other columns can be created
# from it.
vinfo <- dplyr::mutate_(vinfo, width = ~end - begin + 1)
vinfo <- dplyr::arrange_(vinfo, ~begin, ~desc(width))
vinfo <- dplyr::mutate_(vinfo, maxend = ~cummax(dplyr::lag(end, default = 0)),
is_overlapping = ~begin < maxend,
is_overlapper = ~end > dplyr::lead(begin, default = Inf))
# Get dataframe of non-overlapping columns
nonoverlaps <- vinfo
nonoverlaps <- dplyr::filter_(nonoverlaps, ~!is_overlapping)
nonoverlaps <- dplyr::select_(nonoverlaps, ~col_names, ~begin, ~end)
# Make mdots to pass into a mutate_ function that will take substrings of
# The widest columns to make the sub-columns.
## Start by filling some of the overlapper information down to the overlapping
overlaps <- vinfo
overlaps <- dplyr::filter_(overlaps, ~is_overlapper | is_overlapping)
if (nrow(overlaps) > 1) {
overlaps <- dplyr::mutate_(overlaps, overlapper_name = ~ifelse(is_overlapper, col_names, NA),
overlapper_begin = ~ifelse(is_overlapper, begin, NA),
overlapper_end = ~ifelse(is_overlapper, end, NA))
overlaps <- tidyr::fill_(overlaps, c("overlapper_name", "overlapper_begin", "overlapper_end"))
## Check if there are variables that are partially overlapped, if so, then just throw
## an error.
if (any(overlaps$end > overlaps$overlapper_end)) stop("Columns overlap with more than one othe variable",
call. = FALSE)
## Make a text representation of the substring calls we'll want to make
## on the dataframe after it is loaded.
overlaps <- dplyr::mutate_(overlaps, rel_begin = ~begin - overlapper_begin + 1,
rel_end = ~end - overlapper_begin + 1)
overlaps <- dplyr::filter_(overlaps, ~is_overlapping)
mdots <- sprintf("stringr::str_sub(%s, %s, %s)", overlaps$overlapper_name,
overlaps$rel_begin, overlaps$rel_end)
names(mdots) <- overlaps$col_names
} else {
# If no overlapping, don't worry about mdots.
mdots <- NULL
}
# Return list of the nonoverlaps, plus the dots to be passed into mdots
list(vinfo = nonoverlaps, mdots = mdots)
} '123456789abcdefg\n123456789abcdefg' %>%
read_fwf_overlap(fwf_positions(start=c(1,3,3,8,10 ),
end= c(2,4,7,9,13 ) )
)
#> # A tibble: 2 × 5
#> X1 X2 X3 X4 X5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 12 345 34567 89 abcd
#> 2 12 345 34567 89 abcd
|
that is nice. I would only change it slightly so as to not lose the col_type information. I think the function should: a) receive the import dictionary parameters (start, end, col_type, names, etc)
d) import the data, based on the "cleaned" dictionary (without overlaps) |
In any case, I hope the developer fix this ASAP |
This seems like an extremely esoteric use of fwf, but this should throw an informative error. |
Tks Hadley. Maybe it is more common than you think. In Brazil every microdata from IBGE, national statistics office, contains overlaping variables. It is always the municipal code (7digits), which contains the state code (2 digits) in its first two digits. |
You can always extract that after the fact. |
I had ommited the redundant variables in the import functions of IBGE data in the microdadosBrasil package. The problem is that the import functions in other software (SAS, Stata) do support overlapping columns, so users expect that to find those variables. I can, off course recreate, the variables afterwards, but that bloats the code with variable definitions, which ideally should be separate (I am very inspired by your cupcake recipes generalisation here). And more broadly, as long as import dictionary makes sense and there is a minimum amount of data using the nested definition, I shouldn't be up to the import function to restrict such cases. |
I'd consider a PR to implement this, but we're unlikely to do it otherwise. |
@nicolassoarespinto, @gutorc92: lets try to do this (see above) |
commit 217cc7751ec8cfdec89e4fe21717c1a8aff421ad Author: Jim Hester <james.f.hester@gmail.com> Date: Mon Feb 27 10:53:49 2017 -0500 Fix failing tests, reformatting commit 3545e47df758687a0ad2676bfbd47f7efb146a7d Author: gfellis <gfellis@umn.edu> Date: Thu Jan 26 15:05:24 2017 -0600 Update news. commit 7b2e2f20bf92a593b6e79aa895124829ea41cf1b Author: gfellis <gfellis@umn.edu> Date: Tue Jan 24 10:59:23 2017 -0600 Fix off by one error commit 41763088413c47b5035163792564427db9ae9d0f Author: gfellis <gfellis@umn.edu> Date: Tue Jan 24 10:27:37 2017 -0600 Add error for overlapping fwf specification (#534).
@hadley,
Please let us know if these suggestions are acceptable. We have been using them in our forked repos and they seem to work fine. |
Please don't discuss on closed issues as it's I'm extremely likely I won't remember when we're next working on readr. It's much better to open a new issue |
I sometimes receive fixed with files with specifications that have overlapping columns. readr's current behavior made it difficult to figure out what was going wrong.
I wish that I was able to read in the data with the original specifications and get the 2 variables as normal, but if that's not possible, perhaps fwf_positions could check for overlaps and throw a better error?
The text was updated successfully, but these errors were encountered: