PNAD domicílios import dictionary is wrong #88

lucasmation · 2016-10-19T20:34:53Z

download_sourceData("PNAD", 2002, unzip = T)
dom <- read_PNAD("domicilios", 2002) %>% data.table

running the above the positions of variables beiond UF are wrong, when compared to the import dictionary and visual inspection of the txt file.

This is confirmed when I inspect the import dictionary for PNAD-2002-dom:

get_import_dictionary('PNAD',2002,'domicilios')

the first variable (ano) is not included
the variable UF is ok, but all other variables start and end in the wrong position.
this is probably because this import dictionary is characterized by start position + length. For instance, UF and v0102 both start at position 5

All of this may have been caused by uso updating the source files from a newer version from IBGE, but forgetting to update the import dicionaries. @nicolassoarespinto can you import again all import dictionaries, from the SAS import dictionary, for PNAD into R?

nicolassoarespinto · 2016-10-20T18:20:42Z

@lucasmation what dictionary are you using to compare? Here it seems all right, except for V0101 that is really out and should be included.

lucasmation · 2016-10-20T18:22:18Z

get_import_dictionary('PNAD',2002,'domicilios')

I think the 3rd one is wrong

lucasmation · 2016-10-20T18:22:41Z

V0102 or v0103

lucasmation · 2016-10-20T18:23:27Z

run old_dic <- get_import_dictionary('PNAD',2002,'domicilios')

lucasmation · 2016-10-20T18:26:17Z

Then import from SAS dic. again and compare

lucasmation · 2016-10-20T18:27:56Z

Just download the data and check the Excel dictionary, you will see.

lucasmation · 2016-10-20T18:28:36Z

Please also check for similar problems with the pessoas data

nicolassoarespinto · 2016-10-20T19:04:44Z

@lucasmation this is the dictionary stored in microdadosBrasil:

   int_pos var_name   x label length decimal_places fin_pos col_type  CHAR
1        5       UF $2.            2              0       6        c  TRUE
2        5    V0102 $8.            8              0      12        c  TRUE
3       13    V0103 $3.            3              0      15        c  TRUE
4       16    V0104 $2.            2              0      17        c  TRUE
5       18    V0105  2.            2              0      19        i FALSE
6       20    V0106  2.            2              0      21        i FALSE
7       22    V0201 $1.            1              0      22        c  TRUE
8       23    V0202 $1.            1              0      23        c  TRUE
9       24    V0203 $1.            1              0      24        c  TRUE
10      25    V0204 $1.            1              0      25        c  TRUE
11      26    V0205  2.            2              0      27        i FALSE
12      28    V0206  2.            2              0      29        i FALSE
13      30    V2006 $1.            1              0      30        c  TRUE
14      31    V0207 $1.            1              0      31        c  TRUE
15      32    V0208 12.           12              0      43        d FALSE
16      44    V2081 $1.            1              0      44        c  TRUE
17      45    V0209 12.           12              0      56        d FALSE
18      57    V2091 $1.            1              0      57        c  TRUE
19      58    V0210 $1.            1              0      58        c  TRUE
20      59    V2010 $2.            2              0      60        c  TRUE

Here is IBGE Excel file downloaded today:

It seems to be correct, aside for the missing variables. Anyway, I am downloading all data and will run our script again to get dictionaries for all years and see if there is anything strange or if I am missing something.

lucasmation · 2016-10-20T23:57:29Z

This should be a reproducible example:
this I what I ran:

install.packages("devtools")
library(devtools)
remove.packages('microdadosBrasil')
detach("package:microdadosBrasil", unload=TRUE)
devtools::install_github("lucasmation/microdadosBrasil")
library('microdadosBrasil')

library(tidyverse)
library(data.table)

get_import_dictionary('PNAD',2002,'domicilios')

So the dictionary seems to be correct (except for not having the ANO variable)

However once I import the variable, column v0102 seems to start at the wrong column:

download_sourceData("PNAD", 2002, unzip = T)
dom  <- read_PNAD("domicilios", 2002) %>% data.table
pes <- read_PNAD("pessoas", 2002) %>% data.table

The fist observation:
V0102 should be "11000015"
V0103 shold be "001"

But look at how the data is after being imported

View(dom)

nicolassoarespinto · 2016-10-21T17:55:22Z

@lucasmation the problem seems to be within readr:read_fwf (or in the way we use it) . I tested without UF ( so no variable starts in the same position) and it works fine, as below:


> dic<- get_import_dictionary("PNAD", 2002, "domicilios")
> PNAD_file = "PNAD_reponderado_2002/2002/Dados/DOM2002.txt"



> fwf_positions(start = dic$int_pos, end = dic$fin_pos, col_names = dic$var_name)
$begin
 [1]   0   4   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71
[32]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165
[63] 168 175

$end
 [1]   4   6  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72
[32]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168
[63] 175 182

$col_names
 [1] "V0101" "UF"    "V0102" "V0103" "V0104" "V0105" "V0106" "V0201" "V0202" "V0203" "V0204" "V0205" "V0206" "V2006" "V0207"
[16] "V0208" "V2081" "V0209" "V2091" "V0210" "V2010" "V2210" "V0211" "V0212" "V0213" "V0214" "V0215" "V0216" "V2016" "V0217"
[31] "V0218" "V0219" "V0220" "V2020" "V0221" "V0222" "V0223" "V0224" "V0225" "V0226" "V0227" "V0228" "V0229" "V0230" "V0231"
[46] "V0232" "V4105" "V4106" "V4107" "V4600" "V4601" "V4602" "V4604" "V4605" "V4606" "V4607" "V4608" "V4609" "V4610" "V4611"
[61] "V4614" "UPA"   "V4617" "V4618"
> fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
$begin
 [1]   0   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72
[32]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168
[63] 175

$end
 [1]   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72  73
[32]  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168 175
[63] 182

$col_names
 [1] "V0101" "V0102" "V0103" "V0104" "V0105" "V0106" "V0201" "V0202" "V0203" "V0204" "V0205" "V0206" "V2006" "V0207" "V0208"
[16] "V2081" "V0209" "V2091" "V0210" "V2010" "V2210" "V0211" "V0212" "V0213" "V0214" "V0215" "V0216" "V2016" "V0217" "V0218"
[31] "V0219" "V0220" "V2020" "V0221" "V0222" "V0223" "V0224" "V0225" "V0226" "V0227" "V0228" "V0229" "V0230" "V0231" "V0232"
[46] "V4105" "V4106" "V4107" "V4600" "V4601" "V4602" "V4604" "V4605" "V4606" "V4607" "V4608" "V4609" "V4610" "V4611" "V4614"
[61] "UPA"   "V4617" "V4618"

d<- read_fwf(PNAD_file, fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2]) )

> head(d)
# A tibble: 6 × 63
  V0101    V0102 V0103 V0104 V0105 V0106 V0201 V0202 V0203 V0204 V0205 V0206 V2006 V0207 V0208 V2081 V0209 V2091 V0210 V2010
  <int>    <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <int> <int> <chr> <int> <chr> <int> <int> <chr>
1  2002 11000015   001    01    05    05     1     2     2     1    06    03     2     1  <NA>    NA  <NA>    NA     2    04
2  2002 11000015   002    01    04    04     1     2     2     1    06    02     2     1  <NA>    NA  <NA>    NA     2    03
3  2002 11000015   003    01    04    02     1     2     2     1    05    02     4     1  <NA>    NA  <NA>    NA     2    02
4  2002 11000015   004    01    02    02     1     2     2     1    07    01     4     1  <NA>    NA  <NA>    NA     2    03
5  2002 11000015   005    05  <NA>  <NA>    NA    NA    NA    NA  <NA>  <NA>    NA    NA  <NA>    NA  <NA>    NA    NA  <NA>
6  2002 11000015   006    05  <NA>  <NA>    NA    NA    NA    NA  <NA>  <NA>    NA    NA  <NA>    NA  <NA>    NA    NA  <NA>
# ... with 43 more variables: V2210 <chr>, V0211 <int>, V0212 <int>, V0213 <int>, V0214 <int>, V0215 <int>, V0216 <int>,
#   V2016 <chr>, V0217 <int>, V0218 <int>, V0219 <int>, V0220 <int>, V2020 <int>, V0221 <int>, V0222 <int>, V0223 <int>,
#   V0224 <int>, V0225 <int>, V0226 <int>, V0227 <int>, V0228 <int>, V0229 <int>, V0230 <int>, V0231 <int>, V0232 <int>,
#   V4105 <int>, V4106 <int>, V4107 <int>, V4600 <int>, V4601 <chr>, V4602 <int>, V4604 <int>, V4605 <dbl>, V4606 <int>,
#   V4607 <dbl>, V4608 <dbl>, V4609 <chr>, V4610 <int>, V4611 <chr>, V4614 <chr>, UPA <int>, V4617 <chr>, V4618 <chr>

lucasmation · 2016-10-23T05:00:54Z

You are correct about the error being in read_fwf.

tidyverse/readr#534

(see comment in the end

lucasmation · 2016-10-23T17:57:37Z

Let see if readr developers fix that. For now lets just delete UF from the PNAD import dictionaries (the data.frames for each dataset in microdadosBrasil)

lucasmation · 2016-10-24T05:22:23Z

@nicolassoarespinto , better than degrading the dictionaries, plase just change the wrapper function read_PNAD() to something like:

dic<- get_import_dictionary("PNAD", 2002, "domicilios")
PNAD_file <-  "PNAD_reponderado_2002/2002/Dados/DOM2002.txt"
p <- fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
dom<- read_fwf(PNAD_file, p )
dom %>% mutate(UF = substring(V0102,1,2)) -> dom

(please adapt to the variable names there)

lucasmation · 2016-10-24T05:25:49Z

@nicolassoarespinto , also please check if the same problem (ovelapping columns) does not happen in any other dataset. Something like (pseudo-code... )


for each dataset
 for each filetype
   for each year
      for each variable
           check if int_pos + length +1 of line n-1 equals int_pos of line n

If there are overlapping columns in any other dataset-filetype we will need to make similar adjustments to the wrapper functions

lucasmation · 2016-10-25T18:18:00Z

Are you in the master or RA branch?

Em qui, 20 de out de 2016 12:04 PM, nicolassoarespinto <
notifications@github.com> escreveu:

@lucasmation https://github.com/lucasmation this is the dictionary
stored in microdadosBrasil:

int_pos var_name x label length decimal_places fin_pos col_type CHAR
1 5 UF $2. 2 0 6 c TRUE
2 5 V0102 $8. 8 0 12 c TRUE
3 13 V0103 $3. 3 0 15 c TRUE
4 16 V0104 $2. 2 0 17 c TRUE
5 18 V0105 2. 2 0 19 i FALSE
6 20 V0106 2. 2 0 21 i FALSE
7 22 V0201 $1. 1 0 22 c TRUE
8 23 V0202 $1. 1 0 23 c TRUE
9 24 V0203 $1. 1 0 24 c TRUE
10 25 V0204 $1. 1 0 25 c TRUE
11 26 V0205 2. 2 0 27 i FALSE
12 28 V0206 2. 2 0 29 i FALSE
13 30 V2006 $1. 1 0 30 c TRUE
14 31 V0207 $1. 1 0 31 c TRUE
15 32 V0208 12. 12 0 43 d FALSE
16 44 V2081 $1. 1 0 44 c TRUE
17 45 V0209 12. 12 0 56 d FALSE
18 57 V2091 $1. 1 0 57 c TRUE
19 58 V0210 $1. 1 0 58 c TRUE
20 59 V2010 $2. 2 0 60 c TRUE

Here is IBGE Excel file downloaded today:

[image: image]
https://cloud.githubusercontent.com/assets/10147479/19573879/bddcfd3e-96e6-11e6-836a-52efd8ec39e4.png

It seems to be correct, aside for the missing variables. Anyway, I am
downloading all data and will run our script again to get dictionaries for
all years and see if there is anything strange or if I am missing something.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#88 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABXDiIdUovv788sbD66ctcwOPQsjo0nvks5q17tNgaJpZM4KbbNo
.

nicolassoarespinto · 2016-10-25T19:48:09Z

@lucasmation I am in master branch. We cannot adjust this in the wrapper functions because since #65 we import dictioanries inside read_data() with get_import_dictionary.

nicolassoarespinto · 2016-10-25T19:49:24Z

@lucasmation working right now on temporary fix by changes in dictionaries

lucasmation · 2016-10-25T21:00:42Z

an option would me to do something like:

read_PNAD <- function(ft,i,...){

  #leave old code commented
  #old code

  dic<- get_import_dictionary("PNAD", 2002, "domicilios")
  file <-  get file name from the metadata file, for file ft
  p <- fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
  d<- read_fwf(file, p )
  d %>% mutate(UF = substring(V0102,1,2)) %>%  return

}

I mean, don't use read_data, use read_fwf directly

lucasmation · 2016-10-25T21:10:57Z

if you have changed the dictionaries already I am fine with that

nicolassoarespinto · 2016-11-04T17:32:02Z

Found the same error in other dictionaries with the test you proposed:

                dataset    i      ft overlapping
1 CensoEducacaoSuperior 1995     ies        TRUE
2 CensoEducacaoSuperior 1996     ies        TRUE
3 CensoEducacaoSuperior 1997     ies        TRUE
4                  PNAD 2009 pessoas        TRUE

Complete code for replicability:

check_overlapping <- function(dic){

  check = any(dic$int_pos[-1] <= dic$fin_pos[-dim(dic)[1]])

  return(check)
}

get_available_datasets <- function(){


  datasets_list<- data(package = "microdadosBrasil")$results[,"Item"] %>%
    gsub(pattern = "_dics", replacement = "")

  return(datasets_list)
}

get_available_periods <- function(dataset, fwfonly = FALSE){


  if(!is.data.frame(dataset)) { md = FALSE}

  if(!md){

    dataset  = read_metadata(dataset)


  }
  if(!"period" %in% names(dataset)){

    warning("data.frame in wrong format")
    return(NULL)

  }
  if(fwfonly){

    dataset = dataset %>% filter(format == "fwf")

  }

  periods = dataset$period
  return(periods)



}

get_available_filetypes<- function(dataset, period){

  if(!is.data.frame(dataset)) { md = FALSE}

  if(!md){

    dataset  = read_metadata(dataset)


  }
  if(all(!grepl(pattern = "^ft_",names(dataset)))){

    warning("metadata in wrong format")
    return(NULL)

  }

  filetypes = dataset[ dataset$period == period,] %>%
    subset(select = !is.na(.)[1,]) %>% names() %>%
    subset(grepl(pattern = "^ft_", x = .)) %>%
    gsub(pattern = "^ft_", replacement = "", x = .)
  return(filetypes)

}

c <-  data.frame()

for(dataset in get_available_datasets()){
    for(i in get_available_periods(dataset, fwfonly = TRUE)){
      for(ft in get_available_filetypes(dataset,i))


       print(i)
       print(ft)
      d<- data.frame(`dataset` = dataset,
                 `i` = i,
                 `ft` = ft,
                 overlapping = check_overlapping(get_import_dictionary(dataset,i,ft)), stringsAsFactors = FALSE)

      d[,]<- lapply(d[,], as.character)

      c<- bind_rows(c, d
      )

    }


}


c$overlapping %>% as.logical() %>% as.numeric %>% sum()
(filter(c, overlapping == "TRUE"))

nicolassoarespinto · 2016-11-04T19:29:59Z

Censo Escolar

We haver overlapping in the ies dictionary from 1996 to 1999

Here is the head of the dictionary, MASCARA should end in 8 instead of 9:

> get_import_dictionary("CensoEducacaoSuperior", 1996, "ies")
   int_pos                   var_name    x label length decimal_places fin_pos col_type  CHAR
1        1                    MASCARA <NA>  <NA>      8              0       9        c FALSE
2        9                     NR_ANO   5.            5              0      13        i FALSE
3       17                NO_DISTRITO $50.           50              0      66        c  TRUE
4       67                   CODMUNIC $12.           12              0      78        c  TRUE
5       79               NO_MUNICIPIO $50.           50              0     128        c  TRUE
6      129                 EH_CAPITAL  $1.            1              0     129        c  TRUE
7      130                      SG_UF  $2.            2              0     131        c  TRUE

This dont seems to be a problem for the other variables, as it shown in the example below:

# A tibble: 6 × 37
  MASCARA NR_ANO           NO_DISTRITO     CODMUNIC          NO_MUNICIPIO EH_CAPITAL SG_UF     NO_UF    NO_REGIAO
    <chr>  <int>                 <chr>        <chr>                 <chr>      <chr> <chr>     <chr>        <chr>
1     872   1996 SAO JOSE DO RIO PRETO 350100449805 SAO JOSE DO RIO PRETO          N    SP Sao Paulo      SUDESTE
2     872   1996              DOURADOS 250100203704            CAJAZEIRAS          N    PB   Paraiba     NORDESTE
3     872   1996                MACEIO 270301104302                MACEIO          S    AL   Alagoas     NORDESTE

nicolassoarespinto mentioned this issue Oct 21, 2016

adds V0101 to PNAD dictionary #89

Merged

nicolassoarespinto mentioned this issue Oct 26, 2016

temporary fix overlapping in PNAD dictionaries #91

Merged

nicolassoarespinto closed this as completed Nov 1, 2016

nicolassoarespinto reopened this Nov 4, 2016

nicolassoarespinto closed this as completed Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PNAD domicílios import dictionary is wrong #88

PNAD domicílios import dictionary is wrong #88

lucasmation commented Oct 19, 2016 •

edited

Loading

nicolassoarespinto commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

nicolassoarespinto commented Oct 20, 2016

lucasmation commented Oct 20, 2016

nicolassoarespinto commented Oct 21, 2016

lucasmation commented Oct 23, 2016

lucasmation commented Oct 23, 2016 •

edited

Loading

lucasmation commented Oct 24, 2016

lucasmation commented Oct 24, 2016 •

edited

Loading

lucasmation commented Oct 25, 2016

nicolassoarespinto commented Oct 25, 2016 •

edited

Loading

nicolassoarespinto commented Oct 25, 2016

lucasmation commented Oct 25, 2016

lucasmation commented Oct 25, 2016

nicolassoarespinto commented Nov 4, 2016 •

edited

Loading

nicolassoarespinto commented Nov 4, 2016

PNAD domicílios import dictionary is wrong #88

PNAD domicílios import dictionary is wrong #88

Comments

lucasmation commented Oct 19, 2016 • edited Loading

nicolassoarespinto commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

lucasmation commented Oct 20, 2016

nicolassoarespinto commented Oct 20, 2016

lucasmation commented Oct 20, 2016

nicolassoarespinto commented Oct 21, 2016

lucasmation commented Oct 23, 2016

lucasmation commented Oct 23, 2016 • edited Loading

lucasmation commented Oct 24, 2016

lucasmation commented Oct 24, 2016 • edited Loading

lucasmation commented Oct 25, 2016

nicolassoarespinto commented Oct 25, 2016 • edited Loading

nicolassoarespinto commented Oct 25, 2016

lucasmation commented Oct 25, 2016

lucasmation commented Oct 25, 2016

nicolassoarespinto commented Nov 4, 2016 • edited Loading

nicolassoarespinto commented Nov 4, 2016

Censo Escolar

lucasmation commented Oct 19, 2016 •

edited

Loading

lucasmation commented Oct 23, 2016 •

edited

Loading

lucasmation commented Oct 24, 2016 •

edited

Loading

nicolassoarespinto commented Oct 25, 2016 •

edited

Loading

nicolassoarespinto commented Nov 4, 2016 •

edited

Loading