Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PNAD domicílios import dictionary is wrong #88

Closed
lucasmation opened this issue Oct 19, 2016 · 21 comments
Closed

PNAD domicílios import dictionary is wrong #88

lucasmation opened this issue Oct 19, 2016 · 21 comments

Comments

@lucasmation
Copy link
Owner

lucasmation commented Oct 19, 2016

download_sourceData("PNAD", 2002, unzip = T)
dom <- read_PNAD("domicilios", 2002) %>% data.table

running the above the positions of variables beiond UF are wrong, when compared to the import dictionary and visual inspection of the txt file.

This is confirmed when I inspect the import dictionary for PNAD-2002-dom:

get_import_dictionary('PNAD',2002,'domicilios')

  • the first variable (ano) is not included
  • the variable UF is ok, but all other variables start and end in the wrong position.
  • this is probably because this import dictionary is characterized by start position + length. For instance, UF and v0102 both start at position 5

All of this may have been caused by uso updating the source files from a newer version from IBGE, but forgetting to update the import dicionaries. @nicolassoarespinto can you import again all import dictionaries, from the SAS import dictionary, for PNAD into R?

@nicolassoarespinto
Copy link
Collaborator

@lucasmation what dictionary are you using to compare? Here it seems all right, except for V0101 that is really out and should be included.

@lucasmation
Copy link
Owner Author

get_import_dictionary('PNAD',2002,'domicilios')

I think the 3rd one is wrong

@lucasmation
Copy link
Owner Author

V0102 or v0103

@lucasmation
Copy link
Owner Author

run old_dic <- get_import_dictionary('PNAD',2002,'domicilios')

@lucasmation
Copy link
Owner Author

Then import from SAS dic. again and compare

@lucasmation
Copy link
Owner Author

Just download the data and check the Excel dictionary, you will see.

@lucasmation
Copy link
Owner Author

Please also check for similar problems with the pessoas data

@nicolassoarespinto
Copy link
Collaborator

@lucasmation this is the dictionary stored in microdadosBrasil:

   int_pos var_name   x label length decimal_places fin_pos col_type  CHAR
1        5       UF $2.            2              0       6        c  TRUE
2        5    V0102 $8.            8              0      12        c  TRUE
3       13    V0103 $3.            3              0      15        c  TRUE
4       16    V0104 $2.            2              0      17        c  TRUE
5       18    V0105  2.            2              0      19        i FALSE
6       20    V0106  2.            2              0      21        i FALSE
7       22    V0201 $1.            1              0      22        c  TRUE
8       23    V0202 $1.            1              0      23        c  TRUE
9       24    V0203 $1.            1              0      24        c  TRUE
10      25    V0204 $1.            1              0      25        c  TRUE
11      26    V0205  2.            2              0      27        i FALSE
12      28    V0206  2.            2              0      29        i FALSE
13      30    V2006 $1.            1              0      30        c  TRUE
14      31    V0207 $1.            1              0      31        c  TRUE
15      32    V0208 12.           12              0      43        d FALSE
16      44    V2081 $1.            1              0      44        c  TRUE
17      45    V0209 12.           12              0      56        d FALSE
18      57    V2091 $1.            1              0      57        c  TRUE
19      58    V0210 $1.            1              0      58        c  TRUE
20      59    V2010 $2.            2              0      60        c  TRUE

Here is IBGE Excel file downloaded today:

image

It seems to be correct, aside for the missing variables. Anyway, I am downloading all data and will run our script again to get dictionaries for all years and see if there is anything strange or if I am missing something.

@lucasmation
Copy link
Owner Author

This should be a reproducible example:
this I what I ran:

install.packages("devtools")
library(devtools)
remove.packages('microdadosBrasil')
detach("package:microdadosBrasil", unload=TRUE)
devtools::install_github("lucasmation/microdadosBrasil")
library('microdadosBrasil')

library(tidyverse)
library(data.table)

get_import_dictionary('PNAD',2002,'domicilios')


image

So the dictionary seems to be correct (except for not having the ANO variable)

However once I import the variable, column v0102 seems to start at the wrong column:

download_sourceData("PNAD", 2002, unzip = T)
dom  <- read_PNAD("domicilios", 2002) %>% data.table
pes <- read_PNAD("pessoas", 2002) %>% data.table

The fist observation:
V0102 should be "11000015"
V0103 shold be "001"

But look at how the data is after being imported

View(dom)
image

@nicolassoarespinto
Copy link
Collaborator

@lucasmation the problem seems to be within readr:read_fwf (or in the way we use it) . I tested without UF ( so no variable starts in the same position) and it works fine, as below:


> dic<- get_import_dictionary("PNAD", 2002, "domicilios")
> PNAD_file = "PNAD_reponderado_2002/2002/Dados/DOM2002.txt"



> fwf_positions(start = dic$int_pos, end = dic$fin_pos, col_names = dic$var_name)
$begin
 [1]   0   4   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71
[32]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165
[63] 168 175

$end
 [1]   4   6  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72
[32]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168
[63] 175 182

$col_names
 [1] "V0101" "UF"    "V0102" "V0103" "V0104" "V0105" "V0106" "V0201" "V0202" "V0203" "V0204" "V0205" "V0206" "V2006" "V0207"
[16] "V0208" "V2081" "V0209" "V2091" "V0210" "V2010" "V2210" "V0211" "V0212" "V0213" "V0214" "V0215" "V0216" "V2016" "V0217"
[31] "V0218" "V0219" "V0220" "V2020" "V0221" "V0222" "V0223" "V0224" "V0225" "V0226" "V0227" "V0228" "V0229" "V0230" "V0231"
[46] "V0232" "V4105" "V4106" "V4107" "V4600" "V4601" "V4602" "V4604" "V4605" "V4606" "V4607" "V4608" "V4609" "V4610" "V4611"
[61] "V4614" "UPA"   "V4617" "V4618"
> fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
$begin
 [1]   0   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72
[32]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168
[63] 175

$end
 [1]   4  12  15  17  19  21  22  23  24  25  27  29  30  31  43  44  56  57  58  60  62  63  64  65  66  67  68  70  71  72  73
[32]  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  92  94  98 100 112 115 127 136 145 148 153 165 168 175
[63] 182

$col_names
 [1] "V0101" "V0102" "V0103" "V0104" "V0105" "V0106" "V0201" "V0202" "V0203" "V0204" "V0205" "V0206" "V2006" "V0207" "V0208"
[16] "V2081" "V0209" "V2091" "V0210" "V2010" "V2210" "V0211" "V0212" "V0213" "V0214" "V0215" "V0216" "V2016" "V0217" "V0218"
[31] "V0219" "V0220" "V2020" "V0221" "V0222" "V0223" "V0224" "V0225" "V0226" "V0227" "V0228" "V0229" "V0230" "V0231" "V0232"
[46] "V4105" "V4106" "V4107" "V4600" "V4601" "V4602" "V4604" "V4605" "V4606" "V4607" "V4608" "V4609" "V4610" "V4611" "V4614"
[61] "UPA"   "V4617" "V4618"

d<- read_fwf(PNAD_file, fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2]) )

> head(d)
# A tibble: 6 × 63
  V0101    V0102 V0103 V0104 V0105 V0106 V0201 V0202 V0203 V0204 V0205 V0206 V2006 V0207 V0208 V2081 V0209 V2091 V0210 V2010
  <int>    <int> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <chr> <chr> <int> <int> <chr> <int> <chr> <int> <int> <chr>
1  2002 11000015   001    01    05    05     1     2     2     1    06    03     2     1  <NA>    NA  <NA>    NA     2    04
2  2002 11000015   002    01    04    04     1     2     2     1    06    02     2     1  <NA>    NA  <NA>    NA     2    03
3  2002 11000015   003    01    04    02     1     2     2     1    05    02     4     1  <NA>    NA  <NA>    NA     2    02
4  2002 11000015   004    01    02    02     1     2     2     1    07    01     4     1  <NA>    NA  <NA>    NA     2    03
5  2002 11000015   005    05  <NA>  <NA>    NA    NA    NA    NA  <NA>  <NA>    NA    NA  <NA>    NA  <NA>    NA    NA  <NA>
6  2002 11000015   006    05  <NA>  <NA>    NA    NA    NA    NA  <NA>  <NA>    NA    NA  <NA>    NA  <NA>    NA    NA  <NA>
# ... with 43 more variables: V2210 <chr>, V0211 <int>, V0212 <int>, V0213 <int>, V0214 <int>, V0215 <int>, V0216 <int>,
#   V2016 <chr>, V0217 <int>, V0218 <int>, V0219 <int>, V0220 <int>, V2020 <int>, V0221 <int>, V0222 <int>, V0223 <int>,
#   V0224 <int>, V0225 <int>, V0226 <int>, V0227 <int>, V0228 <int>, V0229 <int>, V0230 <int>, V0231 <int>, V0232 <int>,
#   V4105 <int>, V4106 <int>, V4107 <int>, V4600 <int>, V4601 <chr>, V4602 <int>, V4604 <int>, V4605 <dbl>, V4606 <int>,
#   V4607 <dbl>, V4608 <dbl>, V4609 <chr>, V4610 <int>, V4611 <chr>, V4614 <chr>, UPA <int>, V4617 <chr>, V4618 <chr>


@lucasmation
Copy link
Owner Author

You are correct about the error being in read_fwf.

tidyverse/readr#534

(see comment in the end

@lucasmation
Copy link
Owner Author

lucasmation commented Oct 23, 2016

Let see if readr developers fix that. For now lets just delete UF from the PNAD import dictionaries (the data.frames for each dataset in microdadosBrasil)

@lucasmation
Copy link
Owner Author

@nicolassoarespinto , better than degrading the dictionaries, plase just change the wrapper function read_PNAD() to something like:

dic<- get_import_dictionary("PNAD", 2002, "domicilios")
PNAD_file <-  "PNAD_reponderado_2002/2002/Dados/DOM2002.txt"
p <- fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
dom<- read_fwf(PNAD_file, p )
dom %>% mutate(UF = substring(V0102,1,2)) -> dom

(please adapt to the variable names there)

@lucasmation
Copy link
Owner Author

lucasmation commented Oct 24, 2016

@nicolassoarespinto , also please check if the same problem (ovelapping columns) does not happen in any other dataset. Something like (pseudo-code... )


for each dataset
 for each filetype
   for each year
      for each variable
           check if int_pos + length +1 of line n-1 equals int_pos of line n


If there are overlapping columns in any other dataset-filetype we will need to make similar adjustments to the wrapper functions

@lucasmation
Copy link
Owner Author

Are you in the master or RA branch?

Em qui, 20 de out de 2016 12:04 PM, nicolassoarespinto <
notifications@github.com> escreveu:

@lucasmation https://github.com/lucasmation this is the dictionary
stored in microdadosBrasil:

int_pos var_name x label length decimal_places fin_pos col_type CHAR
1 5 UF $2. 2 0 6 c TRUE
2 5 V0102 $8. 8 0 12 c TRUE
3 13 V0103 $3. 3 0 15 c TRUE
4 16 V0104 $2. 2 0 17 c TRUE
5 18 V0105 2. 2 0 19 i FALSE
6 20 V0106 2. 2 0 21 i FALSE
7 22 V0201 $1. 1 0 22 c TRUE
8 23 V0202 $1. 1 0 23 c TRUE
9 24 V0203 $1. 1 0 24 c TRUE
10 25 V0204 $1. 1 0 25 c TRUE
11 26 V0205 2. 2 0 27 i FALSE
12 28 V0206 2. 2 0 29 i FALSE
13 30 V2006 $1. 1 0 30 c TRUE
14 31 V0207 $1. 1 0 31 c TRUE
15 32 V0208 12. 12 0 43 d FALSE
16 44 V2081 $1. 1 0 44 c TRUE
17 45 V0209 12. 12 0 56 d FALSE
18 57 V2091 $1. 1 0 57 c TRUE
19 58 V0210 $1. 1 0 58 c TRUE
20 59 V2010 $2. 2 0 60 c TRUE

Here is IBGE Excel file downloaded today:

[image: image]
https://cloud.githubusercontent.com/assets/10147479/19573879/bddcfd3e-96e6-11e6-836a-52efd8ec39e4.png

It seems to be correct, aside for the missing variables. Anyway, I am
downloading all data and will run our script again to get dictionaries for
all years and see if there is anything strange or if I am missing something.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#88 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABXDiIdUovv788sbD66ctcwOPQsjo0nvks5q17tNgaJpZM4KbbNo
.

@nicolassoarespinto
Copy link
Collaborator

nicolassoarespinto commented Oct 25, 2016

@lucasmation I am in master branch. We cannot adjust this in the wrapper functions because since #65 we import dictioanries inside read_data() with get_import_dictionary.

@nicolassoarespinto
Copy link
Collaborator

@lucasmation working right now on temporary fix by changes in dictionaries

@lucasmation
Copy link
Owner Author

an option would me to do something like:

read_PNAD <- function(ft,i,...){

  #leave old code commented
  #old code

  dic<- get_import_dictionary("PNAD", 2002, "domicilios")
  file <-  get file name from the metadata file, for file ft
  p <- fwf_positions(start = dic$int_pos[-2], end = dic$fin_pos[-2], col_names = dic$var_name[-2])
  d<- read_fwf(file, p )
  d %>% mutate(UF = substring(V0102,1,2)) %>%  return

}


I mean, don't use read_data, use read_fwf directly

@lucasmation
Copy link
Owner Author

if you have changed the dictionaries already I am fine with that

@nicolassoarespinto
Copy link
Collaborator

nicolassoarespinto commented Nov 4, 2016

Found the same error in other dictionaries with the test you proposed:

                dataset    i      ft overlapping
1 CensoEducacaoSuperior 1995     ies        TRUE
2 CensoEducacaoSuperior 1996     ies        TRUE
3 CensoEducacaoSuperior 1997     ies        TRUE
4                  PNAD 2009 pessoas        TRUE

Complete code for replicability:

check_overlapping <- function(dic){

  check = any(dic$int_pos[-1] <= dic$fin_pos[-dim(dic)[1]])

  return(check)
}

get_available_datasets <- function(){


  datasets_list<- data(package = "microdadosBrasil")$results[,"Item"] %>%
    gsub(pattern = "_dics", replacement = "")

  return(datasets_list)
}

get_available_periods <- function(dataset, fwfonly = FALSE){


  if(!is.data.frame(dataset)) { md = FALSE}

  if(!md){

    dataset  = read_metadata(dataset)


  }
  if(!"period" %in% names(dataset)){

    warning("data.frame in wrong format")
    return(NULL)

  }
  if(fwfonly){

    dataset = dataset %>% filter(format == "fwf")

  }

  periods = dataset$period
  return(periods)



}
get_available_filetypes<- function(dataset, period){

  if(!is.data.frame(dataset)) { md = FALSE}

  if(!md){

    dataset  = read_metadata(dataset)


  }
  if(all(!grepl(pattern = "^ft_",names(dataset)))){

    warning("metadata in wrong format")
    return(NULL)

  }

  filetypes = dataset[ dataset$period == period,] %>%
    subset(select = !is.na(.)[1,]) %>% names() %>%
    subset(grepl(pattern = "^ft_", x = .)) %>%
    gsub(pattern = "^ft_", replacement = "", x = .)
  return(filetypes)

}

c <-  data.frame()

for(dataset in get_available_datasets()){
    for(i in get_available_periods(dataset, fwfonly = TRUE)){
      for(ft in get_available_filetypes(dataset,i))


       print(i)
       print(ft)
      d<- data.frame(`dataset` = dataset,
                 `i` = i,
                 `ft` = ft,
                 overlapping = check_overlapping(get_import_dictionary(dataset,i,ft)), stringsAsFactors = FALSE)

      d[,]<- lapply(d[,], as.character)

      c<- bind_rows(c, d
      )

    }


}


c$overlapping %>% as.logical() %>% as.numeric %>% sum()
(filter(c, overlapping == "TRUE"))

@nicolassoarespinto
Copy link
Collaborator

Censo Escolar

We haver overlapping in the ies dictionary from 1996 to 1999

Here is the head of the dictionary, MASCARA should end in 8 instead of 9:

> get_import_dictionary("CensoEducacaoSuperior", 1996, "ies")
   int_pos                   var_name    x label length decimal_places fin_pos col_type  CHAR
1        1                    MASCARA <NA>  <NA>      8              0       9        c FALSE
2        9                     NR_ANO   5.            5              0      13        i FALSE
3       17                NO_DISTRITO $50.           50              0      66        c  TRUE
4       67                   CODMUNIC $12.           12              0      78        c  TRUE
5       79               NO_MUNICIPIO $50.           50              0     128        c  TRUE
6      129                 EH_CAPITAL  $1.            1              0     129        c  TRUE
7      130                      SG_UF  $2.            2              0     131        c  TRUE

This dont seems to be a problem for the other variables, as it shown in the example below:

# A tibble: 6 × 37
  MASCARA NR_ANO           NO_DISTRITO     CODMUNIC          NO_MUNICIPIO EH_CAPITAL SG_UF     NO_UF    NO_REGIAO
    <chr>  <int>                 <chr>        <chr>                 <chr>      <chr> <chr>     <chr>        <chr>
1     872   1996 SAO JOSE DO RIO PRETO 350100449805 SAO JOSE DO RIO PRETO          N    SP Sao Paulo      SUDESTE
2     872   1996              DOURADOS 250100203704            CAJAZEIRAS          N    PB   Paraiba     NORDESTE
3     872   1996                MACEIO 270301104302                MACEIO          S    AL   Alagoas     NORDESTE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants