Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel losing first few rows of .xls files. #611

Closed
ddekadt opened this issue Apr 17, 2020 · 8 comments
Closed

read_excel losing first few rows of .xls files. #611

ddekadt opened this issue Apr 17, 2020 · 8 comments
Labels
bug an unexpected problem or unintended behavior libxls xls 👵

Comments

@ddekadt
Copy link

ddekadt commented Apr 17, 2020

Hello,

I've been using readxl::read_excel for a project. Great package and function -- thank you so much!

I have one issue that I would love help with -- I don't think it has been address before (I have done my due diligince, I believe).

I include two sample .xls spreadsheets here. For one (success.xls) the function successfully reads in all the data, for the other (failure.xls) it loses the first two rows. This lost information is critical for me.

Code that reproduces this problem for these two files is as follows:

indir = "drive here"
files = list.files(indir)
onesheet1 = read_excel(paste0(indir,"//",files[1]), col_names=F)
onesheet2 = read_excel(paste0(indir,"//",files[2]), col_names=F)

I use this (roughly) code on 1500 .xls files. For some large % of the .xls files, read_excel fails to retrieve the first two rows of data. I imagine there is some issue related to the fact that these are .xls not .xlsx files -- fwiw, these are automatically produced by Abbyy Finereader.

Any assistance would be appreciated,

Daniel

@jennybc
Copy link
Member

jennybc commented May 11, 2020

I will have to see what happens with these files when I use libxls on them w/o R or readxl. But the chances are high that this is a libxls matter. If/when I can prove that, I'll report it upstream.

@jennybc
Copy link
Member

jennybc commented May 11, 2020

I've read these files now and am even more suspicious that it is a libxls issue. Related to encoding.

Content that is not being read out of cell A1 of failure.xls:

ELECTORAL DIVISION OF SOMERSET EAST ; FIELD-CÓRNËTCY No. 5, ZWAGERSHÓÊN.

@jennybc jennybc added the libxls label May 11, 2020
@evanmiller
Copy link

This value appears to be stored in an RSTRING (rich-text string) record, which is not currently supported by libxls.

evanmiller added a commit to libxls/libxls that referenced this issue Jun 6, 2020
Record type 0xD6 was used in BIFF5 to represent rich-text strings. The
rich-text information is stored after the text itself, so we can treat
it like a LABEL record.

See tidyverse/readxl#611
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jul 14, 2020
Changes:
 - Allow truncated XLS files #55 #60 #76 tidyverse/readxl#619
 - Fix long-standing "extra column" bug #73
 - Support for RSTRING records (rich-text cells in older BIFF5 files)
   tidyverse/readxl#611

Minimum version updated in bl3 due to header changes.
@jennybc
Copy link
Member

jennybc commented Jul 29, 2021

Hi @ddekadt I've just incorporated the current version of libxls into the dev version of readxl. I have high hopes that resolves this issue.

Your sample files are no longer on DropBox. If you're still listening, can you share again so I can verify this is fixed?

@ddekadt
Copy link
Author

ddekadt commented Jul 29, 2021

Hi @jennybc -- here you go: https://www.dropbox.com/s/x81wwr35po4r1l6/sample_data.zip?dl=0

Thank you!

@jennybc
Copy link
Member

jennybc commented Jul 30, 2021

Yes, as I had hoped, the updated libxls has fixed this. This is only available in the dev version of readxl at this point, but it's good to know this is fixed.

library(readxl)

dat <- read_xls("investigations/sample_data/success.xls")
#> New names:
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * `` -> ...6
#> * ...
dat
#> # A tibble: 20 x 8
#>    `ELECTORAL DIVISION … ...2    ...3     ...4     ...5  ...6       ...7   ...8 
#>    <chr>                 <chr>   <chr>    <chr>    <chr> <chr>      <chr>  <chr>
#>  1 <NA>                  <NA>    <NA>     <NA>     <NA>  <NA>       <NA>   <NA> 
#>  2 Vo.                   1 I Su… Christi… Profess… Qual… Situation… <NA>   Pace…
#>  3 50                    Abson   William  Proprie… Hous… Hartman-r… Tenant E    
#>  4 51                    Achard… Gamiet   Boatman  Sala… Rudolph-s… Fishe… M    
#>  5 52                    Achard… Mahomet  Clerk    55    55         <NA>   M    
#>  6 Ó3                    Acker   Henry    <NA>     Hous… Sherlock-… Tenant E    
#>  7 54                    Ackerd… Isaac    Mason    55    Armstrong… <NA>   M    
#>  8 55                    Ackerl… F ranci… Clerk    Sala… Britannia… AV. R… E    
#>  9 56                    Ackman  j J ames Sub-gan… 55    North End  Cape … E    
#> 10 57                    Acurrow Francis  Guard    55    Palmersto… Tramw… E    
#> 11 58                    Adam    John     Coffee … 55    Strangers… Foste… 0    
#> 12 59                    Adams   'Louis   Labourer 55    Kirkwood-… Harbo… M    
#> 13 60                    (Adami  Mickie   Merchant Hous… Evatt-str… Tenant E    
#> 14 61                    Adami   Nanbry   55       55    55         55     E    
#> 15 62                    Adami   'Darwick 55       55    <NA>       <NA>   E    
#> 16 63                    Adams   lAchmat  Boatman  Sala… W aimer-r… Fishe… M    
#> 17 64                    (Adams  Aden     <NA>     <NA>  <NA>       <NA>   M    
#> 18 65                    Adams   (Charles <NA>     <NA>  Coode-str… W. Jo… 0    
#> 19 66                    (Adams  Charles… Clerk    55    Donkin-st… Cape … E    
#> 20 67                    ?Adams  Ernest   Boatman  55    W aimer-r… Harbo… 0

dat <- read_xls("investigations/sample_data/failure.xls")
#> New names:
#> * `` -> ...7
#> * `` -> ...8
dat
#> # A tibble: 13 x 8
#>      No. `Surname in full… `Christian Names … `Profession, Trade… Qualification.
#>    <dbl> <chr>             <chr>              <chr>               <chr>         
#>  1  1788 McNally           Henry              Cape Police         Salary        
#>  2  1789 Nel               Jacobus Petrus     Farmer              Householder   
#>  3  1790 Nel               Louis Johannes     <NA>                <NA>          
#>  4  1791 Paddock           Charles            <NA>                <NA>          
#>  5  1792 Paddock           Ernest Harding     <NA>                J,            
#>  6  1793 Parkman           Charles Stewart    <NA>                M             
#>  7  1794 Peover            John               <NA>                M             
#>  8  1795 Read              George             Cape Police         Salary        
#>  9  1796 Schoeman, jun,    Christoff el Andr… Farmer              Householder   
#> 10  1797 Schoeman          Johannes Cornells  <NA>                <NA>          
#> 11  1798 Strydom           Jan Abram          <NA>                0.5           
#> 12  1799 Strydom           Johannes Hendrick  <NA>                <NA>          
#> 13  1800 Snyman            Jacobus            <NA>                Occupier      
#> # … with 3 more variables:
#> #   Situation of Property in respect of which qualification exists, or pi esent address if qualification be not in respect of Property. <chr>,
#> #   ...7 <chr>, ...8 <chr>

Created on 2021-07-29 by the reprex package (v2.0.0.9000)

@jennybc jennybc closed this as completed Jul 30, 2021
@jennybc jennybc reopened this Jul 30, 2021
@jennybc
Copy link
Member

jennybc commented Jul 30, 2021

Hmmm, maybe I spoke too soon. It reads without error, but it still seems to be missing data.

@jennybc
Copy link
Member

jennybc commented Jul 30, 2021

Reading these files with xls2csv appears to work, which suggests the problem now lies in readxl itself.

jennys-MacBook-Pro:readxl jenny$ ~/rrr/libxls/xls2csv investigations/sample_data/success.xls 
FILE: investigations/sample_data/success.xls
"ELECTORAL DIVISION OF PORT ELIZABETH : F.C. No. I, PORT ELIZABETH.";"";"";"";"";"";"";""
"";"";"";"";"";"";"";""
"Vo.";"1 I Surname in full.";"Christian Names (if any) in full.";"Profession, Trade, or other description.";"Qualification.";"Situation of Property in les-pect of which qualification exists, or present address if qualification be not in respect of Property.";"(a) Name and Address of Employer in case of Salary or Wages qualification. (J) When qualification is in respect of Occupation state whether occupancy is as owner, tenant, or otherwise ; and if as tenant, whether as sole or joint tenant.";"Pace distinction."
50;"Abson";"William";"Proprietor";"Householder";"Hartman-road";"Tenant";"E"
51;"Achardien";"Gamiet";"Boatman";"Salary";"Rudolph-strcet";"Fisheries";"M"
52;"Achardien";"Mahomet";"Clerk";55;55;"";"M"
"Ó3";"Acker";"Henry";"";"Householder";"Sherlock-street";"Tenant";"E"
54;"Ackerdiene";"Isaac";"Mason";55;"Armstrong-street";"";"M"
55;"Ackerley";"F rancis Benjamin";"Clerk";"Salary";"Britannia-street";"AV. R. Jecks & Co.";"E"
56;"Ackman";"j J ames";"Sub-ganger";55;"North End";"Cape Government Railways";"E"
57;"Acurrow";"Francis";"Guard";55;"Palmerston-street";"Tramway Co.";"E"
58;"Adam";"John";"Coffee roastr.";55;"Strangers Location";"Foster Bros.";0
59;"Adams";"'Louis";"Labourer";55;"Kirkwood-street";"Harbour Board";"M"
60;"(Adami";"Mickie";"Merchant";"Householder";"Evatt-street";"Tenant";"E"
61;"Adami";"Nanbry";55;55;55;55;"E"
62;"Adami";"'Darwick";55;55;"";"";"E"
63;"Adams";"lAchmat";"Boatman";"Salary";"W aimer-road";"Fisheries";"M"
64;"(Adams";"Aden";"5* Shoemaker";"";"";"";"M"
65;"Adams";"(Charles";"";"";"Coode-street";"W. Jones & Co.";0
66;"(Adams";"Charles Howell";"Clerk";55;"Donkin-street";"Cape Government Railways";"E"
67;"?Adams";"Ernest";"Boatman";55;"W aimer-road";"Harbour Board";0

jennys-MacBjennys-MacBook-Pro:readxl jenny$ ~/rrr/libxls/xls2csv investigations/sample_data/failure.xls 
FILE: investigations/sample_data/failure.xls
"ELECTORAL DIVISION OF SOMERSET EAST ; FIELD-CÓRNËTCY No. 5, ZWAGERSHÓÊN.";"";"";"";"";"";"";""
"";"";"";"";"";"";"";""
"No.";"Surname in full.";"Christian Names (if any) in full.";"Profession, Trade, or other description.";"Qualification.";"Situation of Property in respect of which qualification exists, or pi esent address if qualification be not in respect of Property.";"(a) Name and Address of Employer in owe of Salary or Waqet qualification. (l>') When qualification is in respect of Occupation state whether occupancy is as owner, t< nant, or otherwise ; and if as tenant, whether as sole or joint tenant.";"ri ? s g •rM S ’"
1788;"McNally";"Henry";"Cape Police";"Salary";"Grootvlakte";"Colonial Government";"E"
1789;"Nel";"Jacobus Petrus";"Farmer";"Householder";"Sampsonskloof";"Joint Tenant";"E"
1790;"Nel";"Louis Johannes";"";"";"Weltevreden";"Sole Tenant";"E"
1791;"Paddock";"Charles";"";"";"Grootolakte";"Owner";"E ;"
1792;"Paddock";"Ernest Harding";"";"J,";"";"Occupier otherwise";"E"
1793;"Parkman";"Charles Stewart";"";"M";"Fitzkraal";"Sole Tenant";"E"
1794;"Peover";"John";"";"M";"Luípésklip";"Owner";"E"
1795;"Read";"George";"Cape Police";"Salary";"Grootvlakte";"Colonial Government";"E"
1796;"Schoeman, jun,";"Christoff el Andries";"Farmer";"Householder";"Luipesklip";"Joint Tenant";"E"
1797;"Schoeman";"Johannes Cornells";"";"";"";"";"E"
1798;"Strydom";"Jan Abram";"";0.5;"Weltevreden";"Sole Tenant";"E"
1799;"Strydom";"Johannes Hendrick";"";"";"Waterval";"Owner";"E"
1800;"Snyman";"Jacobus";"";"Occupier";"Weltevreden";"Sole Tenant";"E"

@jennybc jennybc added the bug an unexpected problem or unintended behavior label Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior libxls xls 👵
Projects
None yet
Development

No branches or pull requests

3 participants