-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing issues with nomer tsv records? #5
Comments
@cboettig thanks for sharing. Your comments highlight various separate issues. I'll attempt to address each of them separately in the following comments. I am planning to release a new GloBI Taxon Graph version v0.3.2 with corrections applied in this thread. First, in line 98457 in taxonCache.tsv v0.3.1, I found (please note that header was added for convenience) id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
EOL:224784 Neoniphon sammara Species "Kolvin-soldaat @af | Deek @ar | 鐵甲 @cnm | Eichhörnchenfisch @de | Sammara squirrelfish @en | Candil samara @es | Corocoro @fj | Marignan tacheté @fr | \"Ala'ihi @hw | Ukeguchi-ittoudai @ja | 무늬얼게돔 @ko | Jerra @mh | Kolithaduva @ml | Kinolu @ms | Esquilo samara @pt | Malau-tui @sm | Baga-baga @tl | Araoe @ty | Cá Son dá dài @vi | 条纹长颏鳂 @zh | 莎姆新東洋金鱗魚 @zh-Hant |" Animalia | Chordata | Actinopterygii | Beryciformes | Holocentridae | Neoniphon | Neoniphon sammara EOL:1 | EOL:694 | EOL:1905 | EOL:8234 | EOL:8237 | EOL:24504 | EOL:224784 kingdom | phylum | class | order | family | genus | species http://eol.org/pages/224784 http://media.eol.org/content/2009/05/19/16/85885_98_68.jpg Note that commonNames value is (incorrectly) enclosed by double quotes and an escaped "Ala'ihi @hw On closer inspection, the commonNames value was enclosed by quotes when csv was still used to store taxonCache. This also explains the escaped double quote. Also, it appears that the Hawaiian name for Neoniphon sammara is not transcribed properly in EOL http://eol.org/pages/224784/names/common_names . Instead of @jhammock any change you can update the common name? From sources like http://www.wpcouncil.org/managed-fishery-ecosystems/hawaii-archipelago/regulations-and-enforcement-hawaii/ it appears that the common name is used to describe various different species, not just Neoniphon sammara . To correct for this, double quotes are removed and the escape double quote has been replaced with the original string reported by EOL, including the double quotes. Note that TSV does not need escaping of quotes (https://www.iana.org/assignments/media-types/text/tab-separated-values) . |
A second issue was reported on line 119858:
Similar pattern is observed here: csv-style escaping/quoting used because of the usage of double quotes in the text. @jhammock any idea why To correct, doubles quotes are removed as well as the escaped double quotes. |
A third issue was reported on line 425504:
Same double quoting issues here. integration tests confirm that iNaturalist explicitly reports |
A fourth issue was found, where entries in taxonCache were found without a taxonId column. This was a transformation mistake and entries with missing taxonId columns will be removed. Note that the entries without an id actually had valid counter parts in the taxonCache file. |
Also, please note that the first three issues are definitely data errors, but not tsv parsing errors. TSV, according to IANA https://www.iana.org/assignments/media-types/text/tab-separated-values , does not have any string quoting . Please see tidyverse/readr#844 . If empty quote parameter is used, no problems are encountered when reading the taxonCache.tsv : taxonCache <- readr::read_tsv('taxonCache.tsv', quote='')
Parsed with column specification:
cols(
id = col_character(),
name = col_character(),
rank = col_character(),
commonNames = col_character(),
path = col_character(),
pathIds = col_character(),
pathNames = col_character(),
externalUrl = col_character(),
thumbnailUrl = col_character()
)
|=================================================================| 100% 904 MB
> library(readr)
> problems(taxonCache)
# tibble [0 × 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr> @cboettig curious to hear your thoughts on all this. |
I've prepared a pre-release of taxonCache with applied changes, please see https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Please let me know if this pre-release solves this issue. If not, or if you find new issue, please do share. |
Thanks, will do! Good point on the Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table? I'm playing a bit with parsing the pipe strings right now; I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings. Will let you know if that surfaces any other parsing issues for me. |
I did some spot checks, and duplicates seem to exist. I removed the entries with path values that include the unexpected |
I agree that zipping (combining) path / pathIds / pathNames is not convenient. It seems that most biologist are comfortable with tabular formats, so I am trying to figure out ways to mold data into that shape to lower barrier to edit / use / share without losing too much flexibility. Am open to suggestions and am in favor of exposing the same knowledge in different formats rather than taking a one-size-fits-all approach. |
@jhpoelen I think I'm still seeing a whole bunch of entries with alignment issues? library(tidyverse)
taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache %>% filter(!grepl("(:|-|_)", id)) shows a bunch of rows that are getting parsed that appear to have no id and so still have everything miss-aligned. |
@cboettig confirmed . I've uploaded a second pass at the taxonCach.tsv.gz file, overwriting https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Thanks for sharing, please check and let me know if you see more issues. |
@jhpoelen I seem to be getting a 403 access denied error at that URL now(?) |
Thanks for letting me know . I've updated the access privileges and the file should be public now. Please try again - https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . |
@jhpoelen Thanks! Getting there! Looks like a possible data issue now: e.g. row taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache[243356,]$path
I see a total of 954 records where it looks to me that the number of pipes differs between pattern <- "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>%
map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>%
map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
which( !(path_pipes == pathName_pipes))
|
Thanks against for your patience and feedback. I went through the entries with mismatching path / path names. I found that most of the issue were due to an historic bug that didn't include empty ranks when ingesting path names. I removed the entries, after spot checking that duplicate entries existed in the taxonCache with aligned path/ids/names. A single item, The remaining issues are terms related to non-taxa like environmental terms (e.g., wood) or functional groups (e.g., plankton). These do not have path/rank names. I've included the remaining issue below. I've uploaded an updated copy of taxonCache for your review at https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . This cleanup of taxonCache.tsv makes me re-realize the importance of data mobility, archiving, versioning, automated quality control, peer review and the effort this all takes...
|
@jhpoelen Found some more rows with alignment / missing-id issue: look for cases with whitespace in the id: taxonCache %>% filter(grepl("\\s", id)) (Missed this one before because previously my pattern looked for identifiers with Another possible issue I noticed in pathNames: taxonCache %>% filter(grepl(":", pathNames)) This gets the above miss-aligned ones too, but looks like it is mostly getting pathNames given by identifiers, maybe mostly from Wikidata. I see why wikidata does that so technically these aren't errors, but from a practical point of view it would be much better to have path names we can match to other path names. e.g. instead of |
Thanks!
I've prepared https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz for your review. If you are ok with this version, I'll prepare another zenodo publication. Otherwise, please detail your concerns. |
Sounds like a plan. Nice to have ALA taxon addressed. I'm still seeing 57 rows that don't have a > taxonCache %>% filter(!grepl(":", id))
# A tibble: 57 x 9
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 4701dc84-660a-4c51-bd16-593997f2370b Coelo… spec… NA Fung… urn:ls… kingdom … NA NA
2 ALA_Cladia_muelleri Cladi… unkn… NA | Cl… | ALA_… | unknown NA NA
3 ALA_Delia_hirticrura Delia… unkn… NA | De… | ALA_… | unknown NA NA
4 ALA_Oxycetonia_jucunda Oxyce… unkn… NA | Ox… | ALA_… | unknown NA NA
5 NZOR-3-100527 Proci… genus NA | Pr… | NZOR… | genus NA NA
6 NZOR-3-109825 Marie… genus NA | Ma… | NZOR… | genus NA NA
7 NZOR-3-33834 Misce… unkn… NA | Mi… | NZOR… | unknown NA NA
8 NZOR-3-40069 Proka… unkn… NA | Pr… | NZOR… | unknown NA NA
9 NZOR-3-41136 Urtic… genus NA | Ur… | NZOR… | genus NA NA
10 NZOR-3-54695 Oreoc… genus NA | Or… | NZOR… | genus NA NA
# ... with 47 more rows Maybe that is intentional? Isn't clear if these identifiers can be resolved, notably they have no externalUrl entry, though There's a larger set of things with no externalUrl, some which seem to have prefixes that aren't defined in the prefix table ( > taxonCache %>% filter(is.na(externalUrl))
# A tibble: 2,770 x 9
id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 4701dc8… Coelom… spec… NA Fungi | Chy… urn:lsid:indexfungorum.… kingdom | p… NA NA
2 ALA_Cla… Cladia… unkn… NA | Cladia mu… | ALA_Cladia_muelleri | unknown NA NA
3 ALA_Del… Delia … unkn… NA | Delia hir… | ALA_Delia_hirticrura | unknown NA NA
4 ALA_Oxy… Oxycet… unkn… NA | Oxycetoni… | ALA_Oxycetonia_jucunda | unknown NA NA
5 CAAB:0c… Halica… spec… NA Halicarcinu… CAAB:0cd18290:475549ca:… species NA NA
6 CAAB:23… Taloch… spec… NA | Talochlam… | CAAB:23270067 | species NA NA
7 CAAB:28… Crab z… unkn… NA | Crab zoea | CAAB:28850902 | unknown NA NA
8 CAAB:53… Mastog… spec… NA Mastogloiac… CAAB:53210000 | CAAB:53… family | ge… NA NA
9 CAAB:80… Microa… unkn… microalgae… | Microalgae | CAAB:80200000 | unknown NA NA
10 CoL:254… Pseudo… spec… NA Pseudoparre… CoL:25759155 | CoL:2549… genus | spe… NA NA
# ... with 2,760 more rows Again, I think this all just shows what an amazing resource this is to have all of this compiled in a nice file like Running a few experiments on the pipe paths but I think that all relates to next steps in #7 rather than possible issues in |
Looks like there might be a few cases where path, pathNames, and pathIDs do not all have the same length (not counting cases where any one of these is na). e.g. row with id = |
in case it's at all helpful, here's the crummy R code I'm using to identify the ~1000 rows that appear to have issues. ## Expect same number of pipes in each entry:
pattern = "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
pathIds_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathIds, pattern)[[1]]))
na_path <- is.na(taxonCache$path)
na_pathNames <- is.na(taxonCache$pathNames)
na_pathIds <- is.na(taxonCache$pathIds)
trouble <- which( !(pathIds_pipes == path_pipes) & !na_path & !na_pathIds)
## Here's the ~1000 rows that appear miss-matched to me
taxonCache[trouble,] |
Very helpful indeed, thank for being thorough I am working on an input / output validation framework to more easily detect these inconsistencies. #8 . Curious to hear your thoughts on that. |
@cboettig just published http://doi.org/10.5281/zenodo.1250572 . In this version, consistency terms and links were checked using nomer's |
@jhpoelen Maybe I'm not understanding something here, but it seems there's ~ 500,000 rows in taxonCache involving duplicate ids? I think this should be reproducible R code: library(tidyverse)
taxonCache <- read_tsv("https://zenodo.org/record/1250572/files/taxonCache.tsv.gz", quote="")
dup_id <-
taxonCache %>% select(id) %>% group_by(id) %>%
summarise(n_id = length(id)) %>% filter(n_id > 1)
trouble <- taxonCache %>% semi_join(select(dup_id, id))
# a data frame with the subset of taxonCache having duplicate ids
trouble This prevents me from establishing a unique path / pathId / pathNames for an ID; it's not clear how to resolve the conflicts. I think this is related (/the cause of) to the issue I just added to #7 |
@cboettig thanks for sharing. See #7 (comment) . I think this warrants a further discussion. . . |
Also, please note #9 - would having the name source / retrieval date would provide more information on which taxon id to select? Currently, GloBI itself uses a pretty blunt method - just use all that match to populate taxon search index/ graph. |
Here's an example of a taxon id with slight changes in name hierarchies as provided by the name source. Note that http://id.biodiversity.org.au/node/apni/50587232 and https://id.biodiversity.org.au/taxon/apni/51337710 are both outdated identifiers for Plantae. So, this is an example of multiple interpretations of taxon ids. Am leaving this issue open because it exposes some interesting effects associated to taxon ids. |
Hi @jhpoelen ,
I'm running into some issues parsing the
taxonCache
file in the Zenodo-archived data http://doi.org/10.5281/zenodo.1213465, (which looks super nice otherwise btw).For instance, the
readr
package in R shows a few parsing errors, mostly due to what might be extraneous quote characters:shows these errors
Those are pretty minor though, looks like only 3 rows are having issues. More troublesome is that somehow
readr
parsing of the file is getting some rows miss-aligned, e.g. if you then do:you get a whole sequence of rows where the
path
column haspathId
values. A quick inspection of these rows shows they are all shifted over by one column, as they are all missing the first column (anid
). (Same problem can be reproduced with the base Rread.delim
, which is much slower thanreadr
implementation). Is there something that can be done to so those rows that don't have an id still begin with a proper delimiter such that they get anNA
forid
instead of causing this miss-alignment?The text was updated successfully, but these errors were encountered: