Parsing issues with nomer tsv records? #5

cboettig · 2018-05-09T03:46:37Z

I'm running into some issues parsing the taxonCache file in the Zenodo-archived data http://doi.org/10.5281/zenodo.1213465, (which looks super nice otherwise btw).

For instance, the readr package in R shows a few parsing errors, mostly due to what might be extraneous quote characters:

taxonCache <- readr::read_tsv("https://zenodo.org/record/1213465/files/taxonCache.tsv.gz")
problems(taxonCache)

shows these errors

      row col         expected           actual     file                    
    <int> <chr>       <chr>              <chr>      <chr>                   
 1  98457 commonNames delimiter or quote A          'data/taxonCache.tsv.gz'
 2 119858 commonNames delimiter or quote m          'data/taxonCache.tsv.gz'
 3 119858 commonNames delimiter or quote " "        'data/taxonCache.tsv.gz'
 4 425504 path        delimiter or quote c          'data/taxonCache.tsv.gz'
 5 425504 path        delimiter or quote S          'data/taxonCache.tsv.gz'
 6 425504 path        delimiter or quote m          'data/taxonCache.tsv.gz'
 7 425504 path        delimiter or quote A          'data/taxonCache.tsv.gz'
 8 425504 path        delimiter or quote m          'data/taxonCache.tsv.gz'
 9 425504 path        delimiter or quote a          'data/taxonCache.tsv.gz'
10 425504 path        delimiter or quote A          'data/taxonCache.tsv.gz'
11 425504 path        delimiter or quote " "        'data/taxonCache.tsv.gz'
12 425504 NA          9 columns          10 columns 'data/taxonCache.tsv.gz'

Those are pretty minor though, looks like only 3 rows are having issues. More troublesome is that somehow readr parsing of the file is getting some rows miss-aligned, e.g. if you then do:

library(dplyr)
taxonCache %>% filter(grepl(":", path))

you get a whole sequence of rows where the path column has pathId values. A quick inspection of these rows shows they are all shifted over by one column, as they are all missing the first column (an id). (Same problem can be reproduced with the base R read.delim, which is much slower than readr implementation). Is there something that can be done to so those rows that don't have an id still begin with a proper delimiter such that they get an NA for id instead of causing this miss-alignment?

The text was updated successfully, but these errors were encountered:

jhpoelen · 2018-05-09T17:36:36Z

@cboettig thanks for sharing. Your comments highlight various separate issues. I'll attempt to address each of them separately in the following comments. I am planning to release a new GloBI Taxon Graph version v0.3.2 with corrections applied in this thread.

First, in line 98457 in taxonCache.tsv v0.3.1, I found (please note that header was added for convenience)

id      name    rank    commonNames     path    pathIds pathNames       externalUrl     thumbnailUrl
EOL:224784      Neoniphon sammara       Species "Kolvin-soldaat @af | Deek @ar | 鐵甲 @cnm | Eichhörnchenfisch @de | Sammara squirrelfish @en | Candil samara @es | Corocoro @fj | Marignan tacheté @fr | \"Ala'ihi @hw | Ukeguchi-ittoudai @ja | 무늬얼게돔 @ko | Jerra @mh | Kolithaduva @ml | Kinolu @ms | Esquilo samara @pt | Malau-tui @sm | Baga-baga @tl | Araoe @ty | Cá Son dá dài @vi | 条纹长颏鳂 @zh | 莎姆新東洋金鱗魚 @zh-Hant |"        Animalia | Chordata | Actinopterygii | Beryciformes | Holocentridae | Neoniphon | Neoniphon sammara     EOL:1 | EOL:694 | EOL:1905 | EOL:8234 | EOL:8237 | EOL:24504 | EOL:224784       kingdom | phylum | class | order | family | genus | species     http://eol.org/pages/224784     http://media.eol.org/content/2009/05/19/16/85885_98_68.jpg

Note that commonNames value is (incorrectly) enclosed by double quotes and an escaped "Ala'ihi @hw

On closer inspection, the commonNames value was enclosed by quotes when csv was still used to store taxonCache. This also explains the escaped double quote. Also, it appears that the Hawaiian name for Neoniphon sammara is not transcribed properly in EOL http://eol.org/pages/224784/names/common_names . Instead of "Ala'ihi, I suspect the name should be 'ala'ihi, replacing the double quotes with a single quote.

@jhammock any change you can update the common name? From sources like http://www.wpcouncil.org/managed-fishery-ecosystems/hawaii-archipelago/regulations-and-enforcement-hawaii/ it appears that the common name is used to describe various different species, not just Neoniphon sammara .

To correct for this, double quotes are removed and the escape double quote has been replaced with the original string reported by EOL, including the double quotes. Note that TSV does not need escaping of quotes (https://www.iana.org/assignments/media-types/text/tab-separated-values) .

jhpoelen · 2018-05-09T17:44:27Z

A second issue was reported on line 119858:

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
EOL:392765	Handroanthus chrysanthus	Species	"roble amarillo @en | \"makulis\" @es | เหลืองอินเดีย @th |"	Plantae | Tracheophyta | Magnoliopsida | Lamiales | Bignoniaceae | Handroanthus | Handroanthus chrysanthus	EOL:281 | EOL:4077 | EOL:283 | EOL:4300 | EOL:4421 | EOL:27931337 | EOL:392765	kingdom | phylum | class | order | family | genus | species	http://eol.org/pages/392765	http://media.eol.org/content/2015/02/26/03/48029_98_68.jpg

Similar pattern is observed here: csv-style escaping/quoting used because of the usage of double quotes in the text.

@jhammock any idea why makulis for spanish common name on http://eol.org/pages/392765/names/common_names is surrounded by double quotes?

To correct, doubles quotes are removed as well as the escaped double quotes.

jhpoelen · 2018-05-09T17:51:49Z

A third issue was reported on line 425504:

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
INAT_TAXON:379688	candidatus phytoplasma	genus		"Bacteria | Firmicutes | Mollicutes | \"candidatus phytoplasma\""	INAT_TAXON:67333 | INAT_TAXON:151853 | INAT_TAXON:151986 | INAT_TAXON:379688	kingdom | phylum | class | genus	http://inaturalist.org/taxa/379688

Same double quoting issues here. integration tests confirm that iNaturalist explicitly reports "candidatus phytoplasma" for the genus.
To correct, enclosing double quotes are removed as well as the escape characters.

jhpoelen · 2018-05-09T17:58:16Z

A fourth issue was found, where entries in taxonCache were found without a taxonId column. This was a transformation mistake and entries with missing taxonId columns will be removed. Note that the entries without an id actually had valid counter parts in the taxonCache file.

jhpoelen · 2018-05-09T18:09:19Z

Also, please note that the first three issues are definitely data errors, but not tsv parsing errors. TSV, according to IANA https://www.iana.org/assignments/media-types/text/tab-separated-values , does not have any string quoting . Please see tidyverse/readr#844 .

If empty quote parameter is used, no problems are encountered when reading the taxonCache.tsv :

taxonCache <- readr::read_tsv('taxonCache.tsv', quote='')
Parsed with column specification:
cols(
  id = col_character(),
  name = col_character(),
  rank = col_character(),
  commonNames = col_character(),
  path = col_character(),
  pathIds = col_character(),
  pathNames = col_character(),
  externalUrl = col_character(),
  thumbnailUrl = col_character()
)
|=================================================================| 100%  904 MB
> library(readr)
> problems(taxonCache)
# tibble [0 × 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

@cboettig curious to hear your thoughts on all this.

jhpoelen · 2018-05-09T20:03:27Z

I've prepared a pre-release of taxonCache with applied changes, please see https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Please let me know if this pre-release solves this issue. If not, or if you find new issue, please do share.

cboettig · 2018-05-09T20:13:11Z

Thanks, will do! Good point on the tsv by the way; makes total sense. The whole escaped quoting thing in csv files always bugged me, so tsv is a pretty clever solution I never properly appreciated (since it's harder to imagine needing a literal \t in a text file, but easy to see why you need a literal ,)

Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?

I'm playing a bit with parsing the pipe strings right now; I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings. Will let you know if that surfaces any other parsing issues for me.

jhpoelen · 2018-05-09T20:25:06Z

Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?

I did some spot checks, and duplicates seem to exist. I removed the entries with path values that include the unexpected : delimited values.

jhpoelen · 2018-05-09T20:32:18Z

I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings.

I agree that zipping (combining) path / pathIds / pathNames is not convenient. It seems that most biologist are comfortable with tabular formats, so I am trying to figure out ways to mold data into that shape to lower barrier to edit / use / share without losing too much flexibility. Am open to suggestions and am in favor of exposing the same knowledge in different formats rather than taking a one-size-fits-all approach.

cboettig · 2018-05-09T21:01:05Z

@jhpoelen I think I'm still seeing a whole bunch of entries with alignment issues?

library(tidyverse)
taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")

taxonCache %>% filter(!grepl("(:|-|_)", id))

shows a bunch of rows that are getting parsed that appear to have no id and so still have everything miss-aligned.

jhpoelen · 2018-05-09T23:06:52Z

@cboettig confirmed . I've uploaded a second pass at the taxonCach.tsv.gz file, overwriting https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Thanks for sharing, please check and let me know if you see more issues.

cboettig · 2018-05-10T03:35:41Z

@jhpoelen I seem to be getting a 403 access denied error at that URL now(?)

jhpoelen · 2018-05-10T14:15:40Z

Thanks for letting me know . I've updated the access privileges and the file should be public now. Please try again - https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .

…albioticinteractions/nomer#5

cboettig · 2018-05-10T18:28:14Z

@jhpoelen Thanks! Getting there! Looks like a possible data issue now:

e.g. row 243356 has a single entry in the path pipe-string but two entries in the pathNames pipe string.

taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache[243356,]$path

[1] "Gnaphalium purpureum"
> taxonCache[243356,]$pathNames
[1] "kingdom | species"

I see a total of 954 records where it looks to me that the number of pipes differs between path and pathName (though I guess some of these might be NA for one or the other, which is guess is okay, but some clearly aren't like the example above).

pattern <-  "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% 
  map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% 
  map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))


which( !(path_pipes == pathName_pipes))

jhpoelen · 2018-05-10T22:23:18Z

Thanks against for your patience and feedback.

I went through the entries with mismatching path / path names. I found that most of the issue were due to an historic bug that didn't include empty ranks when ingesting path names. I removed the entries, after spot checking that duplicate entries existed in the taxonCache with aligned path/ids/names.

A single item, EOL:211953 Cetengraulis edentulus appear to have a \t embedded in common name Anchoveta raboamaril\t3. It appears that this common name was included in the taxoncache prior to the implementation of tab replacements on writing to tsv.

The remaining issues are terms related to non-taxa like environmental terms (e.g., wood) or functional groups (e.g., plankton). These do not have path/rank names. I've included the remaining issue below.

I've uploaded an updated copy of taxonCache for your review at https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .

This cleanup of taxonCache.tsv makes me re-realize the importance of data mobility, archiving, versioning, automated quality control, peer review and the effort this all takes...

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
ENVO:00000339	Stones	NA	NA	environmental feature \| mesoscopic physical object \| abiotic mesoscopic physical object \| piece of rock	ENVO:00002297 \| ENVO:00002004 \| ENVO:01000010 \| ENVO:00000339	NA	http://purl.obolibrary.org/obo/ENVO_00000339	NA
ENVO:00001998	soil	NA	NA	environmental material \| soil	ENVO:00010483 \| ENVO:00001998	NA	http://purl.obolibrary.org/obo/ENVO_00001998	NA
ENVO:00002003	bovine or equine dung	NA	NA	environmental material \| organic material \| bodily fluid \| excreta \| feces	ENVO:00010483 \| ENVO:01000155 \| ENVO:02000019 \| ENVO:02000022 \| ENVO:00002003	NA	http://purl.obolibrary.org/obo/ENVO_00002003	NA
ENVO:00002007	Sediment	NA	NA	environmental material \| sediment	ENVO:00010483 \| ENVO:00002007	NA	http://purl.obolibrary.org/obo/ENVO_00002007	NA
ENVO:00002040	Wood	NA	NA	environmental material \| organic material \| wood	ENVO:00010483 \| ENVO:01000155 \| ENVO:00002040	NA	http://purl.obolibrary.org/obo/ENVO_00002040	NA
ENVO:01000155	Detritus	NA	NA	environmental material \| organic material	ENVO:00010483 \| ENVO:01000155	NA	http://purl.obolibrary.org/obo/ENVO_01000155	NA
ENVO:01000404	plastic	NA	NA	environmental material \| anthropogenic environmental material	ENVO:00010483 \| ENVO:0010001	NA	http://purl.obolibrary.org/obo/ENVO_01000404	NA
EOL:19662459	Zooplankton	NA	NA	plankton \| zooplankton	NA	NA	http://eol.org/pages/19662459	NA
EOL:19662463	Phytoplankton	NA	NA	plankton \| phytoplankton	NA	NA	http://eol.org/pages/19662463	NA
W:Bacterioplankton	bacterioplankton	NA	NA	plankton \| bacterioplankton	NA	NA	http://wikipedia.org/wiki/Bacterioplankton	NA
W:Macroalgae	Macroalgae	NA	NA	algae \| macroalgae	NA	NA	http://wikipedia.org/wiki/Macroalgae	NA

cboettig · 2018-05-11T16:31:50Z

@jhpoelen Found some more rows with alignment / missing-id issue:

look for cases with whitespace in the id:

taxonCache %>% filter(grepl("\\s", id))

(Missed this one before because previously my pattern looked for identifiers with "(:|-|_)", and some species names have these in them). I think it would actually be preferable if ids were all URIs -- would that be possible? e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid: prefix, and some that seem to use _ as a prefix?

Another possible issue I noticed in pathNames:

taxonCache %>% filter(grepl(":", pathNames))

This gets the above miss-aligned ones too, but looks like it is mostly getting pathNames given by identifiers, maybe mostly from Wikidata. I see why wikidata does that so technically these aren't errors, but from a practical point of view it would be much better to have path names we can match to other path names. e.g. instead of WD:Q35409 | ... just have family | ... (as https://www.wikidata.org/wiki/Q35409). Or maybe that's an issue for a separate thread since it's not really about parsing problem?

jhpoelen · 2018-05-12T01:15:18Z

Thanks!

taxonCache %>% filter(grepl("\\s", id))
Nice! This remove 41 remaining entries with misaligned columns. The accompanying entries with ids were also present in the taxonCache.

I think it would actually be preferable if ids were all URIs -- would that be possible?
That would be possible, and can already by done using a prefix mapping like: https://api.globalbioticinteractions.org/prefixes . You might have noticed that externalUrl expands the id to a resolvable id when possible.

e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid: prefix, and some that seem to use _ as a prefix?
Good point. Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.

taxonCache %>% filter(grepl(":", pathNames))
This additional validator only selected the wikidata path names. As you noticed, abbreviated wikidata identifiers were used to capture the rank information. This was done for pragmatic reasons. It should be relatively easy to map the rank name ids to associated labels. In the future, we might want to introduce a normalized term rank by introducing rankName and rankId, in addition to pathNames and pathNameIds. Related to #7 .

I've prepared https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz for your review. If you are ok with this version, I'll prepare another zenodo publication. Otherwise, please detail your concerns.

cboettig · 2018-05-12T16:47:46Z

Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.

Sounds like a plan. Nice to have ALA taxon addressed. I'm still seeing 57 rows that don't have a : in the id, e.g.

> taxonCache %>% filter(!grepl(":", id))
# A tibble: 57 x 9
   id                                   name   rank  commonNames path  pathIds pathNames externalUrl thumbnailUrl
   <chr>                                <chr>  <chr> <chr>       <chr> <chr>   <chr>     <chr>       <chr>       
 1 4701dc84-660a-4c51-bd16-593997f2370b Coelo… spec… NA          Fung… urn:ls… kingdom … NA          NA          
 2 ALA_Cladia_muelleri                  Cladi… unkn… NA          | Cl… | ALA_… | unknown NA          NA          
 3 ALA_Delia_hirticrura                 Delia… unkn… NA          | De… | ALA_… | unknown NA          NA          
 4 ALA_Oxycetonia_jucunda               Oxyce… unkn… NA          | Ox… | ALA_… | unknown NA          NA          
 5 NZOR-3-100527                        Proci… genus NA          | Pr… | NZOR… | genus   NA          NA          
 6 NZOR-3-109825                        Marie… genus NA          | Ma… | NZOR… | genus   NA          NA          
 7 NZOR-3-33834                         Misce… unkn… NA          | Mi… | NZOR… | unknown NA          NA          
 8 NZOR-3-40069                         Proka… unkn… NA          | Pr… | NZOR… | unknown NA          NA          
 9 NZOR-3-41136                         Urtic… genus NA          | Ur… | NZOR… | genus   NA          NA          
10 NZOR-3-54695                         Oreoc… genus NA          | Or… | NZOR… | genus   NA          NA          
# ... with 47 more rows

Maybe that is intentional? Isn't clear if these identifiers can be resolved, notably they have no externalUrl entry, though ALA and NZOR look like they want to be prefixes to something(?)

There's a larger set of things with no externalUrl, some which seem to have prefixes that aren't defined in the prefix table (CoL, CAAB, ...), e.g.:

> taxonCache %>% filter(is.na(externalUrl))
# A tibble: 2,770 x 9
   id       name    rank  commonNames path         pathIds                  pathNames    externalUrl thumbnailUrl
   <chr>    <chr>   <chr> <chr>       <chr>        <chr>                    <chr>        <chr>       <chr>       
 1 4701dc8… Coelom… spec… NA          Fungi | Chy… urn:lsid:indexfungorum.… kingdom | p… NA          NA          
 2 ALA_Cla… Cladia… unkn… NA          | Cladia mu… | ALA_Cladia_muelleri    | unknown    NA          NA          
 3 ALA_Del… Delia … unkn… NA          | Delia hir… | ALA_Delia_hirticrura   | unknown    NA          NA          
 4 ALA_Oxy… Oxycet… unkn… NA          | Oxycetoni… | ALA_Oxycetonia_jucunda | unknown    NA          NA          
 5 CAAB:0c… Halica… spec… NA          Halicarcinu… CAAB:0cd18290:475549ca:… species      NA          NA          
 6 CAAB:23… Taloch… spec… NA          | Talochlam… | CAAB:23270067          | species    NA          NA          
 7 CAAB:28… Crab z… unkn… NA          | Crab zoea  | CAAB:28850902          | unknown    NA          NA          
 8 CAAB:53… Mastog… spec… NA          Mastogloiac… CAAB:53210000 | CAAB:53… family | ge… NA          NA          
 9 CAAB:80… Microa… unkn… microalgae… | Microalgae | CAAB:80200000          | unknown    NA          NA          
10 CoL:254… Pseudo… spec… NA          Pseudoparre… CoL:25759155 | CoL:2549… genus | spe… NA          NA          
# ... with 2,760 more rows

Again, I think this all just shows what an amazing resource this is to have all of this compiled in a nice file like taxonCache.tsv.gz, as synthesizing all these resources in a single table like that is far from trivial!

Running a few experiments on the pipe paths but I think that all relates to next steps in #7 rather than possible issues in taxonCache. Lemme know what you think about the above concerns with some of the ids bot otherwise this is looking ready for release to me.

cboettig · 2018-05-14T04:22:54Z

Looks like there might be a few cases where path, pathNames, and pathIDs do not all have the same length (not counting cases where any one of these is na). e.g. row with id = ITIS:10824. Could be indicative of an issue?

cboettig · 2018-05-15T00:34:48Z

in case it's at all helpful, here's the crummy R code I'm using to identify the ~1000 rows that appear to have issues.

## Expect same number of pipes in each entry:
pattern = "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
pathIds_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathIds, pattern)[[1]]))
na_path <- is.na(taxonCache$path)
na_pathNames <- is.na(taxonCache$pathNames)
na_pathIds  <- is.na(taxonCache$pathIds)

trouble <- which( !(pathIds_pipes == path_pipes) & !na_path & !na_pathIds)

## Here's the ~1000 rows that appear miss-matched to me
taxonCache[trouble,]

jhpoelen · 2018-05-15T01:07:23Z

Very helpful indeed, thank for being thorough I am working on an input / output validation framework to more easily detect these inconsistencies. #8 . Curious to hear your thoughts on that.

jhpoelen · 2018-05-22T03:27:55Z

@cboettig just published http://doi.org/10.5281/zenodo.1250572 . In this version, consistency terms and links were checked using nomer's validate-term and validate-term-link. Also, various fixes were included to help make the ids and their hierarchies a bit more well-behaved.

cboettig · 2018-06-09T18:19:42Z

@jhpoelen Maybe I'm not understanding something here, but it seems there's ~ 500,000 rows in taxonCache involving duplicate ids?

I think this should be reproducible R code:

library(tidyverse)
taxonCache <- read_tsv("https://zenodo.org/record/1250572/files/taxonCache.tsv.gz", quote="")

dup_id <- 
  taxonCache %>% select(id) %>% group_by(id) %>% 
  summarise(n_id = length(id)) %>% filter(n_id > 1) 

trouble <- taxonCache %>% semi_join(select(dup_id, id))

# a data frame with the subset of taxonCache having duplicate ids
trouble

This prevents me from establishing a unique path / pathId / pathNames for an ID; it's not clear how to resolve the conflicts. I think this is related (/the cause of) to the issue I just added to #7

jhpoelen · 2018-06-09T19:37:19Z

@cboettig thanks for sharing. See #7 (comment) . I think this warrants a further discussion. . .

jhpoelen · 2018-06-18T23:15:57Z

Also, please note #9 - would having the name source / retrieval date would provide more information on which taxon id to select?

Currently, GloBI itself uses a pretty blunt method - just use all that match to populate taxon search index/ graph.

jhpoelen · 2020-03-12T01:10:04Z

Here's an example of a taxon id with slight changes in name hierarchies as provided by the name source. Note that http://id.biodiversity.org.au/node/apni/50587232 and https://id.biodiversity.org.au/taxon/apni/51337710 are both outdated identifiers for Plantae. So, this is an example of multiple interpretations of taxon ids.

Am leaving this issue open because it exposes some interesting effects associated to taxon ids.

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
ALATaxon:NZOR-6-102447	Eurya	genus		Plantae \| Charophyta \| Equisetopsida \| Magnoliidae \| Ericales \| Pentaphylacaceae \| Eurya	ALATaxon:http://id.biodiversity.org.au/node/apni/50587232 \| ALATaxon:http://id.biodiversity.org.au/node/apni/50587231 \| ALATaxon:http://id.biodiversity.org.au/node/apni/50587230 \| ALATaxon:http://id.biodiversity.org.au/node/apni/50587229 \| ALATaxon:http://id.biodiversity.org.au/node/apni/8790835 \| ALATaxon:http://id.biodiversity.org.au/node/apni/8305023 \| ALATaxon:NZOR-6-102447	kingdom \| phylum \| class \| subclass \| order \| family \| genus	https://bie.ala.org.au/species/NZOR-6-102447
ALATaxon:NZOR-6-102447	Eurya	genus		Plantae \| Charophyta \| Equisetopsida \| Magnoliidae \| Ericales \| Pentaphylacaceae \| Eurya	ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337710 \| ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337706 \| ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337705 \| ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337515 \| ALATaxon:https://id.biodiversity.org.au/taxon/apni/51311074 \| ALATaxon:https://id.biodiversity.org.au/node/apni/8305023 \| ALATaxon:NZOR-6-102447	kingdom \| phylum \| class \| subclass \| order \| family \| genus	https://bie.ala.org.au/species/NZOR-6-102447

jhpoelen pushed a commit that referenced this issue May 10, 2018

reproduce quoted candidatus from iNaturalist. related to #5

d083bad

jhpoelen pushed a commit to globalbioticinteractions/globalbioticinteractions that referenced this issue May 10, 2018

reproduce suspicious double quotes in taxonCache.tsv. related to glob…

4a3913b

…albioticinteractions/nomer#5

jhpoelen mentioned this issue May 11, 2018

missing (legacy) AFD uris AtlasOfLivingAustralia/bie-index#187

Closed

jhpoelen mentioned this issue May 15, 2018

taxon map / cache validation #8

Closed

jhpoelen mentioned this issue Mar 12, 2020

Alternative table layout / views from taxonCache #7

Open

jhpoelen added the discussion issue that contains design discussion label Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing issues with nomer tsv records? #5

Parsing issues with nomer tsv records? #5

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018 •

edited

Loading

jhpoelen commented May 9, 2018

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018 •

edited

Loading

jhpoelen commented May 9, 2018 •

edited

Loading

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018 •

edited

Loading

cboettig commented May 10, 2018

jhpoelen commented May 10, 2018

cboettig commented May 10, 2018

jhpoelen commented May 10, 2018

cboettig commented May 11, 2018

jhpoelen commented May 12, 2018

cboettig commented May 12, 2018

cboettig commented May 14, 2018

cboettig commented May 15, 2018

jhpoelen commented May 15, 2018

jhpoelen commented May 22, 2018

cboettig commented Jun 9, 2018 •

edited by jhpoelen

Loading

jhpoelen commented Jun 9, 2018

jhpoelen commented Jun 18, 2018

jhpoelen commented Mar 12, 2020

Parsing issues with nomer tsv records? #5

Parsing issues with nomer tsv records? #5

Comments

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018

jhpoelen commented May 9, 2018 • edited Loading

jhpoelen commented May 9, 2018

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018 • edited Loading

jhpoelen commented May 9, 2018 • edited Loading

cboettig commented May 9, 2018

jhpoelen commented May 9, 2018 • edited Loading

cboettig commented May 10, 2018

jhpoelen commented May 10, 2018

cboettig commented May 10, 2018

jhpoelen commented May 10, 2018

cboettig commented May 11, 2018

jhpoelen commented May 12, 2018

cboettig commented May 12, 2018

cboettig commented May 14, 2018

cboettig commented May 15, 2018

jhpoelen commented May 15, 2018

jhpoelen commented May 22, 2018

cboettig commented Jun 9, 2018 • edited by jhpoelen Loading

jhpoelen commented Jun 9, 2018

jhpoelen commented Jun 18, 2018

jhpoelen commented Mar 12, 2020

jhpoelen commented May 9, 2018 •

edited

Loading

jhpoelen commented May 9, 2018 •

edited

Loading

jhpoelen commented May 9, 2018 •

edited

Loading

jhpoelen commented May 9, 2018 •

edited

Loading

cboettig commented Jun 9, 2018 •

edited by jhpoelen

Loading