Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing issues with nomer tsv records? #5

Open
cboettig opened this issue May 9, 2018 · 26 comments
Open

Parsing issues with nomer tsv records? #5

cboettig opened this issue May 9, 2018 · 26 comments
Labels
discussion issue that contains design discussion

Comments

@cboettig
Copy link

cboettig commented May 9, 2018

Hi @jhpoelen ,

I'm running into some issues parsing the taxonCache file in the Zenodo-archived data http://doi.org/10.5281/zenodo.1213465, (which looks super nice otherwise btw).

For instance, the readr package in R shows a few parsing errors, mostly due to what might be extraneous quote characters:

taxonCache <- readr::read_tsv("https://zenodo.org/record/1213465/files/taxonCache.tsv.gz")
problems(taxonCache)

shows these errors

      row col         expected           actual     file                    
    <int> <chr>       <chr>              <chr>      <chr>                   
 1  98457 commonNames delimiter or quote A          'data/taxonCache.tsv.gz'
 2 119858 commonNames delimiter or quote m          'data/taxonCache.tsv.gz'
 3 119858 commonNames delimiter or quote " "        'data/taxonCache.tsv.gz'
 4 425504 path        delimiter or quote c          'data/taxonCache.tsv.gz'
 5 425504 path        delimiter or quote S          'data/taxonCache.tsv.gz'
 6 425504 path        delimiter or quote m          'data/taxonCache.tsv.gz'
 7 425504 path        delimiter or quote A          'data/taxonCache.tsv.gz'
 8 425504 path        delimiter or quote m          'data/taxonCache.tsv.gz'
 9 425504 path        delimiter or quote a          'data/taxonCache.tsv.gz'
10 425504 path        delimiter or quote A          'data/taxonCache.tsv.gz'
11 425504 path        delimiter or quote " "        'data/taxonCache.tsv.gz'
12 425504 NA          9 columns          10 columns 'data/taxonCache.tsv.gz'

Those are pretty minor though, looks like only 3 rows are having issues. More troublesome is that somehow readr parsing of the file is getting some rows miss-aligned, e.g. if you then do:

library(dplyr)
taxonCache %>% filter(grepl(":", path))

you get a whole sequence of rows where the path column has pathId values. A quick inspection of these rows shows they are all shifted over by one column, as they are all missing the first column (an id). (Same problem can be reproduced with the base R read.delim, which is much slower than readr implementation). Is there something that can be done to so those rows that don't have an id still begin with a proper delimiter such that they get an NA for id instead of causing this miss-alignment?

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

@cboettig thanks for sharing. Your comments highlight various separate issues. I'll attempt to address each of them separately in the following comments. I am planning to release a new GloBI Taxon Graph version v0.3.2 with corrections applied in this thread.

First, in line 98457 in taxonCache.tsv v0.3.1, I found (please note that header was added for convenience)

id      name    rank    commonNames     path    pathIds pathNames       externalUrl     thumbnailUrl
EOL:224784      Neoniphon sammara       Species "Kolvin-soldaat @af | Deek @ar | 鐵甲 @cnm | Eichhörnchenfisch @de | Sammara squirrelfish @en | Candil samara @es | Corocoro @fj | Marignan tacheté @fr | \"Ala'ihi @hw | Ukeguchi-ittoudai @ja | 무늬얼게돔 @ko | Jerra @mh | Kolithaduva @ml | Kinolu @ms | Esquilo samara @pt | Malau-tui @sm | Baga-baga @tl | Araoe @ty | Cá Son dá dài @vi | 条纹长颏鳂 @zh | 莎姆新東洋金鱗魚 @zh-Hant |"        Animalia | Chordata | Actinopterygii | Beryciformes | Holocentridae | Neoniphon | Neoniphon sammara     EOL:1 | EOL:694 | EOL:1905 | EOL:8234 | EOL:8237 | EOL:24504 | EOL:224784       kingdom | phylum | class | order | family | genus | species     http://eol.org/pages/224784     http://media.eol.org/content/2009/05/19/16/85885_98_68.jpg

Note that commonNames value is (incorrectly) enclosed by double quotes and an escaped "Ala'ihi @hw

On closer inspection, the commonNames value was enclosed by quotes when csv was still used to store taxonCache. This also explains the escaped double quote. Also, it appears that the Hawaiian name for Neoniphon sammara is not transcribed properly in EOL http://eol.org/pages/224784/names/common_names . Instead of "Ala'ihi, I suspect the name should be 'ala'ihi, replacing the double quotes with a single quote.

@jhammock any change you can update the common name? From sources like http://www.wpcouncil.org/managed-fishery-ecosystems/hawaii-archipelago/regulations-and-enforcement-hawaii/ it appears that the common name is used to describe various different species, not just Neoniphon sammara .

To correct for this, double quotes are removed and the escape double quote has been replaced with the original string reported by EOL, including the double quotes. Note that TSV does not need escaping of quotes (https://www.iana.org/assignments/media-types/text/tab-separated-values) .

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

A second issue was reported on line 119858:

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
EOL:392765	Handroanthus chrysanthus	Species	"roble amarillo @en | \"makulis\" @es | เหลืองอินเดีย @th |"	Plantae | Tracheophyta | Magnoliopsida | Lamiales | Bignoniaceae | Handroanthus | Handroanthus chrysanthus	EOL:281 | EOL:4077 | EOL:283 | EOL:4300 | EOL:4421 | EOL:27931337 | EOL:392765	kingdom | phylum | class | order | family | genus | species	http://eol.org/pages/392765	http://media.eol.org/content/2015/02/26/03/48029_98_68.jpg

Similar pattern is observed here: csv-style escaping/quoting used because of the usage of double quotes in the text.

@jhammock any idea why makulis for spanish common name on http://eol.org/pages/392765/names/common_names is surrounded by double quotes?

To correct, doubles quotes are removed as well as the escaped double quotes.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

A third issue was reported on line 425504:

id	name	rank	commonNames	path	pathIds	pathNames	externalUrl	thumbnailUrl
INAT_TAXON:379688	candidatus phytoplasma	genus		"Bacteria | Firmicutes | Mollicutes | \"candidatus phytoplasma\""	INAT_TAXON:67333 | INAT_TAXON:151853 | INAT_TAXON:151986 | INAT_TAXON:379688	kingdom | phylum | class | genus	http://inaturalist.org/taxa/379688	

Same double quoting issues here. integration tests confirm that iNaturalist explicitly reports "candidatus phytoplasma" for the genus.
To correct, enclosing double quotes are removed as well as the escape characters.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

A fourth issue was found, where entries in taxonCache were found without a taxonId column. This was a transformation mistake and entries with missing taxonId columns will be removed. Note that the entries without an id actually had valid counter parts in the taxonCache file.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

Also, please note that the first three issues are definitely data errors, but not tsv parsing errors. TSV, according to IANA https://www.iana.org/assignments/media-types/text/tab-separated-values , does not have any string quoting . Please see tidyverse/readr#844 .

If empty quote parameter is used, no problems are encountered when reading the taxonCache.tsv :

taxonCache <- readr::read_tsv('taxonCache.tsv', quote='')
Parsed with column specification:
cols(
  id = col_character(),
  name = col_character(),
  rank = col_character(),
  commonNames = col_character(),
  path = col_character(),
  pathIds = col_character(),
  pathNames = col_character(),
  externalUrl = col_character(),
  thumbnailUrl = col_character()
)
|=================================================================| 100%  904 MB
> library(readr)
> problems(taxonCache)
# tibble [0 × 4]
# ... with 4 variables: row <int>, col <int>, expected <chr>, actual <chr>

@cboettig curious to hear your thoughts on all this.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

I've prepared a pre-release of taxonCache with applied changes, please see https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Please let me know if this pre-release solves this issue. If not, or if you find new issue, please do share.

@cboettig
Copy link
Author

cboettig commented May 9, 2018

Thanks, will do! Good point on the tsv by the way; makes total sense. The whole escaped quoting thing in csv files always bugged me, so tsv is a pretty clever solution I never properly appreciated (since it's harder to imagine needing a literal \t in a text file, but easy to see why you need a literal ,)

Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?

I'm playing a bit with parsing the pipe strings right now; I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings. Will let you know if that surfaces any other parsing issues for me.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

Quick clarification on the entities that didn't have valid ids and were thus creating the alignment problems: so those rows were duplicates of rows already elsewhere in taxonCache? Are those rows now dropped from the table?

I did some spot checks, and duplicates seem to exist. I removed the entries with path values that include the unexpected : delimited values.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

I see there utility but I think it would often be convenient to have a more explicit relationship between rank, value, and id in those strings.

I agree that zipping (combining) path / pathIds / pathNames is not convenient. It seems that most biologist are comfortable with tabular formats, so I am trying to figure out ways to mold data into that shape to lower barrier to edit / use / share without losing too much flexibility. Am open to suggestions and am in favor of exposing the same knowledge in different formats rather than taking a one-size-fits-all approach.

@cboettig
Copy link
Author

cboettig commented May 9, 2018

@jhpoelen I think I'm still seeing a whole bunch of entries with alignment issues?

library(tidyverse)
taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")

taxonCache %>% filter(!grepl("(:|-|_)", id)) 

shows a bunch of rows that are getting parsed that appear to have no id and so still have everything miss-aligned.

@jhpoelen
Copy link
Member

jhpoelen commented May 9, 2018

@cboettig confirmed . I've uploaded a second pass at the taxonCach.tsv.gz file, overwriting https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz . Thanks for sharing, please check and let me know if you see more issues.

@cboettig
Copy link
Author

@jhpoelen I seem to be getting a 403 access denied error at that URL now(?)

@jhpoelen
Copy link
Member

Thanks for letting me know . I've updated the access privileges and the file should be public now. Please try again - https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .

jhpoelen pushed a commit to globalbioticinteractions/globalbioticinteractions that referenced this issue May 10, 2018
@cboettig
Copy link
Author

@jhpoelen Thanks! Getting there! Looks like a possible data issue now:

e.g. row 243356 has a single entry in the path pipe-string but two entries in the pathNames pipe string.

taxonCache <- read_tsv("https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz", quote="")
taxonCache[243356,]$path
[1] "Gnaphalium purpureum"
> taxonCache[243356,]$pathNames
[1] "kingdom | species"

I see a total of 954 records where it looks to me that the number of pipes differs between path and pathName (though I guess some of these might be NA for one or the other, which is guess is okay, but some clearly aren't like the example above).

pattern <-  "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% 
  map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% 
  map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))


which( !(path_pipes == pathName_pipes))

@jhpoelen
Copy link
Member

Thanks against for your patience and feedback.

I went through the entries with mismatching path / path names. I found that most of the issue were due to an historic bug that didn't include empty ranks when ingesting path names. I removed the entries, after spot checking that duplicate entries existed in the taxonCache with aligned path/ids/names.

A single item, EOL:211953 Cetengraulis edentulus appear to have a \t embedded in common name Anchoveta raboamaril\t3. It appears that this common name was included in the taxoncache prior to the implementation of tab replacements on writing to tsv.

The remaining issues are terms related to non-taxa like environmental terms (e.g., wood) or functional groups (e.g., plankton). These do not have path/rank names. I've included the remaining issue below.

I've uploaded an updated copy of taxonCache for your review at https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz .

This cleanup of taxonCache.tsv makes me re-realize the importance of data mobility, archiving, versioning, automated quality control, peer review and the effort this all takes...

id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
ENVO:00000339 Stones NA NA environmental feature | mesoscopic physical object | abiotic mesoscopic physical object | piece of rock ENVO:00002297 | ENVO:00002004 | ENVO:01000010 | ENVO:00000339 NA http://purl.obolibrary.org/obo/ENVO_00000339 NA
ENVO:00001998 soil NA NA environmental material | soil ENVO:00010483 | ENVO:00001998 NA http://purl.obolibrary.org/obo/ENVO_00001998 NA
ENVO:00002003 bovine or equine dung NA NA environmental material | organic material | bodily fluid | excreta | feces ENVO:00010483 | ENVO:01000155 | ENVO:02000019 | ENVO:02000022 | ENVO:00002003 NA http://purl.obolibrary.org/obo/ENVO_00002003 NA
ENVO:00002007 Sediment NA NA environmental material | sediment ENVO:00010483 | ENVO:00002007 NA http://purl.obolibrary.org/obo/ENVO_00002007 NA
ENVO:00002040 Wood NA NA environmental material | organic material | wood ENVO:00010483 | ENVO:01000155 | ENVO:00002040 NA http://purl.obolibrary.org/obo/ENVO_00002040 NA
ENVO:01000155 Detritus NA NA environmental material | organic material ENVO:00010483 | ENVO:01000155 NA http://purl.obolibrary.org/obo/ENVO_01000155 NA
ENVO:01000404 plastic NA NA environmental material | anthropogenic environmental material ENVO:00010483 | ENVO:0010001 NA http://purl.obolibrary.org/obo/ENVO_01000404 NA
EOL:19662459 Zooplankton NA NA plankton | zooplankton NA NA http://eol.org/pages/19662459 NA
EOL:19662463 Phytoplankton NA NA plankton | phytoplankton NA NA http://eol.org/pages/19662463 NA
W:Bacterioplankton bacterioplankton NA NA plankton | bacterioplankton NA NA http://wikipedia.org/wiki/Bacterioplankton NA
W:Macroalgae Macroalgae NA NA algae | macroalgae NA NA http://wikipedia.org/wiki/Macroalgae NA

@cboettig
Copy link
Author

@jhpoelen Found some more rows with alignment / missing-id issue:

look for cases with whitespace in the id:

taxonCache %>% filter(grepl("\\s", id))

(Missed this one before because previously my pattern looked for identifiers with "(:|-|_)", and some species names have these in them). I think it would actually be preferable if ids were all URIs -- would that be possible? e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid: prefix, and some that seem to use _ as a prefix?

Another possible issue I noticed in pathNames:

taxonCache %>% filter(grepl(":", pathNames))

This gets the above miss-aligned ones too, but looks like it is mostly getting pathNames given by identifiers, maybe mostly from Wikidata. I see why wikidata does that so technically these aren't errors, but from a practical point of view it would be much better to have path names we can match to other path names. e.g. instead of WD:Q35409 | ... just have family | ... (as https://www.wikidata.org/wiki/Q35409). Or maybe that's an issue for a separate thread since it's not really about parsing problem?

@jhpoelen
Copy link
Member

Thanks!

taxonCache %>% filter(grepl("\\s", id))
Nice! This remove 41 remaining entries with misaligned columns. The accompanying entries with ids were also present in the taxonCache.

I think it would actually be preferable if ids were all URIs -- would that be possible?
That would be possible, and can already by done using a prefix mapping like: https://api.globalbioticinteractions.org/prefixes . You might have noticed that externalUrl expands the id to a resolvable id when possible.

e.g. there's what looks like some UUID strings in there but they don't have the urn:uuid: prefix, and some that seem to use _ as a prefix?
Good point. Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.

taxonCache %>% filter(grepl(":", pathNames))
This additional validator only selected the wikidata path names. As you noticed, abbreviated wikidata identifiers were used to capture the rank information. This was done for pragmatic reasons. It should be relatively easy to map the rank name ids to associated labels. In the future, we might want to introduce a normalized term rank by introducing rankName and rankId, in addition to pathNames and pathNameIds. Related to #7 .

I've prepared https://depot.globalbioticinteractions.org/tmp/taxon-0.3.2/taxonCache.tsv.gz for your review. If you are ok with this version, I'll prepare another zenodo publication. Otherwise, please detail your concerns.

@cboettig
Copy link
Author

Please note that #6 describes the origin of the prefix-less ids. I am hoping to incorporate these changes in the next major release of GloBI's taxon graph (should I rename to globi term graph instead?). I've started making manual patches using a development version and nomer, only to release that my time is probably better spent on thinking more about how to automatically validate, and report on, term mappings in addition to making the term graph more modular (e.g., splitting up term vertices and mapping edges into more manageable chunks similar to modular development of software libraries). If this is a big concern, please let me know.

Sounds like a plan. Nice to have ALA taxon addressed. I'm still seeing 57 rows that don't have a : in the id, e.g.

> taxonCache %>% filter(!grepl(":", id))
# A tibble: 57 x 9
   id                                   name   rank  commonNames path  pathIds pathNames externalUrl thumbnailUrl
   <chr>                                <chr>  <chr> <chr>       <chr> <chr>   <chr>     <chr>       <chr>       
 1 4701dc84-660a-4c51-bd16-593997f2370b CoelospecNA          Fungurn:lskingdomNA          NA          
 2 ALA_Cladia_muelleri                  CladiunknNA          | Cl| ALA_| unknown NA          NA          
 3 ALA_Delia_hirticrura                 DeliaunknNA          | De| ALA_| unknown NA          NA          
 4 ALA_Oxycetonia_jucunda               OxyceunknNA          | Ox| ALA_| unknown NA          NA          
 5 NZOR-3-100527                        Procigenus NA          | Pr| NZOR| genus   NA          NA          
 6 NZOR-3-109825                        Mariegenus NA          | Ma| NZOR| genus   NA          NA          
 7 NZOR-3-33834                         MisceunknNA          | Mi| NZOR| unknown NA          NA          
 8 NZOR-3-40069                         ProkaunknNA          | Pr| NZOR| unknown NA          NA          
 9 NZOR-3-41136                         Urticgenus NA          | Ur| NZOR| genus   NA          NA          
10 NZOR-3-54695                         Oreocgenus NA          | Or| NZOR| genus   NA          NA          
# ... with 47 more rows

Maybe that is intentional? Isn't clear if these identifiers can be resolved, notably they have no externalUrl entry, though ALA and NZOR look like they want to be prefixes to something(?)

There's a larger set of things with no externalUrl, some which seem to have prefixes that aren't defined in the prefix table (CoL, CAAB, ...), e.g.:

> taxonCache %>% filter(is.na(externalUrl))
# A tibble: 2,770 x 9
   id       name    rank  commonNames path         pathIds                  pathNames    externalUrl thumbnailUrl
   <chr>    <chr>   <chr> <chr>       <chr>        <chr>                    <chr>        <chr>       <chr>       
 1 4701dc8CoelomspecNA          Fungi | Chyurn:lsid:indexfungorum.kingdom | pNA          NA          
 2 ALA_ClaCladiaunknNA          | Cladia mu| ALA_Cladia_muelleri    | unknown    NA          NA          
 3 ALA_DelDeliaunknNA          | Delia hir| ALA_Delia_hirticrura   | unknown    NA          NA          
 4 ALA_OxyOxycetunknNA          | Oxycetoni| ALA_Oxycetonia_jucunda | unknown    NA          NA          
 5 CAAB:0cHalicaspecNA          HalicarcinuCAAB:0cd18290:475549ca:species      NA          NA          
 6 CAAB:23TalochspecNA          | Talochlam| CAAB:23270067          | species    NA          NA          
 7 CAAB:28Crab zunknNA          | Crab zoea  | CAAB:28850902          | unknown    NA          NA          
 8 CAAB:53MastogspecNA          MastogloiacCAAB:53210000 | CAAB:53family | geNA          NA          
 9 CAAB:80Microaunknmicroalgae| Microalgae | CAAB:80200000          | unknown    NA          NA          
10 CoL:254PseudospecNA          PseudoparreCoL:25759155 | CoL:2549genus | speNA          NA          
# ... with 2,760 more rows

Again, I think this all just shows what an amazing resource this is to have all of this compiled in a nice file like taxonCache.tsv.gz, as synthesizing all these resources in a single table like that is far from trivial!

Running a few experiments on the pipe paths but I think that all relates to next steps in #7 rather than possible issues in taxonCache. Lemme know what you think about the above concerns with some of the ids bot otherwise this is looking ready for release to me.

@cboettig
Copy link
Author

Looks like there might be a few cases where path, pathNames, and pathIDs do not all have the same length (not counting cases where any one of these is na). e.g. row with id = ITIS:10824. Could be indicative of an issue?

@cboettig
Copy link
Author

in case it's at all helpful, here's the crummy R code I'm using to identify the ~1000 rows that appear to have issues.

## Expect same number of pipes in each entry:
pattern = "\\s*\\|\\s*"
path_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$path, pattern)[[1]]))
pathName_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathNames, pattern)[[1]]))
pathIds_pipes <- taxonCache %>% purrr::transpose() %>% map_int( ~length(str_split(.x$pathIds, pattern)[[1]]))
na_path <- is.na(taxonCache$path)
na_pathNames <- is.na(taxonCache$pathNames)
na_pathIds  <- is.na(taxonCache$pathIds)

trouble <- which( !(pathIds_pipes == path_pipes) & !na_path & !na_pathIds)

## Here's the ~1000 rows that appear miss-matched to me
taxonCache[trouble,]

@jhpoelen
Copy link
Member

Very helpful indeed, thank for being thorough I am working on an input / output validation framework to more easily detect these inconsistencies. #8 . Curious to hear your thoughts on that.

@jhpoelen
Copy link
Member

@cboettig just published http://doi.org/10.5281/zenodo.1250572 . In this version, consistency terms and links were checked using nomer's validate-term and validate-term-link. Also, various fixes were included to help make the ids and their hierarchies a bit more well-behaved.

@cboettig
Copy link
Author

cboettig commented Jun 9, 2018

@jhpoelen Maybe I'm not understanding something here, but it seems there's ~ 500,000 rows in taxonCache involving duplicate ids?

I think this should be reproducible R code:

library(tidyverse)
taxonCache <- read_tsv("https://zenodo.org/record/1250572/files/taxonCache.tsv.gz", quote="")

dup_id <- 
  taxonCache %>% select(id) %>% group_by(id) %>% 
  summarise(n_id = length(id)) %>% filter(n_id > 1) 

trouble <- taxonCache %>% semi_join(select(dup_id, id))

# a data frame with the subset of taxonCache having duplicate ids
trouble

This prevents me from establishing a unique path / pathId / pathNames for an ID; it's not clear how to resolve the conflicts. I think this is related (/the cause of) to the issue I just added to #7

@jhpoelen
Copy link
Member

jhpoelen commented Jun 9, 2018

@cboettig thanks for sharing. See #7 (comment) . I think this warrants a further discussion. . .

@jhpoelen
Copy link
Member

Also, please note #9 - would having the name source / retrieval date would provide more information on which taxon id to select?

Currently, GloBI itself uses a pretty blunt method - just use all that match to populate taxon search index/ graph.

@jhpoelen
Copy link
Member

Here's an example of a taxon id with slight changes in name hierarchies as provided by the name source. Note that http://id.biodiversity.org.au/node/apni/50587232 and https://id.biodiversity.org.au/taxon/apni/51337710 are both outdated identifiers for Plantae. So, this is an example of multiple interpretations of taxon ids.

Am leaving this issue open because it exposes some interesting effects associated to taxon ids.

id name rank commonNames path pathIds pathNames externalUrl thumbnailUrl
ALATaxon:NZOR-6-102447 Eurya genus Plantae | Charophyta | Equisetopsida | Magnoliidae | Ericales | Pentaphylacaceae | Eurya ALATaxon:http://id.biodiversity.org.au/node/apni/50587232 | ALATaxon:http://id.biodiversity.org.au/node/apni/50587231 | ALATaxon:http://id.biodiversity.org.au/node/apni/50587230 | ALATaxon:http://id.biodiversity.org.au/node/apni/50587229 | ALATaxon:http://id.biodiversity.org.au/node/apni/8790835 | ALATaxon:http://id.biodiversity.org.au/node/apni/8305023 | ALATaxon:NZOR-6-102447 kingdom | phylum | class | subclass | order | family | genus https://bie.ala.org.au/species/NZOR-6-102447
ALATaxon:NZOR-6-102447 Eurya genus Plantae | Charophyta | Equisetopsida | Magnoliidae | Ericales | Pentaphylacaceae | Eurya ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337710 | ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337706 | ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337705 | ALATaxon:https://id.biodiversity.org.au/taxon/apni/51337515 | ALATaxon:https://id.biodiversity.org.au/taxon/apni/51311074 | ALATaxon:https://id.biodiversity.org.au/node/apni/8305023 | ALATaxon:NZOR-6-102447 kingdom | phylum | class | subclass | order | family | genus https://bie.ala.org.au/species/NZOR-6-102447

@jhpoelen jhpoelen added the discussion issue that contains design discussion label Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion issue that contains design discussion
Projects
None yet
Development

No branches or pull requests

2 participants