Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement changes to "normative document" discussed at DwC MG meeting #264

Merged
merged 32 commits into from
Aug 13, 2020

Conversation

baskaufs
Copy link

This branch includes a number of changes that were discussed during and after recent DwC Maintenance Group meeting. It needs to be checked to see what effects the changes have on the Quick Reference Guide. The most obvious thing is that the labels were reverted from camelCase to English prose. This change will definitely have an effect on the QRG unless the script is changed to use the term local names. So it should not be merged until that particular issue is resolved.

The changes involving dcterms:language and dcterms:type should also be looked at carefully. The intent is that dc:language should replace dcterms:language in the record-level terms section, dc:type should replace dcterms:type in the record-level terms section, dcterms:language should move into the UseWithIRI section, and dcterms:type should disappear from the QRG (since the DwC RDF Guide says that rdf:type should be used instead of it in the case where one desires to indicate type by an IRI).

If these changes are appropriate, I'll use it as the standard to check whether build script 3 will faithfully generate the file necessary to build the QRG. That can be done before a merge as long as I know that this is ultimately the form we want the "normative document" to take.

This pull request is related to the following issues:

Steve Baskauf added 6 commits July 25, 2020 13:16
In the record level terms section, replace dcterms:language with dc:language. Use the former dcterms: examples for string values with dc:language. Move dcterms:language to the UseWithIRI section and change its recommended value to an IRI from the LOC ISO 639-2 scheme (to maintain consistency with Audubon Core usage. NOTE: the value of organized_in for dc:language is now different from the other DCMI terms in the record-level section. Will that break the QRG build script?
Move dcterms:type below the recommended terms and change status to deprecated. Change its notes to indicate that dc:type should be used for strings and that the RDF guide recommends rdf:type instead of dcterms:type when providing an IRI value. Replace dcterms:type in the record-level terms with dc:type and carry over the examples that were previously in dcterms:type to it. NOTE: the terms below the recommended terms are not in alphabetical order still.
Note: I did not change the obsolete terms, which have always had camelCase labels. Also, the IRI terms that have non-IRI analogs have " (IRI)" appended to their labels to distinguish them.
Added the missing legacy superproperty dwc:accordingTo and made the status deprecated.
In the previous term_versions.csv file, the value in the replaces column indicated what DCMI said the term replaced. However, in the complete history table, we use this column to indicate what this term version replaces with respect to previous TDWG use. For example, in an Executive Decision, dcterms:license was to be used in lieu of dcterms:rights. I've changed the table to reflect this replacement, rather than indicating how http://dublincore.org/usage/terms/history/#rightsHolder-002
 was caused by DCMI to replace http://dublincore.org/usage/terms/history/#rightsHolder-001 .  There are several other examples of replacements caused by TDWG actions and these have replaced the record of DCMI actions. If there was no TDWG-sanctioned replacement, the column was changed to empthy.
@baskaufs
Copy link
Author

I forgot several technical changes related to the replaces column for DCMI borrowed terms. See the commit notes for details.

Steve Baskauf added 3 commits July 26, 2020 06:57
dcterms:rights was originally in Darwin Core, but was replaced by dcterms:license in Executive Decision http://rs.tdwg.org/dwc/terms/history/decisions/#Decision-2014-11-06_17 . It has been added as part of the historical record. It is listed as the "replaces" value for dcterms:license term and needs to be there.
@baskaufs
Copy link
Author

Just to let you know where I am on this: I've run the 2017 and 2018 changes through the build script and am now comparing the diffs of the term_version.csv file in this pull request with the generated one. I'm determining whether the difference is something that I missed in this branch's version or if it is a problem with the source data in the rs.tdwg.org that the build script is using. There are a bunch of camelCase/prose differences I need to reconcile, but I think I'm close to getting it right with the build script.

Steve Baskauf added 4 commits July 27, 2020 05:34
The DCMI History page http://dublincore.org/usage/terms/history/ uses a version model that differs from what we are using in TDWG. They use the term '"issued" for what we call "current terms", whereas we only use "issued" for term versions. Thus their issued date is what we call the created date and their modified date for the current term is what we call the issued date of the version.

These dates have been changed to match the TDWG model. Also, version 003 of dcterms:modified has "Date Modified" as the label, rather than "Modified"
The comment "Access Rights may include information regarding access or restrictions based on privacy, security, or other policies." is found in the comments for dcterms:accessRights, not as part of the definition.
Added a replaces value showing that the deprecated dwctype:Location term was replaced by dcterms:Location
@peterdesmet
Copy link
Member

@baskaufs indeed, the QRG build script now spits out the English labels (like Institution ID). I'm assuming you have a term_localName in your source CSV? And you're not adding this to term_versions.csv because it wouldn't allow you to do git diff checks?

Can you also generate a term_versions_localName.csv with that term_localName added? I'm suggesting with term_localName as second column, right after iri.

@baskaufs
Copy link
Author

@peterdesmet Yes, that is correct. I can easily add the term_localname column as the second column right after iri. And yes, I'm going through every diff that I see and either changing the source rs.tdwg.org data (if I think it's wrong) or the draft term_versions.csv (if I think it is wrong). I will keep going until there are no diffs and then consider the build script to be fully working.

So you can operate on the assumption that there will be a term_localname column - I will add it as the last thing.

@baskaufs
Copy link
Author

@tucotuco One of the major diffs that I have is that the abcdEquivalence values have been removed from all of the obsolete terms (curatorial, digr.org terms, etc.) in term_versions.csv . I still have those values in rs.tdwg.org. I could remove them from the rs.tdwg.org source files, but that seems like a bad idea since it would be throwing away historical information. I'm not sure why it hurts to have that information in the term_versions.csv file since nobody will probably ever care or look at it and it has no effect on building the QRG. Let me know what you want me to do. I'm going to have to stop working on this for a while, so I'll just wait until you give me an answer on this before proceeding.

@tucotuco
Copy link
Member

tucotuco commented Jul 27, 2020 via email

@tucotuco
Copy link
Member

tucotuco commented Aug 6, 2020

This all looks good to me. Is it OK to merge?

@baskaufs
Copy link
Author

baskaufs commented Aug 6, 2020

When @tucotuco agrees that the series of commits are OK, the first box can be ticked and I will go ahead with merging tdwg/rs.tdwg.org#38, which I think is the only thing blocking #256 .

The next steps are for @peterdesmet to decide what changes need to be made in the QRG build script. First, whether to make it work without the tdwgutility:UseWithIRI row or save that for later (I think the last thing blocking #252) and second whether to make the script use the new term_localName column instead of the label column or whether to just leave the label column as it currently is (with the localNames instead of English prose) for now. That's one of the remaining checkboxes blocking this issue.

@baskaufs
Copy link
Author

baskaufs commented Aug 6, 2020

@tucotuco I think there are still some boxes to tick before merging

@baskaufs
Copy link
Author

baskaufs commented Aug 6, 2020

The script that builds term_versions.csv from the data in the rs.tdwg.org repo is now in the build directory. It currently outputs the file into that directory, but could be changed to put it in a different path. 665ca08#diff-19d84221e53d6c43dde26fcaa8d1e180

@peterdesmet
Copy link
Member

@baskaufs, you can:

  • Rename build/generate_normative_csv.py to build/generate_term_versions.py (a suggestion)
  • Rename generated_normative_document.csv to term_versions.csv (effectively replacing it) in your script as well the output.

Once that is done, I will revert the build script to work from that file again and we can merge.

peterdesmet and others added 2 commits August 7, 2020 21:46
Changed the path of output so that the current term_versions.csv file is replaced by the output. Changed the name of the script itself to generate_term_versions.py . Added the list of terms used to indicate the order of terms in the Quick Reference Guide.
@baskaufs
Copy link
Author

baskaufs commented Aug 7, 2020

@peterdesmet I have renamed the script as you suggested. I also added a source file that I had forgotten (qrg-list.csv) that is used by the script to determine the order of the terms in the Quick Reference Guide. I deleted generated_normative_document.csv and ran the script to generate the replacement term_versions.csv (now with the term_localName column). The diff (1d52ee8#diff-7af473e1d078337b36d73b11b0780a48) seems to indicate that the generated file is as designed. But you might try building the QRG from it one more time to make sure that all is well.

I think that when you can verify that the built QRG is OK we would be ready to merge.

@baskaufs
Copy link
Author

Checking off the box for "modify the QRG build script to use the extra column in the generated_normative_document.csv file so that it makes the labels appear on the QRG as desired. Once that is done, it can replace the term_versions.csv file and this branch can be merged with the master." because it was done in 5aa48c8

@baskaufs
Copy link
Author

Checking off the box for "move the script that builds the term_versions.csv file from https://github.com/baskaufs/msc/blob/master/dwc_workflow/generate_normative_csv.ipynb to some appropriate place in the DwC repo. There are a few path changes that will have to be made in the script Jupyter notebook. Once I merge the changes I've made to the rs.tdwg.org repo, the value of github_baseUri will need to be changed to the master branch. At the end of the script, the save file path will need to be changed from generated_normative_document.csv to term_versions.csv with an appropriate relative path so that the file will end up in the right place relative to where the script is stashed." because it was done in 665ca08

@baskaufs
Copy link
Author

baskaufs commented Aug 12, 2020

In b16d893 I have moved the Jupyter notebook that generates the list of terms document to the build directory along with the header and footer templates that it uses. I also modified the header template to match the other DwC documents like the RDF guide, etc. so hopefully it will render the same way. I also added a few notes to the README.md about this in 0057820

Todo:

  • Check that the page renders appropriately using the TDWG template (especially the backtics in the example fields).
  • Following the patterns of the other documents like the guides, the URL should end in /list/. If this is not correct, then the directory structure should be changed appropriately and the build script save location changed.
  • Once the URL is settled, a link needs to be added to the list of terms from the QRG, with appropriate text explaining what it is.

@baskaufs
Copy link
Author

@tucotuco @peterdesmet The link to the updated list of terms document is here. I believe that it now includes all of the changes @tucotuco made and those terms are versioned using today's date.

@peterdesmet
Copy link
Member

peterdesmet commented Aug 13, 2020

Hi @baskaufs, the build.py script now uses term_versions.csv again and works fine.

The docs/list/index.md renders ok for the most part:

  • It is correct that dcterms:Location is the only class not starting with dwc:?
  • Is the capitalization on dwc_DwCType correct?
  • All term links point to URLs such as .../list/#dwc_organismQuantityType
  • There are trailing </tr> tags before every vocabulary element. I think this is caused by a missing <tr> before modified.

Screenshot 2020-08-13 at 10 20 06

  • The backticks are not rendered properly, because markdown in html (here: a table) is just displayed as play text. You need to convert these to html yourself in the notebook. We also had to do this in the QRG build script. For now, I'd say: leave as is or copy/paste from this function:

    dwc/build/build.py

    Lines 156 to 161 in 67e30c9

    def convert_code(text_with_backticks):
    """Takes all back-quoted sections in a text field and converts it to
    the html tagged version of code blocks <code>...</code>
    """
    return re.sub(r'`([^`]*)`', r'<code>\1</code>', text_with_backticks)
    The same applies to http links:

    dwc/build/build.py

    Lines 163 to 174 in 67e30c9

    def convert_link(text_with_urls):
    """Takes all links in a text field and converts it to the html tagged
    version of the link
    """
    def _handle_matched(inputstring):
    """quick hack version of url handling on the current prime versions data"""
    url = inputstring.group()
    return "<a href=\"{}\">{}</a>".format(url, url)
    regx = "(http[s]?://[\w\d:#@%/;$()~_?\+-;=\\\.&]*)(?<![\)\.])"
    return re.sub(regx, _handle_matched, text_with_urls)

@baskaufs
Copy link
Author

@peterdesmet

checkbox 1: Yes, only dcterms:Location.

checkbox 2: John might know the history of dwc:DwCType. As far as I know, it is a historical anomaly that is only included here because it was in the dwc: namespace in the past and was part of the original dwchistory RDF/XML file that was the source of all of the historical information. It is the only term that has the type dcam:VocabularyEncodingScheme and I have no idea what the conventions are for capitalization of that.

checkbox 3: Not sure what the problem is here. Are you talking about the links in the index? The page source shows them as pointing to local fragment identifiers:

and they link to other places in the document as designed:

Am I misunderstanding something?

checkbox 4: I'm not seeing this problem either. In the page source, the <tr> in front of <td>Modified</td> is indented in a "non-pretty" way and lacks a hard return after it, but it is there.

	<tbody>
		<tr>
			<td>Term IRI</td>
			<td><a href="http://rs.tdwg.org/dwc/terms/acceptedNameUsage" rel="nofollow">http://rs.tdwg.org/dwc/terms/acceptedNameUsage</a></td>
		</tr>
			<tr><td>Modified</td>
			<td>2017-10-06</td>
		</tr>
		<tr>
			<td>Term version IRI</td>
			<td><a href="http://rs.tdwg.org/dwc/terms/version/acceptedNameUsage-2017-10-06" rel="nofollow">http://rs.tdwg.org/dwc/terms/version/acceptedNameUsage-2017-10-06</a></td>
		</tr>
...
	</tbody>

I tried putting the HTML in Oxygen to validate it, but there are a lot of problems caused by the backticks mixed with Jekyll's automatic hyperlinking. So it's hard to identify problems with the HTML until that is fixed.

checkbox 5:

I would like to fix this, but I'm running out of time to work on this project now. So if it's "good enough" for now, I guess I would be OK with fixing it later. I'll create an issue, assign it to me and try to get to it before too much time goes by.

@peterdesmet
Copy link
Member

Checkbox 3: no misunderstanding, links seem ok
Checkbox 4: odd, let's fix this later: #270
Checkbox 5: minor thing, having separate issue is ok

@baskaufs baskaufs merged commit 5bd258e into master Aug 13, 2020
@baskaufs baskaufs deleted the dcmi-term-changes branch August 13, 2020 15:05
@tucotuco
Copy link
Member

Just an FYI. dwc:DwCType is correct. It was a term parallel to dc:type, but for the basisOfRecord vocabulary. All history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants