These GitHub pages include a page describing a software and a page describing a dataset that was created as an output from the software at hand.
- Name: Tutorial on adding Bioschemas markup to GitHub pages
- Description: This tutorial shows how to add Bioschemas markup to GitHub pages. It uses a simple GitHub page hosted in the gh-pages branch to create a sample project page, i.e., as learners could do with their own GitHub projects. As an example, it uses the software and data corresponding to the released project TREC-doc-2-doc-relevance, a web-based interface to add document-to-document relevance assessments to pairs of documents retrieved from TREC 2005 Genomics Track.
- Keywords: Bioschemas, schema.org, structured markup, GitHub pages
- How can I add Bioschemas markup to GitHub pages?
- Can I also use schema.org types not included in Bioschemas?
- How can I use Bioschemas profiles? In particular ComputationalTool and Dataset (bonus, we also use TrainingMaterial to describe the tutorial itself)
- Describe how schema.org and Bioschemas markup can be embedded to GitHub pages
- Use schema.org and Bioschemas profiles on GitHub pages
- Use schema and Bioschemas validators
- Understanding of what is Bioschemas and what are Bioschemas types and profiles
- Familiarity on how to use GitHub
- Basic knowledge on how to use GitHub pages. More information at GitHub Pages
- Familiarity with JSON-LD
- Basic knowledge on Markdown
- Knowledge of develop tools on a browser
Latest modification 2024-02-07
License CC-By 4.0
Identifier DOI:10.5281/zenodo.10629453
Citation Castro, LJ. (2024, February 7). Adding Bioschemas Dataset and ComputationalTool markup to GitHub pages. Zenodo. https://doi.org/10.5281/zenodo.10629453
In this tutorial we will cover:
- Creating this GitHub page
- Adding schema.org and Bioschemas markup
- Visualizing the structured markup
- Validating the pages against the schema Validator
- Validating against the Rich Results Test
- Validating the pages against FAIR-Checker/Bioschemas validator
- Creating a sitemap
- Try it out
Let's start by forking this repository for your own purposes. Once forked, go to settings
You will need to enable "Pages" on your forked repository, and select "Deploy from a branch" using the branch "gh-pages". Save the changes.
As this repo does have a gh-pages branch, it will use it. If such branch would not exist, GitHub would ask you to use the main branch to start the gh-pages one
In a matter of minutes, your site will be live. The pages corresponding to the examples used in this tutorial are available at https://zbmed-semtec.github.io/bioschemas-ghpages-markup-tutorial/
Do not forget to get a local copy of your fork so you can make changes.
We have already added schema.org and Bioschemas markup corresponding to our example, code and data for our repo TREC-doc-2-doc-relevance. With GitHub pages, same a with regular HTML (static or generated), the trick is adding the JSON-LD describing those elements that you want. To do so, you need to add a block <script type="application/ld+json">
with the Bioschemas markup as a JSON-LD inside the block.
The Bioschemas markup will start with the context "@context": "https://schema.org"
, i.e., where all types and profiles are defined.
Then you add the type and id of what you are describing, for instace a bioschemas:ComputationalTool
. The "http://purl.org/dc/terms/conformsTo"
is the bit that indicates that you are using Bioschemas.
"@type": "SoftwareApplication",
"@id": "https://doi.org/10.5281/zenodo.7341391",
"http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/profiles/ComputationalTool/1.0-RELEASE",
For the rest of the markup, you need to see what the Bioschemas profile recommends (keep reading, more information below).
Have a look to the pages that we have created, one for the code and corresponding release, another for a dataset created from data collected by the software.
The [software page] has markup for two elements, schema:SoftareSourceCode
and bioschemas:ComputationalTool
.
The markup corresponding to schema:SoftwareSourceCode
is shown below. Note that the source code is linked to the release via the property schema:targetProduct
. In the code below, we are using the release @id
to refer to it.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareSourceCode",
"@id": "https://github.com/zbmed-semtec/TREC-doc-2-doc-relevance",
"citation": "Talha M, Geist L, Fellerhof T, Ravinder R, Giraldo O, Rebholz-Schuhmann D, et al. TREC-doc-2-doc-relevance [Software source code]. GitHub; 2022.",
"name": "TREC-doc-2-doc-relevance",
"description": "This is the software source code facilitating the creation of a doc-2-doc relevance assessment on PMIDs used in the TREC 2005 Genomics track along with its metadata.",
"url": "https://github.com/zbmed-semtec/TREC-doc-2-doc-relevance#",
"targetProduct": {
"@id": "https://doi.org/10.5281/zenodo.7341391"
},
"license": {
"@type": "CreativeWork",
"@id": "http://spdx.org/licenses/MIT",
"name": "MIT License",
"url": "https://opensource.org/license/mit/"
},
"author": [
{"@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha"},
{"@id": "https://orcid.org/0000-0002-2910-7982"},
{"@id": "https://orcid.org/0000-0002-8725-1317"},
{"@id": "https://orcid.org/0009-0004-4484-6283"},
{"@id": "https://orcid.org/0000-0003-2978-8922"},
{"@id": "https://orcid.org/0000-0002-1018-0370"},
{"@id": "https://orcid.org/0000-0003-3986-0510"}
]
}
<7script>
Note that we use a trick for the @id
and the schema:url
in our markup. Identifiers should be unique, global and permanent, like e.g., DOIs, and clearly identify the object they describe (in this case a source code). But, we do not have such a PID for our source code, we only have the GitHub URL for our repo. Now, if the @id
is meant to identify a source code, we cannot use the exact same URL for the source code and for a "regular" url. We add #
to the "regular" URL. If we do not, validators will see that the exact same URL is used to identify a source code and a "regular" URL but source codes and URLs are not the same thing so the validators might get confused. Using the exact same URL would be like saying that an animal is a cat at the same time that it is a dog, it cannot be. The best way to go would be getting an actual PID for your repo, e.g., via w3id.org
And here you have the markup corresponding to the bioschemas:ComputationalTool
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareApplication",
"@id": "https://doi.org/10.5281/zenodo.7341391",
"http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/profiles/ComputationalTool/1.0-RELEASE",
"identifier": "DOI:10.5281/zenodo.7341391",
"citation": "Talha M, Geist L, Fellerhof T, Ravinder R, Giraldo O, Rebholz-Schuhmann D, et al. TREC-doc-2-doc-relevance assessment interface. Zenodo; 2022. doi:10.5281/zenodo.7341391",
"name": "TREC-doc-2-doc-relevance assessment interface",
"description": "The code, data and docs at this release aim at facilitating the creation of a doc-2-doc relevance assessment on PMIDs used in the TREC 2005 Genomics track. A doc-2-doc relevance assessment takes one document as reference and assess a second document regarding its relevance to the reference one. This doc-2-doc collection will be used to evaluate the doc-2-doc recommendations approaches that we are working on.",
"url": "https://zenodo.org/records/7341391",
"softwareVersion": "1.0.0",
"datePublished": "2022-11-21",
"license": {
"@type": "CreativeWork",
"@id": "http://spdx.org/licenses/MIT",
"name": "MIT License",
"url": "https://opensource.org/license/mit/"
},
"author": [
{"@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha"},
{"@id": "https://orcid.org/0000-0002-2910-7982"},
{"@id": "https://orcid.org/0000-0002-8725-1317"},
{"@id": "https://orcid.org/0009-0004-4484-6283"},
{"@id": "https://orcid.org/0000-0003-2978-8922"},
{"@id": "https://orcid.org/0000-0002-1018-0370"},
{"@id": "https://orcid.org/0000-0003-3986-0510"}
]
}
</script>
The dataset page has markup only for one element, a dataset.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Dataset",
"@id": "https://doi.org/10.5281/zenodo.7338056",
"conformsTo": "https://bioschemas.org/profiles/Dataset/1.1-DRAFT",
"identifier": "DOI:10.5281/zenodo.7338056",
"citation": "Giraldo O, Solanki D, Rebholz-Schuhmann D, Castro LJ. Fleiss kappa for doc-2-doc relevance assessment. Zenodo; 2022. doi:10.5281/zenodo.7338056",
"name": "Fleiss kappa for doc-2-doc relevance assessment",
"description": "Fleiss' kappa measuring inter-annotator agreement on a document-to-document relevance assessment task. The table contains 7 columns, the first one presents the topics, 8 in total. The second column shows the \"reference articles\", represented by their PubMed-ID and organized by topic. The third column shows the Fleiss’ Kappa results. The fourth column shows the interpretation of the Fleiss' Kappa results being: i) \"Poor\" results <0.20, ii) \"Fair\" results within 0.21 - 0.40, and iii) \"Moderate\" results within 0.41 - 0.60. The fifth column shows the PubMed-IDs of evaluation articles rated by the four annotators as \"Relevant\" regarding its corresponding \"reference article\". The sixth column shows the PubMed-IDs of evaluation articles rated by the four annotators as \"Partially relevant\" regarding its corresponding \"reference article\". The seventh column shows the PubMed-IDs of evaluation articles rated by the four annotators as \"Non-relevant\" regarding its corresponding \"reference article\".",
"keywords": "Fleiss' Kappa, Inter-annoator agreement, TREC Genomics Track 2005, relevance assessment",
"license": {
"@type": "CreativeWork",
"@id": "http://spdx.org/licenses/CC-BY-4.0",
"name": "Creative Commons Attribution 4.0 International",
"alternateName": "CC BY 4.0",
"url": "https://creativecommons.org/licenses/by/4.0/"
},
"url": "https://zenodo.org/record/7338056",
"datePublished": "2022-11-19",
"author": [
{"@id": "https://orcid.org/0000-0003-2978-8922"},
{"@id": "https://orcid.org/0009-0004-1529-0095"},
{"@id": "https://orcid.org/0000-0002-1018-0370"},
{"@id": "https://orcid.org/0000-0003-3986-0510"}
]
}
</script>
If you use a schema.org type that has a corresponding Bioschemas profile, we recommend using the Bioschemas profile as it will help you focus on the most common and useful properties. However, if there is a type in schema.org that does not have a Bioschemas profile, we suggest finding examples on how others have used them for a case similar to yours.
In this case we have used schema:SoftwareSourceCode
with some properties that are also present for bioschemas:ComputationalTool
and that will provide some minimum information to anyone willing to reuse our source code, including citation information.
Whenever you use a Bioschemas profile, go to the corresponding page, we have used:
The minimum information that you should provide are the "minimum" properties, and, whenever possible, it is a good idea to also provide the "recommended" ones. Optional properties are up to you. Another important piece of information is the cardinality, whether or not you could use a single object or an array.
You can find examples and other useful links for each Bioschemas profile
And also live deploys using the profile
If you use Bioschemas markup on your website, you can also add it to the "live deploy" list. TODO
Once the markup is in place and the page is live, you can use the developers tools in a browser to visualize the markup. You can use the inspect
tool in e.g., Chrome, Firefox or Edge.
Go to the software page and use the inspect
tool for any element. You should see the HTML.
With CTRL+F
(or command key + F
on Macs) to search and find context
. The first one is added by GitHub pages, and describes the page as a schema:WebPage
.
The second occurence will be the markup corresponding to the schema:SoftwareSourceCode
and bioschemas:ComputationalTool
manually added to the block <script type="application/ld+json">
.
If you look closely, you will notice that we have the elementt bioschemas:ComputationalTool
fully inside the element schema:SoftwareSourceCode
rather than a reference via the "@id": "https://doi.org/10.5281/zenodo.7341391"
as we did above. Both ways are valid.
This is how we have embedded the Bioschemas markup on the page
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "SoftwareSourceCode",
"@id": "https://github.com/zbmed-semtec/TREC-doc-2-doc-relevance",
"citation": "Talha M, Geist L, Fellerhof T, Ravinder R, Giraldo O, Rebholz-Schuhmann D, et al. TREC-doc-2-doc-relevance [Software source code]. GitHub; 2022.",
"name": "TREC-doc-2-doc-relevance",
"description": "This is the software source code facilitating the creation of a doc-2-doc relevance assessment on PMIDs used in the TREC 2005 Genomics track along with its metadata.",
"url": "https://github.com/zbmed-semtec/TREC-doc-2-doc-relevance#",
"targetProduct": {
"@type": "SoftwareApplication",
"@id": "https://doi.org/10.5281/zenodo.7341391",
"http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/profiles/ComputationalTool/1.0-RELEASE",
"identifier": "DOI:10.5281/zenodo.7341391",
"citation": "Talha M, Geist L, Fellerhof T, Ravinder R, Giraldo O, Rebholz-Schuhmann D, et al. TREC-doc-2-doc-relevance assessment interface. Zenodo; 2022. doi:10.5281/zenodo.7341391",
"name": "TREC-doc-2-doc-relevance assessment interface",
"description": "The code, data and docs at this release aim at facilitating the creation of a doc-2-doc relevance assessment on PMIDs used in the TREC 2005 Genomics track. A doc-2-doc relevance assessment takes one document as reference and assess a second document regarding its relevance to the reference one. This doc-2-doc collection will be used to evaluate the doc-2-doc recommendations approaches that we are working on.",
"url": "https://zenodo.org/records/7341391",
"softwareVersion": "1.0.0",
"datePublished": "2022-11-21",
"license": {
"@type": "CreativeWork",
"@id": "http://spdx.org/licenses/MIT",
"name": "MIT License",
"url": "https://opensource.org/license/mit/"
},
"author": [
{"@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha"},
{"@id": "https://orcid.org/0000-0002-2910-7982"},
{"@id": "https://orcid.org/0000-0002-8725-1317"},
{"@id": "https://orcid.org/0009-0004-4484-6283"},
{"@id": "https://orcid.org/0000-0003-2978-8922"},
{"@id": "https://orcid.org/0000-0002-1018-0370"},
{"@id": "https://orcid.org/0000-0003-3986-0510"}
]
},
"license": {
"@type": "CreativeWork",
"@id": "http://spdx.org/licenses/MIT",
"name": "MIT License",
"url": "https://opensource.org/license/mit/"
},
"author": [
{"@id": "https://zbmed-semtec.github.io/previous_members/#muhammad-talha"},
{"@id": "https://orcid.org/0000-0002-2910-7982"},
{"@id": "https://orcid.org/0000-0002-8725-1317"},
{"@id": "https://orcid.org/0009-0004-4484-6283"},
{"@id": "https://orcid.org/0000-0003-2978-8922"},
{"@id": "https://orcid.org/0000-0002-1018-0370"},
{"@id": "https://orcid.org/0000-0003-3986-0510"}
]
}
</script>
Go to Schema validator. Add the URL for either the software page or the dataset page and click on "Run Test"
In the case of the software page, it identifies and validates the two upper objects corresponding to schema:WebPage
and schema:SoftwareSourceCode
but it does not find bioschemas:ComputationalTool
(which corresponds to a schema:SoftwareApplication
)
You can run the test again with a "Code Snippet" by copying and pasting the code embedded on the page, the results will be the same. However, if you try and validate a code snippet containing only one element, the inner element bioschemas:ComputationalTool
, the Schema Validator will identify and validate it (tested on 2024-02-05).
There is another schema.org validator, currently the one recommended by Google. Go to Rich Results Test and use the same URL as before, and click "Test URL".
If you try the software page, it will identify some errors with schema:SoftwareApplication
. Yes, it does identify the inner element! As there is an error, only the error will be shown, nothing regarding the other elements. It looks like our bioschemas:ComputationalTool
is missing some important elements (for Google, and you will see also for Bioschemas).
If you try the dataset page, one correct element will be identified.
Did you notice that nor the Schema Validator neither the Rich Results Test pay attention to the schema:WebPage
element?
Let's try now a validator specific for Bioschemas. Go to FAIR-Checker and click on the blue button "Inspect".
On Step 1, add the URL for either the software page or the dataset page and click on "Build Knowledge Graph"
Once you can see that the Knowledge Graph is ready in Step 2
Jump directly to step 3, select "Bioschemas" and click "Check Bioschemas"
Only Bioschemas profiles will be checked so, in this case bioschemas:ComputationalTool
and bioschemas:Dataset
but not schema:SoftwareSourceCode
.
We are missing some of the recommended properties and the validator indicates this as improvements.
If you want to make it easier for machines to crawl your website (and this could be a good idea when you have markup and you want to enable easy integration into aggregators), consider adding a sitemap.xml to your website.
We have used XML-Sitemaps to create the one for the GitHub pages used in this tutorial.
Our sitemap looks like
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->
<url>
<loc>https://zbmed-semtec.github.io/bioschemas-ghpages-markup-tutorial/</loc>
<lastmod>2024-02-05T14:31:07+00:00</lastmod>
<priority>1.00</priority>
</url>
<url>
<loc>https://zbmed-semtec.github.io/bioschemas-ghpages-markup-tutorial/software.html</loc>
<lastmod>2024-02-05T14:31:07+00:00</lastmod>
<priority>0.80</priority>
</url>
<url>
<loc>https://zbmed-semtec.github.io/bioschemas-ghpages-markup-tutorial/dataset.html</loc>
<lastmod>2024-02-05T14:31:07+00:00</lastmod>
<priority>0.80</priority>
</url>
</urlset>
Try and modify the markup to describe a source code of your own. If you have not created a release yet, use only schema:SoftwareSourceCode
. If you have a release or if you prefer, you can omit the markup for the source code and use only bioschemas:ComputationalTool
for your release. Or you can provide some markup describing a dataset you have created for this you would use bioschemas:Dataset
.
Are you interested in using Bioschemas markup for real on your web pages? Join the community, you can send questions to the mailing list or more informally on the Slack channel.
We covered a manual addition of schema.org and Bioschemas markup to GitHub pages. However, websites many times are created by a production pipeline. In those cases, markup creation should be integrated to it.
We also briefly mentioned sitemaps with much of a context. We invite to get more information on what they are and why they are useful.
This tutorial is based on bioschemas-github-markup-example and Adding schema.org to a GitHub Pages site.
LJC has received fundings from the German Research Foundation (DFG) via the grant for NFDI4DataScience No. 460234259
We use free SVG icons from Font Awesone