Skip to content

Using version control

stuckyb edited this page Oct 31, 2017 · 5 revisions

Using OntoPilot with version control software

  1. Introduction
  2. Using git (and other version control systems) with OntoPilot
  3. Using git with binary spreadsheet files
    1. Setting up a diff driver for tables2txt
    2. Telling git when to use the diff driver
    3. Putting it all together
  4. Improving performance

Introduction

When developing ontologies, it is often very useful to keep track of how an ontology changes over time. For example, you might want to know when a particular feature was added, or you might even want to revert back to an earlier version of the ontology. If you are working with a team of developers, you might also want to know which developer was responsible for a given change to the ontology.

These are common problems in software development in general, and the usual solution is to manage a project with version control software. There are many version control systems available, but at their most basic, they all:

  • Keep track of how files change over time, including who makes the changes.
  • Allow developers to compare different versions of a file to see how they differ (that is, to see what changed from an older version to a newer version).
  • Make it possible to recover earlier versions of a file.
  • Provide mechanisms for multiple developers to collaborate on a project without stepping on each others' toes.

The remainder of this page is geared toward OntoPilot users who are already familiar with the basics of version control and would like to use version control for their OntoPilot ontology projects. If you are not familiar with version control software, I hope the very brief description above piqued your interest enough to learn more. There are lots of resources available online for learning version control basics, but I'd recommend starting with the documentation for git, which includes plenty of nicely presented material for beginners. Git is currently one of the most popular version control systems, is completely free, and runs on all major operating systems, so learning about git is a great place to start. Once you've gained some familiarity with git, return to this page to learn more about using git with your OntoPilot projects.

For readers who are already familiar with version control and want to use it for their OntoPilot ontology projects, read on. Note, though, that the solutions presented on this page are all intended for git, which is probably the most popular open-source version control system in use today (thanks, in large part, to the success of web-based collaboration sites like GitHub and GitLab). If you use other version control systems, such as Mercurial, Subversion, or a proprietary system, you should be able to adapt the ideas presented here.

Using git (and other version control systems) with OntoPilot

Using git with an OntoPilot project is not fundamentally different from using git with any other software development project. Once you turn your OntoPilot project directory into a git repository (e.g., with the git init command) and add your source files to the repository (with the git add and git commit commands), git will automatically keep track of changes to your source files.

The chief difficulty you are likely to encounter is that some source file types do not automatically work as nicely as we might like with git or other version control systems. The problem is that the routines git uses to compare two versions of a file assume that the file is in a plain text format for which line-by-line comparisons make sense. This is almost always the case for programming language source code, but it is quite often not the case for spreadsheet documents. The default spreadsheet file formats used by LibreOffice, OpenOffice, and Microsoft Excel all result in so-called binary files that git doesn't know how to interpret.

For example, suppose you have a source file in Excel format called example_classes.xlsx with the following contents:

Type ID Label Text definition
Class EO:0000001 example class A meaningless example class

Now, suppose you commit the file to a git repository and then add another row to the spreadsheet:

Type ID Label Text definition
Class EO:0000001 example class A meaningless example class.
Class EO:0000002 example class 2 A second meaningless example class.

Then, you could save the spreadsheet file and ask git to show you the differences between the current version of the file and how it looked at the last commit by running the command

$ git diff example_classes.xlsx

You would get output that looks something like this:

diff --git a/example_classes.xlsx b/example_classes.xlsx
index e711964..d87ec0c 100644
Binary files a/example_classes.xlsx and b/example_classes.xlsx differ

In other words, git is telling you that the two versions are different (see the last line of the output), but it cannot show you any details about how they differ. It is important to emphasize that this does not affect git's ability to manage the spreadsheet file. You can still use git to retrieve earlier versions of the spreadsheet and see who commited changes to the spreadsheet, but you can't have git tell you anything about how the contents of the spreadsheet have changed over time.

One solution to this problem is to use a plain-text spreadsheet format, such as comma-separated values (CSV), for your OntoPilot source files. If you repeat the example above but use CSV format instead of Excel format, the output of git diff example_classes.csv will look something like this:

diff --git a/example_classes.csv b/example_classes.csv
index c073a13..ea7f1d4 100644
--- a/example_classes.csv
+++ b/example_classes.csv
@@ -1,2 +1,3 @@
 Type,ID,Label,Text definition
 Class,EO:0000001,example class,A meaningless example class.
+Class,EO:0000002,example class 2,A second meaningless example class.

This is exactly what we want. Git is showing us that the most recent version of the file has one new line (i.e., one new table row, the last line marked with a + at the beginning), and git is also showing us the contents of the new row and where it is located in the file. CSV files are supported by all spreadsheet programs, and for many projects, this solution might be all you need.

CSV files do have some drawbacks, however. For ontology development, the two most serious drawbacks are that they do not support any sort of text or cell formatting, and they do not support formulas. Formatting can be extremely useful for making source spreadsheets easier for humans to read, and formulas are useful for implementing ontology design patterns, where all rows in a source spreadsheet follow a basic template and spreadsheet formulas are used to "fill in the blanks". For these features, you need to use full-fledged, binary spreadsheet formats which, as we've already seen, do not work so nicely with git. This is not the end of the story, though, and the next section discusses how to improve this situation.

Using git with binary spreadsheet files

Although git doesn't know how to interpret binary spreadsheet formats by default, we can tell git how to use an external program to read the spreadsheet files in a plain-text format that git can compare. OntoPilot includes a command-line utility called tables2txt that serves just this purpose. Once you install OntoPilot and add OntoPilot's bin directory to your path, you will automatically have tables2txt available, too. All that remains is to tell git how to use tables2txt to read spreadsheet files. This requires two steps: 1) setting up a diff driver for tables2txt; and 2) telling git when to use the driver. These steps are easy and take only a minute or two.

Setting up a diff driver for tables2txt

The first step is to set up a so-called diff "driver" for git to use that points to the tables2txt program. We will call the driver "tables2txt", so the command to run is:

$ git config diff.tables2txt.textconv tables2txt

The above command assumes that tables2txt is in your path. If not, you will need to provide the full path to the tables2txt executable, and the command would look something like this:

$ git config diff.tables2txt.textconv /path/to/ontopilot/bin/tables2txt

Note that if you want the new diff driver to be available to all git repositories on your machine, you need to add the --global option to the command, like this:

$ git config --global diff.tables2txt.textconv tables2txt

Telling git when to use the diff driver

Now that we've created a new git diff driver for tables2txt, we need to tell git when to use the driver. This is done by creating a file called .gitattributes that associates particular file types with the new driver. Specifically, we want to tell git that comparisons of binary spreadsheet files should always be done by reading the files using tables2txt.

To do this, simply create a new text file in the root of your git repository, name the file .gitattributes, and paste the following contents into the file:

*.ods diff=tables2txt
*.xls diff=tables2txt
*.xlsx diff=tables2txt

Putting it all together

Now, you can run git diff on your binary spreadsheet files and get back useful results just as for CSV files. Continuing with the example developed earlier, recall that before, if we ran

$ git diff example_classes.xlsx

git simply told us that the "binary files differ" but couldn't tell us anything about how they were different. After setting up the diff driver and .gitattributes file, we can run the same command again and git will now produce output that looks something like this:

diff --git a/example_classes.xlsx b/example_classes.xlsx
index e711964..d87ec0c 100644
--- a/example_classes.xlsx
+++ b/example_classes.xlsx
@@ -1,2 +1,3 @@
 Table: Example classes
 type,id,label,text definition
 Class,EO:0000001,example class,A meaningless example class.
+Class,EO:0000002,example class 2,A second meaningless example class.^M

Just as with files in CSV format, we can now see exactly how the spreadsheet file changed from one version to the next.

This solution is not perfect, of course. The biggest drawback is that it only compares the values of spreadsheet cells, which means that changes to formats or formulas will not be reported. Please note that if the cell formats or formulas in a file change, git will detect that the file changed and manage it as expected. You simply will not be able to see these changes when you run the git diff command. In practice, though, this limitation isn't really too much of a problem. Most changes to source files that you are likely to care about will involve changes of cell values, and for those cases, the solution presented here works reasonably well.

Improving performance

One minor downside of using the tables2txt diff driver is that generating diffs of spreadsheet files is relatively slow because of script start-up times. For occasional change comparisons, this isn't a big problem, but if you want to look at a large number of comparisons (e.g., by running git log -p), it can be annoying. You can improve performance by telling git to cache the results of running the textconv filter. To do this, run

$ git config diff.tables2txt.cachetextconv true

If you want to enable caching globally, add the --global flag:

$ git config --global diff.tables2txt.cachetextconv true

Now, the next time you run git diff on a binary spreadsheet file, git will cache the results of reading the file using tables2txt. Thereafter, if git needs to access the tables2txt output for the file, the output will be read from the cache and available almost instantly. In my testing, caching was able, in some cases, to reduce git diff run times from more than 10 seconds to less than a hundredth of a second.

The cached output is stored as part of a project's git repository in the .git folder. If for any reason you'd like to clear the cache, you can run:

$ git update-ref -d refs/notes/textconv/tables2txt