Being Tidy With Github

Using Github properly is a bit of an art. Its great that data science is standardizing around Git the same way its standardizing around Python. However, like Python, Git can either be used well or badly, and its important to impose discipline in pursuit of the former.

Github branch management

There are a variety of Github branch management strategies, but they all have one thing in common - branches are "born to die". A fairly sophisticated branching strategy is visualized here. You can see that there are only two perpetual branches - develop and master. Other branches have a limited lifespan, and are killed when they are merged into develop (release and hotfix branches are killed when they are simultaneously merged into develop and master both). Develop will occasionally merge into master, but it is never deleted. A nearly identical two perpetual branch protocol (differing largely in semantics) is described here.

Are two perpetual branches needed? No. A simpler one-perpetual branch strategy is documented here. The author here describes this as GitHub Flow, with the prior example being GitFlow. I have worked with teams that followed both two-perpetual-branches and one-perpetual-branch protocols. There are strengths and weaknesses to both approaches. I would say the two branch protocol makes more sense as the team gets bigger, and as the run time of the complete unit tests get slower. For the one-branch strategy, it is important that the unit tests always pass on the perpetual branch. For two branches, this burden is only strictly observed for the master branch. It is a deep cultural failure of the team to make a release that doesn't pass the unit tests, and it is assumed that the unit tests will be of high quality.

It is also very important that all releases be tagged, and that no two tags share the same __version__ for the package. It doesn't really matter when or how the __version__ is incremented, so long as the __version__ of a given tag is consistent with the name of the tag itself. Of course, we are not barbarians... we will never deploy code to a client that isn't tagged for release. Unfortunately, it is not uncommon to encounter deployment tools (aka app building platforms) that encourage (or even require) an app developer to edit the app within the tool itself. Since such editing is not reflected in the tagged version of the source code, such functionality will in fact do active harm to your ability to troubleshoot client experiences.

A simpler two perpetual branch strategy

The two perpetual branch protocol I have used the most is nearly identical to that which is is described above, with the primary difference being the use of hotfix branches. Under this protocol, hotfix branches are used rarely, and only for highly targeted fixes to truly dangerous bugs. As before, the hotfix branch is forked from the most recent tagged release. The developer, however, is required to make similar (usually identical) commits both to the hotfix and develop branch, in order to insure the bug is fixed in both places. (Remember, if the changes are significant, then just don't use a hotfix, and instead require users to wait for the next release in order to see the bug fix). Once a hotfix branch has addressed the bug, then its __version__ attribute is altered in an unusual way (usually by post-pending the final commit ID) and deployed to users directly. In this case we are not tagging the release, but the postpended SHA will identify it for us. The users are also informed that they are being accommodated with a stopgap measure and that they will be required to migrate to the next proper release in order to maintain support. The hotfix branch can simply be destroyed subsequent to the next tagged release, as this protocol requires the developer to fix the bug in both the hotfix and develop branches.

Repository directory structure

This guide strikes me as close to canonical. Here are some notes on how you might tweak this structure.

I prefer markdown files (.md) to .rst files.
I think a docs section is not strictly needed, so long as the public components of your Python package have good docstrings, and you provide an examples directory with notebooks or short glue scripts that exercise your package. Any such example file should assume your package is findable via the Python path.
- Moreover, the tests section will contain a great many sample data sets, so you might be fine with neither a docs nor an examples directory. Bear in mind, a ticdat powered solve function contains quite a bit of programmatic self-documentation via the companion input_schema and solution_schema objects.
The tests section might be included inside the package directory itself (so long as doing so doesn't bloat the package noticeably). This is a matter of taste, there is nothing wrong with making it a sibling to the package directory.
If the engine is itself simple enough to be a single file, then it isn't strictly necessary to even have a package directory. In this case, the engine.py file would be at the top level of the repo directory, and a sibling to the tests directory.
We will of course have a package directory if the engine code is too big for one file. We will of course distribute all of the production code as part of the package in this case.

Importing the package

In my experience, it is helpful to make and manage a single .pth file when developing code with a tidy Github repo. Your single, editable, .pth file will be placed in the standard package directory, alongside other .pth files which you will never edit. When you wish to extend (or shrink) the number of Github-based codes you are developing, you will simply add or remove (or comment out) from the list of repository directories in this .pth file. You are very unlikely to damage your Python installation by creating and editing just one .pth file you. I find this technique to a be a safe and convenient way to change the packages that can be universally imported, but I only use it for packages that I am myself working on.

This method is not the only way to skin this particular cat - the important thing is to make your repo directory part of the Python path. This allows the package you are developing to be importable from anywhere in your file system.

A .whl file is a very easy thing to build from a tidy Github repo. A very simple setup.py file is all thats needed to make a useful .whl file. The .whl file (which is not checked into the repo Github) is then used to distribute your Python packages to other developers who are pure consumers of your code. People collaborating with you on package development will of course clone the Github repo and use their own .pth file (or equivalent).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Being Tidy With Github

Github branch management

A simpler two perpetual branch strategy

Repository directory structure

Importing the package

Clone this wiki locally