CDU data science team blog: GitHub Standard Operating Procedure

Chris Beeley

Introduction

We’ve been using GitHub in the CDU data science team since 8th May 2020 (I think- that’s the oldest commit I could find). Team members past and present have used git and GitHub to varying degrees, often solo. Data science is a team sport, without question, but like any team sport it requires proper coordination between participants to be effective. Git and GitHub are very powerful tools, but they are hard to use well even when you’re collaborating just with yourself, and working well with others is another set of skills in itself. We’ve all made a lot of mistakes and learned a lot in the last two years, and this post is designed to codify some of this learning into a Standard Operating Procedure (SOP) for the benefit of team members past and present. And of course, we share it for the general benefit of others who might learn something from our experiences, and also in the hope that those with more/ different skills and knowledge might be able to help us refine it further (get in touch through the usual channels ☺️).

Standard operating procedure

Branches

We adopted gitflow fairly early on as the team expanded and it worked very well for our purposes. I now see that there is a health warning on the linked post and gitflow is deprecated elsewhere in favour of something like GitHub flow. I’m not going to go into the differences here because it is not the focus of this post but there are some important differences and I can feel another blog post coming on about different models of using git and GitHub once I’ve discussed the matter with team members.

The essential feature of gitflow as we’re using it is that there are three types of branches -

main branch, which is the working code,
development, which is a kind of “staging” branch where code can be worked on and
tested and feature branches which either fix bugs or add new features

The main idea is that feature code is merged into the development branch and the development branch is periodically merged into the main branch. Each time you merge to main you do a point release with semantic versioning.

We’ve all found that it works pretty well but it is overkill for simpler repos. The development branch is not necessary, and you can merge straight to main from a feature branch for simpler things, especially where it is mainly one person writing, maintaining, and using the code. For important stuff that we all use though, particularly {nottshcData} the development branch is really useful because it means that we can all send PRs to a branch without worrying about untested code getting out into the wild.

With all that said how have we found it best to use it in practice? A few points:

Merge to development and main as frequently as possible. Especially from my point of view, as the team manager, I spend a lot of time on everyone’s repos, and if everything is tucked away in branches I can’t see it. * Documentation should go straight to main, there is no need to use branches at all, or if a branch is used it can be continuously merged to main (in case anyone else is writing documentation)
Branches should be deleted as soon as possible, for the same reason, especially if they include commits that are not on main. If they are ahead of main that implies that they could have useful stuff on, or be a failed experiment that needs binning. If you’re off sick and we’re running your code we don’t want to be wondering which it is

Collaboration

There should be one person in charge of each repo. That person is responsible for making sure that the main branch is clean, bug free, and properly documented. If a team member spots something on someone else’s repo that either doesn’t work or doesn’t look right they should either file an issue (if it’s something that might need discussing) or make a pull request (PR) - if it’s a simple fix to formatting or documentation - I send lots of PRs like this. It is the responsibility of the lead for the repo to check the PR and to merge it in in a timely fashion, request changes to it, or reject it.

Having a PR accepted is a privilege, not a right, so team members who want to improve a repo need to send a clean PR with a decent explanatory note in it. We might discuss over Slack or in person but in my opinion it’s better if that discussion happens on GitHub where possible because it means everyone, all team members and the open source community at large, can see it. I’m subscribed to all of the repos on our GitHub and I quite often spot areas for improvement or niggles by reading these discussions and it’s my job to understand the workflow and to suggest improvements.

Issues

Bug reports and feature requests should all be filed as issues, again to increase transparency of ongoing work (within and without the team). Issues should be clearly worded with enough detail and should wherever possible include a reprex. We were bad with issues in the early days, jotting notes to each other that made a lot of sense at the time but within a month or so they were incomprehensible even to the author. Like code commenting and documentation a little work now can save a lot of confusion in the long run.

Reprexes

I really can’t say enough about the reprex. The beauty of a reprex for me is that at least half the time making the reprex solves the problem. Boiling the problem down to its essential elements improves your understanding of the problem to such a degree that it either disappears completely or you can PR the bug fix yourself. And when it doesn’t it’s so valuable to have it as the person receiving the issue. If you don’t include a reprex, basically, don’t expect a fix, unless there’s a really good reason why you can’t include one.

Data security

Data security is obviously really important when you’re using GitHub, even if it’s a private repo. The simplest way to keep data security is don’t keep data in a git controlled folder. If you don’t do that, you will never accidentally push data to a repo. It’s impossible. A lot of our code just gets stuff straight from the server and processes it all in place, so there really is no need a lot of the time to save data anywhere.

If you absolutely must save data, for example if you have long running or complex data operations to complete, just make sure that you use .gitignore appropriately. My usual strategy is to have a folder called secret, and I add secret/ to .gitignore. Do that and check what you’re committing and pushing.

If the worst happens, and I should say the worst has never happened to us and I don’t think it ever will, you need to destroy all the commits with the data in.

Data security is a very important area and in my very inexpert opinion using GitHub doesn’t make data any less secure. Using GitHub means that you are always thinking about what’s in your data and it reduces sloppy practices like storing credentials in code (which you should NEVER do, even if you’re not using git- store them in environment variables. I read about so many information breaches from the NHS and they seem to be largely people emailing stuff about without concentrating. It’s complacency, they’re so used to everything being secure they forget what they’re doing.

A note on packages

This is not strictly about GitHub, but it fits in the general area of collaboration and so I wanted to include it. We have adopted R packages almost wholesale in the team (another blog post that needs to be written), but there is one minor problem that they introduce that I wanted to flag up here.

People get into the habit of building the package as they go, and then they use the built package for their analysis. This is fine as far as it goes, but you should never deploy or share anything based on code that you built yourself. Before you share or deploy you need to merge to main, switch to a new session and install the code from GitHub. This ensures that the code that you are running is the same code we all have access to. We’ve all spent far too long debugging code that only exists on one person’s laptop, and if you’re off sick and we redeploy your stuff you want to make sure we deploy what you actually want, because we can’t access the built stuff on your laptop.

Edit

To add to this there is a specific post on package workflow for development which has much of what is written here and acts as a reminder of all the steps to cover.

GitHub Standard Operating Procedure