What can the CDU data science team do to verify its outputs, disseminate learning, and support individual development in team sessions?
We’ve been a proper data science team for a year now, there’s six of us at the moment which is GREAT and we’re starting to review some of the stuff that we’ve been doing to make sure that it’s serving a purpose. I’m sort of writing this for our benefit just as a way of recording what the plan is (and to discuss the plan via pull request 😎) but we work in the open so why not talk about this in the open too.
Fairly near the beginning of become A Proper Data Science Team we kicked off “Code review” sessions which would take place fortnightly and team members would bring stuff, taking it in turns to do one each. I hardly think I can improve on Google’s guide to code review but I’ll summarise the main points here for those who don’t click through. Code review is the process of one or more people who have not written the code in question to read it and check that it’s okay, before the code is merged into the codebase. Code reviewers are considering:
In the linked document Google emphasise that you should read every line- not scan over things and assume they’re okay. It’s worth remembering as well that we are doing data science- so we need to be reviewing statistical and ML methods as well. People (including me) fall into the trap of thinking that data science is just programming and they stop asking themselves hard questions about the methods they are using and that is very dangerous.
I put code review in quotes advisedly because although we went into them thinking we would be doing code review over the weeks and months we ended up doing something totally different. Sometimes we would pretty much do code review. Sometimes we would spend a long time discussing code style. Other times we would start off doing code review and end up doing an ad hoc lesson in a particular method that the person who came to code review. Sometimes we would have such a good time doing one of these things that we couldn’t fit all the excitement into an hour and would call extraordinary code review meetings so we could carry on discussing it and not have to wait two weeks.
It was pretty anarchic but we were all learning and having fun in a supportive environment so I thought we may as well just see where we ended up. After a year I thought it was time to review what we were doing around assuring our code so I kicked off a discussion about it which this blog post is part of.
We have had a pretty wide ranging discussion about it and the first point of interest is that everybody wanted to keep the fortnightly sessions but everybody agreed that, however awesome they were, they weren’t really “code review”. We boiled down what we feel we need to four distinct activities. For the sake of brevity I will summarise them here- I could probably do a blog post on each and if anybody would like to hear more then find me on Twitter, have a look at the about page.
Pair programming is what it sounds like- programming in pairs. It’s a great way of teaching and helping each other, and solving problems together, but it’s also a way of reviewing code too. The maxim we came up with today is simple.
Every line of code should have been read by two people
You can do that in a pair, live, or you can do it by review (or you can do both). Interestingly enough the team just spontaneously started doing pair programming. Nobody mentioned the word, nobody asked them to. It just makes sense to do that so they just started doing it, which I absolutely love.
This is the other review methodology that came out of the discussion. I already talked about what Google think this should be so there’s not much to add. One thing we said is that the other thing we need to change is doing it fortnightly as a group. Proper code review can only be done by someone who understands the code. We have a LOT of skills in the team (SQL, R, Python, Shiny, statistics, ML) and nobody understands it all (especially me). We decided that this would happen when people feel they’re ready and would take place with one or more designated people who understand the code, and not every fortnight with everybody.
One interesting thing that came out of this discussion was my realisation that we need to deepen and broaden the skills of the team. Several of us are writing code that nobody has the expertise to review. I’ve long been obsessed with truck factor but I hadn’t considered it from the position of validating code before. It’s entirely my failing as a manager and it’s something else to think about when we’re recruiting next (it’s also relevant for training and development, but that’s another post).
Something else that we spontaneously did as part of the code reviews that weren’t really code reviews was what we’re loosely calling “journal club”- basically one of “I know something that you need to know, I’m going to teach you” or “we all need to get better at x- let’s learn together”. We’re going to start with me: “Everything your data scientists wanted to know about managing servers and deploying in the cloud- but were afraid to ask” which will end up on GitHub somewhere.
Not sure about the name for this one! The last thing that everybody seemed to want was something else we were already doing. It’s basically just the ability to just bring anything you like and talk about it. We’ve had all sorts of things. “This code is fine, it’s just horribly slow”, “I can’t figure out what is going to give the best experience for our users”, “I’ve done this analysis but it’s kind of shallow- what else can I do?”. I don’t think this is anything particular, it doesn’t have a proper name, it isn’t particularly for one purpose, it’s more just the team’s way of saying “We help and support each other and we carve out one hour every two weeks so team members can bring a problem and talk it through with us”.
That’s where we’re at now, once we’ve agreed this as the way forward (or agreed something else, obviously) then we’ll give it a go. As is probably clear, our Friday sessions and code review are just one part of trying to have a team that works well together and I’m totally new at this so if anybody out there has anything to add I’d be super grateful to hear it.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Beeley (2021, July 30). CDU data science team blog: Pair programming, code review, journal club, and team time. Retrieved from https://cdu-data-science-team.github.io/team-blog/posts/2021-07-30-pair-programming-code-review-journal-club-and-team-time/
BibTeX citation
@misc{beeley2021pair, author = {Beeley, Chris}, title = {CDU data science team blog: Pair programming, code review, journal club, and team time}, url = {https://cdu-data-science-team.github.io/team-blog/posts/2021-07-30-pair-programming-code-review-journal-club-and-team-time/}, year = {2021} }