Development of open tools for analysing healthcare data in R

Open source Teamwork Packages

TLTR: (Too Long To Read) Our goal was to make it easier to work with healthcare data in a reproducible and collaborative way. We wrote lots of R functions for recurring data manipulations and analytical tasks that magically translate into SQL code and communicate with large databases. All our functions are grouped into R packages because this made it easier for us to: (i) write good documentation of our code and analytical tasks, (ii) easily distribute updates across all team members, (iii) formally test our code, and (iv) integrate common data manipulations (or analyses) into interactive dashboards in a modular way.

Milan Wiedemann

Image of the words nottshcverse with curly brackets around it

Our team works with routinely collected NHS patient data. Currently we focus on understanding better how patients are using the service, changes in clinical outcome measures, and analysis of patient experiences. The main questions that guide our work are ‘What works for whom and how does it work?’, ‘What doesn’t work?’, and ‘How can we integrate patient experiences into our analyses?’. We developed a set of different tools, the {nottshcverse}, to help us look at these questions by automating recurring and time-consuming tasks so that we can spend more time thinking the clinical questions.

Challenges and solutions

Real data is messy but we should do our best to tidy the mess, where possible in an automated way. There are many challenges when working with healthcare data, here is only a small selection of those that I think underlies most analyses. Figure 1 shows our solutions to these challenges.

Main goals that guided the development of the {nottshcverse} packages.

Figure 1: Main goals that guided the development of the {nottshcverse} packages.

  1. 🤔 It’s really hard to get the clinical data that is needed in a reproducible way that is consistent across different people who work on the same (or related) analysis. This is particularly true because most of the data is stored on SQL servers and only starts to make sense after joining multiple different datasets. 🤓 We wrote functions that made it very easy and secure to (i) connect to and (ii) query data from different databases that we are working with. This way we could get the data we needed for our analyses using very few lines of code in R.
# First, create connection to databases
conn_s1 <- connect_sql(server = "DB-one")
conn_iapt <- connect_sql(server = "DB-two")

# Now we can use the connection to get contacts data from SystmOne ...
db_contacts_s1 <- get_s1_contacts(from = "2020-01-01", 
                                  to = "2020-12-31", 
                                  conn = conn_s1)

# ... and IAPTus databases
db_contacts_iapt <- get_iapt_contacts(from = "2020-01-01", 
                                      to = "2020-12-31", 
                                      conn = conn_iapt)

# Note that the objects 'contacts_s1' and 'contacts_iapt' are just pointing to 
# the databases (I like to use the prefix db) and not actually downloaded to the
# environment on your computer.
  1. 🤔 Most of the time, the data is not in the format that is needed for further analyses. There may be specific data manipulations that are needed to make sense of the data and analyse it properly. Also, different databases might be set up in ways that it is hard to merge data. 🤓 We wrote function that tidy the raw data from the databases so that it is more consistent across databases and easier to analyse. All of this still happens within the server so that our computers don’t have to do all of this work.
# Each get_*_data() function comes with a tidy_*_data(), here tidy_s1_contacts()
# Here I use the connection to the raw (messy) contacts  data that I created above
# and tidy it using the tidy_s1_contacts() function
db_contacts_s1 <- db_contacts_s1 %>% 

db_contacts_iapt <- db_contacts_iapt %>% 
  1. 🤔 The methods of data analyses and visualisations should be understandable, reproducible, and available to other people. Unfortunately this is not always the case yet because the software tools that are used are not script based and often shared in private emails. 🤓 We wrote R functions for common analytical tasks and visualisations.

  2. 🤔 Code should be really well documented so that it is easy to understand what’s going on. This includes the current version of the code as well as all previous versions and changes. This can be done using tools like Git and GitHub, but unfortunately most code is currently shared undocumented in private emails. 🤓 We created detailed documentations that are easily accessible to everyone who uses our R packages. Also, because develop our tools on GitHub, every change to our code is documented.

  3. 🤔 Mistakes happen! Sometimes things that you have no control over can change and break your code that previously worked fine (e.g., the format of the raw data or a functions that someone else wrote). Therefore, it’s important to continuously test whether the code is still working the way it’s supposed to work. 🤓 R packages (or similar solutions in other statistical programming languages) are relatively easy to test, for example using the {testthat} package. We started to implement tests into our work so that we can check if changes that we make to our code don’t break anything.

Overview of the {nottshcverse}

Overview of some R packages developed by the Clinical Development Unit Data Science Team⭐and the NHS-R Community ❤️

Figure 2: Overview of some R packages developed by the Clinical Development Unit Data Science Team⭐and the NHS-R Community ❤️

Simplified working example

To illustrate how R can be used to work with databases I’ll use the following example. Imagine we’re working with a database called SystmTwo (S2) and need to use two different tables for our analysis:

Create example data

# Set up example contacts table
contacts_s2 <- tibble(client_id = c(1, 1, 1, 2), 
                      contact_id = c(123, 124, 125, 156), 
                      referral_id = c(456, 459, 500, 501), 
                      referral_date = c("2018-04-19", "2019-05-23", 
                                        "2020-06-01", "2018-12-11"),
                      contact_date = c("2018-05-19", "2019-06-05", 
                                       "2020-07-08", "2019-01-15"),
                      team_id = c("tm1", "tm2", "tm1", "tm1"), 
                      hcp_id = c("hcp1", "hcp2", "hcp1", "hcp1"), 
                      contact_type = c("phone", "f2f", "video", "phone"),
                      assessment_id = c(321, 322, 344, NA))

# Set up example demographics table with 2 patients
demographics_s2 <- tibble(client_id = c(1, 2), 
                          dob = c("1988-01-01", "1965-01-01"),
                          dod = c(NA, NA),
                          sex = c("f", "m"))

Create SQLite connection

# Create connection (conn) to "local" database called SystmTwo (s2)
conn_s2 <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

# Copy local data frame to conn_s2 database
db_s2_contacts <- copy_to(conn_s2, contacts_s2)
db_s2_demographics <- copy_to(conn_s2, demographics_s2)

Example analysis

Here we join the contacts with the demographics information to calculate the age at the time a patient has their contact (age_at_contact).

# Calculate age at time of contact
db_age_at_contacts <- db_s2_contacts %>% 
  left_join(db_s2_demographics) %>% 
  mutate(age_at_contact = as.Date(contact_date) - as.Date(dob))

Writing a function

We can also write our own functions and use them in a modular way whenever we need them. Here’s a simple example to demonstrate how we can do the same calculation as shown above using our own function. In this example the function arguments take the variable names for date of birth (dob) and the contact date (contact_date).

calc_age_at_contact <- function(data, var_dob, var_contact_date) {
  # Add code here to check that arguments are specified correctly
  data %>% 
    dplyr::mutate(age_at_contact = as.Date({{var_contact_date}}) - as.Date({{var_dob}}))

Example SQL code

As mentioned above (Challenges and solutions, Point 1), the object that we work with most of the time are just SQL queries and not real data stored in your R environment. We can look at the underlying SQL code using the dplyr::show_query() function. I don’t really know SQL very well myself, but some people who do have created a great package that translates R code into SQL code (see the dbplyr package for more).

# Use dplyr::show_query() function to see underlying SQL code
SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `LHS`.`client_id` AS `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`
FROM `contacts_s2` AS `LHS`
LEFT JOIN `demographics_s2` AS `RHS`
ON (`LHS`.`client_id` = `RHS`.`client_id`)

Note that we can also see the SQL code from our own functions.

db_age_at_contacts %>% 
  calc_age_at_contact(var_dob = dob, 
                      var_contact_date = contact_date) %>% 
SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `LHS`.`client_id` AS `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`
FROM `contacts_s2` AS `LHS`
LEFT JOIN `demographics_s2` AS `RHS`
ON (`LHS`.`client_id` = `RHS`.`client_id`)

Example results

# Look at results from SQL query shown above
db_age_at_contacts %>% 
  select(client_id, contact_date, dob, age_at_contact)
# Source:   lazy query [?? x 4]
# Database: sqlite 3.38.0 [:memory:]
  client_id contact_date dob        age_at_contact
      <dbl> <chr>        <chr>               <int>
1         1 2018-05-19   1988-01-01             30
2         1 2019-06-05   1988-01-01             31
3         1 2020-07-08   1988-01-01             32
4         2 2019-01-15   1965-01-01             54

Of course this is a VERY simple example. This can get way more complex, think BIG and solve BIG problems. There are many other examples out there showing how to work with databases in RStudio. I added some links that I found useful at the end of this post.

Watch out now

So what’s coming next and where can we take this? Is this perfect? I don’t know exactly what’s coming next and this is definitely far from perfect. But it’s the best approach my colleagues and I could come up with in the time that we spent working on this. Maybe I’ll improve this one day, maybe someone else will? Until then let’s share ideas and work together to improve healthcare analytics in the NHS. Ohhh, some people are already working like this 👀 it’s time others join them. I hope those who make decisions about the direction of healthcare analytics in the NHS will start to understand the problems and opportunities and act soon. If not now, when then? We need to move towards a more open and modern way of healthcare analytics!


Here’s a short summary in BOLD AND ALL CAPS:

Of course this doesn’t always work. There will always be some messy data, inconsistent variable names, undocumented code, and … blah blah blah.

Related work and resources


If you see mistakes or want to suggest changes, please create an issue on the source repository.


For attribution, please cite this work as

Wiedemann (2021, Sept. 29). CDU data science team blog: Development of open tools for analysing healthcare data in R. Retrieved from

BibTeX citation

  author = {Wiedemann, Milan},
  title = {CDU data science team blog: Development of open tools for analysing healthcare data in R},
  url = {},
  year = {2021}