TLTR: (Too Long To Read) Our goal was to make it easier to work with healthcare data in a reproducible and collaborative way. We wrote lots of R functions for recurring data manipulations and analytical tasks that magically translate into SQL code and communicate with large databases. All our functions are grouped into R packages because this made it easier for us to: (i) write good documentation of our code and analytical tasks, (ii) easily distribute updates across all team members, (iii) formally test our code, and (iv) integrate common data manipulations (or analyses) into interactive dashboards in a modular way.
Our team works with routinely collected NHS patient data. Currently
we focus on understanding better how patients are using the service,
changes in clinical outcome measures, and analysis of patient
experiences. The main questions that guide our work are ‘What works
for whom and how does it work?’, ‘What doesn’t work?’, and
‘How can we integrate patient experiences into our analyses?’.
We developed a set of different tools, the
{nottshcverse}
, to help us look at these
questions by automating recurring and time-consuming tasks so that we
can spend more time thinking the clinical questions.
Real data is messy but we should do our best to tidy the mess, where possible in an automated way. There are many challenges when working with healthcare data, here is only a small selection of those that I think underlies most analyses. Figure 1 shows our solutions to these challenges.
# First, create connection to databases
conn_s1 <- connect_sql(server = "DB-one")
conn_iapt <- connect_sql(server = "DB-two")
# Now we can use the connection to get contacts data from SystmOne ...
db_contacts_s1 <- get_s1_contacts(from = "2020-01-01",
to = "2020-12-31",
conn = conn_s1)
# ... and IAPTus databases
db_contacts_iapt <- get_iapt_contacts(from = "2020-01-01",
to = "2020-12-31",
conn = conn_iapt)
# Note that the objects 'contacts_s1' and 'contacts_iapt' are just pointing to
# the databases (I like to use the prefix db) and not actually downloaded to the
# environment on your computer.
# Each get_*_data() function comes with a tidy_*_data(), here tidy_s1_contacts()
# Here I use the connection to the raw (messy) contacts data that I created above
# and tidy it using the tidy_s1_contacts() function
db_contacts_s1 <- db_contacts_s1 %>%
tidy_s1_contacts()
db_contacts_iapt <- db_contacts_iapt %>%
tidy_iapt_contacts()
🤔 The methods of data analyses and visualisations should be understandable, reproducible, and available to other people. Unfortunately this is not always the case yet because the software tools that are used are not script based and often shared in private emails. 🤓 We wrote R functions for common analytical tasks and visualisations.
🤔 Code should be really well documented so that it is easy to understand what’s going on. This includes the current version of the code as well as all previous versions and changes. This can be done using tools like Git and GitHub, but unfortunately most code is currently shared undocumented in private emails. 🤓 We created detailed documentations that are easily accessible to everyone who uses our R packages. Also, because develop our tools on GitHub, every change to our code is documented.
🤔 Mistakes happen! Sometimes things that you have no control over can change and break your code that previously worked fine (e.g., the format of the raw data or a functions that someone else wrote). Therefore, it’s important to continuously test whether the code is still working the way it’s supposed to work. 🤓 R packages (or similar solutions in other statistical programming languages) are relatively easy to test, for example using the {testthat} package. We started to implement tests into our work so that we can check if changes that we make to our code don’t break anything.
{nottshcData}
: Unified framework to query, transform,
and aggregate data from different databases{nottshcMethods}
: Tools for performing common
analytical tasks (e.g., grouping continuous age into groups){honos}
, {LSOApop}
: Packages designed in
generic way to help use and others {nottshcData}
work with
specific questionnaires (e.g., Health of the Nation Outcome Scales,
HoNOS) or open data sets (e.g. LSOA population estimates){outcomesdashboard}
: Our dashboards use all the
packages mentioned above + special packages developed specifically to
support the dashboards with helper functionsTo illustrate how R can be used to work with databases I’ll use the
following example. Imagine we’re working with a database called
SystmTwo
(S2)
and need to use two different
tables for our analysis:
[S2].[contacts]
: Information about contacts with
clinical teams[S2].[demographics]
: Some demographic information# Set up example contacts table
contacts_s2 <- tibble(client_id = c(1, 1, 1, 2),
contact_id = c(123, 124, 125, 156),
referral_id = c(456, 459, 500, 501),
referral_date = c("2018-04-19", "2019-05-23",
"2020-06-01", "2018-12-11"),
contact_date = c("2018-05-19", "2019-06-05",
"2020-07-08", "2019-01-15"),
team_id = c("tm1", "tm2", "tm1", "tm1"),
hcp_id = c("hcp1", "hcp2", "hcp1", "hcp1"),
contact_type = c("phone", "f2f", "video", "phone"),
assessment_id = c(321, 322, 344, NA))
# Set up example demographics table with 2 patients
demographics_s2 <- tibble(client_id = c(1, 2),
dob = c("1988-01-01", "1965-01-01"),
dod = c(NA, NA),
sex = c("f", "m"))
SQLite
connectionHere we join the contacts with the demographics information to
calculate the age at the time a patient has their contact
(age_at_contact
).
We can also write our own functions and use them in a modular way
whenever we need them. Here’s a simple example to demonstrate how we can
do the same calculation as shown above using our own function. In this
example the function arguments take the variable names for date of birth
(dob
) and the contact date (contact_date
).
SQL
codeAs mentioned above (Challenges and solutions, Point 1), the object
that we work with most of the time are just SQL queries and not real
data stored in your R environment. We can look at the underlying SQL
code using the dplyr::show_query()
function. I don’t really
know SQL very well myself, but some people who do have created a great
package that translates R code into SQL code (see the dbplyr
package for more).
# Use dplyr::show_query() function to see underlying SQL code
show_query(db_age_at_contacts)
<SQL>
SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `LHS`.`client_id` AS `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`
FROM `contacts_s2` AS `LHS`
LEFT JOIN `demographics_s2` AS `RHS`
ON (`LHS`.`client_id` = `RHS`.`client_id`)
)
Note that we can also see the SQL code from our own functions.
db_age_at_contacts %>%
calc_age_at_contact(var_dob = dob,
var_contact_date = contact_date) %>%
show_query()
<SQL>
SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`, CAST(`contact_date` AS DATE) - CAST(`dob` AS DATE) AS `age_at_contact`
FROM (SELECT `LHS`.`client_id` AS `client_id`, `contact_id`, `referral_id`, `referral_date`, `contact_date`, `team_id`, `hcp_id`, `contact_type`, `assessment_id`, `dob`, `dod`, `sex`
FROM `contacts_s2` AS `LHS`
LEFT JOIN `demographics_s2` AS `RHS`
ON (`LHS`.`client_id` = `RHS`.`client_id`)
))
# Look at results from SQL query shown above
db_age_at_contacts %>%
select(client_id, contact_date, dob, age_at_contact)
# Source: lazy query [?? x 4]
# Database: sqlite 3.38.0 [:memory:]
client_id contact_date dob age_at_contact
<dbl> <chr> <chr> <int>
1 1 2018-05-19 1988-01-01 30
2 1 2019-06-05 1988-01-01 31
3 1 2020-07-08 1988-01-01 32
4 2 2019-01-15 1965-01-01 54
Of course this is a VERY simple example. This can get way more complex, think BIG and solve BIG problems. There are many other examples out there showing how to work with databases in RStudio. I added some links that I found useful at the end of this post.
So what’s coming next and where can we take this? Is this perfect? I don’t know exactly what’s coming next and this is definitely far from perfect. But it’s the best approach my colleagues and I could come up with in the time that we spent working on this. Maybe I’ll improve this one day, maybe someone else will? Until then let’s share ideas and work together to improve healthcare analytics in the NHS. Ohhh, some people are already working like this 👀 it’s time others join them. I hope those who make decisions about the direction of healthcare analytics in the NHS will start to understand the problems and opportunities and act soon. If not now, when then? We need to move towards a more open and modern way of healthcare analytics!
Here’s a short summary in BOLD AND ALL CAPS:
Of course this doesn’t always work. There will always be some messy data, inconsistent variable names, undocumented code, and … blah blah blah.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Wiedemann (2021, Sept. 29). CDU data science team blog: Development of open tools for analysing healthcare data in R. Retrieved from https://cdu-data-science-team.github.io/team-blog/posts/2021-08-06-nottshcverse/
BibTeX citation
@misc{wiedemann2021development, author = {Wiedemann, Milan}, title = {CDU data science team blog: Development of open tools for analysing healthcare data in R}, url = {https://cdu-data-science-team.github.io/team-blog/posts/2021-08-06-nottshcverse/}, year = {2021} }