Overview

What is dbplyr?

dbplyr is an R package within the tidyverse set of packages. It provides a “tidyverse friendly” way of interacting with SQL databases – such as the SMRA database.

Why use dbplyr with the SMRA database?

SQL (Structured Query Language) is the language used internally by many databases. In SPSS code, a snippet of SQL is often used to extract a basefile of SMR episodes from the SMRA database; typically specifiying start and end dates, and some fields of interest. This basefile is then saved locally before being filtered and analysed by the rest of the SPSS code.

This pattern is easy to replicate in R, as below:

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

sql_query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"

df_basefile <- as_tibble(odbc::dbGetQuery(smra, sql_query, n=-1))
write_rds(df_basefile, "basefile.rds")

# all episodes with at least one "F34" procedure
df <- df_basefile %>%  
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))

On the other hand, dbplyr allows a user to write code using the standard tidyverse commands they may already be familiar with e.g. filter,mutate, group_by, and summarise. dbplyr automatically translates these into an SQL query which is then used to query the database.

This means we can easily write code that queries the database, and returns data that has already been filtered and summarised in the way we want – see the example below.

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

df <- tbl(smra, "SMR01_PI") %>%
  filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
  select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
         MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
  collect()

Here we use only tidyverse functions. dbplyr translates them into an SQL query: when the collect() function is called, the SQL query is submitted to the database and the resultant data retrieved.

In this example, the first method transfers some 14,084,977 rows from the database, while the dbplyr example transfers only the 41,862 rows we need. This means that desktop/laptop computers have to perform less large-scale filtering, disk space is saved, network bandwidth strain is reduced, and most importantly – time is saved. See here for more detailed benchmarking on the above example.

At any point we can use the show_query() function to reveal the SQL code that dbplyr has generated. Note that although no data is stored locally until the collect() command is used, we can still take a peek at objects before retrieving them.

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2018-02-01','YYYY-MM-DD')) %>%
  select(URI, DISCHARGE_DATE, LINK_NO, CIS_MARKER)
head(df)
URI DISCHARGE_DATE LINK_NO CIS_MARKER
76243260 2018-01-01 44022065 16
76214632 2018-01-01 45525431 2
76188549 2018-01-01 7613088 14
76143288 2018-01-01 30134943 7
76211276 2018-01-01 10838432 19
76221135 2018-01-01 6005427 37

Useful Examples

Some typical use cases are shown below. Note that the R code is essentially identical to that you might write when not using dbplyr with the exception of using the tbl() function in the setup, and collect() when retreiving data.

Setup

A typical dbplyr session would start as follows.

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

SMR01 <- tbl(smra, "SMR01_PI")

The dbplyr package can be loaded explicitly with library(dbplyr), but it is also loaded by default when using library(tidyverse).

Date Selection

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2018-01-02','YYYY-MM-DD')) %>%
  select(DISCHARGE_DATE, LOCATION)

df <- collect(df)
DISCHARGE_DATE LOCATION
2018-01-01 A210H
2018-01-01 F704H
2018-01-01 A111H
2018-01-01 A111H
2018-01-01 C121H
2018-01-01 H214H

SQL commands can be passed to the database within the typical tidyverse commands by using the sql() function. The following code will return data filtered according to a new “year” variable.

df <- SMR01 %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  filter(year == 2019) %>%
  select(DISCHARGE_DATE, year, LOCATION)

df <- collect(df)
DISCHARGE_DATE year LOCATION
2019-02-01 2019 A101H
2019-02-09 2019 A101H
2019-01-16 2019 A101H
2019-01-06 2019 A101H
2019-01-08 2019 A101H
2019-03-05 2019 A101H

Select and Filter Multiple Fields

Here data is selected for patients over the age of 65 at the Royal Alexandra Hospital (RAH).

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2019-01-01','YYYY-MM-DD'),
         AGE_IN_YEARS > 65,
         LOCATION == "C418H") %>%
  select(LOCATION, AGE_IN_YEARS)

df <- collect(df)
LOCATION AGE_IN_YEARS
C418H 85
C418H 76
C418H 90
C418H 74
C418H 83
C418H 82

Note that the order in which filter() and select() are applied does matter. In the above example, if the select() statement appeared before the filter() statement, the filter() step would fail as it would be unable to find the field DISCHARGE_DATE.

Select Multiple Values

Here we filter for episodes based on whether a field has one of a number of values, e.g. whether the location is RAH or QEUH. R’s %in% operator can be used as follows to match the values.

hospitals_of_interest <- c("C418H", "G405H")

df <- SMR01 %>%
  filter(LOCATION %in% hospitals_of_interest)

df <- collect(df)
LOCATION
C418H
C418H
C418H
G405H
G405H
C418H

Excluding Missing Values

If we wanted to exlcude records with no recorded DATE_OF_MAIN_OPERATION from the above query, we could do the following.

df <- SMR01 %>%
  filter(!is.na(DATE_OF_MAIN_OPERATION)) %>%
  select(LOCATION, DATE_OF_MAIN_OPERATION)

df <- collect(df)
LOCATION DATE_OF_MAIN_OPERATION
L302H 1998-02-24
L302H 1997-11-27
L302H 1998-03-25
L302H 1998-03-24
L304H 1997-07-31
L304H 1997-05-30

Here, R’s is.na() function is used to identify episodes where the DATE_OF_MAIN_OPERATION field is empty. The exclamation mark represents negation, so the net effect of !is.na() is to exclude episodes where DATE_OF_MAIN_OPERATION is missing.

Filtering for a Substring

Often we want to identify records that have a particular set of diagnosis or procedure codes. Here we can use the str_sub function from the tidyverse stringr package, which works similarly to the CHAR.SUBSTR function from SPSS.

conditions <- c("C51", "C52")

df <- SMR01 %>%
  filter(str_sub(MAIN_CONDITION, 1, 3) %in% conditions) %>%
  select(MAIN_CONDITION)

df <- collect(df)
MAIN_CONDITION
C52X
C52X
C52X
C519
C519
C52X

Sorting Cases

This is accomplished using the arrange function. The following code arranges episodes in the order typically used when combining episodes into Continuous Inpatient Stays.

df <- SMR01 %>%
  arrange(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI) %>%
  select(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI)

df <- collect(df)
LINK_NO CIS_MARKER ADMISSION_DATE DISCHARGE_DATE ADMISSION DISCHARGE URI
4 15 1997-09-07 1997-09-08 0 1 30024196
4 16 1999-03-30 1999-04-02 0 0 32381190
4 16 1999-04-02 1999-04-05 2 1 32381191
4 17 1999-05-05 1999-05-07 0 1 32381192
4 18 1999-05-17 1999-05-21 0 1 32381193
4 19 2001-07-22 2001-07-24 0 1 37590590

Combining Text Fields

We can use R’s paste function to combine text fields.

df <- SMR01 %>%
  mutate(boards = paste("HBtreat:", HBTREAT_CURRENTDATE, "HBres:", HBRES_CURRENTDATE)) %>%
  filter(HBTREAT_CURRENTDATE != HBRES_CURRENTDATE) %>% # show those with difference
  select(boards)

df <- collect(df)
boards
HBtreat: S08000021 HBres: S08000015
HBtreat: S08000021 HBres: S08000015
HBtreat: S08000021 HBres: S08000022
HBtreat: S08000022 HBres: S08200001
HBtreat: S08000021 HBres: S08000023
HBtreat: S08000021 HBres: S08000023

Aggregating

The group_by and summarise functions are typically used to aggregate data. Functions like n(), sum() and mean() etc can be used within the summarise function, analagous to the method in SPSS.

df <- SMR01 %>%
  filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(year)

df <- collect(df)
year total
1997 6158
1998 7739
1999 6446
2000 5329
2001 3179
2002 4442
2003 4360
2004 4347
2005 4654
2006 4363
2007 4492
2008 4397
2009 4494
2010 4569
2011 4322
2012 4594
2013 4782
2014 4508
2015 4665
2016 4665
2017 4203
2018 4384
2019 1517

Distinct Values

The distinct() function can be used to drop duplicate cases. To generate a list of all the distinct hospitals that performed an “F34” procedure as the MAIN_OPERATION in 2014, you could use the following code.

df <- SMR01 %>%
  filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  filter(year == 2014) %>%
  select(LOCATION) %>%
  distinct()

df <- collect(df)
LOCATION
G516H
G306H
V213V
H202H
W107H
N121H
C418H
S225H
A111H
F704H
V217H
G207H
Z102H
N101H
G412V
B120H
Y104H
S308H
L106H
G405H
G513H
T101H
R101H
G502V
S124V
N411H

Worked Benchmark Example

In this example we’ll see how the dbplyr methodology can shorten the time taken to run an analysis, as well as reduce the memory and disk space required. The use case is to return all the SMR01 episodes from 2009 – 2019, where at least one “F34” procedure wasperformed. Two methods will be compared:

  • transfer all: where all the episodes in the time period are transferred (using a snippet of SQL) and saved as a “basefile”, then filtered locally to find all the episodes with an “F34” procedure.

  • dbplyr: where tidyverse commands are used to generate an SQL query, which is then submitted to the server and the desired episodes collected.

Code

transfer all

query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"

df_basefile <- as_tibble(odbc::dbGetQuery(smra, query, n=-1))

df <- df_basefile %>%  
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))

dbplyr

df <- tbl(smra, "SMR01_PI") %>%
  filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
  select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
         MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
  collect()

Time Comparison

The code for both methods was run twenty times each in three different user environments: R desktop (on the local NSS network), R desktop off-site (via VPN) and on the NSS RStudioServer. The bench package was used to collate the results.

As can be seen below, the dbplyr method gives consistently shorter run times, as well as more consistent run times in all three environments.

Typically the dbplyr code returns results in roughly 50 seconds, whereas the transfer all method takes 130 to 1,150 seconds depending on the environment and network performance.

Memory and Disk Usage Comparison

It is also worth considering the memory allocated and disk space used when using each method.

The figure below compares the file size of saved results, number of rows returned, respective object sizes in memory and the local memory allocated while running the respective code blocks.