dbplyr
?dbplyr
is an R package within the tidyverse set of packages. It provides a “tidyverse friendly” way of interacting with SQL databases – such as the SMRA database.
dbplyr
with the SMRA database?SQL (Structured Query Language) is the language used internally by many databases. In SPSS code, a snippet of SQL is often used to extract a basefile of SMR episodes from the SMRA database; typically specifiying start and end dates, and some fields of interest. This basefile is then saved locally before being filtered and analysed by the rest of the SPSS code.
This pattern is easy to replicate in R, as below:
library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
dsn = "SMRA",
uid = "YOUR_USERNAME",
pwd = rstudioapi::askForPassword("Database password: ")))
sql_query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"
df_basefile <- as_tibble(odbc::dbGetQuery(smra, sql_query, n=-1))
write_rds(df_basefile, "basefile.rds")
# all episodes with at least one "F34" procedure
df <- df_basefile %>%
filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))
On the other hand, dbplyr
allows a user to write code using the standard tidyverse commands they may already be familiar with e.g. filter
,mutate
, group_by
, and summarise
. dbplyr
automatically translates these into an SQL query which is then used to query the database.
This means we can easily write code that queries the database, and returns data that has already been filtered and summarised in the way we want – see the example below.
library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
dsn = "SMRA",
uid = "YOUR_USERNAME",
pwd = rstudioapi::askForPassword("Database password: ")))
df <- tbl(smra, "SMR01_PI") %>%
filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
collect()
Here we use only tidyverse functions. dbplyr
translates them into an SQL query: when the collect()
function is called, the SQL query is submitted to the database and the resultant data retrieved.
In this example, the first method transfers some 14,084,977 rows from the database, while the dbplyr example transfers only the 41,862 rows we need. This means that desktop/laptop computers have to perform less large-scale filtering, disk space is saved, network bandwidth strain is reduced, and most importantly – time is saved. See here for more detailed benchmarking on the above example.
At any point we can use the show_query()
function to reveal the SQL code that dbplyr
has generated. Note that although no data is stored locally until the collect()
command is used, we can still take a peek at objects before retrieving them.
df <- SMR01 %>%
filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
DISCHARGE_DATE < TO_DATE('2018-02-01','YYYY-MM-DD')) %>%
select(URI, DISCHARGE_DATE, LINK_NO, CIS_MARKER)
head(df)
URI | DISCHARGE_DATE | LINK_NO | CIS_MARKER |
---|---|---|---|
76243260 | 2018-01-01 | 44022065 | 16 |
76214632 | 2018-01-01 | 45525431 | 2 |
76188549 | 2018-01-01 | 7613088 | 14 |
76143288 | 2018-01-01 | 30134943 | 7 |
76211276 | 2018-01-01 | 10838432 | 19 |
76221135 | 2018-01-01 | 6005427 | 37 |
Some typical use cases are shown below. Note that the R code is essentially identical to that you might write when not using dbplyr
with the exception of using the tbl()
function in the setup, and collect()
when retreiving data.
A typical dbplyr
session would start as follows.
library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
dsn = "SMRA",
uid = "YOUR_USERNAME",
pwd = rstudioapi::askForPassword("Database password: ")))
SMR01 <- tbl(smra, "SMR01_PI")
The dbplyr
package can be loaded explicitly with library(dbplyr)
, but it is also loaded by default when using library(tidyverse)
.
df <- SMR01 %>%
filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
DISCHARGE_DATE < TO_DATE('2018-01-02','YYYY-MM-DD')) %>%
select(DISCHARGE_DATE, LOCATION)
df <- collect(df)
DISCHARGE_DATE | LOCATION |
---|---|
2018-01-01 | A210H |
2018-01-01 | F704H |
2018-01-01 | A111H |
2018-01-01 | A111H |
2018-01-01 | C121H |
2018-01-01 | H214H |
SQL commands can be passed to the database within the typical tidyverse commands by using the sql()
function. The following code will return data filtered according to a new “year” variable.
df <- SMR01 %>%
mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
filter(year == 2019) %>%
select(DISCHARGE_DATE, year, LOCATION)
df <- collect(df)
DISCHARGE_DATE | year | LOCATION |
---|---|---|
2019-02-01 | 2019 | A101H |
2019-02-09 | 2019 | A101H |
2019-01-16 | 2019 | A101H |
2019-01-06 | 2019 | A101H |
2019-01-08 | 2019 | A101H |
2019-03-05 | 2019 | A101H |
Here data is selected for patients over the age of 65 at the Royal Alexandra Hospital (RAH).
df <- SMR01 %>%
filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
DISCHARGE_DATE < TO_DATE('2019-01-01','YYYY-MM-DD'),
AGE_IN_YEARS > 65,
LOCATION == "C418H") %>%
select(LOCATION, AGE_IN_YEARS)
df <- collect(df)
LOCATION | AGE_IN_YEARS |
---|---|
C418H | 85 |
C418H | 76 |
C418H | 90 |
C418H | 74 |
C418H | 83 |
C418H | 82 |
Note that the order in which filter()
and select()
are applied does matter. In the above example, if the select()
statement appeared before the filter()
statement, the filter()
step would fail as it would be unable to find the field DISCHARGE_DATE
.
Here we filter for episodes based on whether a field has one of a number of values, e.g. whether the location is RAH or QEUH. R’s %in%
operator can be used as follows to match the values.
hospitals_of_interest <- c("C418H", "G405H")
df <- SMR01 %>%
filter(LOCATION %in% hospitals_of_interest)
df <- collect(df)
LOCATION |
---|
C418H |
C418H |
C418H |
G405H |
G405H |
C418H |
If we wanted to exlcude records with no recorded DATE_OF_MAIN_OPERATION
from the above query, we could do the following.
df <- SMR01 %>%
filter(!is.na(DATE_OF_MAIN_OPERATION)) %>%
select(LOCATION, DATE_OF_MAIN_OPERATION)
df <- collect(df)
LOCATION | DATE_OF_MAIN_OPERATION |
---|---|
L302H | 1998-02-24 |
L302H | 1997-11-27 |
L302H | 1998-03-25 |
L302H | 1998-03-24 |
L304H | 1997-07-31 |
L304H | 1997-05-30 |
Here, R’s is.na()
function is used to identify episodes where the DATE_OF_MAIN_OPERATION
field is empty. The exclamation mark represents negation, so the net effect of !is.na()
is to exclude episodes where DATE_OF_MAIN_OPERATION
is missing.
Often we want to identify records that have a particular set of diagnosis or procedure codes. Here we can use the str_sub
function from the tidyverse stringr
package, which works similarly to the CHAR.SUBSTR
function from SPSS.
conditions <- c("C51", "C52")
df <- SMR01 %>%
filter(str_sub(MAIN_CONDITION, 1, 3) %in% conditions) %>%
select(MAIN_CONDITION)
df <- collect(df)
MAIN_CONDITION |
---|
C52X |
C52X |
C52X |
C519 |
C519 |
C52X |
This is accomplished using the arrange
function. The following code arranges episodes in the order typically used when combining episodes into Continuous Inpatient Stays.
df <- SMR01 %>%
arrange(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI) %>%
select(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI)
df <- collect(df)
LINK_NO | CIS_MARKER | ADMISSION_DATE | DISCHARGE_DATE | ADMISSION | DISCHARGE | URI |
---|---|---|---|---|---|---|
4 | 15 | 1997-09-07 | 1997-09-08 | 0 | 1 | 30024196 |
4 | 16 | 1999-03-30 | 1999-04-02 | 0 | 0 | 32381190 |
4 | 16 | 1999-04-02 | 1999-04-05 | 2 | 1 | 32381191 |
4 | 17 | 1999-05-05 | 1999-05-07 | 0 | 1 | 32381192 |
4 | 18 | 1999-05-17 | 1999-05-21 | 0 | 1 | 32381193 |
4 | 19 | 2001-07-22 | 2001-07-24 | 0 | 1 | 37590590 |
We can use R’s paste
function to combine text fields.
df <- SMR01 %>%
mutate(boards = paste("HBtreat:", HBTREAT_CURRENTDATE, "HBres:", HBRES_CURRENTDATE)) %>%
filter(HBTREAT_CURRENTDATE != HBRES_CURRENTDATE) %>% # show those with difference
select(boards)
df <- collect(df)
boards |
---|
HBtreat: S08000021 HBres: S08000015 |
HBtreat: S08000021 HBres: S08000015 |
HBtreat: S08000021 HBres: S08000022 |
HBtreat: S08000022 HBres: S08200001 |
HBtreat: S08000021 HBres: S08000023 |
HBtreat: S08000021 HBres: S08000023 |
The group_by
and summarise
functions are typically used to aggregate data. Functions like n()
, sum()
and mean()
etc can be used within the summarise function, analagous to the method in SPSS.
df <- SMR01 %>%
filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
group_by(year) %>%
summarise(total = n()) %>%
arrange(year)
df <- collect(df)
year | total |
---|---|
1997 | 6158 |
1998 | 7739 |
1999 | 6446 |
2000 | 5329 |
2001 | 3179 |
2002 | 4442 |
2003 | 4360 |
2004 | 4347 |
2005 | 4654 |
2006 | 4363 |
2007 | 4492 |
2008 | 4397 |
2009 | 4494 |
2010 | 4569 |
2011 | 4322 |
2012 | 4594 |
2013 | 4782 |
2014 | 4508 |
2015 | 4665 |
2016 | 4665 |
2017 | 4203 |
2018 | 4384 |
2019 | 1517 |
The distinct()
function can be used to drop duplicate cases. To generate a list of all the distinct hospitals that performed an “F34” procedure as the MAIN_OPERATION
in 2014, you could use the following code.
df <- SMR01 %>%
filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
filter(year == 2014) %>%
select(LOCATION) %>%
distinct()
df <- collect(df)
LOCATION |
---|
G516H |
G306H |
V213V |
H202H |
W107H |
N121H |
C418H |
S225H |
A111H |
F704H |
V217H |
G207H |
Z102H |
N101H |
G412V |
B120H |
Y104H |
S308H |
L106H |
G405H |
G513H |
T101H |
R101H |
G502V |
S124V |
N411H |
In this example we’ll see how the dbplyr
methodology can shorten the time taken to run an analysis, as well as reduce the memory and disk space required. The use case is to return all the SMR01 episodes from 2009 – 2019, where at least one “F34” procedure wasperformed. Two methods will be compared:
transfer all: where all the episodes in the time period are transferred (using a snippet of SQL) and saved as a “basefile”, then filtered locally to find all the episodes with an “F34” procedure.
dbplyr: where tidyverse commands are used to generate an SQL query, which is then submitted to the server and the desired episodes collected.
transfer all
query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"
df_basefile <- as_tibble(odbc::dbGetQuery(smra, query, n=-1))
df <- df_basefile %>%
filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))
dbplyr
df <- tbl(smra, "SMR01_PI") %>%
filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
collect()
The code for both methods was run twenty times each in three different user environments: R desktop (on the local NSS network), R desktop off-site (via VPN) and on the NSS RStudioServer. The bench
package was used to collate the results.
As can be seen below, the dbplyr
method gives consistently shorter run times, as well as more consistent run times in all three environments.
Typically the dbplyr
code returns results in roughly 50 seconds, whereas the transfer all method takes 130 to 1,150 seconds depending on the environment and network performance.
It is also worth considering the memory allocated and disk space used when using each method.
The figure below compares the file size of saved results, number of rows returned, respective object sizes in memory and the local memory allocated while running the respective code blocks.