Using dbplyr with SMRA

Overview

What is `dbplyr`?

dbplyr is an R package within the tidyverse set of packages. It provides a “tidyverse friendly” way of interacting with SQL databases – such as the SMRA database.

Why use `dbplyr` with the SMRA database?

SQL (Structured Query Language) is the language used internally by many databases. In SPSS code, a snippet of SQL is often used to extract a basefile of SMR episodes from the SMRA database; typically specifiying start and end dates, and some fields of interest. This basefile is then saved locally before being filtered and analysed by the rest of the SPSS code.

This pattern is easy to replicate in R, as below:

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

sql_query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"

df_basefile <- as_tibble(odbc::dbGetQuery(smra, sql_query, n=-1))
write_rds(df_basefile, "basefile.rds")

# all episodes with at least one "F34" procedure
df <- df_basefile %>%  
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))

On the other hand, dbplyr allows a user to write code using the standard tidyverse commands they may already be familiar with e.g. filter,mutate, group_by, and summarise. dbplyr automatically translates these into an SQL query which is then used to query the database.

This means we can easily write code that queries the database, and returns data that has already been filtered and summarised in the way we want – see the example below.

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

df <- tbl(smra, "SMR01_PI") %>%
  filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
  select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
         MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
  collect()

Here we use only tidyverse functions. dbplyr translates them into an SQL query: when the collect() function is called, the SQL query is submitted to the database and the resultant data retrieved.

In this example, the first method transfers some 14,084,977 rows from the database, while the dbplyr example transfers only the 41,862 rows we need. This means that desktop/laptop computers have to perform less large-scale filtering, disk space is saved, network bandwidth strain is reduced, and most importantly – time is saved. See here for more detailed benchmarking on the above example.

At any point we can use the show_query() function to reveal the SQL code that dbplyr has generated. Note that although no data is stored locally until the collect() command is used, we can still take a peek at objects before retrieving them.

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2018-02-01','YYYY-MM-DD')) %>%
  select(URI, DISCHARGE_DATE, LINK_NO, CIS_MARKER)
head(df)

URI	DISCHARGE_DATE	LINK_NO	CIS_MARKER
76243260	2018-01-01	44022065	16
76214632	2018-01-01	45525431	2
76188549	2018-01-01	7613088	14
76143288	2018-01-01	30134943	7
76211276	2018-01-01	10838432	19
76221135	2018-01-01	6005427	37

Useful Examples

Some typical use cases are shown below. Note that the R code is essentially identical to that you might write when not using dbplyr with the exception of using the tbl() function in the setup, and collect() when retreiving data.

Setup

A typical dbplyr session would start as follows.

library(tidyverse)
smra <- suppressWarnings(odbc::dbConnect(odbc::odbc(),
                                         dsn = "SMRA",
                                         uid = "YOUR_USERNAME",
                                         pwd = rstudioapi::askForPassword("Database password: ")))

SMR01 <- tbl(smra, "SMR01_PI")

The dbplyr package can be loaded explicitly with library(dbplyr), but it is also loaded by default when using library(tidyverse).

Date Selection

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2018-01-02','YYYY-MM-DD')) %>%
  select(DISCHARGE_DATE, LOCATION)

df <- collect(df)

DISCHARGE_DATE	LOCATION
2018-01-01	A210H
2018-01-01	F704H
2018-01-01	A111H
2018-01-01	A111H
2018-01-01	C121H
2018-01-01	H214H

SQL commands can be passed to the database within the typical tidyverse commands by using the sql() function. The following code will return data filtered according to a new “year” variable.

df <- SMR01 %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  filter(year == 2019) %>%
  select(DISCHARGE_DATE, year, LOCATION)

df <- collect(df)

DISCHARGE_DATE	year	LOCATION
2019-02-01	2019	A101H
2019-02-09	2019	A101H
2019-01-16	2019	A101H
2019-01-06	2019	A101H
2019-01-08	2019	A101H
2019-03-05	2019	A101H

Select and Filter Multiple Fields

Here data is selected for patients over the age of 65 at the Royal Alexandra Hospital (RAH).

df <- SMR01 %>%
  filter(DISCHARGE_DATE >= TO_DATE('2018-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE < TO_DATE('2019-01-01','YYYY-MM-DD'),
         AGE_IN_YEARS > 65,
         LOCATION == "C418H") %>%
  select(LOCATION, AGE_IN_YEARS)

df <- collect(df)

LOCATION	AGE_IN_YEARS
C418H	85
C418H	76
C418H	90
C418H	74
C418H	83
C418H	82

Note that the order in which filter() and select() are applied does matter. In the above example, if the select() statement appeared before the filter() statement, the filter() step would fail as it would be unable to find the field DISCHARGE_DATE.

Select Multiple Values

Here we filter for episodes based on whether a field has one of a number of values, e.g. whether the location is RAH or QEUH. R’s %in% operator can be used as follows to match the values.

hospitals_of_interest <- c("C418H", "G405H")

df <- SMR01 %>%
  filter(LOCATION %in% hospitals_of_interest)

df <- collect(df)

LOCATION
C418H
C418H
C418H
G405H
G405H
C418H

Excluding Missing Values

If we wanted to exlcude records with no recorded DATE_OF_MAIN_OPERATION from the above query, we could do the following.

df <- SMR01 %>%
  filter(!is.na(DATE_OF_MAIN_OPERATION)) %>%
  select(LOCATION, DATE_OF_MAIN_OPERATION)

df <- collect(df)

LOCATION	DATE_OF_MAIN_OPERATION
L302H	1998-02-24
L302H	1997-11-27
L302H	1998-03-25
L302H	1998-03-24
L304H	1997-07-31
L304H	1997-05-30

Here, R’s is.na() function is used to identify episodes where the DATE_OF_MAIN_OPERATION field is empty. The exclamation mark represents negation, so the net effect of !is.na() is to exclude episodes where DATE_OF_MAIN_OPERATION is missing.

Filtering for a Substring

Often we want to identify records that have a particular set of diagnosis or procedure codes. Here we can use the str_sub function from the tidyverse stringr package, which works similarly to the CHAR.SUBSTR function from SPSS.

conditions <- c("C51", "C52")

df <- SMR01 %>%
  filter(str_sub(MAIN_CONDITION, 1, 3) %in% conditions) %>%
  select(MAIN_CONDITION)

df <- collect(df)

MAIN_CONDITION
C52X
C52X
C52X
C519
C519
C52X

Sorting Cases

This is accomplished using the arrange function. The following code arranges episodes in the order typically used when combining episodes into Continuous Inpatient Stays.

df <- SMR01 %>%
  arrange(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI) %>%
  select(LINK_NO, CIS_MARKER, ADMISSION_DATE, DISCHARGE_DATE, ADMISSION, DISCHARGE, URI)

df <- collect(df)

LINK_NO	CIS_MARKER	ADMISSION_DATE	DISCHARGE_DATE	ADMISSION	DISCHARGE	URI
4	15	1997-09-07	1997-09-08	0	1	30024196
4	16	1999-03-30	1999-04-02	0	0	32381190
4	16	1999-04-02	1999-04-05	2	1	32381191
4	17	1999-05-05	1999-05-07	0	1	32381192
4	18	1999-05-17	1999-05-21	0	1	32381193
4	19	2001-07-22	2001-07-24	0	1	37590590

Combining Text Fields

We can use R’s paste function to combine text fields.

df <- SMR01 %>%
  mutate(boards = paste("HBtreat:", HBTREAT_CURRENTDATE, "HBres:", HBRES_CURRENTDATE)) %>%
  filter(HBTREAT_CURRENTDATE != HBRES_CURRENTDATE) %>% # show those with difference
  select(boards)

df <- collect(df)

boards
HBtreat: S08000021 HBres: S08000015
HBtreat: S08000021 HBres: S08000015
HBtreat: S08000021 HBres: S08000022
HBtreat: S08000022 HBres: S08200001
HBtreat: S08000021 HBres: S08000023
HBtreat: S08000021 HBres: S08000023

Aggregating

The group_by and summarise functions are typically used to aggregate data. Functions like n(), sum() and mean() etc can be used within the summarise function, analagous to the method in SPSS.

df <- SMR01 %>%
  filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  group_by(year) %>%
  summarise(total = n()) %>%
  arrange(year)

df <- collect(df)

year	total
1997	6158
1998	7739
1999	6446
2000	5329
2001	3179
2002	4442
2003	4360
2004	4347
2005	4654
2006	4363
2007	4492
2008	4397
2009	4494
2010	4569
2011	4322
2012	4594
2013	4782
2014	4508
2015	4665
2016	4665
2017	4203
2018	4384
2019	1517

Distinct Values

The distinct() function can be used to drop duplicate cases. To generate a list of all the distinct hospitals that performed an “F34” procedure as the MAIN_OPERATION in 2014, you could use the following code.

df <- SMR01 %>%
  filter(str_sub(MAIN_OPERATION, 1, 3) == "F34") %>%
  mutate(year = sql('extract(year from DISCHARGE_DATE)')) %>%
  filter(year == 2014) %>%
  select(LOCATION) %>%
  distinct()

df <- collect(df)

LOCATION
G516H
G306H
V213V
H202H
W107H
N121H
C418H
S225H
A111H
F704H
V217H
G207H
Z102H
N101H
G412V
B120H
Y104H
S308H
L106H
G405H
G513H
T101H
R101H
G502V
S124V
N411H

Worked Benchmark Example

In this example we’ll see how the dbplyr methodology can shorten the time taken to run an analysis, as well as reduce the memory and disk space required. The use case is to return all the SMR01 episodes from 2009 – 2019, where at least one “F34” procedure wasperformed. Two methods will be compared:

transfer all: where all the episodes in the time period are transferred (using a snippet of SQL) and saved as a “basefile”, then filtered locally to find all the episodes with an “F34” procedure.
dbplyr: where tidyverse commands are used to generate an SQL query, which is then submitted to the server and the desired episodes collected.

Code

transfer all

query <- "select HBTREAT_CURRENTDATE, DISCHARGE_DATE, MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3 from SMR01_PI WHERE DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD') AND DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')"

df_basefile <- as_tibble(odbc::dbGetQuery(smra, query, n=-1))

df <- df_basefile %>%  
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34"))

dbplyr

df <- tbl(smra, "SMR01_PI") %>%
  filter(DISCHARGE_DATE >= TO_DATE('2010-01-01','YYYY-MM-DD'),
         DISCHARGE_DATE <= TO_DATE('2019-01-01','YYYY-MM-DD')) %>%
  select(HBTREAT_CURRENTDATE, DISCHARGE_DATE,
         MAIN_OPERATION, OTHER_OPERATION_1, OTHER_OPERATION_2, OTHER_OPERATION_3) %>%
  filter_at(vars(contains("OPERATION")), any_vars(str_sub(., 1, 3) == "F34")) %>%
  collect()

Time Comparison

The code for both methods was run twenty times each in three different user environments: R desktop (on the local NSS network), R desktop off-site (via VPN) and on the NSS RStudioServer. The bench package was used to collate the results.

As can be seen below, the dbplyr method gives consistently shorter run times, as well as more consistent run times in all three environments.

Typically the dbplyr code returns results in roughly 50 seconds, whereas the transfer all method takes 130 to 1,150 seconds depending on the environment and network performance.

Memory and Disk Usage Comparison

It is also worth considering the memory allocated and disk space used when using each method.

The figure below compares the file size of saved results, number of rows returned, respective object sizes in memory and the local memory allocated while running the respective code blocks.

Using dbplyr with SMRA

Bob Taylor

02/08/2019

Overview

What is `dbplyr`?

Why use `dbplyr` with the SMRA database?

Useful Examples

Setup

Date Selection

Select and Filter Multiple Fields

Select Multiple Values

Excluding Missing Values

Filtering for a Substring

Sorting Cases

Combining Text Fields

Aggregating

Distinct Values

Worked Benchmark Example

Code

Time Comparison

Memory and Disk Usage Comparison

Using dbplyr with SMRA

Bob Taylor

02/08/2019

Overview

What is dbplyr?

Why use dbplyr with the SMRA database?

Useful Examples

Setup

Date Selection

Select and Filter Multiple Fields

Select Multiple Values

Excluding Missing Values

Filtering for a Substring

Sorting Cases

Combining Text Fields

Aggregating

Distinct Values

Worked Benchmark Example

Code

Time Comparison

Memory and Disk Usage Comparison

What is `dbplyr`?

Why use `dbplyr` with the SMRA database?