This vignette assumes a SQL server at localhost
(we
use PostgreSQL), with patient data in OMOP Common Data Model v5.4 format
in schema cdm_new_york3
. The data shown in this example is
synthetic, generated by Synthea(TM) Patient
Generator.
suppressPackageStartupMessages(library(dplyr))
library(phea)
# Connect to SQL server.
<- DBI::dbConnect(RPostgres::Postgres(),
dbcon host = fabcred$pg$host, port = fabcred$pg$port, dbname = fabcred$pg$database,
user = fabcred$pg$user, password = fabcred$pg$pass)
# Setup Phea
setup_phea(dbcon, 'cdm_new_york3')
How to find patients who had an acute myocardial infarction (AMI) when they were clinically obese (body mass index, BMI > 30)?
A patient can have multiple myocardial infarctions, and their weight can change over time. In this example we are only interested in cases where the AMI happened while the patient’s BMI was known to be > 30.
The problem is, we cannot go back in time and measure the BMI of the patient on the day the AMI happened. Maybe someone did measure the patient’s weight on that day, but maybe not. So, what do we mean when we say “known to be obese”?
Maybe the only weight measurement available for that person is from 3 years before, or after, the day of the AMI. Or maybe the AMI happened between two weight measurements, one giving BMI > 30 hence obesity, the other giving BMI < 30 hence no obesity. Which one do you pick?
Here, we will pick whichever body measurements are closest in time to each AMI event.
Say a patient:
…that means that on day 40, the closest weight record available was the one from day 1. That record said that the patient was obese. So, we count that as an AMI that happened in a patient known to be obese.
That would not be the case had the AMI been on day 90, for example. Similarly, if that same patient had a second AMI (god forbid) on day 120, or any day after 60, that second AMI would count in the non-obese group.
Let us first compute body mass index like in another vignette.
# Weight component
# Loinc 29463-7 Body weight, OMOP concept ID 3025315
<- sqlt(measurement) |>
weight_component filter(measurement_concept_id == 3025315) |>
make_component(
.ts = measurement_datetime,
.pid = person_id)
# Height component
# Loinc 8302-2 Body height, OMOP concept ID 3036277
<- sqlt(measurement) |>
height_component filter(measurement_concept_id == 3036277) |>
make_component(
.ts = measurement_datetime,
.pid = person_id)
To which we add a component for acute myocardial infarction (AMI),
from table CONDITION_OCCURRENCE
.
# Acute myocardial infarction (AMI) component
# SNOMED 22298006 Myocardial infarction, OMOP concept ID 4329847
<- sqlt(condition_occurrence) |>
ami_component filter(condition_concept_id == 4329847) |>
make_component(
.ts = condition_start_datetime,
.pid = person_id)
Spelling out the BMI formula is straightforward:
bmi = height/(weight * weight)
From which we can write a formula for “is obese”:
is_obese = bmi > 30
Notice that while the first formula is numerical (gives a number),
the second one is logical (gives a TRUE
or
FALSE
). Both are supported transparently, as long as
they’re not mixed, e.g. 3 / TRUE
.
However, doing the same for “has an AMI” may not be as clear. What is the formula for “patient has an AMI event”?
Since we have a column that gives the date when the AMI happened
(it’s called condition_start_datetime
), we can check
whether that column is empty. In each patient, for all phenotype
calculations with a timestamp past the timestamp of an AMI event, that
column will be populated by Phea with the date of that AMI event. For
timestamps prior to the patient’s first AMI, the column is empty. If the
patient has a new AMI, the date is updated.
To check if a column is empty in SQL, we use
is not null
:
has_ami = ami_condition_start_datetime is not null
Armed with this, we can write the final formula for “has AMI and is obese”:
obese_ami = is_obese AND has_ami
# AMI in obese patient
<- calculate_formula(
ami_obese components = list(
weight = weight_component,
height = height_component,
ami = ami_component),
fml = list(
height_in_meters = 'height_value_as_number / 100',
a = list( # the name 'a' doesn't matter
bmi = 'weight_value_as_number / (height_in_meters * height_in_meters)',
has_ami = 'ami_condition_start_datetime is not null'),
b = list( # the name 'b' doesn't matter
has_bmi = 'bmi is not null',
is_obese = 'bmi > 30'),
ami_obese = 'has_ami and is_obese'
)
)#> Called from: calculate_formula(components = list(weight = weight_component,
#> height = height_component, ami = ami_component), fml = list(height_in_meters = "height_value_as_number / 100",
#> a = list(bmi = "weight_value_as_number / (height_in_meters * height_in_meters)",
#> has_ami = "ami_condition_start_datetime is not null"),
#> b = list(has_bmi = "bmi is not null", is_obese = "bmi > 30"),
#> ami_obese = "has_ami and is_obese"))
#> debug: board <- dplyr::mutate(board, window = dplyr::sql(sql_ts_greatest) -
#> dplyr::sql(sql_ts_least), phea_ts_row = dplyr::sql(sql_txt))
#> debug: if (require_all || !is.null(dont_require)) {
#> required_components <- setdiff(names(components), dont_require)
#> if (length(required_components) > 0) {
#> sql_txt <- paste0(DBI::dbQuoteIdentifier(paste0(required_components,
#> "_ts"), conn = .pheaglobalenv$con), " is not null",
#> collapse = " and ")
#> if (has_content(window)) {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row &&
#> dplyr::sql(sql_txt) && window < local(window))
#> }
#> else {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row &&
#> dplyr::sql(sql_txt))
#> }
#> }
#> else {
#> if (has_content(window)) {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row &&
#> window < local(window))
#> }
#> else {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row)
#> }
#> }
#> } else {
#> if (has_content(window)) {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row &&
#> window < local(window))
#> }
#> else {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row)
#> }
#> }
#> debug: if (has_content(window)) {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row &&
#> window < local(window))
#> } else {
#> board <- dplyr::filter(board, phea_row_id == phea_ts_row)
#> }
#> debug: board <- dplyr::filter(board, phea_row_id == phea_ts_row)
#> debug: if (!is.null(filters) && any(!is.na(filters))) {
#> sql_txt <- paste0("(", paste0(filters[!is.na(filters)], collapse = ") AND ("),
#> ")")
#> board <- filter(board, sql(sql_txt))
#> }
#> debug: if (isFALSE(is.na(limit))) board <- head(board, n = lim)
#> debug: board <- dplyr::select(board, phea_row_id, pid, ts, window, !!!g_vars)
#> debug: res_vars <- NULL
#> debug: if (!is.null(fml)) {
#> if (cascaded) {
#> for (i in seq(fml)) {
#> cur_fml <- fml[[i]]
#> if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml),
#> cur_fml, ~rlang::exprs(`:=`(!!..1, dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> }
#> else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> }
#> }
#> else {
#> if (any(lapply(fml, class) == "list"))
#> stop("Nested formulas require cascaded = TRUE.")
#> res_vars <- c(res_vars, names(fml))
#> commands <- unlist(purrr::map2(names(fml), fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> }
#> }
#> debug: if (cascaded) {
#> for (i in seq(fml)) {
#> cur_fml <- fml[[i]]
#> if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml,
#> ~rlang::exprs(`:=`(!!..1, dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> }
#> else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> }
#> } else {
#> if (any(lapply(fml, class) == "list"))
#> stop("Nested formulas require cascaded = TRUE.")
#> res_vars <- c(res_vars, names(fml))
#> commands <- unlist(purrr::map2(names(fml), fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> }
#> debug: for (i in seq(fml)) {
#> cur_fml <- fml[[i]]
#> if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml,
#> ~rlang::exprs(`:=`(!!..1, dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> }
#> else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> }
#> debug: cur_fml <- fml[[i]]
#> debug: if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> } else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> debug: res_vars <- c(res_vars, names(fml)[i])
#> debug: board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> debug: cur_fml <- fml[[i]]
#> debug: if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> } else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> debug: if (any(lapply(cur_fml, class) == "list")) stop("Formulas cannot be nested deeper than 1 level.")
#> debug: res_vars <- c(res_vars, names(cur_fml))
#> debug: commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> debug: board <- dplyr::mutate(board, !!!commands)
#> debug: cur_fml <- fml[[i]]
#> debug: if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> } else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> debug: if (any(lapply(cur_fml, class) == "list")) stop("Formulas cannot be nested deeper than 1 level.")
#> debug: res_vars <- c(res_vars, names(cur_fml))
#> debug: commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> debug: board <- dplyr::mutate(board, !!!commands)
#> debug: cur_fml <- fml[[i]]
#> debug: if (class(cur_fml) == "list") {
#> if (any(lapply(cur_fml, class) == "list"))
#> stop("Formulas cannot be nested deeper than 1 level.")
#> res_vars <- c(res_vars, names(cur_fml))
#> commands <- unlist(purrr::map2(names(cur_fml), cur_fml, ~rlang::exprs(`:=`(!!..1,
#> dplyr::sql(!!..2)))))
#> board <- dplyr::mutate(board, !!!commands)
#> } else {
#> res_vars <- c(res_vars, names(fml)[i])
#> board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> }
#> debug: res_vars <- c(res_vars, names(fml)[i])
#> debug: board <- dplyr::mutate(board, `:=`(!!rlang::sym(names(fml)[i]),
#> dplyr::sql(cur_fml)))
#> debug: if (class(kco) == "logical") {
#> if (length(kco) > 1)
#> stop("If logical, kco must be of length 1.")
#> if (kco && !is.null(res_vars))
#> board <- keep_change_of(board, res_vars, partition = "pid",
#> order = "ts")
#> } else {
#> if (class(kco) != "character")
#> stop("kco must be logical or character vector.")
#> board <- keep_change_of(board, kco, partition = "pid", order = "ts")
#> }
#> debug: if (length(kco) > 1) stop("If logical, kco must be of length 1.")
#> debug: if (kco && !is.null(res_vars)) board <- keep_change_of(board,
#> res_vars, partition = "pid", order = "ts")
#> debug: board <- dplyr::collapse(board, cte = TRUE)
#> debug: attr(board, "phea") <- "phenotype"
#> debug: attr(board, "phea_res_vars") <- res_vars
#> debug: attr(board, "phea_out_vars") <- g_vars
#> debug: if (get_sql) {
#> return(dbplyr::sql_render(board))
#> } else {
#> return(board)
#> }
#> debug: return(board)
Phea will compute ami_obese
for all points in time in
each patient. If a patient has a single AMI event, but had his body
weight and height measured on 6 different dates, that is 7 computations
of ami_obese
, unless the date of the AMI coincides with one
of the dates of the body measurements.
Since we want to consider only the body measurements made closest in
time to the AMI event, we can filter the rows by the
window
.
Phea returns the window
alongside each phenotype
calculation. The window
is the time distance between the
oldest record, and the most recent one, that went into the phenotype
calculation.
Therefore, in each patient, for each AMI event, whichever row has the
smallest window
, that row is using the body measurements
that are closest possible, in time, to the AMI event.
# Pull the IDs of the patients who ever had an AMI
<- ami_obese |>
ami_patients filter(has_ami) |>
select(pid) |>
distinct() |>
pull()
# Keep only the smallest window for each AMI event:
<- ami_obese |>
ami_obese_f filter(has_ami & has_bmi) |>
pick_row_by('window', c('pid', 'ami_condition_start_datetime'))
head_shot(ami_obese_f) |>
kable()
phea_row_id | pid | ts | window | height_value_as_number | weight_value_as_number | ami_condition_start_datetime | height_in_meters | bmi | has_ami | has_bmi | is_obese | ami_obese |
---|---|---|---|---|---|---|---|---|---|---|---|---|
8098 | 196 | 2007-12-22 | 364 days | 186.0 | 94.4 | 2007-12-22 | 1.860 | 27.28639 | TRUE | TRUE | FALSE | FALSE |
8391 | 204 | 1975-04-20 | 14 days | 176.1 | 93.2 | 1975-04-20 | 1.761 | 30.05365 | TRUE | TRUE | TRUE | TRUE |
10861 | 272 | 1990-12-30 | 105 days | 182.8 | 91.2 | 1990-12-30 | 1.828 | 27.29245 | TRUE | TRUE | FALSE | FALSE |
11404 | 286 | 1987-11-04 | 189 days | 168.9 | 53.3 | 1987-11-04 | 1.689 | 18.68392 | TRUE | TRUE | FALSE | FALSE |
11777 | 295 | 2018-07-08 | 154 days | 181.0 | 91.8 | 2018-07-08 | 1.810 | 28.02112 | TRUE | TRUE | FALSE | FALSE |
12787 | 321 | 2009-01-28 | 112 days | 191.4 | 110.4 | 2009-01-28 | 1.914 | 30.13597 | TRUE | TRUE | TRUE | TRUE |
13092 | 328 | 1989-09-28 | 98 days | 162.4 | 55.3 | 1989-09-28 | 1.624 | 20.96781 | TRUE | TRUE | FALSE | FALSE |
15510 | 379 | 2005-09-09 | 42 days | 182.8 | 92.1 | 2005-09-09 | 1.828 | 27.56178 | TRUE | TRUE | FALSE | FALSE |
15914 | 389 | 2012-03-03 | 147 days | 169.1 | 79.8 | 2012-03-03 | 1.691 | 27.90716 | TRUE | TRUE | FALSE | FALSE |
18539 | 452 | 1932-09-15 | 160 days | 187.3 | 75.5 | 1932-09-15 | 1.873 | 21.52144 | TRUE | TRUE | FALSE | FALSE |
<- ami_obese |>
ami_obese_patients filter(ami_obese) |>
select(pid) |>
distinct() |>
pull()
<- sample(ami_obese_patients, 1)
random_patient message('Sampled patient: ', random_patient)
#> Sampled patient: 13001
|>
ami_obese select(
-has_bmi,
-height_value_as_number,
-ami_condition_start_datetime) |>
phea_plot(pid = random_patient)
#> Collecting lazy table, done. (turn this message off with `verbose` or `.verbose` in setup_phea())
To see the SQL query underlying the phenotype, use helper function
code_shot()
, or dbplyr::sql_render()
, or the
.clip_sql
option in calculate_formula()
.
code_shot(ami_obese)
SELECT *, has_ami and is_obese AS "ami_obese"
FROM (
SELECT *, bmi is not null AS "has_bmi", bmi > 30 AS "is_obese"
FROM (
SELECT
*,
/ (height_in_meters * height_in_meters) AS "bmi",
weight_value_as_number is not null AS "has_ami"
ami_condition_start_datetime FROM (
SELECT
"phea_row_id",
"pid",
"ts",
"window",
"height_value_as_number",
"weight_value_as_number",
"ami_condition_start_datetime",
/ 100 AS "height_in_meters"
height_value_as_number FROM (
SELECT
*,
greatest("weight_ts", "height_ts", "ami_ts") - least("weight_ts", "height_ts", "ami_ts") AS "window",
MAX("phea_row_id") OVER (PARTITION BY "pid", "ts") AS "phea_ts_row"
FROM (
SELECT
row_number() over (order by "pid", "ts") AS "phea_row_id",
"pid",
"ts",
case when "name" = 184746 then "value_as_number" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "weight_value_as_number",
phea_last_value_ignore_nulls(case when "name" = 184746 then "ts" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "weight_ts",
phea_last_value_ignore_nulls(case when "name" = 476286 then "value_as_number" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "height_value_as_number",
phea_last_value_ignore_nulls(case when "name" = 476286 then "ts" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "height_ts",
phea_last_value_ignore_nulls(case when "name" = 144573 then "condition_start_datetime" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "ami_condition_start_datetime",
phea_last_value_ignore_nulls(case when "name" = 144573 then "ts" else null end) OVER (PARTITION BY "pid" ORDER BY "ts" ROWS UNBOUNDED PRECEDING) AS "ami_ts"
phea_last_value_ignore_nulls(FROM (
(SELECT *, NULL AS "condition_start_datetime"
FROM (
(SELECT
184746 AS "name",
"person_id" AS "pid",
"measurement_datetime" AS "ts",
"value_as_number"
FROM "cdm_new_york3"."measurement"
WHERE ("measurement_concept_id" = 3025315.0)
)UNION ALL
(SELECT
476286 AS "name",
"person_id" AS "pid",
"measurement_datetime" AS "ts",
"value_as_number"
FROM "cdm_new_york3"."measurement"
WHERE ("measurement_concept_id" = 3036277.0)
)"q01"
)
)UNION ALL
(SELECT
"name",
"pid",
"ts",
NULL AS "value_as_number",
"condition_start_datetime"
FROM (
SELECT
144573 AS "name",
"person_id" AS "pid",
"condition_start_datetime" AS "ts",
"condition_start_datetime"
FROM "cdm_new_york3"."condition_occurrence"
WHERE ("condition_concept_id" = 4329847.0)
"q02"
)
)"q03"
) "q04"
) "q05"
) WHERE ("phea_row_id" = "phea_ts_row")
"q06"
) "q07"
) "q08" )