This vignette assumes a SQL server at localhost
(we
use PostgreSQL), with data in OMOP Common Data Model v5.4 format in
schema cdm_new_york3
. The patient records shown in this
example are synthetic data from Synthea(TM) Patient
Generator.
library(phea)
library(dplyr)
# Connect to SQL server.
<- DBI::dbConnect(RPostgres::Postgres(),
dbcon host = 'localhost', port = 8765, dbname = 'fort',
user = cred$pg$user, password = cred$pg$pass)
# Call setup_phea so we can use sqlt() and sql0().
setup_phea(dbcon, 'cdm_new_york3')
How complex can a phenotype be in Phea? And how does performance change as you increase the complexity? Do the queries ever become too complex?
Especifically, we talk about two dimensions of complexity:
A. How many components you can put into one formula.
B. How many consecutive times can you repeat this: compute a formula, use its results as a component inside another formula.
In this vignette we will stress-test Phea in dimension A. We are calling this stress test A:
The phenotype we will compute here is not relevant. We just want to test Phea!
These are this vignette’s takeaways:
We will be using table MEASUREMENT
from
cdm_new_york3
. For context, let us measure the size of that
table.
<- sqlt(measurement) |>
size_stats group_by() |>
summarise(
rows = n(),
patients = n_distinct(person_id)) |>
collect()
<- sql0('
measurement_summary select measurement_concept_id as concept_id, concept_code, concept_name, vocabulary_id,
count(*) as records, count(distinct person_id) as patients
from cdm_new_york3.measurement a
inner join cdm_new_york3.concept b
on a.measurement_concept_id = b.concept_id
group by measurement_concept_id, concept_code, concept_name, vocabulary_id
order by patients desc, records desc
') |>
collect()
MEASUREMENT
has 9,668,839 rows and
23,065 patients. Let’s see the most popular concepts in
it.
head(measurement_summary, 10) |>
kable()
concept_id | concept_code | concept_name | vocabulary_id | records | patients |
---|---|---|---|---|---|
43055141 | 72514-3 | Pain severity - 0-10 verbal numeric rating [Score] - Reported | LOINC | 548651 | 23046 |
3000963 | 718-7 | Hemoglobin [Mass/volume] in Blood | LOINC | 98650 | 23044 |
3009744 | 786-4 | MCHC [Mass/volume] by Automated count | LOINC | 98245 | 23044 |
3020416 | 789-8 | Erythrocytes [#/volume] in Blood by Automated count | LOINC | 98245 | 23044 |
3023599 | 787-2 | MCV [Entitic volume] by Automated count | LOINC | 98245 | 23044 |
3024929 | 777-3 | Platelets [#/volume] in Blood by Automated count | LOINC | 98245 | 23044 |
3000905 | 6690-2 | Leukocytes [#/volume] in Blood by Automated count | LOINC | 98245 | 23044 |
3012030 | 785-6 | MCH [Entitic mass] by Automated count | LOINC | 98245 | 23044 |
3025315 | 29463-7 | Body weight | LOINC | 470537 | 23042 |
3023314 | 4544-3 | Hematocrit [Volume Fraction] of Blood by Automated count | LOINC | 94026 | 23042 |
That is the data we will use to benchmark Phea. Each component will be exactly like this:
SELECT *
FROM cdm_new_york3.measurement
WHERE measurement_concept_id = [chosen concept]
In other words, each component uses a different
measurement_concept_id
, from the summary table, to
SELECT
from MEASUREMENT
.
We will be picking rows (concepts) from the summary table starting from the first, so, to reduce bias, let us shuffle them.
# Shuffle the rows
set.seed(42)
<- measurement_summary[sample(1:nrow(measurement_summary)),] measurement_summary
This was the machine used to execute this test:
Processor Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz
Installed RAM 64.0 GB (63.8 GB usable)
System type 64-bit operating system, x64-based processor
For each run of test A, we will pick N concepts, build a component for each (C1 to Cn), and compute the following formula:
test_a = C1value_as_number + C2value_as_number + C2value_as_number + … + Cnvalue_as_number
In other words, we will just add up the value_as_number
of each component.
We do not want to put a NULL
into that sum, because it
would cause the entire result to permanently collapse to
NULL
. This is because any number plus NULL
equals to NULL
. To circumvent that, we will use function
coalesce()
. If a component Cx is
NULL
, the number 0
will be used instead. So
the actual formula we’ll use is:
test_a = coalesce(C1value_as_number, 0) + coalesce(C2value_as_number, 0) + coalesce(C2value_as_number, 0) + … + coalesce(Cnvalue_as_number, 0)
From the finalized phenotype, we will just count the rows, unique patients, and sum the result across patients. This is just to force the SQL server to compute the phenotype entirely, for all patients, so we can measure how long it takes.
In each run, we use library tictoc
to measure the time
spent.
library(tictoc)
Function produce_N_components(N)
produces N
components using the first N
rows of
measurement_summary
.
<- function(N) {
produce_N_components <- function(i) {
make_test_component <- measurement_summary$concept_id[i]
concept_id <- sqlt(measurement) |>
records filter(measurement_concept_id == local(concept_id))
<- make_component(records,
component .ts = measurement_datetime,
.pid = person_id)
return(component)
}# Pick the first N lab tests
<- lapply(1:N, make_test_component)
components
# Their names will be simply "C<sub>n</sub>"
names(components) <- paste0('c', 1:length(components))
return(components)
}
Function run_test_a(N)
runs Test A using N
components.
<- function(N) {
run_test_a wrap(paste0('test_a_N', N), by_name = TRUE, pass_val = TRUE, {
<- produce_N_components(N)
components
<- paste0('coalesce(', names(components), '_value_as_number, 0)', collapse = ' + ')
formula_string
tic()
<- calculate_formula(
phen components = components,
fml = list(test_a = formula_string))
<- toc(quiet = TRUE)
phea_time <- phea_time$toc - phea_time$tic
phea_time
tic()
<- phen |>
result group_by() |>
summarise(
rows = n(),
patients = n_distinct(pid),
test_a_sum = sum(test_a/100000, na.rm = TRUE)) |>
collect()
<- toc(quiet = TRUE)
query_time <- query_time$toc - query_time$tic
query_time
<- result$rows
result_rows <- result$patients
result_patients <- result$test_a_sum
test_a_sum
# Compute component stats
<- lapply(components, \(x) summarise(x$rec_source$records, rows = n()) |> pull('rows')) |>
input_rows ::reduce(`+`)
purrr
<- lapply(components, \(x) select(x$rec_source$records, person_id)) |>
input_patients ::reduce(union_all) |>
purrrsummarise(patients = n_distinct(person_id)) |>
pull('patients')
<- head(phen, 5) |> collect()
phen_head
<- tibble(
res N = N,
input_rows = input_rows,
result_rows = result_rows,
test_a_sum = test_a_sum,
input_patients = input_patients,
result_patients = result_patients,
phea_time = phea_time,
query_time = query_time,
pps_query = input_patients/query_time,
rps_query = input_rows/query_time,
phen_head = list(phen_head)
)
return(res)
}) }
Function print_test_a_results()
will also be used. For
brevity it is not printed here. You can see it by downloading the
markdown file stress_test_a.Rmd
.
= list() test_a_runs
Let’s first run Test A with N = 1, that is, a single component.
length(test_a_runs)+1]] <- run_test_a(1) test_a_runs[[
Done! Now let us briefly peek at the results of the phenotype itself.
length(test_a_runs)]]$phen_head[[1]] |>
test_a_runs[[kable()
row_id | pid | ts | window | c1_value_as_number | test_a |
---|---|---|---|---|---|
1 | 6 | 2011-07-13 | 00:00:00 | NA | 0 |
2 | 6 | 2014-07-30 | 00:00:00 | NA | 0 |
3 | 17 | 2021-12-26 | 00:00:00 | NA | 0 |
4 | 26 | 2017-11-05 | 00:00:00 | NA | 0 |
5 | 28 | 2017-04-28 | 00:00:00 | NA | 0 |
Column test_a
is the result of the formula, i.e. the sum
of the components. As it appears, there are many NAs in
value_as_number
in our table MEASUREMENT
, but
that is not an issue.
How was the performance? Let’s see.
print_test_a_results()
N | in. rows | res. rows | test_a_sum | in. patients | res. patients | Phea time (s) | query time (s) | query pps | query rps |
---|---|---|---|---|---|---|---|---|---|
1 | 12.9k | 12.9k | 0 | 5136 | 5136 | 0.5 | 0.8 | 6420 | 16.1k |
Here are the meanings of the columns:
calculate_formula()
.calculate_formula()
.calculate_formula()
, in seconds.collect()
) the phenotype (aggregated), in seconds.With N = 1 (a single component), Phea took 0.5 seconds to assemble the phenotype, and the SQL server took 0.8 seconds to run the query. That is 16,086 input rows per second of query time.
length(test_a_runs)+1]] <- run_test_a(5)
test_a_runs[[
print_test_a_results()
N | in. rows | res. rows | test_a_sum | in. patients | res. patients | Phea time (s) | query time (s) | query pps | query rps |
---|---|---|---|---|---|---|---|---|---|
1 | 12.9k | 12.9k | 0.0 | 5136 | 5136 | 0.5 | 0.8 | 6420.0 | 16.1k |
5 | 70.7k | 65.5k | 3.5 | 9548 | 9548 | 1.2 | 4.4 | 2150.5 | 15.9k |
With N = 5 (five components), Phea took 1.17 sec to assemble the phenotype, and the SQL server took 4.44 sec to run the query. That is 15,916 input rows per second of query time.
Now that we have a grip on how Test A works, let’s run it for N = 10 to 150, increasing 10 at a time.
<- 10
N <- 150
N_max while(N <= N_max) {
length(test_a_runs)+1]] <- run_test_a(N)
test_a_runs[[<- N + 10
N }
print_test_a_results()
N | in. rows | res. rows | test_a_sum | in. patients | res. patients | Phea time (s) | query time (s) | query pps | query rps |
---|---|---|---|---|---|---|---|---|---|
1 | 12.9k | 12.9k | 0.0 | 5136 | 5136 | 0.5 | 0.8 | 6420.0 | 16.1k |
5 | 70.7k | 65.5k | 3.5 | 9548 | 9548 | 1.2 | 4.4 | 2150.5 | 15.9k |
10 | 205.9k | 161.8k | 4.4 | 17673 | 17673 | 1.9 | 10.1 | 1755.0 | 20.4k |
20 | 576.1k | 377.7k | 11.2 | 20867 | 20867 | 3.4 | 33.4 | 624.6 | 17.2k |
30 | 1118.5k | 439.9k | 975.0 | 23052 | 23052 | 5.3 | 121.9 | 189.1 | 9.2k |
40 | 1536.7k | 574.2k | 1743.8 | 23053 | 23053 | 7.4 | 234.7 | 98.2 | 6.5k |
50 | 1797.3k | 575k | 3383.9 | 23053 | 23053 | 8.2 | 348.4 | 66.2 | 5.2k |
60 | 2153.1k | 575.3k | 3766.4 | 23053 | 23053 | 9.9 | 612.9 | 37.6 | 3.5k |
70 | 3539.8k | 637.1k | 5422.1 | 23053 | 23053 | 11.5 | 1275.4 | 18.1 | 2.8k |
80 | 3568.5k | 637.4k | 5485.4 | 23053 | 23053 | 13.4 | 1570.2 | 14.7 | 2.3k |
90 | 3752.7k | 662.5k | 7690.8 | 23056 | 23056 | 17.6 | 2017.3 | 11.4 | 1.9k |
100 | 3809.5k | 662.5k | 7708.4 | 23056 | 23056 | 18.2 | 2383.0 | 9.7 | 1.6k |
110 | 4590.1k | 682.7k | 8057.5 | 23059 | 23059 | 20.7 | 3477.5 | 6.6 | 1.3k |
120 | 5279.3k | 709.7k | 10484.0 | 23059 | 23059 | 21.9 | 5483.3 | 4.2 | 1k |
130 | 5703.5k | 722.8k | 10721.4 | 23064 | 23064 | 30.5 | 7381.4 | 3.1 | 0.8k |
140 | 6536.1k | 722.9k | 11997.7 | 23064 | 23064 | 26.7 | 9472.8 | 2.4 | 0.7k |
150 | 6843.3k | 722.9k | 12752.9 | 23064 | 23064 | 29.6 | 11618.8 | 2.0 | 0.6k |
That is a lot of numbers. Let us do some plots. For brevity,
the code that makes the plots is hidden. Please download
stress_test_a.Rmd
from GitHub if you’d like to see
it.
library(plotly)
As we increase the number of components, the query time increases at sub-exponential rate (exponential growth would be a straight line).
Same as before: as we increase the number of input rows (and number of components), the query time increases at sub-exponential rate.
As we increase the number of components, less rows per second are computed, because each row takes longer. This is on top of the fact that, the more components you add, the more input rows, because each component is coming from a different SQL query (in Phea’s parlance, a different record source). Nevertheless, these two factors are not enough to cause exponential growth of query time. The rate of growth in query time, with respect to the number of rows or components, tapers off as N increases.
Takeaways from the results: