Stress-testing Phea, test A

This vignette assumes a SQL server at localhost (we use PostgreSQL), with data in OMOP Common Data Model v5.4 format in schema cdm_new_york3. The patient records shown in this example are synthetic data from Synthea(TM) Patient Generator.

library(phea)
library(dplyr)

# Connect to SQL server.
dbcon <- DBI::dbConnect(RPostgres::Postgres(),
  host = 'localhost', port = 8765, dbname = 'fort',
  user = cred$pg$user, password = cred$pg$pass)

# Call setup_phea so we can use sqlt() and sql0().
setup_phea(dbcon, 'cdm_new_york3')

How complex can a phenotype be in Phea? And how does performance change as you increase the complexity? Do the queries ever become too complex?

Especifically, we talk about two dimensions of complexity:

A. How many components you can put into one formula.
B. How many consecutive times can you repeat this: compute a formula, use its results as a component inside another formula.

In this vignette we will stress-test Phea in dimension A. We are calling this stress test A:

  1. Compute formulas with 1 to 150 components, each coming from a different SQL query.

The phenotype we will compute here is not relevant. We just want to test Phea!

TL;DR

These are this vignette’s takeaways:

The dataset

We will be using table MEASUREMENT from cdm_new_york3. For context, let us measure the size of that table.

size_stats <- sqlt(measurement) |>
  group_by() |>
  summarise(
    rows = n(),
    patients = n_distinct(person_id)) |>
  collect()

measurement_summary <- sql0('
    select measurement_concept_id as concept_id, concept_code, concept_name, vocabulary_id,
      count(*) as records, count(distinct person_id) as patients
    from cdm_new_york3.measurement a
    inner join cdm_new_york3.concept b
    on a.measurement_concept_id = b.concept_id
    group by measurement_concept_id, concept_code, concept_name, vocabulary_id
    order by patients desc, records desc
  ') |>
  collect()

MEASUREMENT has 9,668,839 rows and 23,065 patients. Let’s see the most popular concepts in it.

head(measurement_summary, 10) |>
  kable()
concept_id concept_code concept_name vocabulary_id records patients
43055141 72514-3 Pain severity - 0-10 verbal numeric rating [Score] - Reported LOINC 548651 23046
3000963 718-7 Hemoglobin [Mass/volume] in Blood LOINC 98650 23044
3009744 786-4 MCHC [Mass/volume] by Automated count LOINC 98245 23044
3020416 789-8 Erythrocytes [#/volume] in Blood by Automated count LOINC 98245 23044
3023599 787-2 MCV [Entitic volume] by Automated count LOINC 98245 23044
3024929 777-3 Platelets [#/volume] in Blood by Automated count LOINC 98245 23044
3000905 6690-2 Leukocytes [#/volume] in Blood by Automated count LOINC 98245 23044
3012030 785-6 MCH [Entitic mass] by Automated count LOINC 98245 23044
3025315 29463-7 Body weight LOINC 470537 23042
3023314 4544-3 Hematocrit [Volume Fraction] of Blood by Automated count LOINC 94026 23042

That is the data we will use to benchmark Phea. Each component will be exactly like this:

SELECT *
FROM cdm_new_york3.measurement
WHERE measurement_concept_id = [chosen concept]

In other words, each component uses a different measurement_concept_id, from the summary table, to SELECT from MEASUREMENT.

We will be picking rows (concepts) from the summary table starting from the first, so, to reduce bias, let us shuffle them.

# Shuffle the rows
set.seed(42)
measurement_summary <- measurement_summary[sample(1:nrow(measurement_summary)),]

Computer used to run this test

This was the machine used to execute this test:

Processor Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz 2.59 GHz
Installed RAM 64.0 GB (63.8 GB usable)
System type 64-bit operating system, x64-based processor

Formula used for testing

For each run of test A, we will pick N concepts, build a component for each (C1 to Cn), and compute the following formula:

test_a = C1value_as_number + C2value_as_number + C2value_as_number + … + Cnvalue_as_number

In other words, we will just add up the value_as_number of each component.

We do not want to put a NULL into that sum, because it would cause the entire result to permanently collapse to NULL. This is because any number plus NULL equals to NULL. To circumvent that, we will use function coalesce(). If a component Cx is NULL, the number 0 will be used instead. So the actual formula we’ll use is:

test_a = coalesce(C1value_as_number, 0) + coalesce(C2value_as_number, 0) + coalesce(C2value_as_number, 0) + … + coalesce(Cnvalue_as_number, 0)

From the finalized phenotype, we will just count the rows, unique patients, and sum the result across patients. This is just to force the SQL server to compute the phenotype entirely, for all patients, so we can measure how long it takes.

In each run, we use library tictoc to measure the time spent.

library(tictoc)

Functions used for the test

Function produce_N_components(N) produces N components using the first N rows of measurement_summary.

produce_N_components <- function(N) {
  make_test_component <- function(i) {
    concept_id <- measurement_summary$concept_id[i]
    records <- sqlt(measurement) |>
      filter(measurement_concept_id == local(concept_id))
    component <- make_component(records,
      .ts = measurement_datetime,
      .pid = person_id)
    return(component)
  }
  # Pick the first N lab tests
  components <- lapply(1:N, make_test_component)
  
  # Their names will be simply "C<sub>n</sub>"
  names(components) <- paste0('c', 1:length(components))
  
  return(components)
}

Function run_test_a(N) runs Test A using N components.

run_test_a <- function(N) {
  wrap(paste0('test_a_N', N), by_name = TRUE, pass_val = TRUE, {
    components <- produce_N_components(N)
    
    formula_string <- paste0('coalesce(', names(components), '_value_as_number, 0)', collapse = ' + ')
  
    tic()
    phen <- calculate_formula(
      components = components,
      fml = list(test_a = formula_string))
    phea_time <- toc(quiet = TRUE)
    phea_time <- phea_time$toc - phea_time$tic
    
    tic()
    result <- phen |>
      group_by() |>
      summarise(
        rows = n(),
        patients = n_distinct(pid),
        test_a_sum = sum(test_a/100000, na.rm = TRUE)) |>
      collect()
    query_time <- toc(quiet = TRUE)
    query_time <- query_time$toc - query_time$tic
    
    result_rows <- result$rows
    result_patients <- result$patients
    test_a_sum <- result$test_a_sum
    
    # Compute component stats
    input_rows <- lapply(components, \(x) summarise(x$rec_source$records, rows = n()) |> pull('rows')) |>
      purrr::reduce(`+`)
    
    input_patients <- lapply(components, \(x) select(x$rec_source$records, person_id)) |>
      purrr::reduce(union_all) |>
      summarise(patients = n_distinct(person_id)) |>
      pull('patients')
    
    phen_head <- head(phen, 5) |> collect()
    
    res <- tibble(
      N = N,
      
      input_rows = input_rows,
      result_rows = result_rows,
      test_a_sum = test_a_sum,
      
      input_patients = input_patients,
      result_patients = result_patients,
      
      phea_time = phea_time,
      query_time = query_time,
      
      pps_query = input_patients/query_time,
      rps_query = input_rows/query_time,
      
      phen_head = list(phen_head)
    )
    
    return(res)
  })
}

Function print_test_a_results() will also be used. For brevity it is not printed here. You can see it by downloading the markdown file stress_test_a.Rmd.

Run N = 1

test_a_runs = list()

Let’s first run Test A with N = 1, that is, a single component.

test_a_runs[[length(test_a_runs)+1]] <- run_test_a(1)

Done! Now let us briefly peek at the results of the phenotype itself.

test_a_runs[[length(test_a_runs)]]$phen_head[[1]] |>
  kable()
row_id pid ts window c1_value_as_number test_a
1 6 2011-07-13 00:00:00 NA 0
2 6 2014-07-30 00:00:00 NA 0
3 17 2021-12-26 00:00:00 NA 0
4 26 2017-11-05 00:00:00 NA 0
5 28 2017-04-28 00:00:00 NA 0

Column test_a is the result of the formula, i.e. the sum of the components. As it appears, there are many NAs in value_as_number in our table MEASUREMENT, but that is not an issue.

How was the performance? Let’s see.

print_test_a_results()
N in. rows res. rows test_a_sum in. patients res. patients Phea time (s) query time (s) query pps query rps
1 12.9k 12.9k 0 5136 5136 0.5 0.8 6420 16.1k

Here are the meanings of the columns:

With N = 1 (a single component), Phea took 0.5 seconds to assemble the phenotype, and the SQL server took 0.8 seconds to run the query. That is 16,086 input rows per second of query time.

Run N = 5

test_a_runs[[length(test_a_runs)+1]] <- run_test_a(5)

print_test_a_results()
N in. rows res. rows test_a_sum in. patients res. patients Phea time (s) query time (s) query pps query rps
1 12.9k 12.9k 0.0 5136 5136 0.5 0.8 6420.0 16.1k
5 70.7k 65.5k 3.5 9548 9548 1.2 4.4 2150.5 15.9k

With N = 5 (five components), Phea took 1.17 sec to assemble the phenotype, and the SQL server took 4.44 sec to run the query. That is 15,916 input rows per second of query time.

Runs N = 10 to N = 150

Now that we have a grip on how Test A works, let’s run it for N = 10 to 150, increasing 10 at a time.

N <- 10
N_max <- 150
while(N <= N_max) {
  test_a_runs[[length(test_a_runs)+1]] <- run_test_a(N)
  N <- N + 10
}

Test A results

print_test_a_results()
N in. rows res. rows test_a_sum in. patients res. patients Phea time (s) query time (s) query pps query rps
1 12.9k 12.9k 0.0 5136 5136 0.5 0.8 6420.0 16.1k
5 70.7k 65.5k 3.5 9548 9548 1.2 4.4 2150.5 15.9k
10 205.9k 161.8k 4.4 17673 17673 1.9 10.1 1755.0 20.4k
20 576.1k 377.7k 11.2 20867 20867 3.4 33.4 624.6 17.2k
30 1118.5k 439.9k 975.0 23052 23052 5.3 121.9 189.1 9.2k
40 1536.7k 574.2k 1743.8 23053 23053 7.4 234.7 98.2 6.5k
50 1797.3k 575k 3383.9 23053 23053 8.2 348.4 66.2 5.2k
60 2153.1k 575.3k 3766.4 23053 23053 9.9 612.9 37.6 3.5k
70 3539.8k 637.1k 5422.1 23053 23053 11.5 1275.4 18.1 2.8k
80 3568.5k 637.4k 5485.4 23053 23053 13.4 1570.2 14.7 2.3k
90 3752.7k 662.5k 7690.8 23056 23056 17.6 2017.3 11.4 1.9k
100 3809.5k 662.5k 7708.4 23056 23056 18.2 2383.0 9.7 1.6k
110 4590.1k 682.7k 8057.5 23059 23059 20.7 3477.5 6.6 1.3k
120 5279.3k 709.7k 10484.0 23059 23059 21.9 5483.3 4.2 1k
130 5703.5k 722.8k 10721.4 23064 23064 30.5 7381.4 3.1 0.8k
140 6536.1k 722.9k 11997.7 23064 23064 26.7 9472.8 2.4 0.7k
150 6843.3k 722.9k 12752.9 23064 23064 29.6 11618.8 2.0 0.6k

That is a lot of numbers. Let us do some plots. For brevity, the code that makes the plots is hidden. Please download stress_test_a.Rmd from GitHub if you’d like to see it.

library(plotly)

As we increase the number of components, the query time increases at sub-exponential rate (exponential growth would be a straight line).

Same as before: as we increase the number of input rows (and number of components), the query time increases at sub-exponential rate.

As we increase the number of components, less rows per second are computed, because each row takes longer. This is on top of the fact that, the more components you add, the more input rows, because each component is coming from a different SQL query (in Phea’s parlance, a different record source). Nevertheless, these two factors are not enough to cause exponential growth of query time. The rate of growth in query time, with respect to the number of rows or components, tapers off as N increases.

Takeaways from the results:

Author contact

Fabrício Kury
2022/Dec/16
Be always welcome to reach me at .