Detect event streaks based on specified thresholds
Here is a question we received via heatwaveR’s GitHub page regarding using our package’s functions to detect streaks of outcomes based on a threshold in some lab experiment. The question is as follows:
“I was looking for a function that could help me with data wrangling and I believe the
exceedance()
function is exactly what I’m looking for. I have a dataframe with person IDs, date of treatment, day since treatment, and a lab value. Essentially, I am hoping to find the longest streak of consecutive days where the lab value is below a certain threshold, and extending the streak if the lab value does not rise above the threshold for >3 days.
I wrangled my dataframe into a dataframe with the IDs, date, and lab value as follows …”
Prepare a dataframe with the IDs, date, and lab values
num_subjects <- 5
num_days <- 20
set.seed(13)
subject_ids <- rep(1:num_subjects, each = num_days)
dates <- rep(seq(as.Date("2023-01-01"), by = "1 day", length.out = num_days),
times = num_subjects)
lab_values <- runif(num_subjects * num_days, min = 0, max = 3000)
df <- data.frame(
SubjectID = factor(subject_ids),
Date = as.Date(dates),
Lab = lab_values
)
Calculate streaks lengths based on linear (over time) thresholds
It turns out that one of the other built-in functions can exactly do what you need. We simply need to make some adjustments to the dataframe fed to the detect_event()
function, which is the function that will count the streak lengths.
Calculate a mean value and a threshold; you can calculate an overall mean and threshold, or a mean and threshold for each group— this will depend on your experimental design and hypotheses.
I calculate a mean and threshold based on the pooled data (across A-C):
Even though we calculated the mean value, this is not used; only the threshold is used.
results
is a list of dataframes, one pair of dataframes for each subject. Let us look at the first list element, which is for SubjectID == 1
:
$climatology
SubjectID Date Lab seas thresh threshCriterion
1 1 2023-01-01 2130.96734 1484.172 500 FALSE
2 1 2023-01-02 738.41191 1484.172 500 FALSE
3 1 2023-01-03 1168.90333 1484.172 500 FALSE
4 1 2023-01-04 274.15102 1484.172 500 TRUE
5 1 2023-01-05 2886.19363 1484.172 500 FALSE
6 1 2023-01-06 32.79999 1484.172 500 TRUE
7 1 2023-01-07 1722.88553 1484.172 500 FALSE
8 1 2023-01-08 2293.19397 1484.172 500 FALSE
9 1 2023-01-09 2620.14693 1484.172 500 FALSE
10 1 2023-01-10 123.19006 1484.172 500 TRUE
11 1 2023-01-11 1983.36480 1484.172 500 FALSE
12 1 2023-01-12 2635.11255 1484.172 500 FALSE
13 1 2023-01-13 2671.67709 1484.172 500 FALSE
14 1 2023-01-14 1698.84140 1484.172 500 FALSE
15 1 2023-01-15 1780.64203 1484.172 500 FALSE
16 1 2023-01-16 1093.54355 1484.172 500 FALSE
17 1 2023-01-17 1072.23855 1484.172 500 FALSE
18 1 2023-01-18 1774.37116 1484.172 500 FALSE
19 1 2023-01-19 2596.35438 1484.172 500 FALSE
20 1 2023-01-20 2041.57098 1484.172 500 FALSE
durationCriterion event event_no
1 FALSE FALSE NA
2 FALSE FALSE NA
3 FALSE FALSE NA
4 TRUE TRUE 1
5 FALSE TRUE 1
6 TRUE TRUE 1
7 FALSE TRUE 1
8 FALSE TRUE 1
9 FALSE TRUE 1
10 TRUE TRUE 1
11 FALSE FALSE NA
12 FALSE FALSE NA
13 FALSE FALSE NA
14 FALSE FALSE NA
15 FALSE FALSE NA
16 FALSE FALSE NA
17 FALSE FALSE NA
18 FALSE FALSE NA
19 FALSE FALSE NA
20 FALSE FALSE NA
$event
event_no index_start index_peak index_end duration date_start date_peak
1 1 4 6 10 7 2023-01-04 2023-01-06
date_end intensity_mean intensity_max intensity_var intensity_cumulative
1 2023-01-10 -62.3774 -1451.372 1249.218 -436.6417
intensity_mean_relThresh intensity_max_relThresh intensity_var_relThresh
1 921.7944 -467.2 1249.218
intensity_cumulative_relThresh intensity_mean_abs intensity_max_abs
1 6452.561 1421.794 32.8
intensity_var_abs intensity_cumulative_abs rate_onset rate_decline
1 1249.218 9952.561 -275.4909 -226.7728
The first dataframe is called climatology
and the other is called events
. Don’t worry about the names as the function was initially written for climate events. The climatology
dataframe contains all the data for SubjectID
that were initially supplied in df2
and a few new columns, threshCriterion
, durationCriterion
, event
, and event_no
are added at the end. When the Lab
value dips below the thresh
, threshCriterion
will flag as TRUE
regardless of how long it remains below the threshold. durationCriterion
flags as TRUE
if the number of times threshCriterion
is equal to or greater than minDuration
. event
flags as TRUE
if threshCriterion
is TRUE
AND durationCriterion
is TRUE
. A unique identifier is given for each event in event_no
.
The events
dataframe contains the event_no
, the start and end dates of the event, and the event duration
(your ‘streaks’). Various other summary stats are also calculated, but these might not be relevant for your question. Or are they?
If you are only interested in the event
dataframe and want to combine all the streaks into one table with the results, do this—note the use of ddply()
rather than dlply()
:
results <- plyr::ddply(.data = df2, .variables = "SubjectID", function(sub_df) {
detect_event(sub_df,
x = Date,
y = Lab,
seasClim = seas,
threshClim = thresh,
minDuration = 1,
maxGap = 3,
coldSpell = TRUE,
protoEvents = FALSE)$event
})
results |>
select(SubjectID, event_no, duration)
SubjectID event_no duration
1 1 1 7
2 2 1 7
3 2 2 5
4 3 1 1
5 3 2 8
6 4 1 1
7 4 2 1
8 5 1 3
Reuse
Citation
@online{j._smit2023,
author = {J. Smit, Albertus},
title = {Detect Event Streaks Based on Specified Thresholds},
date = {2023-11-22},
url = {http://tangledbank.netlify.app/blog/2023-11-22-run-lengths/},
langid = {en}
}