This website has three main purposes:

Provides the code we used for
Loading the data into R;
- Exploring the UNOS data, (e.g., identifying the number of columns, observations, % data missing within each column);
- Summarizing and visualizing the data;
- Data pre-processing;
Obtaining the machine learning results; and
Applying isotonic regression to calibrate the survival probabilities
Provides the results that we have obtained for each stage of the analysis.
Allows for reusing our code and reproducing our results, which in our view is a major limitation in the majority of the published literature since they do not provide code and it is often unclear what are some of the decision made in terms of data pre-processing and modeling.

Data was provided from the UNOS registry by staff at the US, United Network for Organ Sharing. Per the UNOS regulations, we cannot share their data. For obtaining the data, the reader should click here. We also highly recommend that the reader gets familiarized with the data dictionary (see Data Definitions for details).

To navigate this site, please feel free to use the navigation bar at the left. The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.

A list of the customized functions we created for this study can be found in the custom_functions.R.

1 Overview of Our Framework

The following figure provides an overivew of the application of our two-stage framework for obtaining monotonically decreasing survival probabilities. Our goals for Stage I are to select an appropriate model based on its predictive performance and examine how the selected variables importance change over time.

In Stage II, we calibrate the survival probabilities such that they are monotonically decreasing over time.

Adopting the proposed two-stage framework

2 State I: Survival Probability Prediction at Each Time Point

Stage I is comprised of the following steps:

raw data undergoing detailed data preparation
use subsampling techniques to handle the imbalance between graft survivals and failures at a give time point
use a structured variable selection technique to improve the performance of machine models
use of machine learning models to predict survival probabilities for the various time periods

2.1 Reading and Exploring the UNOS Dataset

The snippet below documents the list of R packages and functions that were used in this research. For convenience, we used the pacman package since it allows for installing/loading the needed packages in one step.

rm(list = ls()) # clear global environment
graphics.off() # close all graphics
if(require(pacman)==FALSE) install.packages("pacman") # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
pacman::p_load(AUC, Biocomb, car, caret, conflicted, DataExplorer, # important for exploratory data analysis 
               dataPreparation, data.table, DT, haven, magrittr, mltools, party, readxl, tidyverse, # important analytic packages
                snow, varImp
)

conflict_prefer("arrange", "dplyr")
conflict_prefer("recode", "dplyr")
conflict_prefer("mutate", "dplyr")
conflict_prefer("makeCluster", "snow")

source("https://raw.githubusercontent.com/Ying-Ju/heart_transplant.github.io/master/custom_functions.R") # Has our custom functions

2.1.1 Loading the UNOS Dataset

df <- read_sas("../Data/thoracic_data.sas7bdat")
df$ID <- row.names(df) # Index to be used in creating/tracking the training and test samples
orgTable <- table(df[, 'WL_ORG'])
df %<>% dplyr::filter(WL_ORG == "HR") # filtering to keep only heart transplants

Our original UNOS data consisted of 159318 observations, which contained: (a) 428 unspecified transplant types, (b) 3148 combined heart-lung transplants, (c) 103570 heart transplants, and (d) 52172 lung transplants. Given our focus on heart transplants, we have only retained the observations where only a heart transplant is performed. The resulting data is resaved into our data.frame titled df, which now consists of 103570 observations and 495 variables (which have been unchanged from our filtering operation).

2.1.2 Exploring the UNOS Dataset

In the following code chunk, we remove the time zone in the variable YR_ENTRY_US_TCR so it has the consistent date format with other variables related to dates before we study the distribution of missing values in the data.

# The variable YR_ENTRY_US_TCR was recorded with time zone, we removed the time zone.
df$YR_ENTRY_US_TCR <- as.Date(df$YR_ENTRY_US_TCR)

Fraction Data Missing

naVec <- is.na(df) %>% colSums() %>% 
  `/`(nrow(df)) # To make it percent missing per variable
names(naVec) <- colnames(df)
naVec.quantiles <- quantile(naVec) %>% round(2)

ggplot() + geom_histogram(aes(naVec), bins = 10) + 
  xlab("Fraction Data Missing Per Variable") + ylab("Number of Variables")+theme_minimal()

Based on the heart transplantation data, 238 variables out of the 103570 have at least one data point missing data. Furthermore, we have computed the percent missing data per variable; the values of the first, second, and third quantiles (in terms of percent of data missing per column) correspond to 0%, 0%, and 60%,respectively. Thus, it is imperative to examine how these variables can be preprocessed prior to any analysis.

Missing Data (Sample Columns)

In the plot below, we sample 30 columns at random from the UNOS dataset to show the actual percentage of the data that is missing for each variable. The colors are used to denote the data quality for that column using a traffic light scheme (where green is good and red is bad).

heart.na.plot <- df[,sample(colnames(df), 30)] %>% plot_missing()

Missing Data (Printout for All Variables)

In the chunk below, we report the percent missing per each variable. The printout is arranged in a descending order of missing values.

data.frame(variableName = names(naVec), percentMissing = round(naVec*100, 2), row.names = NULL) %>%
  arrange(desc(percentMissing)) %>% datatable()

Column Types

R initially divides the columns of different types. We summarize these in the table below.

types <- dataTypes(df) # see functions.R file
numeric.index <- which(types$classes.df=="numeric")
character.index <- which(types$classes.df=="character")
datatable(types) # printing the different types

The reader should note that R reads any variable that purely consists of numeric values to be numeric. However, from a data analysis perspective, a lot of these variables will need to be converted into factors since they represent factor levels rather than numeric values (e.g., donor/recipient ethinicity categories, allocation type, HLA mismatch level, education, etc.). Thus, the numeric variables should be examined more closely to ensure that they are properly coded by the software.

2.2 Data Preparation

In this section, we will detailedly document the data preperation precedures we have done in the study.

2.2.1 Technically Correct Variables

In this subsection, we will capitalize on the file titled = “STAR File Documentation.xls” provided by UNOS to automatically check and convert any variable that was assigned an incorrect type by the R software. The reader should note that the outcome of this data transformation step can be referred as technically correct data (De Jonge and Van Der Loo 2013). Note that we convert all factor columns to character to facilitate their manipulation in R.

varInfo <- read_excel("../Data/STAR File Documentation.xls", sheet = "THORACIC DATA", skip =1, col_types = c(rep("text",3), rep("date",2), rep("guess",7)) )

# fix an error in the variable information document, there were two start dates: 01-Oct-87, 01-Oct-90 for the variable: CMV_DON. We assign 01-Oct-90 for the start date.

varInfo$`VAR START DATE`[51] <- "1990-10-01"

# pulling names of variables that had type char OR having numeric categories (explained in SAF)
factorVars <- varInfo %>% 
  dplyr::filter(`DATA TYPE`=="CHAR" | (`DATA TYPE`=="NUM" & !is.na(`SAS ANALYSIS FORMAT`)) ) %>%
  pull(`VARIABLE NAME`)

# The following variables are qualitative based on the UNOS documentation but they were not identified. 
Other_categorical_vars <- c("BLOOD_INF_DON", "HBV_CORE_DON", "HBV_SUR_ANTIGEN",   "HCV_SEROSTATUS", "HYPERTENS_DUR_DON", "IABP_TRR", "LAST_INACT_REASON", "OTHER_INF_DON", "STERNOTOMY_TRR", "URINE_INF_DON")

factorVars <- c(factorVars, Other_categorical_vars)

df[, factorVars] %<>%  lapply(as.character) # converting factors to character to facilitate preprocessing

#saveRDS(factorVars, "../Results/factorVars.RDS")

Based on the analysis above, we have identified 322 categorical variables. Note the R has previously only identified character/categorical variables. The names of the 322 variables can be accessed in the factorVars.RDS file. Furthermore, we applied the as.factor() to ensure that R recognizes these variables as categorical.

2.2.2 Rowwise Operations: Additional Filters

In the code chunk below, we first filter to keep adult patients since our study focuses on only adult transplantations. Then we remove individuals whose weights were lower than .01% of adults’ weights in the data (less than 30lbs) or whose heights were less than .01% adults’ heights in the data (lower than 122 cms).

df <- df %>% subset(AGE>=18) %>% # filtering to keep only Adults
# we excluded too light or too short people in the following code
subset(WGT_KG_DON_CALC >= quantile(WGT_KG_DON_CALC, 0.0001, na.rm = TRUE)) %>% 
subset(WGT_KG_TCR >= quantile(WGT_KG_TCR, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_DON_CALC >= quantile(HGT_CM_DON_CALC, 0.0001, na.rm = TRUE)) %>% 
subset(HGT_CM_TCR >= quantile(HGT_CM_TCR, 0.0001, na.rm = TRUE))

2.2.3 Columnwise Operations: Dropping Variables

In the code chunk below, we fist filter/remove any observations with missing GSTATUS or GTIME. Our rationale for doing this prior to removing uniformative/irrelevant/unreliable variables was based on the fact that these variables would be used in computing our transplantation OUTCOME and hence, we did not want to lose them. After performing remove any variables that meet any of the following criteria:

Variables with a starting date on or later than Jan 01, 2000, except the variables that are corresponding to prior cardiac surgery (non-transplant) or lung surgery as these variables have been identified as important information in the literature (Vega, Schroder, and Nicoara (2017), Chen et al. (2018), Guven et al. (2018), Medved et al. (2018), Zhang et al. (2019)). These variables are:
PRIOR_CARD_SURG_TCR, PRIOR_CARD_SURG_TRR, PRIOR_CARD_SURG_TYPE_OSTXT_TCR, PRIOR_CARD_SURG_TYPE_OSTXT_TRR, PRIOR_CARD_SURG_TYPE_TCR, PRIOR_CARD_SURG_TYPE_TRR, and PRIOR_LUNG_SURG_TYPE_OSTXT_TRR.
Any variables having an end date for data collection since they would not be collected in any future transplant event.
Any post-transplant variables, with the exception of GTIME and GSTATUS (which we use later to record the transplantation outcome at 1-year post transplant)
Removing the following variables, capturing different types of antigen alleles:
DA1, DA2, RA1, RA2, DB1, DB2, RB1, RB2, RDR1, RDR2, DDR1, and DDR2.
Note that each of those variables had three possible values: 0 indicating no mismatch, 1 indicating one allele matching, and 2 indicating that both alleles mismatched.
Our rationale for removing those variables was based on the observation that the variables HLAMIS, AMIS, BMIS and DRMIS recorded summaries of matches in these antigen alleles. The interested reader is referred to Parham (2014) for a discussion of HLAMIS, and Weisdorf et al. (2008) for a detailed introduction on those four variables.
Removing the two variables DEATH_CIRCUM_DON and DEATH_MECH_DON since a cross-tabulation of their values indicated that they are not consistent, which reduces the interpretation/utility from using those variables.
Removing the variable TX_YEAR since we wanted to develop a predictive and not an explanatory model. For more details on the difference, the reader is referred to Shmueli and others (2010).

varsVec <- read_rds(gzcon(url("https://raw.githubusercontent.com/My-favor-research/ISR-Supplementary-Materials.github.io/master/dropVars.RDS"))) # a vector of 69 vars to be dropped
alleles = c("DA1","DA2","RA1","RA2","DB1","DB2","RB1","RB2","RDR1","RDR2","DDR1","DDR2")

VarstoBeDropped = varInfo %>% 
dplyr::filter(`VAR START DATE` >= '2000-01-01' | 
!is.na(`VAR END DATE`) | 
str_detect(FORM, "TRF/TRR|TRR/TRF-CALCULATED|TRR/TRF|TRF") |
str_detect(`FORM SECTION`, "POST TRANSPLANT CLINICAL INFORMATION") |
str_detect(DESCRIPTION, "IDENTIFIER| DATE") |
`VARIABLE NAME` %in% varsVec |
`VARIABLE NAME` %in% alleles |
`VARIABLE NAME` %in% c("DEATH_CIRCUM_DON","DEATH_MECH_DON", "TX_YEAR") ) %>% 
pull(`VARIABLE NAME`)

# include variables corresponding to prior cardiac surgery (non-transplant) or lung surgery
VarstoBeDropped <- setdiff(VarstoBeDropped, c("PRIOR_CARD_SURG_TCR", "PRIOR_CARD_SURG_TRR", "PRIOR_CARD_SURG_TYPE_OSTXT_TCR", "PRIOR_CARD_SURG_TYPE_OSTXT_TRR",
"PRIOR_CARD_SURG_TYPE_TCR", "PRIOR_CARD_SURG_TYPE_TRR", "PRIOR_LUNG_SURG_TYPE_OSTXT_TRR"))

df %<>% dplyr::filter(!is.na(GSTATUS) & !is.na(GTIME) ) %>% # remove obs missing graft status/time
select(!all_of(VarstoBeDropped)) # exclude variables to be dropped

2.2.4 Creating Variables based on the Literature and Our Exploration of the Data

In the code chunk below, we create the following variables:

ISCHTIME was recoded to be in minutes to make the interpretation consistent with Medved et al. (2018)
PVR, which calculates the recipent’s pulmonary vascular resistance
An ECMO index variable was created to indicate whether the ECMO machine was utilized at either registration or at transplant
CARD_SURG, which captures whether the recipent had prior cardiac surgeries (non-transplant)
BMI_CHNG, which captures the percent change in the recipient’s BMI from transplantation time to registration/listing
WGT_CHNG, which captures the percent change in the recipient’s weight from transplantation time to registration/listing
HGT_CHNG, which captures the percent change in the recipient’s height from transplantation time to registration/listing
AGE_MAT, which captures the absolute value of the difference between the recipient’s and deceased donor’s age
BMI_MAT, which captures the absolute value of the difference between the recipient’s and deceased donor’s BMI
ETH_MAT, which captures whether the recipent and deceased donor were of the same ethinicity
GENDER_MAT, which captures whether the recipent and deceased donor were of the same gender
ANCEF, which captures whether the deceased donor was prescribed Ancef within 24 hours or procurement
DOPAMINE, which captures whether the deceased donor was prescribed Dopamine within 24 hours or procurement
HEPARIN, which captures whether the deceased donor was prescribed Herapin within 24 hours or procurement
ZOSYN, which captures whether the deceased donor was prescribed Zosyn within 24 hours or procurement
OUTCOME, which captures whether the recipient has survived at one-year post transplant based on combining information from GTIME and GSTATUS. The censoring procedure used herein is identical to that applied in Dag et al. (2016); Dag et al. (2017).

After these variables were created, we dropped both GTIME and GSTATUS as well as the four variables capturing the medications prescribed/applied on the deceased donor (i.e. PT_OTH1_OSTXT_DON – PT_OTH4_OSTXT_DON). Furthemore, we converted all indicator columns into characters.

df %<>% mutate(ISCHTIME = ISCHTIME*60) %>%
mutate(PVR = (HEMO_PA_MN_TRR - HEMO_PCW_TRR)*79.72/HEMO_CO_TRR) %>%
mutate(ECMO = ifelse(ECMO_TCR + ECMO_TRR == 0, 0, 1)) %>% 
mutate(CARD_SURG = if_else(PRIOR_CARD_SURG_TCR=="Y"|PRIOR_CARD_SURG_TRR=="Y", true="Y", false=ifelse(PRIOR_CARD_SURG_TCR=="N"&PRIOR_CARD_SURG_TRR=="N", "N", NA))) %>%
mutate(BMI_CHNG = 100*(BMI_CALC- INIT_BMI_CALC)/INIT_BMI_CALC) %>% 
mutate(WGT_CHNG = 100*(WGT_KG_CALC - INIT_WGT_KG_CALC)/INIT_WGT_KG_CALC) %>% 
mutate(HGT_CHNG = 100*(HGT_CM_CALC - INIT_HGT_CM_CALC)/INIT_HGT_CM_CALC) %>% 
mutate(AGE_MAT = abs(AGE - AGE_DON)) %>% 
mutate(BMI_MAT = abs(BMI_CALC - BMI_DON_CALC)) %>% 
mutate(ETH_MAT = if_else(ETHCAT == ETHCAT_DON, true = "Y", false = "N")) %>% 
mutate(GENDER_MAT = if_else(GENDER == GENDER_DON, true = "Y", false = "N"))

# We changed the following medicines to "HEPARIN", "ANCEF", "DOPAMINE", "ZOSYN" since these are the most repeated medicines.

temp <- df[, c("PT_OTH1_OSTXT_DON", "PT_OTH2_OSTXT_DON","PT_OTH3_OSTXT_DON", "PT_OTH4_OSTXT_DON")]
new_vars <- data.frame(matrix(NA, ncol=4, nrow=nrow(df)))
colnames(new_vars) <- c("HEPARIN","ANCEF","DOPAMINE","ZOSYN")

temp <- as.data.frame(apply(temp, 2, function(x) gsub("^$| ^", NA, x)), stringsAsFactors=FALSE)

for (i in 1:ncol(new_vars)){
new_vars[,i] <- apply(temp, 1, function(x) detect_terms(x, colnames(new_vars)[i]))
}

df %<>% mutate(HEPARIN = new_vars$HEPARIN, ANCEF=new_vars$ANCEF,
               DOPAMINE= new_vars$DOPAMINE, ZOSYN=new_vars$ZOSYN) %>% rowwise() %>% mutate(OUTCOME = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 1)) %>% # see custom_functions.R
select(-c(starts_with("PT_OTH")))

df %<>% ungroup() %>% 
mutate_at(c('ECMO', 'CARD_SURG', 'ETH_MAT', 'GENDER_MAT', 'ANCEF', 'DOPAMINE', 'HEPARIN', 'ZOSYN', 'OUTCOME'), as.character)

2.2.5 Reducing Number of Levels for Categorical Columns

2.2.5.1 Number of Distinct Values within Each Categorical Variable

Despite the application of the nearZeroVar() from the caret package, several of the remaining character/factor variables still exhibit a large number of unique values. The purpose of the code chunk below is to quantify the variability in the remaining character/factor variables.

df %>% select(-ID) %>%  summarise_if(is.character, n_distinct, na.rm = TRUE) -> numLevels

If we were to exclude the ID column, the number of distinct values in each character variable (equivallently the number of non-NA levels if these variables were encoded as factors) has the following summary statistics:

Min. : 1.00
1st Qu.: 2.00
Median : 4.00
Mean : 37.17
3rd Qu.: 5.00
Max. :1294.00

Accordingly, we will have to examine how to reduce the number of levels within each variable to be able to use these variables in the machine learning stage; we do not want to generate a very large number of dummy variables and we will group some of those levels by capitalizing on the similarities in interpretting some of these levels.

2.2.5.2 Grouping Operations

While recipes package provides an excellent pipeline for data pre-processing, we have elected to manually go through the process of grouping variables since this would allow us to (a) utilize the information in the factor levels to combine similar categories (whether due to location, education status or medical diagnosis codes); and (b) the integration of the recipes package into the training of machine learning models would require to train the models repeatdly for each recipe per the documentation of Kuhn (2019). In the code chunk below, we provide the details for how we regrouped several values/levels of each categorical/character vector with the hope of improving both the efficiency and prediction performance of the different machine learning models. In addition, we have also used the recode() to rename some of the categories since R can produce errors when one-hot encoding is applied if the name of the category is numeric; therefore, some of the operations performed below merely changed the name of the category to avoid future errors when processing the data. After regrouping, we performed two additional steps of variable elimination: (a) Any variables where the number of missing observations exceeded 90%; (b) Any numeric variables where the number of missing observations exceeded 30%; (c) removing variables that have more than 90% of the non-missing observations in one level; (d) using the caret nearZeroVar() function to remove variables that the percentage of unique values in the variable is less than 0.01%.

df$ABO %<>% recode(A = 'A', A1 = 'A', A2 = 'A', B = 'B', O = 'O', AB = 'AB', A1B = 'AB', A2B = 'AB', .default = "OTHER", .missing = NA_character_) 
df$ABO_DON %<>% recode(A = 'A', A1 = 'A', A2 = 'A', B = 'B', O = 'O', AB = 'AB', A1B = 'AB', A2B = 'AB', .default = "OTHER", .missing = NA_character_)

df$ABO_MAT %<>% recode(`1` = "IDENTICAL", `2` = "NOT_IDENTICAL", `3` = "NOT_IDENTICAL", .default = "OTHER", .missing = NA_character_)

df$AMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED", .default = "OTHER", .missing = NA_character_)

df$BMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED", .default = "OTHER", .missing = NA_character_)

df$BRONCHO_LT_DON %<>% recode(`1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL", `4` = "ABNORMAL", `5` = "ABNORMAL",`6` = "ABNORMAL", `7` ="OTHER", `998` = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$BRONCHO_RT_DON %<>% recode(`1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL", `4` = "ABNORMAL", `5` = "ABNORMAL",`6` = "ABNORMAL", `7` ="OTHER", `998` = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$CHEST_XRAY_DON %<>% recode(`0` = "UNKNOWN", `1` = "NO", `2` = "NORMAL", `3` = "ABNORMAL_SINGLE", `4` = "ABNORMAL_SINGLE", `5` = "ABNORMAL_BOTH",
`998` = "UNKNOWN", `999` = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$COD_CAD_DON %<>% recode(`1` = 'ANOXIA', `2` = 'CEREBROVASCULAR_STROKE', `3` = 'HEAD_TRAUMA',`4` = 'OTHER', `999` = 'OTHER', Unknown = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$CORONARY_ANGIO %<>% recode(`1` = "NO", `2` = "YES_NORMAL", `3` = "YES_ABNORMAL", .default = "OTHER", .missing = NA_character_) 

df$CMV_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$DIAB %<>% recode(`1` = 'NO', `2` = 'ONE', `3` = 'TWO', `4` = 'OTHER', `5` = 'OTHER', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$DIAG %<>% recode(`1000` = 'DILATED_MYOPATHY_IDI', `1001` = 'DILATED_MYOPATHY_OTH', `1002` = 'DILATED_MYOPATHY_OTH', `1003` = 'DILATED_MYOPATHY_OTH',`1004` = 'DILATED_MYOPATHY_OTH', `1005` = 'DILATED_MYOPATHY_OTH',`1006` = 'DILATED_MYOPATHY_OTH', `1007` = 'DILATED_MYOPATHY_ISC', `1049` = 'DILATED_MYOPATHY_OTH',
`1200` = 'CORONARY', `999` = 'OTHER', `1999` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$TCR_DGN %<>% recode(`1000` = 'DILATED_MYOPATHY_IDI', `1001` = 'DILATED_MYOPATHY_OTH', `1002` = 'DILATED_MYOPATHY_OTH', `1003` = 'DILATED_MYOPATHY_OTH',`1004` = 'DILATED_MYOPATHY_OTH', `1005` = 'DILATED_MYOPATHY_OTH',`1006` = 'DILATED_MYOPATHY_OTH', `1007` = 'DILATED_MYOPATHY_ISC', `1049` = 'DILATED_MYOPATHY_OTH',
`1200` = 'CORONARY', `999` = 'OTHER', `1999` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$THORACIC_DGN %<>% recode(`1000` = 'DILATED_MYOPATHY_IDI', `1001` = 'DILATED_MYOPATHY_OTH', `1002` = 'DILATED_MYOPATHY_OTH', `1003` = 'DILATED_MYOPATHY_OTH',`1004` = 'DILATED_MYOPATHY_OTH', `1005` = 'DILATED_MYOPATHY_OTH',`1006` = 'DILATED_MYOPATHY_OTH', `1007` = 'DILATED_MYOPATHY_ISC', `1049` = 'DILATED_MYOPATHY_OTH',
`1200` = 'CORONARY', `999` = 'OTHER', `1999` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)


df$DRMIS %<>% recode(`0` = "NO_MISMATCH", `1` = "ONE_MISMATCHED", `2` = "TWO_MISMATCHED", .default = "OTHER", .missing = NA_character_)

df$EBV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$EDUCATION %<>% recode(`1` = 'OTHER', `2` = 'GRADE', `3` = 'HIGH', `4` = 'COLLEGE', `5` = 'UNIVERSITY', `6` = 'UNIVERSITY', `996` = 'OTHER', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$ETHCAT %<>% recode(`1` = 'WHITE', `2` = 'BLACK', `4` = 'HISPANIC', `5` = 'OTHER', `6` = 'OTHER',`7` = 'OTHER', `9` = 'OTHER', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$ETHCAT_DON %<>% recode(`1` = 'WHITE', `2` = 'BLACK', `4` = 'HISPANIC', `5` = 'OTHER', `6` = 'OTHER',`7` = 'OTHER', `9` = 'OTHER', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_)

df$FUNC_STAT_TCR %<>% recode(`1` = "ABLE", `2` = "ASSISTED", `3` = "DISABLE", `996` = "OTHER", `998` = "UNKNOWN", `2010` = "DISABLE", `2020` = "DISABLE", `2030` = "DISABLE", `2040` = "DISABLE", `2050` = "ASSISTED", `2060` = "ASSISTED" , `2070` = "ASSISTED", `2080` = "ABLE", `2090` = "ABLE", `2100` = "ABLE", .default = "OTHER", .missing = NA_character_)

df$FUNC_STAT_TRR %<>% recode(`1` = "ABLE", `2` = "ASSISTED", `3` = "DISABLE", `996` = "OTHER", `998` = "UNKNOWN", `2010` = "DISABLE", `2020` = "DISABLE", `2030` = "DISABLE", `2040` = "DISABLE",`2050` = "ASSISTED", `2060` = "ASSISTED" , `2070` = "ASSISTED", `2080` = "ABLE", `2090` = "ABLE", `2100` = "ABLE", .default = "OTHER", .missing = NA_character_)

df$HBV_CORE %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$HBV_CORE_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$HBV_SUR_ANTIGEN %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$HCV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$HEP_C_ANTI_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$HIST_DIABETES_DON %<>% recode(`1` = 'NO', `2` = 'YES', `3` = 'YES', `4` = 'YES' , `5` = 'YES', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_) 

df$HYPERTENS_DUR_DON %<>% recode(`1` = 'NO', `2` = 'YES', `3` = 'YES', `4` = 'YES' , `5` = 'YES', `998` = 'UNKNOWN', .default = "OTHER", .missing = NA_character_) 

df$HIV_SEROSTATUS %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

# We convert the blank cells to NA as these variables were well coded. 

df %<>% mutate_all(na_if, "") %>% mutate_all(na_if, " ") %>% mutate_all(na_if, "U")

df$HLAMIS %<>% recode(`0` = 'LOWEST', `1` = 'LOWEST', `2` = 'LOWEST', `3` = 'LOW', 
`4` = 'MEDIUM', `5` = 'HIGH', `6` = 'HIGHEST', .default = "OTHER", .missing = NA_character_)

df$HTLV2_OLD_DON %<>% recode(C = "UNKNOWN", I = "UNKNOWN", N = "NEGATIVE", ND = "UNKNOWN", P = "POSITIVE", PD = "UNKNOWN", U = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$INIT_STAT %<>%  recode(`1010` = 'ONE', `1020` = 'ONE', `1030` = 'TWO', `1090` ='ONE', `1999` = 'TEMPORARILY_INACTIVE', `2010` = 'ONE', `2020` = 'ONE', `2030` = 'TWO',`2090` = 'ONE', `2999` = 'TEMPORARILY_INACTIVE', `7010` = 'ACTIVE',
`7999` = 'TEMPORARILY_INACTIVE', .default = "OTHER", .missing = NA_character_)

df$END_STAT %<>%  recode(`1010` = 'ONE', `1020` = 'ONE', `1030` = 'TWO', `1090` ='ONE', `1999` = 'TEMPORARILY_INACTIVE', `2010` = 'ONE', `2020` = 'ONE', `2030` = 'TWO',`2090` = 'ONE', `2999` = 'TEMPORARILY_INACTIVE', `7010` = 'ACTIVE',
`7999` = 'TEMPORARILY_INACTIVE', .default = "OTHER", .missing = NA_character_)

df$LAST_INACT_REASON %<>% recode(`7`='HEALTH', `9`='HEALTH', `11`='HEALTH', .missing = NA_character_)

df$STERNOTOMY_TRR %<>% recode(`1`="ONE", `2`="MORE", `3`="MORE", `998`="UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$INOTROPES_TCR %<>% recode(`0` = "NO", `1` = "YES", .default = "OTHER", .missing = NA_character_) 

df$INOTROPES_TRR %<>% recode(`0` = "NO", `1` = "YES",  .default = "OTHER", .missing = NA_character_)

df$MED_COND_TRR %<>% recode(`1` = "ICU_HOSPITALIZED", `2` = "HOSPITALIZED", `3` = "NOT_HOSPITALIZED",  .default = "OTHER", .missing = NA_character_)

df$NUM_PREV_TX %<>% recode(`0` = 'ZERO', .default = 'NON_ZERO', .missing = NA_character_)

df$PRI_PAYMENT_TCR %<>% recode(`1` = "PRIVATE", `2` ="MEDICAID", `3` = "MEDICARE_FREE", `4` = "PUBLIC_OTHER",`5` = "PUBLIC_OTHER", `6` = "PUBLIC_OTHER", `7` = "PUBLIC_OTHER", `8` = "OTHER",`9` = "OTHER", `10` = "OTHER", `11` = "OTHER", `12` = "OTHER", `13` = "PUBLIC_OTHER",`14` = "OTHER", `15` = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$PRI_PAYMENT_TRR %<>% recode(`1` = "PRIVATE", `2` ="MEDICAID", `3` = "MEDICARE_FREE", `4` = "PUBLIC_OTHER",`5` = "PUBLIC_OTHER", `6` = "PUBLIC_OTHER", `7` = "PUBLIC_OTHER", `8` = "OTHER",`9` = "OTHER", `10` = "OTHER", `11` = "OTHER", `12` = "OTHER", `13` = "PUBLIC_OTHER", `14` = "OTHER", `15` = "UNKNOWN", .default = "OTHER", .missing = NA_character_)

df$PROC_TY_HR %<>% recode(`1` = "BICAVAL", `2` = "TRADITIONAL", .default = "OTHER", .missing = NA_character_)

df$PULM_INF_DON %<>% recode(`0` = "NO", `1` = "YES", .default = "OTHER", .missing = NA_character_) 

df$REGION %<>% recode(`1` = "NORTH_EAST", `2` = "NORTH_EAST", `3` = "SOUTH_EAST", `4` = "SOUTH_EAST", `5` = "WEST", `6` = "WEST", `7` = "MIDWEST", `8` = "MIDWEST", `9` = "NORTH_EAST", `10` = "MIDWEST",`11` = "SOUTH_EAST", .default = "OTHER", .missing = NA_character_)

df$SHARE_TY %<>% recode(`3` = "LOCAL", `4` = "REGIONAL", `5` = "NATIONAL", .default = "OTHER", .missing = NA_character_) 

df %<>% discard(~sum(is.na(.x))/length(.x)* 100 >= 90) %>% # remove vars with >= 90 % missing
dropNumericMissing(percent = 30) # drop numeric columns w/ >= 30% missing (see custom_functions.R)

df <- df %>% dropLowVar(percent=90)
nzvColLoc <- nearZeroVar(df, uniqueCut = 0.01)
df <- df[,-nzvColLoc]

# we need to save df as another object for later use: the application of the selected model for the multiple time periods

df_iso <- df
df_iso %<>% select(-c(OUTCOME))
df %<>% select(-c(GSTATUS, GTIME))

Now, we have 134 variables and 48347 cases in the data.

2.3 Deal with Missing Values & Encoding Strategy

In this section, we provide some options of dealing with missing values in either categorical or numeric variables and two approaches of encoding categorical variables.

Numerical Imputation: No imputation (Drop), Missing values are replaced with the median

Categorical Imputation: No imputation (Drop), Missing values are labeled with the existing “unknown” category.

We suggest two approaches of encoding categorical variables: Label Encoding where each label is mapped to an integer, and One-Hot Encoding where each label is mapped to a binary variable.

In the following code chunk, we create a list (df_all) of 8 sublists corresponding to the imputation strategies and encoding strategies mentioned above. For example, df_all[[“Median_Drop_Label”]] stores the data after imputing median for missing values in the numerical variables, dropping missing values in the categorical variables, and mapping each level in each categorical variable to an integer.

# first, we need to remove cases with missing values in OUTCOME as it is the response variable in our study.
df <- df[!is.na(df$OUTCOME),]

# find the indices for categorical and numerical variables in the data
index_char <- setdiff(which(unlist(lapply(df, class))=="character"), which(colnames(df)=="ID"|colnames(df)=="OUTCOME"))

index_num <- which(unlist(lapply(df,class))=="numeric")


# define three factors
imputation_num <- c("Drop", "Median")
imputation_cat <- c("Drop", "Unknown")
encoding <- c("Label", "OneHot")

# Create a list of 8 elements that will store the 8 data sets for all scenarios
df_all <- rep(list(NA), 8)
names(df_all) <- apply(as.data.frame(crossing(imputation_num, imputation_cat, encoding)),1,function(x) paste(x, collapse="_"))

# prepare for imputation cases
char.impute_unknown <- df[,index_char] %>% mutate_all(~ifelse(is.na(.x), "UNKNOWN", .x)) 
num.impute_median <- df[,index_num] %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)) 

# prepare for encoding with categorical imputations
unknown_Label <- encode_category(char.impute_unknown, "Label") 
unknown_Onehot <- encode_category(char.impute_unknown, "Onehot")

# prepare for no imputation case
drop_drop <- search_complete(df)  
index_Num_DD <- which(unlist(lapply(drop_drop, class))=="numeric")
index_Char_DD <- setdiff(which(unlist(lapply(drop_drop, class))=="character"), which(colnames(drop_drop)=="ID"|colnames(drop_drop)=="OUTCOME"))
drop_drop_Label <- encode_category(drop_drop[,index_Char_DD], "Label") 
drop_drop_Onehot <- encode_category(drop_drop[,index_Char_DD], "Onehot") 

# prepare for imputing median for numerical variables but dropping categorical missing values
median_notdrop <- bind_cols(df[,c("ID", "OUTCOME")], num.impute_median, df[,index_char])
median_drop <- search_complete(median_notdrop)
index_Num_MD <- which(unlist(lapply(median_drop, class))=="numeric")
index_Char_MD <- setdiff(which(unlist(lapply(median_drop, class))=="character"), which(colnames(median_drop)=="ID"|colnames(median_drop)=="OUTCOME"))
median_drop_Label <- encode_category(median_drop[,index_Char_MD], "Label") 
median_drop_Onehot <- encode_category(median_drop[,index_Char_MD], "Onehot") 

# obtain data for 8 scenarios
df_all$Drop_Drop_Label <- bind_cols(drop_drop[,c("ID", "OUTCOME")], drop_drop[,index_Num_DD], drop_drop_Label)
df_all$Drop_Drop_OneHot <- bind_cols(drop_drop[,c("ID", "OUTCOME")], drop_drop[,index_Num_DD], drop_drop_Onehot)
df_all$Drop_Unknown_Label <- bind_cols(df[,c("ID", "OUTCOME")], df[,index_num], unknown_Label) %>% drop_na()
df_all$Drop_Unknown_OneHot <- bind_cols(df[,c("ID", "OUTCOME")], df[,index_num], unknown_Onehot) %>% drop_na()
df_all$Median_Drop_Label <- bind_cols(median_drop[,c("ID", "OUTCOME")], median_drop[,index_Num_MD], median_drop_Label)
df_all$Median_Drop_OneHot <- bind_cols(median_drop[,c("ID", "OUTCOME")], median_drop[,index_Num_MD], median_drop_Onehot)
df_all$Median_Unknown_Label <- bind_cols(df[,c("ID", "OUTCOME")], num.impute_median, unknown_Label)
df_all$Median_Unknown_OneHot <- bind_cols(df[,c("ID", "OUTCOME")], num.impute_median, unknown_Onehot)

saveRDS(df_all, "../Results/all_data.RDS")

We found that if we simply dropped all missing values in the categorical variables, there were no observations in the data. Thus, we try to maximize remaining cells in the data through a heuristic algorithm: search_complete(), which drops rows and columns based on the percentage of missing values until just non-empty cells are remained. This function can be found in custom_functions.R.

The following table shows the dimensions for the data (including ID) we created for all scenarios.

D <- lapply(df_all, dim)
D <- do.call(rbind.data.frame, D)
colnames(D) <- c("Number_of_Observations", "Number_of_Variables")
D <- data.frame(Scenario=names(df_all), D)
datatable(D)

2.4 Full Factorial Design

Based on our Stage I description, we consider the following six model factors, with the following levels:

Categorical Imputation: Drop, Unknown
Numerical Imputation: Drop, Median
Encoding: Label, One-Hot
Subsampling: None, Down, Up, SMOTE, ROSE
Feature Selection: FFS, LASSO, RF
Algorithm: ANN, DT(CART), LDA, LR, SVM, RF, XGB

Furthermore, we will use 5-fold cross validation in model selection. To examine short, mid- and long-term survival, we would examine 11 prediction time-periods to generate the survival probability curve. Consequently, the enumeration of these combinations would result in 46,200 runs.

We decided to reduce the computational burden to 4,200 runs by comparing the predictive performance of those combinations based on the 1-year prediction period. We have chosen 1-year since it (i) provided us with the largest sample size (with the exception of the 1-month time period where the focus is on acute organ rejection prediction), and (ii) it is a commonly used period in the literature.

The selected modeling approach will then be trained for all other periods for the sake of demonstrating the need and utility of our isotonic regression calibration approach in Stage II.

Readers are encouraged to refer our paper for the descriptions of each factor.

2.5 Feature Selection

In the following code chunk, we show how the feature selection approaches: Fast Feature Selection, LASSO, and Random Forests are utilized. If there is an error occurs related to “JAVA_HOME cannot be determined from the registry”, the error is often resolved by installing a Java version (i.e. 64-bit Java or 32-bit Java) that fits to the type of R version that you are using (i.e. 64-bit R or 32-bit R). And set the JAVA_HOME path using one of the commends below. “jre1.8.0_251” needs to be adjusted depends on the folder’s name.

Sys.setenv(JAVA_HOME=‘C:/Program Files/Java/jre1.8.0_251’) for 64-bit version

Sys.setenv(JAVA_HOME=‘C:/Program Files (x86)/Java/jre1.8.0_251’) for 32-bit version

We need to create training and holdout data since the feature selection procedure should be applied to the training data only. We will use IDs to distinguish the training and holdout data. Since we would like to have 5 disjoint holdout datasets for the cross validation (corresponding to 5 training datasets), we sample cases by using ID without replacement and obtain each 20% of the sample cases for the holdout data.

set.seed <- 1234
all_data_ID <- lapply(df_all, function(x) sample(x$ID, nrow(x)))
holdout_index <- Get_holdout_index(all_data_ID)

Since there will be \(5\times 8=40\) training datasets, we will use a parallel algorithm to conduct the feature selection procedure for these datasets simultaneously. We apply each feature selection method to each dataset individually and the result is stored in a list with the following structure. The list contains 5 sub-lists which contain the results corresponding to the five fold cross validation training sets. Additionally, each of these 5 sub-lists contains 8 elements corresponding to the 8 data scenarios. Then we store 3 lists: features_FFS, features_LASSO, features_RF into a list: all_features.

For example, all_features[[“LASSO”]][[3]][[“Median_Drop_Label”]] stores the features after applying LASSO to the third training datasets from the data: imputing median for missing values in the numerical variables, dropping missing values in the categorical variables, and mapping each level in each categorical variable to an integer.

numCores <- 4 # this depends on the number of cores and the installed memory (RAM) available in the computer
# Fast Feature selection
features_FFS <- vector(mode="list", 5)
startTime <- proc.time()[3] # set up timer

for (i in 1:5){
  cl <- makeCluster(numCores, type="SOCK")
  features_FFS[[i]] <- parSapply(cl, 1:8, select_variables, df_all, "OUTCOME", "FFS", all_data_ID, holdout_index[[i]], 2090)
  stopCluster(cl)
}

runTime <- proc.time()[3] - startTime
print(paste("It took", round(runTime), "seconds to obtain all features used using FFS."))

# Lasso Feature selection for Binomial TARGETS
features_LASSO <- vector(mode="list", 5)
startTime <- proc.time()[3] # set up timer

for (i in 1:5){
  cl <- makeCluster(numCores, type="SOCK")
  features_LASSO[[i]] <- parSapply(cl, 1:8, select_variables, df_all, "OUTCOME", "LASSO", all_data_ID, holdout_index[[i]], 2090)
  stopCluster(cl)
}

runTime <- proc.time()[3] - startTime
print(paste("It took", round(runTime), "seconds to obtain all features used using LASSO."))

# Random Forest Feature Selection
features_RF <- vector(mode="list", 5)
startTime <- proc.time()[3] # set up timer

for (i in 1:5){
  cl <- makeCluster(numCores, type="SOCK")
  features_RF[[i]] <- parSapply(cl, 1:8, select_variables, df_all, "OUTCOME", "RF", all_data_ID, holdout_index[[i]], 2090)
  stopCluster(cl)
}

runTime <- proc.time()[3] - startTime
print(paste("It took", round(runTime), "seconds to obtain all features used using Random Forest."))

all_features <- list(features_FFS, features_LASSO, features_RF)
names(all_features) <- c("FFS", "LASSO", "RF")

For the reference, the approximate run time (in seconds) for each method to obtain the list using a Windows 10 machine with Intel(R) Core(TM) i7 Processor (4 cores, 8GB RAM) is reported below.

Time <- c(374, 2328, 5576)
names(Time) <- c("FFS", "LASSO", "RF")
data.table(t(Time))

2.6 Obtain Results for All Scenarios

In the following code chunk, a parallel algorithm will be used to obtain the prediction performance of each combination of factors and we save the result in the file: all_scenarios.csv. We report five common performance metrics that are suited for two class classification problems: AUC, accuracy, sensitivity, specificity and G-mean where G-mean \(= \sqrt{\mbox{sensitivity} \times \mbox{specificity}}\).

We use a function DOE_function() to conduct this study. This function can be found in the custom_functions.R. The output of this function could be either a vector or a list depends on if the predicted probability for Survival is returned.

fold <- 1:5
feature_selection <- c("FFS", "LASSO","RF")
subsampling <- c("none", "down", "Up", "smote", "rose")
algorithm <- c("ANN", "DT", "LDA", "LR", "RF", "SVM", "XGB")

Design <- crossing(fold, imputation_num, imputation_cat, encoding, feature_selection, subsampling, algorithm)

numCores <- 4 # this depends on the number of cores and the installed memory (RAM) available in the computer

cl <- makeCluster(numCores, type="SOCK")

all_scenarios <- parSapply(cl, 1:nrow(Design), DOE_function, Design, df_all, "OUTCOME", all_data_ID, holdout_index, all_features, return.prediction=FALSE, 2020)

stopCluster(cl)

all_scenarios <- t(all_scenarios)
colnames(all_scenarios) <- c("fold", "imputation_num", "imputation_cat", "encoding", "feature_selection", "subsampling", "algorithm", "auc", "sen", "spec", "accu", "gmean")
all_scenarios <- tbl_df(all_scenarios)
all_scenarios[,c("fold", "auc", "sen", "spec", "accu", "gmean")] %<>% mutate_all( as.numeric)

#write.csv(all_scenarios, "../Results/all_scenarios.csv", row.names = FALSE)

The following table presents an overview of the prediction performance of the seven machine learning algorithms with the best data preparation/pre-processing approach for a given algorithm.

best_DOE_average <- read.csv("https://raw.githubusercontent.com/Ying-Ju/heart_transplant.github.io/master/Data/DOE_result.csv")
datatable(best_DOE_average)

In our study, we use G-Mean as the primary metric for model selection since it

is suitable for unbalanced classification problems;
penalizes models where there is a large discrepancy between sensitivity and specificity.

Thus, we would select the combination of categorical imputation, numerical imputation, categorical variable encoding, sub-sampling and machine learning model with highest average G-Mean across the five cross-validation models. Note that, irrespective of our pre-algorithm preparation procedure, SVM did not converge.

Based on the table above, logistic regression (LR) presented the best G-Mean performance with

median imputation for missing numeric values,
imputing missing categories as “unknown”,
one-hot encoding for categorical variables,
LASSO for feature selection, and
UP sampling.

While the results are not statistically significant when compared to other approaches, we believe LR is the best approach since it is the quickest to run and most understandable by health-professionals.

2.7 Apply the Selected Model over Multiple Time Periods

In this section, the selected modeling approach (LR) in the previous section will be trained for all other periods.

There will be several steps in this section.

Create the training and holdout datasets for each time period
Apply LASSO feature selection approach to each training dataset
Apply the logistical regression model to the training dataset
Obtain the predicted survival probability at each time period

2.7.1 Create the Training and Holdout Datasets for Each Time Period

In this snippet, we create the training and holdout datasets for each time point of interest. In order to both validate models on the holdout datasets and use isotonic regression to calibrate survival probabilities, we make sure all patients selected in the 10th year holdout dataset were included in previous time points. This is not a necessary procedure for either prediction purpose or calibration of survival probabilities. However, we want to evaluate calibration with isotonic regression for the patients that their survival is predicted in all time stamps. Therefore, we need to use the patients IDs as an indicator for checking their data availability in all the time stamps.

We use a list object keep_NA to record the following IDs:

IDs for the dataset at each time point of interest
IDs for the holdout dataset at each time point of interest
IDs for the training dataset at each time point of interest

# find the indecies for categorical and numerical variables in the data
index_char <- setdiff(which(unlist(lapply(df_iso, class))=="character"), which(colnames(df_iso)=="ID"|colnames(df_iso)=="GTIME"|colnames(df_iso)=="GSTATUS"))

index_num <- setdiff(which(unlist(lapply(df_iso,class))=="numeric"),
 which(colnames(df_iso)=="ID"|colnames(df_iso)=="GTIME"|colnames(df_iso)=="GSTATUS"))                    

char.impute_unknown <- df_iso[,index_char] %>% mutate_all(~ifelse(is.na(.x), "Unknown", .x))

iso.char.impute_unknown <- df_iso[,index_char] %>% mutate_all(~ifelse(is.na(.x), "Unknown", .x)) 
iso.num.impute_median <- df_iso[,index_num] %>% mutate_all(~ifelse(is.na(.x), median(.x, na.rm = TRUE), .x)) 
iso.unknown_Onehot <- encode_category(iso.char.impute_unknown, "Onehot")

df_iso <- bind_cols(df_iso[,c("ID","GSTATUS","GTIME")], iso.num.impute_median, iso.unknown_Onehot)


# here we made 11 consecutive dependent variables, month1, year1, year2, year3, ..., year10

df_iso %<>%
mutate(year0 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 1/12)) %>%
mutate(year1 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 1))    %>%
mutate(year2 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 2))    %>%
mutate(year3 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 3))    %>%
mutate(year4 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 4))    %>%
mutate(year5 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 5))    %>%
mutate(year6 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 6))    %>%
mutate(year7 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 7))    %>%
mutate(year8 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 8))    %>%
mutate(year9 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 9))    %>%
mutate(year10 = recodeGSTATUS(gStatus = GSTATUS, gTime = GTIME, targetYear = 10))  %>%
select(-c(GSTATUS, GTIME))
iso_vars=names(df_iso)

# We keep IDs of non NA observations in each time stamps
keep_NA <- rep(list(NA), 33)
keep_NA[1:11] <- lapply(1:11,function(x){pull(df_iso[!is.na(df_iso[paste0("year", (x-1))]),"ID"],ID)})
names(keep_NA)[1:11] <- paste0("ID",seq(0,10))

all_sizes <- unname(unlist(lapply(keep_NA[1:11], length)))
test_sizes <- round(all_sizes*0.2)
train_sizes <- all_sizes - test_sizes

# We select the holdout sets that contain the year 10 test data
# Then we exclude those IDs from each year to find IDs for other training sets
set.seed(2019)
keep_NA[[12]] <- sample(keep_NA$ID10,test_sizes[11])
names(keep_NA)[12] <- "ID_holdout10"

keep_NA[13:22] <- lapply(2:11,function(x){
  c(keep_NA$ID_holdout10, sample(setdiff(keep_NA[[(12-x)]], keep_NA$ID_holdout10), (test_sizes[(12-x)]-length(keep_NA$ID_holdout10))))
})
names(keep_NA)[13:22] <- paste0("ID_holdout", seq(9,0))


keep_NA[23:33] <- lapply(1:11,function(x){
  setdiff(keep_NA[[x]], keep_NA[[paste0("ID_holdout",(x-1))]])})
names(keep_NA)[23:33] <- paste0("ID_train", seq(0,10))

time_stamps <- c("year0","year1","year2","year3","year4","year5", "year6","year7","year8","year9","year10")


# making the data ready for running feature selection over super computer
server_data <- list()

server_data <- lapply(1:11,function(x){df_iso %>% dplyr::filter(ID %in% keep_NA[[paste0("ID_train", x-1)]])%>%
                      select(c(names(.)[!names(.) %in% c("ID",time_stamps)],paste0("year",x-1)))})

2.7.2 Apply LASSO Feature Selection Approach to Each Training Dataset

The following code chunk shows how the LASSO feature selection is applied to each training data.

cl <- makeCluster(min(4,parallel::detectCores()), type="SOCK")

time.begin <- proc.time()[3]

LASSO <- parSapply(cl, 1:11, select_variables_iso, data=server_data, method="LASSO",
                          folds=5,alpha=1,seed=110) 

time.window <- proc.time()[3] - time.begin

stopCluster(cl)

2.7.3 Apply Logistical Regression Model with Selected Approach

In the following code chunk, we apply the selected logistical regression model approach to the training datasets for each time period and obtain the performance measures for each case.

train_data <- list()
holdOut_data <- list()
exclud <- c(paste0("year",seq(0,10)), "ID")
exclud_h <- c(paste0("year",seq(0,10)))
for (i in 1:11){
  train_data[[i]] <- df_iso[df_iso$ID %in% keep_NA[[paste0("ID_train", i-1)]],
               !names(df_iso) %in% exclud[exclud!= paste0("year",i-1)] ]

  
  holdOut_data[[i]]<-df_iso[df_iso$ID %in% keep_NA[[paste0("ID_holdout", i-1)]],
                            !names(df_iso) %in% exclud_h[exclud_h!= paste0("year",i-1)]]

}


cl <- makeCluster(min(4,parallel::detectCores()), type="SOCK")
time.begin <- proc.time()[3]

pred_LR <- parSapply(cl, 1:11, train_iso ,t_data=train_data,h_data=holdOut_data,features=LASSO,
                          folds=5,resampl_meth="up",alg_used="glm",seed0=2019) 

time.window <- proc.time()[3] - time.begin

stopCluster(cl)

2.8 Performance Measures of the Selected Model over Multiple Time Periods

The following table shows the distribution of the features selected at each time period. “1” indicates the variable is selected and “0” indicates the variable is not selected in a given year.

LR_models <- readRDS(url("https://raw.githubusercontent.com/Ying-Ju/heart_transplant.github.io/master/Data/pred_LR.RDS"))
# here we load name of all the variables in the train set, they are loaded here to create a report them in each time stamps
iso_vars <-readRDS(url("https://raw.githubusercontent.com/Ying-Ju/heart_transplant.github.io/master/Data/iso_vars.RDS"))

exclud <- c(paste0("year",seq(0,10)), "ID")
vars <- iso_vars[!iso_vars %in% exclud]

varimp_years <- as.data.frame(matrix(0, ncol = 12, nrow = length(vars)))
colnames(varimp_years) <- c("vars",paste0("year",seq(0,10)))
varimp_years$vars <- vars

perf_years <- as.data.frame(matrix(0, ncol = 13, nrow = 5))
colnames(perf_years) <- c("measure",paste0("year",seq(0,10)),"all_mean")
perf_years$measure <- c("AUC","Specificity","Sensitivity","Accuracy","G-Mean")

for (i in 0:10){
  varimp_years[varimp_years$vars %in% LR_models[4,(i+1)]$var_imp$vars,paste0("year",i)] <- 1
  perf_years[paste0("year",i)] <- LR_models[1,(i+1)]$Performance
}

# prediction performance
perf_years$all_mean <- rowMeans(perf_years[paste0("year",seq(0,10))])
perf_years[,2:13] <- round(perf_years[,2:13],4)

# feature importance
varimp_years$no_years <- rowSums(varimp_years[paste0("year",seq(0,10))])
varimp_years <- varimp_years[which(varimp_years$no_years!=0),]

# importance of features
datatable(varimp_years)

The following table shows the holdout performance of the UP-LASSO-logistic regression implementation for the full, non-censored data.

datatable(perf_years)

2.9 ROC Curves

In this section, we present the ROC curves for all time periods. Our model yields an AUC value between 0.634-0.700 and 0.596-0.705 for the training and holdout sets.

Month 1

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Year 8

Year 9

Year 10

3 Stage II: Creation of a Realistic Probability Function Over Time

Consider the output of these machine learning models can be represented by \(\mathbf{p}=p_1, p_2, \ldots, p_t\), where \(p_t\) is the computed outcome probability at time \(t\) as obatined from the selected model.

In this section, we propose the use of isotonic regression to calibration and ensure the elements \(\mathbf{p}\) over time for a patient are monotonic. The following figure provides an overview of how isotonic regression can be used to calibrate the outcome probabilities obtained from the Stage 1. In the figure, we simulate the case where the researcher is interested in computing an organ’s failure probabilities over a 20 year time-period post transplant to illustrate the utility of our approach to a non-decreasing prediction problem.

However, we will only consider the opposite case where one would like to predict the survival probability post transplant in our study. This means for a given transplantation event, let \(p=p_0, p_1, p_2, \ldots, p_{10}\) correspond to the obtained survival probabilities of one recipient from Stage I, where 0 reflects the first month and 1-10 denote the number of years post-transplant. Isotonc regression is used to ensure that the elements within \(p\) are monotonic decreasing.

3.1 Application of Isotonic Regression

The following code chunk demonstrates the application of isotonic regression to the survival probabilities obtained the selected modeling approach.

predictions <- list()
# holdout ID of patients that their survival is known in year 10, should be existed in all previous time stamps but if their data is not known (e.g. year5) their data survival status is not known in the following years
#all_TARGETS: actual survival status of patients in each time stamp
#all_TARGETS: predicted survival status of the patients
#all_preds: predicted probability of survival in each time stamp

in_all_h <- LR_models[2,11]$Predicted$ID
all_preds <- as.data.frame(matrix(0, ncol = 11, nrow = length(in_all_h)))
all_TARGETS <- as.data.frame(matrix(0, ncol = 11, nrow = length(in_all_h)))
colnames(all_preds) <- c(paste0("year",seq(0,10)))
colnames(all_TARGETS) <- c(paste0("year",seq(0,10)))
all_TARGETS_pred <- all_TARGETS

for (x in 0:10){
  predictions[[x+1]] <- LR_models[2,x+1]$Predicted[which(LR_models[2,x+1]$Predicted$ID %in%
                                                          in_all_h),]
  if(sum(predictions[[x+1]]$ID==in_all_h)==length(in_all_h)){
    all_preds[paste0("year",x)] <-predictions[[x+1]]$Probability
    all_TARGETS[paste0("year",x)] <- predictions[[x+1]]$TARGET_raw[paste0("year",x)]
    all_TARGETS_pred[paste0("year",x)] <- predictions[[x+1]]$glm
  }
}


# here we apply isotonic regression 
# all_preds2: calibrated survival probability
# all_TARGETS_pred2: survival status after calibrating survival probability

# here we calibrate the survival probability

all_preds2 <- all_preds %>% rowwise() %>%
  mutate(output = list(isoreg(rev(as.numeric(c_across(year0:year10))))$yf)) %>%
  tidyr::unnest_wider(output)%>%
  select(., -starts_with("year")) %>% as_tibble(.)%>% rev(.) %>% 
  rename_(.dots = setNames(names(.), paste0("year", 0:10)))%>% 
  mutate(monotonic=all(.==cummin(.))) 


# here we identify survival status based on the calibrated probabilities
all_TARGETS_pred2 <- all_preds2 %>%
  select(paste0("year",0:10)) %>%
  mutate_all(funs(replace(., . >= 0.5, "Survival")))%>%
  mutate_all(funs(replace(., . < 0.5, "Death"))) %>% mutate_if(is.character,as.factor)

all_TARGETS_pred2 <- as.data.frame(all_TARGETS_pred2)

# here we calculate performance of prediction for the patients that their data is available in all time stamps before/after applying isotonic regression

perf_before_iso <- data.frame(matrix(NA, nrow=5, ncol=0))
perf_before_iso$Measures <- c("Accuracy","Specificity", "Sensitivity","AUC","G-Mean" )
perf_after_iso <- perf_before_iso

# calculating prediction performance before applying isotonic regression
perf_before_iso <- as.data.frame(lapply(0:10, function(x) {
  perf_before_iso %<>% mutate(!!paste0("year",x) := c(
    (as.data.frame(confusionMatrix(all_TARGETS_pred[,paste0("year",x)],
                                   as.factor(all_TARGETS[,paste0("year",x)]))$overall))[1,],
    caret::specificity(all_TARGETS_pred[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])), 
    caret::sensitivity(all_TARGETS_pred[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])),
    AUC::auc(roc(all_preds[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)]))),
    sqrt(caret::specificity(all_TARGETS_pred[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)]))*
           caret::sensitivity(all_TARGETS_pred[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])))
  ))
  return(perf_before_iso[paste0("year",x) ])
}))

perf_before_iso %<>% mutate(Mean = rowMeans(.[,colnames(perf_before_iso)])) %>% 
  add_rownames("Performance Measure")%>% mutate(`Performance Measure` = c("Accuracy","Specificity", "Sensitivity","AUC","G-Mean"))

# calculating prediction performance after applying isotonic regression

perf_after_iso <- as.data.frame(lapply(0:10, function(x) {
  perf_after_iso %<>% mutate(!!paste0("year",x) := c(
    (as.data.frame(confusionMatrix(all_TARGETS_pred2[,paste0("year",x)],
                                   all_TARGETS[,paste0("year",x)])$overall))[1,],
    caret::specificity(all_TARGETS_pred2[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])), 
    caret::sensitivity(all_TARGETS_pred2[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])),
    AUC::auc(roc(pull(all_preds2, paste0("year",x)),as.factor(all_TARGETS[,paste0("year",x)]))),   
    sqrt(caret::specificity(all_TARGETS_pred2[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)]))*
           caret::sensitivity(all_TARGETS_pred2[,paste0("year",x)],as.factor(all_TARGETS[,paste0("year",x)])))
    
  ))
  return(perf_after_iso[paste0("year",x) ])
}))

perf_after_iso %<>% mutate(Mean = rowMeans(.[,colnames(perf_after_iso)])) %>% 
  add_rownames("Performance Measure")%>% mutate(`Performance Measure` = c("Accuracy","Specificity", "Sensitivity","AUC","G-Mean"))

The following table shows the hold-out performance of the UP-LASSO-logistic regression implementation, which was limited to those observations where the recipients’ survival statuses are available for all time periods.

perf_before_iso[,2:13] <- round(perf_before_iso[,2:13],3)
datatable(perf_before_iso)

The following table shows the performance measures after isotonic regression is applied.

perf_after_iso[,2:13] <- round(perf_after_iso[,2:13],3)
datatable(perf_after_iso)

3.2 Survival Probabilites for Random Sample Recipients before and after Isotonic Regression

3.3 Observed vs Forcasted Survival Probabities Before and After Applying Isotonic Regression

In this section, we exam how good our models are by checking plots of the observed versus forecast survival probabilities before (Left) and after (Right) applying Isotonic Regression.

The light-blue line corresponds to the baseline case of a perfect match between both probabilities. The dark-blue line correspond to the results obtained from our machine learning model.

Month 1

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Year 8

Year 9

Year 10

4 Interactive App

We have created an interactive app that presents two modules for performing the analysis:

Manual Entry, where users can insert the values of predictor variables using several text boxes.
CSV Entry, where users can upload the values of predictor variables using a comma seperated variable (CSV) file.

Visit the app: here.

The following figure presents the workflow of the app. The user does not need to have a background in programming nor machine learning to operate the app.

The workflow of the app.

References

Chen, Ching Kit, Cedric Manlhiot, Seema Mital, Steven M Schwartz, Glen S Van Arsdell, Christopher Caldarone, Brian W McCrindle, and Anne I Dipchand. 2018. “Prelisting Predictions of Early Postoperative Survival in Infant Heart Transplantation Using Classification and Regression Tree Analysis.” Pediatric Transplantation 22 (2): e13105.

Dag, Ali, Asil Oztekin, Ahmet Yucel, Serkan Bulur, and Fadel M Megahed. 2017. “Predicting Heart Transplantation Outcomes Through Data Analytics.” Decision Support Systems 94: 42–52.

Dag, Ali, Kazim Topuz, Asil Oztekin, Serkan Bulur, and Fadel M Megahed. 2016. “A Probabilistic Data-Driven Framework for Scoring the Preoperative Recipient-Donor Heart Transplant Survival.” Decision Support Systems 86: 1–12.

De Jonge, Edwin, and Mark Van Der Loo. 2013. An Introduction to Data Cleaning with R. Statistics Netherlands, The Hague/Heerlen. https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf.

Guven, Goksel, Milos Brankovic, Alina A Constantinescu, Jasper J Brugts, Dennis A Hesselink, Sakir Akin, Ard Struijs, et al. 2018. “Preoperative Right Heart Hemodynamics Predict Postoperative Acute Kidney Injury After Heart Transplantation.” Intensive Care Medicine 44 (5): 588–97.

Kuhn, Max. 2019. “12 Using Recipes with Train | the Caret Package.” https://topepo.github.io/caret/using-recipes-with-train.html.

Medved, Dennis, Mattias Ohlsson, Peter Höglund, Bodil Andersson, Pierre Nugues, and Johan Nilsson. 2018. “Improving Prediction of Heart Transplantation Outcome Using Deep Learning Techniques.” Scientific Reports 8 (1): 1–9. https://doi.org/10.1038/s41598-018-21417-7.

Parham, Peter. 2014. The Immune System. Garland Science.

Shmueli, Galit, and others. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310.

Vega, Eleanor, Jacob Schroder, and Alina Nicoara. 2017. “Postoperative Management of Heart Transplantation Patients.” Best Practice & Research Clinical Anaesthesiology 31 (2): 201–13.

Weisdorf, Daniel, Stephen Spellman, Michael Haagenson, Mary Horowitz, Stephanie Lee, Claudio Anasetti, Michelle Setterholm, et al. 2008. “Classification of Hla-Matching for Retrospective Analysis of Unrelated Donor Transplantation: Revised Definitions to Predict Survival.” Biology of Blood and Marrow Transplantation 14 (7): 748–58.

Zhang, Kan, Richard Sheu, Nicole M Zimmerman, Andrej Alfirevic, Shiva Sale, A Marc Gillinov, and Andra E Duncan. 2019. “A Comparison of Global Longitudinal, Circumferential, and Radial Strain to Predict Outcomes After Cardiac Surgery.” Journal of Cardiothoracic and Vascular Anesthesia 33 (5): 1315–22.

Aubrun University, hamid@auburn.edu ↩︎
University of Dayton, ychen4@udayton.edu ↩︎
Aubrun University, cje0010@auburn.edu ↩︎
Aubrun University, azg0074@auburn.edu ↩︎
Miami University, fmegahed@miamioh.edu ↩︎

Supplementary Materials for A Two-Stage Machine Learning Framework to Predict Heart Transplantation Survival Probabilities over Time with a Monotonic Probability Constraint

Hamidreza Ahady Dolatsara1

Ying-Ju Tessa Chen2

Christy Evans3

Ashish Gupta4

Fadel M. Megahed5

04 August, 2020

1 Overview of Our Framework

2 State I: Survival Probability Prediction at Each Time Point

2.1 Reading and Exploring the UNOS Dataset

2.1.1 Loading the UNOS Dataset

2.1.2 Exploring the UNOS Dataset

Fraction Data Missing

Missing Data (Sample Columns)

Missing Data (Printout for All Variables)

Column Types

2.2 Data Preparation

2.2.1 Technically Correct Variables

2.2.2 Rowwise Operations: Additional Filters

2.2.3 Columnwise Operations: Dropping Variables

2.2.4 Creating Variables based on the Literature and Our Exploration of the Data

2.2.5 Reducing Number of Levels for Categorical Columns

2.2.5.1 Number of Distinct Values within Each Categorical Variable

2.2.5.2 Grouping Operations

2.3 Deal with Missing Values & Encoding Strategy

2.4 Full Factorial Design

2.5 Feature Selection

2.6 Obtain Results for All Scenarios

2.7 Apply the Selected Model over Multiple Time Periods

2.7.1 Create the Training and Holdout Datasets for Each Time Period

2.7.2 Apply LASSO Feature Selection Approach to Each Training Dataset

2.7.3 Apply Logistical Regression Model with Selected Approach

2.8 Performance Measures of the Selected Model over Multiple Time Periods

2.9 ROC Curves

Month 1

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Year 8

Year 9

Year 10

3 Stage II: Creation of a Realistic Probability Function Over Time

3.1 Application of Isotonic Regression

3.2 Survival Probabilites for Random Sample Recipients before and after Isotonic Regression

3.3 Observed vs Forcasted Survival Probabities Before and After Applying Isotonic Regression

Month 1

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Year 8

Year 9

Year 10

4 Interactive App

References

Hamidreza Ahady Dolatsara¹

Ying-Ju Tessa Chen²

Christy Evans³

Ashish Gupta⁴

Fadel M. Megahed⁵