The Stratified Cox Proportional Hazards Regression Model | by Sachin Date | Dec, 2020
[ad_1]
The data set we’ll use to illustrate the procedure of building a stratified Cox proportional hazards model is the US Veterans Administration Lung Cancer Trial data. It contains data about 137 patients with advanced, inoperable lung cancer who were treated with a standard and an experimental chemotherapy regimen. Their progress was tracked during the study until the patient died or exited the trial while still alive, or until the trial ended. In the later two situations, the data is considered to be right censored. The data set appears in the book The Statistical Analysis of Failure Time Data, Second Edition, by John D. Kalbfleisch and Ross L. Prentice.
Using Python and Pandas, let’s load the data set into a DataFrame:
import pandas as pddata_types = {'TREATMENT_TYPE':'int', 'CELL_TYPE':'category', 'SURVIVAL_IN_DAYS':'int', 'STATUS':'int', 'KARNOFSKY_SCORE':'int', 'MONTHS_FROM_DIAGNOSIS':'int', 'AGE':'int', 'PRIOR_THERAPY':'int'}df = pd.read_csv(filepath_or_buffer='va_lung_cancer_dataset.csv', dtype=data_types)df.head()
Here is the output:
Our regression variables X are going to be the following:
TREATMENT_TYPE: 1=Standard. 2=Experimental
CELL_TYPE: 1=Squamous, 2=Small cell, 3=Adeno, 4=large
KARNOFSKY_SCORE: A measure of general performance of the patient. 100=Best
MONTHS_FROM_DIAGNOSIS: The number of months after diagnosis of lung cancer that the patient entered the trial.
AGE: The age in years of the patient when they were inducted into the trial.
PRIOR_THERAPY: Whether the patient had received any kind of prior therapy for lung cancer before induction into the trial.
Our dependent variable y is going to be:
SURVIVAL_IN_DAYS: Indicating how many days the patient lived after being inducted into the trail.
The event variable is:
STATUS: 1=Dead. 0=Alive
Using Patsy, let’s break out the categorical variable CELL_TYPE into different category wise column variables. Don’t worry about the fact that SURVIVAL_IN_DAYS is on sides of the model expression. It’s just to make Patsy happy.
from patsy import dmatrices#Build the model expression in Patsy syntax.
model_expr = 'SURVIVAL_IN_DAYS ~ TREATMENT_TYPE + CELL_TYPE + KARNOFSKY_SCORE + MONTHS_FROM_DIAGNOSIS + AGE + PRIOR_THERAPY + SURVIVAL_IN_DAYS + STATUS'#Use the model expression to break out the CELL_TYPE categorical variable into 1-0 type columns
y, X = dmatrices(model_expr, df, return_type='dataframe')#Print out the first few rows
X.head()
Next, let’s build and train the regular (non-stratified) Cox Proportional Hazards model on this data using the Lifelines Survival Analysis library:
from lifelines import CoxPHFitter
#Create the Cox model
cph_model = CoxPHFitter()#Train the model on the data set
cph_model.fit(df=X, duration_col='SURVIVAL_IN_DAYS', event_col='STATUS')#Print the model summary
cph_model.print_summary()
We see the following model summary:
To test the proportional hazards assumptions on the trained model, we will use the proportional_hazard_test
method supplied by Lifelines on the CPHFitter
class:
CPHFitter.proportional_hazard_test(fitted_cox_model, training_df, time_transform, precomputed_residuals)
Let’s look at each parameter of this method:
fitted_cox_model
: This parameter references the fitted Cox model. In our example, fitted_cox_model=cph_model
training_df
: This is a reference to the training data set. In our example, training_df=X
time_transform
: This variable takes a list of strings: {‘all’, ‘km’, ‘rank’, ‘identity’, ‘log’}. Each string indicates the function to apply to the y (duration) variable of the Cox model so as to lessen the sensitivity of the test to outliers in the data i.e. extreme duration values. Recollect that in the VA data set the y variable is SURVIVAL_IN_DAYS. ‘km’ applies the transformation: (1-KaplanMeirFitter.fit(durations, event_observed). The ‘rank’ transform will map the sorted list of durations to the set of ordered natural numbers [1, 2, 3, …]. ‘Identity’ will keep the durations intact and ‘log’ will log-transform the duration values.
precomputed_residuals
: You get to supply the type of residual errors of your choice from the following types: Schoenfeld, score, delta_beta, deviance, martingale, and variance scaled Schoenfeld.
Let’s compute the variance scaled Schoenfeld residuals of the Cox model which we trained earlier:
scaled_schoenfeld = cph_model.compute_residuals(training_dataframe=X, kind='scaled_schoenfeld')
To know more the Schoenfeld residuals, you may want to refer to the following article:
Now let’s perform the proportional hazards test:
from lifelines.statistics import proportional_hazard_testproportional_hazard_test(fitted_cox_model=cph_model, training_df=X, time_transform='log', precomputed_residuals=scaled_schoenfeld)
We get the following output:
The test statistic obeys a Chi-square(1) distribution under the Null hypothesis that the variable follows the proportional hazards test. Under the Null hypothesis, the expected value of the test statistic is zero. Any deviations from zero can be judged to be statistically significant at some significance level of interest such as 0.01, 0.05 etc.
The genesis of this test statistic is itself a fascinating topic of study. For the interested reader, the following paper provides a good starting point:
Getting back to our little problem, I have highlighted in red the variables which have failed the Chi-square(1) test at a significance level of 0.05 (95% confidence level).
We will try to solve these issues by stratifying AGE, CELL_TYPE[T.4] and KARNOFSKY_SCORE.
Read More …
[ad_2]