## The Stratified Cox Proportional Hazards Regression Model | by Sachin Date | Dec, 2020

[ad_1]

The data set we’ll use to illustrate the procedure of building a stratified Cox proportional hazards model is the **US Veterans Administration Lung Cancer Trial data**. It contains data about 137 patients with advanced, inoperable lung cancer who were treated with a standard and an experimental chemotherapy regimen. Their progress was tracked during the study until the patient died or exited the trial while still alive, or until the trial ended. In the later two situations, the data is considered to be right censored. The data set appears in the book The Statistical Analysis of Failure Time Data, Second Edition, by John D. Kalbfleisch and Ross L. Prentice.

Using Python and Pandas, let’s load the data set into a DataFrame:

importpandasaspddata_types = {'TREATMENT_TYPE':'int', 'CELL_TYPE':'category', 'SURVIVAL_IN_DAYS':'int', 'STATUS':'int', 'KARNOFSKY_SCORE':'int', 'MONTHS_FROM_DIAGNOSIS':'int', 'AGE':'int', 'PRIOR_THERAPY':'int'}df = pd.read_csv(filepath_or_buffer='va_lung_cancer_dataset.csv',dtype=data_types)df.head()

Here is the output:

Our regression variables ** X** are going to be the following:

**TREATMENT_TYPE**: 1=Standard. 2=Experimental**CELL_TYPE**: 1=Squamous, 2=Small cell, 3=Adeno, 4=large**KARNOFSKY_SCORE**: A measure of general performance of the patient. 100=Best**MONTHS_FROM_DIAGNOSIS**: The number of months after diagnosis of lung cancer that the patient entered the trial.**AGE**: The age in years of the patient when they were inducted into the trial.**PRIOR_THERAPY**: Whether the patient had received any kind of prior therapy for lung cancer before induction into the trial.

Our dependent variable ** y** is going to be:

**SURVIVAL_IN_DAYS:**Indicating how many days the patient lived after being inducted into the trail.

The event variable is:

**STATUS:**1=Dead. 0=Alive

Using Patsy, let’s break out the categorical variable **CELL_TYPE **into different category wise column variables. Don’t worry about the fact that **SURVIVAL_IN_DAYS **is on sides of the model expression. It’s just to make Patsy happy.

frompatsyimportdmatrices#Build the model expression in Patsy syntax.model_expr = 'SURVIVAL_IN_DAYS ~ TREATMENT_TYPE + CELL_TYPE + KARNOFSKY_SCORE + MONTHS_FROM_DIAGNOSIS + AGE + PRIOR_THERAPY + SURVIVAL_IN_DAYS + STATUS'#Use the model expression to break out the CELL_TYPE categorical variable into 1-0 type columns

y, X =dmatrices(model_expr, df,return_type='dataframe')#Print out the first few rows

X.head()

Next, let’s build and train the regular (non-stratified) Cox Proportional Hazards model on this data using the **Lifelines**** **Survival Analysis library:

fromlifelinesimportCoxPHFitter#Create the Cox model

cph_model =CoxPHFitter()#Train the model on the data set

cph_model.fit(df=X,duration_col='SURVIVAL_IN_DAYS',event_col='STATUS')#Print the model summarycph_model.print_summary()

We see the following model summary:

To test the proportional hazards assumptions on the trained model, we will use the

method supplied by Lifelines on the **proportional_hazard_test**

class:**CPHFitter**

**CPHFitter**.**proportional_hazard_test**(**fitted_cox_model**, **training_df**, **time_transform**, **precomputed_residuals**)

Let’s look at each parameter of this method:

: This parameter references the fitted Cox model. In our example, **fitted_cox_model****fitted_cox_model**=cph_model

: This is a reference to the training data set. In our example, **training_df****training_df**=X

: This variable takes a list of strings: {‘all’, ‘km’, ‘rank’, ‘identity’, ‘log’}. Each string indicates the function to apply to the **time_transform**** y** (duration) variable of the Cox model so as to lessen the sensitivity of the test to outliers in the data i.e. extreme duration values. Recollect that in the VA data set the

**variable is**

*y***SURVIVAL_IN_DAYS**. ‘km’ applies the transformation:

*(1-KaplanMeirFitter.fit(durations, event_observed).*The ‘rank’ transform will map the sorted list of durations to the set of ordered natural numbers [1, 2, 3, …]. ‘Identity’ will keep the durations intact and ‘log’ will log-transform the duration values.

: You get to supply the type of residual errors of your choice from the following types: Schoenfeld, score, delta_beta, deviance, martingale, and variance scaled Schoenfeld.**precomputed_residuals**

Let’s compute the variance scaled Schoenfeld residuals of the Cox model which we trained earlier:

`scaled_schoenfeld = cph_model.`**compute_residuals**(**training_dataframe**=X, **kind**='scaled_schoenfeld')

To know more the Schoenfeld residuals, you may want to refer to the following article:

Now let’s perform the proportional hazards test:

fromlifelines.statisticsimportproportional_hazard_testproportional_hazard_test(fitted_cox_model=cph_model,training_df=X,time_transform='log',precomputed_residuals=scaled_schoenfeld)

We get the following output:

The test statistic obeys a Chi-square(1) distribution under the Null hypothesis that the variable follows the proportional hazards test. Under the Null hypothesis, the expected value of the test statistic is zero. Any deviations from zero can be judged to be statistically significant at some significance level of interest such as 0.01, 0.05 etc.

The genesis of this test statistic is itself a fascinating topic of study. For the interested reader, the following paper provides a good starting point:

Getting back to our little problem, I have highlighted in red the variables which have failed the Chi-square(1) test at a significance level of 0.05 (95% confidence level).

We will try to solve these issues by stratifying AGE, CELL_TYPE[T.4] and KARNOFSKY_SCORE.

Read More …

[ad_2]