27.5: Validación cruzada (Sección 26.6.1)

Última actualización
Guardar como PDF

Page ID: 150911

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

La validación cruzada es una técnica poderosa que nos permite estimar qué tan bien se generalizarán nuestros resultados a un nuevo conjunto de datos. Aquí construiremos nuestro propio código de validación cruzada para ver cómo funciona, continuando con el ejemplo de regresión logística de la sección anterior.

En la validación cruzada, queremos dividir los datos en varios subconjuntos y luego entrenar iterativamente el modelo dejando fuera cada subconjunto (que generalmente llamamos pliegues) y luego probar el modelo en ese pliegue repartido Vamos a escribir nuestro propio código para hacer esta división; una forma relativamente fácil de esto es crear un vector que contenga los números de pliegue y luego barajarlo aleatoriamente para crear las asignaciones de pliegue para cada punto de datos.

nfolds <- 4 # number of folds

# we use the kronecker() function to repeat the folds
fold <-  kronecker(seq(nfolds),rep(1,npatients/nfolds))
# randomly shuffle using the sample() function
fold <- sample(fold)

# add variable to store CV predictions
disease_df <- disease_df %>%
  mutate(CVpred=NA)

# now loop through folds and separate training and test data
for (f in seq(nfolds)){
  # get training and test data
  train_df <- disease_df[fold!=f,]
  test_df <- disease_df[fold==f,]
  # fit model to training data
  glm_result_cv <- glm(heartattack ~ biomarker, data=train_df,
                  family=binomial())
  # get probability of heart attack on test data
  pred <- predict(glm_result_cv,newdata = test_df)
  # convert to prediction and put into data frame
  disease_df$CVpred[fold==f] = (pred>0.5)

}

Ahora veamos el rendimiento del modelo:

# create table comparing predicted to actual outcomes
CrossTable(disease_df$CVpred,
           disease_df$heartattack,
           prop.t=FALSE,
           prop.r=FALSE,
           prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                   | disease_df$heartattack 
## disease_df$CVpred |     FALSE |      TRUE | Row Total | 
## ------------------|-----------|-----------|-----------|
##             FALSE |       416 |       269 |       685 | 
##                   |     0.832 |     0.538 |           | 
## ------------------|-----------|-----------|-----------|
##              TRUE |        84 |       231 |       315 | 
##                   |     0.168 |     0.462 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |       500 |       500 |      1000 | 
##                   |     0.500 |     0.500 |           | 
## ------------------|-----------|-----------|-----------|
## 
##

Ahora vemos que el modelo sólo predice con precisión menos de la mitad de los ataques cardíacos que ocurrieron cuando se está prediciendo a una nueva muestra. Esto nos dice que este es el nivel de predicción que podríamos esperar si se aplicara el modelo a una nueva muestra de pacientes de la misma población.