MLZC25-21. Feature Elimination: Optimizando el Modelo Removiendo Características Innecesarias



This content originally appeared on DEV Community and was authored by Jesus Oviedo Riquelme

🎯 Objetivo del Post: Aprenderás la técnica de eliminación de características (feature elimination) para identificar variables que puedes remover sin afectar (o incluso mejorando) el rendimiento del modelo, creando modelos más simples y eficientes.

🤔 ¿Por Qué Eliminar Features?

Tener más features no siempre es mejor. Principio de parsimonia (Navaja de Ockham):

“Entre dos modelos con rendimiento similar, el más simple es preferible”

Problemas de Tener Muchas Features

❌ Overfitting: El modelo aprende ruido en lugar de patrones reales
❌ Tiempo de entrenamiento: Más lento con más features
❌ Complejidad: Más difícil de entender e interpretar
❌ Costo computacional: Más memoria y recursos
❌ Maldición de la dimensionalidad: Con muchas dimensiones, los datos se vuelven “esparcidos”

Ventajas de un Modelo Simple

✅ Generaliza mejor: Menos propenso a overfitting
✅ Más rápido: Entrena y predice más rápido
✅ Más interpretable: Fácil de explicar
✅ Menos recursos: Menor costo computacional
✅ Más robusto: Menos sensible a ruido

🔍 ¿Qué es Feature Elimination?

Feature Elimination es el proceso sistemático de:

  1. Entrenar un modelo con todas las features
  2. Remover una feature a la vez
  3. Evaluar el impacto en el rendimiento
  4. Identificar features que aportan poco o nada

Metodología

Accuracy Baseline (todas las features) = 0.85

Feature 1 removida → Accuracy = 0.84  (↓0.01)
Feature 2 removida → Accuracy = 0.85  (↔ 0.00)  ← Candidata a eliminar
Feature 3 removida → Accuracy = 0.82  (↓0.03)
Feature 4 removida → Accuracy = 0.85  (↔ 0.00)  ← Candidata a eliminar
...

💻 Implementación Paso a Paso

Paso 1: Entrenar Modelo Baseline

Primero, necesitamos nuestro modelo de referencia con todas las features:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score

# Cargar y preparar datos (código del post anterior)
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
df = pd.read_csv(url)

categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

df_clean = df.copy()
for col in categorical_cols:
    df_clean[col] = df_clean[col].fillna('NA')
for col in numerical_cols:
    if col != 'converted':
        df_clean[col] = df_clean[col].fillna(0.0)

# División de datos
df_train_full, df_temp = train_test_split(df_clean, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)

y_train = df_train_full['converted'].values
y_val = df_val['converted'].values

X_train_df = df_train_full.drop('converted', axis=1).reset_index(drop=True)
X_val_df = df_val.drop('converted', axis=1).reset_index(drop=True)

# One-hot encoding
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val = dv.transform(X_val_df.to_dict(orient='records'))

# Entrenar modelo baseline
print("MODELO BASELINE (CON TODAS LAS FEATURES)")
print("=" * 60)

model_baseline = LogisticRegression(
    solver='liblinear', 
    C=1.0, 
    max_iter=1000, 
    random_state=42
)
model_baseline.fit(X_train, y_train)

# Evaluar
y_pred_baseline = model_baseline.predict(X_val)
accuracy_baseline = accuracy_score(y_val, y_pred_baseline)

print(f"Accuracy baseline: {accuracy_baseline:.6f}")
print(f"Número de features: {X_train.shape[1]}")
print(f"\n✅ Modelo baseline entrenado")

Paso 2: Feature Elimination – Prueba Individual

Ahora probamos remover cada feature y ver cómo afecta al accuracy:

print("\nFEATURE ELIMINATION - ANÁLISIS INDIVIDUAL")
print("=" * 70)

# Features a probar (según el homework)
features_to_test = ['industry', 'employment_status', 'lead_score']

# Diccionario para guardar resultados
feature_importance = {}

print(f"\n{'Feature':<25} {'Acc sin feature':>15} {'Diferencia':>15} {'Impacto'}")
print("-" * 70)

for feature in features_to_test:
    # Crear dataframes SIN la feature
    X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
    X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')

    # Verificar si la feature existe
    if feature not in X_train_df.columns:
        print(f"⚠ Feature '{feature}' no encontrada")
        continue

    # One-hot encoding sin la feature
    dv_no_feat = DictVectorizer(sparse=False)
    X_train_transformed = dv_no_feat.fit_transform(X_train_no_feat.to_dict(orient='records'))
    X_val_transformed = dv_no_feat.transform(X_val_no_feat.to_dict(orient='records'))

    # Entrenar modelo sin la feature
    model_no_feat = LogisticRegression(
        solver='liblinear', 
        C=1.0, 
        max_iter=1000, 
        random_state=42
    )
    model_no_feat.fit(X_train_transformed, y_train)

    # Evaluar
    y_pred_no_feat = model_no_feat.predict(X_val_transformed)
    accuracy_no_feat = accuracy_score(y_val, y_pred_no_feat)

    # Calcular diferencia
    diff = accuracy_baseline - accuracy_no_feat
    feature_importance[feature] = diff

    # Determinar impacto
    if diff < 0:
        impact = "Mejora ↑"
    elif diff < 0.01:
        impact = "Mínimo →"
    elif diff < 0.03:
        impact = "Bajo ↓"
    elif diff < 0.05:
        impact = "Moderado ↓↓"
    else:
        impact = "Alto ↓↓↓"

    print(f"{feature:<25} {accuracy_no_feat:>15.6f} {diff:>15.6f}  {impact}")

print("-" * 70)

Salida esperada:

FEATURE ELIMINATION - ANÁLISIS INDIVIDUAL
======================================================================

Feature                  Acc sin feature      Diferencia  Impacto
----------------------------------------------------------------------
industry                        0.849315        0.000000  Mínimo →
employment_status               0.852740       -0.003425  Mejora ↑
lead_score                      0.835616        0.013699  Bajo ↓
----------------------------------------------------------------------

Paso 3: Identificar Feature Menos Útil (Question 5)

print("\nQUESTION 5: FEATURE CON MENOR IMPACTO")
print("=" * 70)

# Encontrar la feature con MENOR diferencia (puede ser negativa)
min_diff_feature = min(feature_importance, key=feature_importance.get)
min_diff = feature_importance[min_diff_feature]

print(f"\nRanking por impacto al remover:")
for idx, (feat, diff) in enumerate(sorted(feature_importance.items(), 
                                          key=lambda x: x[1]), 1):
    marker = " ← MENOR IMPACTO" if feat == min_diff_feature else ""
    direction = "↑ Mejora" if diff < 0 else "↓ Empeora" if diff > 0 else "→ Sin cambio"
    print(f"{idx}. {feat:<25} : {diff:>10.6f} {direction}{marker}")

print(f"\n🎯 Feature con MENOR diferencia (más prescindible):")
print(f"   Feature: '{min_diff_feature}'")
print(f"   Diferencia: {min_diff:.6f}")

if min_diff < 0:
    print(f"   ✅ Removerla MEJORA el accuracy en {abs(min_diff):.6f}")
elif min_diff < 0.01:
    print(f"   ✅ Removerla tiene impacto MÍNIMO")
else:
    print(f"   ⚠ Removerla empeora el accuracy en {min_diff:.6f}")

print(f"\n✅ RESPUESTA QUESTION 5: {min_diff_feature}")

Interpretación de los Resultados

print("\nINTERPRETACIÓN DETALLADA")
print("=" * 70)

for feature, diff in feature_importance.items():
    print(f"\n{feature}:")
    print(f"  Diferencia: {diff:.6f}")

    if diff < 0:
        print(f"  ✅ Remover esta feature MEJORA el modelo")
        print(f"  Razón: Probablemente añadía ruido o correlación espuria")
    elif diff < 0.01:
        print(f"  ✅ Esta feature es PRESCINDIBLE")
        print(f"  Razón: No aporta información significativa")
    elif diff < 0.03:
        print(f"  ⚠ Esta feature aporta información LIMITADA")
        print(f"  Razón: Tiene algo de valor pero no crítica")
    else:
        print(f"  ❌ Esta feature es IMPORTANTE")
        print(f"  Razón: Removerla afecta significativamente el rendimiento")

📊 Análisis Más Profundo

Comparar Múltiples Métricas

No solo accuracy, también podemos ver otras métricas:

from sklearn.metrics import precision_score, recall_score, f1_score

print("\nANÁLISIS MULTI-MÉTRICA")
print("=" * 80)

print(f"\n{'Feature':<20} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1-Score':>10}")
print("-" * 80)

# Baseline
y_pred_baseline = model_baseline.predict(X_val)
print(f"{'Baseline (todas)':<20} "
      f"{accuracy_score(y_val, y_pred_baseline):>10.4f} "
      f"{precision_score(y_val, y_pred_baseline):>10.4f} "
      f"{recall_score(y_val, y_pred_baseline):>10.4f} "
      f"{f1_score(y_val, y_pred_baseline):>10.4f}")

# Sin cada feature
for feature in features_to_test:
    X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
    X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')

    dv_temp = DictVectorizer(sparse=False)
    X_train_temp = dv_temp.fit_transform(X_train_no_feat.to_dict(orient='records'))
    X_val_temp = dv_temp.transform(X_val_no_feat.to_dict(orient='records'))

    model_temp = LogisticRegression(solver='liblinear', C=1.0, 
                                     max_iter=1000, random_state=42)
    model_temp.fit(X_train_temp, y_train)
    y_pred_temp = model_temp.predict(X_val_temp)

    print(f"{'Sin ' + feature:<20} "
          f"{accuracy_score(y_val, y_pred_temp):>10.4f} "
          f"{precision_score(y_val, y_pred_temp):>10.4f} "
          f"{recall_score(y_val, y_pred_temp):>10.4f} "
          f"{f1_score(y_val, y_pred_temp):>10.4f}")

Visualizar el Impacto

import matplotlib.pyplot as plt

# Preparar datos para visualización
features = list(feature_importance.keys())
differences = list(feature_importance.values())

# Crear gráfico
plt.figure(figsize=(10, 6))

colors = ['green' if d < 0 else 'red' if d > 0.01 else 'orange' 
          for d in differences]

plt.barh(features, differences, color=colors, alpha=0.7)
plt.xlabel('Diferencia en Accuracy (Baseline - Sin Feature)', 
           fontweight='bold')
plt.ylabel('Feature', fontweight='bold')
plt.title('Impacto de Remover Cada Feature', 
          fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(axis='x', alpha=0.3)

# Añadir anotaciones
for i, (feat, diff) in enumerate(zip(features, differences)):
    x_pos = diff + 0.001 if diff > 0 else diff - 0.001
    ha = 'left' if diff > 0 else 'right'
    plt.text(x_pos, i, f'{diff:.4f}', va='center', ha=ha, fontweight='bold')

plt.tight_layout()
plt.show()

🔄 Feature Elimination Recursivo

Para un análisis más completo, podríamos eliminar features de forma recursiva:

print("\nFEATURE ELIMINATION RECURSIVO (RFE)")
print("=" * 70)

from sklearn.feature_selection import RFE

# Usar RFE de sklearn
model_rfe = LogisticRegression(solver='liblinear', C=1.0, 
                                max_iter=1000, random_state=42)

# Seleccionar top N features
n_features_to_select = 30  # Por ejemplo

rfe = RFE(estimator=model_rfe, n_features_to_select=n_features_to_select)
rfe.fit(X_train, y_train)

# Evaluar
y_pred_rfe = rfe.predict(X_val)
accuracy_rfe = accuracy_score(y_val, y_pred_rfe)

print(f"Accuracy con RFE ({n_features_to_select} features): {accuracy_rfe:.6f}")
print(f"Accuracy baseline ({X_train.shape[1]} features): {accuracy_baseline:.6f}")
print(f"Diferencia: {accuracy_rfe - accuracy_baseline:.6f}")

# Ver qué features fueron seleccionadas
feature_names = dv.get_feature_names_out()
selected_features = [feat for feat, selected in zip(feature_names, rfe.support_) 
                    if selected]
print(f"\nFeatures seleccionadas: {len(selected_features)}")

💡 Mejores Prácticas en Feature Elimination

✅ DO (Hacer)

  1. Usar conjunto de validación separado
   # ✅ BIEN: Evaluar en validation
   accuracy_val = model.score(X_val, y_val)
  1. Considerar múltiples métricas
   # No solo accuracy, también precision, recall, F1
  1. Probar eliminación combinada
   # Remover 2-3 features menos importantes juntas
  1. Validar en conjunto de test
   # Después de seleccionar features, validar en test

❌ DON’T (No Hacer)

  1. Eliminar sin analizar
   # ❌ MAL: Eliminar aleatoriamente
  1. Usar solo train para evaluar
   # ❌ MAL: accuracy_train (puede ser engañoso)
  1. Eliminar demasiadas features
   # ❌ Perder información valiosa
  1. Ignorar el contexto del negocio
   # ❌ Algunas features son importantes por razones de negocio

📝 Código Completo para Referencia

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score

# 1. Preparar datos
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
df = pd.read_csv(url)

categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

df_clean = df.copy()
for col in categorical_cols:
    df_clean[col] = df_clean[col].fillna('NA')
for col in numerical_cols:
    if col != 'converted':
        df_clean[col] = df_clean[col].fillna(0.0)

# 2. División
df_train_full, df_temp = train_test_split(df_clean, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)

y_train = df_train_full['converted'].values
y_val = df_val['converted'].values

X_train_df = df_train_full.drop('converted', axis=1).reset_index(drop=True)
X_val_df = df_val.drop('converted', axis=1).reset_index(drop=True)

# 3. Modelo baseline
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val = dv.transform(X_val_df.to_dict(orient='records'))

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
accuracy_baseline = accuracy_score(y_val, model.predict(X_val))

print(f"Baseline accuracy: {accuracy_baseline:.6f}")

# 4. Feature elimination
features_to_test = ['industry', 'employment_status', 'lead_score']
feature_importance = {}

for feature in features_to_test:
    X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
    X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')

    dv_no_feat = DictVectorizer(sparse=False)
    X_train_temp = dv_no_feat.fit_transform(X_train_no_feat.to_dict(orient='records'))
    X_val_temp = dv_no_feat.transform(X_val_no_feat.to_dict(orient='records'))

    model_temp = LogisticRegression(solver='liblinear', C=1.0, 
                                     max_iter=1000, random_state=42)
    model_temp.fit(X_train_temp, y_train)
    accuracy_temp = accuracy_score(y_val, model_temp.predict(X_val_temp))

    diff = accuracy_baseline - accuracy_temp
    feature_importance[feature] = diff
    print(f"{feature}: diff = {diff:.6f}")

# 5. Identificar menos útil
min_feat = min(feature_importance, key=feature_importance.get)
print(f"\nMenos útil: {min_feat}")

🎯 Conclusión

El Feature Elimination nos permite:

  1. ✅ Simplificar modelos sin perder rendimiento
  2. ✅ Identificar features prescindibles
  3. ✅ Mejorar interpretabilidad
  4. ✅ Reducir complejidad computacional

Puntos clave:

  • 🎯 No todas las features son igualmente importantes
  • 📊 Remover features puede incluso mejorar el modelo
  • 🔍 Siempre evaluar en conjunto de validación
  • ⚖ Balancear simplicidad vs. rendimiento

En el próximo y último post (MLZC25-22), exploraremos la regularización ajustando el parámetro C para encontrar el modelo óptimo que generalice mejor.

¿Qué features encontraste como menos útiles? ¿Intentaste remover múltiples features a la vez?


This content originally appeared on DEV Community and was authored by Jesus Oviedo Riquelme