This content originally appeared on DEV Community and was authored by Jesus Oviedo Riquelme
Objetivo del Post: Aprenderás la técnica de eliminación de características (feature elimination) para identificar variables que puedes remover sin afectar (o incluso mejorando) el rendimiento del modelo, creando modelos más simples y eficientes.
¿Por Qué Eliminar Features?
Tener más features no siempre es mejor. Principio de parsimonia (Navaja de Ockham):
“Entre dos modelos con rendimiento similar, el más simple es preferible”
Problemas de Tener Muchas Features
Overfitting: El modelo aprende ruido en lugar de patrones reales
Tiempo de entrenamiento: Más lento con más features
Complejidad: Más difícil de entender e interpretar
Costo computacional: Más memoria y recursos
Maldición de la dimensionalidad: Con muchas dimensiones, los datos se vuelven “esparcidos”
Ventajas de un Modelo Simple
Generaliza mejor: Menos propenso a overfitting
Más rápido: Entrena y predice más rápido
Más interpretable: Fácil de explicar
Menos recursos: Menor costo computacional
Más robusto: Menos sensible a ruido
¿Qué es Feature Elimination?
Feature Elimination es el proceso sistemático de:
- Entrenar un modelo con todas las features
- Remover una feature a la vez
- Evaluar el impacto en el rendimiento
- Identificar features que aportan poco o nada
Metodología
Accuracy Baseline (todas las features) = 0.85
Feature 1 removida → Accuracy = 0.84 (↓0.01)
Feature 2 removida → Accuracy = 0.85 (↔ 0.00) ← Candidata a eliminar
Feature 3 removida → Accuracy = 0.82 (↓0.03)
Feature 4 removida → Accuracy = 0.85 (↔ 0.00) ← Candidata a eliminar
...
Implementación Paso a Paso
Paso 1: Entrenar Modelo Baseline
Primero, necesitamos nuestro modelo de referencia con todas las features:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
# Cargar y preparar datos (código del post anterior)
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
df = pd.read_csv(url)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
df_clean = df.copy()
for col in categorical_cols:
df_clean[col] = df_clean[col].fillna('NA')
for col in numerical_cols:
if col != 'converted':
df_clean[col] = df_clean[col].fillna(0.0)
# División de datos
df_train_full, df_temp = train_test_split(df_clean, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)
y_train = df_train_full['converted'].values
y_val = df_val['converted'].values
X_train_df = df_train_full.drop('converted', axis=1).reset_index(drop=True)
X_val_df = df_val.drop('converted', axis=1).reset_index(drop=True)
# One-hot encoding
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val = dv.transform(X_val_df.to_dict(orient='records'))
# Entrenar modelo baseline
print("MODELO BASELINE (CON TODAS LAS FEATURES)")
print("=" * 60)
model_baseline = LogisticRegression(
solver='liblinear',
C=1.0,
max_iter=1000,
random_state=42
)
model_baseline.fit(X_train, y_train)
# Evaluar
y_pred_baseline = model_baseline.predict(X_val)
accuracy_baseline = accuracy_score(y_val, y_pred_baseline)
print(f"Accuracy baseline: {accuracy_baseline:.6f}")
print(f"Número de features: {X_train.shape[1]}")
print(f"\n✅ Modelo baseline entrenado")
Paso 2: Feature Elimination – Prueba Individual
Ahora probamos remover cada feature y ver cómo afecta al accuracy:
print("\nFEATURE ELIMINATION - ANÁLISIS INDIVIDUAL")
print("=" * 70)
# Features a probar (según el homework)
features_to_test = ['industry', 'employment_status', 'lead_score']
# Diccionario para guardar resultados
feature_importance = {}
print(f"\n{'Feature':<25} {'Acc sin feature':>15} {'Diferencia':>15} {'Impacto'}")
print("-" * 70)
for feature in features_to_test:
# Crear dataframes SIN la feature
X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')
# Verificar si la feature existe
if feature not in X_train_df.columns:
print(f"⚠ Feature '{feature}' no encontrada")
continue
# One-hot encoding sin la feature
dv_no_feat = DictVectorizer(sparse=False)
X_train_transformed = dv_no_feat.fit_transform(X_train_no_feat.to_dict(orient='records'))
X_val_transformed = dv_no_feat.transform(X_val_no_feat.to_dict(orient='records'))
# Entrenar modelo sin la feature
model_no_feat = LogisticRegression(
solver='liblinear',
C=1.0,
max_iter=1000,
random_state=42
)
model_no_feat.fit(X_train_transformed, y_train)
# Evaluar
y_pred_no_feat = model_no_feat.predict(X_val_transformed)
accuracy_no_feat = accuracy_score(y_val, y_pred_no_feat)
# Calcular diferencia
diff = accuracy_baseline - accuracy_no_feat
feature_importance[feature] = diff
# Determinar impacto
if diff < 0:
impact = "Mejora ↑"
elif diff < 0.01:
impact = "Mínimo →"
elif diff < 0.03:
impact = "Bajo ↓"
elif diff < 0.05:
impact = "Moderado ↓↓"
else:
impact = "Alto ↓↓↓"
print(f"{feature:<25} {accuracy_no_feat:>15.6f} {diff:>15.6f} {impact}")
print("-" * 70)
Salida esperada:
FEATURE ELIMINATION - ANÁLISIS INDIVIDUAL
======================================================================
Feature Acc sin feature Diferencia Impacto
----------------------------------------------------------------------
industry 0.849315 0.000000 Mínimo →
employment_status 0.852740 -0.003425 Mejora ↑
lead_score 0.835616 0.013699 Bajo ↓
----------------------------------------------------------------------
Paso 3: Identificar Feature Menos Útil (Question 5)
print("\nQUESTION 5: FEATURE CON MENOR IMPACTO")
print("=" * 70)
# Encontrar la feature con MENOR diferencia (puede ser negativa)
min_diff_feature = min(feature_importance, key=feature_importance.get)
min_diff = feature_importance[min_diff_feature]
print(f"\nRanking por impacto al remover:")
for idx, (feat, diff) in enumerate(sorted(feature_importance.items(),
key=lambda x: x[1]), 1):
marker = " ← MENOR IMPACTO" if feat == min_diff_feature else ""
direction = "↑ Mejora" if diff < 0 else "↓ Empeora" if diff > 0 else "→ Sin cambio"
print(f"{idx}. {feat:<25} : {diff:>10.6f} {direction}{marker}")
print(f"\n🎯 Feature con MENOR diferencia (más prescindible):")
print(f" Feature: '{min_diff_feature}'")
print(f" Diferencia: {min_diff:.6f}")
if min_diff < 0:
print(f" ✅ Removerla MEJORA el accuracy en {abs(min_diff):.6f}")
elif min_diff < 0.01:
print(f" ✅ Removerla tiene impacto MÍNIMO")
else:
print(f" ⚠ Removerla empeora el accuracy en {min_diff:.6f}")
print(f"\n✅ RESPUESTA QUESTION 5: {min_diff_feature}")
Interpretación de los Resultados
print("\nINTERPRETACIÓN DETALLADA")
print("=" * 70)
for feature, diff in feature_importance.items():
print(f"\n{feature}:")
print(f" Diferencia: {diff:.6f}")
if diff < 0:
print(f" ✅ Remover esta feature MEJORA el modelo")
print(f" Razón: Probablemente añadía ruido o correlación espuria")
elif diff < 0.01:
print(f" ✅ Esta feature es PRESCINDIBLE")
print(f" Razón: No aporta información significativa")
elif diff < 0.03:
print(f" ⚠ Esta feature aporta información LIMITADA")
print(f" Razón: Tiene algo de valor pero no crítica")
else:
print(f" ❌ Esta feature es IMPORTANTE")
print(f" Razón: Removerla afecta significativamente el rendimiento")
Análisis Más Profundo
Comparar Múltiples Métricas
No solo accuracy, también podemos ver otras métricas:
from sklearn.metrics import precision_score, recall_score, f1_score
print("\nANÁLISIS MULTI-MÉTRICA")
print("=" * 80)
print(f"\n{'Feature':<20} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1-Score':>10}")
print("-" * 80)
# Baseline
y_pred_baseline = model_baseline.predict(X_val)
print(f"{'Baseline (todas)':<20} "
f"{accuracy_score(y_val, y_pred_baseline):>10.4f} "
f"{precision_score(y_val, y_pred_baseline):>10.4f} "
f"{recall_score(y_val, y_pred_baseline):>10.4f} "
f"{f1_score(y_val, y_pred_baseline):>10.4f}")
# Sin cada feature
for feature in features_to_test:
X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')
dv_temp = DictVectorizer(sparse=False)
X_train_temp = dv_temp.fit_transform(X_train_no_feat.to_dict(orient='records'))
X_val_temp = dv_temp.transform(X_val_no_feat.to_dict(orient='records'))
model_temp = LogisticRegression(solver='liblinear', C=1.0,
max_iter=1000, random_state=42)
model_temp.fit(X_train_temp, y_train)
y_pred_temp = model_temp.predict(X_val_temp)
print(f"{'Sin ' + feature:<20} "
f"{accuracy_score(y_val, y_pred_temp):>10.4f} "
f"{precision_score(y_val, y_pred_temp):>10.4f} "
f"{recall_score(y_val, y_pred_temp):>10.4f} "
f"{f1_score(y_val, y_pred_temp):>10.4f}")
Visualizar el Impacto
import matplotlib.pyplot as plt
# Preparar datos para visualización
features = list(feature_importance.keys())
differences = list(feature_importance.values())
# Crear gráfico
plt.figure(figsize=(10, 6))
colors = ['green' if d < 0 else 'red' if d > 0.01 else 'orange'
for d in differences]
plt.barh(features, differences, color=colors, alpha=0.7)
plt.xlabel('Diferencia en Accuracy (Baseline - Sin Feature)',
fontweight='bold')
plt.ylabel('Feature', fontweight='bold')
plt.title('Impacto de Remover Cada Feature',
fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(axis='x', alpha=0.3)
# Añadir anotaciones
for i, (feat, diff) in enumerate(zip(features, differences)):
x_pos = diff + 0.001 if diff > 0 else diff - 0.001
ha = 'left' if diff > 0 else 'right'
plt.text(x_pos, i, f'{diff:.4f}', va='center', ha=ha, fontweight='bold')
plt.tight_layout()
plt.show()
Feature Elimination Recursivo
Para un análisis más completo, podríamos eliminar features de forma recursiva:
print("\nFEATURE ELIMINATION RECURSIVO (RFE)")
print("=" * 70)
from sklearn.feature_selection import RFE
# Usar RFE de sklearn
model_rfe = LogisticRegression(solver='liblinear', C=1.0,
max_iter=1000, random_state=42)
# Seleccionar top N features
n_features_to_select = 30 # Por ejemplo
rfe = RFE(estimator=model_rfe, n_features_to_select=n_features_to_select)
rfe.fit(X_train, y_train)
# Evaluar
y_pred_rfe = rfe.predict(X_val)
accuracy_rfe = accuracy_score(y_val, y_pred_rfe)
print(f"Accuracy con RFE ({n_features_to_select} features): {accuracy_rfe:.6f}")
print(f"Accuracy baseline ({X_train.shape[1]} features): {accuracy_baseline:.6f}")
print(f"Diferencia: {accuracy_rfe - accuracy_baseline:.6f}")
# Ver qué features fueron seleccionadas
feature_names = dv.get_feature_names_out()
selected_features = [feat for feat, selected in zip(feature_names, rfe.support_)
if selected]
print(f"\nFeatures seleccionadas: {len(selected_features)}")
Mejores Prácticas en Feature Elimination
DO (Hacer)
- Usar conjunto de validación separado
# ✅ BIEN: Evaluar en validation
accuracy_val = model.score(X_val, y_val)
- Considerar múltiples métricas
# No solo accuracy, también precision, recall, F1
- Probar eliminación combinada
# Remover 2-3 features menos importantes juntas
- Validar en conjunto de test
# Después de seleccionar features, validar en test
DON’T (No Hacer)
- Eliminar sin analizar
# ❌ MAL: Eliminar aleatoriamente
- Usar solo train para evaluar
# ❌ MAL: accuracy_train (puede ser engañoso)
- Eliminar demasiadas features
# ❌ Perder información valiosa
- Ignorar el contexto del negocio
# ❌ Algunas features son importantes por razones de negocio
Código Completo para Referencia
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score
# 1. Preparar datos
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"
df = pd.read_csv(url)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
df_clean = df.copy()
for col in categorical_cols:
df_clean[col] = df_clean[col].fillna('NA')
for col in numerical_cols:
if col != 'converted':
df_clean[col] = df_clean[col].fillna(0.0)
# 2. División
df_train_full, df_temp = train_test_split(df_clean, test_size=0.4, random_state=42)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)
y_train = df_train_full['converted'].values
y_val = df_val['converted'].values
X_train_df = df_train_full.drop('converted', axis=1).reset_index(drop=True)
X_val_df = df_val.drop('converted', axis=1).reset_index(drop=True)
# 3. Modelo baseline
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_df.to_dict(orient='records'))
X_val = dv.transform(X_val_df.to_dict(orient='records'))
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)
accuracy_baseline = accuracy_score(y_val, model.predict(X_val))
print(f"Baseline accuracy: {accuracy_baseline:.6f}")
# 4. Feature elimination
features_to_test = ['industry', 'employment_status', 'lead_score']
feature_importance = {}
for feature in features_to_test:
X_train_no_feat = X_train_df.drop(feature, axis=1, errors='ignore')
X_val_no_feat = X_val_df.drop(feature, axis=1, errors='ignore')
dv_no_feat = DictVectorizer(sparse=False)
X_train_temp = dv_no_feat.fit_transform(X_train_no_feat.to_dict(orient='records'))
X_val_temp = dv_no_feat.transform(X_val_no_feat.to_dict(orient='records'))
model_temp = LogisticRegression(solver='liblinear', C=1.0,
max_iter=1000, random_state=42)
model_temp.fit(X_train_temp, y_train)
accuracy_temp = accuracy_score(y_val, model_temp.predict(X_val_temp))
diff = accuracy_baseline - accuracy_temp
feature_importance[feature] = diff
print(f"{feature}: diff = {diff:.6f}")
# 5. Identificar menos útil
min_feat = min(feature_importance, key=feature_importance.get)
print(f"\nMenos útil: {min_feat}")
Conclusión
El Feature Elimination nos permite:
Simplificar modelos sin perder rendimiento
Identificar features prescindibles
Mejorar interpretabilidad
Reducir complejidad computacional
Puntos clave:
No todas las features son igualmente importantes
Remover features puede incluso mejorar el modelo
Siempre evaluar en conjunto de validación
Balancear simplicidad vs. rendimiento
En el próximo y último post (MLZC25-22), exploraremos la regularización ajustando el parámetro C para encontrar el modelo óptimo que generalice mejor.
¿Qué features encontraste como menos útiles? ¿Intentaste remover múltiples features a la vez?
This content originally appeared on DEV Community and was authored by Jesus Oviedo Riquelme