Predicting Fuel Efficiency with Tree-Based Models: A Hands-On Machine Learning Walkthrough



This content originally appeared on DEV Community and was authored by Kenechukwu Anoliefo

Understanding how vehicle characteristics affect fuel efficiency is a classic regression problem — and an excellent way to explore tree-based models like Decision Trees, Random Forests, and XGBoost. In this project, I analyzed a dataset of cars and built models to predict fuel efficiency (MPG) with different configurations.

🧩 Step 1 — Data Preparation

The dataset contained various vehicle features, including:

  • vehicle_weight
  • engine_displacement
  • horsepower
  • acceleration
  • model_year
  • origin
  • fuel_type

To ensure data consistency, all missing values were filled with zeros.
Then I performed a train/validation/test split (60%/20%/20%), using a random_state=1 for reproducibility.

Next, I used DictVectorizer(sparse=True) to convert categorical and numerical features into a format suitable for scikit-learn models.

🌳 Step 2 — Decision Tree Regressor

I began with a Decision Tree Regressor with max_depth=1.
This simple tree helps visualize which feature the model uses first to split the data — effectively revealing the most influential variable in predicting MPG.

Result:
The feature used for splitting was model_year, showing that newer vehicles tend to have different fuel efficiencies compared to older models.

🌲 Step 3 — Random Forest Model

Next, I trained a Random Forest Regressor with the parameters:

n_estimators=10  
random_state=1  
n_jobs=-1

Random forests aggregate multiple decision trees to reduce overfitting and improve accuracy.

Validation RMSE:4.5

This confirmed the model could capture relationships between engine specs and fuel efficiency quite effectively.

⚙ Step 4 — Tuning n_estimators

To see how the number of trees affects performance, I trained models with n_estimators ranging from 10 to 200 (step = 10).
After monitoring RMSE, I observed the improvement plateaued after around 80 estimators, indicating that adding more trees didn’t significantly enhance accuracy.

🌾 Step 5 — Tuning max_depth

I then compared four values of max_depth — [10, 15, 20, 25] — each with increasing n_estimators from 10 to 200.
The best mean RMSE occurred at max_depth = 20, which struck the right balance between bias and variance.

🔍 Step 6 — Feature Importance

Random Forests provide an excellent built-in mechanism for feature importance.
Training the model with:

n_estimators=10, max_depth=20, random_state=1

I found the most influential feature for predicting fuel efficiency to be engine_displacement, followed by vehicle_weight and horsepower.

This aligns well with domain knowledge — larger engines and heavier vehicles typically consume more fuel.

⚡ Step 7 — XGBoost Experiments

Finally, I trained an XGBoost regressor, tuning the eta (learning rate) parameter between 0.3 and 0.1.

xgb_params = {
    'eta': [0.3 or 0.1],
    'max_depth': 6,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1
}

After 100 training rounds, the model with eta = 0.1 delivered slightly better RMSE on the validation set — confirming that a smaller learning rate can yield smoother, more generalized models.

🎯 Key Takeaways

  • model_year strongly influences fuel efficiency in modern cars.
  • Random Forests with n_estimators ≈ 80 and max_depth=20 gave the most balanced performance.
  • Engine displacement emerged as the most important predictor of MPG.
  • XGBoost with a lower learning rate (eta=0.1) achieved the best validation score.

💡 Final Thoughts

This project demonstrates how iterative experimentation with tree-based models reveals both predictive strength and interpretability.
From simple decision trees to tuned XGBoost models, each step provided insight into how vehicle characteristics drive fuel efficiency — and how model parameters affect performance.

If you’re learning machine learning, projects like this are perfect for mastering feature engineering, evaluation metrics, and model tuning.


This content originally appeared on DEV Community and was authored by Kenechukwu Anoliefo