Cross-Validation and Hyperparameter Tuning in scikit-learn

You've built a model. It performs beautifully on your training data. You deploy it, and... it tanks. Welcome to overfitting, one of machine learning's most humbling lessons. The problem isn't your model, it's how you evaluated it.
This is where cross-validation saves your career. And once you've got that locked down, hyperparameter tuning becomes the lever that unlocks your model's true potential. Let's dig into both, because getting these right is the difference between a model that looks good and one that actually works.
If you've been following this series, you already know how to build models and evaluate them using metrics like accuracy, precision, recall, and AUC. But knowing your metric isn't enough, knowing whether your metric is trustworthy is the real game. Every production ML failure story I've seen has one common thread: the evaluation setup was flawed. The model scored high during development, the team shipped it with confidence, and reality delivered a brutal correction. In this article we're going to make sure that doesn't happen to you. We'll cover cross-validation strategies that give you honest performance estimates, hyperparameter search methods that range from brute force to intelligent optimization, and a practical end-to-end workflow you can drop into any project. Whether you're tuning a simple SVM or a complex ensemble, these techniques form the foundation of rigorous, production-ready machine learning.
Table of Contents
- Why Cross-Validation Matters
- Why Single Train-Test Splits Lie to You
- K-Fold Strategies
- Stratified K-Fold: When Classes Are Imbalanced
- GroupKFold: When Samples Aren't Independent
- Grid vs Random vs Bayesian Search
- GridSearchCV: Exhaustive Hyperparameter Search
- RandomizedSearchCV: When Your Grid Is Too Large
- Halving Search: Iterative Elimination
- Common Tuning Mistakes
- The Nested CV Trap: Avoiding Optimism Bias
- Data Leakage: Why Preprocessing Matters
- Bayesian Optimization with Optuna
- Practical Workflow: Putting It All Together
- Key Takeaways
Why Cross-Validation Matters
Before we touch a single line of code, we need to talk about why cross-validation exists and why you should care deeply about it. The fundamental problem in machine learning is generalization: we train on one set of data and hope our model works on data it has never seen. A single train-test split is a fragile bet on that hope.
Consider what happens when your test set is unrepresentative. If your 20% test set happened to capture easier examples by chance, your accuracy looks great, but only because you got lucky with the random split, not because your model is genuinely good. You have one data point about your model's behavior, and one data point is never enough to draw a confident conclusion. Cross-validation transforms that one data point into k data points by systematically rotating which portion of the data acts as the test set. The result is a distribution of performance scores rather than a single number, and that distribution tells you something incredibly valuable: how stable your model is. A model that scores 0.91, 0.92, 0.90, 0.93, 0.91 across five folds is telling you something very different from one that scores 0.75, 0.95, 0.88, 0.62, 0.97. The mean might be similar, but the second model is a liability. Cross-validation exposes that variance before you ship, when you still have time to do something about it.
Why Single Train-Test Splits Lie to You
Picture this: you split your dataset 80/20, train on 80%, and evaluate on 20%. Your accuracy looks fantastic, 95%! But here's the uncomfortable truth: if your data is imbalanced, randomly distributed, or has hidden patterns, that 20% test set might be a statistical anomaly. You could've just gotten lucky.
A single train-test split gives you one data point about your model's performance. One. That's not enough information to make real claims. What if you split differently? What if the model performs wildly differently on different slices of data?
K-fold cross-validation fixes this. Instead of one split, you create k non-overlapping subsets (folds) of your data. You train k models, each time using k-1 folds for training and 1 fold for testing. This gives you k performance scores, not just one.
The following example is the simplest possible entry point into cross-validation. We load a well-known dataset, define a model, and let scikit-learn handle the entire fold-training-evaluation loop for us with a single function call. Notice how little code it takes to go from "one number" to "a real picture of performance."
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"Fold scores: {scores}")
print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")Output might look like:
Fold scores: [0.96 0.97 0.95 0.93 0.94]
Mean: 0.950, Std: 0.015
Now you have a real picture: your model consistently performs around 95%, with a standard deviation of 1.5%. That standard deviation is critical, it tells you whether your model is stable or lucky. When comparing two models with similar means, always favor the one with lower standard deviation. Consistency in production is worth more than a single high score in evaluation.
Why does this matter? Because that standard deviation represents robustness. A score of 0.95 ± 0.15 is way less trustworthy than 0.95 ± 0.01. One model is all over the place; the other is rock solid.
K-Fold Strategies
The basic k-fold gives you a starting point, but scikit-learn offers a family of cross-validation strategies tailored to different data situations. Choosing the right one is not about preference, it is about correctness. Using the wrong strategy can give you inflated or unstable scores just as surely as a bad train-test split.
The most common choice is 5-fold or 10-fold cross-validation. Five folds is the standard for most everyday tasks because it balances computational cost against estimate quality. Ten folds gives you a lower-variance estimate but costs twice as much compute. For very small datasets, a few hundred samples or fewer, you can go all the way to leave-one-out cross-validation, where each sample gets its own fold. Leave-one-out is nearly unbiased but has high variance and is computationally expensive for large datasets, so use it sparingly. For large datasets with tens of thousands of samples, three folds is often sufficient because each fold still gives you a massive and representative test set. The general rule: more folds for smaller datasets, fewer folds for larger ones, and always stratify when your classes are imbalanced.
Stratified K-Fold: When Classes Are Imbalanced
Here's a common nightmare: you have a binary classification problem where 95% of samples are class 0 and 5% are class 1. A random k-fold split might put most of your rare class into one fold, leaving others with almost none. Your model trains on imbalanced data, and your folds test on wildly different distributions. Your CV scores bounce all over.
Stratified K-Fold fixes this by ensuring each fold has approximately the same percentage of samples for each target class. This is especially important when you are evaluating with metrics like F1, precision, or recall that are sensitive to class balance.
import numpy as np
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y):
_, train_counts = np.unique(y[train_idx], return_counts=True)
_, test_counts = np.unique(y[test_idx], return_counts=True)
print(f"Train fold classes: {train_counts}")
print(f"Test fold classes: {test_counts}")
break # Just showing first foldEach fold maintains the original class distribution. The shuffle=True argument is important here, without it, folds are created from contiguous blocks of your data, which can introduce ordering bias if your samples are sorted by class. Always shuffle unless you have a specific reason not to.
Each fold maintains the original class distribution. This is especially crucial for imbalanced datasets where a "lucky" random split could make your scores meaningless.
GroupKFold: When Samples Aren't Independent
Imagine you're predicting patient health outcomes, and you have multiple measurements per patient. A standard k-fold split might put patient A's measurements in both training and test sets. Now your model has seen that patient during training, and you're testing on the same patient. Your evaluation is contaminated.
GroupKFold ensures that all samples from the same group stay together in either train or test. This is the correct choice whenever your data has a natural clustering structure, patients, users, geographic regions, experimental runs, where leakage across groups would make your evaluation optimistic and misleading.
import numpy as np
from sklearn.model_selection import GroupKFold
# groups array: assign each sample to one of 4 groups (must match dataset size)
groups = np.array([i % 4 for i in range(len(X))]) # 4 groups across 150 samples
gkf = GroupKFold(n_splits=4)
for train_idx, test_idx in gkf.split(X, y, groups=groups):
print(f"Train groups: {set(groups[i] for i in train_idx)}")
print(f"Test groups: {set(groups[i] for i in test_idx)}")Output shows groups are cleanly separated:
Train groups: {0, 1, 2}
Test groups: {3}
Train groups: {0, 1, 3}
Test groups: {2}
...
Notice that no group number appears in both train and test within the same fold, that clean separation is exactly what prevents contamination. If you skip this step when your data has group structure, you will systematically overestimate your model's performance because it will have already seen related examples during training.
This is essential for time-series data, medical imaging studies, or any scenario where samples are grouped by patient, location, time period, or experimental unit.
Grid vs Random vs Bayesian Search
Once you have a solid cross-validation strategy, the next question is how to actually find good hyperparameters. There are three main approaches, and they differ dramatically in how they explore the search space. Understanding the tradeoffs helps you pick the right tool for your situation rather than defaulting to whatever you saw in a tutorial.
Grid search is exhaustive and systematic. You define a fixed set of values for each hyperparameter, and it tries every combination. This guarantees you will find the best combination within your defined grid, which sounds great, but the guarantee only holds for the values you specified. If the optimal value for a parameter lies between your grid points, grid search misses it. Grid search also scales exponentially: adding one more parameter or one more value to your grid multiplies the total number of combinations. It works well for small grids and gives you complete coverage of the space you define. Random search samples combinations randomly from distributions you specify rather than from a fixed grid. Research by Bergstra and Bengio showed that random search often finds better hyperparameters than grid search with the same number of trials, because when some hyperparameters matter more than others, random search allocates more effective coverage to the important ones. Bayesian optimization goes a step further, it builds a probabilistic model of the objective function and uses that model to decide where to sample next, actively focusing on regions of the space that are likely to yield improvements. For expensive-to-evaluate models or large search spaces, Bayesian methods are the most efficient choice.
GridSearchCV: Exhaustive Hyperparameter Search
You've got your cross-validation strategy locked down. Now comes the real problem: your model has hyperparameters. Random Forest has n_estimators, max_depth, min_samples_split. SVM has C and gamma. Neural networks have learning rates, dropout, layer sizes.
How do you find the best combination? Try them all? That's essentially what GridSearchCV does, it exhaustively searches a specified parameter grid using cross-validation to evaluate each combination. The key advantage is that every evaluation is fully reproducible and the search is guaranteed to find the best combination within the grid you define.
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': ['scale', 'auto', 0.001, 0.01],
'kernel': ['linear', 'rbf']
}
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")
# Evaluate on test set
test_score = grid_search.score(X_test, y_test)
print(f"Test score: {test_score:.3f}")This tries 4 × 4 × 2 = 32 parameter combinations, each evaluated with 5-fold CV. That's 160 model trainings. It's computationally expensive, but you get the best parameters for your problem. The n_jobs=-1 argument is worth noting, it tells scikit-learn to use every available CPU core in parallel, which can dramatically speed up the search on multi-core machines.
The results_ DataFrame gives you the full breakdown:
import pandas as pd
results_df = pd.DataFrame(grid_search.cv_results_)
print(results_df[['param_C', 'param_gamma', 'mean_test_score', 'std_test_score']].head(10))You can see exactly how each combination performed and spot patterns, maybe higher C values are consistently better, or maybe a specific gamma value is crucial. This diagnostic view is genuinely useful: if the best score is clustered at the edge of your grid (e.g., the highest C value you tried is always best), that's a signal to expand your grid in that direction and re-run.
RandomizedSearchCV: When Your Grid Is Too Large
GridSearchCV is exhaustive, which is great when your parameter space is small. But what if you have 50 possible values for 5 different hyperparameters? That's 312.5 million combinations. You'll be waiting forever.
RandomizedSearchCV randomly samples combinations from your parameter space. You specify how many combinations you want to try, and it searches intelligently without checking every possibility. The crucial advantage over grid search is that you can specify continuous distributions for continuous parameters, which means you are not artificially discretizing the search space.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_dist = {
'C': uniform(0.1, 100), # Continuous: 0.1 to 100
'gamma': ['scale', 'auto'], # Discrete: fixed choices
'degree': randint(2, 10), # Integer: 2 to 10
'kernel': ['linear', 'rbf', 'poly']
}
random_search = RandomizedSearchCV(
SVC(),
param_dist,
n_iter=20, # Try 20 random combinations instead of all
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")The beauty here? You sample intelligently. uniform(0.1, 100) for C lets you explore a continuous range. randint(2, 10) explores integers. You're not locked into a predefined grid, you're exploring more flexibly. The random_state=42 ensures your search is reproducible, which matters when you need to report results or debug unexpected behavior.
For large hyperparameter spaces, RandomizedSearchCV is often more efficient than grid search and frequently finds comparable or better results.
Halving Search: Iterative Elimination
Let's say you're searching a massive space and you don't want to waste compute on bad parameter combinations. Halving search (available via HalvingGridSearchCV and HalvingRandomizedSearchCV) uses an iterative elimination strategy:
- Start with all parameter combinations, but evaluate each with minimal resources (e.g., 1-fold CV)
- Eliminate the bottom 50% of combinations
- Increase resources for the remaining combinations (e.g., 3-fold CV)
- Repeat until one combination remains
This is like a tournament bracket, you quickly eliminate bad options and invest more resources in promising ones.
from sklearn.model_selection import HalvingGridSearchCV
halving_search = HalvingGridSearchCV(
RandomForestClassifier(),
param_grid={
'n_estimators': [50, 100, 200, 500],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
},
cv=3,
factor=3, # Eliminate bottom 1/3 each round
n_jobs=-1
)
halving_search.fit(X_train, y_train)
print(f"Best parameters: {halving_search.best_params_}")The factor=3 parameter controls how aggressively we eliminate candidates each round. A higher factor eliminates more candidates per round, finishing faster but with more risk of discarding a good combination early. A lower factor is more conservative. For most tasks, the default factor of 3 strikes a good balance between speed and thoroughness.
This is dramatically more efficient than grid search for large spaces, especially when you have significant computational constraints.
Common Tuning Mistakes
Hyperparameter tuning looks straightforward on paper, but there are several mistakes that consistently trip up practitioners at every experience level. Knowing them in advance saves you from discovering them the hard way in production.
The first and most common mistake is tuning on the test set. This sounds obvious, but it happens subtly all the time: you run a search, look at the test score, decide to try a larger grid or a different model, run again, and repeat. Every time you use your test score to make a decision, even informally, you are leaking information from the test set into your model selection process. By the time you finalize your model, the "test" score is no longer unbiased. The fix is to keep your test set in a locked box until you have one final model and you evaluate it exactly once. Use cross-validation scores for all intermediate decisions.
The second mistake is tuning too many hyperparameters at once with too few trials. If you have seven hyperparameters and run 20 iterations of random search, you are sampling a seven-dimensional space very sparsely. Start by identifying the two or three hyperparameters that matter most for your model, for Random Forest that is usually n_estimators, max_depth, and min_samples_leaf, and tune those first with enough iterations to actually cover the space. Once you have a good baseline, you can do a second round of fine-tuning with a narrower range.
The third mistake is ignoring scale when building parameter grids. For parameters that span orders of magnitude, learning rates, regularization coefficients, kernel bandwidth, you should sample on a log scale, not a linear scale. Sampling C uniformly from 1 to 100 means most of your samples are in the range 50 to 100, which gives you almost no coverage of the interesting region near 1. Use uniform(1, 100) in log space or provide log-spaced grid points like [0.001, 0.01, 0.1, 1, 10, 100]. The fourth mistake is skipping the pipeline and preprocessing outside the CV loop, which introduces data leakage, we cover this in detail in the next section.
The Nested CV Trap: Avoiding Optimism Bias
Here's where things get subtle, and where most practitioners stumble.
Let's say you use GridSearchCV with 5-fold CV. You find the best hyperparameters. Then you evaluate on a held-out test set. Sounds solid, right?
Wrong. Here's the problem: GridSearchCV looks at the CV scores to select hyperparameters. It indirectly overfits to the training set by choosing parameters that perform best on that data. When you then evaluate on the test set, your results are biased, they look better than they actually are.
This is optimism bias, and it's subtle because your test set is truly held out, yet your results are still inflated.
Nested cross-validation fixes this. You nest one cross-validation loop inside another:
- Outer loop: Splits data for final evaluation (5 folds)
- Inner loop: Hyperparameter tuning within each outer fold (5 folds)
For each outer fold:
- Use 4/5 of data for hyperparameter tuning (inner CV)
- Use 1/5 of data for unbiased evaluation
The code below implements this correctly. Notice that the inner GridSearchCV runs entirely within each outer fold, it never sees the outer test fold, which is exactly what eliminates the optimism bias.
import numpy as np
from sklearn.model_selection import cross_validate
param_grid = {
'C': [0.1, 1, 10],
'gamma': ['scale', 'auto']
}
# Outer CV
outer_cv_scores = []
for train_idx, test_idx in StratifiedKFold(n_splits=5).split(X, y):
X_train_outer, X_test_outer = X[train_idx], X[test_idx]
y_train_outer, y_test_outer = y[train_idx], y[test_idx]
# Inner CV (hyperparameter tuning)
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5, # Inner folds
scoring='accuracy'
)
grid_search.fit(X_train_outer, y_train_outer)
# Evaluate on outer test fold (unbiased)
score = grid_search.score(X_test_outer, y_test_outer)
outer_cv_scores.append(score)
print(f"Nested CV scores: {outer_cv_scores}")
print(f"Mean: {np.mean(outer_cv_scores):.3f}, Std: {np.std(outer_cv_scores):.3f}")This gives you an unbiased estimate of your model's generalization performance. Different hyperparameters get selected in each outer fold (which is fine, you're showing robustness), and you get honest estimates of how your model performs. In practice, the performance gap between nested and non-nested CV is often small, a few percentage points, but that gap represents the difference between reporting an honest number and reporting a flattering one.
Simple CV vs. Nested CV:
- Simple: GridSearchCV on full dataset, evaluate on test set → Optimistically biased
- Nested: GridSearchCV within each CV fold, aggregate outer CV scores → Unbiased
The performance difference is often small, but the statistical validity difference is huge.
Data Leakage: Why Preprocessing Matters
Here's a subtle gotcha that trips up even experienced practitioners: if you apply preprocessing (scaling, encoding, feature selection) before cross-validation, you leak information from your test folds into training.
Example of wrong approach:
from sklearn.preprocessing import StandardScaler
# WRONG: Scale all data first, then CV
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Using statistics from entire dataset!
cross_val_score(model, X_scaled, y, cv=5)The problem: the scaler computed mean and std from the entire dataset, including data that ends up in test folds. Your test set influenced the training transformation. This inflates your CV scores. In a real project this can easily give you 1-5% inflated accuracy on a classification task, and much larger distortions on regression tasks where features have very different scales.
Right approach: Use a Pipeline to ensure preprocessing happens within each fold:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier())
])
# Pipeline automatically applies scaler within each CV fold
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')Now each fold's training data is scaled independently, with no information leakage. Your CV scores are honest. The pipeline approach has a secondary benefit: it makes your preprocessing-model combo a single object that you can pass to GridSearchCV, RandomizedSearchCV, or any other scikit-learn tool, keeping your code clean and your evaluation correct.
This applies to all preprocessing: scaling, encoding, PCA, feature selection, everything should be inside the pipeline or inside your CV loop.
Bayesian Optimization with Optuna
GridSearch and RandomSearch are powerful, but they're somewhat naive. They don't learn from previous trials to inform where to search next. Bayesian optimization does.
Optuna uses Bayesian optimization to intelligently sample the hyperparameter space, focusing resources on promising regions. Under the hood, it builds a probabilistic surrogate model, typically a Tree-structured Parzen Estimator (TPE), that predicts which hyperparameter configurations are likely to give good results based on what it has already evaluated. This is fundamentally smarter than random search, which treats every trial as independent.
import optuna
from optuna.samplers import TPESampler
def objective(trial):
# Define hyperparameters to tune
C = trial.suggest_float('C', 0.001, 100, log=True)
gamma = trial.suggest_categorical('gamma', ['scale', 'auto'])
model = SVC(C=C, gamma=gamma)
# Evaluate with cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5)
return scores.mean()
# Create and run study
sampler = TPESampler(seed=42)
study = optuna.create_study(sampler=sampler, direction='maximize')
study.optimize(objective, n_trials=30)
print(f"Best hyperparameters: {study.best_params}")
print(f"Best CV score: {study.best_value:.3f}")Optuna tried 30 combinations, but it smartly focused on promising regions. It likely found better parameters with fewer trials than random search. The log=True argument in suggest_float tells Optuna to sample C on a log scale, which is exactly what we want for a regularization parameter that spans several orders of magnitude. Optuna also gives you visualization tools to inspect your search, trial history plots, parameter importance charts, and contour plots of the parameter landscape, which make it much easier to understand your model's sensitivity to different hyperparameters.
The suggest_float(..., log=True) handles log-scale exploration (important for parameters like C), and suggest_categorical handles discrete choices.
Practical Workflow: Putting It All Together
Here's how to actually combine these concepts into a complete hyperparameter tuning workflow:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# 1. Split data: training + final test set
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 2. Define pipeline with preprocessing
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestClassifier(random_state=42))
])
# 3. Define hyperparameter grid
param_grid = {
'model__n_estimators': [50, 100, 200],
'model__max_depth': [5, 10, None],
'model__min_samples_split': [2, 5, 10]
}
# 4. Hyperparameter tuning with CV
grid_search = GridSearchCV(
pipeline,
param_grid,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='f1_weighted',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
# 5. Evaluate on held-out test set
test_score = grid_search.score(X_test, y_test)
print(f"Test set score: {test_score:.3f}")
# 6. (Optional) Nested CV for unbiased estimate
from sklearn.model_selection import cross_val_score
nested_scores = cross_val_score(
grid_search,
X_train, y_train,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='f1_weighted'
)
print(f"Nested CV: {nested_scores.mean():.3f} ± {nested_scores.std():.3f}")This workflow hits every best practice we have covered. The pipeline prevents leakage by keeping preprocessing inside the CV loop. The stratified k-fold ensures each fold is representative. The model__ prefix in the parameter grid tells GridSearchCV to look inside the pipeline for the RandomForestClassifier parameters. The final evaluation happens on a held-out test set that was never touched during training or tuning.
This workflow:
- Holds out a true test set for final evaluation
- Uses stratified k-fold to maintain class distributions
- Puts preprocessing in a pipeline to prevent leakage
- Tunes hyperparameters with grid search and CV
- Optionally validates with nested CV for unbiased results
- Evaluates honestly on the held-out test set
Key Takeaways
- K-fold CV beats single splits because it gives you multiple performance estimates, revealing stability and robustness
- Stratified K-fold ensures class distributions are maintained across folds, essential for imbalanced data
- GroupKFold keeps related samples together, preventing contamination in grouped data
- GridSearchCV exhaustively searches hyperparameter space; RandomizedSearchCV samples intelligently when the space is huge
- Halving search eliminates poor options iteratively, saving compute while exploring large spaces
- Nested CV prevents optimism bias by separating hyperparameter tuning from final evaluation
- Pipelines prevent leakage by applying preprocessing within CV folds, not before
- Bayesian optimization (Optuna) intelligently samples, often finding better parameters faster
The difference between a model that looks good and one that actually generalizes usually comes down to rigor in cross-validation and hyperparameter tuning. Get these right, and your models will perform as well in production as they did in development. The patterns in this article, pipeline-wrapped preprocessing, stratified folds, nested evaluation, and intelligent search strategies, are not optional extras for advanced practitioners. They are the baseline for anyone who takes model reliability seriously. Apply them consistently, and you will build a well-earned reputation for models that actually work.