Here's something that trips up a lot of people when they first get into machine learning: they assume logistic regression is the end of the story for classification. You fit a line (or a hyperplane), you threshold the output, done. But then you throw a real dataset at it, something messy, something with features that interact in complicated ways, and suddenly that nice linear boundary just doesn't cut it. The data doesn't cooperate. It clusters in weird blobs, it spirals, it folds back on itself. And your logistic regression sits there looking confused.

That's the moment you realize you need more tools. Not because logistic regression is bad, it's genuinely great for a lot of problems, but because no single algorithm wins every game. Different classifiers make fundamentally different assumptions about how the data is structured, and understanding those assumptions is the difference between a practitioner who reaches blindly for the same tool every time and one who actually knows what they're doing.

In this article we're going deep on three of the most powerful and widely used classifiers in the scikit-learn toolkit: decision trees, random forests, and support vector machines. We're going to cover not just how to call the API, but why each one works the way it does, what makes them fail, and how to know which one belongs in your pipeline. We'll look at the math just enough to build real intuition, no PhD required, and we'll back everything up with working code you can run right now. By the time we're done, you'll have a genuine mental model for how each of these algorithms thinks about data, and you'll know exactly when to reach for each one.

Let's start from scratch and build this up properly.

Decision Trees: The Interpretable Workhorse

A decision tree is exactly what it sounds like: a flowchart that learns to classify data by asking yes/no questions about features.

Start simple. Imagine you're trying to decide whether to approve a loan application. A human expert might say: "If income is below $30,000, deny. If income is above $30,000 and debt-to-income ratio is above 40%, deny. Otherwise, approve." That's a decision tree. It's a sequence of if/then rules, organized hierarchically, that routes each data point to a prediction. A machine-learned decision tree does the same thing, except instead of a human writing the rules, the algorithm figures out the best rules from the data automatically.

How Trees Grow: Splitting on Information

When a decision tree learns, it recursively splits the data. At each split, the algorithm chooses a feature and a threshold to maximize information gain, roughly, how much "purer" the resulting groups become.

Two common purity metrics:

Gini Impurity (what scikit-learn uses by default):

Gini = 1 - sum(p_i^2)

where p_i is the proportion of class i in a node.

Information Gain (entropy-based):

Information Gain = Entropy(parent) - sum(weight * Entropy(child))

Lower Gini or higher information gain = better split. The tree keeps splitting until it hits a stopping condition (like max depth or minimum samples required to split).

Here's a basic tree in action:

python

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
 
# Generate sample data
X, y = make_classification(n_samples=300, n_features=2, n_informative=2,
                            n_redundant=0, random_state=42)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Train a decision tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = dt.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

Run this and you'll see accuracy numbers in the ballpark of 0.85 to 0.92 depending on how the data splits out. Not bad for a few lines of code. But here's the thing you need to understand before you go further: those numbers on the training set will almost certainly look better, sometimes dramatically better, than on the test set. That's overfitting, and it's the central problem with decision trees that we're going to spend a lot of time on in this article.

How Decision Trees Think

Before we jump into hyperparameters and tuning, let's actually internalize the mental model here. Because when you understand how a decision tree carves up the feature space, a lot of other things, including why random forests work, fall into place naturally.

A decision tree draws axis-aligned lines. That's it. Every split in the tree corresponds to a horizontal or vertical cut through the feature space. When you have two features, the tree is asking questions like "Is Feature A greater than 3.7?" and "Is Feature B less than -1.2?" and routing data points down different branches based on the answers. The result, when you visualize the decision boundary, looks like a grid, rectangular regions, each labeled with a class.

This is both a strength and a weakness. The strength is interpretability: you can literally trace the path from root to leaf for any prediction and explain exactly why the model made the decision it did. The weakness is that real-world data rarely lives in rectangular blobs. If your classes are separated by a diagonal line, the tree has to approximate that diagonal with a staircase of horizontal and vertical cuts, which requires more splits, more complexity, and more overfitting.

The deeper the tree grows, the more precisely it fits the training data, and the more it starts memorizing noise rather than learning signal. A tree with no depth limit will eventually give each individual training sample its own leaf node, achieving 100% training accuracy while completely failing to generalize. That's why controlling tree depth is the single most important hyperparameter decision you'll make with this algorithm. You're trading off between fitting the training data well and staying simple enough to generalize.

Here's an intuition pump: think of a shallow tree as a set of coarse rules that capture the major patterns, and a deep tree as a set of hyper-specific rules that capture every quirk of the training set. For most real problems, the coarse rules generalize better. You're looking for the signal, not memorizing the noise.

Key Hyperparameters to Control

Trees can grow unbounded, and they will, if you let them. That's overfitting. Here's how to rein them in:

Parameter	What It Does	Example
`max_depth`	Limits tree height	5, 10, 15
`min_samples_split`	Minimum samples to split a node	2, 5, 10
`min_samples_leaf`	Minimum samples in leaf node	1, 5, 10
`max_features`	Features considered per split	'sqrt', 'log2', None

Pro tip: Start with max_depth=5 and min_samples_split=5. If your tree is still overfitting, tighten these. If it's underfitting, loosen them.

python

# Controlled tree
dt_controlled = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
dt_controlled.fit(X_train, y_train)
print(f"Controlled Accuracy: {dt_controlled.score(X_test, y_test):.4f}")

This constrained version will almost always outperform an unconstrained tree on test data, even though it fits the training data less precisely. That's the regularization effect at work, by forcing the model to stay simple, you're preventing it from chasing noise.

Random Forests: Wisdom of the Crowd

A single tree is interpretable but prone to overfitting. A forest of trees fixes that.

Random Forest uses bootstrap aggregation (bagging): it trains many trees, each on a random sample of the data, then averages their predictions. Additionally, each split only considers a random subset of features, this de-correlates the trees, making the ensemble more robust.

Why Ensemble Methods Work

Imagine you ask 100 friends for advice, each having seen slightly different information. Averaging their opinions (majority vote for classification) often beats any individual expert. Trees are the same: they're biased toward different patterns, so averaging them reduces variance.

python

from sklearn.ensemble import RandomForestClassifier
 
# Train a random forest
rf = RandomForestClassifier(n_estimators=100, max_depth=10,
                             min_samples_split=5, random_state=42)
rf.fit(X_train, y_train)
 
y_pred_rf = rf.predict(X_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")

That's the basic call, 100 trees, max depth 10, and you'll typically see a meaningful accuracy jump over a single tree. The forest handles the variance problem that plagued the single tree. But why, exactly? Let's dig into that.

Why Random Forests Fix Overfitting

The overfitting problem in decision trees comes from variance: small changes in the training data lead to very different trees. One tree might learn "split on Feature A at 2.3 first," while another might learn "split on Feature B at -0.7 first," and both might be valid given slightly different subsets of the training data. Neither tree is reliably right; they're each capturing a different slice of the truth.

Random forests solve this with two key techniques that work together. The first is bootstrap sampling: each tree in the forest is trained on a slightly different random sample of the training data (drawn with replacement). This means each tree sees a different view of the dataset, learns slightly different patterns, and makes slightly different errors. The second technique is feature randomization: at each split in each tree, only a random subset of features is considered as candidates. This prevents all the trees from making the same first split on the most dominant feature and forces them to explore different aspects of the data.

Now here's the magic: when you average across 100 or 200 trees that each have different errors, those errors tend to cancel out. The signal, the true underlying pattern, is consistent across trees and survives the averaging. The noise, the idiosyncratic patterns each tree memorized, varies across trees and gets washed out. This is variance reduction in action, and it's why random forests so consistently outperform single decision trees on real-world data.

There is a cost, though. You lose the interpretability of a single tree. You can't trace a simple path through the forest the way you can through one tree. What you get instead is feature importance scores, which give you a different kind of insight: instead of "here's exactly how the model made this decision," you get "here are the features that mattered most across all decisions." For many practical applications, that's actually more useful.

Feature Importance: Which Features Matter?

One huge advantage of trees: they tell you which features matter. After training, you can check:

python

import pandas as pd
 
# Get feature importances
importances = rf.feature_importances_
feature_names = ['Feature 0', 'Feature 1']
 
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values('Importance', ascending=False)
 
print(importance_df)

The importance scores sum to 1. Features with higher scores contributed more to the forest's splits. This is genuinely useful for exploratory analysis, you can feed in 50 features and quickly identify the 5 or 10 that actually matter, then focus your feature engineering efforts there.

Controlling the Forest

Key parameters:

Parameter	What It Does
`n_estimators`	Number of trees (usually 50-500)
`max_depth`	Depth of each tree (same as single tree)
`min_samples_split`	Minimum samples to split (same as single tree)
`max_features`	Features sampled per split (usually 'sqrt' or 'log2')
`bootstrap`	Whether to sample with replacement (default: True)

Higher n_estimators generally improves performance (up to diminishing returns), but costs more compute.

python

# Tuned forest
rf_tuned = RandomForestClassifier(
    n_estimators=200,
    max_depth=8,
    min_samples_split=5,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1  # Use all cores
)
rf_tuned.fit(X_train, y_train)
print(f"Tuned Forest Accuracy: {rf_tuned.score(X_test, y_test):.4f}")

The n_jobs=-1 flag is worth noting: it tells scikit-learn to use all available CPU cores when training the forest, since each tree can be trained independently. On a modern multi-core machine this can cut training time by 4x or more. Use it.

Support Vector Machines: The Maximum Margin Classifier

SVMs are different beasts. Instead of trees asking questions, SVMs find the hyperplane that maximizes the margin between classes. In 2D, it's a line. In higher dimensions, it's a hyperplane.

The Geometry

For linearly separable data, an SVM finds the line that best separates the two classes with the most breathing room (margin). Points closest to the line are "support vectors", the boundary cases that define the decision boundary.

For non-linear data, SVMs use the kernel trick: they implicitly map data to a higher-dimensional space without computing it explicitly. Common kernels:

Linear: K(x, x') = x * x' (for linearly separable data)
RBF (Radial Basis Function): K(x, x') = exp(-gamma ||x - x'||^2) (for non-linear, curved boundaries)
Polynomial: K(x, x') = (gamma(x * x') + r)^d (for polynomial-shaped boundaries)

SVM Intuition: Finding the Margin

Before we look at code, let's build proper intuition for what an SVM is actually doing. Because "finding the maximum margin hyperplane" sounds abstract until you see why the margin matters.

Imagine you're trying to draw a line between two clusters of points. There are infinitely many lines that correctly separate them, some pass close to one cluster, some pass close to the other, some split right down the middle. The SVM picks the line that is as far as possible from both clusters simultaneously. Why? Because a line that hugs one cluster is fragile: a new data point that's slightly outside what the training set showed you could easily end up on the wrong side. A line that's equidistant from both clusters has maximum tolerance for novel data points.

This is the fundamental insight: maximizing the margin is equivalent to maximizing the classifier's robustness to new data. It's a geometric form of regularization built directly into the algorithm's objective function. You're not just finding a separating boundary; you're finding the most confident separating boundary.

The "support vectors" are the specific training points that sit closest to the boundary, the ones that would cause you to redraw the line if they shifted. These are the critical, most ambiguous examples in your training set. Everything else is irrelevant to where the boundary ends up. This is a key insight: SVMs are one of the few algorithms that explicitly tell you which training examples matter most. In a dataset with 10,000 samples, maybe only 50 are support vectors. The rest could be removed and the model wouldn't change.

The kernel trick is what makes SVMs so powerful for non-linear problems. The idea is that if your data isn't separable in its original feature space, you can transform it into a higher-dimensional space where it is separable. The RBF kernel, for instance, implicitly maps every point into an infinite-dimensional space where most classification problems become linearly separable. The trick is that you never actually compute this transformation explicitly, you compute it implicitly through the kernel function, which makes the math tractable even for very high-dimensional mappings.

SVM in Action

python

from sklearn.svm import SVC
 
# Linear SVM
svm_linear = SVC(kernel='linear', C=1.0, random_state=42)
svm_linear.fit(X_train, y_train)
print(f"Linear SVM Accuracy: {svm_linear.score(X_test, y_test):.4f}")
 
# RBF SVM (non-linear)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_rbf.fit(X_train, y_train)
print(f"RBF SVM Accuracy: {svm_rbf.score(X_test, y_test):.4f}")
 
# Polynomial SVM
svm_poly = SVC(kernel='poly', degree=3, C=1.0, random_state=42)
svm_poly.fit(X_train, y_train)
print(f"Polynomial SVM Accuracy: {svm_poly.score(X_test, y_test):.4f}")

The three kernel options give you three different assumptions about the shape of the decision boundary. Linear assumes a flat hyperplane. RBF allows arbitrary smooth curves. Polynomial allows polynomial-shaped boundaries. Start with RBF as your default when you suspect non-linearity, it's the most flexible and usually performs well across a wide range of problems.

SVM Hyperparameters

Parameter	What It Does	Notes
`kernel`	Kernel type	'linear', 'rbf', 'poly', 'sigmoid'
`C`	Regularization strength (inverse)	Lower C = more regularization, simpler model
`gamma`	Kernel coefficient (RBF/poly)	Lower gamma = smoother, simpler boundary
`degree`	Polynomial degree	Only for 'poly' kernel

C controls the trade-off: high C tries to fit all training points (risk: overfitting). Low C allows some misclassification for a simpler boundary.

gamma (for RBF): high gamma = decision boundary close to training points (overfitting risk). Low gamma = smooth, far-reaching boundary.

python

# Tuned RBF SVM
svm_tuned = SVC(kernel='rbf', C=10.0, gamma=0.1, random_state=42)
svm_tuned.fit(X_train, y_train)
print(f"Tuned SVM Accuracy: {svm_tuned.score(X_test, y_test):.4f}")

One critical thing to always remember with SVMs: you must scale your features before training. SVMs are sensitive to feature magnitude because the margin calculation is based on distances in feature space. If one feature ranges from 0 to 1 and another ranges from 0 to 10,000, the large-scale feature will dominate the distance calculation and the model will effectively ignore the small-scale one. StandardScaler from scikit-learn is your friend here, always use it before fitting an SVM.

Visualizing Decision Boundaries

Now comes the fun part: seeing how each classifier thinks about the data.

python

import numpy as np
import matplotlib.pyplot as plt
 
# Create a mesh to plot decision boundaries
h = 0.02  # step size
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                      np.arange(y_min, y_max, h))
 
# Train classifiers
classifiers = [
    ('Decision Tree (depth=5)', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('Random Forest (100 trees)', RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)),
    ('Linear SVM', SVC(kernel='linear', C=1.0, random_state=42)),
    ('RBF SVM', SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)),
]
 
# Plot
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
 
for idx, (name, clf) in enumerate(classifiers):
    clf.fit(X_train, y_train)
 
    # Predict on mesh
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
 
    # Plot decision boundary and margins
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_test[y_test == 0, 0], X_test[y_test == 0, 1],
                      c='blue', marker='o', label='Class 0', edgecolors='k')
    axes[idx].scatter(X_test[y_test == 1, 0], X_test[y_test == 1, 1],
                      c='red', marker='s', label='Class 1', edgecolors='k')
 
    accuracy = clf.score(X_test, y_test)
    axes[idx].set_title(f'{name}\nAccuracy: {accuracy:.3f}')
    axes[idx].set_xlabel('Feature 0')
    axes[idx].set_ylabel('Feature 1')
    axes[idx].legend()
 
plt.tight_layout()
plt.savefig('decision_boundaries.png', dpi=150, bbox_inches='tight')
plt.show()

If you run this visualization, save it and spend time actually looking at it. The differences between the four panels are not subtle, they're telling you something fundamental about how each algorithm models reality. Don't just glance at the accuracy numbers; look at the shapes.

This visualization is gold. You can see:

Decision trees: Hard, rectangular boundaries (axis-aligned splits)
Random forests: Smoother rectangles (averaging many trees)
Linear SVM: Straight line separating classes
RBF SVM: Curved, non-linear boundary adapting to data clusters

Comparing on the Same Dataset

Let's put them all side-by-side with a more realistic example.

python

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score
 
# Load a real dataset
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
 
# Scale features (important for SVM!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)
 
# Train multiple classifiers
models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
    'Linear SVM': SVC(kernel='linear', C=1.0, random_state=42),
    'RBF SVM': SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42),
}
 
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
 
    results[name] = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
    }
 
results_df = pd.DataFrame(results).T
print(results_df)

The breast cancer dataset has 30 features describing cell nucleus measurements, and the task is to classify tumors as malignant or benign. It's a realistic, high-stakes classification problem, exactly the kind of thing these algorithms were built for. Notice that all four classifiers perform well here, which is typical: on well-structured datasets with good features, multiple algorithms will converge to similar accuracy. The differences become more pronounced on messier, higher-dimensional, or noisier data.

All four are competitive, it depends on your data. Notice the trade-offs:

Trees: Fast training, interpretable, but can overfit
Random Forests: More robust, feature importance, but less interpretable
Linear SVM: Great on many real datasets, needs feature scaling
RBF SVM: Powerful for non-linear patterns, risk of overfitting with wrong C/gamma

Choosing Between These Models

Now we get to the question you actually care about: you have a new dataset, you need to classify something, which algorithm do you pick? There's no universal answer, if there were, everyone would just use that one, but there are principles that will steer you right most of the time.

Start by thinking about what you know about your problem. Do you need to explain your predictions to a non-technical stakeholder? A doctor needs to understand why a model flagged a patient. A loan officer needs to explain a denial to a customer. A regulator might require auditability. In those cases, a decision tree with limited depth is often your best friend, it's the most interpretable of the three algorithms. You can literally print the tree and walk someone through it. Random forests and SVMs are much harder to explain in plain language.

If interpretability isn't the top concern and you want a reliable baseline, random forests are usually your starting point. They're robust across a wide range of dataset types, they handle mixed feature scales reasonably well (no need to standardize), they give you feature importance for free, and they very rarely fail catastrophically. When you don't know much about your data, a random forest with default parameters is a reasonable first move. It won't be the best possible model, but it'll rarely embarrass you.

SVMs tend to shine in specific situations: when your dataset is small to medium sized, when your features are well-engineered and meaningful, and when the number of features is comparable to or larger than the number of samples. Text classification is a classic SVM domain, you might have tens of thousands of features (one per word) but relatively few training documents, and SVMs handle that regime elegantly. If you're working on image classification or any domain where a lot of preliminary feature engineering has already been done, SVMs can be extremely competitive.

Dataset size matters too. SVMs have training time complexity that scales poorly, roughly O(n squared) to O(n cubed) with the number of samples for the standard solver. On a dataset with 100,000 samples, an RBF SVM might take hours to train. Random forests train much faster at scale and can be parallelized trivially. If you have millions of samples, random forests or gradient boosting (which we'll cover later) are almost always the right move over SVMs.

Finally, think about whether you need probability estimates. By default, scikit-learn's SVC doesn't output calibrated probabilities, it outputs hard class predictions. You can get probabilities by setting probability=True, but it uses cross-validation internally and significantly slows down training. Decision trees and random forests give you probability estimates directly through the ratio of class samples at each leaf. If you need well-calibrated probabilities for downstream decisions, say, you want to rank predictions by confidence, that's worth factoring in.

When to Use Each

Decision Trees: When you need interpretability (e.g., "explain why you denied this loan"). Small to medium datasets. Risk: overfitting.

Random Forests: Your default go-to for many tasks. Good accuracy, handles high-dimensional data, resistant to overfitting. Drawback: less interpretable than a single tree.

Linear SVM: When you suspect a linear boundary or after feature engineering makes data separable. Fast on large datasets. Need to scale features.

RBF/Polynomial SVM: When data has clear non-linear structure. Slower training but powerful. Hyperparameter tuning (C, gamma) is crucial.

Key Takeaways

Decision Trees split data recursively on Gini/entropy, trading interpretability for overfitting risk
Random Forests average many trees, reducing variance while maintaining interpretability of feature importance
SVMs find maximum-margin hyperplanes, with kernels enabling non-linear decisions
Visualization reveals how each classifier carves up the decision space, crucial for intuition
Always scale your features for SVM; trees don't care
Tune hyperparameters based on validation performance, not training accuracy

The deeper lesson here isn't just about these three algorithms. It's about having a mental model for why different approaches work and when to apply them. Decision trees think in axis-aligned splits. Random forests reduce variance through ensembling. SVMs seek maximum margin through geometric optimization. Each assumption is a bet about the structure of your data, and knowing which bet to place is what separates someone who can apply machine learning from someone who truly understands it.

In the next article we're going to look at feature engineering and preprocessing: how to transform your raw data to make any of these classifiers perform dramatically better. The algorithms are only as good as the features you feed them, and there's a lot of leverage in that preprocessing step. We'll cover encoding, scaling, dimensionality reduction, and a handful of transformations that consistently make a difference. See you there.

Classification with Decision Trees, Random Forests, and SVM

Decision Trees: The Interpretable Workhorse

How Trees Grow: Splitting on Information

How Decision Trees Think

Key Hyperparameters to Control

Random Forests: Wisdom of the Crowd

Why Ensemble Methods Work

Why Random Forests Fix Overfitting

Feature Importance: Which Features Matter?

Controlling the Forest

Support Vector Machines: The Maximum Margin Classifier

The Geometry

SVM Intuition: Finding the Margin

SVM in Action

SVM Hyperparameters

Visualizing Decision Boundaries

Comparing on the Same Dataset

Choosing Between These Models

When to Use Each

Key Takeaways

Need help implementing this?