November 21, 2025
Python Machine Learning Scikit-Learn Regression

Supervised Learning: Linear and Logistic Regression with scikit-learn

Here's a reality check: most "AI" problems actually boil down to predicting something. Either you're predicting a continuous value (like house prices or temperature) or a category (like spam vs. not spam, disease vs. healthy). Those two problems sit at the heart of supervised learning, and they're solved elegantly with linear and logistic regression.

In this article, we're going to build both on the same dataset, so you'll see exactly how the two approaches differ, when to use which, and why these "simple" models remain the workhorses of practical machine learning.

Before we dive into code, let's talk about what "supervised" actually means and why it matters. Supervised learning is the paradigm where you train a model on labeled examples, you show it inputs paired with the correct outputs, and it learns the mapping between them. The "supervision" is the label: someone (or something) already told you the answer for each training example. Your job is to teach the model to generalize that answer to data it has never seen before. That core idea, learn from labeled examples, generalize to new ones, is what makes linear and logistic regression so powerful and so foundational.

What you'll walk away with after this article: a working intuition for how both algorithms learn from data, hands-on experience fitting and evaluating models in scikit-learn, an understanding of the metrics that actually matter for regression and classification, and a framework for deciding which model to reach for on any given problem. We'll be building everything from scratch in code you can run, modify, and break, because that's how the intuition really sticks. We'll also cover the practical pitfalls that catch beginners off guard: feature scaling, regularization choices, and the metrics that tell the real story when accuracy alone lies to you.

Table of Contents
  1. Why Start Here?
  2. Linear Regression: Predicting Continuous Values
  3. Ordinary Least Squares (OLS)
  4. Regression Metrics: How Good Is Your Fit?
  5. Coefficients Tell the Story
  6. Assumptions (And When They Break)
  7. Linear vs. Logistic Intuition
  8. Regularization: When Simple Models Overfit
  9. Ridge Regression (L2 Regularization)
  10. Lasso Regression (L1 Regularization)
  11. Regularization Explained
  12. Logistic Regression: Predicting Categories
  13. Fitting Logistic Regression
  14. The Decision Boundary
  15. Binary vs. Multiclass Classification
  16. Classification Metrics: Accuracy Isn't Everything
  17. Interpreting Logistic Regression Coefficients
  18. Side-by-Side: Regression vs. Classification on the Same Data
  19. Feature Scaling Importance
  20. When Linear Models Win
  21. Common Regression Mistakes
  22. Hyperparameter Tuning (The Light Touch)
  23. Assumptions and Limitations
  24. Wrapping Up: Why This Foundation Matters

Why Start Here?

Before you jump to deep neural networks or ensemble methods, understand this: linear and logistic regression are interpretable, fast, and serve as the baseline for nearly every ML project. They teach you core concepts, optimization, regularization, metrics, confidence, that transfer directly to complex models.

Plus, when you nail regression and classification with these fundamentals, you'll spot when fancier models are solving the wrong problem. Every seasoned ML engineer has a story about replacing a bloated neural network with a logistic regression that performed just as well and deployed in a fraction of the time. These algorithms are not training wheels. They are workhorses. Learn them deeply and you earn the judgment to know when to use them and when to reach for something heavier.

Linear Regression: Predicting Continuous Values

Linear regression answers a simple question: what's the relationship between your input features and a continuous output?

The math is straightforward. We're fitting a line (or hyperplane in multiple dimensions):

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

Where b0 is the intercept and b1, b2, ..., bn are the coefficients. The goal is to find coefficients that minimize prediction error.

Ordinary Least Squares (OLS)

scikit-learn's LinearRegression uses OLS, a method that minimizes the sum of squared residuals (differences between predicted and actual values). It's the default because it's computationally efficient and has a closed-form solution.

Let's build a concrete example. We'll predict house prices from features. Notice that we import everything we need up front, split the data before touching the model (never peek at your test set during training), and then evaluate on held-out data. The split is what makes evaluation honest, if you trained and tested on the same data, your metrics would look great but your model would be useless in the real world.

python
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
 
# Generate synthetic housing data: 100 samples, 5 features
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)
 
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 
# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Make predictions
y_pred = model.predict(X_test)
 
# Evaluate
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Notice we're not tweaking hyperparameters here, LinearRegression() just works. But understanding what it's doing is crucial. When you call .fit(), scikit-learn solves the OLS optimization internally using matrix algebra to find the exact coefficient values that minimize the total squared error across your training set. The result is deterministic, same data, same answer, every time, which is a property you lose with gradient-descent-based methods. For most tabular datasets with thousands to tens of thousands of rows, this is the right starting point before you consider anything more complex.

Regression Metrics: How Good Is Your Fit?

You can't just squint at predictions and say "looks right." We measure regression accuracy with metrics. The three you need to know cold are MAE, RMSE, and R-squared. Each tells a different part of the story, and experienced practitioners report all three rather than cherry-picking the one that makes their model look best.

Mean Absolute Error (MAE): Average absolute difference between predicted and actual values. Interpreting this is easy, it's in the same units as your target.

python
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")  # "On average, we're off by $X"

MAE is robust to outliers because it treats all errors equally regardless of magnitude. If your target has extreme values, MAE gives you an honest picture of typical error without letting a few bad predictions dominate the score. Use it when you want to communicate model accuracy to stakeholders in plain language.

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE): Penalize larger errors more heavily. RMSE brings MSE back to your original units.

python
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")

Because MSE squares the errors before averaging, a single large mistake contributes disproportionately to the final score. That's a feature, not a bug, in many applications (predicting bridge load tolerances, medical dosages, financial risk), large errors are far worse than small ones and deserve heavier punishment. RMSE restores the units so you can interpret the number directly alongside MAE.

R-squared Score: Proportion of variance explained (0 to 1, higher is better). It answers: "What percentage of the target's variance does my model explain?"

python
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.4f}")  # 0.92 means 92% of variance explained

Why these three? MAE is interpretable for business stakeholders. RMSE punishes outliers. R-squared tells you whether your model beats a naive "always predict the mean" baseline. An R-squared of 0.0 means your model is no better than guessing the mean every time; an R-squared of 1.0 means perfect prediction. In practice, R-squared above 0.7 is often considered strong for social science data, while engineering applications might demand 0.95 or higher. Never report just one metric, together, they paint the full picture.

Coefficients Tell the Story

Here's where interpretability shines. Each coefficient tells you how much the output changes for a one-unit change in that feature, holding others constant. This is one of the most powerful aspects of linear models, you can explain your model's predictions in plain English, trace every prediction back to its source, and validate that the model has learned something sensible before deploying it.

python
for i, coef in enumerate(model.coef_):
    print(f"Feature {i}: coefficient = {coef:.4f}")
    # "A 1-unit increase in Feature 0 increases y by {coef:.4f}, all else equal"

If a coefficient is negative, that feature pushes the prediction down. If positive, it pushes up. This beats the "black box" feeling you get from neural networks. When a coefficient's sign or magnitude doesn't match domain knowledge, say, a feature you know should positively affect the outcome has a negative coefficient, that's a signal to investigate. You might have multicollinearity, a data leak, or a preprocessing error. Interpretability isn't just a communication nicety; it's a debugging tool.

Assumptions (And When They Break)

OLS assumes:

  1. Linearity: The relationship between X and y is linear.
  2. Independence: Observations are independent (no time-series autocorrelation).
  3. Homoscedasticity: Error variance is constant across all X values.
  4. Normality: Residuals are approximately normally distributed.

Violate these and your coefficients become unreliable. Real data rarely satisfies all four perfectly, but awareness matters. If you suspect non-linearity, consider polynomial features or non-linear models.

Linear vs. Logistic Intuition

Before jumping into code for logistic regression, it helps to build a clear mental picture of how these two algorithms are related, and where they diverge. Think of linear regression as drawing the "best fit" straight line through your data points, minimizing the distance between the line and each point. The output is unbounded: it can be any real number, positive or negative, as large or as small as the data demands. That's fine when you're predicting something like house prices or temperature, where the answer lives on a continuous number line.

Now imagine you want to predict whether something is true or false, a binary outcome. If you plug that same linear regression output into a yes/no decision, you immediately hit a problem: linear regression can predict values like 1.7 or -0.3, which have no meaning as probabilities. Logistic regression fixes this by wrapping the linear combination in the sigmoid function, which squashes the output to a range between 0 and 1. Those values now behave like probabilities: 0.9 means "90% chance of class 1," and 0.1 means "10% chance." The decision boundary, the threshold where you flip from predicting class 0 to class 1, is still a straight line (or hyperplane) in feature space, which is why logistic regression is technically a linear classifier. The key insight is that both algorithms are doing the same linear combination of features under the hood; logistic regression just applies a nonlinear transformation to the output to make it useful for classification. That shared foundation is why understanding linear regression deeply makes logistic regression click immediately.

Regularization: When Simple Models Overfit

Sometimes linear regression memorizes your training data and fails on new data. That's overfitting. Regularization adds a penalty term to the loss function to keep coefficients small.

Ridge Regression (L2 Regularization)

Ridge adds the squared magnitude of coefficients to the loss. The intuition: large coefficients imply the model is leaning heavily on specific features, which often reflects noise rather than signal. By penalizing large coefficients, Ridge encourages a smoother, more generalized fit. The alpha parameter controls how aggressive the penalty is, higher alpha means smaller coefficients and a simpler model.

python
from sklearn.linear_model import Ridge
 
# alpha controls regularization strength (higher = stronger penalty)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
 
print(f"Ridge R-squared: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Ridge coefficients (smaller): {ridge.coef_}")

Ridge shrinks all coefficients toward zero but doesn't eliminate them. It's useful when you suspect multicollinearity (features are correlated). When two features are highly correlated, plain OLS becomes unstable, it can assign arbitrarily large positive and negative coefficients that cancel each other out, resulting in a technically correct but practically brittle fit. Ridge stabilizes this by distributing the weight across correlated features rather than concentrating it on one.

Lasso Regression (L1 Regularization)

Lasso is aggressive, it can shrink some coefficients to exactly zero, effectively doing feature selection. This is not just a mathematical curiosity; it has real practical value. When you're working with dozens or hundreds of features, Lasso gives you a principled, data-driven way to discover which ones actually matter without running separate feature selection steps.

python
from sklearn.linear_model import Lasso
 
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
 
print(f"Lasso R-squared: {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Lasso coefficients (some are zero): {lasso.coef_}")

If a Lasso coefficient is zero, that feature doesn't contribute. This is interpretability on steroids, you learn which features actually matter.

When to use Ridge vs. Lasso vs. plain Linear Regression?

  • Plain Linear: Few features, low multicollinearity, you have enough data.
  • Ridge: Many features, moderate multicollinearity, you want all features to contribute a little.
  • Lasso: Many features, you want automatic feature selection, you suspect most features are irrelevant.

Regularization Explained

It's worth slowing down on regularization because this concept trips up beginners more than almost anything else in ML. At its core, regularization is about managing the bias-variance tradeoff. A model with no regularization will fit training data very well, low bias, but it often fits noise along with the signal, leading to poor generalization, high variance. Regularization deliberately introduces a small amount of bias to dramatically reduce variance.

Think of it this way: you have a dataset with 50 features but only 200 training samples. Plain linear regression has 51 parameters to fit (50 coefficients plus an intercept) against 200 data points. There's enough flexibility for the model to find spurious patterns that exist in your training data but not in reality. Regularization says: "before you assign large weights to any feature, you need to pay a penalty." This prevents the model from placing big bets on any single feature unless the evidence for that feature is overwhelming.

The alpha hyperparameter (or its inverse C in logistic regression) is the knob you turn. High alpha means strong regularization: simple model, possibly underfitting. Low alpha means weak regularization: complex model, possibly overfitting. Neither extreme is right in general, you find the sweet spot through cross-validation, which we'll cover in the hyperparameter tuning section. The practical takeaway: whenever you have more features than you have training examples, or whenever your plain linear regression shows a large gap between training and test performance, regularization is your first line of defense. Start with Ridge to stabilize, then try Lasso if you want to also trim irrelevant features.

Logistic Regression: Predicting Categories

Now we flip the switch to classification. Logistic regression answers: what's the probability that this observation belongs to a particular class?

The trick is the sigmoid function, which squashes any input to a probability (0 to 1):

P(y=1|X) = 1 / (1 + e^(-z))

Where z is our familiar linear combination: b0 + b1x1 + b2x2 + ...

Fitting Logistic Regression

Let's use the same feature space but create a binary classification target. We take our continuous regression target and split it at the median, observations with values above the median become class 1, below become class 0. This is a clean way to create a balanced binary problem from continuous data, and it demonstrates that the same features can serve both regression and classification depending on how you frame the output. In practice, your classification targets come pre-labeled: spam/not-spam, fraud/legitimate, churn/retained.

python
from sklearn.linear_model import LogisticRegression
 
# Create binary classification: high price (1) vs. low price (0)
y_binary = (y > np.median(y)).astype(int)
 
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)
 
# Fit logistic regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
 
# Predict probabilities AND class labels
y_pred_proba = log_reg.predict_proba(X_test)  # [[prob_class_0, prob_class_1], ...]
y_pred = log_reg.predict(X_test)  # [0, 1, 1, 0, ...]
 
print(f"Coefficients: {log_reg.coef_}")
print(f"Intercept: {log_reg.intercept_}")

Notice we get both probabilities and hard class labels. Probabilities let you set your own decision threshold, useful when false positives are more costly than false negatives (or vice versa). This is one of logistic regression's most underappreciated advantages: it doesn't just answer "which class?" but "how confident are we?" That confidence score is invaluable for downstream decision-making, risk calibration, and building systems where humans review borderline cases rather than blindly trusting every model output.

The Decision Boundary

Logistic regression draws a line (or hyperplane) that separates classes. By default, it uses a 0.5 probability threshold:

Predicted class = 1 if P(y=1|X) >= 0.5
Predicted class = 0 if P(y=1|X) < 0.5

But that 0.5 isn't sacred. You can adjust it based on your problem:

python
y_pred_custom = (y_pred_proba[:, 1] >= 0.7).astype(int)
# "Predict class 1 only if we're at least 70% confident"

This flexibility is gold. In medical diagnosis, you might demand 95% confidence before saying "patient has disease." In email filtering, 60% might be fine. The threshold is a business decision as much as a technical one, and logistic regression lets you make that decision explicitly rather than having it baked into the model architecture. Adjusting the threshold changes the precision-recall tradeoff directly, higher threshold means fewer but more confident positive predictions (higher precision, lower recall), lower threshold means more positive predictions but more false alarms (lower precision, higher recall). Visualizing the ROC curve before choosing your threshold is the professional approach.

Binary vs. Multiclass Classification

Binary is straightforward, two classes, sigmoid function, done.

Multiclass needs strategy. scikit-learn uses two defaults:

One-vs-Rest (OvR): Train one classifier per class (class A vs. all others, class B vs. all others, etc.). Pick the class with highest confidence.

Multinomial (Softmax): Train a single model with the softmax function, which generalizes sigmoid to multiple classes:

P(y=k|X) = e^(z_k) / sum(e^(z_j)) for all classes j

With scikit-learn's LogisticRegression, use:

python
# Multiclass on 3-class problem
y_multiclass = np.random.randint(0, 3, y.shape)  # 0, 1, or 2
 
log_reg_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=200)
log_reg_multi.fit(X_train, y_multiclass)

For most multiclass problems with well-separated classes, the multinomial approach tends to outperform OvR because it trains a globally coherent model that considers all classes simultaneously rather than building independent binary classifiers. The practical difference is usually small on clean, balanced datasets, but multinomial is the more principled choice when you have three or more classes. You'll also see the solver parameter become relevant here, lbfgs is a gradient-based optimizer that works well for multinomial logistic regression, while the default liblinear solver only supports OvR. When you change multi_class, check that your solver is compatible.

Classification Metrics: Accuracy Isn't Everything

This is where classification gets interesting. Accuracy (percentage correct) is intuitive but misleading if classes are imbalanced.

Accuracy: Simple, how many did we get right?

python
from sklearn.metrics import accuracy_score
 
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")  # But be careful if classes are imbalanced

Here's the trap with accuracy on imbalanced data: if 95% of your emails are not spam, a model that predicts "not spam" for every single email achieves 95% accuracy without learning anything useful. That's why precision and recall exist, they force you to look at how the model handles each class separately.

Precision and Recall: These dig deeper.

python
from sklearn.metrics import precision_score, recall_score
 
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
 
print(f"Precision: {precision:.4f}")  # Of positive predictions, how many are correct?
print(f"Recall: {recall:.4f}")  # Of actual positives, how many did we find?

Why both? Imagine a disease classifier: high recall means we catch most patients (few false negatives, which matter medically). High precision means we don't waste resources on false alarms.

F1 Score: The harmonic mean of precision and recall. Use when you want a single number and both matter equally.

python
from sklearn.metrics import f1_score
 
f1 = f1_score(y_test, y_pred)
print(f"F1: {f1:.4f}")

The harmonic mean is mathematically stricter than the arithmetic mean, a model can't achieve a high F1 by being very good on one metric and terrible on the other. If precision is 1.0 and recall is 0.1, the F1 is only 0.18. That harshness is the point: F1 rewards models that are genuinely balanced, not ones that game one metric at the expense of the other. For imbalanced datasets, also consider the macro-averaged F1, which computes F1 per class and averages, giving equal weight to each class regardless of frequency.

AUC (Area Under the ROC Curve): Evaluates how well your model ranks predictions across all thresholds. AUC = 1 is perfect, 0.5 is random guessing.

python
from sklearn.metrics import roc_auc_score
 
auc = roc_auc_score(y_test, y_pred_proba[:, 1])
print(f"AUC: {auc:.4f}")

AUC is threshold-agnostic, it rewards models that confidently separate classes, regardless of where you draw the line. This makes AUC uniquely useful during model development, before you've decided on a deployment threshold. An AUC of 0.85 tells you the model has real discriminating power even if you haven't yet decided whether to optimize for precision or recall in production.

Interpreting Logistic Regression Coefficients

Coefficients in logistic regression don't mean "change in y" the way they do in linear regression. Instead, they represent log-odds:

log(odds) = b0 + b1*x1 + b2*x2 + ...

To interpret: a coefficient of 0.5 for a feature means a one-unit increase in that feature multiplies the odds of class 1 by e^0.5 (approximately 1.65, or a 65% increase in odds).

python
import numpy as np
 
coef = log_reg.coef_[0, 0]  # First feature
odds_multiplier = np.exp(coef)
print(f"One-unit increase in feature 0 multiplies odds by {odds_multiplier:.2f}")

It's less intuitive than linear regression, but still interpretable, much better than a neural network's hidden layers. Odds ratios are a well-established concept in statistics and medicine, so stakeholders with quantitative backgrounds will understand them. For non-technical audiences, you can simplify further: "every additional year of customer tenure roughly doubles the probability of retention" is graspable even for someone who has never heard the phrase "log-odds." The interpretability of logistic regression coefficients is one of the main reasons the algorithm remains dominant in regulated industries like finance and healthcare, where model explainability is a legal or compliance requirement.

Side-by-Side: Regression vs. Classification on the Same Data

Let's cement the differences with a complete example using both approaches. Running both models on the same underlying feature set makes the comparison concrete, you see that the data preparation and splitting look identical, the fitting API is the same .fit() call, but the metrics you reach for afterward are completely different. Regression answers "how much?" Classification answers "which one?" Both questions are valid; your choice depends entirely on how you've framed the output.

python
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, f1_score, roc_auc_score
import numpy as np
 
# Generate data
X, y_continuous = make_regression(n_samples=200, n_features=5, noise=20, random_state=42)
X_train, X_test, y_train_cont, y_test_cont = train_test_split(
    X, y_continuous, test_size=0.2, random_state=42
)
 
# --- LINEAR REGRESSION ---
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train_cont)
y_pred_reg = lin_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test_cont, y_pred_reg))
r2 = r2_score(y_test_cont, y_pred_reg)
print(f"Linear Regression - RMSE: {rmse:.4f}, R-squared: {r2:.4f}")
 
# --- LOGISTIC REGRESSION (same features, binary target) ---
y_binary = (y_continuous > np.median(y_continuous)).astype(int)
X_train, X_test, y_train_bin, y_test_bin = train_test_split(
    X, y_binary, test_size=0.2, random_state=42
)
 
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train_bin)
y_pred_class = log_reg.predict(X_test)
y_pred_proba = log_reg.predict_proba(X_test)[:, 1]
 
acc = accuracy_score(y_test_bin, y_pred_class)
f1 = f1_score(y_test_bin, y_pred_class)
auc = roc_auc_score(y_test_bin, y_pred_proba)
print(f"Logistic Regression - Accuracy: {acc:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}")

One dataset, two problems, two solutions. Regression predicts a value. Classification predicts a category (with confidence). Same features, different output spaces, different metrics. When you run this code yourself, pay attention to the AUC score for logistic regression, if it's above 0.85 or 0.9, that's a sign the features genuinely discriminate the two classes well, which in turn validates that the underlying linear relationship is strong. Strong regression performance and strong classification performance on the same features is a consistency check that builds confidence in the data quality.

Feature Scaling Importance

One of the most common mistakes beginners make with both linear and logistic regression is ignoring feature scaling. When your features live on wildly different scales, say, one feature ranges from 0 to 1 and another ranges from 0 to 1,000,000, the coefficients become hard to interpret and regularization becomes unreliable.

Here's why scaling matters specifically for regularization: Ridge and Lasso penalize the magnitude of coefficients. If Feature A is measured in millimeters and Feature B is measured in kilometers, their coefficient magnitudes will differ by a factor of a million even if they have equal predictive power. Regularization will inadvertently penalize Feature A's coefficient far more than Feature B's, not because Feature A matters less but because its unit is smaller. Standardizing features before applying regularization levels the playing field, so the penalty is applied equitably across all features.

For logistic regression, scaling also affects optimization convergence. The gradient descent solvers (like lbfgs or saga) converge faster when features are on similar scales because the loss landscape is more spherical and less elongated. Without scaling, training can be slow or even unstable on poorly conditioned data. Use scikit-learn's StandardScaler to center features at zero mean and scale to unit variance before fitting any regularized linear model. The pattern is always: fit the scaler on training data only, then transform both train and test using that fitted scaler. Fitting on test data would leak information about the test distribution into your preprocessing, subtly invalidating your evaluation.

When Linear Models Win

You might be wondering: why not just use XGBoost or neural networks on everything?

Speed: Linear models train in milliseconds. Complex models take minutes or hours.

Interpretability: You explain coefficients to stakeholders. Neural networks? "It's a black box, but it works."

Data efficiency: Linear models learn from hundreds of examples. Deep learning needs thousands.

Overfitting resistance: Fewer parameters means less risk of memorizing noise.

Simplicity: Debug-friendly. When predictions surprise you, the math is transparent.

Where linear models lose: high-dimensional non-linear relationships, image/text data (without heavy feature engineering), ensemble strength.

Common Regression Mistakes

Even experienced practitioners fall into predictable traps with regression models. The first is data leakage, inadvertently including information in your features that would not be available at prediction time. A classic example: predicting customer churn using a feature that is only computed at the time of cancellation. Your training metrics look outstanding, you deploy, and the model is useless because that feature doesn't exist for active customers. Always trace every feature back to its real-world availability before deployment.

The second mistake is neglecting to check residuals after fitting. Residual plots, predicted vs. actual, or residuals vs. feature values, reveal patterns that your model missed. If you see a U-shaped pattern in your residuals, the relationship is non-linear and linear regression is the wrong tool or needs polynomial features. If the variance of residuals increases with predicted value (heteroscedasticity), your standard errors are wrong and your model may need a log-transformation of the target.

The third mistake is treating R-squared as the only metric. A model with R-squared = 0.95 can still have terrible RMSE if the target has high variance. Conversely, a model with R-squared = 0.5 might be highly valuable if the baseline (always predicting the mean) was completely useless to begin with. Report all three metrics, MAE, RMSE, and R-squared, and benchmark against a reasonable baseline.

Finally, beginners often tune on the test set. Every time you look at test set performance and adjust your model, you are effectively training on the test set. Use a validation set or cross-validation for model selection and hyperparameter tuning. Reserve the test set for a single, final evaluation after all decisions have been made.

Hyperparameter Tuning (The Light Touch)

Even simple models have tuning points. For regularized regression, the most important hyperparameter is alpha (regularization strength). The values you try should span multiple orders of magnitude, regularization effects are logarithmic, so the difference between alpha=0.01 and alpha=0.1 is often larger than the difference between alpha=0.1 and alpha=0.2.

python
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import GridSearchCV
 
# Tune Ridge alpha
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
ridge_cv = GridSearchCV(Ridge(), {'alpha': alphas}, cv=5)
ridge_cv.fit(X_train, y_train_cont)
print(f"Best alpha: {ridge_cv.best_params_}")
print(f"Best CV score: {ridge_cv.best_score_:.4f}")

For logistic regression, tune the regularization parameter C (inverse of alpha):

python
log_reg_cv = GridSearchCV(LogisticRegression(max_iter=1000),
                          {'C': [0.001, 0.01, 0.1, 1.0, 10.0]},
                          cv=5)
log_reg_cv.fit(X_train, y_train_bin)

Cross-validation protects you from overfitting to your test set. Use it. The cv=5 argument means scikit-learn splits your training data into 5 folds, trains on 4 and evaluates on 1, rotates through all 5 combinations, and reports the average score. This gives you a much more reliable estimate of model performance than a single train/validation split, especially on small datasets where any single split might be lucky or unlucky by chance.

Assumptions and Limitations

Both models assume a linear decision boundary. If your data looks like this (a spiral or interleaving circles), linear models will struggle. Non-linear kernels or ensemble methods become necessary.

Also remember:

  • Scaling matters: Logistic regression with regularization benefits from standardized features (mean 0, std 1). Linear regression is more forgiving but still clearer with scaled data.
  • Missing data: Neither handles NaN gracefully. Impute or drop beforehand.
  • Categorical features: Encode as numbers or use one-hot encoding.

Wrapping Up: Why This Foundation Matters

Linear regression and logistic regression are the bedrock of supervised learning. They're fast, interpretable, and surprisingly effective. Linear regression predicts continuous values via OLS, with metrics like RMSE and R-squared guiding evaluation. Logistic regression flips the problem, sigmoid squashes predictions to probabilities, and you classify by thresholding.

Regularization (Ridge, Lasso) prevents overfitting and can highlight which features matter. Classification metrics (accuracy, precision, recall, F1, AUC) reveal model behavior from different angles. Coefficients tell stories in both, how much each feature nudges the output.

The concepts we covered here, the bias-variance tradeoff, regularization, proper train/test splitting, metric selection, and feature scaling, are not specific to linear models. They transfer verbatim to every algorithm you will ever use in machine learning. When you start working with gradient-boosted trees, neural networks, or support vector machines, you will be fighting the same battles: preventing overfitting, choosing meaningful metrics, scaling features appropriately, and tuning regularization. Master these fundamentals now and you'll have a head start on every algorithm that follows.

You've now seen the two fundamental supervised learning flavors. Next article, we'll tackle decision trees and random forests, models that go non-linear while staying interpretable. But don't skip this foundation. Every ML engineer worth their salt returns to regression and classification basics when hunting bugs or explaining results.

Ready to build? Load your data, fit a LinearRegression or LogisticRegression, and watch how far linear thinking takes you. When your first model underperforms, resist the temptation to immediately reach for something more complex, instead, diagnose the problem with the tools we covered here. More often than you expect, the answer is better features, better scaling, or the right regularization strength, not a fancier algorithm.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project