Regression vs Classification — Two Pillars of Supervised Learning Explained

Machine learning sounds complicated. But at its core, most machine learning tasks come down to one simple question: are you predicting a number or a category?

If you are predicting a number — like the price of a house or tomorrow's temperature — that is regression. If you are predicting a category — like whether an email is spam or not, or which animal is in a photo — that is classification.

Both are types of supervised learning, which is the most common branch of machine learning. This guide explains both from scratch, with clear everyday analogies and working Python code you can run yourself.

What Is Supervised Learning

Supervised learning means teaching a machine using examples that already have the right answer. You show the model thousands of examples — here is the input, here is the correct output — and the model learns the pattern that connects them.

Think of it like teaching a child to recognise apples. You show them 100 apples and say "this is an apple". You show them 100 oranges and say "this is not an apple". After enough examples, the child can look at a fruit they have never seen before and make a good guess.

Supervised learning works the same way. You give it labelled training data, it learns the pattern, and then it can make predictions on new data it has never seen.

ℹ️ Two types of supervised learning: Regression predicts a continuous number as the output. Classification predicts a discrete category or label as the output. The algorithm you choose depends entirely on what kind of answer you need.

Regression — Predicting a Number

Regression is used when the output you want to predict is a continuous number. The model learns to draw a line or curve through your data that best fits the relationship between your inputs and outputs.

The simplest mental model: imagine you have a scatter plot of house sizes on the x axis and house prices on the y axis. Regression draws the line that fits through all those points. When you give it a new house size it has never seen, it looks at where that size falls on the line and reads off the predicted price.

Real World Examples of Regression

Predicting the price of a house based on its size, location and number of rooms
Predicting tomorrow's temperature based on today's weather data
Predicting how many sales a product will make based on its price and ad spend
Predicting a student's exam score based on hours studied
Predicting a patient's blood pressure based on their age and weight
Predicting the fuel efficiency of a car based on its engine size and weight

Regression in Python

Here is a working example using scikit-learn. We will predict house prices from a list of features:

Python — linear regression with scikit-learn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

# Sample data: [size_sqft, num_bedrooms, age_of_house]
X = np.array([
    [1200, 2, 10],
    [1500, 3, 5],
    [2000, 4, 2],
    [800,  1, 20],
    [1800, 3, 8],
    [2200, 4, 1]
])

# Target: house prices in thousands
y = np.array([250, 320, 450, 150, 380, 500])

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create the model and train it
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
predictions = model.predict(X_test)
print('Predicted prices:', predictions)

# Predict a brand new house
new_house = [[1600, 3, 4]]
price = model.predict(new_house)
print(f'Predicted price: ${price[0]:.0f}k')
# Output: Predicted price: $347k

How to Measure Regression Accuracy

For regression, you measure how far your predictions are from the actual values. Here are the three most common ways to do that:

Python — regression evaluation metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

y_actual    = [250, 320, 450]
y_predicted = [240, 335, 460]

# MAE — average of how far off each prediction was
# Easy to understand: "on average, off by X units"
mae = mean_absolute_error(y_actual, y_predicted)
print(f'MAE: {mae}')           # e.g. MAE: 11.67  (off by $11.67k on average)

# RMSE — punishes big errors more than small ones
rmse = mean_squared_error(y_actual, y_predicted) ** 0.5
print(f'RMSE: {rmse:.2f}')

# R² score — how well the model explains the data
# 1.0 = perfect, 0.0 = no better than guessing the average
r2 = r2_score(y_actual, y_predicted)
print(f'R² Score: {r2:.3f}')    # e.g. 0.987 means very good fit

Classification — Predicting a Category

Classification is used when the output you want to predict is a discrete label or category. Instead of drawing a line, the model learns to draw a boundary that separates one category from another.

Think of it like sorting your email. Every incoming message gets sorted into a box: "spam" or "not spam". The model has learned from thousands of past emails what makes one spam and the other not. Now it can put every new email into the right box.

There are two types of classification. Binary classification is where there are only two possible answers (yes or no, spam or not spam, cat or dog). Multi-class classification is where there are more than two categories (cat, dog, bird, fish).

Real World Examples of Classification

Detecting whether an email is spam or not spam
Deciding if a bank transaction is fraudulent or genuine
Diagnosing whether a tumour is malignant or benign
Recognising which handwritten digit is in an image (0 to 9)
Predicting whether a customer will churn or stay
Identifying which language a sentence is written in
Classifying a photo as a cat, dog or bird

Classification in Python

Here is a working example. We will predict whether a customer will buy a product based on their age and salary:

Python — logistic regression classification with scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

# Features: [age, annual_salary_thousands]
X = np.array([
    [22, 25],  [35, 60],  [45, 80],
    [28, 35],  [52, 95],  [30, 45],
    [40, 70],  [20, 20],  [58, 110]
])

# Labels: 1 = will buy, 0 = will not buy
y = np.array([0, 1, 1, 0, 1, 0, 1, 0, 1])

# Split, train and predict
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
print('Predictions:', predictions)   # [0, 1, ...] not numbers, LABELS

# Get the probability instead of just the label
probabilities = model.predict_proba(X_test)
print('Buy probability:', probabilities[:, 1])  # e.g. [0.23, 0.87]

# Predict one new customer
new_customer = [[38, 65]]
label = model.predict(new_customer)
print('Will buy?', 'Yes' if label[0] == 1 else 'No')

ℹ️ Why is it called Logistic Regression if it is a classifier? Because it uses a mathematical function called the logistic function internally. The name is historical and confusing for beginners. Just remember: Logistic Regression outputs a category (0 or 1), so it is classification despite having "regression" in the name.

How to Measure Classification Accuracy

For classification, the main question is how many predictions did the model get right. But just counting correct answers is not always enough. Here are the four most important metrics:

Python — classification evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_actual    = [1, 0, 1, 1, 0, 1, 0]
y_predicted = [1, 0, 0, 1, 0, 1, 1]

# Accuracy — what % of all predictions were correct
print(f'Accuracy:  {accuracy_score(y_actual, y_predicted):.2f}')

# Precision — of the ones you said YES, how many were actually yes
# Important when false positives are costly (e.g. spam filter)
print(f'Precision: {precision_score(y_actual, y_predicted):.2f}')

# Recall — of the actual YES cases, how many did you catch
# Important when missing a positive is costly (e.g. cancer detection)
print(f'Recall:    {recall_score(y_actual, y_predicted):.2f}')

# F1 Score — the balance between precision and recall
# Use this when you need a single number that covers both
print(f'F1 Score:  {f1_score(y_actual, y_predicted):.2f}')

Side by Side Comparison

Feature	Regression	Classification
Output type	A continuous number	A discrete category or label
Example output	$347,000 22.5 degrees 89.3%	Spam / Not spam Cat / Dog / Bird
Key question	How much? How many?	Which one? Yes or no?
Main metric	MAE, RMSE, R² score	Accuracy, Precision, Recall, F1
Simple algorithm	Linear Regression	Logistic Regression
Tree algorithm	Decision Tree Regressor	Decision Tree Classifier
Ensemble algorithm	Random Forest Regressor	Random Forest Classifier

How to Choose Between Regression and Classification

The choice is almost always obvious once you ask one question: what kind of answer do I need?

Ask yourself: can the answer be any number on a scale, or does it have to be one of a fixed set of options?

If the answer is a number from a continuous range (price, temperature, score, duration) → use regression
If the answer is one of a fixed list of options (yes/no, which category, which label) → use classification

Plain English — quick decision guide
# Ask: what kind of answer do I need?

# REGRESSION if...
"How much will this house sell for?"         # → $347,000 (a number)
"What will the temperature be tomorrow?"    # → 24.5°C (a number)
"How long will this delivery take?"         # → 3.2 days (a number)
"How many units will we sell next month?"   # → 1,847 units (a number)

# CLASSIFICATION if...
"Is this email spam?"                       # → Yes or No (a label)
"What digit is in this image?"              # → 0-9 (a category)
"Will this customer cancel their account?"  # → Will churn / Will stay
"What language is this text written in?"    # → English, French, Hindi...

⚠️ The tricky case — age groups: predicting someone's exact age in years is regression. But predicting which age group they fall into (child, teenager, adult, senior) is classification. The same real-world concept can be either, depending on how you frame your output.

Common Algorithms for Each Type

Most algorithms in scikit-learn come in two versions — one for regression and one for classification. They work the same way internally but produce different types of output:

Python — same algorithm, two versions
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVR, SVC
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier

# REGRESSION versions — output is a number
reg_model = RandomForestRegressor(n_estimators=100)
reg_model.fit(X_train, y_train)
price = reg_model.predict([[1600, 3, 4]])
print(f'Price: ${price[0]:.0f}k')   # e.g. $352k — a number

# CLASSIFICATION versions — output is a category
clf_model = RandomForestClassifier(n_estimators=100)
clf_model.fit(X_train, y_train)
label = clf_model.predict([[38, 65]])
print(f'Label: {label[0]}')         # e.g. 1 — a category

# Both work the same way: .fit() then .predict()
# The only difference is what y_train contains
# Regression: y = [250, 320, 450] (numbers)
# Classification: y = [0, 1, 1, 0] (labels)

✅ The best thing about scikit-learn: the API is identical for both. You always call .fit() to train and .predict() to get results. Once you learn the pattern for one algorithm, you can switch to any other algorithm with just one line change.

⚡ Key Takeaways

Supervised learning means training a model on labelled data — examples where you already know the correct answer.
Regression predicts a continuous number. Use it when the answer could be any value on a scale, like a price, temperature or score.
Classification predicts a discrete category or label. Use it when the answer must be one of a fixed set of options, like yes/no or cat/dog/bird.
To choose between them, ask one question: is the output a number on a continuous range, or one of a fixed set of categories?
Measure regression with MAE (average error), RMSE (punishes big errors) and R² score (how well the model explains the data).
Measure classification with accuracy (overall correct), precision (of your yes calls, how many were right), recall (of actual yes cases, how many did you catch) and F1 score (balance of both).
Most scikit-learn algorithms come in two versions ending in Regressor or Classifier. The API is identical for both. You call .fit() then .predict().
Logistic Regression is a classifier despite its name. Do not let the word "regression" confuse you.

Tags: Machine Learning Regression Classification scikit-learn Python Beginner

Shashank Shekhar

Founder & Creator — Hoopsiper.com

Full stack developer and educator. Building Hoopsiper to help developers learn faster through practical, no-fluff coding guides on JavaScript, AI/ML, Python and modern web development.

Regression vs ClassificationTwo Pillars of Supervised Learning Explained

What Is Supervised Learning

Regression — Predicting a Number

Real World Examples of Regression

Regression in Python

How to Measure Regression Accuracy

Classification — Predicting a Category

Real World Examples of Classification

Classification in Python

How to Measure Classification Accuracy

Side by Side Comparison

How to Choose Between Regression and Classification

Common Algorithms for Each Type

Model Inference Speed and Pruning

Data Pipeline Efficiency

Optimizing Streamlit Apps for Speed

Regression vs Classification
Two Pillars of Supervised Learning Explained