Education Ecosystem Blog

Regression Versus Classification Machine Learning: What’s the Difference?

The difference between regression machine learning algorithms and classification machine learning algorithms sometimes confuse most data scientists, which make them to implement wrong methodologies in solving their prediction problems.

Andreybu, who is from Germany and has more than 5 years of machine learning experience, says that “understanding whether the machine learning task is a regression or classification problem is key for selecting the right algorithm to use.”

You can learn from one of his machine learning projects here.

Let’s start by talking about the similarities between the two techniques.

Supervised machine learning

Regression and classification are categorized under the same umbrella of supervised machine learning.

Both share the same concept of utilizing known datasets (referred to as training datasets) to make predictions.

In supervised learning, an algorithm is employed to learn the mapping function from the input variable (x) to the output variable (y); that is y = f(X).

The objective of such a problem is to approximate the mapping function (f) as accurately as possible such that whenever there is a new input data (x), the output variable (y) for the dataset can be predicted.

Here is a chart that shows the different groupings of machine learning:

Unfortunately, there is where the similarity between regression versus classification machine learning ends.

The main difference between them is that the output variable in regression is numerical (or continuous) while that for classification is categorical (or discrete).

Regression in machine learning

In machine learning, regression algorithms attempt to estimate the mapping function (f) from the input variables (x) to numerical or continuous output variables (y).

In this case, y is a real value, which can be an integer or a floating point value. Therefore, regression prediction problems are usually quantities or sizes.

For example, when provided with a dataset about houses, and you are asked to predict their prices, that is a regression task because price will be a continuous output.

Examples of the common regression algorithms include linear regression, Support Vector Regression (SVR), and regression trees.

Some algorithms, such as logistic regression, have the name “regression” in their names but they are not regression algorithms.

Here is an example of a linear regression problem in Python:

import numpy as np

import pandas as pd

# importing the model

from sklearn.linear_model import LinearRegression

from sklearn.cross_validation import train_test_split

# importing the module for calculating the performance metrics of the model

from sklearn import metrics

data_path = “http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv” # loading the advertising dataset

data = pd.read_csv(data_path, index_col=0)

array_items = [‘TV’, ‘radio’, ‘newspaper’] #creating an array list of the items

X = data[array_items] #choosing a subset of the dataset

y = data.sales #sales

# dividing X and y into training and testing units

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

linearreg = LinearRegression() #applying the linear regression model

linearreg.fit(X_train, y_train) #fitting the model to the training data

y_predict = linearreg.predict(X_test) #making predictions based on the testing unit

print(np.sqrt(metrics.mean_squared_error(y_test, y_predict))) #calculating the RMSE number

#output gives the RMSE number as 1.4046514230328955

Classification in machine learning

On the other hand, classification algorithms attempt to estimate the mapping function (f) from the input variables (x) to discrete or categorical output variables (y).

In this case, y is a category that the mapping function predicts. If provided with a single or several input variables, a classification model will attempt to predict the value of a single or several conclusions.

For example, when provided with a dataset about houses, a classification algorithm can try to predict whether the prices for the houses “sell more or less than the recommended retail price.”

Here, the houses will be classified whether their prices fall into two discrete categories: above or below the said price.

Examples of the common classification algorithms include logistic regression, Naïve Bayes, decision trees, and K Nearest Neighbors.

Here is an example of a classification problem that differentiates between an orange and an apple:

from sklearn import tree

# Gathering training data

# features = [[155, “rough”], [180, “rough”], [135, “smooth”], [110, “smooth”]] # Input to classifier

features = [[155, 0], [180, 0], [135, 1], [110, 1]] # scikit-learn requires real-valued features

# labels = [“orange”, “orange”, “apple”, “apple”] # output values

labels = [1, 1, 0, 0]

# Training classifier

classifier = tree.DecisionTreeClassifier() # using decision tree classifier

classifier = classifier.fit(features, labels) # Find patterns in data

# Making predictions

print (classifier.predict([[120, 1]]))

# Output is 0 for apple

Wrapping up

Selecting the correct algorithm for your machine learning problem is critical for the realization of the results you need.

As a data scientist, you need to know how to differentiate between regression predictive models and classification predictive models so that you can choose the best one for your specific use case.

Do you have any comments or questions?

Please post them below.

Dr. Michael J. Garbade

I, Dr. Michael J. Garbade is the co-founder of the Education Ecosystem (aka LiveEdu), ex-Amazon, GE, Rebate Networks, Y-combinator. Python, Django, and DevOps Engineer. Serial Entrepreneur. Experienced in raising venture funding. I speak English and German as mother tongues. I have a Masters in Business Administration and Physics, and a Ph.D. in Venture Capital Financing. Currently, I am the Project Lead on the community project -Nationalcoronalvirus Hotline I write subject matter expert technical and business articles in leading blogs like Opensource.com, Dzone.com, Cybrary, Businessinsider, Entrepreneur.com, TechinAsia, Coindesk, and Cointelegraph. I am a frequent speaker and panelist at tech and blockchain conferences around the globe. I serve as a start-up mentor at Axel Springer Accelerator, NY Edtech Accelerator, Seedstars, and Learnlaunch Accelerator. I love hackathons and often serve as a technical judge on hackathon panels.