Logistic regression

What is logistic regression?

Logistic regression is a supervised machine learning algorithm designed to predict the probability of an event occurring based on a set of independent variables. It is the go-to method for binary classification problems, determining whether an instance belongs to one category or another—such as spam vs. not spam, disease vs. no disease, or customer purchase vs. no purchase.

At its core, logistic regression models the log odds of an event happening using a sigmoid function, ensuring that the predicted probabilities remain bounded between 0 and 1. Unlike linear regression, which can output values beyond this range, logistic regression transforms predictions using the logit function, making it ideal for categorical outcomes.

How does logistic regression work?

Logistic regression is a statistical method used for binary classification, meaning it predicts whether an instance belongs to one of two categories. Unlike linear regression, which predicts continuous values, logistic regression predicts the probability of an event occurring and assigns a label based on a threshold (typically 0.5).

To understand how logistic regression works, let’s break it down step by step.

1. Converting the Dependent Variable Using Logit Transformation

In logistic regression, the dependent variable (output) is categorical, meaning it can only take discrete values (e.g., 0 or 1). However, to make it work in a regression-like setting, we need to transform this categorical output into a continuous range that can be modeled linearly.

To achieve this, logistic regression applies a logit transformation, which converts probabilities into log odds:

logistic regression

2. Applying the Sigmoid (Logistic) Function

Since we need to predict a probability (a value between 0 and 1), we apply the sigmoid function (also called the logistic function) to transform the log odds back into a probability:

Logistic Regression

3. Decision Rule for Classification

Logistic Regression
measuring coefficients

Why is Logistic Regression Important?

Logistic regression is widely used in machine learning and statistics due to its simplicity, efficiency, and interpretability. It is particularly valuable for binary classification problems where the goal is to determine the probability of an event occurring based on given independent variables. Several key aspects make logistic regression an essential tool for classification and predictive modeling.

1. Probabilistic Predictions

Unlike strict classification models that provide only a categorical output, logistic regression offers a probability score that represents the likelihood of an event occurring. This probability can be used to assess confidence in a prediction and allows for more informed decision-making. By setting an appropriate threshold, the model can be adjusted to suit specific needs, such as minimizing false positives or false negatives. The ability to produce probability scores makes logistic regression highly flexible and applicable in various real-world scenarios.

2. Interpretable Model

One of the main advantages of logistic regression is its interpretability. The model assigns coefficients to independent variables, indicating their contribution to the likelihood of the dependent variable occurring. A positive coefficient suggests that an increase in the corresponding independent variable increases the probability of the event, while a negative coefficient suggests a decrease in probability. This transparency allows for a clear understanding of how each feature influences the outcome, making it easier to communicate results and justify predictions.

3.Computationally Efficient

Logistic regression is computationally efficient, making it suitable for small to medium-sized datasets. It does not require extensive computational resources and converges relatively quickly compared to more complex models. The model’s efficiency allows for rapid training and evaluation, making it practical for applications where quick insights are needed. Additionally, logistic regression is scalable and can handle a large number of observations without significantly increasing computational complexity.

4. Versatile Applications

Logistic regression is widely applicable across multiple domains, including healthcare, finance, marketing, and security. It is commonly used for tasks such as predicting medical outcomes, assessing credit risk, analyzing customer behavior, and detecting fraudulent activities. The model’s ability to handle categorical outcomes and provide interpretable probabilities makes it a preferred choice in many industries where decision-making must be transparent and data-driven.

Types of Logistic Regression

1. Binary Logistic Regression

In this approach, the response or dependent variable is dichotomous in nature, meaning it has only two possible outcomes (e.g., 0 or 1). This is the most commonly used type of logistic regression and serves as a fundamental classifier in binary classification tasks.

Example Use Cases:

  • Predicting whether an email is spam or not spam.
  • Determining whether a tumor is malignant or not malignant.
  • Assessing whether a customer will purchase a product or not.

2. Multinomial Logistic Regression

This type of logistic regression is used when the dependent variable has three or more possible outcomes, but these categories do not have a specific order. Multinomial logistic regression helps in situations where classification involves multiple but unordered groups.

Example Use Cases:

  • Predicting the genre of a movie a person is likely to watch based on demographic factors such as age, gender, and relationship status.
  • Classifying types of customer feedback (e.g., complaint, suggestion, compliment).
  • Identifying different species of plants or animals based on certain characteristics.

3. Ordinal Logistic Regression

This logistic regression model is used when the dependent variable has three or more ordered categories, meaning the outcomes have a natural ranking or hierarchy. Unlike multinomial regression, the distances between the ordered categories may not be equal.

Example Use Cases:

  • Assigning a student’s grade (A, B, C, D, F) based on test scores.
  • Categorizing customer satisfaction levels (Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied).
  • Ranking a product review on a 1 to 5-star scale.

Evaluating a Logistic Regression Model

After implementing logistic regression, the next crucial step is evaluating its performance to ensure that it accurately classifies instances and generalizes well to unseen data. Evaluating the model helps assess its strengths and weaknesses, providing insights into how well it performs in real-world scenarios.

We can evaluate the logistic regression model using the following key metrics:

Ā Accuracy

Accuracy measures the proportion of correctly classified instances in the dataset. It is a basic yet widely used metric for evaluating classification models.

logistic metric

  • True Positives (TP): Cases correctly predicted as positive.
  • True Negatives (TN): Cases correctly predicted as negative.
  • False Positives (FP): Cases incorrectly predicted as positive.
  • False Negatives (FN): Cases incorrectly predicted as negative.

Note: Accuracy is useful when the dataset is balanced (i.e., both classes appear in similar proportions). However, it may be misleading when dealing with imbalanced datasets.

Precision (Positive Predictive Value)

Precision measures how many of the predicted positive cases were actually correct. It is important in situations where false positives are costly.

logistic Regression

  • Higher precision means fewer false positives.
  • Useful in applications like fraud detection, where falsely flagging a non-fraudulent transaction can have consequences.

Recall (Sensitivity or True Positive Rate)

Recall measures the proportion of actual positive cases that were correctly predicted. It is crucial when missing a positive case is costly.

Logistic Regression

  • Higher recall means fewer false negatives.
  • Useful in scenarios like medical diagnosis, where missing a disease case can have serious implications.

F1 Score

The F1 Score is the harmonic mean of precision and recall, balancing both metrics into a single score. It is especially useful when dealing with imbalanced datasets.

Fscore

  • High F1 Score indicates a balance between precision and recall.
  • Useful when both false positives and false negatives are equally important.

AUC-ROC (Area Under the Receiver Operating Characteristic Curve)

The ROC Curve plots the true positive rate (Recall) against the false positive rate at different classification thresholds.

  • AUC (Area Under Curve) measures how well the model distinguishes between positive and negative classes.
  • AUC Score Range:
  • 1.0 → Perfect classifier
  • 0.5 → Random guessing
  • < 0.5 → Worse than random

A higher AUC-ROC value indicates that the model performs well across different classification thresholds.

AUC-PR (Area Under the Precision-Recall Curve)

Similar to AUC-ROC, the Precision-Recall Curve plots precision against recall at different classification thresholds.

  • AUC-PR provides a performance summary focusing on positive class detection.
  • More useful than ROC when the dataset is highly imbalanced (e.g., fraud detection where positive cases are rare).

Logistic Regression Use Cases

Logistic regression is a widely used classification algorithm in various industries due to its simplicity, interpretability, and efficiency. It is particularly useful for binary classification problems but can also handle multiclass classification using multinomial and ordinal logistic regression.

Here are some key use cases of logistic regression across different domains:

Healthcare & Medical Diagnosis

Logistic regression plays a crucial role in predicting medical conditions and diseases based on patient data.

Use Cases:

  • Disease Prediction: Predicting whether a patient has diabetes, heart disease, or cancer based on medical records.
  • Survival Analysis: Estimating the likelihood of patient survival after surgery or treatment.
  • Risk Assessment: Identifying patients at risk for stroke, hypertension, or COVID-19 severity.

Finance & Banking

Banks and financial institutions use logistic regression for risk assessment, fraud detection, and customer segmentation.

Use Cases:

  • Credit Scoring: Predicting whether a borrower will default on a loan.
  • Fraud Detection: Identifying fraudulent transactions based on spending patterns.
  • Customer Retention: Predicting whether a customer is likely to close an account or switch banks.
  • Insurance Claims: Assessing the probability of insurance claims being fraudulent.

Marketing & Customer Analytics

Logistic regression helps businesses understand customer behavior and optimize marketing strategies.

Use Cases:

  • Customer Churn Prediction: Identifying customers who are likely to unsubscribe or stop using a service.
  • Conversion Rate Optimization: Predicting whether a customer will click on an ad or make a purchase.
  • Personalized Marketing: Segmenting customers based on their likelihood of responding to a marketing campaign.

Human Resources (HR) & Employee Management

Companies use logistic regression to analyze employee behavior and hiring trends.

Use Cases:

  • Employee Retention: Predicting whether an employee will resign or stay in the company.
  • Hiring Decisions: Estimating the probability of a candidate succeeding in a job role.
  • Workplace Productivity: Analyzing factors that impact employee performance.

E-commerce & Retail

Retailers use logistic regression to optimize product recommendations, inventory management, and customer engagement.

Use Cases:

  • Predicting Purchase Behavior: Identifying whether a customer will buy a product.
  • Inventory Optimization: Forecasting demand for products based on historical sales data.
  • Product Return Prediction: Estimating the likelihood of a customer returning an item.

Cybersecurity & Fraud Detection

Logistic regression is used to detect anomalies and prevent cyber threats.

Use Cases:

  • Spam Detection: Identifying whether an email is spam or not spam.
  • Malware Classification: Detecting whether a file or software is malicious or safe.
  • Network Intrusion Detection: Predicting potential security breaches based on suspicious activities.

Political & Social Science Research

Researchers use logistic regression to analyze voting patterns and social behaviors.

Use Cases:

  • Voting Behavior Analysis: Predicting whether a person will vote in an election.
  • Survey Analysis: Understanding public opinion on social or political issues.
  • Crime Rate Prediction: Estimating the probability of crime occurrences in specific areas.

Manufacturing & Industrial Applications

Logistic regression is useful for quality control and predictive maintenance in industrial settings.

Use Cases:

  • Defect Detection: Predicting whether a manufactured product is defective or non-defective.
  • Machine Failure Prediction: Estimating the likelihood of machine breakdowns based on sensor data.
  • Supply Chain Optimization: Identifying factors affecting supply chain efficiency.

Python Code Implementation for the Logistic Regression Machine Learning Model

				
					# šŸ“Œ Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.datasets import load_breast_cancer
# šŸ“Œ Load the Dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column
# šŸ“Œ Exploratory Data Analysis (EDA)
print(f"Dataset Shape: {df.shape}")  # Number of rows and columns
print("nMissing Values:n", df.isnull().sum())  # Check for missing values
print("nDataset Summary:n", df.describe())  # Summary statistics
# Plot Target Variable Distribution
plt.figure(figsize=(6, 4))
sns.countplot(x=df['target'], palette="Set2")
plt.title("Distribution of Target Variable")
plt.xlabel("Cancer Type (0: Malignant, 1: Benign)")
plt.ylabel("Count")
plt.show()
# šŸ“Œ Feature Scaling & Splitting Data
X = df.drop(columns=['target'])  # Independent variables
y = df['target']  # Dependent variable
# Split into Training (80%) and Testing (20%) Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Standardize the Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# šŸ“Œ Training Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
# šŸ“Œ Making Predictions
y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probabilities for the positive class (1)
# šŸ“Œ Evaluating the Model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)
# Display Evaluation Metrics
print("nšŸ“Š Model Evaluation Metrics:")
print(f"āœ… Accuracy      : {accuracy:.4f}")
print(f"āœ… Precision     : {precision:.4f}")
print(f"āœ… Recall        : {recall:.4f}")
print(f"āœ… F1 Score      : {f1:.4f}")
print(f"āœ… ROC-AUC Score : {auc:.4f}")
# šŸ“Œ Confusion Matrix Visualization
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="d", xticklabels=["Malignant", "Benign"], yticklabels=["Malignant", "Benign"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
# šŸ“Œ ROC Curve Visualization
fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='blue', label=f'AUC = {auc:.4f}')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
# šŸ“Œ Final Summary
print("nšŸ“¢ Logistic Regression Model Summary:")
print(f"šŸ”¹ Accuracy      : {accuracy:.4f}")
print(f"šŸ”¹ Precision     : {precision:.4f}")
print(f"šŸ”¹ Recall        : {recall:.4f}")
print(f"šŸ”¹ F1 Score      : {f1:.4f}")
print(f"šŸ”¹ ROC-AUC Score : {auc:.4f}")