A Gentle Introduction to Gaussian Discriminant Analysis: Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis

To understand Gaussian discriminant analysis deeply based on the mathematics and a Python implementation from scratch

18 min readJan 27, 2025

--

The visualization of the decision boundary of Gaussian Naive Bayes, LDA, and QDA by the author.

Recently, I came across an interesting description that we can derive Gaussian Naive Bayes, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) from almost the same Bayes formula [1].

These methods utilize the multivariate Gaussian distribution for the likelihood, and the only difference is the covariance matrix parameter for the multivariate Gaussian distribution they assume.

In this article, I will introduce the mathematics of these methods in detail and a Python implementation from scratch to get a concrete sense of how they work.

Table of Concepts

Preliminaries: Bayes’ theorem
Preliminaries: Multivariate Gaussian distribution
Introduction of Gaussian Discriminant Analysis
Introduction of Gaussian Naive Bayes
Introduction of Linear Discriminant Analysis
Introduction of Quadratic Discriminant Analysis

1. Preliminaries: Bayes’ theorem

Bayes’ theorem is a fundamental concept for understanding Gaussian discriminant analysis. In this section, we first learn about Bayes’ theorem. To understand Bayes’ theorem, let’s review some fundamental rules of probability: joint probability, conditional probability, and marginal probability.

Joint probability

A joint probability is a probability that calculates probabilities (or likelihoods) of two events that co-occur. In the following figure, a joint probability is the shared part of two events.

Conditional probability

A conditional probability is a probability that measures the probability of one event given another event. Intuitively, it is just a proportion of event A’s probability under the occurrence of event B’s. In the following example, you can check the conditional probability with visualization.

Marginal probability

A marginal probability is a probability of a single event occurring. When a single event occurs with other events, we can decompose it by joint probability with each event. The following visualization helps you understand it.

In the above example, we assume that event A can occur with four other events B1, B2, B3, and B4. If we sum over the joint probabilities of event A and each event B, we can construct the probability of event A.

Bayes’ theorem

Now, we can derive Bayes’ theorem using these concepts as follows.

Although Bayes’ theorem sounds intimidating, it is just a combination of basic probability concepts based on conditional probability. In the Bayesian context, we also use these terms: likelihood, prior, and evidence, like the above formula.

In the case of two conditioned events, we can represent Bayes’ theorem as follows.

We can derive Bayes’ theorem in the same way as one conditioned event case. We will utilize this formula later in this article.

2. Preliminaries: Multivariate Gaussian distribution

Mathematics of Multivariate Gaussian distribution

Multivariate Gaussian distribution is also one of the necessary concepts to understand the Gaussian discriminant analysis. Let’s quickly recap it. Multivariate Gaussian distribution is the joint probability of Gaussian distribution with more than two dimensions. It has the probability density function below.

The probability density function of the multivariate Gaussian distribution.

We assume we have the following input vector and parameters. x is the input data that has D×1 dimensions, μ is the mean vector that has the same dimension with x and Σ is the covariance matrix that has D×D dimensions.

Matrix description of input vector, mean vector, and covariance matrix.

Basically, the multivariate Gaussian distribution is the multiple-dimension version of the one-dimensional Gaussian distribution. Compared to the one-dimensional Gaussian distribution, the multivariate Gaussian distribution must consider the correlation among other dimensions, so we use the covariance matrix rather than the variance.

Parameter estimation : Maximum likelihood estimation (MLE)

In the previous section, we learned the multivariate Gaussian distribution.
Here is a question: How can we estimate parameters when we want to fit data into the multivariate Gaussian distribution?

One solution is to use Maximum Likelihood Estimation (MLE). What is MLE? MLE estimates the parameters of an assumed probability distribution given data by maximizing a likelihood function. The general formulas are presented below.

The function of the maximum likelihood estimation (MLE).

We assume that we have N data points. Equation (1) is called the likelihood function. It is the multiplication of the conditional probabilities of data given the distribution parameters. When applying for MLE, we estimate the parameters that maximize this function, such as equation (2).

Intuitively, when the likelihood function value increases, the data is more plausible to be generated from an assumed probability distribution given parameters.

Now, let’s look at the concrete example. When we use the multivariate Gaussian distribution for an assumed probability distribution, we can describe the likelihood function, as shown below.

The likelihood function of the multivariate Gaussian distribution.

In practice, we apply natural logarithms to transform multiplication into summation (log-likelihood).

The log-likelihood function of the multivariate Gaussian distribution.

Then, we differentiate the log-likelihood function with respect to each parameter to estimate the parameters. The simple derivation is as shown below. We can ignore the terms that are not relevant to the parameters. (I will attach the detailed derivations in the appendix.)

Basically, when the log-likelihood can be differentiated, we estimate the parameters using differentiation for other probability distribution.

3. Introduction to Gaussian Discriminant Analysis

Now, let’s explore one of the important applications for the multivariate Gaussian distribution: Gaussian discriminant analysis (GDA). GDA is one of the supervised classification algorithms. It tries to solve a classification problem to estimate the data generation process; thus, it is also called a generative classifier. Concretely, GDA assumes the Gaussian distribution for each class’s data distribution and defines the probabilistic model below.

The probabilistic model of Gaussian discriminant analysis.

x refers to the data, c refers to a class, and θ is Gaussian distribution parameters. Formula (3) represents each data’s probability density, which is how likely this data is generated from class c’s distribution.

Meanwhile, the current formula is inconvenient for predicting the probability assigned to each class. We utilize Bayes’ theorem to define a new formula, as shown below.

Bayes’ theorem of the multivariate Gaussian distribution.

The formula (4) refers to the probability of each data being assigned to the class c.

Likelihood p(y=c|θ) or prior represents how likely the class c is obtained in the given data. In the GDA case, we usually use the proportion of each class in given data for the prior 𝜋.

Prior for the Gaussian discriminant analysis.

Using the formula (4), we can calculate the probability density according to each class’s distribution and classify data to the class with the highest probability density. We can derive this decision rule as follows.

Decision rule for the Gaussian discriminant analysis.

Note that we can ignore the denominator because it is not related to the class. If we know the parameters, we can calculate an assigned class of given data using the above formula. However, we usually don’t know the parameters, so we need to estimate them from the data. In such case, we can utilize the maximum likelihood estimation (MLE). MLE finds the parameters to maximize the joint probability of the data and a class. Here is the joint probability we have to maximize.

Joint probability of the Gaussian discriminant analysis’s Bayes formula.

We assume the dataset is i.i.d and can write the log-likelihood function as follows.

Log-likelihood function of the Gaussian discriminant analysis.

𝒟 represents the dataset, which means the data x and the corresponded class y=c.

C refers to the number of classes and the dataset has N points. Now, we can find the parameters to maximize this function using derivatives. When we use different types of covariance matrices, the resulting techniques are the following:

When we assume a diagonal covariance matrix → Gaussian Naive Bayes
When we assume a full covariance matrix but share it for all the classes → Linear Discriminant Analysis (LDA)
When we assume a full covariance matrix → Quadratic Discriminant Analysis (QDA)

Why do we need to add such an assumption besides QDA? Although MLE is a very efficient optimization in speed and simplicity, it has one problem: it can badly overfit in high dimensions. Especially, when we optimize a full-covariance matrix, the MLE result can be unstable because the number of parameters is large. Thus, we usually use Gaussian Naive Bayes or LDA when we have a small amount of data.

Now, let’s check each technique in detail. I will explain the mathematics first, then visualization, and finally, the implementation of each method. We will use a randomly generated dataset. It has two feature dimensions with one label as shown below.

We use this toy dataset to visualize each class’s estimated distribution’s shape and decision boundary. Let’s dive into them!

4. Introduction of Gaussian Naive Bayes

Mathematics of Gaussian Naive Bayes

When we assume a diagonal covariance matrix in the formula (3) parameter, we spontaneously assume Gaussian Naive Bayes. Gaussian Naive Bayes is a type of Naive Bayes method that considers the features to follow a Gaussian distribution and each feature to be independent. A covariance matrix that we assume looks like the following.

Covariance matrix assumed by Gaussian Naive Bayes.

Note that each class has its own diagonal covariance matrix (c refers to each class). Since it is conditionally independent, we can rewrite the log-likelihood function as follows.

Log-likelihood function of Gaussian Naive Bayes.

D refers to the number of feature dimensions. We can derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative.

Mean parameter estimation for Gaussian Naive Bayes.

Variance parameter estimation for Gaussian Naive Bayes.

As you can see, we can obtain parameters to calculate the inter-class mean and variance from the data. Now, let’s look at the visualization.

The visualization of Gaussian Naive Bayes result.

The decision boundary looks like non-linear. Since we don’t consider the correlation among other features, the Gaussian distribution contour is not tilted ellipsoid.

Implementation of Gaussian Naive Bayes

In this section, we will implement Gaussian Naive Bayes from scratch and verify the result using scikit-learn. We implement it as a class. Based on the above derivation, we just calculate each class’s sample mean and variance. Thus, the estimation part looks like as follows:

def fit(self, X, y):
    """
    Fit the Gaussian Naive Bayes model.
  
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training vectors, where n_samples is the number of samples
        and n_features is the number of features.
    y : array-like, shape (n_samples,)
        Target values.
    """
    n_samples, n_features = X.shape
    self.classes = np.unique(y)
    n_classes = len(self.classes)
  
    # Calculate class priors, means, and variances
    self.class_priors = np.zeros(n_classes)
    self.means = np.zeros((n_classes, n_features))
    self.variances = np.zeros((n_classes, n_features))
  
    for i, c in enumerate(self.classes):
        X_c = X[y == c]
        self.class_priors[i] = X_c.shape[0] / float(n_samples)
        self.means[i, :] = X_c.mean(axis=0)
        self.variances[i, :] = X_c.var(axis=0)

As you can see, we calculate the sample mean and variance for each class inside the for-loop. The whole class looks like the following implementation.

class GaussianNaiveBayes:
    def __init__(self):
        self.classes = None
        self.class_priors = None
        self.means = None
        self.variances = None

    def fit(self, X, y):
        """
        Fit the Gaussian Naive Bayes model.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training vectors, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like, shape (n_samples,)
            Target values.
        """
        n_samples, n_features = X.shape
        self.classes = np.unique(y)
        n_classes = len(self.classes)

        # Calculate class priors, means, and variances
        self.class_priors = np.zeros(n_classes)
        self.means = np.zeros((n_classes, n_features))
        self.variances = np.zeros((n_classes, n_features))

        for i, c in enumerate(self.classes):
            X_c = X[y == c]
            self.class_priors[i] = X_c.shape[0] / float(n_samples)
            self.means[i, :] = X_c.mean(axis=0)
            self.variances[i, :] = X_c.var(axis=0)

    def predict(self, X):
        """
        Perform classification on an array of test vectors X.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Test vectors.

        Returns
        -------
        C : array, shape (n_samples,)
            Predicted target values for X.
        """
        n_samples = X.shape[0]
        posteriors = np.zeros((n_samples, len(self.classes)))

        # Calculate posterior probability for each class
        for i, c in enumerate(self.classes):
            prior = np.log(self.class_priors[i])
            # Calculate likelihood using log of probability density function
            likelihood = np.sum(np.log(self._pdf(i, X)), axis=1)
            posteriors[:, i] = prior + likelihood

        # Return the class with the highest posterior probability
        return self.classes[np.argmax(posteriors, axis=1)]

    def _pdf(self, class_idx, x):
      """
      Calculate the probability density function for a Gaussian distribution.
      """
      mean = self.means[class_idx]
      var = self.variances[class_idx]
      numerator = np.exp(- (x - mean)**2 / (2 * var))
      denominator = np.sqrt(2 * np.pi * var)
      return numerator / denominator

The estimation results using the class are below.

Gaussian naive Bayes parameters obtained by scratch implementation.

The comparison results using the scikit-learn library are below.

Gaussian naive Bayes parameters obtained by scikit-learn.

The results are almost the same! We can verify the scratch implementation compatible with scikit-learn. I will attach the whole code later in this blog.

5. Introduction of Linear Discriminant Analysis

Mathematics of Linear Discriminant Analysis

When we assume the same full covariance matrix for all classes in the formula (3) parameter, we assume linear discriminant analysis (LDA). Why do we call it LDA even using a non-linear function? The key lies in the decision boundary. You will see why this method is called LDA in the visualization section. In this case, a covariance matrix that we assume looks like the following.

Covariance matrix assumed by Linear Discriminant Analysis.

Note that we assume a covariance matrix with full parameters, and all classes share the same covariance matrix. Using this parameter, we can write the log-likelihood function as follows.

Log-likelihood function of Linear Discriminant Analysis.

Now we derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative.

Mean parameter estimation for Linear Discriminant Analysis.

For the covariance matrix, we can derive as follows:

Covariance matrix parameter estimation for Linear Discriminant Analysis.

As you can see, we can derive almost the same formula as the MLE result for a multivariate Gaussian distribution. The covariance matrix estimation is a bit tricky. We must consider all the data points, not just each class like Gaussian Naive Bayes. Now, let’s look at the visualization of the result.

The decision boundary shape is linear! Actually, LDA’s name originated from the shape of the decision boundary. But why does the decision boundary become linear?—Let’s solve mathematics.

We can rephrase the decision boundary to mean that the two distributions have the same probability. To calculate the decision boundary between classes 1 and 2, we can write the following equation.

The decision boundary rephrased formula between two classes.

Using the formula (4), we can derive the following equations.

We can substitute the likelihood function with the Gaussian distribution as follows:

Using the above formula, we can rewrite the decision boundary formula as follows:

If we take logs to this equation, the formula will be as follows:

Moreover, if we define:

then we can write:

Thus, the decision boundary between any two classes will be a straight line because of the above formula.

Implementation of Linear Discriminant Analysis

In this section, we will implement LDA from scratch and verify the result using scikit-learn. We implement it as a class. We calculate each class’s sample mean and covariance matrix based on the above derivation. Thus, the estimation part looks like as follows:

def fit(self, X, y):
    n_features = X.shape[1]
    class_labels = np.unique(y)
    n_classes = len(class_labels)

    # Calculate MLEs for priors, means, and covariance matrix
    self.pi = np.zeros(n_classes)
    self.means = np.zeros((n_classes, n_features))
    self.covariance = np.zeros((n_features, n_features))
    Sw = np.zeros((n_features, n_features))

    for i, c in enumerate(class_labels):
        X_c = X[y == c]
        n_c = X_c.shape[0]

        # MLE for prior
        self.pi[i] = n_c / X.shape[0]

        # MLE for class mean
        self.means[i, :] = np.mean(X_c, axis=0)

        # Calculate within-class scatter
        Sw += (X_c - self.means[i, :]).T.dot((X_c - self.means[i, :]))

    # MLE for common covariance matrix
    self.covariance = Sw / X.shape[0]

We use the exact implementation as Gaussian Naive Bayes for the mean. Regarding the covariance matrix, we calculate each class’s squared error and sum over classes and then divide it by the number of data. The whole class looks like the following implementation.

class LDA:
    def __init__(self, n_components):
        self.n_components = n_components
        self.linear_discriminants = None
        self.pi = None  # Store the prior probabilities
        self.means = None  # Store the class means
        self.covariance = None  # Store the common covariance matrix

    def fit(self, X, y):
        n_features = X.shape[1]
        class_labels = np.unique(y)
        n_classes = len(class_labels)

        # Calculate MLEs for priors, means, and covariance matrix
        self.pi = np.zeros(n_classes)
        self.means = np.zeros((n_classes, n_features))
        self.covariance = np.zeros((n_features, n_features))
        Sw = np.zeros((n_features, n_features))

        for i, c in enumerate(class_labels):
            X_c = X[y == c]
            n_c = X_c.shape[0]

            # MLE for prior
            self.pi[i] = n_c / X.shape[0]

            # MLE for class mean
            self.means[i, :] = np.mean(X_c, axis=0)

            # Calculate within-class scatter
            Sw += (X_c - self.means[i, :]).T.dot((X_c - self.means[i, :]))

        # MLE for common covariance matrix
        self.covariance = Sw / (X.shape[0] - n_classes)

    def predict(self, X):
        """
        Predict the class labels for the given data.
        """
        if self.means is None or self.covariance is None or self.pi is None:
            raise Exception("Fit the model before making predictions.")

        n_samples = X.shape[0]
        predictions = np.zeros(n_samples)

        for i in range(n_samples):
            x = X[i]
            posteriors = []

            for k in range(len(self.pi)):
                # Calculate the discriminant function using the MLEs
                prior = np.log(self.pi[k])
                inv_covariance = np.linalg.inv(self.covariance)
                mean_diff = x - self.means[k]
                likelihood = -0.5 * mean_diff.T @ inv_covariance @ mean_diff
                posterior = prior + likelihood  # No need to compute evidence since it's constant
                posteriors.append(posterior)
            
            predictions[i] = np.argmax(posteriors)

        return predictions

The estimation results using the class are below.

LDA parameters obtained by scratch implementation.

The comparison results using the scikit-learn library are below.

LDA parameters obtained by scikit-learn.

The results are almost the same! We can verify the scratch implementation compatible with scikit-learn.

6. Introduction of Quadratic Discriminant Analysis

Mathematics of Quadratic Discriminant Analysis

When we assume different full covariance matrices for all classes in the formula (3) parameter, we assume quadratic discriminant analysis (QDA). QDA can consider more complex relationships between variables than other techniques. On the other hand, it has more parameters than others, so it is easy to overfit when we have a small amount of data. In this case, a covariance matrix that we assume looks like the following.

Covariance matrix assumed by Quadratic Discriminant Analysis.

Using this parameter, we can write the log-likelihood function as follows.

Log-likelihood function of Quadratic Discriminant Analysis.

Now we derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative. Actually, the mean parameter derivation is the same as the LDA.

Mean parameter estimation for Quadratic Discriminant Analysis.

For the covariance matrix, we can derive as follows:

Covariance matrix parameter estimation for Quadratic Discriminant Analysis.

As you can see, we can derive almost the same formula as the MLE result for a multivariate Gaussian distribution. The difference is that we calculate the sample mean and covariance matrix. Now, let’s look at the visualization.

The decision boundary shape is quadratic. QDA’s name also originated from the shape of the decision boundary. Now, let’s calculate the decision boundary like LDA. Again, to calculate the decision boundary between classes 1 and 2, we can write the following equation.

We will skip some derivations that we already did in the LDA section. The decision boundary formula substituted by the Gaussian distribution function as follows:

If we take logs to this equation, the formula will be as follows:

If we define:

then we can write:

The meaning of this decision boundary formula is the same as the quadratic function. Thus, the decision boundary shape becomes quadratic.

Implementation of Quadratic Discriminant Analysis

In this section, we will implement QDA from scratch and verify the result using scikit-learn. We implement it as a class. We calculate each class’s sample mean and covariance. Thus, the estimation part looks like as follows:

def fit(self, X, y):
    n_samples, n_features = X.shape
    self.classes = np.unique(y)
    n_classes = len(self.classes)

    # Calculate MLEs for priors, means, and covariance matrices
    self.pi = np.zeros(n_classes)
    self.means = np.zeros((n_classes, n_features))
    self.covariances = np.zeros((n_classes, n_features, n_features))

    for i, c in enumerate(self.classes):
        X_c = X[y == c]
        n_c = X_c.shape[0]

        # MLE for prior
        self.pi[i] = n_c / n_samples

        # MLE for class mean
        self.means[i, :] = np.mean(X_c, axis=0)

        # MLE for class covariance matrix
        self.covariances[i, :, :] = np.cov(X_c, rowvar=False)

The estimation results using the class are below.

QDA parameters obtained by scratch implementation.

The comparison results using the scikit-learn library are below.

QDA parameters obtained by scikit-learn.

The results are almost the same! We can verify the scratch implementation compatible with scikit-learn.

In this article, we have learned about the Gaussian Discriminant Analysis. It can be derived from the same Bayesian formula with the multivariate Gaussian distribution. When we change the assumption of the covariance matrix, the resulting techniques are called Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis. Here is the code used.

All Python implementation in this article

Appendix: MLE derivations

We will derive the MLE for the multivariate Gaussian Distribution. Before diving into the derivation, let’s quickly recap the matrix algebra below.

In the equations, a and b are vectors, and A and B are matrices. The notation tr(A) refers to the trace of a matrix. The last equation is called the cyclic permutation property of the trace operator. Using it, we can derive the trace trick, which reorders the scalar inner product as:

Trace trick formula for the scalar inner product

Note that, when it comes to the dimension, the scalar inner product calculates (1×D)×(D×D)×(D×1). These equations will be used to prove, so please remember them.

Now, let’s derive the MLE for the multivariate Gaussian distribution. The log-likelihood function below will be used to estimate the parameters.

Log-likelihood function of the multivariate Gaussian distribution

Firstly, let’s derive the mean parameter of a multivariate Gaussian distribution. We can cancel out the term unrelated to the parameter mean.

The log-likelihood function for the mean parameter

Using the substitution:

We can rewrite the formula using the formula (6) and derive the mean parameter as follows:

Next, let’s derive the covariance matrix parameter of a multivariate Gaussian distribution.

The log-likelihood function for the covariance matrix parameter

Using the formula (7), (8), (9) and (10), we can derive as:

Reference

[1] Kevin P. Murphy, Machine Learning: A probabilistic Perspective, MIT Press

Citation

“A Gentle Introduction to Gaussian Discriminant Analysis: Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis”Yuki Shizuya
http://jeetwincasinos.com/p/e54d54f89c5a#533c-2948d83babfc
The Quantastic Journal
Issue January 2025
ISSN 3035–8000

The Quantastic Journal

A Gentle Introduction to Gaussian Discriminant Analysis: Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis

To understand Gaussian discriminant analysis deeply based on the mathematics and a Python implementation from scratch

Table of Concepts

1. Preliminaries: Bayes’ theorem

Joint probability

Conditional probability

Marginal probability

Bayes’ theorem

2. Preliminaries: Multivariate Gaussian distribution

Mathematics of Multivariate Gaussian distribution

Parameter estimation : Maximum likelihood estimation (MLE)

3. Introduction to Gaussian Discriminant Analysis

4. Introduction of Gaussian Naive Bayes

Mathematics of Gaussian Naive Bayes

Implementation of Gaussian Naive Bayes

5. Introduction of Linear Discriminant Analysis

Mathematics of Linear Discriminant Analysis

Implementation of Linear Discriminant Analysis

6. Introduction of Quadratic Discriminant Analysis

Mathematics of Quadratic Discriminant Analysis

Implementation of Quadratic Discriminant Analysis

Appendix: MLE derivations

Reference

Citation

Published in The Quantastic Journal

Written by Yuki Shizuya

Responses (1)