A Gentle Introduction to Gaussian Discriminant Analysis: Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis
To understand Gaussian discriminant analysis deeply based on the mathematics and a Python implementation from scratch
Recently, I came across an interesting description that we can derive Gaussian Naive Bayes, Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA) from almost the same Bayes formula [1].
These methods utilize the multivariate Gaussian distribution for the likelihood, and the only difference is the covariance matrix parameter for the multivariate Gaussian distribution they assume.
In this article, I will introduce the mathematics of these methods in detail and a Python implementation from scratch to get a concrete sense of how they work.
Table of Concepts
1. Preliminaries: Bayes’ theorem
Bayes’ theorem is a fundamental concept for understanding Gaussian discriminant analysis. In this section, we first learn about Bayes’ theorem. To understand Bayes’ theorem, let’s review some fundamental rules of probability: joint probability, conditional probability, and marginal probability.
Joint probability
A joint probability is a probability that calculates probabilities (or likelihoods) of two events that co-occur. In the following figure, a joint probability is the shared part of two events.
Conditional probability
A conditional probability is a probability that measures the probability of one event given another event. Intuitively, it is just a proportion of event A’s probability under the occurrence of event B’s. In the following example, you can check the conditional probability with visualization.
Marginal probability
A marginal probability is a probability of a single event occurring. When a single event occurs with other events, we can decompose it by joint probability with each event. The following visualization helps you understand it.
In the above example, we assume that event A can occur with four other events B1, B2, B3, and B4. If we sum over the joint probabilities of event A and each event B, we can construct the probability of event A.
Bayes’ theorem
Now, we can derive Bayes’ theorem using these concepts as follows.
Although Bayes’ theorem sounds intimidating, it is just a combination of basic probability concepts based on conditional probability. In the Bayesian context, we also use these terms: likelihood, prior, and evidence, like the above formula.
In the case of two conditioned events, we can represent Bayes’ theorem as follows.
We can derive Bayes’ theorem in the same way as one conditioned event case. We will utilize this formula later in this article.
2. Preliminaries: Multivariate Gaussian distribution
Mathematics of Multivariate Gaussian distribution
Multivariate Gaussian distribution is also one of the necessary concepts to understand the Gaussian discriminant analysis. Let’s quickly recap it. Multivariate Gaussian distribution is the joint probability of Gaussian distribution with more than two dimensions. It has the probability density function below.
We assume we have the following input vector and parameters. x is the input data that has D×1 dimensions, μ is the mean vector that has the same dimension with x and Σ is the covariance matrix that has D×D dimensions.
Basically, the multivariate Gaussian distribution is the multiple-dimension version of the one-dimensional Gaussian distribution. Compared to the one-dimensional Gaussian distribution, the multivariate Gaussian distribution must consider the correlation among other dimensions, so we use the covariance matrix rather than the variance.
Parameter estimation : Maximum likelihood estimation (MLE)
In the previous section, we learned the multivariate Gaussian distribution.
Here is a question: How can we estimate parameters when we want to fit data into the multivariate Gaussian distribution?
One solution is to use Maximum Likelihood Estimation (MLE). What is MLE? MLE estimates the parameters of an assumed probability distribution given data by maximizing a likelihood function. The general formulas are presented below.
We assume that we have N data points. Equation (1) is called the likelihood function. It is the multiplication of the conditional probabilities of data given the distribution parameters. When applying for MLE, we estimate the parameters that maximize this function, such as equation (2).
Intuitively, when the likelihood function value increases, the data is more plausible to be generated from an assumed probability distribution given parameters.
Now, let’s look at the concrete example. When we use the multivariate Gaussian distribution for an assumed probability distribution, we can describe the likelihood function, as shown below.
In practice, we apply natural logarithms to transform multiplication into summation (log-likelihood).
Then, we differentiate the log-likelihood function with respect to each parameter to estimate the parameters. The simple derivation is as shown below. We can ignore the terms that are not relevant to the parameters. (I will attach the detailed derivations in the appendix.)
Basically, when the log-likelihood can be differentiated, we estimate the parameters using differentiation for other probability distribution.
3. Introduction to Gaussian Discriminant Analysis
Now, let’s explore one of the important applications for the multivariate Gaussian distribution: Gaussian discriminant analysis (GDA). GDA is one of the supervised classification algorithms. It tries to solve a classification problem to estimate the data generation process; thus, it is also called a generative classifier. Concretely, GDA assumes the Gaussian distribution for each class’s data distribution and defines the probabilistic model below.
x refers to the data, c refers to a class, and θ is Gaussian distribution parameters. Formula (3) represents each data’s probability density, which is how likely this data is generated from class c’s distribution.
Meanwhile, the current formula is inconvenient for predicting the probability assigned to each class. We utilize Bayes’ theorem to define a new formula, as shown below.
The formula (4) refers to the probability of each data being assigned to the class c.
Likelihood p(y=c|θ) or prior represents how likely the class c is obtained in the given data. In the GDA case, we usually use the proportion of each class in given data for the prior 𝜋.
Using the formula (4), we can calculate the probability density according to each class’s distribution and classify data to the class with the highest probability density. We can derive this decision rule as follows.
Note that we can ignore the denominator because it is not related to the class. If we know the parameters, we can calculate an assigned class of given data using the above formula. However, we usually don’t know the parameters, so we need to estimate them from the data. In such case, we can utilize the maximum likelihood estimation (MLE). MLE finds the parameters to maximize the joint probability of the data and a class. Here is the joint probability we have to maximize.
We assume the dataset is i.i.d and can write the log-likelihood function as follows.
𝒟 represents the dataset, which means the data x and the corresponded class y=c.
C refers to the number of classes and the dataset has N points. Now, we can find the parameters to maximize this function using derivatives. When we use different types of covariance matrices, the resulting techniques are the following:
- When we assume a diagonal covariance matrix → Gaussian Naive Bayes
- When we assume a full covariance matrix but share it for all the classes → Linear Discriminant Analysis (LDA)
- When we assume a full covariance matrix → Quadratic Discriminant Analysis (QDA)
Why do we need to add such an assumption besides QDA? Although MLE is a very efficient optimization in speed and simplicity, it has one problem: it can badly overfit in high dimensions. Especially, when we optimize a full-covariance matrix, the MLE result can be unstable because the number of parameters is large. Thus, we usually use Gaussian Naive Bayes or LDA when we have a small amount of data.
Now, let’s check each technique in detail. I will explain the mathematics first, then visualization, and finally, the implementation of each method. We will use a randomly generated dataset. It has two feature dimensions with one label as shown below.
We use this toy dataset to visualize each class’s estimated distribution’s shape and decision boundary. Let’s dive into them!
4. Introduction of Gaussian Naive Bayes
Mathematics of Gaussian Naive Bayes
When we assume a diagonal covariance matrix in the formula (3) parameter, we spontaneously assume Gaussian Naive Bayes. Gaussian Naive Bayes is a type of Naive Bayes method that considers the features to follow a Gaussian distribution and each feature to be independent. A covariance matrix that we assume looks like the following.
Note that each class has its own diagonal covariance matrix (c refers to each class). Since it is conditionally independent, we can rewrite the log-likelihood function as follows.
D refers to the number of feature dimensions. We can derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative.
As you can see, we can obtain parameters to calculate the inter-class mean and variance from the data. Now, let’s look at the visualization.
The decision boundary looks like non-linear. Since we don’t consider the correlation among other features, the Gaussian distribution contour is not tilted ellipsoid.
Implementation of Gaussian Naive Bayes
In this section, we will implement Gaussian Naive Bayes from scratch and verify the result using scikit-learn. We implement it as a class. Based on the above derivation, we just calculate each class’s sample mean and variance. Thus, the estimation part looks like as follows:
def fit(self, X, y):
"""
Fit the Gaussian Naive Bayes model.
Parameters
----------
X : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape (n_samples,)
Target values.
"""
n_samples, n_features = X.shape
self.classes = np.unique(y)
n_classes = len(self.classes)
# Calculate class priors, means, and variances
self.class_priors = np.zeros(n_classes)
self.means = np.zeros((n_classes, n_features))
self.variances = np.zeros((n_classes, n_features))
for i, c in enumerate(self.classes):
X_c = X[y == c]
self.class_priors[i] = X_c.shape[0] / float(n_samples)
self.means[i, :] = X_c.mean(axis=0)
self.variances[i, :] = X_c.var(axis=0)
As you can see, we calculate the sample mean and variance for each class inside the for-loop. The whole class looks like the following implementation.
class GaussianNaiveBayes:
def __init__(self):
self.classes = None
self.class_priors = None
self.means = None
self.variances = None
def fit(self, X, y):
"""
Fit the Gaussian Naive Bayes model.
Parameters
----------
X : array-like, shape (n_samples, n_features)
Training vectors, where n_samples is the number of samples
and n_features is the number of features.
y : array-like, shape (n_samples,)
Target values.
"""
n_samples, n_features = X.shape
self.classes = np.unique(y)
n_classes = len(self.classes)
# Calculate class priors, means, and variances
self.class_priors = np.zeros(n_classes)
self.means = np.zeros((n_classes, n_features))
self.variances = np.zeros((n_classes, n_features))
for i, c in enumerate(self.classes):
X_c = X[y == c]
self.class_priors[i] = X_c.shape[0] / float(n_samples)
self.means[i, :] = X_c.mean(axis=0)
self.variances[i, :] = X_c.var(axis=0)
def predict(self, X):
"""
Perform classification on an array of test vectors X.
Parameters
----------
X : array-like, shape (n_samples, n_features)
Test vectors.
Returns
-------
C : array, shape (n_samples,)
Predicted target values for X.
"""
n_samples = X.shape[0]
posteriors = np.zeros((n_samples, len(self.classes)))
# Calculate posterior probability for each class
for i, c in enumerate(self.classes):
prior = np.log(self.class_priors[i])
# Calculate likelihood using log of probability density function
likelihood = np.sum(np.log(self._pdf(i, X)), axis=1)
posteriors[:, i] = prior + likelihood
# Return the class with the highest posterior probability
return self.classes[np.argmax(posteriors, axis=1)]
def _pdf(self, class_idx, x):
"""
Calculate the probability density function for a Gaussian distribution.
"""
mean = self.means[class_idx]
var = self.variances[class_idx]
numerator = np.exp(- (x - mean)**2 / (2 * var))
denominator = np.sqrt(2 * np.pi * var)
return numerator / denominator
The estimation results using the class are below.
The comparison results using the scikit-learn library are below.
The results are almost the same! We can verify the scratch implementation compatible with scikit-learn. I will attach the whole code later in this blog.
5. Introduction of Linear Discriminant Analysis
Mathematics of Linear Discriminant Analysis
When we assume the same full covariance matrix for all classes in the formula (3) parameter, we assume linear discriminant analysis (LDA). Why do we call it LDA even using a non-linear function? The key lies in the decision boundary. You will see why this method is called LDA in the visualization section. In this case, a covariance matrix that we assume looks like the following.
Note that we assume a covariance matrix with full parameters, and all classes share the same covariance matrix. Using this parameter, we can write the log-likelihood function as follows.
Now we derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative.
For the covariance matrix, we can derive as follows:
As you can see, we can derive almost the same formula as the MLE result for a multivariate Gaussian distribution. The covariance matrix estimation is a bit tricky. We must consider all the data points, not just each class like Gaussian Naive Bayes. Now, let’s look at the visualization of the result.
The decision boundary shape is linear! Actually, LDA’s name originated from the shape of the decision boundary. But why does the decision boundary become linear?—Let’s solve mathematics.
We can rephrase the decision boundary to mean that the two distributions have the same probability. To calculate the decision boundary between classes 1 and 2, we can write the following equation.
Using the formula (4), we can derive the following equations.
We can substitute the likelihood function with the Gaussian distribution as follows:
Using the above formula, we can rewrite the decision boundary formula as follows:
If we take logs to this equation, the formula will be as follows:
Moreover, if we define:
then we can write:
Thus, the decision boundary between any two classes will be a straight line because of the above formula.
Implementation of Linear Discriminant Analysis
In this section, we will implement LDA from scratch and verify the result using scikit-learn. We implement it as a class. We calculate each class’s sample mean and covariance matrix based on the above derivation. Thus, the estimation part looks like as follows:
def fit(self, X, y):
n_features = X.shape[1]
class_labels = np.unique(y)
n_classes = len(class_labels)
# Calculate MLEs for priors, means, and covariance matrix
self.pi = np.zeros(n_classes)
self.means = np.zeros((n_classes, n_features))
self.covariance = np.zeros((n_features, n_features))
Sw = np.zeros((n_features, n_features))
for i, c in enumerate(class_labels):
X_c = X[y == c]
n_c = X_c.shape[0]
# MLE for prior
self.pi[i] = n_c / X.shape[0]
# MLE for class mean
self.means[i, :] = np.mean(X_c, axis=0)
# Calculate within-class scatter
Sw += (X_c - self.means[i, :]).T.dot((X_c - self.means[i, :]))
# MLE for common covariance matrix
self.covariance = Sw / X.shape[0]
We use the exact implementation as Gaussian Naive Bayes for the mean. Regarding the covariance matrix, we calculate each class’s squared error and sum over classes and then divide it by the number of data. The whole class looks like the following implementation.
class LDA:
def __init__(self, n_components):
self.n_components = n_components
self.linear_discriminants = None
self.pi = None # Store the prior probabilities
self.means = None # Store the class means
self.covariance = None # Store the common covariance matrix
def fit(self, X, y):
n_features = X.shape[1]
class_labels = np.unique(y)
n_classes = len(class_labels)
# Calculate MLEs for priors, means, and covariance matrix
self.pi = np.zeros(n_classes)
self.means = np.zeros((n_classes, n_features))
self.covariance = np.zeros((n_features, n_features))
Sw = np.zeros((n_features, n_features))
for i, c in enumerate(class_labels):
X_c = X[y == c]
n_c = X_c.shape[0]
# MLE for prior
self.pi[i] = n_c / X.shape[0]
# MLE for class mean
self.means[i, :] = np.mean(X_c, axis=0)
# Calculate within-class scatter
Sw += (X_c - self.means[i, :]).T.dot((X_c - self.means[i, :]))
# MLE for common covariance matrix
self.covariance = Sw / (X.shape[0] - n_classes)
def predict(self, X):
"""
Predict the class labels for the given data.
"""
if self.means is None or self.covariance is None or self.pi is None:
raise Exception("Fit the model before making predictions.")
n_samples = X.shape[0]
predictions = np.zeros(n_samples)
for i in range(n_samples):
x = X[i]
posteriors = []
for k in range(len(self.pi)):
# Calculate the discriminant function using the MLEs
prior = np.log(self.pi[k])
inv_covariance = np.linalg.inv(self.covariance)
mean_diff = x - self.means[k]
likelihood = -0.5 * mean_diff.T @ inv_covariance @ mean_diff
posterior = prior + likelihood # No need to compute evidence since it's constant
posteriors.append(posterior)
predictions[i] = np.argmax(posteriors)
return predictions
The estimation results using the class are below.
The comparison results using the scikit-learn library are below.
The results are almost the same! We can verify the scratch implementation compatible with scikit-learn.
6. Introduction of Quadratic Discriminant Analysis
Mathematics of Quadratic Discriminant Analysis
When we assume different full covariance matrices for all classes in the formula (3) parameter, we assume quadratic discriminant analysis (QDA). QDA can consider more complex relationships between variables than other techniques. On the other hand, it has more parameters than others, so it is easy to overfit when we have a small amount of data. In this case, a covariance matrix that we assume looks like the following.
Using this parameter, we can write the log-likelihood function as follows.
Now we derive each parameter by taking a derivative. We can cancel the terms that are not relevant to the derivative. Actually, the mean parameter derivation is the same as the LDA.
For the covariance matrix, we can derive as follows:
As you can see, we can derive almost the same formula as the MLE result for a multivariate Gaussian distribution. The difference is that we calculate the sample mean and covariance matrix. Now, let’s look at the visualization.
The decision boundary shape is quadratic. QDA’s name also originated from the shape of the decision boundary. Now, let’s calculate the decision boundary like LDA. Again, to calculate the decision boundary between classes 1 and 2, we can write the following equation.
We will skip some derivations that we already did in the LDA section. The decision boundary formula substituted by the Gaussian distribution function as follows:
If we take logs to this equation, the formula will be as follows:
If we define:
then we can write:
The meaning of this decision boundary formula is the same as the quadratic function. Thus, the decision boundary shape becomes quadratic.
Implementation of Quadratic Discriminant Analysis
In this section, we will implement QDA from scratch and verify the result using scikit-learn. We implement it as a class. We calculate each class’s sample mean and covariance. Thus, the estimation part looks like as follows:
def fit(self, X, y):
n_samples, n_features = X.shape
self.classes = np.unique(y)
n_classes = len(self.classes)
# Calculate MLEs for priors, means, and covariance matrices
self.pi = np.zeros(n_classes)
self.means = np.zeros((n_classes, n_features))
self.covariances = np.zeros((n_classes, n_features, n_features))
for i, c in enumerate(self.classes):
X_c = X[y == c]
n_c = X_c.shape[0]
# MLE for prior
self.pi[i] = n_c / n_samples
# MLE for class mean
self.means[i, :] = np.mean(X_c, axis=0)
# MLE for class covariance matrix
self.covariances[i, :, :] = np.cov(X_c, rowvar=False)
The estimation results using the class are below.
The comparison results using the scikit-learn library are below.
The results are almost the same! We can verify the scratch implementation compatible with scikit-learn.
In this article, we have learned about the Gaussian Discriminant Analysis. It can be derived from the same Bayesian formula with the multivariate Gaussian distribution. When we change the assumption of the covariance matrix, the resulting techniques are called Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis. Here is the code used.
Appendix: MLE derivations
We will derive the MLE for the multivariate Gaussian Distribution. Before diving into the derivation, let’s quickly recap the matrix algebra below.
In the equations, a and b are vectors, and A and B are matrices. The notation tr(A) refers to the trace of a matrix. The last equation is called the cyclic permutation property of the trace operator. Using it, we can derive the trace trick, which reorders the scalar inner product as:
Note that, when it comes to the dimension, the scalar inner product calculates (1×D)×(D×D)×(D×1). These equations will be used to prove, so please remember them.
Now, let’s derive the MLE for the multivariate Gaussian distribution. The log-likelihood function below will be used to estimate the parameters.
Firstly, let’s derive the mean parameter of a multivariate Gaussian distribution. We can cancel out the term unrelated to the parameter mean.
Using the substitution:
We can rewrite the formula using the formula (6) and derive the mean parameter as follows:
Next, let’s derive the covariance matrix parameter of a multivariate Gaussian distribution.
Using the formula (7), (8), (9) and (10), we can derive as:
Reference
[1] Kevin P. Murphy, Machine Learning: A probabilistic Perspective, MIT Press
Citation
“A Gentle Introduction to Gaussian Discriminant Analysis: Gaussian Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis”Yuki Shizuya
http://jeetwincasinos.com/p/e54d54f89c5a#533c-2948d83babfc
The Quantastic Journal
Issue January 2025
ISSN 3035–8000