Let’s say you want to determine whether a puppy like the one in the figure above (on the right) is a Siberian Husky or an Alaskan Malamute. Of course, this is a complete stranger’s pet, so you pull out your smartphone and search the internet for examples of each breed. After looking through a few *labeled examples*, like the ones illustrated in the two rows above, you hope to make the correct prediction. The most intuitive way to do this is to maximize some notion of "similarity" between the puppy and the two sets of labeled photos. However, it is unclear which notion of similarity is correct in this context.

For example, one could ask, "How similar is the picture of the *puppy* to each row?" Conversely, one could ask, "How similar is each *row* to the picture of the puppy?" Finally, "If the picture of the puppy were added to each group, which would be more internally consistent?"

Abstractly, these questions may seem equivalent, but formalizing each reveals that this is not quite true.

# Formal Problem

Formally, given a set of labeled training data for each category , there are at least three ways that one could compute the probability of a label for a test point $x^*$:

- using posterior predictives of conditioned on for each ,

- using the marginal likelihoods ,

- and using the posterior predictives of conditioned on ,

Equation is intuitive, but what about Equations and ?

# TL;DR

is the most numerically stable, the most intuitive, and requires the least "hacks" to get the job done.

# Concrete Example: 2D Binary Classification

Suppose we have 2-dimensional data from two categories . We assume they are generated by the following conjugate Gaussian model:

That is, the means , of categories , are modeled as draws from a Gaussian prior, whose mean is and covariance is . Using these means, labeled samples and are assumed to be drawn from two Gaussian likelihoods with identity covariance and means , , respectively. and are simply the numbers of data sampled from the respective categories.

To make the marginal likelihood simpler to compute, we assume to be a diagonal matrix, meaning that each of the 2 dimensions are independent of one another. Samples from this model are illustrated in the scatter plot below. (see also generative model and plotting code).

**Fig. 1: Generative Model.** **(A)** Distribution of the training data. The 200 light blue triangles represent samples from category A and the 100 dark blue squares samples from category B, all of which serve as the training data for classification. The black disks (dots) represent the likelihood means sampled from the prior distribution, and the red disks empirical means of categories A and B. **(B)** The greater of the two category probabilities for every point in the 1000x1000 grid, according to the likelihood Gaussians. **(C)** The natural logarithm of the probabilities in **B**. **(D)** Classification certainty map. Probabilities in **C** normed by the sum of the two category probabilities, in log space (i.e. the true binary classification log probabilities).

# Results

** Fig. 2: Classification certainty maps obtained with between-category normalization only. ** For every point in the 1000x1000 grid, **A**, **B**, **C** plot the greater of the two classification probabilities. **D**, **E**, **F** plot the natural logarithm of these probabilities, respectively. The pairs of classification probabilities for column **A**+**D** are computed with \eqref{ppTarget}, **B**+**E** with \eqref{ml}, and **C**+**F** with \eqref{ppTraining}.

Fig. 2B-C reveal the scale of probabilities to be inversely related to the ratios of numbers of category training data when computing and (i.e. the more training data, the smaller the scale of probabilities). One way to address this issue is to perform within-category normalization before the between-category normalization. For example, to compute , one could obtain a within-category normalizing constant for a category by summing each for every test point (see code computing Fig. 3 classification probabilities). Performing this normalization tells us the relative probabilities of points on the grid for a specific category. Results using this approach are illustrated below.

** Fig. 3: Classification certainty maps obtained with within-category and between-category normalization. ** For every point in the 1000x1000 grid,

**A**,

**B**,

**C**plot the greater of the two classification probabilities, computed using within-category normalized probabilities.

**D**,

**E**,

**F**plot the the natural logarithm of probabilities in

**A**,

**B**,

**C**, respectively. The pairs of classification probabilities for column

**A**+

**D**are computed with \eqref{ppTarget},

**B**+

**E**with \eqref{ml}, and

**C**+

**F**with \eqref{ppTraining},

*after*normalizing each term in the equations by its respective within-category normalizing constant.

## Posterior predictive of the Test Datum

**The decision boundary is largely recovered.**- To classy a test point, simply assign the label with the higher category probability.

**This approach is robust to differences in numbers of category training data.**- The above results are obtained with 200 training samples for category A and 100 for category B, but the linear decision boundary remains linear and is shifted correctly.

**Probabilities must be computed in log space to remain numerically stable.**- The black regions in Fig. 2A illustrate areas assigned 0 probability for both classes. However, we know this cannot be true because the form of the Gaussian probability density function tells us there is probability density on the entire space.
- Examing the probabilities in log space (Fig. 2D and Fig. 3A) reveals that these regions do have a small amount of probability density.

## Marginal Likelihoods of the Training Data + Test Datum

**Scales of the terms in are tied to the number of training data for the categories.**- Each marginal likelihood term in is computed on an -dimensional space, where is the number of training data for category .
- This means that the scale of each term (i.e. probability) is strongly tied to the number of corresponding category training data.
- Unless something is done to account for this issue, will assign almost all points to the category with the least training data (Fig. 2B and Fig. 2E).
- Performing within-category normalization is one such hacky solution (Fig. 3B and Fig. 3E).

**is equivalent to under certain conditions.**- For the assumed model, we can prove that and are mathematically equivalent in the case of one test point and no training data. I will upload the proof when I get the chance.
- Moreover, if within-category normalization is performed before between-category normalization, and are equivalent (Fig. 3A vs. Fig. 3B and Fig. 3D vs. Fig. 3E).
- If you require further convincing, run the following after computing the classification probabilities for Fig. 3.
`np.allclose(log_ps_eq1, log_ps_eq2, rtol=1e-100) # Returns True.`

## Posterior predictive of the Training Data

**is numerically unstable, even in log space.**- That is, the probabilities are so small that extra care and computer memory are required to track the probabilities. Fig. 2C misleadingly shows that points are assigned 0 probability for
*both*classes unless they are very close to the category means. - Even in log space (Fig. 2F and Fig. 3C), the magnitude of values are in the 1000s, for a
*small*set of training data ( and ) of*low*dimensional () data. - This renders the approach useless in most practical settings where datasets are orders of magnitude larger in sample size and data dimension, causing the probabilities to become significantly smaller.

- That is, the probabilities are so small that extra care and computer memory are required to track the probabilities. Fig. 2C misleadingly shows that points are assigned 0 probability for
**Classification***un*certainty is coupled with sample size ratios of the category training data.- Fig. 2F, Fig. 3C, and Fig. 3F show curvatures of the decision boundaries about category A. This means that will be
*more*likely to misclassify points from category A as being samples from category B, even though there is more training data for category A! - The curvature is slightly less pronounced with within-category normalization before between-category normalization (Fig. 3C), but the problem still remains.
- This is something to avoid. We want classification accuracy to either increase or plateau as the number of training data increase, but not decline.
- Of course, we can exclude some data so the size of the training sets for each category are equal, but this is clearly an inefficient use of data.

- Fig. 2F, Fig. 3C, and Fig. 3F show curvatures of the decision boundaries about category A. This means that will be

# Concluding Remarks

- The most important lesson here is quite general: even if you manage to estimate the marginal or predictive distributions "well enough", classification performance can still be poor if the probabilities aren’t used together properly.
- is the safest and most intuitive approach.

# Implementation

Given the above model…

is implemented as

where is the posterior predictive for the model, , and .

- In words, is the posterior covariance on the mean and is the empirical mean.
- Code

is implemented as

where

, , and indexes a dimension.

- Because the prior and likelihood covariances are diagonal, each dimension is independent, which allows us to compute the marginal likelihood for each dimension separately, and then take the product to obtain the overall marginal likelihood of the data.
- The derivation for the general form of the marginal likelihood in the univariate case can be found in section 2.5 of Kevin Murphy’s Conjugate Bayesian analysis of the Gaussian distribution.
- Code

is implemented as

where .

- For a category, we condition on the test datum to compute the posterior predictive probabilities of all the training data in that category. Then, we take the product of these posterior predictives.
- Code

# Code

```
import string
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.gridspec import GridSpec
from scipy.stats import multivariate_normal as gaussian
from scipy.misc import logsumexp
# Python 3.6.3
# The string module is included in the Python Standard Library.
# In : np.__version__
# Out: '1.13.3'
# In : matplotlib.__version__
# Out: '2.1.0'
# In : scipy.__version__
# Out: '0.19.1'
plt.style.use('seaborn')
mpl.rcParams['xtick.labelsize'] = 10
mpl.rcParams['ytick.labelsize'] = 10
mpl.rcParams['font.size'] = 14
mpl.rcParams['axes.titlesize'] = 16
np.random.seed(0)
```

```
def sample_model(n_A, n_B, return_params=False):
assert type(return_params) == bool
n_dims = 2
prior_mean = np.zeros(n_dims)
prior_cov = np.diag(np.asarray([4, 9]))
prior = gaussian(mean=prior_mean, cov=prior_cov)
likelihood_means = prior.rvs(2) # Sample 2 means from the prior.
likelihood_cov = np.eye(n_dims)
likelihood_A = gaussian(mean=likelihood_means[0], cov=likelihood_cov)
likelihood_B = gaussian(mean=likelihood_means[1], cov=likelihood_cov)
data_A = likelihood_A.rvs(n_A)
data_B = likelihood_B.rvs(n_B)
if return_params:
return data_A, data_B, \
likelihood_means, likelihood_cov, prior_mean, prior_cov
else:
return data_A, data_B
```

Posterior predictive of the test point, given the training data.

```
def calc_log_pp_test_datum(test_datum, conditioned_data, prior_cov):
""" data is assumed to be row-wise: shape=(n, 2). """
n = conditioned_data.shape[0]
prior_cov_diag = prior_cov.diagonal()
posterior_cov_diag = prior_cov_diag / (prior_cov_diag * n + 1)
pp_mean = conditioned_data.sum(axis=0) * posterior_cov_diag
pp_cov_diag = posterior_cov_diag + 1
pp_cov = np.diag(pp_cov_diag)
calc_logp = gaussian(mean=pp_mean, cov=pp_cov).logpdf
return calc_logp(test_datum)
```

Marginal likelihood of the data.

```
def calc_log_ml(data, prior_cov):
""" data is assumed to be row-wise: shape=(n, 2). """
prior_cov_diag = prior_cov.diagonal()
n = data.shape[0]
log_constants = -.5 * np.log(2 * np.pi * (n * prior_cov_diag + 1))
exponent_1 = -.5 * (data ** 2).sum(axis=0)
exponent_2 = .5 * n ** 2 * prior_cov_diag * data.mean(axis=0) ** 2
exponent_2 = exponent_2 / (n * prior_cov_diag + 1)
return (log_constants + exponent_1 + exponent_2).sum()
```

Posterior predictive of the training data, given the test point.

```
def calc_log_pp_training_data(training_data, conditioned_datum, prior_cov):
""" data is assumed to be row-wise: shape=(n, 2). """
prior_cov_diag = prior_cov.diagonal()
pp_cov_diag = prior_cov_diag / (prior_cov_diag + 1)
pp_cov = np.diag(pp_cov_diag)
calc_log_pp_x = gaussian(mean=conditioned_datum, cov=pp_cov).logpdf
probs = calc_log_pp_x(training_data)
return probs.sum()
```

```
def plot_probs(probs, title, x_bounds, y_bounds):
assert len(probs.shape) == 2
n_pxls_1, n_pxls_2 = probs.shape
x_min, x_max = x_bounds
y_min, y_max = y_bounds
xticks = np.linspace(0, n_pxls_1, 5)
yticks = np.linspace(0, n_pxls_2, 5)
xticklabels = np.linspace(x_min, x_max, 5)
yticklabels = np.linspace(y_max, y_min, 5)
fig, ax = plt.subplots(1)
plt.setp(ax,
yticks=yticks,
yticklabels=yticklabels,
xticks=xticks,
xticklabels=xticklabels)
im = ax.imshow(probs, cmap='magma')
fig.colorbar(im, ax=ax, fraction=.046, pad=.04)
plt.title(title)
plt.show()
```

Sample the generative model for training data.

```
# Sample 200 points for category A and 100 points for category B.
data_A, data_B, \
likelihood_means, likelihood_cov, \
prior_mean, prior_cov = sample_model(n_A=200, n_B=100, return_params=True)
x_mean_A = data_A.mean(axis=0)
x_mean_B = data_B.mean(axis=0)
empirical_means = np.asarray([x_mean_A, x_mean_B])
# With the seed set, the following should be your variable values.
#
# In : likelihood_means
# Out:
# array([[ 0.80031442, 5.29215704],
# [ 4.4817864 , 2.93621395]])
#
# In : likelihood_cov
# Out:
# array([[ 1., 0.],
# [ 0., 1.]])
#
# In : prior_mean
# Out: array([ 0., 0.])
#
# In : prior_cov
# Out:
# array([[4, 0],
# [0, 9]])
#
# In : empirical_means
# Out:
# array([[ 0.7154225 , 5.29006778],
# [ 4.3428008 , 2.84722654]])
#
```

Make a grid of test points and compute the ground truth probabilities.

```
# Bounds for plotting.
x_min = round(np.min(np.vstack([data_A, data_B])[:, 0])) - 1
x_max = round(np.max(np.vstack([data_A, data_B])[:, 0])) + 1
y_min = round(np.min(np.vstack([data_A, data_B])[:, 1])) - 1
y_max = round(np.max(np.vstack([data_A, data_B])[:, 1])) + 1
# Create a grid of test points according to bounds.
n_pxls = 100 # Actual figures are 1000x1000, not 100x100.
x = np.linspace(x_min, x_max, n_pxls)
y = np.linspace(y_max, y_min, n_pxls)
X, Y = np.meshgrid(x, y)
test_pts = np.asarray([X.flatten(), Y.flatten()]).T
# Calculate the ground truth probabilities for the test points.
distribution_A = gaussian(mean=likelihood_means[0], cov=likelihood_cov)
distribution_B = gaussian(mean=likelihood_means[1], cov=likelihood_cov)
log_ps_A = distribution_A.logpdf(test_pts)
log_ps_B = distribution_B.logpdf(test_pts)
map_log_ps = np.max([log_ps_A, log_ps_B], axis=0)
map_log_ps_normed = map_log_ps - logsumexp([log_ps_A, log_ps_B], axis=0)
# In : (x_min, x_max, y_min, y_max)
# Out: (-3.0, 8.0, -1.0, 9.0)
```

Figure 1A: plot the sampled data, sampled likelihood means, and prior mean.

```
# Figure 1A.
fig, ax = plt.subplots(1)
colors = plt.get_cmap('Blues')(np.arange(200))
ax.scatter(data_A[:, 0], data_A[:, 1], label='k = {}'.format('A'),
marker='^', s=20, c=colors[149])
ax.scatter(data_B[:, 0], data_B[:, 1], label='k = {}'.format('A'),
marker=',', s=20, c=colors[199])
ax.scatter(likelihood_means[:, 0],
likelihood_means[:, 1],
color='black', s=75, label='$\mathbf{\mu}_k$')
ax.scatter(*prior_mean, color='gray', s=50)
ax.scatter(empirical_means[:, 0],
empirical_means[:, 1],
color='red', s=25, label='$\mathbf{\overline{x}}^k$')
ax.set_xticks(np.linspace(x_min, x_max, 5))
ax.set_yticks(np.linspace(y_max, y_min, 5))
ax.set_xticklabels(np.linspace(x_min, x_max, 5))
ax.set_yticklabels(np.linspace(y_max, y_min, 5))
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.set_aspect('equal')
ax.legend(title='Category', loc='upper right')
plt.legend(prop={'weight': 'bold'})
plt.title('Figure 1A')
plt.show()
```

Figures 1B-D.

```
# Figures 1B-D.
plot_probs(np.exp(map_log_ps).reshape(100, 100),
'Figure 1B', (x_min, x_max), (y_min, y_max))
plot_probs(map_log_ps.reshape(100, 100),
'Figure 1C', (x_min, x_max), (y_min, y_max))
plot_probs(map_log_ps_normed.reshape(100, 100),
'Figure 1D', (x_min, x_max), (y_min, y_max))
```

Calculate probabilities for Figure 2.

```
# Calculate probabilities according to Equations (1), (2), and (3).
# Looping for readability, but ought to be done with matrix operations.
log_ps_A_eq1 = []
log_ps_B_eq1 = []
log_ps_A_eq2 = []
log_ps_B_eq2 = []
log_ps_A_eq3 = []
log_ps_B_eq3 = []
for pt in test_pts:
ps = calc_log_pp_test_datum(pt, data_A, prior_cov)
log_ps_A_eq1.append(ps)
ps = calc_log_pp_test_datum(pt, data_B, prior_cov)
log_ps_B_eq1.append(ps)
ps = calc_log_ml(np.vstack([data_A, pt]), prior_cov)
log_ps_A_eq2.append(ps)
ps = calc_log_ml(np.vstack([data_B, pt]), prior_cov)
log_ps_B_eq2.append(ps)
ps = calc_log_pp_training_data(data_A, pt, prior_cov)
log_ps_A_eq3.append(ps)
ps = calc_log_pp_training_data(data_B, pt, prior_cov)
log_ps_B_eq3.append(ps)
log_ps_A_eq1 = np.asarray(log_ps_A_eq1)
log_ps_B_eq1 = np.asarray(log_ps_B_eq1)
log_ps_A_eq2 = np.asarray(log_ps_A_eq2)
log_ps_B_eq2 = np.asarray(log_ps_B_eq2)
log_ps_A_eq3 = np.asarray(log_ps_A_eq3)
log_ps_B_eq3 = np.asarray(log_ps_B_eq3)
# Compute Between-category normalizing constants in natural log space.
log_norms_eq1 = logsumexp([log_ps_A_eq1, log_ps_B_eq1], axis=0)
log_norms_eq2 = logsumexp([log_ps_A_eq2, log_ps_B_eq2], axis=0)
log_norms_eq3 = logsumexp([log_ps_A_eq3, log_ps_B_eq3], axis=0)
# Get max category probabilty for each test point.
log_ps_eq1 = np.amax([log_ps_A_eq1, log_ps_B_eq1], axis=0)
log_ps_eq2 = np.amax([log_ps_A_eq2, log_ps_B_eq2], axis=0)
log_ps_eq3 = np.amax([log_ps_A_eq3, log_ps_B_eq3], axis=0)
```

```
plot_probs(np.exp(log_ps_eq1).reshape(n_pxls, n_pxls),
'Figure 2A', (x_min, x_max), (y_min, y_max))
plot_probs(np.exp(log_ps_eq2).reshape(n_pxls, n_pxls),
'Figure 2B', (x_min, x_max), (y_min, y_max))
plot_probs(np.exp(log_ps_eq3).reshape(n_pxls, n_pxls),
'Figure 2C', (x_min, x_max), (y_min, y_max))
plot_probs(log_ps_eq1.reshape(n_pxls, n_pxls),
'Figure 2D', (x_min, x_max), (y_min, y_max))
plot_probs(log_ps_eq2.reshape(n_pxls, n_pxls),
'Figure 2E', (x_min, x_max), (y_min, y_max))
plot_probs(log_ps_eq3.reshape(n_pxls, n_pxls),
'Figure 2F', (x_min, x_max), (y_min, y_max))
```

Calculate probabilities for Figure 3.

```
# Within-category normalization.
log_ps_A_eq1 -= logsumexp(log_ps_A_eq1)
log_ps_B_eq1 -= logsumexp(log_ps_B_eq1)
log_ps_A_eq2 -= logsumexp(log_ps_A_eq2)
log_ps_B_eq2 -= logsumexp(log_ps_B_eq2)
log_ps_A_eq3 -= logsumexp(log_ps_A_eq3)
log_ps_B_eq3 -= logsumexp(log_ps_B_eq3)
# Compute Between-category normalizing constants in natural log space.
log_norms_eq1 = logsumexp([log_ps_A_eq1, log_ps_B_eq1], axis=0)
log_norms_eq2 = logsumexp([log_ps_A_eq2, log_ps_B_eq2], axis=0)
log_norms_eq3 = logsumexp([log_ps_A_eq3, log_ps_B_eq3], axis=0)
# Get max category probabilty for each test point.
log_ps_eq1 = np.amax([log_ps_A_eq1, log_ps_B_eq1], axis=0)
log_ps_eq2 = np.amax([log_ps_A_eq2, log_ps_B_eq2], axis=0)
log_ps_eq3 = np.amax([log_ps_A_eq3, log_ps_B_eq3], axis=0)
log_classification_ps_eq1 = log_ps_eq1 - log_norms_eq1
log_classification_ps_eq2 = log_ps_eq2 - log_norms_eq2
log_classification_ps_eq3 = log_ps_eq3 - log_norms_eq3
```

Figure 3. Please, note that variables used below are overloaded in the above code block.

```
plot_probs(log_ps_eq1.reshape(n_pxls, n_pxls),
'Figure 3A', (x_min, x_max), (y_min, y_max))
plot_probs(log_ps_eq2.reshape(n_pxls, n_pxls),
'Figure 3B', (x_min, x_max), (y_min, y_max))
plot_probs(log_ps_eq3.reshape(n_pxls, n_pxls),
'Figure 3C', (x_min, x_max), (y_min, y_max))
plot_probs(log_classification_ps_eq1.reshape(n_pxls, n_pxls),
'Figure 3D', (x_min, x_max), (y_min, y_max))
plot_probs(log_classification_ps_eq2.reshape(n_pxls, n_pxls),
'Figure 3E', (x_min, x_max), (y_min, y_max))
plot_probs(log_classification_ps_eq3.reshape(n_pxls, n_pxls),
'Figure 3F', (x_min, x_max), (y_min, y_max))
```