The Dirichlet-Multinomial Model

The Dirichlet-Multinomial model is the generalization of the Beta-Binomial model to multiple classes of a categorical or multinomial distribution. We use it to do analysis on data from rates of multiple distinct outcomes.

The model has a wide scope of applications. It can be used to analyze how much profit different traffic sources bring into a website, or how on-sale-items in a store affect total profit. The Dirichlet-Multinomial also applies to document classification as it can be used to model which topics or corpus a document belongs to.

Model Description:

Suppose we are doing a test with $K$ distinct categories of events that may be observed. The count of observations from each category, denoted $c_1,\dots,c_K$, form a multinomial distribution. The goal of the model is to estimate the distributions of the true proportions of observations, denoted $p_1,\dots,p_K$, from the multinomial distribution.

The conjugate prior for the multinomial distribution is the Dirichlet distribution, the application of which is the Dirichelt-multinomial model. The model estimates the posterior distrubtion of $p_1,\dots,p_k$ given our current data and prior beliefs. Our prior beliefs are encoded in the model through the hyperparameter $\alpha$, which represents pseudocounts of what we believe the data should look like – typically set as 1's for weak uniform beliefs. Then the following describes the distribution of our probability estimates:

$$ p \ | \ \alpha, c \sim \text{Dir}(\alpha+c) $$
A good choice of $\alpha$ should produce prior estimates equal to the 'current data' posterior distribution. If we don't have a baseline for our prior, then we should use the non-informative uniform prior.

Further, since the marginals of the Dirichlet distribution are Betas, the distribution of each probability estimate is given by:

$$ p_i \ | \ \alpha, c \sim \text{Beta}\left(\beta_i, \sum_{k=1}^{K}{\beta_k} -\beta_i \right)
$$

where $\beta_i = \alpha_i+c_i$.

We can calculate the expected value of each of our parameters as: $$ \text{E}[p_i \ | \ \alpha, c ] = \frac{\beta_i}{\sum_{k=1}^{K}{\beta_k}} $$

where $\beta_i = \alpha_i+c_i$.

This is the posterior mean and is the estimate which minimizes mean square error of our posterior. For most models this is the estimate which is most useful.

The posterior mode, or maximum a posteriori estimate (MAP), of $\left(p_1,\dots,p_K \ | \ \alpha, c\right)$ is the vector $\left(x_1,\dots,x_K \right)$ where each $x_i$ is defined as:

$$ x_i = \frac{\beta_i-1}{\sum_{k=1}^{K}{\beta_k} - K} $$

MAP estimates are best when a model requires us to be wrong the least often. This is in contrast to the posterior mean which minimizes mean squared error.

Weighted Distribution

A useful implementation of the model is to compare the sum of weighted categories such as the distribution of expected profits from a categorical model.

Suppose we have a weighting function, $\Omega$, paramaterized by $\omega_i(p)$, for each category. If we want to estimate the distribution of weighted values, $W$, we can use a Monte-Carlo simulation to approximate the distribution of expected weighted values:

$$ W(P)=\text{E}[\Omega(P)] \approx \sum_{i=1}^{K}{\omega_i(p_i) \cdot p_i}
$$ where $P \sim \text{Dir}(\alpha+c)$.

In the case of linear weights we set $\omega_i(p)=w_i$ for fixed weight $w_i$.

We can use Monte Carlo to estimate the differences and means of our weighted distributions.

Python Implementation

In Python this looks like the following:

import numpy

# Define weighting function
Omega = lambda row: sum([row[0] * 10, row[1] * 50, row[2] * 100 - 14])

observations = numpy.array([600.,150.,40.], dtype=float)  
alphas = numpy.ones(observations.shape)

# Sample from Dirichlet posterior
samples = numpy.random.dirichlet(observations + alphas, 100000)

# apply sum to sample draws
W_samples = numpy.apply_along_axis(Omega, 1, samples)

#Compute P(W > 10)
(W_samples > 10).mean()

Julia Implementation

In Julia it looks like this:

using Distributions: Dirichlet

Omega(A,B,C) = sum([A * 10, B * 50, C * 100 - 14])

observations = [600.,150.,40.]  
alpha = ones(length(observations))  
posterior = Dirichlet(observations + alpha)

N = 100000  
samples = rand(posterior, N)  
W_samples = [Omega(samples[:,col]...) for col=1:N]

#Compute P(W > 10)
mean(map((x)-> x > 10, W_samples))