In this assignment, your task will be to implement and analyze inference in the Bayesian Latent Dirichlet Allocation (LDA) model as described in the corresponding slides from BAYa class. You will accomplish this task by completing this Jupyter Notebook, which already comes with a code generating the training data and some plotting functions for presenting the results. If you do not have any experience with Jupyter Notebook, the easiest way to start is to install Anaconda3, run Jupyter Notebook, and open this notebook downloaded from BAYa_Assignment2022.ipynb. You can also find some inspiration and pieces of code to reuse (e.g. KL divergence for Dirichlet distribution) in the other Jupyter Notebooks provided for this class.
The Notebook is organized as follows:
Do not edit the code in the following cells for generating and presenting the training data!
# Run this code! But there is no need to pay much attention to this cell at the first pass through the notebook
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.tri as tri
import scipy.stats as sps
# Pre-calculate some global data used by functions plot_simplex, plot_dirichlet, and plot_points_in_simplex
_corners = np.array([[0, 0], [1, 0], [0.5, 0.75**0.5]])
_invT = np.linalg.inv(_corners[:2]-_corners[2])
_triangle = tri.Triangulation(_corners[:, 0], _corners[:, 1])
_refiner = tri.UniformTriRefiner(_triangle)
_trimesh = _refiner.refine_triangulation(subdiv=7)
# Now convert _trimesh 2D cartesian coordinates to 3D barycentric coordinates (i.e. 3D point on the 2D simplex)
# as described in https://en.wikipedia.org/wiki/Barycentric_coordinate_system#Edge_approach
# We calculate only the first 2 barycentric coordinates as sps.dirichlet.pdf is happy without the last one (l3 = 1-l1+l2)
_tol=1.e-8
_l1l2 = (np.c_[_trimesh.x, _trimesh.y] -_corners[2]) @ _invT
_l1l2 = np.clip(_l1l2, 2*_tol, 1.0-_tol) - _tol # to make sure that none of the probabilities is exactly zero or one
def plot_simplex(class_labels=[""]*3):
'''Plot "axis" for 2-simplex. It simply plots a triangle into which we will be ploting points
representing Categorical distributions (for 3 categories/topics) or a Dirichlet distribution.
Arguments:
class_labels: list of 3 strings that are used as labels (i.e. category/topic names) for the
simplex (triangle) corners
'''
plt.triplot(_triangle, linewidth=1)
#plt.xlim(0, 1)
#plt.ylim(0, 0.75**0.5)
plt.axis('equal')
plt.axis('off')
plt.text(0, 0, class_labels[0], horizontalalignment='left', verticalalignment='top')
plt.text(1, 0, class_labels[1], horizontalalignment='right', verticalalignment='top')
plt.text(0.5, 0.75**0.5, class_labels[2], horizontalalignment='center', verticalalignment='bottom')
def plot_dirichlet(alpha, nlevels=128, **kwargs):
'''Plot Dirichlet pdf in an equilateral triangle (2-simplex).
Arguments:
alpha: Dirichlet distribution parameters.
nlevels (int): Number of contours (shades) to draw.
kwargs: Keyword args passed on to `plt.tricontourf`.
'''
plt.tricontourf(_trimesh, sps.dirichlet.pdf(_l1l2.T, alpha), nlevels, cmap='gray_r', **kwargs)
def plot_points_in_simplex(X, **kwargs):
'''Plots a set of points in the 2-simplex. Each point can represent
a categorical distribution with 3 categories/topics.
Arguments:
X: A Nx3 array in barycentric coordinates of points to plot.
kwargs: Keyword args passed on to `plt.plot`.
'''
plt.plot(*(X @ _corners).T, **kwargs)
def plot_topic_distributions(Phi, vocabulary, top_V=10):
'''Plot words and their probabilities for each topic.
Arguments:
Phi: A KxD matrix where rows are topic specific distributions
vocabulary: List of strings that are words corresponding to columns of Phi
top_V: Only top_V most likely words and their probabilities are shown.
'''
plt.figure(figsize=(20, 4))
for k, topic_dist in enumerate(Phi):
plt.subplot(1, len(Phi), k+1)
sort_ixs = np.argsort(topic_dist)[::-1]
top_V_words = [vocabulary[i] for i in sort_ixs[:top_V]]
plt.barh(np.arange(len(top_V_words)/2, 0, -0.5), topic_dist[sort_ixs[:top_V]], height=0.4)
plt.yticks(np.arange(len(top_V_words)/2, 0, -0.5), top_V_words, fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.title("Topic %2d" % (k+1,))
plt.xlabel("Probability")
The generative process assumed by the Bayesian Latent Dirichlet Allocation model is
\begin{align} {\boldsymbol\varphi}_k &\sim \operatorname{Dir}(\boldsymbol{\beta}_0), && \text{for } k=1, \dots, K\\ {\boldsymbol\theta}_d &\sim \operatorname{Dir}(\boldsymbol{\alpha}_0), && \text{for } d=1, \dots, D\\ z_{dn} &\sim \operatorname{Cat}(\boldsymbol\theta_d), &&\text{for } d=1, \dots, D,\quad n=1, \dots, N_d\\ w_{dn} &\sim \operatorname{Cat}(\boldsymbol\varphi_{z_{dt}}), &&\text{for } d=1, \dots, D,\quad n=1, \dots, N_d,\\ \end{align}where ${\boldsymbol\varphi}_k$ are topic-specific word distributions, ${\boldsymbol\theta}_d$ are document-specific topic distributions (topic mixture weights), $z_{dn}=k$ denotes that the $n^{th}$ word in document $d$ comes from topic $k$, and $w_{dn}=v$ denotes that the $n^{th}$ word in document $d$ is $v$. However, our training data will not be represented in terms of $w_{dn}$. It will be compactly represented by a $D\times V$ word count matrix $\boldsymbol{M}$ with elements $M_{dv}$ counting how many times document $d$ contains word $v$.
Let the $V\times K$ matrix of topic-specific word distributions be $$\boldsymbol\Phi = [{\boldsymbol\varphi}_1, {\boldsymbol\varphi}_2, \dots, {\boldsymbol\varphi}_K],$$ and let the $K\times D$ matrix of document-specific topic distributions (topic mixture weights) be $$\boldsymbol\Theta = [{\boldsymbol\theta}_1, {\boldsymbol\theta}_2, \dots, {\boldsymbol\theta}_D]$$
Mathematicians like to use column vectors and form matrices by stacking the vectors into the columns of the matrices. In python, it is more convenient to work with row vectors and stack them into matrix rows. Therefore, everything in the code will be transposed as compared to the equations. E.g., the $V\times K$ matrix $\boldsymbol\Phi$ will be represented by numpy.array named Phi with shape (K, V). Similarly, $\boldsymbol\Theta$ will be represented by numpy.array named Theta with shape (D, K).
First, we handcraft the matrix $\boldsymbol\Phi$ containing topic-specific word distributions. We handcraft $\boldsymbol\Phi$ directly rather than sampling it from the Dirichlet prior as the generative process for Bayesian LDA would suggest. The Dirichlet prior will be still used later for LDA model training. We store the handcrafted matrix $\boldsymbol\Phi$ to the variable Phi_gt, where _gt, stays for "ground truth". We will generate our training data using this matrix, and we hope to learn this matrix (or some matrix close to it) back during the LDA model training. We consider only a toy example with $K=3$ topics ("sports", "computers", "food") and a vocabulary of only $V=6$ distinct words (stored in the variable vocabulary).
vocabulary = [ 'tenis', 'surfing','software','apple', 'burger'] # TOPIC:
Phi_gt = np.array([[0.7, 0.3, 0.0, 0.0, 0.0], # sports
[0.0, 0.2, 0.5, 0.3, 0.0], # computers
[0.0, 0.0, 0.0, 0.4, 0.6]]) # food
plot_topic_distributions(Phi_gt, vocabulary)
Note that the word "surfing" can likely occur for both topics "sports" and "computers". Similarly, the word "apple" can likely occur for both topics "food" and "computers". Note also that we have given names to the topics in our example, but in real data, the topics in the training data and their number (the parameter $K$) are unknown.
We used the function plot_topic_distributions to plot the word distributions for each topic. Each plot shows the probabilities of words for one topic sorted by the word probability. The function shows (at most) only the top_V=10 most likely words, which will be useful when figuring out what the topics correspond to for real data with many words in the vocabulary.
Now, we will handcraft the matrix $\boldsymbol\Theta$ with topic distributions (or topic mixture weights) ${\boldsymbol\theta}_d$ one for each training document $d$. We will pretend that there are 4 thematically focused document collections in our training data with documents about:
Each collection will contain documents_per_collection=20 training documents. In the following code, for each document $d$, the topic mixture weights ${\boldsymbol\theta}_d$ are sampled from a collection-specific Dirichlet distribution (see the grayscale plots produced by the next cell). The parameters of these 4 Dirichlet distributions are given in the rows of the matrix alpha4collections in the code below.
# sports computers food # Collection of documents about ...
alpha4collections = np.array([[ 12, 1, 1], # sports
[ 2, 8, 1], # computers
[ 1, 1, 13], # food
[ 12, 2, 10]]) # food for sportsmen
documents_per_collection = 20
Theta_gt = []
plt.figure(figsize=(14, 3))
# For each collection
for i, a in enumerate(alpha4collections):
plt.subplot(1, len(alpha4collections), i+1)
plt.title(r'$\alpha$ = (%.3f, %.3f, %.3f)' % tuple(a))
# Plot 2-simplex (a triangle) and add topic labels to the corners of the triangle where those topics have probability 1
plot_simplex(['Sports', 'Computers', 'Food'])
# Plot collection-specific Dirichlet distribution (in grayscale)
plot_dirichlet(a)
# Sample topic mixture weights for all documents in one collection
thetas4collection = sps.dirichlet.rvs(a, documents_per_collection)
# Plot topic mixture weights as points in the simplex
plot_points_in_simplex(thetas4collection, c='b', ls='none', marker='+')
Theta_gt.append(thetas4collection)
# Concatenate all topic weights from all collections into a single matrix
Theta_gt = np.concatenate(Theta_gt)
# For each document, remember the label saying which collection it comes from
collection_label = np.repeat(range(len(alpha4collections)), documents_per_collection)
Each plot (triangle) produced by the code above is a 2-dimensional simplex corresponding to one of the collections. It shows its corresponding Dirichlet distribution (in grayscale). Each blue cross is a sample from that Dirichlet distribution representing the topic mixture weights ${\boldsymbol\theta}_d$ for one training document. We can see that, in the first collection (documents about sports), all documents have high weights for the topic "sports" and low weights for the other two topics. In contrast, in the last collection (documents about food for sportsmen), all documents have comparable high weights for the topics "sports" and "food" and low weight for the topic "computers".
At the end of the code in the cell above, we concatenate all the document-specific mixture weights ${\boldsymbol\theta}_d$ from all the collections into a single matrix Theta_gt representing the "ground truth" matrix $\boldsymbol\Theta$. Just like in the case of Phi_gt, our task will be to recover Theta_gt (or something close to it) during the LDA model training.
We also remember the collection label for each training document in the variable collection_label. One of your tasks will be to cluster the documents into collections, and collection_label will be useful to see whether you manage to recover the original "ground truth" document collections.
In the code below, we plot all the mixture weights ${\boldsymbol\theta}_d$ for all training documents into a single simplex. We also stop pretending that we know the topic names. Instead, we use the most likely word from each topic-specific distribution as the representative topic labels (i.e. the labels in the corners of the simplex). We also indicate the "ground truth" collection label by the color and marker of each point. This is how you will later plot the learned mixture weights ${\boldsymbol\theta}_d$ and how you will be able to indicate your clustering of the documents.
plot_simplex([vocabulary[np.argmax(p)] for p in Phi_gt])
markers = ['+', 'x', '^', '*']
colors = ['r', 'b', 'g', 'm']
for i, t in enumerate(Theta_gt):
plot_points_in_simplex(t, c=colors[collection_label[i]], ls='none', marker=markers[collection_label[i]])
Once we have the matrices $\boldsymbol\Phi$ and $\boldsymbol\Theta$, we can obtain the word distribution specific to document $d$ as $\boldsymbol\Phi {\boldsymbol\theta}_d$. Or more efficiently, we can obtain a matrix of the distribution for all the documents as $\boldsymbol\Phi\boldsymbol\Theta$. Finally, for document $d$, we can sample the vector of word counts (i.e a row of the word count matrix $\boldsymbol{M}$) from the distribution $\operatorname{Multinomial(\boldsymbol\Phi{\boldsymbol\theta}_d,N_d)}$. For simplicity, in the code below, we assume that each training document contains the same number of $N_d=N=100$ words. The training word count matrix is stored in variable M.
PhiTheta = Theta_gt @ Phi_gt
N=100
M =np.vstack([sps.multinomial.rvs(N, dd) for dd in PhiTheta])
print("The word count matrix for the first 10 training documents:\n", M[:10])
The word count matrix for the first 10 training documents: [[69 26 0 2 3] [69 25 0 2 4] [68 30 2 0 0] [60 31 2 3 4] [60 27 2 4 7] [56 25 4 7 8] [63 36 0 0 1] [58 18 4 9 11] [59 22 5 5 9] [70 23 4 2 1]]
Note that $C_{dv}^k$ can be stored as elements of a 3D matrix and $\sum_{d=1}^D C_{dv}^k$ can be easily evaluated for all $v$ and $k$ by summing the matrix over the $d$ dimension. Similarly, we can sum the matrix over the $v$ dimension to get all $\sum_{v=1}^V C_{dv}^k$.
To monitor the progress, we can evaluate $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ in every iteration (See slide 8. "Joint probability using the counts" in the BAYa class slides).
1. Implement Gibbs Sampling inference for training the Bayesian Latent Dirichlet Allocation model described above.
Store $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ in every iteration to monitor how the Gibbs Sampling inference progresses.
The following field comes with the definition of variables that you will use in your implementation. Namely, alpha0 are the parameters of prior $p({\boldsymbol\theta}_d) = \operatorname{Dir}(\boldsymbol{\alpha}_0)$ and beta0 the parameters of prior ${p(\boldsymbol\varphi}_k) = \operatorname{Dir}(\boldsymbol{\beta}_0)$. For simplicity, we will use the flat non-informative prior (i.e. $\boldsymbol{\alpha}_0$ and $\boldsymbol{\beta}_0$ are vectors of ones). In variables Theta and Phi, your code will store the samples of $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$. Start the GS iterations from the initial values of $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$ provided in the code.
Note that, for the LDA model training, one needs to choose the hyperparameter $K$ which is the assumed number of topics. We choose $K=3$ which will allow us to plot each sampled $\boldsymbol{\theta}_d$ to the 2-dimensional simplex. However, we are somewhat cheating here, as we know that this is the ground truth number of topics.
# Make use of the following variables
D, V = M.shape # number of training documents D and size of vocabulary V
K = 3
alpha0 = np.ones(K, dtype='float') # Parameters of prior for Theta
beta0 = np.ones(V, dtype='float') # Parameters of prior for Phi
Theta = np.ones((D,K))/K # Initial values for Theta; store new Theta sampled in each iteration to this variable
Phi = np.ones((K,V))/V # Initial values for Phi; store new Phi sampled in each iteration to this variable
#Your code for Gibbs sampling inference
2. Plot the evolution of the joint probability $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ over the iterations to monitor how the Gibbs sampling inference progresses.
#Your code goes here
3. Use the function plot_topic_distributions to plot the sampled topic-specific word distributions $\boldsymbol{\Phi}$. Also, plot the ground truth version of it.
#Your code goes here
4. The final sampled topic mixture weights $\boldsymbol{\theta}_d$ can be seen as low dimensional representations of each document informing about their topic. Cluster the training documents into 4 clusters, where each of the clusters should ideally correspond to one of the ground truth document collections.
#Your code for clustering
5. We now consider the case where we use the trained LDA model to extract topic mixture weight ${\boldsymbol\theta}_d$ representations for some additional "test" utterances. For this purpose, we fix the $\boldsymbol{\Phi}$ obtained from the last GS iteration and we will only iteratively re-estimate the $\boldsymbol{\theta}_d$ and topic assignment counts $\boldsymbol{C}_{dv}^k$ for each additional test document.
In the code below we handcraft the word count matrix M_test for 4 additional documents.
M_test = np.array([[ 6, 3, 1, 0, 0],
[16, 171, 110, 50, 160],
[ 6, 1, 0, 8, 5],
[159, 78, 14, 97, 121]])
print("Word count matrix for additional test documents\n", M_test)
#
D_test=len(M_test)
Theta_test=np.array([[1,0,0]]*D_test)
Word count matrix for additional test documents [[ 6 3 1 0 0] [ 16 171 110 50 160] [ 6 1 0 8 5] [159 78 14 97 121]]
#Your code goes here
The VB inference attempts to obtain a good approximation of the posterior distribution
$$q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}) \approx P(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}| \boldsymbol{W})$$for that, we use mean-field approximation, where we assume the factorization
$$q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}) = q(\boldsymbol{Z}) q(\boldsymbol{\Theta}, \boldsymbol{\Phi})$$which, because of induced factorizations, further equals to $$q(\boldsymbol{Z}) q(\boldsymbol{\Theta}, \boldsymbol{\Phi}) = \prod_{d=1}^D \prod_{n=1}^N q(z_{dn}) \prod_{d=1}^D q(\boldsymbol{\theta}_d) \prod_{k=1}^K q(\boldsymbol{\varphi}_k)$$
therefore we will be interested in estimating the independent approximate posterior distributions $q(\boldsymbol{\theta}_d)$ and $q(\boldsymbol{\varphi}_k)$ for each document $d$ and topic $k$. The Variational Bayes updates dictate that these distributions are $Dir(\boldsymbol{\theta}_d|\alpha_d^*)$ and $Dir(\boldsymbol{\varphi}_k|\beta_k^*)$. Our main task in the VB inference is to estimate the parameters $\alpha_d^*=[\alpha_{d1}^*,\alpha_{d2}^*,\dots,\alpha_{dK}^*]^T$ and $\beta_k^*=[\beta_{k1}^*,\beta_{k2}^*,\dots,\beta_{kV}^*]$.
The iterative algorithm for the VB inference goes as follows:
To monitor progress, we can evaluate the evidence lower bound (ELBO) $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ in every iteration. It is easiest done right after step 2. as described on slide 31. "Efficient ELBO calculation" in the BAYa class slides).
After the convergence, we can represent each document by the expected vector of topic mixture weights (i.e. expected value of variable $\boldsymbol{\theta}_d$)
$$\boldsymbol{\hat{\theta}}_d = \mathbb{E}_{q(\boldsymbol{\theta}_d)}[\boldsymbol{\theta}_d)] = \int \boldsymbol{\theta}_d \operatorname{Dir}(\boldsymbol{\theta}_d|\boldsymbol{\alpha}_d^*) d\boldsymbol{\theta}_d = \frac{\boldsymbol{\alpha}_d^*}{\sum_{k=1}^K \alpha_{dk}^*},$$which can be seen as the posterior predictive distribution over topics for document $d$. Similarly, we can obtain expected topic-specific word distributions $$\boldsymbol{\hat{\varphi}}_k = \mathbb{E}_{q(\boldsymbol{\varphi}_k)}[\boldsymbol{\varphi}_k)] = \frac{\boldsymbol{\beta}_k^*}{\sum_{v=1}^V \beta_{kv}^*}.$$
1. Implement Variational Bayes inference for training the Bayesian Latent Dirichlet Allocation model described above.
Store the ELBO $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ in every iteration to monitor the convergence.
The following field comes with the definition of variables that you will use in your implementation. As done before for the Gibbs Sampling inference, we provide parameters of priors alpha0 and beta0. The rows of the matrix alpha are the initial parameters for individual approximate posteriors $q(\boldsymbol{\theta}_d)$ for all documents $d$. Similarly, each of the rows of beta are the initial parameters for each $q(\boldsymbol{\varphi}_k)$. Please reuse the variables alpha and beta to store the parameters updated in each VB iteration. As before we choose $K=3$.
# Make use of the following variables
D, V = M.shape # number of training documents D and size of vocabulary V
K = 3
alpha0 = np.ones(K, dtype='float') # Parameters of prior for Theta
beta0 = np.ones(V, dtype='float') # Parameters of prior for Phi
alpha=np.abs(np.random.randn(D,K)) # The rows are the initial parameters of the approximate posterior for each q(theta_d)
beta =np.abs(np.random.randn(K,V)) # The rows are the initial parameters of the approximate posterior for each q(varphi_k)
# Reuse the variables alpha and beta to store the parameters of the approximate posteriors updated in each VB iteration.
#Your code for Variational Bayes inference
2. Plot the evolution of ELBO $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ over the iterations to monitor the convergence of the VB inference.
#Your code goes here
3. Use the function plot_topic_distributions to plot the expected topic specific word ditributions $\boldsymbol{\hat{\Phi}} = [\boldsymbol{\hat{\varphi}}_1, \boldsymbol{\hat{\varphi}}_2, \dots, \boldsymbol{\hat{\varphi}}_K]$. Also, plot the ground truth $\boldsymbol{\Phi}$. Is the learned $\boldsymbol{\hat{\Phi}}$ close to the ground truth?
#Your code goes here
4. Evaluate expected topic mixture weights $\boldsymbol{\hat{\theta}}_d$ for each training document $d$ and cluster them into 4 clusters in a similar way as we did for the GS inference.
#Your code for clustering
5. As we did before for the GS inference, consider now the case where we use the LDA model trained using VB inference to extract expected topic mixture weight $\boldsymbol{\hat{\theta}}_d$ for the 4 additional utterances represented by the matrix M_test (as defined before). For this purpose, run VB inference where all $\boldsymbol{\beta}_k^*$ (and therefore also $q(\boldsymbol{\varphi}_k$) stay fixed.
#Your code goes here