BAYa class Assignment 2024¶
In this assignment, your task will be to implement and analyze inference in the Bayesian Latent Dirichlet Allocation (LDA) model as described in the corresponding slides from BAYa class. You will accomplish this task by completing this Jupyter Notebook, which already comes with a code generating the training data and some plotting functions for presenting the results. If you do not have any experience with Jupyter Notebook, the easiest way to start is to install Anaconda3, run Jupyter Notebook, and open this notebook downloaded from BAYa_Assignment2024.ipynb. You can also find some inspiration and pieces of code to reuse (e.g. KL divergence for Dirichlet distribution) in the other Jupyter Notebooks provided for this class.
The Notebook is organized as follows:
- First comes a cell with a code of functions that will be later used for presenting the results and the learned models. Specifically, it contains code for plotting points into 2-dimensional simplex, plotting Dirichlet distribution, and plotting topic-specific distributions. You can skip this cell first as the use of the functions will be demonstrated later.
- Next comes a code that "handcrafts" some parameters of the LDA model and implements the generative process assumed by the LDA model. The code generates some artificial training data that you will use for LDA model training. Please carefully read this code and the comments around it.
- At the end of this notebook, there are cells with instructions to fill in your implementation of the LDA model training using Gibbs Sampling (GS) and Variational Bayes (VB) inference. There are also fields with other tasks to accomplish and questions to answer.
Do not edit the code in the following cells for generating and presenting the training data!
# Run this code! But there is no need to pay much attention to this cell at the first pass through the notebook
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.tri as tri
import scipy.stats as sps
# Pre-calculate some global data used by functions plot_simplex, plot_dirichlet, and plot_points_in_simplex
_corners = np.array([[0, 0], [1, 0], [0.5, 0.75**0.5]])
_invT = np.linalg.inv(_corners[:2]-_corners[2])
_triangle = tri.Triangulation(_corners[:, 0], _corners[:, 1])
_refiner = tri.UniformTriRefiner(_triangle)
_trimesh = _refiner.refine_triangulation(subdiv=7)
# Now convert _trimesh 2D cartesian coordinates to 3D barycentric coordinates (i.e. 3D point on the 2D simplex)
# as described in https://en.wikipedia.org/wiki/Barycentric_coordinate_system#Edge_approach
# We calculate only the first 2 barycentric coordinates as sps.dirichlet.pdf is happy without the last one (l3 = 1-l1+l2)
_tol=1.e-8
_l1l2 = (np.c_[_trimesh.x, _trimesh.y] -_corners[2]) @ _invT
_l1l2 = np.clip(_l1l2, 2*_tol, 1.0-_tol) - _tol # to make sure that none of the probabilities is exactly zero or one
def plot_simplex(class_labels=[""]*3):
'''Plot "axis" for 2-simplex. It simply plots a triangle into which we will be ploting points
representing Categorical distributions (for 3 categories/topics) or a Dirichlet distribution.
Arguments:
class_labels: list of 3 strings that are used as labels (i.e. category/topic names) for the
simplex (triangle) corners
'''
plt.triplot(_triangle, linewidth=1)
#plt.xlim(0, 1)
#plt.ylim(0, 0.75**0.5)
plt.axis('equal')
plt.axis('off')
plt.text(0, 0, class_labels[0], horizontalalignment='left', verticalalignment='top')
plt.text(1, 0, class_labels[1], horizontalalignment='right', verticalalignment='top')
plt.text(0.5, 0.75**0.5, class_labels[2], horizontalalignment='center', verticalalignment='bottom')
def plot_dirichlet(alpha, nlevels=128, **kwargs):
'''Plot Dirichlet pdf in an equilateral triangle (2-simplex).
Arguments:
alpha: Dirichlet distribution parameters.
nlevels (int): Number of contours (shades) to draw.
kwargs: Keyword args passed on to `plt.tricontourf`.
'''
plt.tricontourf(_trimesh, sps.dirichlet.pdf(_l1l2.T, alpha), nlevels, cmap='gray_r', **kwargs)
def plot_points_in_simplex(X, **kwargs):
'''Plots a set of points in the 2-simplex. Each point can represent
a categorical distribution with 3 categories/topics.
Arguments:
X: A Nx3 array in barycentric coordinates of points to plot.
kwargs: Keyword args passed on to `plt.plot`.
'''
plt.plot(*(X @ _corners).T, **kwargs)
def plot_topic_distributions(Phi, vocabulary, top_V=10):
'''Plot words and their probabilities for each topic.
Arguments:
Phi: A KxD matrix where rows are topic specific distributions
vocabulary: List of strings that are words corresponding to columns of Phi
top_V: Only top_V most likely words and their probabilities are shown.
'''
plt.figure(figsize=(20, 4))
for k, topic_dist in enumerate(Phi):
plt.subplot(1, len(Phi), k+1)
sort_ixs = np.argsort(topic_dist)[::-1]
top_V_words = [vocabulary[i] for i in sort_ixs[:top_V]]
plt.barh(np.arange(len(top_V_words)/2, 0, -0.5), topic_dist[sort_ixs[:top_V]], height=0.4)
plt.yticks(np.arange(len(top_V_words)/2, 0, -0.5), top_V_words, fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.title("Topic %2d" % (k+1,))
plt.xlabel("Probability")
LDA generative process¶
The generative process assumed by the Bayesian Latent Dirichlet Allocation model is
\begin{align} {\boldsymbol\varphi}_k &\sim \operatorname{Dir}(\boldsymbol{\beta}_0), && \text{for } k=1, \dots, K\\ {\boldsymbol\theta}_d &\sim \operatorname{Dir}(\boldsymbol{\alpha}_0), && \text{for } d=1, \dots, D\\ z_{dn} &\sim \operatorname{Cat}(\boldsymbol\theta_d), &&\text{for } d=1, \dots, D,\quad n=1, \dots, N_d\\ w_{dn} &\sim \operatorname{Cat}(\boldsymbol\varphi_{z_{dt}}), &&\text{for } d=1, \dots, D,\quad n=1, \dots, N_d,\\ \end{align}where ${\boldsymbol\varphi}_k$ are topic-specific word distributions, ${\boldsymbol\theta}_d$ are document-specific topic distributions (topic mixture weights), $z_{dn}=k$ denotes that the $n^{th}$ word in document $d$ comes from topic $k$, and $w_{dn}=v$ denotes that the $n^{th}$ word in document $d$ is $v$. However, our training data will not be represented in terms of $w_{dn}$. It will be compactly represented by a $D\times V$ word count matrix $\boldsymbol{M}$ with elements $M_{dv}$ counting how many times document $d$ contains word $v$.
Let the $V\times K$ matrix of topic-specific word distributions be $$\boldsymbol\Phi = [{\boldsymbol\varphi}_1, {\boldsymbol\varphi}_2, \dots, {\boldsymbol\varphi}_K],$$ and let the $K\times D$ matrix of document-specific topic distributions (topic mixture weights) be $$\boldsymbol\Theta = [{\boldsymbol\theta}_1, {\boldsymbol\theta}_2, \dots, {\boldsymbol\theta}_D]$$
Mathematicians like to use column vectors and form matrices by stacking the vectors into the columns of the matrices. In python, it is more convenient to work with row vectors and stack them into matrix rows. Therefore, everything in the code will be transposed as compared to the equations. E.g., the $V\times K$ matrix $\boldsymbol\Phi$ will be represented by numpy.array named Phi with shape (K, V). Similarly, $\boldsymbol\Theta$ will be represented by numpy.array named Theta with shape (D, K).
Handcrafting the LDA model¶
Handcrafting $\boldsymbol\Phi$¶
First, we handcraft the matrix $\boldsymbol\Phi$ containing topic-specific word distributions. We handcraft $\boldsymbol\Phi$ directly rather than sampling it from the Dirichlet prior as the generative process for Bayesian LDA would suggest. The Dirichlet prior will be still used later for LDA model training. We store the handcrafted matrix $\boldsymbol\Phi$ to the variable Phi_gt, where _gt, stays for "ground truth". We will generate our training data using this matrix, and we hope to learn this matrix (or some matrix close to it) back during the LDA model training. We consider only a toy example with $K=3$ topics ("sports", "computers", "food") and a vocabulary of only $V=6$ distinct words (stored in the variable vocabulary).
vocabulary = [ 'tenis', 'surfing','software','apple', 'burger'] # TOPIC:
Phi_gt = np.array([[0.7, 0.3, 0.0, 0.0, 0.0], # sports
[0.0, 0.2, 0.5, 0.3, 0.0], # computers
[0.0, 0.0, 0.0, 0.4, 0.6]]) # food
plot_topic_distributions(Phi_gt, vocabulary)
Note that the word "surfing" can likely occur for both topics "sports" and "computers". Similarly, the word "apple" can likely occur for both topics "food" and "computers". Note also that we have given names to the topics in our example, but in real data, the topics in the training data and their number (the parameter $K$) are unknown.
We used the function plot_topic_distributions to plot the word distributions for each topic. Each plot shows the probabilities of words for one topic sorted by the word probability. The function shows (at most) only the top_V=10 most likely words, which will be useful when figuring out what the topics correspond to for real data with many words in the vocabulary.
Handcrafting $\boldsymbol\Theta$¶
Now, we will handcraft the matrix $\boldsymbol\Theta$ with topic distributions (or topic mixture weights) ${\boldsymbol\theta}_d$ one for each training document $d$. We will pretend that there are 4 thematically focused document collections in our training data with documents about:
- sports
- computers
- food
- food for sportsmen
Each collection will contain documents_per_collection=20 training documents. In the following code, for each document $d$, the topic mixture weights ${\boldsymbol\theta}_d$ are sampled from a collection-specific Dirichlet distribution (see the grayscale plots produced by the next cell). The parameters of these 4 Dirichlet distributions are given in the rows of the matrix alpha4collections in the code below.
# sports computers food # Collection of documents about ...
alpha4collections = np.array([[ 12, 1, 1], # sports
[ 2, 8, 1], # computers
[ 1, 1, 13], # food
[ 12, 2, 10]]) # food for sportsmen
documents_per_collection = 20
Theta_gt = []
plt.figure(figsize=(14, 3))
# For each collection
for i, a in enumerate(alpha4collections):
plt.subplot(1, len(alpha4collections), i+1)
plt.title(r'$\alpha$ = (%.3f, %.3f, %.3f)' % tuple(a))
# Plot 2-simplex (a triangle) and add topic labels to the corners of the triangle where those topics have probability 1
plot_simplex(['Sports', 'Computers', 'Food'])
# Plot collection-specific Dirichlet distribution (in grayscale)
plot_dirichlet(a)
# Sample topic mixture weights for all documents in one collection
thetas4collection = sps.dirichlet.rvs(a, documents_per_collection)
# Plot topic mixture weights as points in the simplex
plot_points_in_simplex(thetas4collection, c='b', ls='none', marker='+')
Theta_gt.append(thetas4collection)
# Concatenate all topic weights from all collections into a single matrix
Theta_gt = np.concatenate(Theta_gt)
# For each document, remember the label saying which collection it comes from
collection_label = np.repeat(range(len(alpha4collections)), documents_per_collection)
Each plot (triangle) produced by the code above is a 2-dimensional simplex corresponding to one of the collections. It shows its corresponding Dirichlet distribution (in grayscale). Each blue cross is a sample from that Dirichlet distribution representing the topic mixture weights ${\boldsymbol\theta}_d$ for one training document. We can see that, in the first collection (documents about sports), all documents have high weights for the topic "sports" and low weights for the other two topics. In contrast, in the last collection (documents about food for sportsmen), all documents have comparable high weights for the topics "sports" and "food" and low weight for the topic "computers".
At the end of the code in the cell above, we concatenate all the document-specific mixture weights ${\boldsymbol\theta}_d$ from all the collections into a single matrix Theta_gt representing the "ground truth" matrix $\boldsymbol\Theta$. Just like in the case of Phi_gt, our task will be to recover Theta_gt (or something close to it) during the LDA model training.
We also remember the collection label for each training document in the variable collection_label. One of your tasks will be to cluster the documents into collections, and collection_label will be useful to see whether you manage to recover the original "ground truth" document collections.
Plotting all handcrafted ${\boldsymbol\theta}_d$ into a single simplex¶
In the code below, we plot all the mixture weights ${\boldsymbol\theta}_d$ for all training documents into a single simplex. We also stop pretending that we know the topic names. Instead, we use the most likely word from each topic-specific distribution as the representative topic labels (i.e. the labels in the corners of the simplex). We also indicate the "ground truth" collection label by the color and marker of each point. This is how you will later plot the learned mixture weights ${\boldsymbol\theta}_d$ and how you will be able to indicate your clustering of the documents.
plot_simplex([vocabulary[np.argmax(p)] for p in Phi_gt])
markers = ['+', 'x', '^', '*']
colors = ['r', 'b', 'g', 'm']
for i, t in enumerate(Theta_gt):
plot_points_in_simplex(t, c=colors[collection_label[i]], ls='none', marker=markers[collection_label[i]])
Sampling training data¶
Once we have the matrices $\boldsymbol\Phi$ and $\boldsymbol\Theta$, we can obtain the word distribution specific to document $d$ as $\boldsymbol\Phi {\boldsymbol\theta}_d$. Or more efficiently, we can obtain a matrix of the distribution for all the documents as $\boldsymbol\Phi\boldsymbol\Theta$. Finally, for document $d$, we can sample the vector of word counts (i.e a row of the word count matrix $\boldsymbol{M}$) from the distribution $\operatorname{Multinomial(\boldsymbol\Phi{\boldsymbol\theta}_d,N_d)}$. For simplicity, in the code below, we assume that each training document contains the same number of $N_d=N=100$ words. The training word count matrix is stored in variable M.
PhiTheta = Theta_gt @ Phi_gt
N=100
M =np.vstack([sps.multinomial.rvs(N, dd) for dd in PhiTheta])
print("The word count matrix for the first 10 training documents:\n", M[:10])
The word count matrix for the first 10 training documents: [[56 21 8 11 4] [66 22 3 6 3] [69 26 0 2 3] [62 31 0 3 4] [62 32 2 1 3] [52 23 7 10 8] [51 26 9 5 9] [68 22 2 3 5] [62 34 1 2 1] [61 26 5 4 4]]
Training the LDA model using Gibbs Sampling inference¶
Summary of the inference algorithm¶
- Initialize matrices (distributions) $\boldsymbol\Phi$ and $\boldsymbol\Theta$ (e.g. randomly or to constant values).
- For each $d=1, \dots, D$ and $v=1, \dots, V$, evaluate the probabilities of topics $\boldsymbol{\pi}_{dv}=[\pi_{dv1},\pi_{dv2},\dots,\pi_{dvK}]^T$ using current values of $\boldsymbol\Phi$ and $\boldsymbol\Theta$, where
- Sample vectors
where the element $C_{dv}^k$ can be seen as the count of word $v$ sampled from topic $k$ in document $d$. 4. For each $d=1, \dots, D$, sample $$\boldsymbol{\theta}_d \sim Dir(\boldsymbol{\alpha}_d)$$ where the elements of $\boldsymbol{\alpha}_d$ are $$\alpha_{dk} = \alpha_{0k} + \sum_{v=1}^V C_{dv}^k$$ and $\alpha_{0k}$ are the elements of the prior $\boldsymbol{\alpha}_0$ 5. For each $k=1, \dots, K$, sample $$\boldsymbol{\varphi}_k \sim Dir(\boldsymbol{\beta}_k)$$ where the elements of $\boldsymbol{\beta}_k$ are $$\beta_{kv} = \beta_{0v} + \sum_{d=1}^D C_{dv}^k$$ and $\beta_{0v}$ are the elements of the prior $\boldsymbol{\beta}_0$ 6. Go to step 2. and repeat for the desired number of iterations
Note that $C_{dv}^k$ can be stored as elements of a 3D matrix and $\sum_{d=1}^D C_{dv}^k$ can be easily evaluated for all $v$ and $k$ by summing the matrix over the $d$ dimension. Similarly, we can sum the matrix over the $v$ dimension to get all $\sum_{v=1}^V C_{dv}^k$.
To monitor the progress, we can evaluate $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ in every iteration (See slide 8. "Joint probability using the counts" in the BAYa class slides).
Tasks and questions:¶
- Implement Gibbs Sampling inference for training the Bayesian Latent Dirichlet Allocation model described above.
- Run a sufficient number of GS iterations to obtain matrices $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$ that would be good samples from the posterior distribution $P(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}| \boldsymbol{W})$
- Store (correctly normalized) $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ in every iteration to monitor how the Gibbs Sampling inference progresses.
The following field comes with the definition of variables that you will use in your implementation. Namely, alpha0 are the parameters of prior $p({\boldsymbol\theta}_d) = \operatorname{Dir}(\boldsymbol{\alpha}_0)$ and beta0 the parameters of prior ${p(\boldsymbol\varphi}_k) = \operatorname{Dir}(\boldsymbol{\beta}_0)$. For simplicity, we will use the flat non-informative prior (i.e. $\boldsymbol{\alpha}_0$ and $\boldsymbol{\beta}_0$ are vectors of ones). In variables Theta and Phi, your code will store the samples of $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$. Start the GS iterations from the initial values of $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$ provided in the code.
Note that, for the LDA model training, one needs to choose the hyperparameter $K$ which is the assumed number of topics. We choose $K=3$ which will allow us to plot each sampled $\boldsymbol{\theta}_d$ to the 2-dimensional simplex. However, we are somewhat cheating here, as we know that this is the ground truth number of topics.
# Make use of the following variables
D, V = M.shape # number of training documents D and size of vocabulary V
K = 3
alpha0 = np.ones(K, dtype='float') # Parameters of prior for Theta
beta0 = np.ones(V, dtype='float') # Parameters of prior for Phi
Theta = np.ones((D,K))/K # Initial values for Theta; store new Theta sampled in each iteration to this variable
Phi = np.ones((K,V))/V # Initial values for Phi; store new Phi sampled in each iteration to this variable
#Your code for Gibbs sampling inference
- Plot the evolution of the log joint probability $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ over the iterations to monitor how the Gibbs sampling inference progresses.
- Gibbs sampling should produce samples from the posterior distribution $P(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}|\boldsymbol{W})$. So, why are we looking at $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$? Why and how does this quantity tell us that the Gibbs sampling is doing the right thing?
- How can we say from this plot that the burn-in phase is over?
- How does $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ behave before and after the burn-in phase and why?
#Your code goes here
- Use the function plot_topic_distributions to plot the sampled topic-specific word distributions $\boldsymbol{\Phi}$. Also, plot the ground truth version of it.
- Is the $\boldsymbol{\Phi}$ sampled in the last iteration close to the ground truth? Would they get closer and closer with every iteration of the GS algorithm? Explain why.
- The $\boldsymbol{\Phi}$ from the last GS iteration is a sample from the posterior distribution. How could we get a better estimate of the matrix $\boldsymbol{\Phi}$ still using the GS inference? Describe in detail what and how would you implement (or directly implement it). Hint: What if the sample from the last GS iterations is not a very likely one according to the posterior distribution? You can also find inspiration in what we do later for Variational Bayes inference.
#Your code goes here
- The final sampled topic mixture weights $\boldsymbol{\theta}_d$ can be seen as low dimensional representations of each document informing about their topic. Cluster the training documents into 4 clusters, where each of the clusters should ideally correspond to one of the ground truth document collections.
- You can use, for example, sklearn.cluster.Agglomerative for such a task.
- Plot the documents into the simplex, as done in the section "Plotting all training ${\boldsymbol\theta}_d$ into a single simplex". For each document, indicate the label assigned by the clustering by the marker and the ground truth collection label by the color.
- Label the corners of the simplex with the most likely word from the corresponding learned topic-specific distribution $\boldsymbol{\varphi}_k$, as done in the mentioned reference plot.
- Comment on how well the obtained clusters match the ground truth collection labels.
- Here, we are (again) rather cheating because we know the ground truth number of clusters/collections. How would you choose the number of clusters if you didn't know it?
#Your code for clustering
- We now consider the case where we use the trained LDA model to extract topic mixture weight ${\boldsymbol\theta}_d$ representations for some additional "test" utterances. For this purpose, we fix the $\boldsymbol{\Phi}$ obtained from the last GS iteration and we will only iteratively re-estimate the $\boldsymbol{\theta}_d$ and topic assignment counts $\boldsymbol{C}_{dv}^k$ for each additional test document.
In the code below we handcraft the word count matrix M_test for 4 additional documents.
- For each document in M_test run 50 GS iterations and store $\boldsymbol{\theta}_d$ from each iteration.
- Start the GS iterations from the initial values of $\boldsymbol{\theta}_d$ defined by the matrix Theta_test in the code below
- You can try different a number of iterations to get a better intuition of what is going on.
- For each document, produce a simplex plot with the points that show the evolution of $\boldsymbol{\theta}_d$ over the GS iterations.
- Use a single call of the function plot_points_in_simplex to plot all the $\boldsymbol{\theta}_d$ (from all iterations for a single document) at once.
- Connect the points from consecutive GS iterations by a dotted line (replace the parameter ls='none' with ls=':' when calling the function plot_points_in_simplex)
- Label the corners of the simplex with the most likely word from the corresponding learned topic-specific distribution $\boldsymbol{\varphi}_k$.
- What topics are likely to be contained in each document? Which document collection do they seem to be coming from?
- What do the individual points in each simplex plot represent? What does their distribution in the plot represent?
- Look at the evolution of $\boldsymbol{\theta}_d$, can you see the burn-in phase? Where?
- The points for some documents should be more concentrated than for others. Explain why.
M_test = np.array([[ 6, 3, 1, 0, 0],
[16, 171, 110, 50, 160],
[ 6, 1, 0, 8, 5],
[159, 78, 14, 97, 121]])
print("Word count matrix for additional test documents\n", M_test)
#
D_test=len(M_test)
Theta_test=np.array([[1,0,0]]*D_test)
Word count matrix for additional test documents [[ 6 3 1 0 0] [ 16 171 110 50 160] [ 6 1 0 8 5] [159 78 14 97 121]]
#Your code goes here
Training LDA model using Variational Bayes inference¶
Summary of the inference algorithm¶
The VB inference attempts to obtain a good approximation of the posterior distribution
$$q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}) \approx P(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}| \boldsymbol{W})$$for that, we use mean-field approximation, where we assume the factorization
$$q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}) = q(\boldsymbol{Z}) q(\boldsymbol{\Theta}, \boldsymbol{\Phi})$$which, because of induced factorizations, further equals to $$q(\boldsymbol{Z}) q(\boldsymbol{\Theta}, \boldsymbol{\Phi}) = \prod_{d=1}^D \prod_{n=1}^N q(z_{dn}) \prod_{d=1}^D q(\boldsymbol{\theta}_d) \prod_{k=1}^K q(\boldsymbol{\varphi}_k)$$
therefore we will be interested in estimating the independent approximate posterior distributions $q(\boldsymbol{\theta}_d)$ and $q(\boldsymbol{\varphi}_k)$ for each document $d$ and topic $k$. The Variational Bayes updates dictate that these distributions are $Dir(\boldsymbol{\theta}_d|\alpha_d^*)$ and $Dir(\boldsymbol{\varphi}_k|\beta_k^*)$. Our main task in the VB inference is to estimate the parameters $\alpha_d^*=[\alpha_{d1}^*,\alpha_{d2}^*,\dots,\alpha_{dK}^*]^T$ and $\beta_k^*=[\beta_{k1}^*,\beta_{k2}^*,\dots,\beta_{kV}^*]$.
The iterative algorithm for the VB inference goes as follows:
- Initialize all $\alpha_{dk}^*$ and $\beta_{kv}^*$ (e.g. randomly).
- For each $d=1, \dots, D$, $v=1, \dots, V$ and $k=1, \dots, K$,
- For each $d=1, \dots, D$ and $k=1, \dots, K$,
where $\alpha_{0k}$ are the elements of the prior $\boldsymbol{\alpha}_0$ 4. For each $k=1, \dots, K$ and $v=1, \dots, V$, $$\beta_{kv}^* = \beta_{0v} + \sum_{d=1}^D \bar{C}_{dv}^k$$ where $\beta_{0v}$ are the elements of the prior $\boldsymbol{\beta}_0$ 5. Go to step 2. and repeat for the desired number of iterations
To monitor progress, we can evaluate the evidence lower bound (ELBO) $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ in every iteration. It is easiest done right after step 2. as described on slide 31. "Efficient ELBO calculation" in the BAYa class slides).
After the convergence, we can represent each document by the expected vector of topic mixture weights (i.e. expected value of variable $\boldsymbol{\theta}_d$)
$$\boldsymbol{\hat{\theta}}_d = \mathbb{E}_{q(\boldsymbol{\theta}_d)}[\boldsymbol{\theta}_d)] = \int \boldsymbol{\theta}_d \operatorname{Dir}(\boldsymbol{\theta}_d|\boldsymbol{\alpha}_d^*) d\boldsymbol{\theta}_d = \frac{\boldsymbol{\alpha}_d^*}{\sum_{k=1}^K \alpha_{dk}^*},$$which can be seen as the posterior predictive distribution over topics for document $d$. Similarly, we can obtain expected topic-specific word distributions $$\boldsymbol{\hat{\varphi}}_k = \mathbb{E}_{q(\boldsymbol{\varphi}_k)}[\boldsymbol{\varphi}_k)] = \frac{\boldsymbol{\beta}_k^*}{\sum_{v=1}^V \beta_{kv}^*}.$$
Tasks and questions:¶
- Implement Variational Bayes inference for training the Bayesian Latent Dirichlet Allocation model described above.
- Run a sufficient number of VB iterations to make the algorithm converge
- Store the ELBO $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ in every iteration to monitor the convergence.
The following field comes with the definition of variables that you will use in your implementation. As done before for the Gibbs Sampling inference, we provide parameters of priors alpha0 and beta0. The rows of the matrix alpha are the initial parameters for individual approximate posteriors $q(\boldsymbol{\theta}_d)$ for all documents $d$. Similarly, each of the rows of beta are the initial parameters for each $q(\boldsymbol{\varphi}_k)$. Please reuse the variables alpha and beta to store the parameters updated in each VB iteration. As before we choose $K=3$.
# Make use of the following variables
D, V = M.shape # number of training documents D and size of vocabulary V
K = 3
alpha0 = np.ones(K, dtype='float') # Parameters of prior for Theta
beta0 = np.ones(V, dtype='float') # Parameters of prior for Phi
alpha=np.abs(np.random.randn(D,K)) # The rows are the initial parameters of the approximate posterior for each q(theta_d)
beta =np.abs(np.random.randn(K,V)) # The rows are the initial parameters of the approximate posterior for each q(varphi_k)
# Reuse the variables alpha and beta to store the parameters of the approximate posteriors updated in each VB iteration.
#Your code for Variational Bayes inference
- Plot the evolution of ELBO $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ over the iterations to monitor the convergence of the VB inference.
- What can we see from the plot? In how many iterations does the VB inference converge?
- How does $\mathcal{L}(q(\boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi}))$ differ from $\ln P(\boldsymbol{W}, \boldsymbol{Z}, \boldsymbol{\Theta}, \boldsymbol{\Phi})$ that we plotted for monitoring convergence of the Gibbs Sampling inference? These two quantities are expected to behave differently over the GS and VB iterations. Point out and explain the difference in their behavior.
#Your code goes here
- Use the function plot_topic_distributions to plot the
expected topic specific word ditributions $\boldsymbol{\hat{\Phi}} = [\boldsymbol{\hat{\varphi}}_1, \boldsymbol{\hat{\varphi}}_2, \dots, \boldsymbol{\hat{\varphi}}_K]$. Also, plot the ground truth $\boldsymbol{\Phi}$. Is the learned $\boldsymbol{\hat{\Phi}}$ close to the ground truth?
#Your code goes here
- Evaluate expected topic mixture weights $\boldsymbol{\hat{\theta}}_d$ for each training document $d$ and cluster them into 4 clusters in a similar way as we did for the GS inference.
- As for GS inference, plot all $\boldsymbol{\hat{\theta}}_d$ into one simplex. For each document, indicate the label assigned by the clustering by the marker and the ground truth collection label by the color.
- Again, label the corners of the simplex with the most likely word from the corresponding learned expected topic-specific distribution $\boldsymbol{\hat{\varphi}}_k$.
- Comment on how well the obtained clusters match the ground truth collection labels.
- Is there any difference compared to the clustering obtained in the GS inference case?
#Your code for clustering
- As we did before for the GS inference, consider now the case where we use the LDA model trained using VB inference to extract expected topic mixture weight $\boldsymbol{\hat{\theta}}_d$ for the 4 additional utterances represented by the matrix M_test (as defined before). For this purpose, run VB inference where all $\boldsymbol{\beta}_k^*$ (and therefore also $q(\boldsymbol{\varphi}_k)$) stay fixed.
- For each of the 4 documents, produce a separate 2-dimensional simplex plot using the function plot_simplex.
- Label the corners of each simplex with the most likely word from the corresponding expected topic-specific distribution $\boldsymbol{\hat{\varphi}}_k$.
- To each simplex, plot document-specific Dirichlet distribution $q(\boldsymbol{{\theta}}_d)$ using the function plot_dirichlet.
- What does this plot represent?
- How does it differ for the different test documents and why?
- How are these plots related to the plots for task 5 in the Gibbs Sampling section (i.e. the plots showing evolutions of $\boldsymbol{\theta}_d$ over the GS iterations)?
- To each simplex, plot the corresponding expected topic mixture weights $\boldsymbol{\hat{\theta}}_d$ as a single point using the function plot_points_in_simplex.
#Your code goes here