An Introduction to the Math of Variational Autoencoders (VAEs Part 2)

In an earlier post, I gave an intuitive (but informal) explanation of VAEs. In this post, I go through the math of VAEs to provide a deeper understanding.


1. The Objective

1.1 The Ideal Objective

In regular supervised learning, we attempt to optimize parameters \theta to maximize the likelihood of obtaining the data:

P(X, y | \theta)

where X is the data and y are the labels.

What then, is the objective in the case of VAEs? Recall that in the case of autoencoders, we want to learn a model that can generate reasonable samples when it receives codes sampled from the unit Gaussian  \mathcal{N}(0, I) . We will denote the codes as latent variable z . A “reasonable sample” will have a high probability of belonging to the same distribution as the data X . Therefore, our objective is to maximize P(X | z) for z that are sampled from P(z) = \mathcal{N}(0, I) .

Formally, we can write this objective as:

P(X | \theta) = \int P(X|z, \theta)P(z) dz

In practice, we maximize the log-likelihood of the data (also known as the evidence)

\log(P(X | \theta)) 

This is our ideal objective that we want to maximize. For the sake of brevity, I will leave the parameter \theta out of the equations from here on.

1.2 The Real Objective

An astute reader may be a bit confused here: in the previous post, I explained that the encoder produced samples of z according to the distribution

\mathcal{N}(\mu, \Sigma)

where \mu and \Sigma were dependent on X . Where is this distribution in the above equation?

Before explaining this, let me explain why directly optimizing $latex  \int P(X|z, \theta)P(z) dz  $ does not work in practice.

When we try to optimize $latex \int P(X|z, \theta)P(z) dz $, we are optimizing E_{z \sim P(z)}[P(X|z, \theta)] (these are equivalent equations). Since this is impossible to compute analytically, we optimize this is by sampling z from $ latex P(z) $ and computing $latex P(X|z, \theta) $, then performing gradient descent.

The problem is that for each data point X , a great majority of z \sim P(z) is unlikely to produce anything close to it. This makes learning extremely inefficient since we will need to sample a large number of z to obtain any meaningful output for X .

What if, instead of sampling from the prior P(z) , we sampled z from the posterior distribution P(z | X) for each data point X ? This would allow us to concentrate on samples that are likely to produce X and avoid wasting samples. We would need to slightly change the objective though.

Using Bayes Rule, we can prove that

\log P(X) = \log P(X|z) + \log P(z) - \log P(z|X) 

E_{z \sim P(z|X)}[ \log P(X) ] = \log P(X) = E_{z \sim P(z|X)}[ \log P(X|z) + \log P(z) - \log P(z|X) ] 

\log P(X) = E_{z \sim P(z|X)}[\log P(X|z)] - K_{DL}(P(z | X) ||  P(z)) 

Where K_{DL} is the KL-divergence.

From this equation, we see that the only modification we need to make is to add a KL-divergence term between the prior and the posterior, and we can optimize the likelihood of X by sampling from the posterior.

The only problem is, the posterior is impossible to calculate here. Remember, in the VAE, P(X|z) is an extremely complicated function computed by a neural network. We have no way of finding the true posterior.

This is where \mathcal{N}(\mu, \Sigma) comes in. Instead of sampling from P(z|X) , we sample from an approximate distribution Q(z|X)  = \mathcal{N}(\mu, \Sigma) , which is parameterized by a neural network.

Using a similar procedure as the above, we can prove that

\log P(X) - K_{DL}(Q(z | X) ||  P(z | X)) = E_{z \sim Q(z|X)}[\log P(X|z)] - K_{DL}(Q(z | X) ||  P(z)) 

This is almost entirely the same as the above objective, except for the KL-divergence term in the left-hand side.

This equation is the centerpiece of the mathematics behind VAEs. Let’s look into it a bit further.


2. Dissecting the Objective

Anyone who is familiar with variational methods will recognize this objective. It is a commonly occurring equation that underpins many variational bayesian methods.

\log P(X) - K_{DL}(Q(z | X) ||  P(z | X)) = E_{z \sim Q(z|X)}[\log P(X|z)] - K_{DL}(Q(z | X) ||  P(z)) 

Let’s look at the right-hand side. The right-hand side is composed of two terms. The first term

E_{z \sim Q(z|X)}[\log P(X|z)] 

measures how likely the sampled z are to generate the data. This basically pulls the model in the direction of explaining the data as well as it can.

The second term

K_{DL}(Q(z | X) ||  P(z))

measures how far Q(z | X) deviates from the prior. This essentially acts as a kind of regularization term, stopping the model from outputting extremely narrow distributions to easily explain the data.

The right-hand side is commonly known as the evidence lower bound (ELBO). This is because it sets a lower bound on the evidence (the log-likelihood of the data). To understand further, let’s look at the left-hand side.

The left-hand side of the equation is almost the log evidence. The only difference is the KL-term. Recall that the KL-divergence becomes zero if and only if the two distributions are exactly the same. This means that if Q(z|X) is sufficiently close to P(z|X) , then this term is exactly the log likelihood. Otherwise, the KL-divergence is always positive, so the right-hand side effectively acts as a lower bound on the log-evidence.

Therefore, if Q(z|X) is sufficiently flexible so that it can approximate P(z|X) , then we can optimize logP(X) by optimizing the right-hand side equation.

Thankfully, $ latex Q(z|X) $ is parametrized by a neural network so we can expect it to be sufficiently flexible. Since we can sample from Q(z | X) and the KL-divergence is analytically computable so the right-hand side can be optimized relatively easily in the case of VAEs.

This is the theoretical backbone of VAEs and why they work.

3. Conclusion

Hopefully, this post served to give you a better understanding of VAEs and why they work. Variational methods are applied in a wide variety of areas, so the above reasoning is worth understanding in depth.


An Intuitive Explanation of Variational Autoencoders (VAEs Part 1)

Variational Autoencoders (VAEs) are a popular and widely used method. Though there are many papers and tutorials on VAEs, many tend to be far too in-depth or mathematical to be accessible to those without a strong foundation in probability and machine learning.

In this post, I attempt to offer a gentler introduction to VAEs that omits equations as much as possible in favor of intuitive explanations. I do not mean to say that understanding the math is meaningless; instead, I will try to provide an intuitive understanding that will assist readers in actually tackling more advanced readings and papers.

1. What are VAEs for?

1.1 Autoencoders

Before diving into “variational” autoencoders, I want to familiarize the reader with the notion of an autoencoder (in case you are already familiar, please jump to section 1.2).

Basically, an autoencoder is a method of obtaining a low-dimensional representation of some higher level input, such as images. The idea is very simple: you train a neural network to take the image as input and return it as the output. This would be trivial if the model had enough capacity to store the entire image. Autoencoders teach the network to compress the information by forcing an “encoder network” to output a low dimensional representation z , which is then consumed by a “decoder network” to output the original image. This framework is illustrated below:


1.2 The motivation behind VAEs

Autoencoders are great, but wouldn’t it be great if we could feed encodings to the decoder that were not generated from the images in the source dataset? If we could, we could create new, reasonable images simply by sending a bunch of vectors to the decoder and observing the output.


What prevents us from using autoencoders in this manner? Simply put, we have no idea if the vectors we are passing to the decoder are valid. Chances are, if we give a random vector to the decoder, we will just get a bunch of random pixels back. This is because the decoder assumes that the encodings adhere to some unknown distribution, and does not account for encodings outside of it.

This is the problem that VAEs try to solve. Essentially, VAEs attempt to make sure that encodings that come from some known probability distribution can be decoded to produce reasonable outputs, even if they are not encodings of actual images.

2. Details of the VAE Model

The approach of VAEs is a seemingly simple extension of the autoencoder: instead of forcing the encoder to produce a single encoding, it forces the encoder to produce a probability distribution (in practice, a Gaussian distribution) over encodings. The decoder will then sample an encoding from that probability distribution, and try to reconstruct the original input.


This forces the decoder to produce reasonable outputs over a range of different encodings. Since a Gaussian distribution can be parametrized by its mean and covariance matrix, we have the encoder output a mean vector \mu and a covariance matrix \Sigma (restricted to a diagonal matrix for simplicity).

An astute reader would realize that this approach has a problem: there is nothing stopping the encoder from outputting an extremely narrow distribution, effectively a single value. Obviously, this is desirable from the perspective of the model, since the decoder would no longer have to care about covering a wide range of inputs. This defeats the purpose of the VAE thought.

In order to combat this, the VAE introduces a loss other than the reconstruction loss: the KL divergence between the distribution produced by the encoder and a unit Gaussian distribution (mean 0, covariance matrix is the identity matrix). The KL divergence is a way of measuring the distance between two distributions. Adding the KL divergence term to the loss forces the model to make a trade-off between ease of reproduction and covering the unit Gaussian. This allows us to safely create new images from encodings sampled from the unit Gaussian.

3. Optimization and the Reparameterization Trick

Autoencoders are simple to train since you simply have to backpropagate the reconstruction loss across the weights of the network.

VAEs are not as simple to optimize though. The key problem is that the sampling operation is not differentiable. This means we cannot propagate the gradients from the reconstruction error to the encoder.


Normally we would have to resort to more complicated optimization techniques like REINFORCE, but luckily we are able to resolve this problem through the reparameterization trick.

The idea behind this trick is to isolate the sampling from the parameter estimation (mean and variance). First, we sample \epsilon from a unit Gaussian distribution.  Thanks to the properties of the Gaussian distribution, we can make the sample to adhere to a Gaussian distribution with mean \mu and covariance matrix \Sigma by transforming it as follows:

z = \mu + \Sigma^{\frac{1}{2}} \epsilon


This removes the sampling operation from the flow of gradients, so we can now train the model through simple backpropagation.

This ease of training along with the generational model it produces are what make VAEs so powerful and popular.

4. Conclusion and Further Readings

Hopefully, this post served to foster an intuitive understanding of VAEs. Although I deliberately left equations out of this post, the math is actually quite interesting. I will be covering the mathematics behind VAEs in a separate, future post. For those who want more in-depth understanding now, I refer them to the following resources:

Tutorial on Variational Autoencoders

Auto-Encoding Variational Bayes

An Overview of Sentence Embedding Methods

Word embeddings/vectors are a powerful method that has greatly assisted neural network based NLP methods. As of now, word2vec and GloVe tend to be used as the standard method for obtaining word embeddings (although there are other methods out there).

Compared to this, the task of obtaining sentence embeddings remains more elusive. Although sentence embeddings can often be obtained in a supervised way by training a model on top of word embeddings, there are often circumstances where we want to obtain sentence embeddings in an unsupervised manner. For instance, we may want to conduct paraphrase identification or create a system for retrieving similar sentences efficiently, where we do not have explicit supervision data.

In this blog post, I aim to present an overview of some important unsupervised sentence embedding methods. This is by no means comprehensive; there are many more methods out there, and if there are any important ones that I have missed, it would be great if you could leave comments so that I can learn and improve this post 🙂


1. Topic Models

Although often overshadowed by neural network based methods, topic modeling is a surprisingly powerful and interpretable framework for embedding sentences.

Topic modeling basically aims to decompose a set of documents/sentences into several distinct topics based on the co-occurrence of words. For instance, in news corpuses, words corresponding to politics, science, and economics will tend to coexist within the same document. Topic modeling represents a document as a weighted sum of topics, which can be, in turn, used as a sentence embedding.

NMF and LDA are both popular methods for topic modeling. For those that are interested in the details, I have written a post on NMF here.


Topic modeling is easily interpretable and efficient to calculate. The legitimacy of the modeling can be examined by the obtained topics. Topic modeling is also easy to implement, with NMF being implemented in sklearn, and LDA being implemented in gensim.


Topic modeling does not take the order of the words in a sentence into account. It does not capture the meaning of the sentence either. For instance, the sentences

“I hate this movie because of its usage of CGI and acting”


“I love this movie because of its usage of CGI and acting”

would probably be represented as similar vectors, since their bag-of-words representations are extremely similar.


2. Paragraph Vectors (doc2vec)

Introduced in this paper, paragraph vectors (often referred to as doc2vec) are an extension of word2vec.  Word2vec attempts to predict a word(s) in a sentence from its surrounding words (or predict surrounding words from a single word, but this difference is not important in this post). Paragraph vectors extend this by training paragraph vectors as auxiliary information during this prediction task.

The details are shown in the figure below.


Each paragraph (or sentence/document) is associated with a vector. During training of the word2vec model, the paragraph vector is either averaged or concatenated with the context vector (composed of the word vectors of the surrounding words in the sentence), and used in the prediction.

Intuitively, paragraph vectors encode the additional information necessary to predict a word from its context that is not clear just from the context words themselves. For example, the following sentence

“I have a pet _____”

could contain a multitude of words, including “cat”, “dog”, and “snake”. The paragraph vector would assist prediction by encoding which word enters the blank.

As a word of caution, this is an explanation of the distributed-memory model. In the original paper, the distributed-BOW model is also introduced, but for the sake of brevity, I will omit it in this post.


In the original paper, paragraph vectors are explained to have the following advantages:

(1) They inherit the semantics of word vectors

For instance, the word “powerful” is closer to “strong” in the semantic space compared to “Paris”. Paragraph vectors can utilize these learned representations.

(2) They take word order into account

Though paragraph vectors do not use a sequential model like an RNN, they take the word order into account in a small context, similar to how an n-gram model with a large n would. This is important since n-gram models preserve a lot of information of the paragraph, including word order. Compared to n-gram models, paragraph vectors are significantly lower dimension and less prone to overfitting when used.

Paragraph vectors are implemented in gensim as well, making them very easy to use. They are also relatively quick to train as well.


What kind of information is actually captured in the paragraph vectors is unclear and difficult to interpret, since the prediction task encodes information both in word vectors and paragraph vectors. The quality of the vectors is also highly dependent on the quality of the word vectors.


3. Skip-Thought Vectors

Introduced in this paper, skip-thought vectors are – in a way – word2vec for sentences. Where word2vec attempts to predict surrounding words from certain words in a sentence, skip-thought vector extends this idea to sentences: it predicts surrounding sentences from a given sentence.

The model is summarized in the following figure:


Skip-thought vectors use the encoder-decoder model to first encode a sentence into a vector, then decode that representation into the surrounding sentences.
Despite being deceptively simple, skip-thought vectors have achieved outstanding results in a wide variety of tasks including sentiment analysis and paraphrase detection.


Skip-thought vectors not only take word order into account but also take the ordering of sentences into account. This allows it to encode rich information into the embedding. Unsurprisingly, skip-thought vectors often show the best performance among other unsupervised methods on a wide range of tasks.


Unlike the other methods, skip-thought vectors require the sentences to be ordered in a semantically meaningful way. This makes this method difficult to use for domains such as social media text, where each snippet of text exists in isolation. By virtue of it being a deep neural network model, it is also much slower to train than the other methods.


4. FastSent

Skip-thought vectors are slow to train. FastSent attempts to remedy this inefficiency while expanding on the core idea of skip-thought: that predicting surrounding sentences is a powerful way to obtain distributed representations.

Formally, FastSent represents sentences as the simple sum of its word embeddings, making training efficient. The word embeddings are learned so that the inner product between the sentence embedding and the word embeddings of surrounding sentences is maximized.


FastSent shares the advantages and disadvantages of skip-thought in its limitation of usage and the kind of information it encodes. It is vastly more efficient than skip-thought though, which is its main advantage.


FastSent sacrifices word order for the sake of efficiency, which can be a large disadvantage depending on the use-case.


5. Weighted Sum of Word Vectors

This paper showed that an “embarrassingly simple” baseline for sentence embedding can work well: a weighted sum of word vectors.

In this method, each word vector is weighted by the factor \frac{a}{a + p(w)} where a is a hyperparameter and p(w) is the (estimated) word frequency. This is similar to tf-idf weighting, where more frequent terms are weighted down.

The paper gives an interesting theoretical justification to this formulation, based on a model of sentences where each word in a sentence is generated by a “random-walk” of the discourse (sentence) vector.


Due to its simplicity, this method is very easy to implement by hand and fast to compute. It is also widely applicable and can be used for both short and long sentences from a wide variety of domains including social media.


Being a relatively new method, the weighted sum of word vectors has not been extensively used and tested compared to older methods like skip-thought, making its performance difficult to predict. Word order and surrounding sentences are ignored as well, limiting the information that is encoded.


6. Other Methods and Further Readings

There are too many sentence embedding methods to cover in this post, so I mainly focused on methods that were unique to natural language.

Similarly to images or any other domain, there are various sentence embedding methods that are universally applicable, such as denoising autoencoders and variational auto encoders.


For further information on methods explained or omitted above, I will refer the reader to the following resources:

CMU Neural Nets for NLP 2017 Lesson 6: Using/Evaluating Sentence Representations

Learning Distributed Representations of Sentences from Unlabeled Data

A Practical Introduction to NMF (nonnegative matrix factorization)

With the rise of complex models like deep learning, we often forget simpler, yet powerful machine learning methods that can be equally powerful. NMF (Nonnegative Matrix Factorization) is one effective machine learning technique that I feel does not receive enough attention. NMF has a wide range of uses, from topic modeling to signal processing.

This post aims to be a practical introduction to NMF. Instead of delving into the mathematical proofs, I will attempt to provide the minimal intuition and knowledge necessary to use NMF in practice and interpret the results. I’ll also show example code for topic modeling using NMF at the end of this post.

1. What is NMF, and what is the intuition behind it?

NMF (Nonnegative Matrix Factorization)  is a matrix factorization method where we constrain the matrices to be nonnegative. In order to understand NMF, we should clarify the underlying intuition between matrix factorization.

Suppose we factorize a matrix X into two matrices W and H so that X \approx WH

There is no guarantee that we can recover the original matrix, so we will approximate it as best as we can.

Now, suppose that X   is composed of m rows x_1, x_2, ... x_m W is composed of k rows w_1, w_2, ... w_k H is composed of m rows h_1, h_2, ... h_m . Each row in X can be considered a data point. For instance, in the case of decomposing images, each row in X   is a single image, and each column represents some feature.


Take the i-th row inX , x_i . If you think about the equation, you will find that x_i can be written as

The meaning of this equation becomes clearer when we visualize it:


Basically, we can interpret x_i to be a weighted sum of some components (or bases if you are more familiar with linear algebra), where each row in H is a component, and each row in W   contains the weights of each component.

Note that in this explanation, we treat each row of the input matrix X to be a single data point. In other explanations, each column might be considered a data point, in which case each column in W becomes a component and each column in H   becomes a set of weights.

In practice, we introduce various conditions on the components, so that they can be interpreted in a meaningful manner. In the case of NMF, we constrict the underlying components and weights to be non-negative. Essentially, NMF decomposes each data point into an overlay of certain components.

2. When do we use NMF?

NMF is suited for tasks where the underlying factors can be interpreted as non-negative. This is best demonstrated by examples, so let’s look at two examples below.

Example 1: Faces

A different, commonly used factorization method is PCA (Principle Components Analysis).

All you currently need to know about PCA is that it creates factors that can be both positive and negative.

Let’s see what happens when we try decomposing various faces. In the following image, we show the components (bases) we calculated with PCA on the left and the weights corresponding to a single image on the right. Red represents negative values.


If you look at the components, you see that they don’t make much sense. It’s also difficult to interpret what it means for a face to have a “negative” component.

Now, let’s see what happens when we use NMF.


Now, the components seem to have some meaning. Some of them look like parts of a nose or parts of an eye. We can consider each face to be an overlay of several components. This means we can interpret the decomposition of a face as having a certain weight of a certain nose type, a certain amount of some eye type, etc.


Example 2: Topic Modeling

Imagine if you wanted to decompose a term-document matrix, where each column represented a document, and each element in the document represented the weight of a certain word (the weight might be the raw count or the tf-idf weighted count or some other encoding scheme; those details are not important here).

What happens when we decompose this into two matrices? Imagine if the documents came from news articles. The word “eat” would be likely to appear in food-related articles, and therefore co-occur with words like “tasty” and “food”. Therefore, these words would probably be grouped together into a “food” component vector, and each article would have a certain weight of the “food” topic.

Therefore, an NMF decomposition of the term-document matrix would yield components that could be considered “topics”, and decompose each document into a weighted sum of topics. This is called topic modeling and is an important application of NMF.

Note that this interpretation would not be possible with other decomposition methods. We cannot interpret what it means to have a “negative” weight of the food topic. This is another example where the underlying components (topics) and their weights should be non-negative.

Another interesting property of NMF is that it naturally produces sparse representations. This makes sense in the case of topic modeling: documents generally do not contain a large number of topics.


3. How do we conduct NMF?

In order to conduct NMF we formalize an objective function and iteratively optimize it. NMF is an NP-hard problem in general, so we will aim for a good local minima.

Let’s look at the objective function.

Although there are some variants, a generally used measures of distance is the frobenius norm (the sum of element-wise squared errors). Formalizing this, we obtain the following objective:

minimize \, \| X - WH \|_{F}^{2} \,  w.r.t.  \, W, H \, s.t. \, W, H \geq 0

Next, we will introduce a couple of optimization methods.

I. Multiplicative Update

A commonly used method of optimization is the multiplicative update method.

In this method, W and H   are each updated iteratively according to the following rule:

H \leftarrow H \odot \frac{W^TX}{W^TWH}

W \leftarrow W \odot \frac{XH^T}{WHH^T}

where the matrix not being updated is kept constant.

The objective functions can be proved to be non-increasing under this rule. A less formal, but more intuitive explanation of this method is to interpret this as rescaled gradient descent.

Rescaled gradient descent can be written as follows:

H \leftarrow H - \eta \odot [W^TWH - W^TX] 

W \leftarrow W - \zeta \odot [WHH^T - XH^T] 

If we set \eta to be \frac{H}{W^TWH}  and \zeta   to be \frac{W}{WHH^T} , then we obtain the multiplicative update rule.

II. Hierarchical Alternating Least Squares

Though less commonly used, in this paper, Hierarchical Alternating Least Squares (HALS) is shown to be a faster method for computing NMF.

HALS updates one column of W at a time, using the property that the columns of W do not interact with each other. It computes the closed-form optimal solution for each column, keeping all other columns constant.

The rule can be expressed as follows:

W[:,i] \leftarrow max(0, \frac{Xh_i^T - \sum_{k \neq i}W[:,i]h_kh_i^T}{\| h_i \|^2})

where W[:,i]   is the i-th column of  W  and $latex  h_i $ is the i-th row of H .


4. Other commonly used tricks

I. Initialization

Any iterative optimization scheme is sensitive to its initialization. Though random initialization often performs sufficiently well, other methods of initialization also exist.

One such method uses SVD to compute a rough estimate of the matrices W   and H . If the non-negativity condition did not exist, taking the top k singular values and their corresponding vectors would construct the best rank k estimate, measured by the frobenius norm. Since W   and  H must be non-negative, we must slightly modify the vectors we use. Detailed discussion is beyond the scope of this blog post, so please refer to this paper for the details.

II. Regularization

We discussed the unregularized objective above, but we may want to impose stronger sparsity constraints or prevent the weights from becoming too large. To solve these problems, we can introduce l_1   and  l_2 regularization losses on the weights of the matrices.


5. Using NMF in practice

NMF is implemented in scikit-learn, making it very easy to use. If you understood the above explanation, you will see that it is not that difficult to implement by hand either.

The following is some example code of NMF for topic modeling on a term-document matrix, using scikit-learn. We use the built-in 20 newsgroups dataset and see what topics are generated.


import numpy as np

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.decomposition import NMF

data= fetch_20newsgroups(remove=('headers', 'footers', 'quotes')).data

# convert the text to a tf-idf weighted term-document matrix

vectorizer = TfidfVectorizer(max_features=2000, min_df=10, stop_words='english')

X = vectorizer.fit_transform(data)

idx_to_word = np.array(vectorizer.get_feature_names())

# apply NMF

nmf = NMF(n_components=20, solver="mu")

W = nmf.fit_transform(X)

H = nmf.components_

# print the topics

for i, topic in enumerate(H):

    print("Topic {}: {}".format(i + 1, ",".join([str(x) for x in idx_to_word[topic.argsort()[-10:]]])))


Aside from the time for downloading the dataset, the code executes in less than 10 seconds on my Macbook pro.

Here are the results:

Topic 1: don,make,say,really,way,did,right,good,time,people
Topic 2: pc,appreciated,information,program,mail,looking,need,software,help,use
Topic 3: lord,church,christian,christians,believe,faith,christ,bible,jesus,god
Topic 4: nsa,algorithm,public,escrow,government,keys,clipper,encryption,chip,key
Topic 5: hd,cd,floppy,controller,ide,hard,disk,drives,scsi,drive
Topic 6: 50,20,price,condition,offer,shipping,10,new,sale,00
Topic 7: using,ms,program,problem,running,files,window,dos,file,windows
Topic 8: teams,win,hockey,play,season,players,year,games,team,game
Topic 9: ftp,pub,cc,university,cs,soon,banks,gordon,pitt,edu
Topic 10: new,oil,speed,good,miles,dealer,engine,bike,cars,car
Topic 11: ram,color,bus,driver,vga,cards,drivers,monitor,video,card
Topic 12: land,state,war,peace,arabs,jewish,arab,jews,israeli,israel
Topic 13: appreciate,address,windows,looking,info,tell,hi,mail,advance,thanks
Topic 14: understand,agree,moral,david,pretty,need,know,people,don,think
Topic 15: mean,ll,oh,looks,look,new,wondering,sounds,like,just
Topic 16: ftp,mean,info,file,read,let,anybody,don,does,know
Topic 17: sun,hp,address,email,bob,internet,article,dave,list,com
Topic 18: sci,earth,moon,station,gov,orbit,launch,shuttle,nasa,space
Topic 19: soviet,azerbaijan,said,genocide,turks,turkey,armenia,turkish,armenians,armenian
Topic 20: noticed,months,recently,list,tried,people,heard,seen,got,ve

The originally intended topics are:



sci.electronics talk.politics.misc







Seems relatively reasonable for such simple code.


6. Conclusion

I hope this blog post has informed you with the basics of NMF and has convinced you that NMF is a quick but surprisingly powerful machine learning method.


7. Further Readings

For more details about the mathematics involved and related methods, check out the readings below:


Scikit-learn user guide



Algorithms for Non-negative Matrix Factorization

The Why and How of Nonnegative Matrix Factorization

A Bird’s-Eye View of Design Patterns

Design patterns are a powerful set of tools that all engineers should understand. Unfortunately, they are also difficult to grasp by just reading about them, or learning about them in theory. Although design patterns are best learned through hands-on experience, it’s difficult to know when you can use design patterns without a bird’s-eye view of the subject.

While many books focus on individual patterns, in this post, I will discuss what design patterns are for, and when you should look for one.


  1. What are design patterns?

According to this site, a design pattern is

“a general repeatable solution to a commonly occurring problem in software design”.

This is a common definition, but it’s pretty vague. What are “commonly occurring problems” anyway? Do there have to have problems with the code for design patterns to have any meaning?

I think a more concrete definition will make design patterns a lot clearer. Although there are a few exceptions, most design patterns are used for the following purpose: “to maximize reuse of existing code“.

I’ll explain a bit more in detail later, but first, let’s clarify why this is important.

Firstly, and most obviously, reusing code means you have to write less code. Another important reason is that code reuse prevents modification of existing code. Generally, most code that is part of a large code base is tested and goes through all sorts of procedures to prevent bugs from occurring. Therefore, when you rewrite the code, you waste a large part of that continuous effort and create room for all sorts of new bugs. For example, other code might depend on the code you just modified, creating bugs with complex dependencies.

Design patterns are a tool to prevent this from happening.

  1. How do design patterns work?

Why is code reuse difficult anyway?
The primary reason is that requirements change. Suppose you are building an application that analyzes the sentiment of user input. At first, you may be content with detecting positive and negative sentiment. Afterwards, you might want to add new, more specific sentiments such as anger and sorrow. You might then want to expand the languages you take as input.

Writing code that is robust to these changes is not trivial. That’s why we use wisdom from the past: design patterns.

The basic idea of design patterns is to isolate factors that change for some reason from factors that do not change.

Let’s take the earlier example. The basic functionality that does not change is that you

  1. Take text from some language as input
  2. Process it
  3. Input the processed data into some model
  4. Extract the output
  5. Return the results to the user

Though this overall process does not change, the contents of each individual step do change. Processing would differ between languages; for example, English can be easily tokenized, but Chinese does not have spaces between words, so we need a tokenizer. The model that we use also differs when we are trying to detect different sentiments. In contrast, we might want to reuse the process of returning the results to the user.

Here’s where we should look for design patterns that match our use case. In this case, based on what changes and what does not change, a good design pattern to implement is the Template design pattern. The template design pattern basically creates a template for a sequence of algorithms, while leaving the implementation of the algorithms unspecified. This matches our demand perfectly.

Here’s what the basic code for a template in our example might look like:

import json

class SentimentClassifierTemplate:
    def clean_data(self, input):
        return input

    def tokenize(self, input):
        return [x.split() for x in input]

    def classify(self, words):
        # do something
        return words

    def return_results(self, results):
        return json.dumps({"results": results})

    def process(self, input):
        x = self.clean_data(input)
        x = self.tokenize(x)
        x = self.classify(x)
return self.return_results(x)

This is obviously not meant to be initialized and used (it should, therefore, actually be an abstract class, but since abstract classes are an entirely different subject I will leave this as a concrete class).

The crux of this code is the process method. This ties together the individual steps in the pipeline without specifying their contents. Therefore, it is easy to pick and choose implementations from each of the steps with minimal code rewriting. Engineers can simply subclass this class and rewrite the code that they need to change. For instance, if we wanted new input cleaning functionality (such as returning escaped characters to their original character), we simply create a subclass with a new clean_data method.

import html

class SentimentClassifierBasic(SentimentClassifierTemplate):
    def clean_data(self, input):
        return html.unescape(input)

Notice how we only have to write a small amount of code, and that we do not have to overwrite the original SentimentClassifierTemplate class at all! We can also reuse this clean_data method across languages by subclassing the SentimentClassifierBasic as well.

Let’s look back at what we’ve done. Why did this go well? Because design patterns gave us a way to isolate what changes and what does not. This is what most design patterns are supposed to do. They basically “factorize” the code so that we can reuse what is constant.

Of course, there are many other principles that are built into design patterns that make them effective (such as the principle of least knowledge or the Hollywood principle), but they all, in essence, come down to maximizing code reuse. Therefore, what you need to know the most when using design patterns is what changes in your code and what does not.

  1. How should design patterns be learned?

In the above section, I just presented the Template pattern without explaining how I found it. This should raise a question: how can we know which design pattern we can use in the first place?

I don’t think that engineers need to memorize all existing design patterns. However, I do think they need to have a vague idea of the options they have and be able to look up the appropriate design pattern when the need arises.

Therefore, I would recommend readers of this blog to read a book on design patterns, while noting what factors change and what factors are constant for each pattern. Then, when you build something, you can consult your notes and compare the factors that change in your application with your notes.

I would particularly recommend the book Head First Design Patterns.

The explanations are memorable and intuitive, despite its casual look.

  1. Conclusion

Design patterns are an important but relatively difficult-to-grasp notion. I hope this post has given the reader better insight into what they are for, as well as how to go along learning them.

How to write better code in Jupyter notebooks (and why you should)

Jupyter notebooks are becoming increasingly popular as a method of sharing data analysis. I really love Jupyter and think it is an awesome way of sharing insights. Unfortunately, Jupyter notebooks are often hard to understand. I suspect this is because many people who write notebooks are not from a software engineering background, and often are not educated about the importance of clean code.

This post is aimed to convince people in data analysis positions to invest in clean coding skills, and to introduce a few principles/techniques I find useful when writing notebooks myself.

What exactly does “clean code” mean?
Every programmer has his or her own opinion on what constitutes clean code. I don’t want to start a debate about this topic so I will give 2 features of “clean code” that I think many people could sympathize with:

1. It is easy to understand the intent and functionality
2. It can be easily modified

Take the following example (taken from the book “Clean Code” and translated to python ):

def func1(list1):
	list2= []
	for x in list1:
		if x[0] == 4:
	return list2

Compare it with the following code:

def get_flagged_cells(game_board):
	flagged_cells = []
	for cell in game_board:
		if cell.status_value == FLAGGED:
	return flagged_cells

These code snippets have the same functionality; they only differ in style. However, the second clearly shows it is acquiring “flagged cells” whereas the first is impossible to decipher.

Why is clean code important?
Clean code is important in production code, largely because it improves the maintainability of the code base. In analysis code, the reasons are slightly different, though similar in principle.

1. Communicating results
You rarely, if ever, only conduct analysis for yourself. Even if you do, you will often need to look at your own results weeks after you have actually done your analysis. If your code is clear, then it is easy for you to say, “just read my notebook” in order to share your findings with others. It is also easy for you to reuse past analysis results. On the contrary, if your code is full of temporary variables, it is very difficult for anyone (including your future self!) to understand the results. You will often end up doing the same analysis since you do not remember what your past code did.

2. Checking for correctness
Let’s face it: we all make mistakes. Some analysis you did might have a flawed join, resulting in missing data and a skewed sample. You might be using an incorrect train-test split. There are so many places where your analysis and experiments can go wrong. The worst thing about this is that incorrect analysis does not lead to clear bugs, and, in the worst case, can lead to misleading results.
The best way to prevent incorrect analysis, and to be able to check up on your analysis when the results seem wrong, is to write clear code.

3. Extensibility
Often, if we successfully run some analysis on our code, others will want to follow up by running similar code. For example, if you run a machine learning algorithm, others might want to change the parameters or the preprocessing of the input.
This is easy when the code is clear. When it is not, it will be far more difficult for others to collaborate with you when your code is a mess. This is a loss for you and others: you will lose a chance to have your results be further extended, and others will lose a chance to learn from more analysis.

How can we write clean code?
Before I give any tips, the best way to write clean code is to practice. Even if you are only writing code for yourself, make a habit of writing clear and good code.
That being said, here are some tips for writing clean code on Jupyter notebooks.

1. Give good names to variables
This is a staple of good code. Good names might be the most important feature of good code. Remember my example earlier?

def func1(list1):
	list2= []
	for x in list1:
		if x[0] == 4:
	return list2

def get_flagged_cells(game_board):
	flagged_cells = []
	for cell in game_board:
		if cell.status_value == FLAGGED:
	return flagged_cells

The only difference between these two functions is their variable names (not withholding the slightly different variable access methods). This is the power of naming.

2. Avoid duplication
Duplication signals that you can abstract something into a function. Generally, a well-named function communicates its intent more clearly than a few lines of code.

3. Write good comments
Writing comments is often seen as a necessary evil in production code. Incorrect comments are harmful since they are actively misleading. Comments are also hard to maintain when changes happen.
In Jupyter, you can be a bit more liberal in your use of comments. This is because Jupyter notebooks are rarely maintained. They are read and forked, but they do not change often. Therefore, comments are a great way to clarify your intent. In fact, I use comments extensively to organize my code.

4. Don’t reorder cells
When you are writing a notebook, and notice some code lying above that you could rerun to achieve your objective, it is tempting to move your cursor up and execute the cell, then do nothing.
Don’t fall into this trap. Either move the cell downwards or rewrite it. Otherwise, people will not be able to execute your notebook from the top to the bottom. It also makes the code really confusing when you reread it: you will have no idea of the flow of logic and will need to remember the change you made in the past in order to understand. In other words, if you leave this kind of edit alone, your notebook will probably rot.

5. Don’t leave “dead” cells lying around
Cells that do not have any purpose anymore should be deleted since they are actively harmful to the readability of the notebook. Just delete them. If you need them in the future, it is not that difficult to reproduce that work. If you are really worried, commit your notebook to a source code control system like git. Do NOT comment code out and leave it lying around. This will make your code rot.

6. One notebook, one purpose
It is ever so tempting to throw all your code into one notebook, especially if the pre-processing is shared. My advice: don’t. Abstract the shared preprocessing into a library if you must.
The reason for this is because you (and others) will get confused when you try to search for some processing you have done in the past. What if you need to check where you handled missing data? What if you want to run a different algorithm?
Each notebook should have one clear purpose, and only one. Don’t preprocess your data and run your machine learning algorithm on the same notebook.

Of course, all of these tips are simply guidelines. They can be broken if the need is clear. However, it is best to stick to them as much as possible.

Further Readings
If you want to learn more about clean code, I recommend you read the book “Clean Code” by Robert C. Martin. It is full of great tips and clear examples of how to make your code better.
Writing clean code requires a knowledge of general best practices, but also requires knowledge about software engineering in general.
Next time, I am planning to write about design patterns, which are a set of patterns for writing good code.

How to write a web crawler in Python (with examples)

The internet is a treasure trove of data, but in order to make use of it, you must be able to extract the information you want from the noisy data. This is why we use crawlers.
Python has a rich ecosystem of crawling related libraries. This post is not an introduction to those libraries. This post aims to inform the reader of how crawling works through implementing a simple crawler from scratch.
I will introduce an example of a custom crawler that I wrote myself to crawl twitter in a future article, so if you are interested on actual applications of self-implemented crawlers, please take a look 🙂
Without further ado, let’s start!
Table of Contents:
1. How do crawlers work?
2. Implementation
3. Improvements


1. How do crawlers work?
The workings of a crawler are very simple. Below is a step by step explanation of what kind of actions take place behind crawling.


1.1 Networks, requests, and the internet.
When you look at a page on the Internet through a browser like Firefox or Google Chrome, you are getting the contents of the page from a remote server (of course the results might be cached, and there are all sorts of small details that might differ, but bear with me). You get these pages by sending a “request” to the server. The server returns the corresponding results by sending a “response”.
There are various kinds of “requests”, but for the sake of simplicity I will only cover GET requests in this article.
A GET request is basically the kind of request that happens when you access a url through a browser. The way a remote server knows that the request being sent to them is directed at them, and what resource to send back, is by looking at the url of the request. For example, a url like
Is sent to ““, and requests the resource corresponding to “foo”. The server at thus returns the resource to the referrer (the sender of the request).
The most important takeaway from this section is that browsing through pages is nothing more than simply sending requests and receiving responses.  Anything that can be accessed on the Internet can be acquired (theoretically) through this method.
Having clarified this, now we can understand the workings of a crawler.


1.2 Extracting information
Basically, a crawler does roughly the same thing that a browser does: it sends a request, and acquires the response.
The difference between a crawler and a browser, is that a browser visualizes the response for the user, whereas a crawler extracts useful information from the response.
The question is, how exactly do you extract the necessary information from the response?
This actually differs from case to case, but generally, you will have to use a html parser.
Web pages are (mostly) written in html. A detailed explanation of html and parsing it is outside the scope of this blog post, but I will give a brief explanation that will suffice for the purposes of understanding the basics of crawling. Html, for those who are not familiar with it, stands for hyper text markup language, and is a language for expressing the contents of the page in a a structural manner. The structure of the page is expressed by enclosing information between tags, like below.
Hello World!
Tags can have several attributes, such as ids and classes. Tags can also be nested. For example:
represents an a tag within a div tag with the class “hoge”.
The pages you crawl will (hopefully) have some common underlying structure, and you will be exploiting that to extract the necessary information. The underlying structure will differ for each set of pages and the type of information. Therefore, before starting to crawl, you must investigate the structure of the pages you are trying to extract information from.
The way you can investigate is by using the web inspector on your browser. In Chrome, you can do this by opening the DevTools, or by right clicking on any element on the page and selecting “Inspect”. This will open up a tool that allows you to examine the html of the page at hand.


1.3 Getting the next url to crawl
Just extracting information from one page is not very useful. Most of the time, you will want to crawl multiple pages. However, it is often difficult or tedious to list up all the pages you want to crawl in advance. This is why crawlers will often extract the next url to crawl from the html of the page.
The next url you want to access will often be embedded in the response you get. For example, if you are crawling search results, the link to the next set of search results will often appear at the bottom of the page.
By dynamically extracting the next url to crawl, you can keep on crawling until you exhaust search results, without having to worry about terminating, how many search results there are, etc.


2. Implementation
Having the above explained, implementing the crawler should be, in principle, easy.
All you have to do is to manage the following 3 tasks:
     1. Get the response from a url in the list of urls to crawl
     2. Extract information from the url
     3. Update the list of urls to crawl


1 and 2 will require more specialized libraries. The libraries I would recommend are:


The following is a sample crawler that crawls Wikipedia and prints out the list of search results for a given url.
from bs4 import BeautifulSoup
import requests
from collections import deque

def parse_html(html):
    soup = BeautifulSoup(html)
    title_element = soup.find("div", attrs={"class": "mw-search-result-heading"}).find("a")
    title = title_element.attrs["title"]
    links = soup.find_all("a", attrs={"class": "mw-nextlink"})
    return title, [x.attrs["href"] for x in links if "href" in x.attrs]

def crawl(url_list):
    urls_to_crawl = deque(url_list)
    output_data = []
    while len(urls_to_crawl) > 0:
        url = urls_to_crawl.pop()
        url_contents = requests.get(url).text
        data, links = parse_html(url_contents)
    return output_data
The crawl function is where all the work takes place. The list of urls is managed as a collections.deque object. Urls are inserted and extracted from this object. The crawler uses the requests library to send a request to a url, fetches the response, and handles the response in the handle_response function.
The handle_response function extracts the result titles in the first for loop and the links to the following pages in the second for loop.
The way that the parse_html function extracts the titles is by using their common structure: a div tag with the class “mw-search-result-heading” that enclose an a tag with the title as the title attribute. The links to the following pages are extracted similarly: they are all a tags with the class “mw-nextlink” and the url of the next page as the href attribute.
All newly found links are pushed to the queue, and crawling continues.
The above is the basic structure of any crawler. If you want to user your crawler more extensively though, you might want to make a few improvements:
1. Error handling
When you crawl multiple pages, chances are, you are going to encounter some dysfunctional or nonexistent pages. You will want to make sure you handle errors such as connection errors or servers that never respond appropriately.
2. More detailed finish conditions
Often times, you only need to crawl N results, and any further results are unnecessary. You will want the option to terminate your crawler based on the number of items you have acquired.
3. Multi processing
This is more optional than the above two. However, when the number of urls to crawl is large, and the extraction process is long, multiprocessing can be necessary to obtain the results you want in a reasonable amount of time.
I hope you this post has made you see how simple crawling actually is, and how implementing your own crawler is relatively simple.
Thank you for reading this post, and happy crawling!