Notes on topic modeling

24 Aug 2018 . category: DL . Comments
#openai

I’ve talked a bit about topic modeling on this blog before, pre-Scholars program. I revisit the subject now as a potential automatic reward function for the LC-GAN I am using in my final project. My hypothesis is that topic modeling can distill the sentences in my music commentary data set down into distinct sentence types; this information can then be used to teach the LC-GAN what types of sentences to encourage the downstream language model to generate.

Fortunately, my experiments this week suggest that topic modeling can indeed make a good reward function.

Topic modeling is a set of techniques for discovering “topics” (clusters of words) that best represent collections of text. This is a form of dimensionality reduction, itself a set of techniques for transforming a high-dimensional data set (i.e., sentences) into a lower dimension (i.e., topics), with as little information loss as possible1. It can be thought of as a kind of summarization, distillation, or learning of representation.

Topic modeling can be particularly useful for:

  1. Finding structure and understanding over collections of text
  2. Clustering similar texts and words
  3. Inferring these abstract topics on new texts

I explored a topic modeling technique called latent Dirichlet allocation (LDA) this week using the wonderful gensim and pyLDAvis packages for Python. LDA is a neat application of Bayesian inference: it learns a set of probability distributions for both words in a topic (where a topic can be represented by a mixture of words) and topics in a text (where a text can be represented by a mixture of topics).

Choosing n_topics

The single-most important hyperparameter to choose when performing LDA is the number of topics n_topics with which to cluster your collection of texts. I’ve found that this is a bit of a Goldilocks problem: too low and topics become unfocused, with not-so-related texts appearing in the same topic; too high and topics are both unfocused and sometimes fragmented, with a singular topic splitting across multiple topics.

I observed a range of n_topics parameters on my commentary data set. I care most about encouraging the generation of what I call song descriptions: descriptive, almost flowery language about the music itself (e.g., “beginning with driving drums and bass paired with his songbird vocal harmonies…”). Therefore, I’d like a single topic to isolate this type of descriptive sentence. I would consider it a bonus if another topic could isolate junk types like tour dates, repetitive nonsense, and promotional writing so that I can discourage these types simultaneously2.

n_topics topic descriptions
2 1) repetitive nonsense + song description 2) tour and release dates + expository language on artist
3 1) tour dates + song description 2) repetitive nonsense + personal-style writing 3) expository language on artists
4 1) song description 2) repetitive nonsense + personal-style writing 3) tour dates + repeated phrases across sentences 4) wwws + expository language on artist and releases
5 1) promotional writing + “check it outs”s + social media sharing 2) personal-style writing 3) common prefixings + expository language on artists and releases 4) non-English language + song description + repetitive nonsense 5) tour dates + expository language on artists


Full observations and results (including pyLDAvis visualizations like the one below) available at quilt.ipynb.

n_topics=4 appears to be the ideal setting for achieving the separation I want. I would like to reiterate how subjective and task-oriented this choice is though: if my objective changed to just, say, discouraging promotional writing and social media sharing, then 5 topics might be more suitable.

A few notes on reading pyLDAvis visualizations:

  • Circle size is proportional to the topic’s overall prevalence in the corpus.
  • Saliency (blue bars) measures the overall frequency of a term in the corpus.
  • Relevance (red bars) is a weighted measure of the frequency of each term in a topic. It is meant to show how much information a term conveys about a topic. The smaller the weight \(\lambda\) (which you can slide to adjust), the more preference given towards terms distinctive to a topic – making a topic more distinguishable from the others.
  • Hidden feature: if you hover over a word on the right, you can see all topics in which the term appears.
  • Helpful explainer video: http://stat-graphics.org/movies/ldavis.html

Follow my progress this summer with this blog’s #openai tag, or on GitHub.

Footnotes

  1. Autoencoders like VAEs do dimensionality reduction as well! 

  2. I originally had the idea to also use topic modeling to just remove “junk-ier” sentences from the data set entirely. This seemed like a reasonable method of cleaning the data set to me. However, my mentor Natasha persuaded me otherwise, arguing that it is more important to keep that data in order to improve the network’s understanding of language more generally. 


Me

Nadja does not particularly enjoy writing about herself.