# Notes on topic modeling

#### 24 Aug 2018 . category: DL . Comments #openai

I’ve talked a bit about topic modeling on this blog before, pre-Scholars program. I revisit the subject now as a potential automatic reward function for the LC-GAN I am using in my final project. My hypothesis is that topic modeling can distill the sentences in my music commentary data set down into distinct sentence types; this information can then be used to teach the LC-GAN what types of sentences to encourage the downstream language model to generate.

Fortunately, my experiments this week suggest that topic modeling can indeed make a good reward function.

Topic modeling is a set of techniques for discovering “topics” (clusters of words) that best represent collections of text. This is a form of dimensionality reduction, itself a set of techniques for transforming a high-dimensional data set (i.e., sentences) into a lower dimension (i.e., topics), with as little information loss as possible1. It can be thought of as a kind of summarization, distillation, or learning of representation.

Topic modeling can be particularly useful for:

1. Finding structure and understanding over collections of text
2. Clustering similar texts and words
3. Inferring these abstract topics on new texts

I explored a topic modeling technique called latent Dirichlet allocation (LDA) this week using the wonderful gensim and pyLDAvis packages for Python. LDA is a neat application of Bayesian inference: it learns a set of probability distributions for both words in a topic (where a topic can be represented by a mixture of words) and topics in a text (where a text can be represented by a mixture of topics).

## Choosing n_topics

The single-most important hyperparameter to choose when performing LDA is the number of topics n_topics with which to cluster your collection of texts. I’ve found that this is a bit of a Goldilocks problem: too low and topics become unfocused, with not-so-related texts appearing in the same topic; too high and topics are both unfocused and sometimes fragmented, with a singular topic splitting across multiple topics.

I observed a range of n_topics parameters on my commentary data set. I care most about encouraging the generation of what I call song descriptions: descriptive, almost flowery language about the music itself (e.g., “beginning with driving drums and bass paired with his songbird vocal harmonies…”). Therefore, I’d like a single topic to isolate this type of descriptive sentence. I would consider it a bonus if another topic could isolate junk types like tour dates, repetitive nonsense, and promotional writing so that I can discourage these types simultaneously2.

n_topics topic descriptions
2 1) repetitive nonsense + song description 2) tour and release dates + expository language on artist
3 1) tour dates + song description 2) repetitive nonsense + personal-style writing 3) expository language on artists
4 1) song description 2) repetitive nonsense + personal-style writing 3) tour dates + repeated phrases across sentences 4) wwws + expository language on artist and releases
5 1) promotional writing + “check it outs”s + social media sharing 2) personal-style writing 3) common prefixings + expository language on artists and releases 4) non-English language + song description + repetitive nonsense 5) tour dates + expository language on artists

Full observations and results (including pyLDAvis visualizations like the one below) available at quilt.ipynb.

n_topics=4 appears to be the ideal setting for achieving the separation I want. I would like to reiterate how subjective and task-oriented this choice is though: if my objective changed to just, say, discouraging promotional writing and social media sharing, then 5 topics might be more suitable.

A few notes on reading pyLDAvis visualizations:

• Circle size is proportional to the topic’s overall prevalence in the corpus.
• Saliency (blue bars) measures the overall frequency of a term in the corpus.
• Relevance (red bars) is a weighted measure of the frequency of each term in a topic. It is meant to show how much information a term conveys about a topic. The smaller the weight $$\lambda$$ (which you can slide to adjust), the more preference given towards terms distinctive to a topic – making a topic more distinguishable from the others.
• Hidden feature: if you hover over a word on the right, you can see all topics in which the term appears.