I’ve talked a bit about topic modeling on this blog before, pre-Scholars program. I revisit the subject now as a potential automatic reward function for the LC-GAN I am using in my final project. My hypothesis is that topic modeling can distill the sentences in my music commentary data set down into distinct sentence types; this information can then be used to teach the LC-GAN what types of sentences to encourage the downstream language model to generate.
Fortunately, my experiments this week suggest that topic modeling can indeed make a good reward function.
Topic modeling is a set of techniques for discovering “topics” (clusters of words) that best represent collections of text. This is a form of dimensionality reduction, itself a set of techniques for transforming a high-dimensional data set (i.e., sentences) into a lower dimension (i.e., topics), with as little information loss as possible1. It can be thought of as a kind of summarization, distillation, or learning of representation.
Topic modeling can be particularly useful for:
I explored a topic modeling technique called latent Dirichlet allocation (LDA) this week using the wonderful gensim
and pyLDAvis
packages for Python. LDA is a neat application of Bayesian inference: it learns a set of probability distributions for both words in a topic (where a topic can be represented by a mixture of words) and topics in a text (where a text can be represented by a mixture of topics).
The single-most important hyperparameter to choose when performing LDA is the number of topics n_topics
with which to cluster your collection of texts. I’ve found that this is a bit of a Goldilocks problem: too low and topics become unfocused, with not-so-related texts appearing in the same topic; too high and topics are both unfocused and sometimes fragmented, with a singular topic splitting across multiple topics.
I observed a range of n_topics
parameters on my commentary data set. I care most about encouraging the generation of what I call song descriptions: descriptive, almost flowery language about the music itself (e.g., “beginning with driving drums and bass paired with his songbird vocal harmonies…”). Therefore, I’d like a single topic to isolate this type of descriptive sentence. I would consider it a bonus if another topic could isolate junk types like tour dates, repetitive nonsense, and promotional writing so that I can discourage these types simultaneously2.
n_topics |
topic descriptions |
---|---|
2 | 1) repetitive nonsense + song description 2) tour and release dates + expository language on artist |
3 | 1) tour dates + song description 2) repetitive nonsense + personal-style writing 3) expository language on artists |
4 | 1) song description 2) repetitive nonsense + personal-style writing 3) tour dates + repeated phrases across sentences 4) wwws + expository language on artist and releases |
5 | 1) promotional writing + “check it outs”s + social media sharing 2) personal-style writing 3) common prefixings + expository language on artists and releases 4) non-English language + song description + repetitive nonsense 5) tour dates + expository language on artists |
Full observations and results (including pyLDAvis
visualizations like the one below) available at quilt.ipynb
.
n_topics=4
appears to be the ideal setting for achieving the separation I want. I would like to reiterate how subjective and task-oriented this choice is though: if my objective changed to just, say, discouraging promotional writing and social media sharing, then 5 topics might be more suitable.
A few notes on reading pyLDAvis
visualizations:
Autoencoders like VAEs do dimensionality reduction as well! ↩
I originally had the idea to also use topic modeling to just remove “junk-ier” sentences from the data set entirely. This seemed like a reasonable method of cleaning the data set to me. However, my mentor Natasha persuaded me otherwise, arguing that it is more important to keep that data in order to improve the network’s understanding of language more generally. ↩