OpenAI Scholar: Final Project

03 Aug 2018 . category: DL . Comments

This post is a replica of my OpenAI Scholar final project proposal, also available here.

UPDATE 8/31/18: “deephypebot: an overview” is a revamped, more comprehensive version of this post. Check it out!


tl;dr- auto-generating conditioned music commentary on Twitter.

The theme of my summer as an OpenAI Scholar has been explorations around music + text. I find the language around music - manifested by hundreds of “nice, small blogs” on the internet - to be a deep and unique well of creative writing.

As such, my final project will pay homage to these labors of love on the web and attempt to auto-generate consistently good and entertaining new writing about songs, based on a set of characteristics about the song and knowledge of past human music writing.

The project will culminate in a Twitter bot (@deephypebot) that will monitor other music feeds for songs and automatically generate thoughts/opinions/writing about the songs.

Project Architecture

Project architecture diagram

Training data

My training data consists of ~20,000 blog posts with writing about individual songs. The count started at about 80K post links from 5 years of popular songs on the music blog aggregator Hype Machine - then I filtered for English, non-aggregated (i.e., excluding “round up”-style posts about multiple songs) posts about songs that can be found on Spotify. There was some additional attrition due to many post links no longer existing. I did some additional manual cleanup of symbols, markdown, and writing that I deemed non-commentary.

From there, I split the commentary into sentences, which are a good length for a variational autoencoder (VAE) model to encode.

Neural network

A language model (LM) is an approach to generating text by estimating the probability distribution over sequences of linguistic units (characters, words, sentences). This project centers around a sequence-to-sequence conditional variational autoencoder (seq2seq CVAE) model that generates text conditioned on a thought vector z + attributes of the referenced music v (simply concatenated together as cat(z, v)). The conditional fed into the CVAE is provided by an additional latent constraints generative adversarial network (LC-GAN) model that helps control aspects of the text generated.

The CVAE consists of an LSTM-based encoder and decoder, and once trained, the decoder can be used independently as a language model conditioned on latent space cat(z, v) (more on seq2seq VAEs here). The conditional input is fed into the decoder only.

The LC-GAN is used to determine which conditional inputs cat(z, v) to this LM tend to generate samples with particular attributes (more on the LC-GAN here). This project uses LDA topic modeling as its automatic reward function for encouraging samples of a descriptive, almost flowery style (more on LDA topic modeling here). The generator is trained to fool the discriminator with “fake” (e.g., not from training data) samples, ostensibly from the desired topic set. Once trained, the generator can be used independently to provide conditional inputs to the CVAE for inference.

Making inference requests to the network

Once the neural network is trained and deployed, this project will use it to generate new writing conditioned on either audio features or genre information pulled from the Spotify API (depending on which conditioning seems to work better).

This will require detecting the song and artist discussed in tweets that show up on @deephypebot’s timeline and then sending this information to Spotify. Then Spotify’s response will be sent to the neural network.

From samples to tweets

Text generation is a notoriously messy affair where “you will not get quality generated text 100% of the time, even with a heavily-trained neural network.” While much effort will be put into having as automated and clean a pipeline as possible, some human supervision is prudent.

Once generations for a new proposed tweet are available, an email will be sent to the human curator (me), who will select and lightly edit for grammar and such before releasing to @deephypebot for tweeting.



  • “Starting an Open Source Project” by GitHub [guide] - #oss
  • “Rules of Machine Learning: Best Practices for ML Engineering” by Google [guide] - #eng
  • Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M (2014). Machine Learning: The High-Interest Credit Card of Technical Debt. [paper] - #eng
  • “Build Your Own Twitter Bots!” [code] [video] - #twitterbot
  • “Web API Tutorial” by Spotify [guide] - #spotify
  • Sohn, K., Yan, X., Lee, H. Learning Structured Output Representation using Deep Conditional Generative Models. [CVAE paper] - #vae
  • Engel, J., Hoffman, M., Roberts, A. (2017). Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models. [LC-GAN paper] - #gan
  • “Deploying a Python Flask app on Heroku” by John Kagga [guide] - #eng
  • “The Flask Mega-Tutorial Part XIX: Deployment on Docker Containers” by Miguel Grinberg [guide] - #eng
  • Bernardo, F., Zbyszynski, M., Fiebrink, R., Grierson, M. (2016). Interactive Machine Learning for End-User Innovation. [paper] - #onlinelearning



August 3: Project dev spec + preliminary tasks; LC-GAN experiments towards better/controllable samples

  • Preliminary project tasks
    • Developer spec
    • Data cleaning and versioning
    • Permanent train/test split
    • Genre collection from Spotify
    • Metric definitions; benchmarking/baselines
      • Perplexity
      • Using discriminators to measure accuracy? (real/fake, genre, etc.)
    • Chat with Natasha and Jesse about more sophisticated modeling for later weeks
  • LC-GAN experiments
    • Experiment with solo discriminator vs joint: e.g., realism vs realism + readability/grammaticality
    • Investigate differences in training discriminator on Gaussian random z’s vs. sample-based z’s
    • Experiment with maximizing a single quality (e.g., sentiment) of a sample
    • Do balanced class labels matter?

August 10: Twitter bot + production pipeline ready

  • What Twitter feeds to watch
  • How to watch Twitter feeds for songs
  • How to build a Twitter bot
    • Twitter API registration
  • How to retrieve song title and artist from tweets
    • Short answer: regex foolishness
  • How to request audio features and genres from an app
    • Spotify API registration
  • Hook up a dummy/heuristic model
  • Some automatic post-processing
    • Removes UNK token and consecutive duplicate words
  • Samples -> Google Sheets process
  • Google Sheets -> Tweets process
  • [Bonus] Likes -> Model process

August 17: More sophisticated modeling

  • Experiments on conditioning VAE vs. LC-GAN on topic models (LDA)
    • Would be cool to demonstrate Bayesian techniques and understanding through LDA
  • Experiments on conditioning VAE vs. LC-GAN on sentiment (deepmoji), audio features/genre…
  • Retrain VAE with focus on reconstruction error (lower KL constraint σ)
  • Time to get best samples possible
    • Fancier VAEs?

August 24: End-to-end integrations

  • How to deploy a model
    • Especially a 2GB+ one
  • Select and integrate final production model

August 31: Project due

Mentor support

Mentor: Natasha Jaques

  • Assistance in reasoning over neural net architecture decisions
  • Connection to LC-GAN paper author, Jesse Engel, for queries
  • Assistance in understanding how LDA works
  • Assistance in debugging model training
  • Suggestions for model enhancement

Follow my progress this summer with this blog’s #openai tag, or on GitHub.


Nadja does not particularly enjoy writing about herself.