AI Christmas Story song
AI Christmas Story song

This AI Transforms Your Photographs Into Really Creepy Karaoke

Imagine a future where an Artificial Intelligence is capable of constructing a pop song more perfect than anything out of Max Martin’s catalogue. Trained on every piece of music through history accessible, there would be no limit to the styles it could penetrate, channeling the techniques and tones of the greatest artists to ever live.

How could this affect the music industry? Why would labels,  deal with (and pay) human artists when they can have a song produced exactly in the fashion they and their target audience want, when they want, in just a couple seconds. How many artists would lose their job? Who ultimately would have ownership over the AI? Would this centralize all control within music into one entity?

These were my first thoughts when I learned about “neural karaoke,” a tandem of a project currently underway by three individuals within University of Toronto’s Computer Science department. Quite obviously, this is the reaction of an individual who has let his imagination run itself right into a nightmare scenario. There are plenty of “what if” gaps that would have to be approached first, and right now this is simply a project in its first steps.

Let’s set all fear mongering aside and dig into what’s really happening here. Along with Culture Trip editor Peter Ward, we jumped on Skype with the project’s three members — Hang Chu, a PhD student at Toronto’s computer science lab, and his two advisors, Raquel Urtasun and Sanja Fidler — who broke down how “neural karaoke” was created as a result of a broader project. A neural network trained on 100 hours of online music, it breaks down the song into different layers, which include the melody, the lyrics, the percussion, etc. The program’s tech then generates new algorithms for each of these layers. The image is then fed through the algorithm and, after a few seconds, a new song is born.

“The goal here was really can algorithms automatically generate music,” Fidler said. “Typically that is something that is considered art, so can you make an algorithm that produces art?”

While other similar applications exist already — Google Magenta and Sony Flow Machines are two examples — they noticed that they were largely focused on classical music. So they decided to go down the path of pop because the songs are catchier and the structure of it and variety of instruments makes it more interesting from a technical perspective.

While the group is planning to publish a demo online so that anyone can upload photos and “create their own karaoke” in the near future, we decided to get in the Christmas spirit by sending over two stills from two of our favorite Christmas movies, A Christmas Story and Elf, and the songs the AI conjured do not disappoint. Check them out below, along with an interview with Chu, Urtasun, and Fidler.

When the AI is given a picture what exactly does it take from the picture to create the lyrics?

Fidler: From the neural network it takes an image that represents this kind of hierarchy of features which kind of understands the semantics of what’s in the picture. For example, it understands a beach and some people and things like that. Then this algorithm essentially translates these features into text, into something that talks about these layers of the image. Then Hang takes this text, which kind of talks about some emotion and what’s in the picture and translates it into some music that would go well with this kind of text. It’s basically a neural network that learns its mapping from image to text to song.

Chu: That’s related to the algorithm we are using and what it’s training on. We were generating a passage as a whole entity and a typical passage might have some descriptional object and some interaction between the main characters. So when we generate this passage, it automatically incorporates these kind of “great to meet you” and something like “I hear something down the hall.”

Fidler: The algorithms right now are pretty good at reading people’s expressions and emotions and interactions, You can tell a lot from a single picture, right? You’re able to capture a little bit of that, not perfectly yet, and obviously with video that would be way more explicit, but in this case we started with a picture.

What is the thinking behind turning images into songs? You don’t really think about musicians looking at photographs as a means of inspiration for writing.

Fidler: This goes back to some of our previous work. The idea there was can you take a photograph and write a story about this automatically. We basically showed that if we trained this algorithm on Taylor Swift lyrics you can actually caption this photograph with something that looks like Taylor Swift lyrics. And then Hang kind of took this idea further, “Well, can we turn these lyrics or this tech into music. But actually a photograph is a lot about mood and what it makes you feel, right? So it makes sense to connect it to music which also has this layer.

Is this supposed to imitate human emotion?

Urtasun: Ideally, yes, but this is one of the things we are improving the model now to actually take emotion into account. 

Fidler: These are the really early stages of this project and I think we’re going to take it further and hopefully it’s going to produce even better music.

Is the voice the hardest thing to do?

Urtasun: I think so. Even now for synthesizing regular sentences, the voices, you know, Siri, doesn’t sound really good. It still sounds very artificial. The good thing about songs is that you have notes and notes are a nice abstraction. So you can imagine being able to compose something, and then there’s the second level of being able to interpret and perform that something.

Do you think the AI will be able to reach the point where it can write songs indistinguishable from an actual human artist?

Fidler: That’s almost a philosophical question that you are asking. Let’s say we found a song that sounded really good, but it couldn’t be associated with a person or what they convey from their own experiences and personality, so I’m not sure that people would be able to see it in an algorithm even if it sounded good. But we are hoping with more data you are able to sound better at least.

How do you think this program, running at its most efficient form, will fit into the music industry?

Fidler: It would be nice if there was “The Voice” with machines. [laughs] We just thought it was a very nice application which would offer people to have their own little karaoke machine at home. We didn’t think as big as you are saying right now, but in the future we can push the limits of this.

Urtasun: It’s not to think about it being fully automatic, but you can imagine building a simple user interface for people like children to express their emotions and to create and compose their own songs without requiring them to know too much music theory.