We’ve all seen this scene yet some crazy ones among us have wondered about possible implementations of such technology. At the time, there were no concepts of machine learning or in my case, understanding of basic neuron operations and how those lead to patterns which in turn could possibly model voice and visual (re)creation.
10 Years Later
There I was pondering what to do for my Masters research project in 2015. Not one to shy away from a noisy ever talkative ego, I wanted to explore machine learning. Having new blogs and articles pop out everywhere, I felt every second I was entangled in game engine work, the gap between me and my engineering brethren, fighting the edges of knowledge, was growing.
The memory depicting the above video percolated up through the neuron tree lodged in my brain, might be close to memories of ice cream and sweets but that’s another topic all together (Curing Addictions Through Human Machine UnLearning?) .
The timeframe of the research was three to four months, so we prioritized our backlog resulting in initial focus to be on audio dimension. Since they’re all signals, the lessons we’d learn with audio, could relatively easily be used on visual inputs.
As all things in our current universe (or simulation), there are primitives to a dimension. Maybe I can even stipulate that if you breakdown any space small enough, you’ll find primitives or as the mathematicians like to say, eigenvectors of said vectorspace. These primitives can be used to build the space. Think of the alphabet and how it’s used to build words, phrases etc.
In the world of sound, one would think it would be “vowels” and “consonant” …… NOPE! I mean they do make sense from a syntactic breakdown for humans, but here we’re focusing on “Sound”.
Thankfully bright minds in linguistics have endeavored to breakdown sound in language. They’ve called closely sounding speech-sounds, also known as equivalence-class, in a given language as Phonemes. An example is the shared phoneme /h/ between help, hot and hello for the “h” part. They differ per language and dialect but to keep the story succinct, they are used to find “primitives” in a given language.
Hunting for Phonemes
So the basic idea is to find phonemes of a specific speaker in order to recreate their voice. Fortunately life isn’t that simple. Even if we find these phonemes and attach them to keys in a synthesizer it will not sound correctly. For as in the help, hot and hello example, the letter after the “h” defines the sound or accent of the h. Well some smart minds would say, just attach the Phonemes to notes on the musical scale and it would sound like the beginning of a Christmas song “Ho Ho Ho He Ha Ho”. Still wouldn’t cut it as we’d have a synthesizer type voice. No, we want to go for the real…. I mean we’re trying to evolve our species, we aren’t messing around!
Queue Statistical Models (PreAmble To Machine Learning)
Each speaker has a few different ways they’ll pronounce a phoneme. So the idea is to collect enough of their pronunciations and find a common pattern which we average out as the eigenphoneme of that person. Note that a phoneme also has “accents” to differentiate between “ha”, “ho”, “he” etc. To handle these permutation, we can use a probability table or just have a neural network track slight changes in phoneme accents (more on that later).
Here’s a possible representation or model of a phoneme with different accents.
Notice how the inputs help¸ hello and hot will encourage creation of new nodes which will organize the complexity that is a phoneme and its various flavors. Hint hint, we’ve started to approach neurons!
Next up! We’ll get more technical with how we do feature extraction, audioprocessing and clustering! Cup of Joe or stimulant advised 😉