Coder and Observer
In 2021, I built a voice desktop assistant called Smartmiq that allows users to trigger actions using their voice. Similar to Mac's Spotlight search, Smartmiq allows users to perform tasks such as searching Google or replying to Slack messages with just their voice.
Or you could use it to reply to slack messages:
One of the unique aspects of Smartmiq is that I developed my own automatic speech recognition (ASR) system using PyTorch and the transducer architecture. This allowed me to achieve a high level of customization and control over the voice recognition capabilities of the assistant.
I also focused on creating a user interface for Smartmiq that closely resembled the look and feel of Spotlight, in order to make it familiar and intuitive for users. However, as I continued to use and develop Smartmiq, I began to realize that voice input on desktop and laptop computers has not gained as much traction as it has on mobile devices. This is likely due to the fact that typing is often faster and more convenient on these devices, especially in public settings.
Despite this realization, I received positive feedback from users who found Smartmiq to be a useful and convenient tool. Overall, building Smartmiq was a valuable learning experience that helped me to better understand the potential and limitations of voice input on desktop and laptop computers.
I trained a deep reinforcement learning model to play the classic game of Pong using the soft actor-critic algorithm. The soft actor-critic algorithm is a variant of the actor-critic algorithm that uses a "soft" version of the action distribution, in order to encourage exploration during training.
To process the game's image frames and predict the appropriate actions, I implemented a series of convolutional neural networks (CNNs) using PyTorch. Similar to the AtariNet arch, CNNs consisted of four convolutional layers. These layers were followed by a fully-connected layer with 128 units, and two output layers, one for the action value function (the critic) and one for the policy (the actor).
To train the model, I used the Adam optimizer and a mean squared error loss function for the critic, and a negative log likelihood loss function for the actor. I also employed techniques such as Monte Carlo roll-out, experience replay, and importance sampling to improve the model's performance.
Monte Carlo roll-out involves simulating a complete episode of the game, starting from the current state, and using the current policy to choose actions until the episode is completed. The rewards obtained during the episode are then used to update the policy. This technique helps to improve the model's ability to plan ahead and make more informed decisions.
Experience replay involves storing a dataset of past experiences (i.e., state-action-reward-next state tuples) and sampling from this dataset when updating the policy. This helps to decorrelate the experiences and stabilize the learning process.
Importance sampling is a technique used to correct for the fact that the data distribution changes over time due to the learning process. It involves weighting the importance of each experience based on how similar it is to the current policy. This helps to correct for the fact that the model may be learning from experiences that are no longer representative of the current policy.
After several iterations, my model was able to learn how to play Pong and achieved reasonable performance. You can see a demonstration of the model in action in this YouTube video:
The PyTorch training code for this project can be found here
Overall, training this deep reinforcement learning model was a challenging but rewarding experience that taught me about the capabilities and limitations of this type of algorithm.
In 2020, I trained a neural network based on VoiceFilter architecture. The idea was learn a mask to apply to spectrograms in order to mask out non-speech noise.
Here's what it sounds like:
I was looking for an excuse to learn Rust. I decided to build a music visualizer using Rust and a Generative Adverserial Network.
The general idea is to use the frequency components of the streaming music to explore the latent space of a pretrained art GAN.
For each timestep in the music:
In 2018, I developed a voice assistant for Zoom, Google Meet, and Webex meetings that allowed users to trigger actions using their voice. One of the main insights behind this project was that this was one of the only ways to get higher quality audio before these meeting providers offered APIs with a minimum of 16kHz audio. By using this approach, we were able to achieve a higher quality audio than if we had used a service like Twilio to dial-in, which only offers 8kHz audio.
I emulated the webcam and audio drivers in a docker container with a headless chrome browser. Our team at Workfit/Voicera/Voicea (we changed the name a few times...) eventually scaled it up on a kubernetes cluster.
One of the most effective growth hacks we implemented was generating a virtual webcam and showcasing our bot in meetings. This allowed us to reach a wider audience and increase our adoption rate, as on average, 3+ people were exposed to our bot for every meeting a user invited it to.
We built our own wakeword detector and (eventually) our own speech recognition engines.
Here's a video of what it looked like in a Zoom meeting:
In 2015, I worked on the OS of a payment terminal. We built a remote management feature for pushing out OS and app updates. In order to test it out, we pushed a music video to all of our test devices simultaneously.
The following is what ensued: