In case my last post wasn’t a giveaway, I am currently working on an Open Source project called CodeTalker It’s aim is to extend Visual Studio Code into the realm of speech recognition and allow users to use their microphone to easily code files. With respect to CodeTalker, we are just focusing on creating HTML files through speech.
Getting something like this accomplished motivates the heck out of me. Often times we can take things such as typing for granted. Great minds are out there who cannot type, and therefore might not be able to code. So why should a computer restrict users to keyboards? Obviously this is a crazy question at the moment, considering keyboards are vital for computing, but we can certainly create applications which don’t necessarily need a keyboard.
Currently, I am collaborating with a few other dudes and dudettes through Slack. All of whom, including myself, have put in a lot of effort trying to figure out how to accomplish this. We’ve hit a lot of road blocks when it comes to finding an API out there which can handle Speech Recognition. The most common API,
webkitSpeechRecognition created by Google was dropped from use in other shell environments, such as Electron. Google needs to push their ownCloud Speech API, so I can see why they’d want to restrict use on their webkit. In fact, most Speech Recognition API’s out there, such as
Watson Speech to Text from IBM or Microsoft’s
Bing Speech all require subscriptions to use (with a Credit Card…gasp). The most popular free platform would be CMUSphinx. While I do not want to discourage use of this amazing project, compared to the other API’s I have mentioned, it lacks precision out of the box and can be tricky to setup for non-technical users.
One thing I sorta found interesting, was that the Web Speech API just didn’t work in Electron. It is still available. And it’s totally usable in Python 3. In fact, here’s my code using it:
import speech_recognition as sr
r = sr.Recognizer()
test = sr.AudioFile(‘output.wav’)
with test as source:
audio = r.record(source)
# Send the result of the speech recognition to the stdout
The script just needs an .wav file (can be other SOX or ALSA flavors as well) in order to parse it for words. Well, what if I can get VSCode to create that .wav file with a user’s microphone and somehow get this script to run, all inside of the VSCode environment. And the plus side is, having Python installed on your computer isn’t all that hard.
And this is the approach I decided to take when overcoming the initial hurdle.
With the Python file working without issue, it was time to a) figure out how to activate a microphone in Visual Studio, record a stream of audio, write it out to a file, and then somehow run the python script with this file.
Because this is in it’s infancy. Everything I’ve done has a lot of bugs, and I’m more than happy to point them out.
To start, I decided to go with the most popular NPM microphone module: mic. But right off the bat, this thing is buggy. I couldn’t get it to recognize my microphone on Linux. Oh well, at least they allow you to specify it using
micInstance. So I did. And it was recording my voice no problem. But here’s the thing…it’s recording my voice with my microphone. I doubt this will work with someone else’s computer, unless of course they are running an onboard microphone controlled by ALSA on an Arch Linux machine.
But this can be fixed!
Now that I was saving an
output.wav file successfully, I was able to just run the python script ontop of the file. And sure enough, it was a success. But, I had to get VSCode to automatically run this script. Or else, what’s the point. Thank goodness for a teacher of mine, who recommend I check out child processes in VSCode. That is, the child_process API which is a core module in nodeJS. While I was initially looking at tapping into a microphone as a process, I also realized that you can execute shell commands…This is so convenient for what I need to get done.
Right now it’s as simple as:
.exec('python repos/CodeTalker/speech.py', (error, stdout, stderr) which will find the speech.py file and run it AND return a
string value of the words from an audio file.
Two bugs I see here though:
- the path to this file right now is set to work only on my computer.
- speech.py relies on an absolute path to the audio file…It would be MUCH better to just pass a file in as an argument to get around this issue.
(Just somethings I need to add to the Github Issues page, if my PR is accepted).
Pretty much though, I’ve accomplished a small milestone. Through VSCode I am able to use my microphone to talk, and have the debug console output the words that I said.
Pretty darn exciting time to get all this up and running. The work for CodeTalker can be viewed here. And because this is all built and not really tested…in order to activate the functionality you still need to give VSCode the command:
Hello World. Laughable, but something that should be a must fix in the next revision.