Offline speech recorder

1/4/2024

ChromeVox Next offline TTS client, a sister project biemster.Now with a python client in the repo, for easier integration with home automation and such. And as an added bonus, the GBoard models are working with these libraries as well! That opens up a whole world of experimentation, since there are already quite a few of those spotted in the wild! Using wine as an intermediate is still far from ideal, but I guess that the Linux library will also pop up soon considering ChromeOS would depend on it.Īs pointed out, the Linux library is also out there already, so no need to go the wine way anymore. But having an actual working implementation will greatly improve my ability to figure out the inner workings of the models. The SODA client I wrote is developed in a separate repository ( gasr), as it will be mostly just a tool to do the full reverse engineering of the RNN and transducer. Just issue the following command: $ ecasound -f:16,1,16000 -i alsa -o:stdout | wine gasr.exeĪnd watch your conversations roll over the screen: W1215 22:58:43.683654 44 soda_async_:390] Soda session starting (require_hotword:0, hotword_timeout_in_millis:0) > hello > hello from > hello from > hello from sod > hello from soda > hello from soda > final: hello from soda This enabled me to work with wine, and have it pipe the data straight from whatever Linux application I wanted to use to the Windows DLL. And fortunately the same can be said for the SODA client, resulting in a very small code base with only the library as dependency. In my last post I reported on quite a successful project with the Google TTS library, which resulted in a very lightweight client for it. Since I'm much more capable on a Linux machine, I've searched (and found!) a way to use either one of those available libraries. But I was wrong, and the Windows and macOS libraries were available since late November. I've been on the lookout for the Linux library, since that is my preferred environment and I was under the impression that the development was taking place on that platform. So SODA finally landed, sort of, and for a couple weeks already apparently. I recently found a nice overview presentation of (almost) current research, with an interesting description starting on slide 81 explaining when to advance the encoder and retain the prediction network state. The two inputs of the joint are just the outputs of de decoder and encoder, and the softmax only turns this output into probabilities between 1 and 0. The joint and softmax have the least amount of tweakable parameters. This way the current symbol depends on all the previous symbols in the sequence. In the next iteration the decoder is fed with the output of the softmax layer, which is of lenght 128 and represents the probabilities of the symbol heard in the audio. The decoder is fed with a tensor of zeros at t=0. The output of the second encoder is fed to the joint. Both those outputs should be fed to the second encoder (enc1) to provide it with a tensor of length 1280. Then three more frames should be captured to run enc0 again to obtain a second output. Gauging from the number of inputs to the first encoder (enc0), 3 frames should be stacked and provided to enc0. The audio input is probably 80 log-Mel channels, as described in this paper. The Encoder Network comprises 8 such layers. The Prediction Network comprises 2 layers of 2048 units, with a 640-dimensional projection layer. The Prediction and Encoder Networks are LSTM RNNs, the Joint model is a feedforward network ( paper). The predicted symbols (outputs of the Softmax layer) are fed back into the model through the Prediction network, as y u-1, ensuring that the predictions are conditioned both on the audio samples so far and on past outputs.

Representation of an RNN-T, with the input audio samples, x, and the predicted symbols y. Further analysis of the app is necessary to find the right parameters to the models, but the initial blog post also provides some useful info: Write lightweight application for dictation (DONE)įinding the trained models was done by reverse engineering the GBoard app using apktool.Figure out how to connect the different inputs and outputs to each other (in progress).Figure out how to import the model in TensorFlow (DONE).

0 Comments

Offline speech recorder

Leave a Reply.

Author

Archives

Categories