Coffee Space


Listen:

Autogenerated Audio

Preview Image

TL;DR

This is a quick update to let you know that you can now listen to the articles as MP3s; either through the website or via the RSS feed. These are automatically generated with a reasonably good open source voice.

Why?

A few news sights have been introducing this feature that you can listen to their articles rather than having to read them, and I quite like this idea. For a while now I have been toying with the idea of creating a podcast also, but I neither have the voice for it, the time to do it or the content to put in it.

I then had the idea to have audio auto-generated from my website entries like these news sites do, meaning I can get some half-way point for my goals. Obviously I want this entire process to be automated, as manually converting these sources really takes quite some time.

Research

I first started looking into this perhaps 6 months ago, using the standard libespeak library. The default voice really sounds awful, and the other regional voice don’t sound to great either. They have a really hard to understand robotic voice. Don’t get me wrong - I am really happy this exists, but it’s hardly cutting edge in noise quality.

I have a soft spot for libespeak as it’s what we used to use for general debugging on our robots. They would complain to us about their various injuries in a robotic voice (“help, my left knee pitch joint is stuck”). Whilst it was understandable enough for debugging, it left a lot to be desired when it comes to reading an entire article.

I reviewed some other libraries, but all the open source versions sound really robotic. I kept coming back to libespeak, but I really wasn’t happy about the voice. Eventually I stumbled upon mbrola and managed to get it installed locally, eventually settling on the us-3 voice they offer (kindly provided by “Mike Macon”).

Implementation

Getting mbrola installed on the server on the other hand was slightly more tough. It’s running Ubuntu 16 with armhf, as you can imagine it’s not in the default package manager. Initially I tried various manual installation instruction, but ultimately you need the mbrola binary and that only seems to be available from the website, which is unfortunately down.

Luckily pkgs.org had an installable copy (I installed the Ubuntu 18 version). I was also able to find the voice I wanted (for the correct Ubuntu version) on their site.

The next thing to create was the build script, which is getting decidedly complicated during these days. It has to now:

  1. Generate a link for the MP3 to be inserted into the web page.
  2. Generate the web page.
  3. Generate the text for the text-to-speech.
  4. Generate the WAV file using libespeak.
  5. Convert the WAV to a compressed MP3.
  6. Clean-up the directories and temporary files to save some disk space.

After it has one this, it then needs to drop these files into the RSS feed (generated last).

If anybody would like to know about this in more detail, please ask in the comments.

Problems

I really wish I could have some nicer voice available. I feel like this is something a small neural network should be capable of these days - hell even some semi-intelligent Markov-model. Unless you really need to get your text-to-speech library onto a micro-controller or need extreme speeds, you should really be able to generate a much nicer voice.

One thing I have also considered is some sort of processing after the fact to make the voice nicer. This is also something you could imagine using some form of machine learning for. Ideally you need the person who created the voice to also generate sample data to help the system learn. I guess big tech companies can afford the time and effort to implement such systems, where us normal folk cannot.

Server load whilst converting

As you can imagine, converting audio from text to WAV to MP3 generates a significant load on the server. Unfortunately ffmpeg will only use one core, so it takes a minute or more to convert each article. I had to update the server script to stop deleting files until absolutely necessary to avoid noticeable downtime.

Other than downtime, you also must consider that it now takes longer to view drafts and review the changes live. Something that took 0-15 minutes now takes 20+ minutes, significant once you consider the fact you start pushing out to RSS feeds, etc. It doesn’t give you a very nice window.

You may suggest “generate the files locally”, but ultimately I want to be in a place where all I need is a dumb terminal to write articles, rather than some serious processing machine to generate files. (For example, when travelling, I may even only have a mobile phone available).

You also suggest “generate the files only for changed articles”, but this requires some sort of intelligent system - something I don’t want to write or rely on unless I have to. I can already imagine cases where it believes there are no changes, when in fact there are changes. Even mature software struggles from this, for example, the make build system. Sometimes you do have to run a clean and re-build in order to get the compilation process working correctly.