Coffee Space – Coffee Space

ffmpeg Audio Mix

TL;DR

I don’t know what I’m doing, but I did something and maybe it could be useful to you too. I got text-to-speech (TTS) audio with background music played for a small intro. I couldn’t find any examples online, so hopefully this can save somebody some time.

I wrote a brief article explaining why here.

Markdown to Text

So here we are using pandoc to take a markdown file ($filename) and converting it to plain text (.plain file):

0001 pandoc -t plain -o ${filename%.*}.plain $filename

This should be really quick for reasonably sized audio.

Text to Audio

Next we want to convert the plain text (.plain file) to a WAV file (.wav file):

0002 espeak -v us-mbrola-3 -s 150 -x -w "${filename%.*}.wav" \
0003   "$AUDIO_START $(cat ${filename%.*}.plain) $AUDIO_END"

A quick explanation of parameters:

-v us-mbrola-3 – A nicer than default voice for espeak from the mbrola package.
-s 150 – Increase the speed of the voice, I don’t have that long to live.
-x – Don’t say the conversion out loud.
-w "${filename%.*}.wav" – Output the WAV file to a given location.
$AUDIO_START – A variable containing a start-of-audio statement.
$(cat ${filename%.*}.plain) – The filename we want to speak.
$AUDIO_END – A variable containing an end-of-audio statement.

Mix Audio with Background Audio

And now finally, we want to overlap a background music on to the intro and compress the audio to save on bandwidth:

0004 ffmpeg -i "${filename%.*}.wav" \
0005   -i "$AUDIO_INTRO" \
0006   -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]" \
0007   -map "[a]" \
0008   -threads $(nproc) -acodec libmp3lame -ac 2 -ar 22050 -ab 40k -f mp3 \
0009   "${filename%.*}.mp3"

A quick explanation of parameters:

-i "${filename%.*}.wav" – The input WAV file (TTS conversion).
-i "$AUDIO_INTRO" – The background music file.
-map "[a]" – Map the filter to the audio output.
-threads $(nproc) – Number of threads to use based on the number of available cores (most irrelevant for audio processing).
-acodec libmp3lame – The audio conversion codec for converting MP3s.
-ac 2 – Use two audio channels.
-ar 22050 – The audio sampling rate (reduced for compression).
-ab 40k – The audio bit rate (reduced for compression).
-f mp3 – The output format is MP3.
"${filename%.*}.mp3" – The output MP3 file.

Lastly, is this beast, where the magic is done:

0010 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]"

Disclaimer: This is my interpretation of what is going on - I could be talking rubbish. You are warned!

Firstly, $AUDIO_ILEN is the length of time to play the starting audio for and $AUDIO_IFADE is how long should be taken to do a linear fade out.

Breaking it down:

0011 [1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1]

So essentially we are telling ffmpeg that we want a fade effect (afade), it should be based on time (t), it should be fading out (out), the start time for the fade out (st) and the duration of the fade out (d). I believe the [1] tells it that this is the first operation to perform and [a1] to tell it which source to be applied to.

And for the second part:

0012 [0:a][a1]amix=inputs=2:duration=first[a]

So here we say to mix the audio sources into 1 source (amix), that there are two input audio sources (inputs=2) and that it should last in total for the length of the first source (duration=first). I believe [0:a] is perhaps telling it to map audio input 0 (the TTS) to the main audio channel, and then to mix [a1] (background) with it. Finally I think it says to map the result to [a].