Coffee Space


Listen:

ffmpeg Audio Mix

Preview Image

TL;DR

I don’t know what I’m doing, but I did something and maybe it could be useful to you too. I got text-to-speech (TTS) audio with background music played for a small intro. I couldn’t find any examples online, so hopefully this can save somebody some time.

I wrote a brief article explaining why here.

Markdown to Text

So here we are using pandoc to take a markdown file ($filename) and converting it to plain text (.plain file):

0001 pandoc -t plain -o ${filename%.*}.plain $filename

This should be really quick for reasonably sized audio.

Text to Audio

Next we want to convert the plain text (.plain file) to a WAV file (.wav file):

0002 espeak -v us-mbrola-3 -s 150 -x -w "${filename%.*}.wav" \
0003   "$AUDIO_START $(cat ${filename%.*}.plain) $AUDIO_END"

A quick explanation of parameters:

Mix Audio with Background Audio

And now finally, we want to overlap a background music on to the intro and compress the audio to save on bandwidth:

0004 ffmpeg -i "${filename%.*}.wav" \
0005   -i "$AUDIO_INTRO" \
0006   -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]" \
0007   -map "[a]" \
0008   -threads $(nproc) -acodec libmp3lame -ac 2 -ar 22050 -ab 40k -f mp3 \
0009   "${filename%.*}.mp3"

A quick explanation of parameters:

Lastly, is this beast, where the magic is done:

0010 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]"

Disclaimer: This is my interpretation of what is going on - I could be talking rubbish. You are warned!

Firstly, $AUDIO_ILEN is the length of time to play the starting audio for and $AUDIO_IFADE is how long should be taken to do a linear fade out.

Breaking it down:

0011 [1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1]

So essentially we are telling ffmpeg that we want a fade effect (afade), it should be based on time (t), it should be fading out (out), the start time for the fade out (st) and the duration of the fade out (d). I believe the [1] tells it that this is the first operation to perform and [a1] to tell it which source to be applied to.

And for the second part:

0012 [0:a][a1]amix=inputs=2:duration=first[a]

So here we say to mix the audio sources into 1 source (amix), that there are two input audio sources (inputs=2) and that it should last in total for the length of the first source (duration=first). I believe [0:a] is perhaps telling it to map audio input 0 (the TTS) to the main audio channel, and then to mix [a1] (background) with it. Finally I think it says to map the result to [a].