Coffee Space 
I don’t know what I’m doing, but I did something and maybe it could be useful to you too. I got text-to-speech (TTS) audio with background music played for a small intro. I couldn’t find any examples online, so hopefully this can save somebody some time.
I wrote a brief article explaining why here.
So here we are using pandoc to take a markdown file
($filename) and converting it to plain text
(.plain file):
0001 pandoc -t plain -o ${filename%.*}.plain $filename
This should be really quick for reasonably sized audio.
Next we want to convert the plain text (.plain file) to
a WAV file (.wav file):
0002 espeak -v us-mbrola-3 -s 150 -x -w "${filename%.*}.wav" \
0003 "$AUDIO_START $(cat ${filename%.*}.plain) $AUDIO_END"
A quick explanation of parameters:
-v us-mbrola-3 – A nicer than default voice for
espeak from the mbrola package.-s 150 – Increase the speed of the voice, I don’t have
that long to live.-x – Don’t say the conversion out loud.-w "${filename%.*}.wav" – Output the WAV file to a
given location.$AUDIO_START – A variable containing a start-of-audio
statement.$(cat ${filename%.*}.plain) – The filename we want to
speak.$AUDIO_END – A variable containing an end-of-audio
statement.And now finally, we want to overlap a background music on to the intro and compress the audio to save on bandwidth:
0004 ffmpeg -i "${filename%.*}.wav" \
0005 -i "$AUDIO_INTRO" \
0006 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]" \
0007 -map "[a]" \
0008 -threads $(nproc) -acodec libmp3lame -ac 2 -ar 22050 -ab 40k -f mp3 \
0009 "${filename%.*}.mp3"
A quick explanation of parameters:
-i "${filename%.*}.wav" – The input WAV file (TTS
conversion).-i "$AUDIO_INTRO" – The background music file.-map "[a]" – Map the filter to the audio output.-threads $(nproc) – Number of threads to use based on
the number of available cores (most irrelevant for audio
processing).-acodec libmp3lame – The audio conversion codec for
converting MP3s.-ac 2 – Use two audio channels.-ar 22050 – The audio sampling rate (reduced for
compression).-ab 40k – The audio bit rate (reduced for
compression).-f mp3 – The output format is MP3."${filename%.*}.mp3" – The output MP3 file.Lastly, is this beast, where the magic is done:
0010 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]"
Disclaimer: This is my interpretation of what is going on - I could be talking rubbish. You are warned!
Firstly, $AUDIO_ILEN is the length of time to play the
starting audio for and $AUDIO_IFADE is how long should be
taken to do a linear fade out.
Breaking it down:
0011 [1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1]
So essentially we are telling ffmpeg that we want a fade
effect (afade), it should be based on time
(t), it should be fading out (out), the start
time for the fade out (st) and the duration of the fade out
(d). I believe the [1] tells it that this is
the first operation to perform and [a1] to tell it which
source to be applied to.
And for the second part:
0012 [0:a][a1]amix=inputs=2:duration=first[a]
So here we say to mix the audio sources into 1 source
(amix), that there are two input audio sources
(inputs=2) and that it should last in total for the length
of the first source (duration=first). I believe
[0:a] is perhaps telling it to map audio input 0 (the TTS)
to the main audio channel, and then to mix [a1]
(background) with it. Finally I think it says to map the result to
[a].