I don’t know what I’m doing, but I did something and maybe it could be useful to you too. I got text-to-speech (TTS) audio with background music played for a small intro. I couldn’t find any examples online, so hopefully this can save somebody some time.
I wrote a brief article explaining why here.
So here we are using pandoc
to take a markdown file ($filename
) and converting it to plain text (.plain
file):
0001 pandoc -t plain -o ${filename%.*}.plain $filename
This should be really quick for reasonably sized audio.
Next we want to convert the plain text (.plain
file) to a WAV file (.wav
file):
0002 espeak -v us-mbrola-3 -s 150 -x -w "${filename%.*}.wav" \ 0003 "$AUDIO_START $(cat ${filename%.*}.plain) $AUDIO_END"
A quick explanation of parameters:
-v us-mbrola-3
– A nicer than default voice for espeak
from the mbrola package.-s 150
– Increase the speed of the voice, I don’t have that long to live.-x
– Don’t say the conversion out loud.-w "${filename%.*}.wav"
– Output the WAV file to a given location.$AUDIO_START
– A variable containing a start-of-audio statement.$(cat ${filename%.*}.plain)
– The filename we want to speak.$AUDIO_END
– A variable containing an end-of-audio statement.And now finally, we want to overlap a background music on to the intro and compress the audio to save on bandwidth:
0004 ffmpeg -i "${filename%.*}.wav" \ 0005 -i "$AUDIO_INTRO" \ 0006 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]" \ 0007 -map "[a]" \ 0008 -threads $(nproc) -acodec libmp3lame -ac 2 -ar 22050 -ab 40k -f mp3 \ 0009 "${filename%.*}.mp3"
A quick explanation of parameters:
-i "${filename%.*}.wav"
– The input WAV file (TTS conversion).-i "$AUDIO_INTRO"
– The background music file.-map "[a]"
– Map the filter to the audio output.-threads $(nproc)
– Number of threads to use based on the number of available cores (most irrelevant for audio processing).-acodec libmp3lame
– The audio conversion codec for converting MP3s.-ac 2
– Use two audio channels.-ar 22050
– The audio sampling rate (reduced for compression).-ab 40k
– The audio bit rate (reduced for compression).-f mp3
– The output format is MP3."${filename%.*}.mp3"
– The output MP3 file.Lastly, is this beast, where the magic is done:
0010 -filter_complex "[1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1];[0:a][a1]amix=inputs=2:duration=first[a]"
Disclaimer: This is my interpretation of what is going on - I could be talking rubbish. You are warned!
Firstly, $AUDIO_ILEN
is the length of time to play the starting audio for and $AUDIO_IFADE
is how long should be taken to do a linear fade out.
Breaking it down:
0011 [1]afade=t=out:st=$AUDIO_ILEN:d=$AUDIO_IFADE[a1]
So essentially we are telling ffmpeg
that we want a fade effect (afade
), it should be based on time (t
), it should be fading out (out
), the start time for the fade out (st
) and the duration of the fade out (d
). I believe the [1]
tells it that this is the first operation to perform and [a1]
to tell it which source to be applied to.
And for the second part:
0012 [0:a][a1]amix=inputs=2:duration=first[a]
So here we say to mix the audio sources into 1 source (amix
), that there are two input audio sources (inputs=2
) and that it should last in total for the length of the first source (duration=first
). I believe [0:a]
is perhaps telling it to map audio input 0 (the TTS) to the main audio channel, and then to mix [a1]
(background) with it. Finally I think it says to map the result to [a]
.