eSpeak is a compact, open source text to speech synthesiser for Windows and Linux and is a great piece of software to create a talking Raspberry Pi. It is able to synthesise speech from text in English and other languages (including Afrikaans).
At the time of writing, eSpeak is newer than some other software text to speech synthesizers and is free to use. Apart from its installation, it does not need the internet to operate and is fairly small in size. It is also very easy to get up and running with the default Raspbian settings, easy to use and is very customisable. eSpeak commands are triggered using the terminal and Bash.
On the down-side, eSpeak wails a little and sounds very muck like an alien or a robot (which can also be why you would like to incorporate it into a project). The text to speach (TTS) conversion is also not that accurate and due to bad pronunciation, some words might be difficult to hear.
For this post, a fully installed Raspberry Pi with the latest version of Raspbian was used. Default sound output from either the 3.5mm audio jack or HDMI cable need to be audible. During the installation process a connection to the internet will be required. Without a screen, keyboard and mouse, PuTTY and/or WinSCP can be used to do the testing and coding.
eSpeak should work out of the box with Raspbian’s default sound settings. The only requiredment is to choose the desired sound output (i.e. HDMI or audio jack). To test the sound output on Raspbain, the following terminal command can be used:
You should hear a clip playing a short, “haaa” singing voice. If this clip can be heard, then eSpeak should also be heard.
Alternatively, eSpeak can be used with the
-w command to write wave files containing the speech instead of playing it on the soound device. More on this below.
Installing eSpeak on a Raspberry Pi
While connected to the internet, the following terminal command is used to install eSpeak:
sudo apt-get install espeak
To see if eSpeak has been installed correctly use:
Using the eSpeak command
On the Raspberry Pi, eSpeak is used by using terminal commands. The eSpeak command can be used in s couple of ways. The simplest way to use eSpeak is by typing the desired speech in the form of text input (text within double quotes) after the
espeak "Hello, world"
To read text from a text file, use:
espeak -f <text file>
By not entering any text after the eSpeak command the program will use text taken form
stdin, but each line is treated as a separate sentence. I.e. by just typing:
followed by text on subsequent lines, each line is spoken when Enter/RETURN is pressed. Pressing Ctrl + Z will enter the command prompt cursor again.
eSpeak command options
eSpeak has plenty of handy command line options which will alter its default use. These include changing the accent/language, gendered tone, pitch, speed, etc. of the spoken voice. Command line options can be ‘stacked’ onto each other. To see the version of eSpeak and all the command line options:
The voice used by eSpeak is a determined by the voice accent/language file and a variant determining its tone (e.g. male or female). To change the voice accent/language, the correct voice file needs to be used. To see a list of the available voice files, the following command is used:
To use the Afrikaans accent/language, for example, the following command option is used:
espeak -v af
af is the corresponding
Language in the list of available voices.
By default all languages/accents are generated in a male tone. With eSpeak the tone of a voice can also be changed using an additional
-v <voice filename>[+<variant>]
According to the official documentation, the variants are “+m1 +m2 +m3 +m4 +m5 +m6 for male voices and +f1 +f2 +f3 +f4 which simulate female voices by using higher pitches. Other variants are +croak and +whisper.”
To use the Afrikaans accent/language with a mid-tone female voice, for example, the following command option is used:
Some of the other more useful command line options include:
-f <text file> speaks a text file. --stdin takes the text input from stdin. -a <integer> sets amplitude (volume) in a range of 0 to 200. The default is 100. -p <integer> adjusts the pitch in a range of 0 to 99. The default is 50. -s <integer> sets the speed in words-per-minute (approximate values for the default English voice, others may differ slightly). The default value is 170. Range 80 to 390. -g <integer> inserts a pause between words. The value is the length of the pause, in units of 10 mS (at the default speed of 170 wpm). -l <integer> inserts a line-break length, default value 0. If set, then lines which are shorter than this are treated as separate clauses and spoken separately with a break between them. This can be useful for some text files, but bad for others. -w <wave file> writes the speech output to a file in WAV format, rather than speaking it. -z removes the end-of-sentence pause which normally occurs at the end of the text. --stdout writes the speech output to stdout as it is produced, rather than speaking it. The data starts with a WAV file header which indicates the sample rate and format of the data. The length field is set to zero because the length of the data is unknown when the header is produced.
See the eSpeak documentation for more information.