Speech Codec Wav Samples
Keywords: Voice Codec Samples, Codec Sample Data, WAV File Samples
Overview | Speech Recognition | Speech Codec Samples | Speech + Noise Codec Samples | ITU ◳ | MELPe Speech Codecs ◳ | PESQ Speech Quality Measurement | AT&T Noise Preprocessor | Notes | More Samples | EVS Codec ◳ | Codec PC Simulation Program ◳
Overview
Below are a variety of "before and after" .wav file samples for different LBR (low bit rate) speech (voice) codecs, including MELP, GSM, and G.729A/B, with bit rates ranging from 600 bps to 13000 bps. Click on .wav file links in the table below to listen to the samples (note -- all .wav file entries are mono, sampled at 8 kHz).Speech samples are at left, with different codec types across (columns). Each row is a different language or sample type, such as addition of background noise. In several cases, a language sample may include both male and female speakers.
Underneath each sample is given the PESQ score, which is a numerical algorithm comparison between the original sample and the processed sample designed to closely approximate a MOS score. 4.5 is a perfect PESQ score, meaning there was no degradation of the processed sample from the original sample. PESQ scores normally refer to the "Original" sample at far left column, unless otherwise indicated. More information on PESQ is given below
Speech Recognition
Signalogic uses these wav files in speech recognition training, testing, and analysis work, for example comparing noise reduction and silence detection algorithms utilized by state-of-the-art codecs with those used by popular speech recognition open source, such as Kaldi ◳. The most advanced codec in use today is the 3GPP sponsored EVS codec ◳, which accurately classifies and models voice, music and other sounds, and background noise, applies precision silence detection, and offers a wide variety of sampling rates and bitrates. For Kaldi online decoding, the Kaldi ASR Offloading project offers a production alternative to GStreamer. SigSRF software, deployed worldwide by telecom, LEA, and analytics users, has an inference library that interfaces to Kaldi run-time libraries. This allows RTP packet audio streams using wideband codecs to be processed with full RFC and jitter buffer support, including multiple stream groups, then given to Kaldi's online decoding raw audio interface. More information on Signalogic speech recognition projects can be found at:- Kaldi Online Decoding RTP Packet Interface ◳
- Edge AI Server (TI Wiki "Small AI Server" page) ◳
- DeepDome™ Home/Business Assistant Privacy, with Two-Level Wake Word Security ◳
- SigSRF SDK Overview ◳
- DirectCore® Software ◳
- 2-D Spectrograph page ◳
Speech Codec Samples
Numbers given in () below are bitrate (in bps) for uncompressed wav file samples, and inherent algorithm frame size (in msec) for compressed (i.e. codec encoder output) wav file samples. Typically, the frame size is also the delay of the codec, but in some cases there may be additional "look ahead" delay.Fs = 8 kHz (128000 bps) |
2400 bps (22.5 msec) |
2400 bps (22.5 msec) |
1200 bps (67.5 msec) |
2400 bps (22.5 msec) |
2700 bps (20 msec) |
4000 bps (20 msec) |
600 bps (30 msec) |
8000 bps (10 msec) |
13000 bps (20 msec) |
2400 + AT&T NPP4 |
8000 bps + AT&T NPP |
13000 bps +AT&T NPP |
13000 bps +AT&T NPP |
13000 bps +AT&T NPP |
||||||||||||||||
Language & Score | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | Female | |||||||||||||||
English1 | eng2_m | eng2_f | male600 | female600 | eng_m | eng_f | eng_m | eng_f | ||||||||||||||||||||||
(ITU) PESQ Score | 4.5 | 4.5 | 2.673 | 2.293 | ||||||||||||||||||||||||||
English2 | eng_m | eng_f | eng_m1 | eng_f1 | eng_m2 | eng_f2 | eng_m3 | eng_f3 | eng_m4 | eng_f4 | eng_m5 | eng_f5 | eng_m6 | eng_f6 | eng_m7 | eng_f7 | eng_m9 | eng_f9 | eng_m10 | eng_f10 | ||||||||||
(ITU) PESQ1 Score | 4.5 | 4.5 | 2.666 | 2.445 | 2.86 | 2.583 | 2.413 | 2.323 | 2.923 | 2.704 | 2.958 | 2.734 | 3.266 | 3.078 | 3.570 | 3.265 | 2.583 | 2.434 | 3.254 | 3.076 | ||||||||||
French | f_m | f_f | f_m1 | f_f1 | f_m2 | f_f2 | f_m3 | f_f3 | f_m4 | f_f4 | f_m5 | f_f5 | f_m6 | f_f6 | f_m7 | f_f7 | f_m9 | f_f9 | f_m10 | f_f10 | f_m | f_f | f_m | f_f | ||||||
(ITU) PESQ Score | 4.5 | 4.5 | 2.401 | 2.549 | 2.482 | 2.575 | 2.249 | 2.365 | 2.786 | 2.756 | 2.770 | 2.829 | 3.162 | 3.243 | 3.349 | 3.352 | 2.482 | 2.599 | 3.343 | 3.315 | ||||||||||
German | ||||||||||||||||||||||||||||||
(ITU) PESQ Score | ||||||||||||||||||||||||||||||
Japanese | ||||||||||||||||||||||||||||||
(ITU) PESQ Score | ||||||||||||||||||||||||||||||
Chinese | ch_m | ch_f | ch_m1 | ch_f1 | ch_m2 | ch_f2 | ch_m3 | ch_f3 | ch_m4 | ch_f4 | ch_m5 | ch_f5 | ch_m6 | ch_f6 | ch_m7 | ch_f7 | ch_m9 | ch_f9 | ch_m10 | ch_f10 | ||||||||||
(ITU) PESQ Score | 4.5 | 4.5 | 2.781 | 2.572 | 3.080 | 2.769 | 2.738 | 2.477 | 3.120 | 2.739 | 3.124 | 2.809 | 3.490 | 3.164 | 3.730 | 3.601 | 2.969 | 2.641 | 3.548 | 3.462 | ||||||||||
NSA test vector5 |
nsa_m | nsa_f | nsa_m1 | nsa_f1 | nsa_m2 | nsa_f2 | nsa_m3 | nsa_f3 | nsa_m4 | nsa_f4 | nsa_m5 | nsa_f5 | nsa_m6 | nsa_f6 | nsa_m600 | nsa_f600 | nsa_m7 | nsa_f7 | nsa_m9 | nsa_f9 | nsa_m10 | nsa_f10 | ||||||||
(ITU) PESQ Score | 4.5 | 4.5 | 3.185 | 2.963 | 3.270 | 3.063 | 2.976 | 2.761 | 3.331 | 3.029 | 3.330 | 2.963 | 3.637 | 3.451 | 2.694 | 2.275 | 3.882 | 3.901 | 3.197 | 2.988 | 3.865 | 3.659 | ||||||||
GSM test vector |
||||||||||||||||||||||||||||||
(ITU) PESQ Score | ||||||||||||||||||||||||||||||
G.729A test vector |
||||||||||||||||||||||||||||||
(ITU) PESQ Score |
Speech + Noise Codec Samples
w/o Noise Fs = 8 kHz (128000 bps) |
with Noise Fs = 8 kHz (128000 bps) |
2400 bps (22.5 msec) |
2400 bps (22.5 msec) |
1200 bps (67.5 msec) |
2400 bps (22.5 msec) |
2700 bps (20 msec) |
4000 bps (20 msec) |
8000 bps (10 msec) |
(20 msec) 13000 bps |
2400 + AT&T NPP |
+ AT&T NPP 8000 bps |
+AT&T NPP |
MELPe + RLS6 | |||||||||||||||||||
Language & Score |
||||||||||||||||||||||||||||||||
English + car background noise (1) |
car | car1 | car2 | car3 | car4 | car5 | car6 | car7 | car10 | car9 | car12 | PESQ Score | 2.458 | 2.561 | 2.37 | 2.740 | 2.723 | 2.968 | 3.51511 | 2.8367 | 3.65512 | |||||||||||
English + white noise (2) |
white noise original | white noise | white noise1 | white noise2 | white noise3 | white noise4 | white noise5 | white noise6 | white noise7 | white noise9 | white noise10 | white noise9 | white noise11 | white noise13 | PESQ Score | 4.5 | 1.615 | 2.251 | 2.237 | 1.574 | 2.509 | 2.795 | 2.933 | 3.5508 | 1.5139 | 2.94610 | 3.1219 | |||||
English + wideband noise (3) |
wideband noise original |
wideband noise |
wideband noise1 |
wideband noise2 |
wideband noise3 |
wideband noise4 |
wideband noise5 |
wideband noise6 |
wideband noise7 |
wideband noise9 |
wideband noise10 |
wideband noise11 |
wideband noise12 |
wideband noise14 |
PESQ Score | 4.5 | 1.695 | 2.64313 | 2.69613 | 2.45313 | 2.85513 | 2.78813 | 3.15413 | 3.41813 | 3.913 | 2.08014 | 1.98914 | 2.06414 | 3.253 | |||
English + street background noise (4) |
street noise original | street noise | street noise1 | street noise2 | street noise3 | street noise4 | street noise5 | street noise6 | street noise7 | street noise9 | street noise10 | street noise11 | street noise12 | street noise14 | PESQ Score | 4.5 | 1.800 | 2.28315 | 2.35115 | 2.19715 | 2.49315 | 2.42215 | 2.92515 | 3.35915 | 3.799 | 2.13916 | 1.98916 | 2.13316 | 3.239 |
External Links -- More Samples
Below are listed more web pages with audio/voice codec samples: http://www.kyastem.co.jp/english/e-sample.html http://www.hawksoft.com/hawkvoice/codecs.shtml SigSRF SDK (more wav files are included in the demo download) ◳About PESQ
PESQ (Perceptual Evaluation of Speech Quality) is an ITU algorithm that is increasingly used to emulate MOS, which is "mean opinion score", a listening test for speech codecs. The objective is to automate and ensure reproducibility of measurement of degradation over telephony channels, due to speech compression, line conditions, delay/echo, etc. and obtain results that closely correlate with MOS human listening tests. More information on the PESQ approach and algorithm can be found at: http://www.pesq.org/ The current ITU PESQ recommendation is P.862E ◳, and the current software version of ITU PESQ is v1.2, which is a significant improvement over earlier versions. There is still ongoing work by PESQ developers to address limitations for maximum input file size, worst-case variation in delay (dynamic time-alignment with reference input), and noisy backgrounds.AT&T Noise Preprocessor
The AT&T website contains no published information about the AT&T Noise Preprocessor. Instead, below we provide the abstract of the AT&T patent: The system and method of the invention relates to voice detection technology for determining instants of time at which a snapshot of noise characteristics results in improved adaptation of noise floors used in voice detection. The approach is based on the "lower envelope" of the smoothed input signal power. Incorporation of this approach in a simple time domain VAD (Voice Activity Detector) results in an effective low-complexity system which, on the basis of simulations, gives good performance down to SNR values of about 0 dB. In the invention the lower envelope also provides the updated value of the noise threshold during the presence of speech. The invention can also be embedded in other, more complex (e.g., frequency domain) VADs at low computational cost. More information on the AT&T patent:Patent Number | 5,991,718 |
Author | David Malah |
Title | System and method for noise threshold adaptation for voice activity detection in nonstationary noise environments |
Notes
1 ITU P.862 v1.2. For more information, see About PESQ section above.2 Mixed Excitation Linear Predictive.
3 Global System for Mobile Communications.
4 AT&T Noise Preprocessor algorithm. NPP algorithm has some processing delay before algorithm output stabilizes.
5 Mixed Excitation Linear Predictive. MELP = original MELP v1.2, MELPe = enhanced MELP (current standard).
6 MELPe + RLS based adaptive noise cancellation information is located at http://www.owlnet.rice.edu/~ryanking/elec431/compare.html
7 PESQ score referenced to car10 waveform.
8 PESQ score referenced to white noise waveform.
9 PESQ score referenced to white noise original waveform.
10 PESQ score referenced to white_noise10 waveform.
11 PESQ score referenced to car
12 PESQ score referenced to car10
13 PESQ score referenced to wideband noise
14 PESQ score referenced to wideband noise original
15 PESQ score referenced to streetnoise
16 PESQ score referenced to streetnoise original
17 Preliminary Simulation of 600bps MELPe-Plus Codec.