Home > Products > Software > DSP Real-Time Algorithms > Speech Codec Samples

Speech Codec Wav Samples
Keywords: Voice Codec Samples, Codec Sample Data, WAV File Samples

Overview

Below are a variety of "before and after" .wav file samples for different LBR (low bit rate) speech (voice) codecs, including MELP, GSM, and G.729A/B, with bit rates ranging from 600 bps to 13000 bps. Click on .wav file links in the table below to listen to the samples (note -- all .wav file entries are mono, sampled at 8 kHz).

Speech samples are at left, with different codec types across (columns). Each row is a different language or sample type, such as addition of background noise. In several cases, a language sample may include both male and female speakers.

Underneath each sample is given the PESQ score, which is a numerical algorithm comparison between the original sample and the processed sample designed to closely approximate a MOS score. 4.5 is a perfect PESQ score, meaning there was no degradation of the processed sample from the original sample. PESQ scores normally refer to the "Original" sample at far left column, unless otherwise indicated. More information on PESQ is given below

Speech Recognition

Signalogic uses these wav files in speech recognition training, testing, and analysis work, for example comparing noise reduction and silence detection algorithms utilized by state-of-the-art codecs with those used by popular speech recognition open source, such as Kaldi ◳. The most advanced codec in use today is the 3GPP sponsored EVS codec ◳, which accurately classifies and models voice, music and other sounds, and background noise, applies precision silence detection, and offers a wide variety of sampling rates and bitrates.

For Kaldi online decoding, the Kaldi ASR Offloading project offers a production alternative to GStreamer. SigSRF software, deployed worldwide by telecom, LEA, and analytics users, has an inference library that interfaces to Kaldi run-time libraries. This allows RTP packet audio streams using wideband codecs to be processed with full RFC and jitter buffer support, including multiple stream groups, then given to Kaldi's online decoding raw audio interface.

More information on Signalogic speech recognition projects can be found at:

Kaldi Online Decoding RTP Packet Interface ◳

Speech Codec Samples

Numbers given in () below are bitrate (in bps) for uncompressed wav file samples, and inherent algorithm frame size (in msec) for compressed (i.e. codec encoder output) wav file samples. Typically, the frame size is also the delay of the codec, but in some cases there may be additional "look ahead" delay.

Original Sample
Fs = 8 kHz
(128000 bps)

MELP²
2400 bps
(22.5 msec)

MELPe
2400 bps
(22.5 msec)

MELPe
1200 bps
(67.5 msec)

MELPe-Plus
2400 bps
(22.5 msec)

MELPe-Plus
2700 bps
(20 msec)

MELPe-Plus
4000 bps
(20 msec)

MELPe-Plus¹⁷
600 bps
(30 msec)

G.729A
8000 bps
(10 msec)

GSM³
13000 bps
(20 msec)

MELPe
2400 + AT&T NPP⁴

G.729A
8000 bps
+ AT&T NPP

GSM
13000 bps
+AT&T NPP

CVSD
13000 bps
+AT&T NPP

G726
13000 bps
+AT&T NPP

Language & Score

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

Male

Female

English1

(ITU) PESQ Score

4.5

2.673

2.293

English2

(ITU) PESQ¹ Score

4.5

2.666

2.445

2.86

2.583

2.413

2.323

2.923

2.704

2.958

2.734

3.266

3.078

3.570

3.265

2.583

2.434

3.254

3.076

French

(ITU) PESQ Score

4.5

2.401

2.549

2.482

2.575

2.249

2.365

2.786

2.756

2.770

2.829

3.162

3.243

3.349

3.352

2.482

2.599

3.343

3.315

German

(ITU) PESQ Score

Japanese

(ITU) PESQ Score

Chinese

(ITU) PESQ Score

4.5

2.781

2.572

3.080

2.769

2.738

2.477

3.120

2.739

3.124

2.809

3.490

3.164

3.730

3.601

2.969

2.641

3.548

3.462

NSA test
vector⁵

(ITU) PESQ Score

4.5

3.185

2.963

3.270

3.063

2.976

2.761

3.331

3.029

3.330

2.963

3.637

3.451

2.694

2.275

3.882

3.901

3.197

2.988

3.865

3.659

GSM test
vector

(ITU) PESQ Score

G.729A test
vector

(ITU) PESQ Score

Speech + Noise Codec Samples

Original Sample
w/o Noise
Fs = 8 kHz
(128000 bps)

Original Sample
with Noise
Fs = 8 kHz
(128000 bps)

MELP²
2400 bps
(22.5 msec)

MELPe
2400 bps
(22.5 msec)

MELPe
1200 bps
(67.5 msec)

MELPe-Plus
2400 bps
(22.5 msec)

MELPe-Plus
2700 bps
(20 msec)

MELPe-Plus
4000 bps
(20 msec)

G.729A
8000 bps
(10 msec)

GSM³
(20 msec)
13000 bps

RLS

AT&T NPP⁴

MELPe
2400 + AT&T NPP

G.729A
+ AT&T NPP
8000 bps

GSM
+AT&T NPP

MELPe + RLS⁶

Language &
Score

English + car
background
noise (1)

car

PESQ Score

2.458

2.561

2.37

2.740

2.723

2.968

3.515¹¹

2.836⁷

3.655¹²

English +
white noise (2)

PESQ Score

4.5

1.615

2.251

2.237

1.574

2.509

2.795

2.933

3.550⁸

1.513⁹

2.946¹⁰

3.121⁹

English +
wideband noise (3)

wideband
noise original

wideband
noise5

PESQ Score

4.5

1.695

2.643¹³

2.696¹³

2.453¹³

2.855¹³

2.788¹³

3.154¹³

3.418¹³

3.913

2.080¹⁴

1.989¹⁴

2.064¹⁴

3.253

English + street
background
noise (4)

street noise original

street noise5

PESQ Score

4.5

1.800

2.283¹⁵

2.351¹⁵

2.197¹⁵

2.493¹⁵

2.422¹⁵

2.925¹⁵

3.359¹⁵

3.799

2.139¹⁶

1.989¹⁶

2.133¹⁶

3.239

External Links -- More Samples

Below are listed more web pages with audio/voice codec samples:

http://www.kyastem.co.jp/english/e-sample.html

http://www.hawksoft.com/hawkvoice/codecs.shtml

SigSRF SDK (more wav files are included in the demo download) ◳

About PESQ

PESQ (Perceptual Evaluation of Speech Quality) is an ITU algorithm that is increasingly used to emulate MOS, which is "mean opinion score", a listening test for speech codecs. The objective is to automate and ensure reproducibility of measurement of degradation over telephony channels, due to speech compression, line conditions, delay/echo, etc. and obtain results that closely correlate with MOS human listening tests.

More information on the PESQ approach and algorithm can be found at: http://www.pesq.org/

The current ITU PESQ recommendation is P.862E ◳, and the current software version of ITU PESQ is v1.2, which is a significant improvement over earlier versions. There is still ongoing work by PESQ developers to address limitations for maximum input file size, worst-case variation in delay (dynamic time-alignment with reference input), and noisy backgrounds.

AT&T Noise Preprocessor

The AT&T website contains no published information about the AT&T Noise Preprocessor. Instead, below we provide the abstract of the AT&T patent:

The system and method of the invention relates to voice detection technology for determining instants of time at which a snapshot of noise characteristics results in improved adaptation of noise floors used in voice detection. The approach is based on the "lower envelope" of the smoothed input signal power. Incorporation of this approach in a simple time domain VAD (Voice Activity Detector) results in an effective low-complexity system which, on the basis of simulations, gives good performance down to SNR values of about 0 dB. In the invention the lower envelope also provides the updated value of the noise threshold during the presence of speech. The invention can also be embedded in other, more complex (e.g., frequency domain) VADs at low computational cost.

More information on the AT&T patent:

Patent Number	5,991,718
Author	David Malah
Title	System and method for noise threshold adaptation for voice activity detection in nonstationary noise environments

More information about AT&T Noise Preprocessor can be found in this 2001 ICASSP paper ◳.

Send e-mail to: Joe Alfred [ALIPM] at AT&T

Notes

¹ ITU P.862 v1.2. For more information, see About PESQ section above.
² Mixed Excitation Linear Predictive.
³ Global System for Mobile Communications.
⁴ AT&T Noise Preprocessor algorithm. NPP algorithm has some processing delay before algorithm output stabilizes.
⁵ Mixed Excitation Linear Predictive. MELP = original MELP v1.2, MELPe = enhanced MELP (current standard).
⁶ MELPe + RLS based adaptive noise cancellation information is located at http://www.owlnet.rice.edu/~ryanking/elec431/compare.html
⁷ PESQ score referenced to car10 waveform.
⁸ PESQ score referenced to white noise waveform.
⁹ PESQ score referenced to white noise original waveform.
¹⁰ PESQ score referenced to white_noise10 waveform.
¹¹ PESQ score referenced to car
¹² PESQ score referenced to car10
¹³ PESQ score referenced to wideband noise
¹⁴ PESQ score referenced to wideband noise original
¹⁵ PESQ score referenced to streetnoise
¹⁶ PESQ score referenced to streetnoise original
¹⁷ Preliminary Simulation of 600bps MELPe-Plus Codec.