【语音识别】基于matlab MFCC+IPC特征+SVM中英语种识别【含Matlab源码 612期】

一、语种识别音频处理简介

1 基本原理

语种识别，根据一段音频判断该音频是英语、中语还是法语，即判断音频的语种。语种识别项目的整体思想就是把语音数据转换成相应的语谱图或者MFCC特征，再对特征进行分析，从而判断出该语音数据的语种类别。

2 公开数据集

Topcoder 竞赛数据(44.1khz 的 mp3 录音,每条 10 秒,176 种语言合计 66176(176*376)条数据,诸多小语种)。

3 基本音频处理流程

语音输入，然后音频信号特征提取，然后进行特征分析处理，最终得到结果，其中音频特征提取多半采用频谱图或者MFCC特征。

4 详解

4.1 语音输入

wav(波形音频文件)mp3 文件或是麦克风中输入的音频信号输入音频

4.2 音频信号特证提取

语音信号处理的目的是弄清语音中各个频率成分的分布。常用的数学工具是傅里叶变换,而傅里叶变换要求输入信号是平稳的,需要对语音信号进行分帧处理,截取出来的一小段信号(通常 20-30ms)就叫一帧。【微观里断定输入信号是平稳的】

语音分帧→每一帧分别 FFT( 离散傅立叶变换) →求取 FFT 之后的幅度/能量,这些数值都是正值,类似图像的像素点,显示出来就是语谱图。

其中语谱图的 x 是时间,y 轴是频率。利用语谱图可以查看指定频率端的能量分布。

二、部分源代码

clc;clear;load traindata MyfeatureA1=zeros(1,30);A2=ones(1,30);Group=[A1,A2];TrainData=Myfeature;SVMStruct = svmtrain(TrainData,Group); N=5.3;Tw = 25; % analysis frame duration (ms)Ts = 10; % analysis frame shift (ms)alpha = 0.97;% preemphasis coefficientR = [ 300 3700 ]; % frequency range to considerM = 20; % number of filterbank channelsC = 13; % number of cepstral coefficientsL = 22; % cepstral sine lifter parameterfs = 16000;hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1)));[filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '选择语音');% %没有图像if filename == 0 return;end[speech,fs] = audioread([pathname, filename]);[voice,fs]=extractvoice_simple(speech,-30, -20,0.2);voicex=voice(1:N*16000);[ mfccs, FBEs, frames ] = ...mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L );ceps_mfccx=mfccs(:); [cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC%[lpc,ER]=lpces(voice,12,256,256);%ceps_lpcc=lpc2lpcc(cep);%LPCCceps_lpcx=ceps_lpc(:);ceps=[ceps_mfccx(1000:2000);ceps_lpcx(1:2000)];TestData = ceps';languagex=svmclassify(SVMStruct,TestData);if languagex == 1language='Chinese'elselanguage='English'end% t=[1:2000];% figure% scatter(t,ceps_lpcx(1:2000),50,'r');% xlabel('sample point');% ylabel('LPC');% title('LPC features');% hold on% [filename, pathname] = uigetfile({'*.*';'*.flac'; '*.wav'; '*.mp3'; }, '选择语音');% % %没有图像% if filename == 0 %return;% end% [speech,fs] = audioread([pathname, filename]);% [voice,fs]=extractvoice_simple(speech,-30, -20,0.2);% voicex=voice(1:N*16000);% [ mfccs, FBEs, frames ] = ...%mfcc( voicex, fs, Tw, Ts, alpha, hamming, R, M, C, L );% ceps_mfccx=mfccs(:); % [cep,ER]=lpces(voicex,17,256,256); ceps_lpc=cep(2:17,:);%LPC% function [ H, f, c ] = trifbank( M, K, R, fs, h2w, w2h )% TRIFBANK Triangular filterbank.%% [H,F,C]=TRIFBANK(M,K,R,FS,H2W,W2H) returns matrix of M triangular filters % (one per row), each K coefficients long along with a K coefficient long % frequency vector F and M+2 coefficient long cutoff frequency vector C. % The triangular filters are between limits given in R (Hz) and are % uniformly spaced on a warped scale defined by forward (H2W) and backward % (W2H) warping functions.%% Inputs% M is the number of filters, i.e., number of rows of H%% K is the length of frequency response of each filter % i.e., number of columns of H%% R is a two element vector that specifies frequency limits (Hz), % i.e., R = [ low_frequency high_frequency ];%% FS is the sampling frequency (Hz)%% H2W is a Hertz scale to warped scale function handle%% W2H is a wared scale to Hertz scale function handle%% Outputs% H is a M by K triangular filterbank matrix (one filter per row)%% F is a frequency vector (Hz) of 1xK dimension%% C is a vector of filter cutoff frequencies (Hz), % note that C(2:end) also represents filter center frequencies,% and the dimension of C is 1x(M+2)%% Example% fs = 16000;% sampling frequency (Hz)% nfft = 2^12; % fft size (number of frequency bins)% K = nfft/2+1; % length of each filter% M = 23; % number of filters%% hz2mel = @(hz)(1127*log(1+hz/700)); % Hertz to mel warping function% mel2hz = @(mel)(700*exp(mel/1127)-700); % mel to Hertz warping function%% % Design mel filterbank of M filters each K coefficients long,% % filters are uniformly spaced on the mel scale between 0 and Fs/2 Hz% [ H1, freq ] = trifbank( M, K, [0 fs/2], fs, hz2mel, mel2hz );%% % Design mel filterbank of M filters each K coefficients long,% % filters are uniformly spaced on the mel scale between 300 and 3750 Hz% [ H2, freq ] = trifbank( M, K, [300 3750], fs, hz2mel, mel2hz );%% % Design mel filterbank of 18 filters each K coefficients long, % % filters are uniformly spaced on the Hertz scale between 4 and 6 kHz% [ H3, freq ] = trifbank( 18, K, [4 6]*1E3, fs, @(h)(h), @(h)(h) );%% hfig = figure('Position', [25 100 800 600], 'PaperPositionMode', ...%'auto', 'Visible', 'on', 'color', 'w'); hold on; % subplot( 3,1,1 ); % plot( freq, H1 );% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); % % subplot( 3,1,2 );% plot( freq, H2 );% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); % % subplot( 3,1,3 ); % plot( freq, H3 );% xlabel( 'Frequency (Hz)' ); ylabel( 'Weight' ); set( gca, 'box', 'off' ); %% Reference% [1] Huang, X., Acero, A., Hon, H., 2001. Spoken Language Processing: %A guide to theory, algorithm, and system development. %Prentice Hall, Upper Saddle River, NJ, USA (pp. 314-315).% Author Kamil Wojcicki, UTD, June if( nargin~= 6 ), help trifbank; return; end; % very lite input validationf_min = 0;% filter coefficients start at this frequency (Hz)f_low = R(1); % lower cutoff frequency (Hz) for the filterbank f_high = R(2);% upper cutoff frequency (Hz) for the filterbank f_max = 0.5*fs;% filter coefficients end at this frequency (Hz)f = linspace( f_min, f_max, K ); % frequency range (Hz), size 1xKfw = h2w( f );% filter cutoff frequencies (Hz) for all filters, size 1x(M+2)c = w2h( h2w(f_low)+[0:M+1]*((h2w(f_high)-h2w(f_low))/(M+1)) );cw = h2w( c );H = zeros( M, K ); % zero otherwisefor m = 1:M % implements Eq. (6.140) on page 314 of [1] % k = f>=c(m)&f<=c(m+1); % up-slope% H(m,k) = 2*(f(k)-c(m)) / ((c(m+2)-c(m))*(c(m+1)-c(m)));% k = f>=c(m+1)&f<=c(m+2); % down-slope% H(m,k) = 2*(c(m+2)-f(k)) / ((c(m+2)-c(m))*(c(m+2)-c(m+1)));% implements Eq. (6.141) on page 315 of [1]k = f>=c(m)&f<=c(m+1); % up-slopeH(m,k) = (f(k)-c(m))/(c(m+1)-c(m));k = f>=c(m+1)&f<=c(m+2); % down-slopeH(m,k) = (c(m+2)-f(k))/(c(m+2)-c(m+1));end

三、运行结果

四、matlab版本及参考文献

1 matlab版本

2 参考文献

[1]韩纪庆,张磊,郑铁然.语音信号处理（第3版）[M].清华大学出版社，.

[2]柳若边.深度学习:语音识别技术实践[M].清华大学出版社，.