Speech recognition with deep recurrent neural networks_2013.Alex Graves

Speech recognition with deep recurrent neural networks : Notes

  • 一、Achievement
  • 二、Architectures and Methods
    • 1.Architectures
    • 2.Methods
    • 3.Dataset
  • 三、Sentence Expression
  • 总结

一、Achievement

At the beginning of this article,they compared existing methods which used Recurrent neural networks in different task,such as cursive handwriting recognition.Then they ask a question:whether RNNs could also benefit from depth in space;that is from stacking multiple recurrent hidden layers on top of each other,just as feedforward layers are stacked in conventional deep networks.So they posed deep Long Short-tern Memory RNNs and assess their potential for speech recognition.They also present an enhancement to a recently introduced end-to-end learning method that jointly trains two separate RNNs as acoustic and linguistic models.What’s more,they make improvement of LSTM networks.

二、Architectures and Methods

1.Architectures

One shortcoming of conventional RNNs is that they are only able to make sure of previous context,which doesn’t exploit future context.Given that how to solve this question,the author proposed a architecture called Bidirectional RNN(BRNNs).The architecture not only compute the forward hidden layers,denote by

h

?

{vec{h}}

h
,but compute the backward hidden layers,denote by

h

overleftarrow{h}

h
,the output sequence y will be got by iterating the backward layer from t=T to 1.Fig.2 illustrates the architecture of BRNNs.Fig.2. Bidirectional RNN
Then the author combine standard LSTM with BRNNs and called it bidirectional LSTM,which can access long-range context in both input directions.
After doing this, the author integrate Deep networks into the above architecture.Deep RNNs can be created by stacking multiple sequence of one layer forming the input sequence for the next.Stop here,the paper’s architecture had been introduced.

2.Methods

(1)Connectionist Temporal Classification
The method uses a softmax layer to define a separate output distribution

Pr

?

(

k

t

)

operatorname{Pr}(k mid t)

Pr(k∣t) at every step t along the input sequence.(ps.

Pr

?

(

k

t

)

operatorname{Pr}(k mid t)

Pr(k∣t) is a differentiable distribution over all possible output sequences y.
(2)RNN Transducer
In speech recognition,RNN transducer combines a CTC-like network with a separate RNN that predict each phoneme given the precious ones,thereby yielding a jointly trained acoustic and language model.RNN transducers can be trained from random initial weights.However,it works better when initialised with the weights of prediction network.
(3)Decoding
The author compared prefix search with beam search. CTC network used to be decoded by prefix search,which use either a form of best-first decoding.And then they chose to use beam search as CTC networks’ decoder,which exploit the same beam search as the transducer,with the modification that the output label probabilities

Pr

?

(

k

t

,

u

)

operatorname{Pr}(k mid t, u)

Pr(k∣t,u) do not depend on the previous outputs,as

Pr

?

(

k

t

,

u

)

=

Pr

?

(

k

t

)

operatorname{Pr}(k mid t, u)=operatorname{Pr}(k mid t)

Pr(k∣t,u)=Pr(k∣t).
(4)Regularisation
Regularisation is vital for good performance with RNNs,as their flexibility makes them prone to overfitting.The paper used two regularisers: early stopping and weight noise.Weight noise was added once per training sequence,rather than at every timestep. Weight noise tends to simplify neural networks,in the sense of reducing the amount of information required to transmit the parameters,which improves generalisation.

3.Dataset

TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition https://catalog.ldc.upenn.edu/docs/LDC93S1/TIMIT.html
learn more about it can click the url:https://blog.csdn.net/qq_39373179/article/details/103788208


三、Sentence Expression

dramatic improvements 显著改善,用于介绍方法在某领域的成果
particularly fruitful 富有成效,用于介绍某种方法处理某种问题的效率
Given that 鉴于…,在句首,是because of 的更规范替代
purpose-built 专门用于,用于介绍某中特定的方法在某一特定领域的使用
a crucial element of …的一个关键因素,置于句首,句子呈倒装句
注:倒装句是英文论文常用的句式,能够更清晰的阐述所要表达的意思


总结

本篇文章是Hinton和他的学生于2013年发表的论文,论文主要贡献为改进了RNN(LSTM)的结构,并在LSTM和BRNN的基础上扩展到深度神经网络,并成功将其应用于语音分类中,降低了错误率。