基于超分辨率图像的卷积稀疏编码
EXPLOITING SPARSENESS IN DEEP NEURAL NETWORKS FOR LARGE VOCABULARY
SPEECH RECOGNITION
Dong Yu 1, Frank Seide 2, Gang Li 2, Li Deng 1
2
Microsoft Research, Redmond, USA Microsoft Research Asia, Beijing, P.R.C
1
{dongyu,fseide,ganl,deng }@microsoft.com
ABSTRACT
Recently, we developed context-dependent deep neural network (DNN)hidden Markov models for large vocabulary speech recogni-tion. While reducing errors by 33%compared to its discriminatively trained Gaussian-mixture counterpart on the switchboard benchmark task, DNN requires much more parameters. In this paper, we report our recent work on DNN for improved generalization, model size, and computation speed by exploiting parameter sparseness. We for-mulate the goal of enforcing sparseness as soft regularization and convex constraint optimization problems, and propose solutions un-der the stochastic gradient ascent setting. We also propose novel data structures to exploit the random sparseness patterns to reduce model size and computation time. The proposed solutions have been evaluated on the voice-search and switchboard datasets. They have decreased the number of nonzero connections to one third while re-ducing the error rate by 0.2-0.3%over the fully connected model on both datasets. The nonzero connections have been further reduced to only 12%and 19%on the two respective datasets without sacrificingspeech recognition performance. Under these conditions we can re-duce the model size to 18%and 29%,and computation to 14%and 23%,respectively, on these two datasets.
Index Terms —speech recognition, deep belief networks, deep neural networks, sparseness
1. INTRODUCTION
Recently, we have witnessed the resurrection of artificialneural net-work (NN)hidden Markov model (HMM)hybrid systems for speech recognition. This mainly attributes to the discovery of the strong modeling ability of deep neural networks (DNNs1) and the availabil-ity of high-speed general purpose graphical processing units (GPG-PUs) for training DNNs. A notable advance is the context-dependent DNN-HMMs (CD-DNN-HMMs)in which DNNs directly model the senones (i.e.,tied CD phone states) and approximate their emission probabilities in HMM speech recognizers [1].
CD-DNN-HMMs have been shown to be highly promising. They have achieved 16%[1]and 33%[2,3]relative recognition error reduction over strong, discriminatively trained CD-GMM (Gaussianmixture model)-HMMs, respectively, on a voice search (VS)task [4]and the switchboard (SWB)phone-call transcription task [5].
Unfortunately, CD-DNN-HMMs use much more parameters than the corresponding CD-GMM-HMMs due to the large number of layers used in DNNs and the direct modeling of as many as 20,000
defineDNNs as multi-layer perceptrons (MLPs)with many more
hidden layers than used before. DNNs are more difficultto train and may benefitfrom procedures such as deep belief network (DBN)pretraining.
1We
or more senones to achieve high recognition accuracy. In addition, the lower layers in DNNs are shared across all the states and need to be computed even if only one state is active in the search path. It is thus of practical importance to reduce the DNN model size so that fast computation can be possible and better generalization can be achieved. In this paper we attack this problem by exploiting param-eter sparseness and formulating the task of enforcing sparseness as soft regularization and convex constraint optimization problems. We compare the performance under the stochastic gradient ascent (SGA)setting, which, to our best knowledge, is the only scalable training procedure for DNNs at present. We further propose data structures to exploit the seemingly random sparseness patterns to save storage and to speed up decoding.
2. CD-DNN-HMM
The CD-DNN-HMM combines the discriminative modeling power of DNN with the sequential modeling power of HMM. The basic idea behind CD-DNN-HMMs has been known since 1990s [6].CD-DNN-HMMs differ from previous work in that they directly model tied CD phone states and do so using DNNs. Empirical evaluation [1,2]indicated that both these two aspects are critical for CD-DNN-HMMs to outperform the speaker-independent state-of-the-art CD-GMM-HMMs on large vocabulary speech recognition tasks.
A DNN models the posterior probability P s |o (s |o ) of a class s given an observation vector o , as a stack of (L +1) layers of log-linear models. The firstL layers, =0...L −1, model hidden binary units h given input vectors v as Bernoulli distribution
P h |v (h |v )
=
N j =1
e z j (v
e z j (v
) ·1
) ·h j
) ·0
+e z j (v
, 0≤
and the top layer L models the class posterior as multinomial distri-bution
L L P s |v (s |v )
e z s (v )
=z (v ) s e s
L L
=softmax s (z L (v L )) (2)
where z (v ) =(W ) T v +a is the activation at layer , W and
a are the weight matrices and bias vectors at layer , and h j and
z j (v ) are the j -th component of h and z (v ) , respectively.
The precise modeling of P s |o (s |o ) is infeasible as it requires in-tegration over all possible values of h across all layers. An effective practical trick is to replace the marginalization with the “mean-field
approximation”[7].Given observation o , we set v =o and choose
conditional expectation E h |v {h |v }=σz (v ) as input v +1to the next layer, where σj (z ) =1/(1+e −z j ) is the sigmoid function.
Given T training samples o (t ) and the associated ground-truth labels s (t ) , DNNs are often trained to maximize the total log condi-tional probability
D
=
T t =1
log P s |o (s |o ) ,
(3)
The only feasible procedure to train DNNs on a large amount of
training samples so far is the error back-propagation (BP)procedure with stochastic gradient ascent (SGA).In its basic form,
T ∂D∂D=v (ωe ) ; =ωt e t (4)t t t t t
L L
(z (v )) , e −1=where the error signals e L = (logsoftmax)
W ·ω ·e , and ω =diag σ (z (v ) if 0≤
1, otherwise , the component-wise derivatives σj (z ) =σj (z ) ·(1−
σj (z )) , (logsoftmax) j (z ) =δs (t ) ,j −softmax j (z ) , and δis Kro-necker delta.
DNNs are more difficultto converge to good solutions and com-putationally more demanding to train than the shallow MLPs. It is only recently that training DNNs has become feasible, with the easy access to high-speed GPGPUs and the discovery of effective weight initialization techniques, in particular “deepbelief network”pretraining [8].The detailed steps on training a CD-DNN-HMM system can be found in [1].
Fig. 1. The weight magnitude distribution of a 7-hidden-layer DNN illustrated as the percentage of weights whose magnitude is below a threshold.
3.1. Soft Regularization Formulation
In the soft regularization formulation, we convert the problem to maximizing the criterion
D 0
=
T t =1
log P s |o (s |o ) −α· W 0,
(5)
3. EXPLOITING SPARSENESS
It is our experience that recognition accuracy of CD-DNN-HMMs
typically increases with the number of hidden units and layers, as long as the training process is controlled by a held-out set. For ex-ample, in the SWB task, we continue to see small but consistent accuracy improvement even with 9hidden layers [2].Resulting op-timal models, however, are large. As shown in Section 4, the CD-DNN-HMMs trained using 24hours of VS data and 300hours of SWB data have 19and 45million parameters, respectively. They each correspond to 12and 2times the size of the corresponding tra-ditional CD-GMM-HMMs.
Fortunately, inspection of fully connected DNNs after the train-ing has shown that a large portion of all connections have very small weights. Fig. 1illustrates the distribution of weight magnitudes of a typical 7-hidden-layer DNN. In this specificexample, the magnitude of 70%of weights is below 0.1. This inspired us to reduce model size by removing connections with small weight magnitude so that we can work with deeper and wider DNNs more effectively. Note that we did not observe similar patterns on bias parameters. This is expected since nonzero bias terms indicate the shift of hyperplanes from the origin. Since the number of bias parameters is very small compared to that of weight parameters, keeping bias parameters in-tact does not affect the finalmodel size in a noticeable way.
The task of enforcing sparseness can be formulated as a multi-objective optimization problem since we want to maximize the log conditional probability D and minimize the number of nonzero weights at the same time. This two-objective optimization problem can be converted into a single objective optimization problem with soft regularization and convex constraint formulations. Note that re-searchers have investigated the MLP weight sparseness problems in the past. The most well known work [9,10]pruned the weights after training converges based on the second-order derivatives. Unfortu-nately, these algorithms are difficultto scale up to large training set we typically use in speech recognition and their advantages vanish if additional training iterations are carried out upon the pruned
weights.
where the L 0matrix norm W 0is the number of nonzero weights, and αis a balancing parameter. We further approximate D 0as
D 1
=
T t =1
log P s |o (s |o ) −α· W 1,
(6)
by replacing L 0-norm with L 1-norm.
Maximizing D 1can be solved by following the sub-gradient
∂D1=
∂D
−α·sgn(W ) (7)
where sgn(·) is the sign function applied element-wise.
Unfortunately, the SGA method with sub-gradient (7)usually does not generate precise sparse solutions. To enforce a sparse solu-tion, one often truncates the solutions after each K steps by forcing parameters with magnitude smaller than a threshold θto zero [11].This truncation step, however, is somewhat arbitrary and is not a di-rect derivation from optimizing D 1. In addition, K is difficultto select correctly. In general, it is not desirable to take a small K (e.g.,1), especially when the minibatch size is small, since in that case each SGA update step only slightly modifiesweights. When a pa-rameter is close to zero it remains so after several SGA updates and will be rounded back to zero if K is not sufficientlylarge. Conse-quently, truncation can be done only after (areasonably large) K steps in the hopes that nonzero coefficientshave sufficienttime to go above θ. On the other hand, a large K means that every time the pa-rameters are truncated, D will be reduced and will require a similar number of steps to get the loss compensated. 3.2. Convex Constraint Formulation
In the convex constraint formulation, we maximize the log condi-tional probability D subject to the constraint
W 0≤q
(8)
where q is a threshold value for the maximal number of nonzero weights allowed.
This constrained optimization problem is hard to solve. How-ever, an approximate solution can be found following two observa-tions:First, after sweeping through the full training set several times the weights become relatively stable —they tend to remain either large or small magnitudes. Second, in a stabilized model, the impor-tance of the connection is approximated well by the magnitudes of the weights (timesthe magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities). This leads to the very simple yet efficientand effective Algorithm 1. After Step 2sparseness Algorithm 1Main Steps to Train a Sparse DNN
1:Train a fully connected DNN by sweeping through the full train-ing set several times using labels from forced alignment.
2:Keep only the connections whose weight magnitudes are in top q .
3:Continue training the DNN with the sparseness pattern gener-ated from Step 2unchanged. constraint (8)is enforced. However, the log conditional probability D is reduced due to connection pruning, esp. when the degree of sparseness is high (i.e.,q is small). It is thus important to apply Step 3and continue training the DNN, which tends to converge much faster than the original training.
We have developed an empirically effective approach to keeping the same sparse connections (andthus same sparseness constraint) in Step 3. We can either mask the pruned connections or round weights with magnitude below min {0. 02, θ/2}to zero, where θis the minimal weight magnitude that survived the pruning and 0. 02is determined by the sparseness pattern shown in Fig. 1. The masking approach is cleaner but requires storage of a huge masking matrix. In our implementation we used the rounding alternative. Note that it is important to round only weights smaller than min {0. 02, θ/2}, instead of θ, to zero. This is because the weights may shrink and be suddenly removed and it is desirable to keep the effect of this removal to minimum without sacrificingthe degree of sparseness.
Compared to the sub-gradient solution associated with the soft regularization formulation, Algorithm 1carries several benefits:First, threshold q can be easily determined by examining the weights after Step 1. Determining the truncation parameter θin the sub-gradient solution, however, is very difficult.Second, we can mask or truncate the weights after each SGA update in Step 3without perfor-mance degradation. However, in the sub-gradient solution, it is very difficultand sometimes impossible to finda good update number K . Third, in the sub-gradient solution, we need to tune the balancing parameter αwhile there is no such parameter in Algorithm 1. 3.3. Data Structure
The sparse weights learned generally have random patterns. Here we propose data structures to effectively exploit the sparse weights to reduce model size and to speed up decoding calculation (W T v ). One data structure example is depicted in Figure 2. The basic idea is simple:only store and calculate with the nonzero-weights (nzws).To speed up the calculation we stored the indexes and actual weights into adjacent groups so that they can be retrieved efficientlywith good locality. A slightly different but almost equally efficientdata structure is to group pairs of indexes and weights. With the proposed data structure, each column can be multiplied with the input vector
in parallel. To further speed up the calculation, parallelization can also happen within each column.
Note that the data structure shown in Figure 2is just one im-plementation. The best data structure depends heavily on the hard-ware architecture chosen and the trade-off between storage size and computation speed. For example, the index block i k s can be further compressed by keeping the delta indexes (requiresonly one byte per index). Further more, if streaming SIMD extension (SSE)instruc-tions are used, we can group frames into batches of four and store nonzero weights row-firstto achieve similar computation speedup.
The saving of storage from using the data structure shown in Figure 2is obvious. For an N ×M single-precision matrix with x %nonzero-weights, the normal matrix requires 4×N ×M bytes. With the proposed data structure, it requires 2+6×M (1+x %×N ) bytes, which takes less space when x %
The speedup of calculation depends heavily on the implemen-tation and hardware used. For a naive matrix-vector multiplication (i.e.,SSE is not used), it requires N ×M multiplications and sum-mation, and 2×N ×M memory accesses. With the proposed data structure, it requires only x %×N ×M multiplications and summa-tions, and 3×x %×N ×M memory
accesses.
Fig. 2. A data structure that can exploit the random sparseness pat-tern in weight matrices to save storage and speed up calculation, where nzws =nonzero-weights
4. EXPERIMENTAL RESULTS
Evaluation was done on the VS and SWB datasets. 4.1. Dataset Description
The VS dataset contains US-wide business and location search queries with a 24-hour (or32,057-utterance) training set, a 6.5-hour (or8,777-utterance) development set, and a 9.5-hour (or12,758-utterance) test set. For the sake of easy comparisons, we have used the same lexicon, tri-gram language model (LM),and 39-dim MFCC features as in previous studies [1].The LM contains 65K word uni-grams with a perplexity of 117and OOV rate of 6%on the test set. The total number of senones and number of mixtures in this dataset is 761and 24respectively and were optimized for CD-GMM-HMM systems trained under maximum likelihood (ML)criterion.
The SWB dataset used in this study contains the standard 309-hour Switchboard-I training set [5],and the NIST 2000Hub5and RT03S (FSHportion) evaluation sets. The system uses 13-dim PLP features with windowed mean-variance normalization and up to third-order derivatives, reduced to 39dimensions by HLDA. The 3-state speaker-independent crossword triphones share 930440-mixture senones optimized for the ML-trained GMM-HMM system. 4.2. Results and Discussions
To obtain the experimental results shown in Tables 1and 2we have followed the steps in [1]to build the fully-connected CD-DNN-HMMs and steps in Alg. 1for the sparse models. In all the setups, DNNs were firstpre-trained using the DBN pretraining algorithm [8]generatively and then fine-tuneddiscriminatively using the BP algo-rithm. Additional details can be found in [1,2].CD-DNN-HMMs
Table 1. Model size, computation time, and percent query error rate (QER)with and without sparseness constraints on the VS dataset. CD-DNN-HMMs contain 5hidden layers each with 2048nodes trained using CD-DNN-HMM alignment. ‘nz’means ’nonzero.’The OOV rate for both the dev and test sets is about 6%.model params Size Time QER QER sparse:46%nz 8.8M 69%55%27.730.1sparse:31%nz 6.0M 47%37%27.730.1sparse:21%nz 4.0M 32%25%27.830.2sparse:12%nz 2.3M 18%14%27.930.4sparse:5%nz 1.0M 8%6%29.731.7
5. CONCLUSION
The main focus of the research reported in this paper is to reduce the
model size, speed up the computation, and improve the generaliza-tion ability of CD-DNN-HMMs. We have shown that while DNNs can become very large as more layers are added, the majority of the weights are close to zero and this is exploited to reduce the model size in this work. We formulated the task of reducing the number of connections in the DNN as soft regularization and convex constraint optimization problems and proposed solutions under the SGA set-ting. The method derived from the convex constraint formulation we reported in detail in this paper performs better and is easier to implement than the sub-gradient algorithm associated with the soft regularization formulation. Experiments on VS and SWB datasets demonstrated that a significantimprovement in model size and com-putation time can be obtained with the same or sometimes slightly higher recognition accuracy.
Table 2. Model size, computation time, and percent word error rate (WER)with and without sparseness constraints on the SWB dataset. CD-DNN-HMMs contain 7hidden layers each with 2048nodes; trained using CD-DNN-HMM alignment. ‘nz’means ’nonzero.’model params Size Time SWB FSH sparse:52%nz 23.6M 78%62%16.118.5sparse:34%nz 15.2M 51%41%16.118.4sparse:24%nz 11.0M 36%29%16.218.5sparse:19%nz 8.6M 29%23%16.418.7sparse:15%nz 6.6M 22%18%16.518.7
Acknowledgment
Our thanks go to A. Gunawardana, K. Thambiratnam, J. Droppo,
N. Str¨o m, A. Stolcke, A. Jayamohan and I. Kouzminykh.
6. REFERENCES
[1]G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent
pre-trained deep neural networks for large vocabulary speech recognition,”IEEE Trans. Speech and Audio Proc., Special Is-sue on Deep Learning for Speech and Lang. Processing , 2012. [2]F. Seide, G. Li, and D. Yu, “Conversationalspeech transcrip-tion using context-dependent deep neural networks,”in Proc. Interspeech , 2011, pp. 437–440.
[3]X. Chen F. Seide, G. Li and D. Yu, “Featureengineering
in context-dependent deep neural networks for conversational speech transcription,”in ASRU , 2011.
[4]D. Yu, Y . C. Ju, Y . Y . Wang, G. Zweig, and A. Acero, “Au-tomated directory assistance system -from theory to practice,”in Proc. Interspeech , 2007, pp. 2709–2711.
[5]J. Godfrey and E. Holliman, “Switchboard-1release 2,”in
Linguistic Data Consortium . 1997.
[6]S. Renals, N. Morgan, H. Boulard, M. Cohen, and H. Franco,
“Connectionistprobability estimators in HMM speech recog-nition,”IEEE Transactions on Speech and Audio Processing , vol. 2, no. 1, pp. 161–174,1994.
[7]C. Peterson and J. Anderson, “Amean fieldtheory learning
algorithm for neural networks,”Complex Systems , pp. 995–1019, 1987.
[8]G. Hinton and R. Salakhutdinov, “Reducingthe dimensionality
of data with neural networks,”Science , vol. 313, no. 5786, pp. 504–507, 2006.
[9]Y . LeCun, J. S. Denker, and S. A. Solla, “Optimalbrain dam-age,”in Advances in Neural Information Processing Systems . 1990, pp. 598–605,Morgan Kaufmann.
[10]B. Hassibi and D. G. Stork, “Secondorder derivatives for net-work pruning:Optimal brain surgeon,”in Advances in Neural Information Processing Systems 5. 1993, pp. 164–171,Morgan Kaufmann.
[11]J. Langford, . Li, and T. Zhang, “Sparseonline learning via
truncated gradient,”J. Mach. Learn. Res. , vol. 10, pp. 777–801, 2009.
outperform discriminatively trained CD-GMM-HMMs by 16%and 33%relative error reduction, respectively, on the VS and SWB datasets. Note that in both cases the size of the CD-DNN-HMM is much larger than that of the corresponding CD-GMM-HMM.
We do not present the results obtained by the sub-gradient algo-rithm associated with the soft regularization formulation in Tables 1and 2due to its limitations discussed in Section 3. Because of the requirement to tune many parameters in the sub-gradient algorithm and the fact that training DNNs are expensive, we only applied this algorithm to several configurationson the VS dataset. From all the configurationswe have tried, the sub-gradient algorithm consistently under-performs Alg. 1by 1-2%in absolute terms. Furthermore, if we use θinstead of min {0. 02, θ/2}in Step 3of Alg. 1as the trun-cation threshold, we lose 0.1-0.3%in absolute terms.
Overall, by exploiting the sparseness property in the model, we obtained 0.2-0.3%absolute error reduction and simultaneously re-duced the connections to only 1/3on both the VS and SWB datasets. Alternatively, we can reduce the number of weights to 12%and 19%,respectively, on the VS and SWB datasets, without sacrificingrecog-nition accuracy. In that case, the CD-DNN-HMM is only 1.5and 0.3times as large as the CD-GMM-HMM on the VS and SWB datasets, respectively, and takes only 18%and 29%of the model size com-pared to the fully-connected models. This translates to reducing the DNN computation to only 14%and 23%of that needed by the full-connected models on the VS and SWB datasets respectively, us-ing either the basic CPU implementation or the SSE implementation with frame batching.