Dan Povey's homepage

Current Interests

This page describes my current scientific interests-- mostly, this a list of things I would be working on if I had more time. If anyone is interested in working with me on any of these ideas, please let me know. Caution: OUT OF DATE!

Kaldi (speech recognition toolkit)

"I don't always use speech recognition software, but when I do, I prefer Kaldi."
See http://kaldi-asr.org for more details.
I am currently working on getting Tomas Mikolov's RNNLMs working with Kaldi.
Other TODO items include: get the fMPE/fMMI code working; add speech/silence segmentation. These will probably get done within the next few months. In the medium term I'd also like to add "genone style" models, subspace fMLLR (as in my papers with Kaisheng Yao), and add some improvements to SGMMs. Others are working on neural-net based systems.

Acoustic Modeling

SGMMs (Subspace Gaussian Mixture Models)

See My Publications for more details. An immediate (not-too-ambitious) goal with SGMMs is to implement a new version in Kaldi with a few new features: I want to add the previously published speaker-dependent weights part ("symmetric SGMM"), and also add something new: the capability to have a kind of two-level hierarchy of Gaussians where the covariance matrices and the projection matrices M and N are shared at the top level of the hierarchy and the bottom level only has mean offsets, and the weight projection vectors w. This should enable better models with less training data.
In the medium term I'd like to do a "predictive" version of SGMMs. Explained very coarsely, this involves predicting the next frame from the previous few frames. Note: this is not just an auto-regressive HMM (I know, those don't work). And I have previously done some preliminary experiments in this direction which showed a small improvement, which I think could be turned into a large one. The basic idea is, when using features of the spliced-cepstra + LDA type, to predict the later spliced cepstra from the earlier ones, but the predictive model is synthesized from the GMM, using p(last frames | first frames) = p(last frames, first frames) / p(first frames), where the numerator is the regular GMM likelihood and the denominator is the same GMM likelihood after projecting to a reduced dimension. Believe it or not, this type of model does work. But I'm not sure whether it will really give "worthwhile" improvements.
I am also interested in stuff with SGMMs where you handle the phonetic context in a more clever way (Brian Mak is working in this direction too, and I have started some work with Thang Vu using Kaldi).

Pronunciation and stress modeling

Pronunciation and stress models are hard to automatically learn from data, but I think it's important enough that it deserves attention. Examples: in English, the words "to" and "the" are pronounced in distinct ways depending on the next word. So does cross-word pronunciation modeling make sense? The machinery to rescore lattices with these types of models is already being built in Kaldi in order to support RNNLMs, so it's not that much extra work.
Stress modeling-- is it possible to train and use context-dependent models of phonetic stress? The general idea would probably be to start with a stress-marked dictionary and see if you can improve it somehow, but ideally you'd like to learn this stuff entirely from data. Again, a lattice-rescoring framework would probably be the easiest way to do this. I had in mind a model of the probability of sequences of stress markers in a sentence, and stress-dependent vowel phonemes.
In this vein, I'm also interested in duration models, especially "whole-sequence" duration models that predict the whole sequence-- these might be practical in a lattice-rescoring context.

"Deep" neural networks

In general neural-network approaches to speech recognition are not really my style, but people seem to have been getting good results with them lately (e.g. recent work at Microsoft, that has been confirmed by other groups). Things I would like to try include:
Try to apply the "Krylov Subspace Descent" method (this is a fast form of quasi-newton gradient descent), to neural network training for speech recognition. (I have some publication with Oriol Vinyals where we described this method).
Random sparse networks. The general idea is: in a large neural network, you have a limited number of connections between layers, and just initialize these randomly. It might be possible to avoid dealing with sparse matrices by somehow using a random permutation and then something like a block-diagonal structure-- or something like that. The motivation is: as networks get larger, having O(#neurons^2) parameters is a bit excessive. We'd like to have sparsity, but I think in a neural network, the pattern of sparsity doesn't really matter as long as it's decided before the parameters are trained.

Other stuff

Language modeling ideas

I'm working on integrating Tomas Mikolov's recursive neural network LMs (which have been giving very impressive results) into Kaldi. If I had time I would implement this stuff myself-- there are a few things that I'd probably do a bit differently, e.g. I'd probably have parameter-specific learning rates and do the SGD-with-regularization in a different way. I am also interested in other formulations that might be similar but easier to understand, e.g. ideas similar to how the weights are handled in SGMMs.
There is also something I haven't had time to write up which relates to Kneser-Ney-like smoothing but done with fractional counts-- and an entropy-pruning method that goes with it, that's like Stolcke pruning but works better (it operates on fractional counts, and it works better because it has more information available to it; it also modifies the backoff distribution).
If I had a lot more time I would probably investigate the types of models that Stan Chen has been working with-- maxent/exponential models with regularization, with class-based features and the like.

Weighted Finite State Transducers (WFSTs)

I have an ICASSP'12 paper about lattice generation; this is a rather ingenious method that uses a generic solution to the following problem: given an FST, give me an FST that accepts the same sets of input-label sequences, but for each input-label sequence, I want just the best (lowest-cost) output-label sequence. It turns out that this is possible using determinization in a special semiring. Note: some people at OGI (now OHSI), namely Zak Shafran, Richard Sproat, Mahsa Yarmohammadi and Brian Roark, have been working on the same problem too (we will probably work together on this, going forward). See here for the ICASSP paper.

Something else in this space is the question of FST intersection in the weighted case. The task is, given two WFSTs, to come up with an algorithm that will give you a WFST that, for each (istring, ostring) pair, assigns weight equal to the (times) of the weights assigned by the two input WFSTs. I came up with an algorithm that I thought would be efficient in certain cases (e.g. where one of the WFSTs was mostly deterministic, or something like that, I forget now). Mike Riley pointed out to me that there are some mathematical results that imply that you can't do this in better than exponential time, in general (or maybe it was even undecidable whether the results is empty or not-- anyway, something extremely dispiriting). It would be interesting if there were some efficient algorithm applicable to an important sub-class of WSFTs, that was useful in practice. BTW, the algorithm was similar in spirit to composition, except it had to "remember" some left-over symbols, so this increases the space of the states in the output WFST (it's no longer just the product of the original states, it is also indexed by the leftover symbols). The problem is that you can't always limit in advance how many leftover symbols there would be, and this leads to a kind of blowup of the state-space. Perhaps for particular types of WFST you could show that this wouldn't happen. There is also the question of whether there is any killer app for this type of algorithm-- although for the more mathematically inclined, this might not matter.

The Center for Language and Speech Processing
Hackerman Hall 226
3400 North Charles Street
Baltimore, MD 21218
dpovey AT gmail DOT com

Back to my homepage