Thanks. Your model looks very similar (if not identical) to what I was proposing. I’ll probably blog more about it after reading your paper. I’m trying to get back into this line of work.

I’m also very interested in Wicentowski’s models for morphology; I used them before (with some generalizations to language model emissions instead of simple multinomials). They work pretty darn well out of the box for languages that are well modeled by the affixing schemes. What I’d really like to do is combine Wicentowski-style morphology with Goldwater/Johnson-style morphology induction.

Of course, the features are going to need to be different for sequence alignment than you were suggesting for words, as there isn’t a notion of word (though there’s some higher-level structure in the form of motifs, and perhaps other repeated patterns that haven’t been well characterized for regulation).

]]>I think I really do have a CRF, because I’m modeling , the conditional probability of generating an alignment yielding a read of length at position in reference sequence .

An obvious way to discriminatively train would be to take a known set of alignments and infer appropriate weights for the clique potentials. As presented above, we just guess at these using domain knowledge. This is exactly what I’d like to do in a hierarchical model. It’d be easy to extend the Gibbs sampler to sample the edit weights given alignments, and given the edit weights, everything else would be as above.

A natural Markov random field would be the joint probability of a read and reference sequence, though you’d probably only want to generate the pieces of the reference aligned. Here, I’m assuming the reference is known and generating the conditional probability of an alignment.

]]>