[223002660010] |LingPipe 3.9 Released [223002660020] |Yatta! [223002660030] |CRFs for tagging and chunking and a whole lot more are out and available for download from the usual place: [223002660040] |
  • LingPipe 3.9 Home Page [223002660050] |The home page details the updates. [223002660060] |Major changes in addition to CRFs include a new tag package to which HMMs were retrofitted, serialization for tokenized LMs, updates to 2010 MEDLINE, and some new utility methods and classes. [223002660070] |

    4.0 Up Next

    [223002660080] |We’re no longer planning to adopt the AGPL license, but we are still planning major changes for 4.0, including getting rid of all the currently deprecated classes and methods. [223002660090] |Many existing classes are likely to be relocated. [223002660100] |Hopefully this will be done soon, as we’re not planning any new functionality for 4.0. [223002660110] |Yet. [223002670010] |Programming esprit d’escalier [223002670020] |

    Mixing Metaphors

    [223002670030] |Last night I was struck by a bolt of esprit d’escalier.* [223002670040] |

    Anecdotal Evidence

    [223002670050] |I was refactoring our part-of-speech tutorial to the new tagging interface introduced for CRFs. [223002670060] |I’d spent over an hour stuck on a bug for unknown token evaluations, which had evocatively repeatable bug patterns. [223002670070] |Finally, I had the sense to give up, get on the subway, and go home. [223002670080] |As soon as I was sitting on my sofa watching Defying Gravity, it occurred to me what the problem was. [223002670090] |I was updating the known token set for the initial segment of the test set and then recalculating test set scores. [223002670100] |D’oh! [223002670110] |

    Down the Garden Path

    [223002670120] |The problem is that your brain gets into a rut. [223002670130] |To borrow a metaphor from psycholinguistics, you wander down a programming garden path. [223002670140] |

    The Zen of Being Your Own Rubber Ducky

    [223002670150] |Way back in high school (post-internt, pre-web), Douglas Hofstadter’s book Gödel, Escher, Bach introduced me to the joy of the kōan. [223002670160] |I liked the one about running around the house without thinking about white elephants. [223002670170] |That is exactly how it feels when you’re down the garden path, trying to escape the labyrinth to reach enlightenment. [223002670180] |I contend that if you were good enough at zen meditation, you could be your own rubber ducky. [223002670190] |* The Wikipedia translates the expression “esprit d’escalier” as “staircase wit” and defines it succintly as “thinking of a clever comeback when it is too late”. [223002670200] |I love the Wikipedia; they even point to an esprit d’escalier-themed Seinfeld episode, “The Comeback”. [223002670210] | Hunt and Thomas introduced the term “rubber ducky” in their book The Pragmatic Programmer. [223002670220] |It’s about how useful it is to explain your problem out loud, even if only to a rubber ducky. [223002670230] |The best rubber ducky is another programmer who actually understands what you’re talking about at some level. [223002670240] |Mitzi and I have found each other more useful than our cats as rubber duckies. [223002690010] |Inferring Splice-Variant mRNA Expression from RNA-Seq [223002690020] |I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. [223002690030] |In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants. [223002690040] |

    The Model

    [223002690050] |The data from which we infer expression is [223002690060] |
  • , the number of splice variants,
  • [223002690070] |
  • , the number of reads, and
  • [223002690080] |
  • , the reads.
  • [223002690090] |For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores. [223002690100] |We assume two model hyperparameters, [223002690110] |
  • , genetic variation, sample preparation, and sequencing parameters, and
  • [223002690120] |
  • , the prior read counts (plus one).
  • [223002690130] |The general-purpose parameter reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. [223002690140] |The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. [223002690150] |Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). [223002690160] |Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base). [223002690170] |We will infer two of the model parameters, [223002690180] |
  • , the mapping of read to splice variant, and
  • [223002690190] |
  • such that , the read expression probabilities.
  • [223002690200] |The model structure in sampling notation is [223002690210] |
  • for
  • [223002690220] |
  • for .
  • [223002690230] |We select the expression levels based on prior counts. [223002690240] |The heart of the mapping model is the discrete sampling of the latent mappings based on the expression level parameter . [223002690250] |This is effectively a beta-binomial model of expression at the corpus level. [223002690260] |The mysterious part of this whole thing is the channel model, which assigns the probability of a given read being observed given it arose from the splice variant under the sequencing model parameterized by . [223002690270] |

    Simple Channel Examples

    [223002690280] |For purposes of illustration, we’re going to assume a really simple instance of this model. [223002690290] |First, we’ll assume that for , which renders a uniform distribution over possible values of . [223002690300] |With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate . [223002690310] |Second, we’ll assume a simple uniform read model. [223002690320] |Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure. [223002690330] |Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. [223002690340] |If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant. [223002690350] |Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here. [223002690360] |With all of these assumptions in place, if read is mapped to splice variant , we set [223002690370] |. [223002690380] |The denominator reflects the number of positions a sequence of the given read length may occur in an mRNA of length . [223002690390] |

    A Simple BUGS Implementation

    [223002690400] |Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS. [223002690410] |BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution. [223002690420] |I call it from R in the usual way. [223002690430] |Here’s just the part where R tells BUGS how to run the model. [223002690440] |After the BUGS window is closed, R plots and attaches the data. [223002690450] |The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC. [223002690460] |Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples. [223002690470] |

    An Example

    [223002690480] |I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. [223002690490] |The gene model and expression levels for the simulation are defined in a comment in the R code. [223002690500] |We’ll consider a case with three exons and three isoforms: [223002690510] |There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). [223002690520] |There are three splice variants, labeled var1, var2, and var3. [223002690530] |Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C. [223002690540] |Without cross-junction reads, it’s not possible to get a uniquely mapped read. [223002690550] |Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3. [223002690560] |Next, we specify the number of samples drawn from each exon in each splice variant. [223002690570] |There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. [223002690580] |These are distributed uniformly across the exons. [223002690590] |Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3. [223002690600] |If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. [223002690610] |But our expression parameter theta () is defined by reads, not by transcripts. [223002690620] |The observed data is much simpler —it’s just the total number of reads for samples mapping to each exon. [223002690630] |We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3. [223002690640] |

    Coding the Example in R

    [223002690650] |We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K: [223002690660] |We actually assume there are 10 times that many samples as in our figure, setting the read data as: [223002690670] |That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190. [223002690680] |Finally, we use R’s matrix constructor to set up the read model: [223002690690] |I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). [223002690700] |A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length). [223002690710] |

    The Output

    [223002690720] |I ran the R program, which called BUGS, which terminated neatly and returned to R. [223002690730] |The values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler. [223002690740] |

    Bayeisan Point Estimate

    [223002690750] |We can compute any statistics we want now with the posterior samples. [223002690760] |For instance, theta[,1] contains all the Gibbs samples for the variable . [223002690770] |We can then generate the Bayesian (L2) estimates for expression levels: [223002690780] |The “right” answer for our simulation is also easy to compute with R: [223002690790] |Our estimates are pretty darn close. [223002690800] |(Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.) [223002690810] |How certain are we of these estimates? [223002690820] |Not very, because there were only 190 samples and lots of mapping uncertainty. [223002690830] |It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples: [223002690840] |With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced. [223002690850] |

    A Triangle Plot and Correlated Expressions

    [223002690860] |Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. [223002690870] |Here’s the plot: [223002690880] |I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3. [223002690890] |It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. [223002690900] |Here’s the two-way interaction plots: [223002690910] |These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. [223002690920] |This makes sense given variants 1 and 2 share the large exon BB. [223002690930] |Finally, here’s the source that generated the PNG file for the triangle plot: [223002690940] |and for the ordinary plots: [223002700010] |LingPipe To-Do List [223002700020] |I never catch up with the pile of API designs and features I have planned for the future of LingPipe. [223002700030] |After that last release, I’d like to just list the major items on the “to-do” list for LingPipe. [223002700040] |Comments appreciated, as always. [223002700050] |As are feature requests. [223002700060] |It’s not very expensive to add to this list, and you never know what’ll grab me to implement next. [223002700070] |

    Demos and Commands

    [223002700080] |We could really use command-line options for much of our technology. [223002700090] |I think we’re losing market share because LingPipe requires Java coding. [223002700100] |A good start might be a command-line package for classifiers like all the other ones out there to allow users to plug and play. [223002700110] |We could also use more demos. [223002700120] |Right now, we just have a few annotation demos up, but that’s only a small part of what we do. [223002700130] |

    External Interface Implementations

    [223002700140] |We’ve never really gotten around to properly integrating large chunks of LingPipe in deployment container wrappers such as Apacche UIMA or development environments such as U. Sheffield’s GATE system. [223002700150] |We don’t support any kind of RDF output for the semantic web. [223002700160] |We’ve pretty much stopped trying to write XML output format wrappers for everything. [223002700170] |

    Clusters and Clouds

    [223002700180] |We’ve also not provided any kind of wrappers to make it easy for people to run large LingPipe jobs on local clusters or distributed cloud computing platforms like Amazon’s EC2. [223002700190] |

    Multi-threaded Implementation

    [223002700200] |Although almost all of the run time classes for LingPipe are thread safe, at least for read operations like classification or clustering or chunking. [223002700210] |But what we don’t have is any kind of threading in our base classes. [223002700220] |I just bought a quad-core notebook, so it’s probably time to start thinking about this. [223002700230] |There are all sorts of opportunities for concurrency in basic LingPipe classes, from K-means clustering to the per-epoch log loss reports in logistic regression or CRFs to any of the cross-validating evaluations. [223002700240] |The tricky part is concurrent training, though that’s also possible for approaches such as EM. [223002700250] |And would be possible if I reorganized logistic regression or CRFs to more directly support blocked updates, because each instance in a block’s gradient may be computed independently. [223002700260] |

    Improvements and Generalizations

    [223002700270] |Many of our modules are not written as generally as possible, either at the level of API or the level of algorithms. [223002700280] |
  • Fully online stochastic gradient for logistic regression, conditional random fields (CRF), and singular value decomposition (SVD) [223002700290] |
  • All iterative algorithms producing iterators over epochs. [223002700300] |Right now, only latent Dirichlet allocation (LDA) is set up this way, but I could do this for semi-supervised expectation-maximization (EM) classifier training, and all the SGDs for logistic regression, CRFs and SVD. [223002700310] |
  • Refactoring SVD to produce iterators over states/dimensions [223002700320] |
  • Weighted examples for just about anything trainable from LM classifiers to logistic regression; this would make it trivial to train on probabilistic examples by weighting categories by probability. [223002700330] |
  • Entropy-based pruning for language models. [223002700340] |
  • Information gain for feature selection; minimum count feature selection [223002700350] |
  • Generalize min-max heaps to take a java.util.Comparator rather than requiring entries to implement LingPipe’s util.Scored. [223002700360] |
  • Soft k-means abstracted from demo into package for clustering [223002700370] |
  • More efficient vector processing for logistic regression, CRFs, etc., where there is no map from strings to numbers, and not necessarily even a vector processed. [223002700380] |In most of these modules, we need to compute a vector dot-product with a coefficient vector, not actually construct a feature vector. [223002700390] |Ideally, we could go all the way to Vowpal-Wabbit-like implicit feature maps. [223002700400] |
  • Integrate short priority queues where appropriate, because they’re more efficient than the general bounded priority queues we’re currently using [223002700410] |
  • More general priors and annealing for SVD to match the other versions of SGD. [223002700420] |
  • More efficient sorted collection with removes for more efficient hierarchical clustering. [223002700430] |
  • Removing all the specific corpus.Handler subinterfaces other than ObjectHandler. [223002700440] |Then I can generalize cross-validating corpora, and just have the object handled as a type parameter for corpus.Parser and corpus.Corpus. [223002700450] |
  • Add iterator methods to corpora that can enumerate instances rather than requiring handling via callbacks to visitors. [223002700460] |
  • Privatize everything that can be. [223002700470] |
  • Finalize methods and classes that shouldn’t be overridden. [223002700480] |I’ve been very sloppy about this all along. [223002700490] |
  • More agressively copy incoming arguments to constructors and wrap getters in immutable views. [223002700500] |Later classes are much better than earlier ones at doing this. [223002700510] |(Maybe I should just say re-read Effective Java and try one more time to apply all its advice.) [223002700520] |
  • Serialize dynamic LM classifiers and serializing tokenized LMs to support it. [223002700530] |

    Rearrangements

    [223002700540] |So many things should move around that it’d be impossible to mention them all. [223002700550] |For instance, putting all the HMM classes in the HMM package, and all the LM classes in the LM package, for a start. [223002700560] |I’m planning to move most of the corpus-specific parsers (in corpus.parsers) out of LingPipe to wherever they’re used. [223002700570] |I’m also planning to move the entire MEDLINE package to the sandbox project lingmed. [223002700580] |I should rename classes like util.Math which conflict with Java library class names. [223002700590] |

    New Modules

    [223002700600] |
  • Wikipedia Wikitext parser (probably not for LingPipe proper) [223002700610] |
  • Probabilistic context-free grammar (PCFG) parser. [223002700620] |Possibly with Collins-style rules. [223002700630] |
  • Discriminative statistical context-free grammars with more global tree kernel features [223002700640] |
  • Dependency parsers with the same characteristics as the CFGs. [223002700650] |
  • Linear support vector machines (SVM) with regularization via SGD. [223002700660] |
  • Morphological analyzer (to work as a stemmer, lemmatizer or feature generator), preferably with semi-supervised learning. [223002700670] |I’d like to take an approach that blends the best aspects of Yarowsky and Wicentowski’s morphology model and Goldwater and Johnson’s context-sensitive Bayesian models. [223002700680] |
  • Some kind of feature language for CRFs [223002700690] |
  • More finely synched cache (along the lines of that suggested in Goetz et al.’s awesome Java Concurrency in Practice) [223002700700] |
  • Logistic regression for a long-distance, rescoring tagger or chunker [223002700710] |
  • Longer-distance maximum-entropy Markov model (MEMM) type taggers and chunkers; with a greedy option as discussed by Ratinov and Roth. [223002700720] |
  • Higher-order HMM rescoring taggers and chunkers [223002700730] |
  • More efficient primitive collections; (I just finished a map implementation for boolean features) [223002700740] |
  • Unions, differences, etc. for feature extractors [223002700750] |
  • Decision tree classifiers [223002700760] |
  • Meta-learning, like boosting (requires weighted training instances) [223002700770] |
  • Jaro-Winkler (or other edit distances) trie versus trie (scalable all versus all processors based on prefix tries). [223002700780] |
  • Prune zeros out of coefficient vectors and symbol tables for classifiers, CRFs, etc. [223002700790] |
  • Standard linear regression (in addition to logistic). [223002700800] |
  • Other loss functions for linear and generalized regressions, such as probit and robit. [223002700810] |
  • Dirichlet-process-based clusterer [223002700820] |
  • Discriminative choice analysis (DCA) estimators, classifiers and coref implementation (the base classes are nearly done). [223002700830] |Please let us know if there’s something you’d like to see on this list. [223002710010] |Custom Java Map for Binary Features [223002710020] |Update: 10 February, 2010. [223002710030] |You can either jump to David R. MacIver’s comment which shows that my work is not yet done, or try to figure out what’s wrong with the implementation below. [223002710040] |Update II: 10 February, 2010. [223002710050] |Of course, I should’ve also added a top-level method to BinaryMap to return the postive set as the key set. [223002710060] |The built-in Java collection implementations tend to be heavy. [223002710070] |For instance, hash maps are based on hash sets of map entries, where a map entry is a simple container object with a reference to a key and a value. [223002710080] |For many natural language applications, such as CRF taggers and chunkers, feature maps are often designed to be boolean, with features taking only binary values, 0 or 1. [223002710090] |These maps are almost always sparse, only representing the one values. [223002710100] |Given the proportion of time (most of it) and memory (pretty much all of it) spent extracting features compared to decoding in CRFs or classification with logistic regression, it seems worthwhile to consider more minimal representations. [223002710110] |Luckily, the Java collections were designed to be easy to extend (thanks Dr. Bloch). [223002710120] |In this post, I’ll walk you through LingPipe’s new binary map class, which implements a generic map to numbers based on a pluggable set implementation. [223002710130] |Because these maps need to be modifiable so that they can be easily built up in feature extractors, there’s a lot of work to do in the implementation. [223002710140] |There are so many inner classes that the whole resembles a set of nesting Russian dolls (pictured above). [223002710150] |

    Class Declaration

    [223002710160] |Skipping all the java.util imports, we declare the map to be in our util package and extend Java’s abstract map: [223002710170] |The return type will be a number, but the keys may be any old type as specified by the generic E. [223002710180] |

    Member Variable and Constructors

    [223002710190] |There’s only a single member variable, containing the set of items that map to 1, and it’s specified in the constructor, defaulting to a hash map: [223002710200] |

    Implementing the Entry Set Method

    [223002710210] |Java’s abstract map class does so much work that only one method must be implemented, entrySet(), which returns a set view of the map entries underlying the map. [223002710220] |The abstract map class can then use that to implement everything from gets and puts to size to hash code methods. [223002710230] |Here, we implement it to return a custom entry set (see below): [223002710240] |I’ll fill in implementations of the other map methods later, although this one method would be enough to get the job done in terms of space savings. [223002710250] |It’d just be inefficient in terms of speed. [223002710260] |

    The Entry Set Implementation

    [223002710270] |Like nesting Russian dolls, this implementation’s going to involve quite a few inner classes. [223002710280] |Here’s the entry set implementation itself, which is an inner class of the top level binary map class: [223002710290] |Only the iterator and size method are necessary. [223002710300] |The size method just delegates to the containing class’s positive set’s size method. [223002710310] |The iterator method returns a custom iterator implementation we describe below, and the size method delegate’s to the containing binary map’s positive set. [223002710320] |(That’s why it’s an inner class, not a static nested class.) [223002710330] |The remaining methods, the clear, contains, is-emtpy and remove methods are provided for efficiency; I grayed out their code to signal its optionality. [223002710340] |

    Some Constants

    [223002710350] |I declared two constants for convenience in the top-level binary map class: [223002710360] |

    Implementing Entry Set’s Iterator

    [223002710370] |The entry set iterator is declared as an inner class of the top-level binary map. [223002710380] |It’s declared to implement the appropriate iterator interface. [223002710390] |The code consists of the iterator method implementations for the has-next, next, and remove methods. [223002710400] |When it is constructed, it uses the containing binary map’s positive set to create an iterator over the positive instances. [223002710410] |The has-next and remove methods are simply delegated to the binary map’s positive set iterator. [223002710420] |This means that when we’re iterating over the entry set for the top-level maps, remove calls have the intended effect on the top-level map, namely removing one of its positive entries. [223002710430] |Of course, if the positive entry set on which the map is based is immutable, the remove operation will throw an unsupported operation exception, as documented in the iterator interface documentation. [223002710440] |The entry itself just returns an instance of another inner class, a positive entry map. [223002710450] |

    Positive Map Entry Implementation

    [223002710460] |The positive map entry is itself an inner class declared inside of the entry set iterator (which is itself an inner class of the top-level binary map). [223002710470] |During construction, it just consumes the next element from the iterator’s contained positive set iterator. [223002710480] |That’s then held locally as the value to return for the entry’s key. [223002710490] |The value is just the constant number 1. [223002710500] |Note that the set method throws an unsupported operation exception; it could’ve been defined to do nothing for a value of 1, remove the positive entry for the value of 0, and throw an illegal argument exception otherwise. [223002710510] |But that would almost certainly cause the iterator from which the value was drawn to throw a concurrent modification exception. [223002710520] |The equality and hash code methods are described in Java’s map entry interface; they must be defined this way for compatibility with Java’s collections. [223002710530] |

    Map Methods for Efficiency

    [223002710540] |The remaining methods defined in the top level map are for efficiency. [223002710550] |Here they are: [223002710560] |All of these implementations are pretty much straightforward delegations to the contained set of objects mapping to 1. [223002710570] |The return types of methods such as the put method are defined in the map interface documentation. [223002710580] |There are additional methods that could’ve been added for even more efficiency, such as computing hash code and equality as specified in the map interface’s documentation. [223002710590] |

    Values Collection Implementation

    [223002710600] |Alas, one more doll to unpack, the value collection implementation. [223002710610] |

    Values Iterator Implementation

    [223002710620] |The final inner class is the values iterator, which also gets a custom implementation. [223002710630] |This one’s a little trickier because it doesn’t delegate the remove method, but keeps track of the state. [223002710640] |Iterators only allow their remove method to be called once after each next call. [223002710650] |

    Override Considered Helpful

    [223002710660] |Extending Java collections is a good illustration of the utility of the override attribute. [223002710670] |If you mark a method definition with the attribute @Override and it doesn’t actually override a superclass method, the compiler throws an error. [223002710680] |The reason it’s so crucial for collections is that many of the collection methods are defined at object level, like set’s contain(Object) method. [223002710690] |Even though in a type-safe world, that object is only going to be of the set’s type, given Java’s retrofitted type system, this can’t be guaranteed. [223002710700] |The types in the collections are as strong as they can be given Java’s type system and the need for backward compatibility with pre-generics Java. [223002710710] |

    Serialization and Synchronization

    [223002710720] |I also made the map serializable using the serialization proxy pattern as supported by LingPipe’s util.AbstractExternalizable base class. [223002710730] |For synchronization, the class may be wrapped in the synchronized wrappers from Java’s java.util.Collections utility class. [223002710740] |

    Whew!

    [223002710750] |No wonder programming books are so darn long. [223002710760] |And this was a relatively simple example that only took me a few hours to write and unit test. [223002710770] |Check it out in the next LingPipe release. [223002710780] |It’ll be paired with a binary vector implementation along the same lines. [223002720010] |Java Inheritance vs. Equality: Chunking and Span Case Study [223002720020] |There’s a tension between allowing inheritance and definining equality properly in Java. [223002720030] |I’ve been working on refactoring many of the LingPipe interfaces and abstract base classes, and this problem comes up again and again. [223002720040] |

    Spans and Chunks

    [223002720050] |In dealing with text annotations, the simplest meaningful data structure is the span, which picks out a start and end (or start and length) in a given character sequence. [223002720060] |The term “chunk” typically applies to spans with additional type information, such as person/organization/location for named entity mention chunking or noun phrase/verb phrase for phrase chunking. [223002720070] |

    Inheritance

    [223002720080] |Using inheritance, we might define a pair interfaces, such as: [223002720090] |Equality interface requirements can’t be specified in Java itself, so they are typically specified informally in the Javadoc. [223002720100] |

    Span and Chunk Equality?

    [223002720110] |It makes sense to consider a span equal to another object if that object is a span that has the same start and end points. [223002720120] |It also makes sense to consider a chunk equal to another object if that object is a chunk with the same start and end points and the same type. [223002720130] |This is all well and good if chunk and span are independent interfaces. [223002720140] |But what if we include the italicized declaration making Chunk a subinterface of Span? [223002720150] |If I have a span sp and a chunk ch that have the same start and end points, we’d wind up having sp.equals(ch) evaluating to true, but ch.equals(sp) evaluating to false. [223002720160] |Unfortunately, this breaks Java’s implicit contract on the equals method, which requires symmetry (a.equals(b) if and only if b.equals(a)). [223002720170] |I’ve actually written special-purpose JUnit assert methods (in com.aliasi.test.unit.Asserts) to test object equality both ways, and to ensure that if two objects are equal, they have the same hash code. [223002720180] |So even though chunks include all the information in spans, they don’t satisfy a span’s notion of identity. [223002720190] |

    Encapsulation vs. Inheritance

    [223002720200] |It’s much cleaner conceptually to instead use encapsulation, defining: [223002720210] |Now it’s clear that a chunk has a span, but it isn’t itself a span per se. [223002720220] |Now we can say that two chunks are equal if their spans and types are equal. [223002720230] |We could even add convenience methods for start() and end() to the Chunk interface. [223002720240] |

    Adding Scores via Overspecification

    [223002720250] |The current state of LingPipe interfaces does not have a Span-like interface, and Chunk includes not only start, end and type, but also score information. [223002720260] |This is a clumsy interface style because not all chunks necessarily have score information. [223002720270] |If we include score equality in the definition of equality, the degenerate scoreless chunks need to all take the same default score. [223002720280] |It wouldn’t help to add scores via inheritance. [223002720290] |We actually do that, defining Chunk to extend the Scored interface. [223002720300] |

    Adding Scores via Encapsulation

    [223002720310] |With encapsulation, it’d make sense to define a scored chunk relative to a chunk the same way we defined a chunk relative to a span: [223002720320] |Or we could have a scored object interface: [223002720330] |and use ScoredObject for scored chunks. [223002720340] |

    Implementations

    [223002720350] |The main problem with encapsulation is that it involves more objects, and despite the discounts provided by recent JVMs, object creation and destruction is still crazy expensive compared to arithmetic. [223002720360] |If we implemented a chunk as a holder for a span and a string type, the object representation for the span itself is unnecessary memory overhead. [223002720370] |If we only store the start, end and type for a chunk, we’d need to create the span to return it to a client. [223002720380] |Just the kind of object creation that efficient Java programs try to avoid. [223002720390] |

    Example 2: Java Set Equality

    [223002720400] |Java’s java.util.Set.equals(Object) method specifies: [223002720410] |Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set). [223002720420] |That’s a sensible definition following the mathematical notion of set. [223002720430] |But what about the superinterface java.util.Collection? [223002720440] |It’s not a coincidence that set’s super-interface, java.util.Collection does not define equality. [223002720450] |Although it might make sense to define collections as equal if they had the same iterated objects in the same order, this would conflict with set’s definition of equality in the same way that span’s definition conflicted with chunk’s. [223002720460] |Similarly, java.util.SortedSet can’t extend set’s definition of equality to require sort equality because it would mess up the use of Set itself and make it impossible for a class to implement both Set and SortedSet. [223002730010] |ThinkPad W510 Review: First Impressions [223002730020] |[Update: 15 March 2011: I finally caved in and got a 13" MacBook Air. [223002730030] |I just got it yesterday and am already up and running. [223002730040] |I love the low-throw keyboard and massive multitouch touchpad. [223002730050] |And the (lack of) weight. [223002730060] |Today, I got rid of the W510. [223002730070] |That's all it took -- one day. [223002730080] |I haven't used a Mac since 1992, but it is, in fact, pretty easy [though I was perplexed by the extension cord, which like many of Apple's products, is completely non-obvious.] [223002730090] |Luckily, I do know Unix and the Mac OS is easy to learn with their online tutorials. [223002730100] |The Air’s hardly a desktop replacement, though I’d probably buy one even if I wanted to run Windows on it. [223002730110] |Alternatively, I might consider a MacBook Pro, which a year after Lenovo came out with a machine with roughly similar specs to the W510.] [223002730120] |[Update: 25 September 2010: After six months or so, the machine's developed a serious looseness on the lower right side of the keyboard/case that makes typing on it even worse. [223002730130] |It rattles, and I can feel the lower right side of the case moving along with the keys. [223002730140] |Typing, I notice it on the whole right side of the keyboard, but it's most pronounced with keys in the lower right. [223002730150] |Wah! [223002730160] |My favorite part of the IBM Thinkpads was the keyboard. [223002730170] |Does anyone have other suggestions out there for a high-performance notebook with a good keyboard? [223002730180] |It's not something most reviewers even mention. [223002730190] |At Best Buy, I hated pretty much all the notebook keyboards I tried, but they hardly had a full range, and mostly consumer models.] [223002730200] |I’ve had my Lenovo ThinkPad W510 for a week [update: I've now had it for nearly a month and my opinion hasn't changed]. [223002730210] |Long enough to transfer all of my work, install everything I need, and use it for a couple days for work. [223002730220] |My previous machine was a Lenovo ThinkPad Z61P with a Core Duo, 4GB of RAM and a 1920×1200 15.5″ monitor. [223002730230] |I’ve had ThinkPads going back to the late 1990s. [223002730240] |

    The Specs

    [223002730250] |Here’s the most info I could find from Lenovo: [223002730260] |
  • Lenovo: ThinkPad W510 Reference [pdf] (complete lineup of what’s available and in what configurations) [223002730270] |
  • Lenovo: ThinkPad W510 Repair Manual [pdf] [223002730280] |

    CPU and Memory

    [223002730290] |I got the cheapest CPU, an Intel Quad Core Processor i7 720QM. [223002730300] |Intel gave it a slow 1.6GHz clock for when all four cores are engaged, but then uses “TurboBoost” to max out at 2.8GHz under less load. [223002730310] |The step up to the 820QM at 3GHz max didn’t seem worthwhile, much less the step to the extreme version. [223002730320] |The W510, unlike the T series notebooks, has 4 DIMM slots. [223002730330] |I got 8GB of DDR3 memory in 4 DIMMs, but only at 1066MHz. [223002730340] |I figure I’ll upgrade to 16GB of DDR3 at 1333MHz when it gets a bit cheaper or when I have a job for which I need 16GB. [223002730350] |

    128MBGB Solid State Drive

    [223002730360] |I have to say that the SSD isn’t the life-changer I’d been led to believe it would be. [223002730370] |It may just be all the over-hype for desktop use. [223002730380] |What it does do is make startup from sleep super-duper fast, as in almost instant. [223002730390] |And I have to keep checking that my gzipped tar files really have unpacked. [223002730400] |Lots of random file access is way faster. [223002730410] |It’s also startlingly quiet. [223002730420] |So the notebook doesn’t thrum any more when in use. [223002730430] |Very cool. [223002730440] |I think it’s also more energy efficient than traditional disks, but I’ve only been plugged in w/o a power meter so far. [223002730450] |

    The Monitor

    [223002730460] |Why don’t other manufacturers release higher definition monitors in smaller sizes? [223002730470] |Apple’s 17″, not to mention HP and Sony’s 18″ plus monsters are just too big. [223002730480] |But I want high res in a smaller size, which pretty much leaves me with Lenovo. [223002730490] |I opted for the 15.6″ FHD Display (95% Gamut) with LED Backlight. [223002730500] |I didn’t get touch screen, and the option for FHD without touch screen seems to have gone away. [223002730510] |They’ve already had supply issues with these machines. [223002730520] |It took some digging to decode FHD into “full HD”, meaning 1920 x 1080 native resolution. [223002730530] |I think this is a huge mistake; I want my 1920 x 1200 back. [223002730540] |I’m a programmer and I like my vertical buffer space in both the shell and emacs and web docs/PDFs. [223002730550] |The brightness and color balance are nothing short of amazing. [223002730560] |From across the room, the screen with an emacs window up reminds me of my white-balanced photo light box. [223002730570] |My old ThinkPad screens look green compared to the new one. [223002730580] |After working for an hour or two in emacs, I find the incandescents in our house look out of balance. [223002730590] |Photos look way better on this screen than on our previous ThinkPad screens and better than on our Samsung 24″ LCD monitors at work. [223002730600] |Did I say “amazing” yet? [223002730610] |It’s so bright I can’t leave it on full brightness indoors. [223002730620] |Speaking of monitors, it also has a DisplayPort output, which I haven’t tried yet. [223002730630] |That’ll mean yet another dongle —DisplayPort to HDMI. [223002730640] |No idea if the W510 sends out sound on that or not. [223002730650] |

    The Keyboard

    [223002730660] |What Lenovo have done to the ThinkPad keyboard is nothing short of criminal. [223002730670] |I type really quickly, and I loved the old ThinkPad keyboards. [223002730680] |I’m using a ThinkPad USB keyboard on my workstation at work as I type this. [223002730690] |Over the last three ThinkPads (over a period of 6 or so years), the keyboards have just gotten worse. [223002730700] |Basically, the keys are longer throw, less definitively clicky, and much much stiffer. [223002730710] |The new keyboard size for escape and delete keys was a good idea; at least they left all the keys I use in the same place as before. [223002730720] |Does anyone know if the T410 and T510 use the same keyboards? [223002730730] |Lenovo —give me my old keyboard back. [223002730740] |I’m willing to pay in both US$ and weight! [223002730750] |I tried all the other brands out there, and there’s nothing out there like the old ThinkPad keyboards. [223002730760] |And Sony’s and Apple’s are just dreadful (though I like the compact iMac keyboards more than any notebook keyboards out now). [223002730770] |[Update: After a month of use, the keyboard strikes me as even worse than it did on first impressions. [223002730780] |I'm consistently dropping keystrokes because the force required is so much greater than my old ThinkPad. [223002730790] |Argh!] [223002730800] |

    The Battery and Power Supply

    [223002730810] |WTF? [223002730820] |There’s a 9-cell battery which sticks out of the back of the machine a good ways and is really heavy. [223002730830] |I almost never get stuck relying on battery power, but then I don’t take many long flights any more. [223002730840] |What’s annoying is that there’s no 6-cell option, so at least for now, I’m stuck with this unsightly thing. [223002730850] |The 135 watt power supply is enormous and also heavy. [223002730860] |It’s more amps than the previous power supply. [223002730870] |I have no idea if I could get away with a less powerful one. [223002730880] |Thankfully they’ve kept the same power-supply connector (what they call a “tip”). [223002730890] |IBM used to swap them every model so I had to keep buying new secondary power supplies. [223002730900] |I don’t see any lighter model that’s compatible with my machine, which is a real drag. [223002730910] |

    Useless Built Ins

    [223002730920] |No, I don’t need a 2MP video camera (though who knows, I may Skype with video now), and I don’t need a fingerprint reader, and I don’t need memory card readers. [223002730930] |I also uninstalled the MS Office trials, but the machine was otherwise free of preinstalled crap. [223002730940] |

    Windows 7 Pro 64 bit

    [223002730950] |This machine came with Windows 7 Pro 64-bit version (it was $70 cheaper than “Ultimate”, whatever that is). [223002730960] |I was burned on my dual quad core workstation by not getting Windows Vista Pro (which is required to run two physical cores —I hate crippleware, of which Windows Home is a prime example), this machine came with Windows 7 Pro. [223002730970] |Of course, 64 bit. [223002730980] |All I can say is “seventh verse, same as the first”. [223002730990] |I’m a Windows user and can’t tell the difference here. [223002731000] |But then some of the speed may be the new Windows instead of the new machine. [223002731010] |I put everything back to Windows Classic mode. [223002731020] |I just can’t get used to the fat translucent title bars. [223002731030] |I turn off ClearType because I prefer my fonts to be crisp. [223002731040] |I put everything back to normal size (100% zoom), because at 125% zoom, the fonts all look yucky. [223002731050] |It’s tiny, but then I can see a lot. [223002731060] |As I get older, it’s tougher on my eyes. [223002731070] |I wish Windows resized everything more gracefully so I can keep emacs and shell at native res, but crank up icons, windows, menus, etc., to something more manageable without them looking ugly and unbalanced. [223002731080] |But what’s with not letting me resize windows near the bottom of the screen? [223002731090] |It just exacerbates the loss of 120 vertical pixels with the “FHD” format. [223002731100] |If anyone knows how to turn that off, let me know. [223002731110] |[Update: Windows 7 snaps the window to full height if you let it go in this region, which is really useful.] [223002731120] |

    The Optical Drive

    [223002731130] |Alas, “Multi Recorder Optical Drive (12.7mm)” did not mean Blu-Ray. [223002731140] |You get Blu-Ray at this price on other mainstream computers other than the Apple. [223002731150] |C’mon, they’re not that expensive, and I’m paying the premium to Netflix because I love 1080p. [223002731160] |

    Graphics Card

    [223002731170] |This thing comes with an “NVIDIA Quadro FX 880M Graphics with 1GB DDR3 memory”. [223002731180] |I don’t really use it to play 3D games. [223002731190] |It seems to have more memory assigned to it —I’ll have to look into that. [223002731200] |I’d probably have preferred a lower-power built-in solution. [223002731210] |

    The Design

    [223002731220] |Are you kidding me? [223002731230] |It was like there was a contest for plain and Lenovo put their entire brain trust behind it. [223002731240] |I kind of like it. [223002731250] |It sort of reminds me of something Cayce Pollard, the protagonist of William Gibson’s Pattern Recognition might like. [223002731260] |

    Performance Evaluation

    [223002731270] |I actually care more about the subjective performance than the kinds of charts they print in most online computer reviews. [223002731280] |Having said that, here’s the “Windows Experience Index” for the machine, with components assessed on a 1.0 to 7.9 scale (I’m not making that range up): [223002731290] |
  • Processor: 6.9 [223002731300] |
  • Memory: 7.3 [223002731310] |
  • Graphics: 6.4 [223002731320] |
  • Gaming Graphics: 6.4 [223002731330] |
  • Disk: 6.9 [223002731340] |

    Two-Day Shipping

    [223002731350] |Sure, that was two day shipping. [223002731360] |Not including the trip from Shanghai via Korea and then the days it took clearing customs in Kentucky. [223002731370] |I’m not sure why customs takes so long. [223002731380] |Overall, it was torture watching the shipping tracker. [223002731390] |

    Lenovo’s Call Center

    [223002731400] |I’d just ordered a T400 a week before the new models came out, and they let me cancel that order, no questions asked and no hassle at all. [223002731410] |Their call center is awesome. [223002731420] |From the accents, I’d guess it’s in India. [223002731430] |It’s one of the best call centers I’ve ever called. [223002731440] |I’ve had two interactions with them over the last few years; one to change a complicated order and one to cancel the T400. [223002731450] |Both were extremely fast, effective, and clear. [223002740010] |DCA LaSO via SGD [223002740020] |AKA, I’m going to use discrete choice analysis (DCA) to provide a probabilistic model for the learning as search optimization (LaSO) paradigm and I’ll fit it with stochastic gradient descent (SGD). [223002740030] |Too bad LaSO isn’t one, or I’d have hit the TLA trifecta. [223002740040] |It occurred to me to do this after considering DCA for coref and then reading Daumé and Marcu’s paper: [223002740050] |
  • Daumé, Hal III, and Daniel Marcu. 2005. [223002740060] |A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model. [223002740070] |In HLT/EMNLP. [223002740080] |(Hint to authors: anything you call “large scale” today will look tiny tomorrow.) [223002740090] |The paper does a really nice job of describing their LaSO technique and demonstrating its utility for within-document coreference resolution. [223002740100] |There are other more general papers on LaSO, but this one’s nice in that it works through a clean applied example involving within-document coreference that we care about here at Alias-i. [223002740110] |What’s really cool is that they jointly solve the entity extraction and coref problem, and get to use arbitrary history-based features. [223002740120] |

    LaSO

    [223002740130] |The LaSO paradigm is simply one of scoring various search states. [223002740140] |You extract feature vectors from a search state and input. [223002740150] |At a given search state, you can generate a finite number of next possible search states consistent with the input. [223002740160] |These are then scored using the feature vectors. [223002740170] |The trick is in the training, where they consider a single deviation from the correct search path at every step. [223002740180] |They train these decisions discriminatively. [223002740190] |It’s more general than the kind of history-based parsing: [223002740200] |
  • Black, Ezra, Fred Jelinek, John Lafferty, David M. Magerman, Robert Mercer and Salim Roukos. 1992. [223002740210] |Towards History-based Grammars: Using Richer Models for Probabilistic Parsing. [223002740220] |In HLT. [223002740230] |that I used in my very first paper on stat NLP: [223002740240] |
  • Manning, Christopher D. and Bob Carpenter. 1997. [223002740250] |Left-corner Language Models and Parsing. [223002740260] |In IWPT. [223002740270] |In the usual history-based parsing setup, there was a fixed set of choices corresponding to parser operations. [223002740280] |For instance, Manning and I used left-corner derivations and Black et al. used top-down derivations. [223002740290] |In the LaSO setup, there can be an arbitrary number of choices. [223002740300] |For instance, when considering to which antecedent a pronoun refers, the number of antecedents isn’t bounded in advance. [223002740310] |

    DCA + LaSO

    [223002740320] |Daumé and Marcu consider two non-probabilistic scoring approaches, a perceptron-like approach and an approximate wide-margin approach. [223002740330] |Given the setup, we can model the finite set of choices from any given search state using DCA. [223002740340] |This in turn provides a proper probabilistic measure over the entire set of possible outputs. [223002740350] |I’m not sure it’s going to work well in practice because it’ll locally normalize all the decisions. [223002740360] |If you look at Andrew McCallum’s work on CRF-like approaches to structured prediction problems like coref, or if you look at Daumé, Marcu and Langford’s extension of LaSO to SEARN, one might think about normalizing on a more global level. [223002740370] |The problem, as usual, is the normalization constant. [223002740380] |On the other hand, considering Ratinov and Roth’s experiments with entity detection, it seems like it’s worth pursuing the simple greedy (or more generally beam-based) approach. [223002740390] |

    Fitting with SGD

    [223002740400] |It’s easy to generate maximum a posteriori (MAP) coefficient estimates for the DCA model using stochastic gradient descent (SGD). [223002740410] |In fact, it’s just like fitting other DCA models. [223002740420] |Not coincidentally, I finished unit-testing DCA fitting via SGD yesterday. [223002740430] |Look for it in the next LingPipe release. [223002740440] |I’ll either figure out how to abstract the entire search, like LaSO, or just provide the DCA fitter and a coref-based example. [223002760010] |LDA Clustering for Gene Pathway Inference [223002760020] |Genetic activation is hierarchically structured at many different levels. [223002760030] |Tight clusters of associated genes are co-activated to construct molecular machines or participate in genetic pathways. [223002760040] |Biologists spend considerable time trying to work out these pathways down to the rate equation level. [223002760050] |More abstract interaction graphs are available from the KEGG Pathway Database, the web page for which describes it as “Wiring diagrams of molecular interactions, reactions, and relations.” [223002760060] |I’ve blogged previously about inferring gene or mRNA transcription levels. [223002760070] |Now suppose you have several tissue samples from the same organism (e.g. from different cells or at different developmental stages or just replicates of the same experiment). [223002760080] |These might be mapped to genes, to splice variants, or to individual exons. [223002760090] |What we’d like to be able to do is start making inferences about genetic pathways and their activation. [223002760100] |

    The Simplest Model: LDA

    [223002760110] |I wrote the simplest possible model down, stepped back, and realized it was just latent Dirichlet allocation (LDA), where “documents” are samples and “words” are mapped reads. [223002760120] |I really like BUGS notation for models because it combines the ease of a graphical model presentation with the precision of a programming language. [223002760130] |Note that the expression parameters theta and phi are scaled as ratios over reads. [223002760140] |These values may be normalized by gene (or variant or exon) length (taking into account the vagaries of the mapping program) to molar units (by analogy with the number of reads being like weight). [223002760150] |These models are really easy to fit and really easy to scale. [223002760160] |As output, we get an estimate of the joint posterior distribution p(theta,phi|y,alpha,beta) over pathway expression and pathway composition (we just marginalize out the nuisance parameter t in the usual way by summation). [223002760170] |LingPipe’s implementation is explained with sample code in: [223002760180] |
  • LingPipe: Clustering Tutorial
  • [223002760190] |The inputs are the hyperparameters alpha (the dimension of which determine the number of pathways T) and beta (determining the number of genes K), and the (ragged) input matrix y (determining the number of samples S and reads per sample R[s]). [223002760200] |The output is a sequence of Gibbs samples. [223002760210] |Each sample contains a value for all of the parameters: the source t of reads, the gene in pathway expressions phi and the pathway in sample values theta. [223002760220] |We can just discard t to get a simulation of the posterior p(theta,phi|y,alpha,beta). [223002760230] |

    Reasoning with Samples

    [223002760240] |As I discussed in our tutorial (and Griffiths and Steyvers discuss in the paper cited), the cluster assignments themselves in LDA are not identifiable. [223002760250] |What we do get is a kind of random effects mixture model of correlation among genes. [223002760260] |On the plus side, we can easily carry out multiple comparisons or other plug-in inferences from any derived sample statistics (such as pairwise or groupwise gene co-activation). [223002760270] |Given that we’re usually using clustering like LDA as part of exploratory data analysis, we can just look at samples and topics and see what pops up. [223002760280] |And if we have ground truth, as supplied at least partially by KEGG, we can look at whether or not known pathways emerge as “topics”. [223002760290] |

    Joint Inference with Mapping

    [223002760300] |Perhaps the most pleasant feature of Bayesian inference is the ease with which it lets you combine models to perform joint inference at multiple levels or over multiple domains. [223002760310] |Because all variable quantities are treated as first-class random variables, and reasoning is carried out by integrating uncertainty, models may be plugged together hierarchically whenever the components fit. [223002760320] |Here, the y variables inferred by the RNA-Seq expression model would be the y variables assumed as data here. [223002760330] |Very convenient [223002760340] |Inference is the same, at least in theory —there are just more variables to sample. [223002760350] |The big difference is that now y has multiple constraints, so its conditional sampling distribution has to be worked out and simulated. [223002760360] |BUGS does this kind of thing automatically. [223002760370] |I’m still struggling with Casella and Robert’s Monte Carlo Statistical Methods. [223002760380] |

    Hierarchical Model of Priors

    [223002760390] |Just as I did for the hierarchical Bayesian model of batting ability and the hierarchical model of categorical data coding, with enough samples S that are in some sense exchangeable, it’s possible to estimate the priors alpha or beta in models like these. [223002760400] |All the usual benefits accrue, like smoothed multiple comparisons. [223002760410] |We just put a diffuse “hyperprior” on alpha and/or beta and include sampling for those variables. [223002760420] |(Again, a bit easier said than done for those still working out all these simulations.) [223002760430] |But again, it’s easy in BUGS. [223002760440] |

    Better Priors than Dirichlet

    [223002760450] |The Dirichlet prior only has a number of parameters equal to the number of dimensions and thus cannot fully characterize covariance of expected expression. [223002760460] |A more reasonable model might be a multivariate normal on the logit scale. [223002760470] |This is what Andrew Gelman suggested when I asked a while ago, and it’s only just sinking in. [223002760480] |Another problem with the Dirichlet (and multivariate normal) priors are that they fix the number of clusters ahead of time. [223002760490] |It’d be possible to plug in something like a Dirichlet process prior here. [223002760500] |I’m not sure if the model it entails is reasonable. [223002760510] |I’ll have a better idea after fitting some data. [223002760520] |

    Not Just Theory

    [223002760530] |Well, OK, this post is just theory. [223002760540] |But Mitzi’s already run the expression models I coded based on the previous post, and next up, I’ll try to convince her to run LDA on the data. [223002760550] |When we have something ready to release to he public (warning: biologists tend to keep their data close), I’ll announce it on the blog. [223002760560] |

    Prior Art?

    [223002760570] |I’m guessing this (or at least the non-Bayesian version of it) has been done, but I don’t know how to look it up (there’s also confusion with “linear discriminant analysis” and “low density array”). [223002760580] |And there’s lots of relevant stuff related to other factor models out there, which were popular for analyzing micro-array data. [223002760590] |Feel free to post relevant references in the comments. [223002760600] |The only things I could find (albeit after only half an hour or so of searching) are: [223002760610] |
  • Barnett, John and Tommi Jaakkola. 2007. [223002760620] |Topic Models for Gene Expression Analysis. [223002760630] |Web page.
  • [223002760640] |
  • Gerber Georg et al. 2007. [223002760650] |Automated Discovery of Functional Generality of Human Gene Expression Programs. [223002760660] |PLoS Comp. [223002760670] |Bio.
  • [223002760680] |
  • Flaherty, Patrick et al. 2005. [223002760690] |A Latent Variable Model for Chemogenomic Profiling. [223002760700] |Bioinformatics.
  • [223002760710] |Also, the Stephens Lab at U.Chicago (which brought you the awesome Marioni et al. paper on RNA-Seq expression (using Poisson GLMs) mention LDA, but in the context of population genetics. [223002770010] |Function Input/Output Polarity and Java Wildcard Generics: extends vs. super [223002770020] |One of the things people seem to find confusing is the distinction in wildcard or captured generics between extends and super. [223002770030] |

    Wild Card extends

    [223002770040] |One issue with Java generics is that even though String extends CharSequence, List does not extend List. [223002770050] |The most specific variable to which we can assign either list type is List, which is extended by both List and List. [223002770060] |We can assign an object of type List to a variable with static type List as long as E extends CharSequence. [223002770070] |

    Wild Cards super

    [223002770080] |The super keyword moves in the inheritance hierarchy in the opposite direction from extends. [223002770090] |A variable of type List may be assgined an object List if E is a supertype of String (equivalent, String extends E). [223002770100] |Thus a variable of type List may be assigned an object of type List or of type List. [223002770110] |What about List (hint: StringBuilder implements CharSequence)? [223002770120] |A variable of type List may be assigned to List, but not List, because StringBuilder extends CharSequence (by implementing it), but StringBuilder is not a supertype of String. [223002770130] |

    Function Inputs and Outputs

    [223002770140] |Suppose we have a simple interface, Foo, with two generic arguments, A for input, and B for output: [223002770150] |Functions present a further problem in that their inputs and outputs generalize in different directions. [223002770160] |In our example Foo, A is a function input. [223002770170] |A more general function takes a more general input than A. [223002770180] |That means a more general type is Foo. [223002770190] |It can accept any input to apply() that Foo could and more. [223002770200] |But what about outputs? [223002770210] |In order for an output type to be used wherever a B could be used, it must extend B so that it implements everything required for a B. Therefore, a more general result type is Foo. [223002770220] |An object assigned to this type is known to provide outputs that at least implement everything a B would. [223002770230] |

    Function Argument Polarity

    [223002770240] |This is related to the polarity of a type variable in a functional type. [223002770250] |Basic types are positive, and arguments to functions reverse polarity. [223002770260] |For example, if A and B are primitive types, for a function from A to B (of type A->B), A is negative polarity and B positive polarity. [223002770270] |Multiple arguments are all negative, so A->(B->C) has C as positive and A and B as negative. [223002770280] |Compound arguments further reverse polarity, so in (A->B)->C, C is positive, and (A->B) is negative, so B is negative, and A is positive again. [223002770290] |In general, you want to extend positive polarity arguments and super negative polarity arguments in order to be able to use the compound object wherever the original object could be used. [223002770300] |

    What if an Generic Parameter is Input and Output?

    [223002770310] |That’s actually the case with List, because the get() method returns an E result, whereas the add() method consumes an E argument. [223002770320] |Thus there’s really no way to generalize the type of a collection. [223002770330] |If you generalize List by subclassing E, you restrict the set of inputs; if you generalize it by superclassing E, you wind up with more general outputs that aren’t guaranteed to implement everything E does. [223002790010] |Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application [223002790020] |Joshua Mason emailed us with a link to his (with a bunch of co-authors) recent ACM paper “English Shellcode” (http://www.cs.jhu.edu/~sam/ccs243-mason.pdf). [223002790030] |Shell code attacks can attempt to seize control of a computer by masquerading as data. [223002790040] |The standard defense is to look for tell-tale patterns in the data that reflect the syntax of assembly language instructions. [223002790050] |It is sort of like spam filtering.The filter would have to reject strings that looked like: [223002790060] |“7IIQZjJX0B1PABkBAZB2BA2AA0AAX8BBPux” [223002790070] |which would not be too hard if you knew to expect language data. [223002790080] |Mason et al changed the code generation process so that lots of variants of the injection are tried but filtered against a language model of English based on the text of Wikipedia and Project Gutenberg.The result is an injection attack that looks like: [223002790090] |“There is a major center of economic activity, such as Star Trek, including The Ed Sullivan Show. [223002790100] |The former Soviet Union.” [223002790110] |This is way better than I would have thought possible and it is going to be very difficult to filter. [223002790120] |It would be interesting to see how automatic essay grading software would score the above. [223002790130] |It is gibberish, but sophisticated sounding gibberish. [223002790140] |And it used LingPipe for the language processing. [223002790150] |I am a firm believer in the white hats publicizing exploits before black hats deploy them surreptitiously. [223002790160] |This one could be a real problem however. [223002790170] |Breck [223002800010] |Yahoo!’s Learning to Rank Challenge [223002800020] |[This is a refactoring of the material that should have been broken out from yesterday's discussion of the expected reciprocal rank evaluation metric.] [223002800030] |Yahoo! is hosting an online Learning to Rank Challenge. [223002800040] |Sort of like a poor man’s Netflix, given that the top prize is US$8K. [Update: I clearly can't read. [223002800050] |As Olivier Chapelle, one of the organizers points out, the rules clearly state that "If no member of a winning Team is able to attend, a representative of the Sponsor will give the presentation on the winning Team's behalf." [223002800060] |Haifa's an interesting city, especially if you can hang out with the natives at the Technion, but it's an expensive trip in both time and money from the US.] [223002800070] |Like Netflix, this is an incredibly rich data set tied to a real large-scale application. [223002800080] |I can imagine how much editorial time went into creating it. [223002800090] |There are over 500K real query relevance judgments! [223002800100] |

    Training Data

    [223002800110] |The supervised training data consists of 500K or so pairs of queries and documents represented as 750-dimensional vectors and provided “editorial grades” of relevance from 0 (completely irrelevant) to 4 (near perfect match). [223002800120] |There’s also a smaller set of 20K or so pairs with some overlapping and some new features to evaluate adaptation. [223002800130] |Unfortunately, there’s no raw data, just the vectors. [223002800140] |

    Test Data

    [223002800150] |The contest involves ranking a set of documents with respect to a query, where the input is the set of document-query vectors. [223002800160] |An evaluation set consists of a number of query-document pairs represented as vectors, where there may be any number of potential documents for each query. [223002800170] |The systems must return rankings of the documents for each query. [223002800180] |

    Binary Logistic Regression for Ranking

    [223002800190] |In case you were curious, I’m going to tackle the learning to rank problem by generalizing LingPipe’s logistic regression models to allow probabilistic training. [223002800200] |Then I can train editorial grade stopping probabilities directly. [223002800210] |If I can predict those well, I can score well at ranking. [223002800220] |I don’t see any advantage to training something like DCA compared to a binary logistic regression on a per-example basis here. [223002800230] |I’m not sure which regression packages that scale to this size data allow probabilistic training. [223002800240] |It could always be faked by making an even bigger data set with positive and negative items. [223002800250] |

    No IP Transfer, But a Non-Compete

    [223002800260] |I’d like to be able to play with these data sets and publish about them and release demos based on them without transferring any intellectual property to a third party other than a publication and rights to publicity. [223002800270] |IANAL, but I don’t think I can do that with this data. [223002800280] |We still haven’t gone through the legal fine print, nor is it clear it will be worth our while to do so in the end. [223002800290] |It’s expensive to run this stuff by lawyers to look for gotchas. [223002800300] |And my own read makes it seem like a no-go. [223002800310] |It doesn’t appear from a quick read that they require the winners to give them the algorithm. [223002800320] |They do require the winners have the rights to their submission. [223002800330] |And they will take possession of copyright on submissions (presumably not including the algorithm itself). [223002800340] |This is great if I’m reading it right. [223002800350] |On the other hand, the non-compete clause (4a) immediately jumped out at me as a deal breaker: [223002800360] |THE TEAM WILL NOT ENTER THE ALGORITHM NOR ANY OUTPUT OR RESULTS THEREOF IN ANY OTHER COMPETITION OR PROMOTION OFFERED BY ANYONE OTHER THAN THE SPONSOR FOR ONE YEAR AFTER THE CONCLUSION OF THE CONTEST PERIOD; [223002800370] |Not only are they shouting, I’m not sure what constitutes “the algorithm”. [223002800380] |It’d be absurd to tie up LingPipe’s logistic regression by entering a contest. [223002800390] |Yet it’s not clear what more there is to “THE ALGORITHM”. [223002800400] |And if we release something as part of LingPipe, our royalty-free license lets people do whatever they want with it. [223002800410] |So I don’t see how we could comply. [223002800420] |The other issue that’s problematic for a company is (4g): [223002800430] |agrees to indemnify and hold the Contest Entities and their respective subsidiaries, affiliates, officers, directors, agents, co-branders or other partners, and any of their employees (collectively, the “Contest Indemnitees”), harmless from any and all claims, damages, expenses, costs (including reasonable attorneys’ fees) and liabilities (including settlements), brought or asserted by any third party against any of the Contest Indemnitees due to or arising out of the Team’s Submissions or Additional Materials, or any Team Member’s conduct during or in connection with this Contest, including but not limited to trademark, copyright, or other intellectual property rights, right of publicity, right of privacy and defamation. [223002800440] |One doesn’t like to put one’s company on the line for a contest that’ll at most net US$8K (less travel and conference registration) and minor publicity. [223002810010] |LingPipe 3.9.1 Released [223002810020] |We’re happy to announce the release of LingPipe 3.9.1, which is available for download from: [223002810030] |
  • LingPipe Home Page [223002810040] |The home page contains the major release notes, the highlights of which are discrete choice analysis models and more deprecations and consolidation of interfaces. [223002810050] |Let us know if you have questions or run into any problems. [223002810060] |

    The Path to LingPipe 4.0

    [223002810070] |We’re now working on LingPipe 4.0, which will remove all of the classes and methods marked as deprecated in 3.9.1. [223002810080] |If you manage to compile without deprecation warnings in 3.9.1, you should be able to compile in LingPipe 4.0. [223002820010] |Stochastic Gradient Descent for Probabilistic Training Data [223002820020] |I’ve been talking for a while about using the Bayesian hierarchical models of data annotation to generate probabilistic corpora. [223002820030] |It’s really easy to extend stochastic gradient descent (SGD) optimization algorithms to handle probabilistic corpora. [223002820040] |I’m going to consider maximum likelihood estimation for binary logistic regression, but the same thing can be done for conditional random field (CRF) taggers, support vector machine (SVM) classifiers, or perceptron classifiers. [223002820050] |I’ll discuss priors and multinomials after the main algorithm. [223002820060] |

    The Training Data

    [223002820070] |Suppose we have labeled training instances, each consisting of a -dimensional input vector and probabilistic outcome : [223002820080] |A non-probabilistic corpus constrains to be either or . [223002820090] |

    The Model

    [223002820100] |Binary logistic regression for -dimensional inputs is parameterized by a -dimensional vector . [223002820110] |The probability of a 1 outcome for a -dimensional input is [223002820120] |, [223002820130] |where [223002820140] |. [223002820150] |

    Maximizing Log Likelihood

    [223002820160] |Our goal’s to find the parameter that maximizes the training data likelihood [223002820170] |. [223002820180] |The maximum likelihood estimate is thus [223002820190] |. [223002820200] |

    Stochastic Gradient Descent

    [223002820210] |SGD finds the maximum likelihood estimate given training data . [223002820220] |For simplicity, we’ll only consider a fixed learning rate (), though the algorithm obviously generalizes to annealing schedules or other learning rate enhancement schemes. [223002820230] |We’ll let be the number of epochs for which to run training, though you could also measure convergence in the algorithm. [223002820240] |Easy peasy, as they say in the old country. [223002820250] |

    Priors

    [223002820260] |Priors don’t actually interact with the probabilistic training. [223002820270] |So you just add them in as usual. [223002820280] |I’d recommend either blocking them (as I did for CRFs in LingPipe) to update every few hundred training instances, or using lazy updates (as I did for logistic regression in LingPipe). [223002820290] |

    Multinomial Classifiers

    [223002820300] |It’s also easy to generalize to multinomials, where the data consists of a probability for each of K outcomes. [223002820310] |You just add an inner loop over the dimensions (minus 1 or not depending on whether you use K or K-1 parameter vectors for a K-dimensional problem), and the error term is calculated versus the probability for that outcome. [223002830010] |Finite-State Queries in Lucene [223002830020] |Otis just posted Robert Muir‘s presentation to the NY Lucene Meetup: [223002830030] |
  • Robert Muir. [223002830040] |Finite State Queries in Lucene. [223002830050] |

    What’s it Do?

    [223002830060] |Read the slide show. [223002830070] |It’s really nicely presented. [223002830080] |The idea’s simple for us automata hackers. [223002830090] |The set of terms making up the keys of the core term-document (reverse) index is used as suffix array (i.e. a compact representation of a trie). [223002830100] |A query is then converted into a finite state automaton which is intersected with the trie implicitly represented by the suffix array. [223002830110] |This is much much faster than doing an edit distance comparison with every term, because the prefix computations are shared. [223002830120] |If you put a bound on the maximum distance, you can reject most terms high up in the trie as you implicitly run the intersection. [223002830130] |It warms my heart to see this kind of nice algorithm so neatly implemented. [223002830140] |The really groovy part is that the suffix array was already there! [223002830150] |

    Bonus Material

    [223002830160] |As an added bonus, there’s an extended bonus section including hints on how to do term expansion at query time instead of indexing time (though this does affect TF/IDF), and some hints about how to integrate morphological processing (e.g. stemming or lemmatization). [223002830170] |

    Similar Applications

    [223002830180] |LingPipe uses the same kind of algorithm for spell checking, only with an explicit trie with source probabilities and a customizable position-specific edit-distance model. [223002830190] |The spell checker only handles single input strings rather than automata and provides ranked results. [223002830200] |I’m guessing Google’s using this kind of thing all over the place for speech rec, because they hired the guys who wrote the AT &T Finite State Machine Library and went on to build Google’s Open FST Library. [223002830210] |

    Do it Yourself

    [223002830220] |If you want to learn how to do this kind of thing yourself, you could do much worse than reading my perennial recommendation, Dan Gusfield’s Algorithms on Strings, Trees, and Sequences. [223002830230] |

    BWT Anyone?

    [223002830240] |Luckily, memory will probably keep up with our terms in language indexing. [223002830250] |In biological sequencing applications, where they store billions of different 50-grams, the Burrows-Wheeler transformed compressed index lets you do the same thing. [223002830260] |The only problem for search is that it doesn’t let you update dynamically (that i know of). [223002840010] |Building High Precision Classifiers, Taggers, Chunkers, Spelling Correctors, … [223002840020] |Not only is F measure hard to optimize, it’s not what customers want. [223002840030] |Neither is area under a precision-recall curve or high average precision stats. [223002840040] |Sure, these might be reasonable proxies, but customers typically choose an operating point for their systems and go with it. [223002840050] |What they really want is a service level agreement (SLA) for precision and/or recall. [223002840060] |

    High Recall

    [223002840070] |We’ve been focused on high recall in our research applications for DARPA (person, location, organization and interaction extraction) and NIH (gene, protein, disease, and relationship extractors). [223002840080] |In these cases, intelligence analysts or biologists are willing to pore through a lot of false positives in order to find a true positive. [223002840090] |

    High Precision

    [223002840100] |When we deal with corporate customers building customer-facing information extraction or classification applications, they almost always want high precision. [223002840110] |Concretely, the goal is almost always: [223002840120] |Build a system with 90% (or 95% or 99%) precision with as much recall as possible. [223002840130] |I’d like to see some bakeoffs with this metric. [223002840140] |Precision is closely related to the frequentist statistical notion of a false discovery rate (FDR). [223002840150] |[Storey and Tibshirani 2003] provides a good discussion of the issues along with some interesting inference techniques in a multiple hypothesis testing framework. [223002840160] |Being a Bayesian, I like hierarchical models plus posterior inference for this job, but that’s not the point of this post. [223002840170] |

    How We Build High Precision Search System

    [223002840180] |Suppose we have a search-type information extraction problem. [223002840190] |For instance, we may be looking for protein-protein interactions in text, or spelling corrections without too many false positive suggestions. [223002840200] |Often what we’re looking for has low prevalence (e.g. indications of terrorists interacting with each other in a situation involving money), so we can’t just grab a random sample of data and start annotating. [223002840210] |As usual, there’s a premium on skilled annotators (no, we can’t just use Mechanical Turk, because we’re usually dealing with proprietary corporate databases in these situations), a less than well-defined task (e.g. find complaints about our product), and a huge pile of unlabeled data (usually millions and sometimes billions of cases [or trillions if you're at MS, Yahoo! or Google]). [223002840220] |In the simple case of a two-category classifier, we start with no training data and then iteratively: [223002840230] |
  • Find some positive cases of what we’re looking for. [223002840240] |
  • Build a search-like application to find some more. [223002840250] |
  • Label the highest-probability results from our search as positive or negative. [223002840260] |This is very different than most active learning strategies, which typically select instances with high uncertaity to label. [223002840270] |Instead, what we’re doing is labeling the data which the last system was most confident was positive. [223002840280] |This can be dangerous in that the boundary can be hard to detect. [223002840290] |And the coding standard defining the boundary may change. [223002840300] |This isn’t as much of a problem as it may seem, because we always go back and check the data we already have. [223002840310] |Like active learning, the result is a biased data set favoring items that score toward the high end of positiveness in one of the iterations. [223002840320] |Active learning results in a data set focused on the positive/negative boundary. [223002840330] |We’ve never had time to evaluate this strategy on large bodies of data, like say the Reuters corpus. [223002840340] |(Mainly, because we haven’t been working on that particular problem.) [223002840350] |

    Setting Operating Points

    [223002840360] |Now we come to practice and need to choose an operating point. [223002840370] |If we have a probabilistic classifier, we return items that the classifier estimates to be above a certain likelihood of being true. [223002840380] |Clearly if we set that point at 0.90 and our system is very accurate in our estimates, we’ll wind up with higher than 90% precision if the items are distributed on the 0.9 to 1.0 range. [223002840390] |So we often set the threshold lower even when our probabilistic classifier is fairly accurate. [223002840400] |(Logistic regression, by the way, is much more accurate than naive Bayes because of the way it accounts for dependencies.) [223002840410] |

    Choosing Operating Point by Utility

    [223002840420] |An even better approach is to assume a utility (equivalent loss) for each true positive, false positive, true negative, and false negative. [223002840430] |Then you can assign a utility to each operating point and maximize the utility rather than just taking a 90% precision operating point. [223002840440] |This is true eve if we assign 1 as the utility of a true positive and -9 as the utility of a false positive, and 0 the utility of false and true negatives. [223002840450] |The 90% precision operating point provides a utility of 0 here and we may be able to do better if we set the operating point higher (thus lowering expected recall but improving precision and hence utility). [223002840460] |Adding utilities to a probability model puts us in the land of statistical decision theory. [223002840470] |

    Plain Old Variance, Over-Dispersion, and Non-Stationarity

    [223002840480] |Garden variety binomial variance is an issue here. [223002840490] |If we don’t have a lot of data, we’ll wind up with a pretty fat posterior on the 90% precision threshold (or utility). [223002840500] |If we set a very conservative estimate here to meet our high precision service level agreement, we’ll potentially lose lots of recall. [223002840510] |Overdispersion is a huge problem. [223002840520] |One of my favorite examples was from one of the MUC entity evaluations. [223002840530] |A single document had several mentions of the planet Mars, which was labeled as a location. [223002840540] |Because the whole document was included, there were way more instances of the phrase “Mars” in the test data than you’d expect given the frequency of the term itself. [223002840550] |A classic case of overdispersion (or “burstiness” as it’s often called in language data; see, e.g., (Jansche 2003) for an overview). [223002840560] |Another major problem is that data in the wild is rarely stationary. [223002840570] |Instead, from hour to hour, day to day, or week to week, the topics slightly or dramatically change. [223002840580] |Systems trained on MUC data from the 1990s don’t reflect today’s distribution of persons, locations or organizations in the news. Similarly, prior data just won’t predict the frequency of the location “Eyjafjallajökull” in this week’s news. [223002850010] |LingPipe 3.9.2 Released [223002850020] |LingPipe Version 3.9.2 is now available from the [223002850030] |
  • LingPipe Home Page.
  • [223002850040] |The home page lists the updates. [223002850050] |It’s backward compatible with 3.9.1. [223002860010] |Sequence Alignment with Conditional Random Fields [223002860020] |Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. [223002860030] |Use conditional random fields (CRFs), because they can normalize anything and are easy to program. [223002860040] |An added benefit is that biologists are used to thinking of edit costs on a log scale. [223002860050] |Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below). [223002860060] |

    The Sequence Alignment Problem

    [223002860070] |The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts). [223002860080] |The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run. [223002860090] |Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. [223002860100] |The model described here is intended to be run over a narrower set of alignment hypotheses. [223002860110] |

    Generative Expression Models Require Probabilistic Alignments

    [223002860120] |In a previous post, I described a Bayesian hierarchical model of mRNA expression. [223002860130] |That model contains a random variable for each read that is dependent on the sequence from which it is read. [223002860140] |Inference with this model requires us to calcluate the probability of a read given a reference sequence. [223002860150] |The model described here directly plugs into the expression model. [223002860160] |Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment. [223002860170] |

    Alignments

    [223002860180] |Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). [223002860190] |Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence AGTTAGG (below) to the reference sequence ACGTACG (above): [223002860200] |The first C in the reference is deleted, the second T in the read is inserted, and the second G in the read is substituted for the second C in the reference. [223002860210] |We can represent alignments as sequences of edit operations. [223002860220] |For the alignment above, we have the sequence of eight edits [223002860230] |m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G). [223002860240] |The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment. [223002860250] |As we noted before, there are typically very many ways to align a read to a reference. [223002860260] |Here’s an alternative alignment, [223002860270] |which corresponds to the following sequence of seven edits [223002860280] |m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G). [223002860290] |

    Read Probabilities from Alignment Probabilities

    [223002860300] |Suppose we have a reference sequence of bases . [223002860310] |The CRF we propose will model the probability of alignment aligning at position of reference sequence and producing a read of length . [223002860320] |We’ll let be the read generated by the alignment . [223002860330] |We compute the probability of a read of length starting at position of the reference sequence as [223002860340] |. [223002860350] |We can then marginalize out the starting position by summing to calculate the probability of a read, [223002860360] |. [223002860370] |

    Linear Basis

    [223002860380] |In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. [223002860390] |We break down the basis into a contribution from edits and starting position. [223002860400] |For edits, we take [223002860410] |where is the potential of the edit operation . [223002860420] |Recall that the edits include their edited bases, so we have four match edits m(Z), twelve substitution edits s(Z1,Z2) with Z1 != Z2, four insert edits i(Z) and four delete edits d(Z), for a total of 24 edit operations. [223002860430] |So provides values in for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights. [223002860440] |For starting positions, we take a potential function for starting an alignment at position of reference sequence . [223002860450] |The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials, [223002860460] |, [223002860470] |where we take to be an indicator variable with value if and otherwise. [223002860480] |This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable. [223002860490] |Normalization is straightforward to calculate in theory. [223002860500] |To compute normalized probabilities, we take [223002860510] |, where [223002860520] |. [223002860530] |As they say in the old country, Bob’s your uncle. [223002860540] |

    Sum-Product Algorithm for Read Probabilities

    [223002860550] |It’s just the usual sum-product algorithm for CRFs. [223002860560] |Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by , and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow. [223002860570] |

    Sum-Product for Normalization

    [223002860580] |To compute the normalizing term , we run a similar sum-product algorithm, only with an extra sum for all the possible reads. [223002860590] |

    Viterbi Algorithm for Best Alignments

    [223002860600] |With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. [223002860610] |We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments. [223002860620] |

    Affine Gap Basis and Edit Context Sensitivity

    [223002860630] |It’s easy to take affine gaps into account. [223002860640] |Just condition the edit operations on the previous edit operation. [223002860650] |You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment. [223002860660] |It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. [223002860670] |This is also easy to dynamic program, though adding context adds to space and time requirements. [223002860680] |

    Modeling Uneven Reads

    [223002860690] |Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming). [223002860700] |These effects are not small. [223002860710] |[Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. [223002860720] |Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. [223002860730] |(I sure wish there was variance in the Wang et al. graphs.) [223002860740] |

    Modeling Input Quality

    [223002860750] |We can also add input quality to our calculations if it’s on a log scale. [223002860760] |Then we have to select which base was actually read, making it more like the use of CRFs for tagging. [223002860770] |For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. [223002860780] |As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms. [223002870010] |McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance [223002870020] |Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. [223002870030] |I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. [223002870040] |(So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.) [223002870050] |
  • McCallum, Andrew, Kedar Bellare and Fernando Pereira. 2005. [223002870060] |A conditional random field for discriminatively-trained finite-state string edit distance. [223002870070] |In UAI. [223002870080] |I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. [223002870090] |But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems. [223002870100] |

    Applications

    [223002870110] |McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. [223002870120] |In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing). [223002870130] |

    What’s Being Modeled?

    [223002870140] |McCallum et al.’s goal was to model the probability that two strings match. [223002870150] |This reduces the problem to a binary classification problem. [223002870160] |In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. [223002870170] |The reads are more like the tags generated in a tagging CRF. [223002870180] |The alignment is a local alignment of the read to the reference. [223002870190] |We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. [223002870200] |They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence. [223002870210] |The difference is in normalization terms, both of which are easy to calculate via dynamic programming. [223002870220] |As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. [223002870230] |I turned a similar crank to estimate a related probability. [223002870240] |

    Other Innovations

    [223002870250] |What I really like about McCallum et al.’s approach is that it allows chunked matching operations. [223002870260] |Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). [223002870270] |The possibilities are endless. [223002870280] |With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. [223002870290] |The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. [223002870300] |Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding. [223002870310] |They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable. [223002870320] |They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). [223002870330] |I’m pretty sure I’ll have the same problem. [223002870340] |They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. [223002870350] |This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features. [223002870360] |Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. [223002870370] |Curiously, there are no features for the actual characters being matched, just classes. [223002870380] |Maybe they didn’t have enough data. [223002870390] |What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads). [223002870400] |

    Direct Regression

    [223002870410] |(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. [223002870420] |These features were things like n-grams. [223002870430] |This also sounds like a good idea. [223002870440] |I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined. [223002870450] |The Tsuruoka model would be easy to implement in LingPipe. [223002870460] |I had to do a custom implementation for my own model decoder and haven’t built the estimator yet. [223002870470] |

    Further Applications and Generalizations

    [223002870480] |I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). [223002870490] |You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. [223002870500] |The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc. [223002870510] |McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. [223002870520] |I’m surprised now that I haven’t seen more work building on this. [223002870530] |It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there. [223002880010] |Upgrading Java Classes with Backward-Compatible Serialization [223002880020] |When using Java’s default serialization, any changes to the member variables or method signatures of the class breaks backward compatibility in the sense that anything serialized from the old class won’t deserialize in the new class. [223002880030] |So do similar changes to superclasses or changing the superclass type. [223002880040] |After a bit of overview, I’ll show you the trick I’ve been using to maintain backward compatibility when upgrading classes with new meaningful parameters. [223002880050] |

    Serial Version ID

    [223002880060] |The problem stems from the fact that Java uses reflection to compute a serial version ID for each class. [223002880070] |(You can calculate it yourself using the serialver command that ships with Sun/Oracle’s JDK.) [223002880080] |When methods or member variables change, the serial version changes. [223002880090] |To deserialize, the serialized class’s ID must match the current class’s ID. [223002880100] |As long as the (non-transient) member variables don’t change, this problem can be tackled by explicitly declaring the serial version ID. [223002880110] |(See the Serializable interface javadoc for how to declare). [223002880120] |With an explicit serial version ID declared, the serialization mechanism uses the declared value rather than computing it from the class signature via reflection. [223002880130] |If there are serializable objects out there you need to deserialize, use serialver to calculate what the version ID used to be, then declare it in the class. [223002880140] |If you’re starting from scratch, the ID can be anything. [223002880150] |So always use an explicit serial version when you first create a class. [223002880160] |It can’t hurt, and it’s likely to help with backward compatibility. [223002880170] |

    Taking Control with the Externalizer Interface

    [223002880180] |But what if (non-transient) member variables (non-static class variables) change? [223002880190] |Version IDs won’t help, because the default read and write implemented through reflection will try to deserialize the new variables and find they’re not there and throw an exception. [223002880200] |You need to take control using the Externalizable interface, which extends Serializable. [223002880210] |Now you control what gets read and written. [223002880220] |Serialization and deserialization will still work, as long as the new member variable is either not final or is defined in the nullary (no-arg) constructor. [223002880230] |

    The Simplest Case of The Problem

    [223002880240] |Suppose we have the class [223002880250] |So far, so good. [223002880260] |If I add the following main(), [223002880270] |and compile and run, I get [223002880280] |So far, so good. [223002880290] |But now suppose I add a second variable to Foo, [223002880300] |And try to run just the deserialization with the existing file, [223002880310] |Because I haven’t changed readExternal(), we still get the same answer. [223002880320] |But what if I want to serialize the second argument? [223002880330] |And give it a default value of null for classes serialized before it was added? [223002880340] |The obvious change to read and write don’t work, [223002880350] |It’ll throw an exception, as in [223002880360] |Now what? [223002880370] |

    The Trick: Marker Objects

    [223002880380] |Here’s what I’ve been doing (well, actually, I’m doing this through a serialization proxy, but the idea’s the same), [223002880390] |The new implementation always writes an instance of Boolean.TRUE (any small object that’s easily identifiable would work), followed by values for the first and second argumet. [223002880400] |(In practice, these might be null in this class, so we’d have to be a bit more careful.) [223002880410] |The trick here is the external read method reads an object, then checks if it’s Boolean.TRUE or not. [223002880420] |If it’s not, we have an instance serialized by the previous version, so we cast that object to a string, assign it to the first argument and return. [223002880430] |If it is, we read two args and assign them. [223002880440] |Now the result of deserializing our previous instance works again: [223002880450] |And there you have it, backward compatibility with new non-transient member variables. [223002880460] |

    The Fine Print

    [223002880470] |The trick only works if there’s an object that’s being serialized (or a primitive with a known restricted range of possible values). [223002880480] |It doesn’t have to be the first variable, but there has to be something you can test to tell which version of the class you have.