[223002390010] |Examples of “Futility of Commenting Code” [223002390020] |Continuing on from the previous blog entry, The Futility of Commenting Code, I’d like to address the dissenting comment of Sandman, who claims to have only seen the kind of caricature comments I made up in posts about commenting. [223002390030] |Well, here goes reality. [223002390040] |

A Real Example

[223002390050] |Consider the following real example, which I chose because I knew the source for Apache projects was available online, and suspecting Apache Commons would be less heavily checked than more popular projects. [223002390060] |I dove into the e-mail package, because it’s the first one I actually use. [223002390070] |I found this example right away: [223002390080] |
  • source for org.apache.commons.mail.ByteArrayDataSource (revision 782475) [223002390090] |pieces of which I repeat here: [223002390100] |I wasn’t talking about API comments, but these display the same problem. [223002390110] |This public constructor is documented with “Create a datasource from a String”, but in fact, there are two string parameters, both documented as “a String”. [223002390120] |That’s what the delcaration says, so this is exactly the kind of redundant comment I was talking about. [223002390130] |Next, consider the one code comment. [223002390140] |It starts on the wrong foot, with “Assumption that the string contains only ASCII characters!”. [223002390150] |If you look at the method call, data.getBytes("iso-8859-1"), you’ll see that it’s actually more general than documented, in that it’ll work for any ISO-8859-1 characters (aka Latin1). [223002390160] |The second part of the comment, “Else just pass in a charset into this constructor and use it in getBytes()” makes no sense, because the bytes are hard coded, and there is no char encoding argument to the constructor. [223002390170] |Furthermore, the catch block (with its throw) should just be removed. [223002390180] |It’s catching an UnsupportedEncodingException, which extends IOException, then throwing a fresh IOException. [223002390190] |It should just be removed, at which point an unsupported encoding throws an unsupported encoding exception; you don’t even need to change the signature, because unsupported encoding exceptions are kinds of I/O exceptions. [223002390200] |There are two problems with the code as is. First, the the IOException reports an unhelpful message; the unsupported encoding exception has the info on what went wrong in its message. [223002390210] |Second, it’s returning a more general type, making it harder for catchers to do the right thing. [223002390220] |You might also consider the fact that the original stack trace is lost a problem. [223002390230] |Another instance of (almost) commenting the language is later in the same file: [223002390240] |I’d argue comments like “Write the InputData to OutputStream” are useless, because this is the idiom for buffered writes. [223002390250] |Aside on Bad Code [223002390260] |Maybe you think you should always buffer streams. [223002390270] |In this case, that’s wrong. [223002390280] |Both bufferings simply cause twice as many assignments as necessary. [223002390290] |Buffering the input stream is useless because you’re already reading into the byte array buffer. [223002390300] |Buffering the output stream is useless, because you’re writing to a byte array output stream baos. [223002390310] |Furthermore, you don’t need to close or flush a byte array output stream, because both are no-ops (see the ByteArrayOutputStream javadoc). [223002390320] |Finally, the naming is wonky, because osWriter is not a writer, it’s an output stream. [223002390330] |This isn’t an isolated case. [223002390340] |Other files in this package have similar problems with doc. [223002390350] |

    Another Example

    [223002390360] |While we’re picking on Apache Commons, I checked out another package I’ve used, fileUpload. [223002390370] |
  • org.apache.commons.fileupload/DiskFileUpload: That's the kind of pre-formatted useless stuff I was talking about. We know what a member variable and constructor look like in Java. There weren't any other code comments in that file. In that same package, the fileupload.ParameterParser class has some, though, and I'm guessing they're of the kind that others mentioned as "good" comments, such as: [223002390380] |I'd argue that perhaps a method called trimLeadingWhiteSpace() implemented the same way would be clearer. [223002390390] |But if you're not going to define new methods, I'd say this kind of comment helps. [223002390400] |But always verify that the code does what it says it does; don't take the comment's word (or method name's word) for it. [223002390410] |I couldn't leave that file without commenting on their return-only-at-the-end-of-a-function style: [223002390420] |

    More Examples

    [223002390430] |I thought only a couple wouldn't be convincing. [223002390440] |So here's some links and examples: [223002390450] |
  • collections.buffer.BoundedBuffer Documenting what's clear from the code. (And nice section divider comments.)
  • digester.Digester Yes, that's what return, log, and throw do. [223002400010] |Happy 95th Birthday, Martin Gardner [223002400020] |Martin Gardner, who turns 95 today, was a huge influence on me. [223002400030] |I devoured his books and Scientific American “Mathematical Games” columns in junior high, high school, and even well into college. [223002400040] |There’s a nice profile of Martin Gardner by John Tierny in today’s New York Times. [223002400050] |(Already cited in four places by the Wikipedia article linked above.) [223002400060] |Being now into my forties and trying to figure out what to do with my life, I’m fascinated by the fact that Gardner started at age 42 with no background in math other than a love of puzzles! [223002400070] |I’m not surprised a journalist would call “recreational mathematics” an oxymoron. [223002400080] |We already knew Gardner liked the A-HA style of puzzle —his books were full of them, requiring no real math knowledge to understand. [223002400090] |I wonder what he’d have done if he grew up on programming puzzles rather than mathematical ones? [223002400100] |We’d probably have more interesting, but equally ridiculous, programmer interview quizzes. [223002410010] |Learn Measure Theory for Under US$10 [223002410020] |My friend and former colleague Christer Samuelsson sent me a present a while back (if you send me presents, you may be mentioned here too, though I don’t have mugs for you): [223002410030] |
  • Kolmogorov, A. N. and S. V. Fomin; translated and adapted by Richard Silverman. 1970. [223002410040] |Introductory Real Analysis. [223002410050] |Dover. [223002410060] |Yes, it’s written by the Kolmogorov, the man who invented (discovered, if you’re a Platonist) a consistent basis for probability using measure theory. [223002410070] |Not content with laying the foundations for probability, Kolmogorov also kicked off the study of algorithmic information theory through what later became known as Kolmogorov complexity, and I can also highly recommend Li and Vitanyi’s book on that subject, though I’ve only read the introductory bits. [223002410080] |You can check out the contents by following the Amazon link above. [223002410090] |I’ll summarize by saying it covers the basics, from set theory to metric topology, including a long discussion of linear spaces and operators, a chapter on measure, and sections on Lebesgue integration in the first integration chapter and the Stieltjes generaliation later (really cool, in that it lets you merge integration and summation). [223002410100] |It presupposes the requisite “mathematical sophistication”, which really means you’ll be hard pressed to read more than a page every 10 or 15 minutes. [223002410110] |Seriously, though, if you didn’t understand calc in high school (or college), this probably isn’t the book for you. [223002410120] |In fact, if you haven’t studied higher math at all, this probably isn’t the place to start. [223002410130] |Having said that, I hated calc in school, and only felt I understood it after doing analysis and topology, which cranked the level of abstraction beyond and Euclidean distances. [223002410140] |It’d help to have some background in abstract algebra and set theory, too, though everything is explained clearly (though quickly) from first principles. [223002410150] |I don’t know whether to thank Kolmogorov, his co-author, or his translator, but this book is wonderfully clear. [223002410160] |The examples are direct, the notation’s clear, and it’s fairly easy to use modularly (e.g. just to look up what a measure or σ-algebra is). [223002420010] |Mihalcea (2007) Using Wikipedia for Automatic Word Sense Disambiguation [223002420020] |In response to a recent thread on the LingPipe mailing list about using Wikipedia for a word-sense disambiguation corpus, I’d like to point out Rada Mihalcea’s 2007 NAACL paper: [223002420030] |
  • Mihalcea, Rada. 2007. [223002420040] |Using Wikipedia for Automatic Word Sense Disambiguation. [223002420050] |NAACL. [223002420060] |The idea’s very simple. [223002420070] |A Wikipedia page represents a single sense of an ambiguous phrase. [223002420080] |Both the senses and text for training can be extracted from Wikipedia’s link structure. [223002420090] |Rada maintains the SenseEval web pages and serves on the advising committee for the bakeoffs, which have pretty much defined the field recently. [223002420100] |

    Leaning on Wikitext

    [223002420110] |The wikitext markup used for Wikipedia pages contains explicit disambiguation and mention-level information. [223002420120] |Consider these examples from Rada’s paper: [223002420130] |
  • In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of twenty three, and entered private practice in Boston. [223002420140] |
  • It is danced in 3/4 time (like most waltzes), with the couple turning approx. [223002420150] |180 degrees every [[bar (music)|bar]]. [223002420160] |The double brackets “[[ ]]” enclose links with the item on the left being the disambiguated reference, e.g. “bar (law)” or “bar (music)” and the item on the right the ambiguous word, e.g. “bar”. [223002420170] |When viewing the page, the text to the right of the vertical bar shows up. [223002420180] |The scale’s impressive —there were 1108 examples of “bar” used as link text with 40 different categories, and this was a few years ago. [223002420190] |Rada extracted paragraphs around the mentions for training data. [223002420200] |These are clearly marked in the Wikitext, so nothing fancy needs to be done for paragraph extraction (beyond the pain of Wikitext parsing itself). [223002420210] |

    What About Disambiguation Pages?

    [223002420220] |Wikipedia is full of disambiguation pages, which Wikipedia itself describes as: [223002420230] |Disambiguation in Wikipedia is the process of resolving the conflicts that occur when articles about two or more different topics could have the same “natural” page title. [223002420240] |This category contains disambiguation pages: non-article pages containing links to other Wikipedia articles and disambiguation pages. [223002420250] |For instance, the Wikipedia page for “bar”, lists an astounding number ambiguous meanings for the word. [223002420260] |There are 11 popular usages (e.g. bar serving alcohol, bar of gold), 5 math and science usages (e.g. unit of pressure, bar chart), 4 uses in the law, and many other usages, some which appear under other categories. [223002420270] |I’d think the pages linked to would be useful for training. [223002420280] |But Rada didn’t use them, listing reasons: [223002420290] |
  • mismatch between disambiguation pages and usage, causing precision problems from links in a disambiguation page that are never referenced with the word and recall problems from some usages not showing up on the disambiguation page, [223002420300] |
  • inconsistencies in the naming of disambiguation pages [223002420310] |The first issue is still problematic, though the recall side of it seems to be improving. [223002420320] |The second issues is also still problematic. [223002420330] |Although the category helps find such pages, Wikipedia’s formatting is still more like natural language than a database. [223002420340] |

    Evaluation

    [223002420350] |As you might expect, it worked pretty well. [223002420360] |Rada evaluated a naive Bayes classifier with features including contextual words with part-of-speech tags. [223002420370] |

    Coding it in LingPipe

    [223002420380] |This is something you could easily replicate in LingPipe. [223002420390] |You could use one of the naive Bayes parsers. [223002420400] |Or any of the other classifiers we consider in our word-sense disambiguation tutorial, including K-nearest neighbors and character or token-based language model classifiers. [223002420410] |If you need part-of-speech tags, check out our part of speech tagging tutorial. [223002420420] |With naive Bayes and easily extracted unsupervised data, you could also use EM semi-supervised training, as described in our EM tutorial. [223002420430] |You could also use a discriminative linear classifier, as described in our logistic regression tutorial. [223002420440] |The hard part’s all the data munging for Wikipedia. [223002420450] |If you build a parser in Java and want to share, I’d be happy to link to it from here or host it in our development sandbox. [223002430010] |LingPipe Classifiers and Chunkers for Endeca Extend Partner Program [223002430020] |A couple weeks ago, Endeca made the following press release: [223002430030] |
  • Endeca Extend Partner Program Adds Leading Text Analytics Software Vendors [223002430040] |The “leading text analytics software vendors” are us (props to Breck for naming us with an “A”), Basis Technology, Lexalytics, MetaCarta, NetOwl, Nstein, Semantia and Temis. [223002430050] |But wait, that’s not all. [223002430060] |A slew of text analytics companies had either joined earlier or announced joining now, including ChoiceStream, BayNote, Lexalytics, Coremetrics, NStein, and Searchandise. [223002430070] |It’s no surprise that we’re all thinking Endeca has quite a bit of potential as a channel partner. [223002430080] |After the usual marketing blather (e.g. “leveraging the extensibility of the McKinley platform”, “lower cost of ownership”, “value-added capabilities”, etc.) and vague promises (e.g. “unrestricted exploration of unstructured content”), the third paragraph of Endeca’s press release explains what it’s all about in allowing Endeca’s search customers to [223002430090] |…run their data through an Endeca Extend partner solution, extract additional meta-data elements from the text, and append that meta-data to the original content [223002430100] |

    Endeca Records

    [223002430110] |Endeca stores documents in record data structures, which associate string keys with lists of string values. [223002430120] |This is the same rought structure as is found in a Lucene Document. [223002430130] |One striking difference is that Endeca’s API is cleaner and better documented. [223002430140] |Overall, I’m very impressed with Endeca’s API. [223002430150] |Looking at their API reminds me of the APIs we built at SpeechWorks, where wiser heads prevailed on me to forego complex controls designed for control-freak grad students in favor of making easy things easy. [223002430160] |Another striking difference is that Lucene’s document structure is much richer, allowing for binary blobs to be stored by those trying to use Lucene as a database. [223002430170] |Lucene also allows both documents as a whole and fields within a document to be boosted, adding a multiplier to their search scores for matching queries. [223002430180] |

    Manipulator Extensions

    [223002430190] |Endeca’s produced an API for extensions. [223002430200] |An extension visits records, modifies them, and writes them back to the index. [223002430210] |It can also write into its own scratch space on the file system and generate all new records. [223002430220] |An extension consists of three components: configuration, factory, and runtime. [223002430230] |

    Class 1. [223002430240] |Configuration

    [223002430250] |The bean-like configuration class provides setters and getters for strings, booleans, integers, and doubles. [223002430260] |These are labeled with attributes and accessed through reflection. [223002430270] |There’s then a method to validate a configuration that returns a list of errors as structured objects. [223002430280] |I’m a big fan of immutable objects, so working with beans drives me crazy. [223002430290] |They could use some more doc on concurrency and lifecycle order; as is, I was conservative and programmed defensively against changes in config. [223002430300] |Configuration is handled through an administrator interface. [223002430310] |As I said, it’s bean-like. [223002430320] |

    Class 2. [223002430330] |Factory

    [223002430340] |There is then a factory class with a method that returns the config class (so the admin interface can tell what kind of config to build for it). [223002430350] |It also contains a method that takes an Endeca application context and configuration and produces a runtime application. [223002430360] |The context provides services like logging, a path to local file space, and a hook into a pipe into which modified records may be sent. [223002430370] |

    Class 3. [223002430380] |Runtime

    [223002430390] |The runtime simply provides a record visitor method. [223002430400] |To write out changes, you grab the output channel from the context provided to the factory. [223002430410] |There are also some lifecycle methods used as callbacks: interrupt processing, processing of records is complete, and final cleanup. [223002430420] |You can still write out answers during the completion callback. [223002430430] |

    Endeca’s Demo Manipulator Extension

    [223002430440] |Endeca has great programmers and their Java API design was really clear. [223002430450] |I love it when vendors follow standard patterns and idioms in their API designs. [223002430460] |Especially when they use generics usefully. [223002430470] |The PDF developer doc’s still in progress, but their Javadoc’s mostly in place. [223002430480] |What was really sweet is that they gave us a working demo extension program with all of its configuration, tests, and even mock objects for use in JUnit testing the entire framework without a complete install of Endeca’s platform. [223002430490] |I’m so happy when someone sends me a Java package that unpacks then compiles with Ant without griping. [223002430500] |

    LingPipe Classifier CAS Manipulator Extension

    [223002430510] |The first extension I wrote is configured with a path to a serialized text classifier on the classpath. [223002430520] |I then configured a list of field names (only strings are available, so I went with comma-separated values) from which to collect text, and a field name into which to write the result of classification. [223002430530] |[Correction: 5 Nov 2009, Endeca let me know that they had this covered; if I declare the variables in the bean-like configuration to be list-like values, the reflection-based config system will figure it out. [223002430540] |This is awesome. [223002430550] |I always hate rolling in ad-hoc little parsing "languages" like CSV in config. [223002430560] |It's just sooo hard to doc and code correctly.] [223002430570] |

    LingPipe Chunker CAS Manipulator Extension

    [223002430580] |The second extension is a chunker. [223002430590] |It requires a path to a chunker. [223002430600] |Optionally, it allows a sentence detector to be configured for preprocessing (most of our chunkers work better at the sentence level). [223002430610] |It also optionally allows a dictionary (and tokenizer factory) to be specified for overriding the chunks found by the chunker. [223002430620] |Then, a list of field names from which to read text. [223002430630] |The output gets written into chunk-specific fields. [223002430640] |Because a given field name can contain multiple values, you can keep the resulting spans separate. [223002430650] |

    Endeca’s Faceting

    [223002430660] |Endeca’s big on faceted search. [223002430670] |You may be familiar with it from two of the best online stores, NewEgg and Amazon. [223002430680] |It’s easy to treat our classifier plugin output as a facet. [223002430690] |For instance, classify documents by sentiment and now sentiment’s a facet. [223002430700] |Do a search, and you’ll get a summary of how many positive and how many negative documents, with an option to restrict search to either subset. [223002430710] |It’s also easy to treat our chunker output as a facet. [223002430720] |For instance, if you include a company name chunker, you’ll be able to use companies as facets (e.g. as NewEgg does with manufacturers, though documents may contain references to more than one company). [223002430730] |

    Buying Plugins

    [223002430740] |Drop Breck a line. [223002430750] |Now that I have my head around the bigger picture, it’s pretty easy to build these kinds of extensions. [223002430760] |So if there’s something you’d like integrated into Endeca and you’re willing to pay for it, let us know. [223002440010] |Hierarchical Bayesian Batting Ability, with Multiple Comparisons [223002440020] |Just in time for the last game(s) of this year’s World Series, the final installment of my example of Bayesian stats using baseball batting ability. [223002440030] |I took us off-topic (text analytics) with a general intro to Bayesian stats (What is “Bayesian” Statistical Inference?). [223002440040] |As a first example, I compared Bayesian calculations of binomial posteriors with point estimates (Batting Averages: Bayesian vs. MLE Estimates). [223002440050] |In exploring the question of “where do priors come from?”, I started with simple non-Bayesian point estimates (Moment Matching for Empirical Bayes Beta Priors for Batting Averages). [223002440060] |Finally, I realized I’d meant “ability” where I’d said “average” (former is a latent parameter, latter is a statistic calculated from at bats), when I considered Bayesian point estimates (Bayesian Estimators for the Beta-Binomial Model of Batting Ability). [223002440070] |

    Hierarchical Model Recap

    [223002440080] |For batter , the number of hits in at bats is modeled as a binomial, [223002440090] |for , [223002440100] |where the ability, or chance of getting a hit, for batter is . [223002440110] |Ability is modeled as having a beta distribution [223002440120] |for [223002440130] |with prior number of hits and prior number of outs . [223002440140] |These parameters, which act as priors for the for the binomial parameter, are themselves given priors. [223002440150] |The mean of the beta, , is given a uniform prior, [223002440160] |. [223002440170] |The prior for the scale is modeled by a Pareto distribution, which has probability proportional to , [223002440180] |. [223002440190] |

    Fitting

    [223002440200] |I used the 2006 AL position player data (given in this previous blog post). [223002440210] |That fixes the number of players and the hits and at-bats for each player . [223002440220] |I then used BUGS encoding of the model above (as shown in this previous post). [223002440230] |BUGS automaticaly implements Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC) inference. [223002440240] |I ran 3 chains of 1000 samples each, retaining the last 500 samples, for 1500 posterior samples total. [223002440250] |All parameters had values very near 1, indicating convergence of the Markov chains. [223002440260] |As usual, we read off statistics from the samples and used the sampled values for inference, where they allow the integrals involved in Bayesian inference (as descirbed in this blog post) to be computed. [223002440270] |

    Beta Posterior

    [223002440280] |We can inspect the posterior for the beta mean and scale parameters. [223002440290] |We plot them as a 2D scatterplot with their means picked out as lines. [223002440300] |As usual, the scale is much harder to estimate than the mean. [223002440310] |

    Ability Posteriors

    [223002440320] |We can also plot the ability for particular players. [223002440330] |Here’s histograms of sampled posteriors for the players with the best average posterior ability estimates, with their hits and at-bats provided as labels above the plot: [223002440340] |Notice how the variance of the posterior is determined by the number of at bats. [223002440350] |The player with 60 at bats has a much wider posterior than the player with 695 at bats. [223002440360] |

    Multiple Comparisons: Who’s the Best Batter?

    [223002440370] |We’re going to do what Andrew, Jennifer and Masanao suggest, which is to use our hierarchical model to make a posterior comparison about player’s ability. [223002440380] |In particular, we’ll estimate the posterior probabability that a particular player is the best player. [223002440390] |We simply look at all 1500 posterior samples, which include ability samples as plotted above, and count how many times a player has the highest ability in a sample. [223002440400] |Then divide by 1500, and we’re done. [223002440410] |It’s a snap in R, and here’s the results, for the same batters as the plot above: [223002440420] |The .347 batter with 521 at bats not only has the highest estimated chance in our model of having the highest ability, he also won the batting crown (Joe Mauer of Minnesota). [223002440430] |The beta-binomial hierarchical model estimate is only 12% that this batter has the highest ability. [223002440440] |The estimate is very close for the .343 batter with 623 at bats (Derek Jeter of the Yankees). [223002440450] |[It turns out the race to the batting crown came down to the wire.] [223002440460] |The larger number of at bats provides more evidence that the batter has a higher ability than average, thus pulling the posterior ability estimate further away from the prior mean. [223002440470] |Finally, note that we’re assigning some chance that the .367 batter with only 60 at bats is best in the league. [223002440480] |That’s because when the samples are on the high side of the distribution, this batter’s best. [223002450010] |Phrase Detectives Linguistic Annotation Game [223002450020] |Massimo Poesio just sent me a pointer to the following awesome web application: [223002450030] |
  • Phrase Detectives (the AnaWiki annotation game) [223002450040] |

    Annotation Game

    [223002450050] |Annotation games (aka “games with a purpose”) were popularized by van Ahn’s ESP game, though I first heard about them through David Stork’s Open Mind project. [223002450060] |Unlike the mechanical Turk, the games approach tries to make the task somewhat fun and competitive. [223002450070] |It seems like making the users “detectives” is a thin veneer of “fun”, but they maintain the metaphor beautifully throughout the whole site, so it works. [223002450080] |As with many games, Phrase Detectives pays out in leader board bragging rights and cash prizes rather than directly for work completed as on Mechanical Turk. [223002450090] |

    Phrase Detective Tasks

    [223002450100] |The really brilliant part is how they break the coref annotation task into four easy-to-answer questions about a single highlighted phrase. [223002450110] |
  • Not Mentioned Before: yes/no question as to whether the referent of highlighted phrase was previously mentioned in the text [223002450120] |
  • Mentioned Before: highlight previous mention of a given phrase [223002450130] |
  • Non-Referring: pick out non-referential nouns (like the “there” in “there is trouble brewing”) [223002450140] |
  • Property of Another Phrase: pick out other phrases that describe someone already mentioned (e.g. attributives or apositives) [223002450150] |The site also has nice clean, easy-to-follow graphics, and appears to still have users after two years. [223002450160] |

    Adjudication Phase

    [223002450170] |OK, they call it “Detectives Conference”, but the idea is you get to vote yes/no as to whether someone else’s answer is right. [223002450180] |This is a good idea widely used on Mechanical Turk because it’s easier to check someone’s work than to create it from scratch. [223002450190] |

    Read All About It

    [223002450200] |It was developed by academics, so there are at least as many papers as contributors: [223002450210] |
  • Chamberlain, Jon, Massimo Poesio and Udo Kruschwitz. 2008. [223002450220] |Phrase Detectives: A Web-based Collaborative Annotation Game. [223002450230] |I-KNOW. [223002450240] |
  • Chamberlain, Poesio and Kruschwitz. 2008. [223002450250] |Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts. [223002450260] |ACL workshop. [223002450270] |
  • Poesio, Kruschwitz and Chamberlain. 2008. [223002450280] |ANAWIKI Creating anaphorically annotated resources through web cooperation. [223002450290] |LREC. [223002450300] |
  • Kruschwitz, Chamberlain and Poesio. 2009. [223002450310] |(Linguistic) Science Through Web Collaboration in the ANAWIKI project [223002450320] |

    Coreference Annotation

    [223002450330] |There are “expert” annotated within-doc coref corpora for the MUC 7 and ACE 2005 evaluations (available from LDC, who charge an arm and a leg for this stuff, especially for commercial rights). [223002450340] |LingPipe does within-document coreference and we’ve worked on cross-document coreference. [223002450350] |

    More Like This

    [223002450360] |As soon as you find one of these things, you find more. [223002450370] |Check out: [223002450380] |
  • Sentiment Quiz (Facebook app to do sentiment annotation) [223002450390] |I’d love to hear about more of these if anyone knows any. [223002460010] |Chang et al. (2009) Reading Tea Leaves: How Humans Interpret Topic Models [223002460020] |This forthcoming NIPS paper outlines a neat little idea for evaluating clustering output: [223002460030] |
  • Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David M. Blei. 2009. [223002460040] |Reading Tea Leaves: How Humans Interpret Topic Models. [223002460050] |NIPS. [223002460060] |The question they pose is how to evaluate clustering output, specifically topic-word and document-topic coherence, for human interpretability. [223002460070] |

    Bag of Words

    [223002460080] |Everything’s in the bag of words setting, so a topic is modeled as a discrete distribution over words. [223002460090] |

    Multiple Topics per Document

    [223002460100] |They only consider models that allow multiple topics per document. [223002460110] |Specifically, the clusterers all model a document as discrete distribution over topics. [223002460120] |The clusterers considered share strong family resemblances: probabilistic latent semantic indexing (pLSI) and two forms of latent Dirichlet allocation (LDA), the usual one and one in which topics for documents are drawn from a logistic prior modeling topic correlations rather than a uniform Dirichlet. [223002460130] |

    Intrusion Tasks

    [223002460140] |To judge the coherence of a topic, they take the top six words from a topic, delete one of the words and insert a top word from a different topic. [223002460150] |They then measure whether subjects can detect the “intruder”. [223002460160] |To judge the coherence of the topics assigned to a document, they do the same thing for document distributions: they take the top topics for a document, delete one and insert a topic not assigned with high probability to the document. [223002460170] |

    Analysis

    [223002460180] |They considered two small corpora of roughly 10K articles, 10K word types and 1M tokens, one from Wikipedia pages and one from NY Times articles. [223002460190] |These can be relatively long documents compared to tweets, customer support requests, MEDLINE abstracts, etc, but are shorter than full-text research articles or corporate 10K or 10Q statements. [223002460200] |They only consider 50, 100, and 150 topic models, and restrict parameterizations to add-1 smoothing (aka the Laplace form of the Dirichlet prior) for per-document topic distributions. [223002460210] |I didn’t see any mention of what the prior was for the per-topic word distributions. [223002460220] |I’ve found both of these parameters to have a huge effect on LDA output, with larger prior counts in both cases leading to more diffuse topic assignments to documents. [223002460230] |They only consider point estimates of the posteriors, which they compute using EM or variational inference. [223002460240] |This is is not surprising given the non-identifiability of topics in the Gibbs sampler. [223002460250] |

    Mechanical Turker Voting

    [223002460260] |They used 8 mechanical Turkers per task (aka HIT) of 10 judgments (wildly overpaying at US$0.07 to US$0.15 per HIT). [223002460270] |

    (Pseudo Expected) Predictive Log Likelihood

    [223002460280] |They do the usual sample cross-entropy rate evaluations (aka [pseudo expected] predictive log likelihoods). [223002460290] |Reporting these to four decimal places is a mistake, because the different estimation methods for the various models have more variance than the differences shown here. [223002460300] |Also, there’s a huge effect from the priors. [223002460310] |For both points, check out Asuncion et al.’s analysis of LDA estimation, which the first author, Jonathan Chang, blogged about. [223002460320] |

    Model Precision

    [223002460330] |Their evaluation for precision is the percentage of subjects who pick out the intruder. [223002460340] |It’d be interesting to see the effect of adjusting for annotator bias and accuracy. [223002460350] |This’d be easy to evaluate with any of our annotation models. [223002460360] |For instance, it’d be interesting to see if it reduced the variance in their figure 3. [223002460370] |There’s variation among the three models at different topics over the different corpora. [223002460380] |I’m just not sure how far to trust their model precision estimates. [223002460390] |

    Their Take Home Message

    [223002460400] |The authors drive home the point that traditional measures such as expected predictive log likelihood are negatively correlated with their notion of human evaluated precision. [223002460410] |As I said, I’m skeptical about the robustness of this inference given the variation in estimation techniques and the strong effect of priors. [223002460420] |The authors go so far as to suggest using humans in the model selection loop. [223002460430] |Or developing an alternative estimation technique. [223002460440] |If they’d been statisticians rather than computer scientists, my guess is that they’d be calling for better models, not a new model selection or estimation technique! [223002460450] |

    The Real Take Home Message

    [223002460460] |I think the real contribution here is the evaluation methodology. [223002460470] |If you’re using clustering for exploratory data analysis, this might be a way to vet clusters for further consideration. [223002460480] |

    What They Could’ve Talked About

    [223002460490] |Although they mention Griffiths and Steyvers’ work on using LDA for traditional psychometrics, I think a more interesting result is Griffith and Steyvers’ use of KL-divergence to measure the stability of topics across Gibbs samples (which I describe and recreate in the LingPipe clustering tutorial). [223002460500] |Using KL divergence to compare different clusters may give you a Bayesian method to automatically assess Chang et al.’s notion of precision. [223002470010] |Probabilistic Edit Channel Models [223002470020] |This post is about probabilistic edit distances used as noisy channel models with applications to biological short-read sequencing and natural language spelling correction. [223002470030] |(Spoiler: Normalization is over channel outputs and the forward (sum) algorithm is required instead of the usual Viterbi (max) algorithm.) [223002470040] |

    Spelling Correction and Chinese Tokenization

    [223002470050] |LingPipe models both spelling correction and Chinese tokenization (word splitting) using noisy channel models. [223002470060] |We have two tutorials: [223002470070] |
  • Spelling Correction Tutorial
  • [223002470080] |
  • Chinese Tokenization Tutorial
  • [223002470090] |The two components of a noisy channel model are the source from which signals consisting of strings are generated probabilistically, and the channel which models the probability of observing the signal when the message is . [223002470100] |For spell checking, the source model of what people are likely to say may be estimated using document data and/or query log, and the channel model of possible brainos and typos may be estimated using matched query/correction data (which itself may be inferred via EM from query logs and/or document data). [223002470110] |For Chinese, the channel deterministically deletes spaces, which requires probabilistic reconstruction for the observer who doesn’t get word boundary information. [223002470120] |LingPipe’s spelling package implements (confusion-matrix weighted) edit distance. [223002470130] |And other models, such as n-gram comparison. [223002470140] |This is all described in the spelling tutorial as it relates to spell checking, and also in: [223002470150] |
  • String Comparison Tutorial
  • [223002470160] |LingPipe doesn’t enforce a probabilistic interpretation of edit distance. [223002470170] |Furthermore, decoding of the intended signal given the observed message is implemented using the Viterbi approximation, which estimates the channel probability based on the single best alignment between the message and the observed signal . [223002470180] |

    Mature mRNA Sequencing

    [223002470190] |Mitzi’s lab’s working on sequencing the untranslated end regions of mature mRNA (aka the 3′ UTR) in C. elegans (aka the nematode worm), a widely studied model organism. [223002470200] |The 3′ UTR is the sequence of bases after the stop codon and before the poly-A tail found on mature mRNA. [223002470210] |

    Short Read Sequencers

    [223002470220] |I’ve been talking with Mitzi lately about the next generation short-read DNA sequencers, like Illumina and SOLiD. [223002470230] |Perhaps it’s not surprising, given that it’s locally coherent string data, that the same techniques are used in bioinformatics as in computational linguistics. [223002470240] |

    Isoform Expression Models

    [223002470250] |I’ve been formulating Bayesian models for transcriptome analysis. [223002470260] |In particular, isoform (aka alternative splicing) expression. [223002470270] |As usual, there’s an underlying generative model of the process. [223002470280] |One of the components of this model is how the read that’s observed from the device (the observed data) is generated given the biological sequence from which it was generated (the estimand of interest). [223002470290] |The noisy channel model is appropriate for this setting; I just need to make it probabilistic. [223002470300] |

    Levenshtein Distance

    [223002470310] |The simplest models of spelling correction or genetic sequencing involve simple Levenshtein distance calculations. [223002470320] |The Levenshtein distance is the minimum number of insert, delete or substitution operations required to turn one sequence into the other. [223002470330] |It’s symmetric. [223002470340] |But it’s not probabilistic. [223002470350] |

    Probabilistic Channels

    [223002470360] |It’s actually quite simple to model . [223002470370] |Just make sure all the moves from a given state sum to 1.0. [223002470380] |A state corresponds to a prefix of the reference sequence . [223002470390] |Legal moves are to skip (delete) the next symbol from the message , to add (insert) a spurious symbol into the observation , match the next symbol from and , or to substitute the next symbols. [223002470400] |

    Viterbi versus Forward Algorithm

    [223002470410] |The usual approach to calculation is to make the so-called Viterbi assumption, and look at the probability of the best sequence of moves, to consume the message sequence and observed sequence. [223002470420] |Given a sequence of edit operations, we can generate an alignment between sequences. [223002470430] |For instance, consider matching the message ACGTACG to the observed signal AGTTAGG. [223002470440] |The following alignment consists of a match of A, a delete of C, two matches GT, an insertion of a T, a match on A, a substitution of a G for a C, and a final match on G: [223002470450] |To make the probabilities add up to 1.0 for the moves from a particular state, we need to subdivide the inserts and substitutions. [223002470460] |With the genetic alphabet of symbols {A, C, G, T}, the 9 possible moves are subdivided by output, I(A), I(C), I(G), I(T), D, S(A), S(C), S(G), S(T). [223002470470] |We have probabilities for these outputs for each symbol. [223002470480] |In the genetic case, that’s four multinomials of order 9. [223002470490] |Note that we’ve included the match among the substitution operations for simplicity. [223002470500] |If we know the next output symbol from a state, only one of the four substitutions will be legal —the one that outputs the next output symbol. [223002470510] |But there are many more alignments for the same pair of strings. [223002470520] |Some even have the same Levenshtein distance. [223002470530] |For instance, we can align the two sequences using three substitutions: [223002470540] |Edit distance assigns a probability to a path of operations, which corresponds to an alignment , say . [223002470550] |Given the underlying message (or sequence) , the observed sequence is determined by the edit operations . [223002470560] |What we need to do is integrate (or sum) out the alignments in the usual way in order to calculate the probability of the observed signal (query/read) given the message (meaning/reference sequence) : [223002470570] |The Viterbi approximation is: [223002470580] |The approximation is only reasonable from a probabilistic point of view if the best alignment accounts for most of the probability mass. [223002470590] |Or, if we are doing calculations proportionally and the best alignment probability is a constant fraction of the total generation probability. [223002470600] |Simply using the forward algorithm instead of the Viterbi algorithm computes this in time (and space) where is the length of and the length of . [223002470610] |We can actually tighten up the space requirements to ; this is how I did things for LingPipe’s implementation and what I contributed for Lucene’s fuzzy matcher. [223002470620] |We can tighten even further if we have an upper bound on the maximum number of edits to consider, say . [223002470630] |Then the time complexity reduces to . [223002470640] |(As for all things string algorithm related, see Gusfield’s string algorithms book and Durbin, Eddy, Krogh and Mithcison’s probabilistic string matching book). [223002470650] |Of course, we’ll compute it all on a log scale so the products become sums. [223002470660] |This will require our old friend the log sum of exponentials operation. [223002470670] |

    Filtering

    [223002470680] |For the biological sequence matching problem, we’re facing on the order of 100M reads of 50 bases each, matching against a set of 30K isoforms of 1.5K bases each. [223002470690] |This is prohibitive for the forward algorithm. [223002470700] |There are various ways to share the prefix operations (again, see Gusfield’s book). [223002470710] |Typically, though, practitioners use n-gram sequence indexes to spot potential matches, only considering sequences with good n-gram overlap, thus eliminating most of the work up front. [223002470720] |This can be done either heuristically, as in BLAST (described in both books), or exactly, as in SHRiMP (the latter link is to a paper with a good description of the SOLiD color space reads and uses automaton composition to compose noisy channels, one for color read errors, one for sequencing errors, and one for mutations). [223002470730] |This is like the notion of blocking in database record linkage (including database deduplication), which we also tend to carry out using character n-grams for natural language sequences. [223002470740] |The transcendental math’s still going to dominate the calculations, so typically a Viterbi pass using Levenshtein distance is done after blocking with n-grams to further filter out unlikely alignments. [223002470750] |Biological sequence analysis sure looks a lot like natural language sequence analysis! [223002480010] |Computing Marginal Channel Probabilities with Generalized Needleman-Wunsch [223002480020] |In my last post on Probabilistic Edit Channel Models, I tried to motivate a properly calculated marginal probability of an observation (e.g., a read) given a source (e.g., a reference sequence). [223002480030] |In this post, I’ll show you how to build a properly normalized probabilistic generalization of the well-known Needleman-Wunsch algorithm for global alignment that computes full marginal probabilities rather than best path probabilities. [223002480040] |Global alignment matches all of its two input sequences. [223002480050] |This approach is easily generalized to the Smith-Waterman algorithm for local alignments or to fully matching a read sequence against a subsequence of a reference sequence. [223002480060] |See the last section below. [223002480070] |

    Edit Operations

    [223002480080] |We’ll assume a simple alphabet of bases, {A,C,G,T}. [223002480090] |We’ll also assume four possible actions given a reference sequence and a position in that reference sequence: [223002480100] |
  • delete the next symbol from the reference (advance the ref position)
  • [223002480110] |
  • insert a symbol into the read (don’t advance the ref position)
  • [223002480120] |
  • generate a matched symbol for the read (advance ref)
  • [223002480130] |
  • generate a mismatched symbol for the read (advance ref)
  • [223002480140] |There’s only one way to delete, which we’ll write D. [223002480150] |There are four possiblities for insertion depending on the base inserted, which we’ll write IA, IC, IG, IT. [223002480160] |There are a total of four possibilities for substitution and matching, depending on the base generated in the read, SA, SC, SG, ST; the one with the symbol corresponding to the next reference symbol is the match case. [223002480170] |

    Probabilistic Edits

    [223002480180] |Given a reference sequence, we generate a read sequence by iterating over the reference sequence base by base, and for each base, performing one of the above operations. [223002480190] |If the operation is not a delete, the position in the reference sequence advances. [223002480200] |For instance, consider the alignment we introduced in the previous post, [223002480210] |which is generated by the sequence of operations: [223002480220] |What we need to estimate is the conditional probability of one of the nine edit operations given the next base in the input. [223002480230] |We’ll assume the edit probabilities are known. [223002480240] |Typically, these edits are inferred using little or no training data through the EM algorithm. [223002480250] |The sequence of edit operations is: σ = [SA, D, SG, SG, IT, SA, SA, SG]. [223002480260] |Given the reference sequence ACGTACG and the edit operations σ, the read AGTTAGG is uniquely determined. [223002480270] |With this clean generative model, given an input reference sequence, the probability of all read sequences sums to 1.0. [223002480280] |Of course, we will know the read in most of what we do and only need to compute its probability. [223002480290] |The probability of that sequence of edit operations given the reference sequence ACGTACG is given by the product of all the probabilities in the rightmost column above: [223002480300] |p(σ|ACGTACG) = p(SA|A) p(D|C) p(SG|G) p(ST|T) p(IT|T) p(SA|A) p(SG|C) p(SG|G) [223002480310] |What we’re interested in calculating here is p(AGTTAGG | ACGTACG), namely the probability of the read given the reference sequence. [223002480320] |This is done by summing over all possible alignments: [223002480330] |Note that this probability will be zero if the edit sequence and reference don’t generate the read . [223002480340] |The usual alignment programs compute the best alignment, namely: [223002480350] |We could use a regular in place of to compute the probability of the best alignment instead of the alignment itself. [223002480360] |Note that we take the maximum probability; usually the algorithm is formulated with edit costs and it takes the minimum cost. [223002480370] |We could convert to the usual setup with minimization by reformulating in terms of negative log probabilities. [223002480380] |

    Needleman-Wunsch Algorithm

    [223002480390] |The Needlman-Wunsch algorithm is a classic instance of dynamic programming where the distances computed for shorter substrings are extended in the inner loop to distances for longer strings. [223002480400] |The first two loops deal with the boundary conditions of all-insertions and all-deletions [223002480410] |Given its simple loop, this algorithm clearly takes time proportional to read time reference length. [223002480420] |

    Marginal Needleman-Wunsch

    [223002480430] |I’ll call this the marginal Needleman-Wunsch, but if someone knows a better name, let me know. [223002480440] |It bears the same relation to the standard N-W algorithm as the forward algorithm for HMMs compares to the Viterbi algorithm. [223002480450] |The only change is that we’re calling the familiar log sum of exponentials function here rather than max. [223002480460] |Recall the definition: [223002480470] |We use this because in our case, x, y, and z are log probabilities, so that exp(x), exp(y), and exp(z) are linear probabilities, so that their sum is a linear probability, and hence the final result is the total probability of the paths converted back to the log scale. [223002480480] |We don’t just maintain linear probabilities because we’re likely to hit underflow problems. [223002480490] |(This may not actually be an issue for the limited number of errors allowed in these models; if not, then we just need a sum of the linear probabilities, which will be much faster to execute computationally.) [223002480500] |That’s it. [223002480510] |We’ve computed the (log of) the sum of all path probabilities. [223002480520] |

    When the Reference is Exhausted

    [223002480530] |One small detail: we need to be able to generate symbols in the read after the reference is exhausted, so we have to include that option in the algorithm. [223002480540] |That is, we need five discrete distributions over edit operations, one for each next base {A,C,G,T} in the reference, and one to use when the reference is exhausted. [223002480550] |Although not conceptually difficult or expensive to compute, it complicates the boundary conditions for the algorithm and the conceptualization of the algorithm, so the algorithm isn’t technically correct as stated. [223002480560] |To correc it, the probability terms need to be fiddled to account for the boundary. [223002480570] |

    Generalizing to Local Alignments

    [223002480580] |The Smith-Waterman-like generalization to local alignments requires allowing zero-cost deletes on the first and last row of the edit matrix, and zero-cost inserts on the first and last column of the matrix. [223002480590] |This isn’t quite properly normalized, but can be if we assume a uniform model of read start and end positions. [223002480600] |We can also restrict local alignments in the usual way to the case where the read completely matches a substring of the reference sequence by only allowing zero-cost deletes on the first and last row of the distance matrix. [223002480610] |

    Multiple Edits and Correlated Errors

    [223002480620] |The model above is Markovian; only the current reference symbol determines the output symbol. [223002480630] |This assumption may be relaxed to allow compound edit operations instead of the simple Markovian insertion, substitution, and deletion operations of the Needleman-Wunsch algorithm. [223002480640] |For example, University of Toronto’s SHRiMP alignment package considers pairwise operations generated due to the AB SOLiD platform’s colorspace reads (there’s a shortage of capital “I”s for names, hence “Alias-i”). [223002480650] |With this approach, SHRiMP can model the the correlated errors from the SOLiD colorspace reads relative to a reference sequence due to to singular nucleotide polymorphisms (SNPs) in the samples relative to the reads (which induce two sequential, correlated, color space read errors). [223002480660] |SHRiMP’s generalization of Needleman-Wunsch is similar both algorithmically and conceptually to the phonetic extensions to he edit distance component of the noisy channel model for natural language spelling correction proposed by Kristina Toutanova and Bob Moore in their 2002 ACL paper, Pronunciation Modeling for Improved Spelling Correction. [223002500010] |Measure Theory in Two Definitions [223002500020] |Measure theory generalizes the notion of length, area and volume to a very general topological setting. [223002500030] |It also forms the basis of probability theory. [223002500040] |

    Definition 1

    [223002500050] |A -algebra is a set of sets which [223002500060] |
  • is closed under difference, if , [223002500070] |
  • is closed under countable unions, if , and [223002500080] |
  • has a unit such that for all . [223002500090] |

    Definition 2

    [223002500100] |A -additive measure is a function over a -algebra taking non-negative values such that [223002500110] |for pairwise disjoint sets . [223002500120] |

    Immediate Consequences

    [223002500130] |The closure of a -algebra under countable intersections follows from de Morgan’s law, [223002500140] |. [223002500150] |The empty set is the unique zero element such that for all . [223002500160] |Further, and for all . [223002500170] |A -algebra forms a ring, with as addition, as multiplication, as the multiplicative identity, as the additive identity, and as additive difference. [223002500180] |

    Constructing Lebesgue Measures

    [223002500190] |It’s not entirely trivial to show that interesting measures exist, even working over the real numbers. [223002500200] |The easiest useful measure to construct is the Lebesgue measure over the unit interval . [223002500210] |Start by defining the measure of a subinterval to be its length. [223002500220] |Extend the definition to countable unions of subintervals by summation, following the definition of measure. [223002500230] |It remains to tidy up the some boundary cases producing zero-measure results, which isn’t entirely trivial (analysis is built on counterexamples and edge cases), but the basic idea pretty much works as is. [223002510010] |Probability Measures and Random Variables [223002510020] |I introduced measure theory in my last post, Measure Theory in Two Definitions. [223002510030] |With that out of the way, we can move on to probability measures and random variables. [223002510040] |

    Probability Measure

    [223002510050] |A probability measure is a measure taking values in the closed interval and mapping the unit to 1, . [223002510060] |It follows that . [223002510070] |In a probability measure, the sets are called events and we write for . [223002510080] |The sample space for a probability measure is . [223002510090] |

    Those Pesky Random Variables

    [223002510100] |Given a probability measure , a random variable is a real-valued function over the measure’s sample space such that (so that it has a measure) for all . [223002510110] |The cumulative density function for is defined by [223002510120] |. [223002510130] |The probability that random variable is less than a fixed value is [223002510140] |. [223002510150] |The probability that the variable falls in the interval is defined by [223002510160] |. [223002510170] |When is differentiable, the density function is [223002510180] |. [223002510190] |When the density is defined, [223002510200] |and [223002510210] |. [223002510220] |

    End of Story

    [223002510230] |This still seems like a lot of work just to get going with random variables. [223002510240] |We haven’t even defined multivariate densities or conditional and joint probabilities. [223002510250] |In applied work, we define integrable multivariate density functions explicitly and reason through integration. [223002510260] |For instance, in the hierarchical batting ability model I discussed as an example of Bayesian inference, the model is fully described by the density [223002510270] |Technically, we could construct the measure from the density function as follows. [223002510280] |First, construct the sample space from which the parameters are drawn. [223002510290] |Continuing the baseball example, we draw our parameter assignment from the sample space [223002510300] |. [223002510310] |We then take the Lebesgue measurable events with measure defined by (multivariate) Lebesgue-Stieltjes integration, [223002510320] |where, of course, . [223002510330] |If is a simple hypercube in , this works out to the usual sum/integral: [223002510340] |. [223002520010] |Coding Inference Talk at University of Chicago, Wed 16 Dec 2009 [223002520020] |Update: It’s still Wednesday, but the right day is 16 December. [223002520030] |This Wednesday, I (Bob) will be giving a talk at the brand-spanking new Knapp Center for Biomedical Discovery at the University of Chicago. [223002520040] |When: 2 PM, Wed 16 Dec 2009 [223002520050] |Where: Knapp Center for Biomedical Discovery (KCBD) 10th Floor, South Conference Room 900 East 57th Street University of Chicago [223002520060] |

    Multilevel Models of Coding and Diagnosis with Multiple Tests and No Gold Standards

    [223002520070] |Bob Carpenter, Alias-i [223002520080] |I’ll introduce multilevel generalizations of some well known models drawn from the epidemiology literature and evaluate their fit to diagnostic testing and linguistic annotation data. [223002520090] |The analogy is that data annotation for machine learning (or, e.g., ICD-9 coding) is the same kind of process as diagnostic testing. [223002520100] |The observed data consists of multiple testers supplying results for multiple tested units. [223002520110] |True labels for some units may be known (perhaps with selection bias), and not all units need to undergo every test. [223002520120] |In all models, there are parameters for outcome prevalence, the result for each unit, and some form of test accuracy. [223002520130] |I’ll also consider models with parameters for individual test accuracy and bias (equivalently sensitivity and specificity in the binary case) and item difficulty/severity. [223002520140] |I’ll focus on the priors for annotator accuracy/bias and item difficulty, showing how diffuse hyperpriors allow them to be effectively inferred along with the other parameters using Gibbs sampling. [223002520150] |The posterior samples may be used for inferences for diagnostic precision, multiple comparisons of test accuracies, population prevalence, unit-level labels, etc. [223002520160] |I’ll show that the resulting multilevel models can be fit using data simulated according to the model. [223002520170] |I will then fit the model to a range of clinical and natural language data. [223002520180] |I’ll discuss their advantages for inference with epidemiology data ranging from dentists diagnosing caries based on x-rays, oncologists diagnosing tumors based on slides, infection diagnosis based on exams and serum tests, and with natural language data including name spotting, word stemming, and classification. [223002520190] |I’ll conclude by discussing extensions to further pooling through random effects for different testing facilities, different kinds of annotators (e.g. doctors vs. ICD-9 specialists), different kinds of subjects/units (e.g. genetic predisposition to diseases, or articles drawn from different journals), etc. [223002520200] |All the software (Java, R, BUGS) and data discussed in this talk are freely available from the LingPipe sandbox in the hierAnno project. [223002520210] |You may already be familiar with all this from the data annotation thread on this blog. [223002530010] |Multimodal Priors for Binomial Data [223002530020] |Andrey Rzhetsky gave me some great feedback on my talk at U. Chicago. [223002530030] |I want to address one of the points he rasied in this blog post: [223002530040] |

    What if the prior is multimodal?

    [223002530050] |This is a reasonable question. [223002530060] |When you look at the Mechanical Turk annotation data we have, sensitivities and specificities are suggestively multimodal, with the two modes representing the spammers, and what I’ve decided to call “hammers” (because the opposite of “spam” is “ham” in common parlance). [223002530070] |Here’s a replay from an earlier post, Filtering Out Bad Annotators: [223002530080] |The center of each circle represents the sensitivity and specificity of an annotator relative to the gold standard from Snow et al.’s paper on Mech Turk for NLP. [223002530090] |

    Beta Priors are Unimodal or Amodal

    [223002530100] |But a beta distribution has a single mode at in the situation where , but has no modes if or . [223002530110] |

    Mixture Priors

    [223002530120] |One possibility would be to use a mixture of two beta distributions, where for and , we define: [223002530130] |. [223002530140] |

    Hierarchical Fit

    [223002530150] |We can fit the mixture component along with the parameters of the beta mixture components in the usual way. [223002530160] |It’s just one more variable to sample in the Gibbs sampler. [223002530170] |

    Hierarchical Priors

    [223002530180] |On the other hand, if you consider baseball batting data, there would be at least two modes in the prior for batting average corresponding to fielding position (i.e. shortstops and second-basemen typically don’t bat as well as outfielders and I imagine there’s more variance in batting ability for infielders). [223002530190] |If we didn’t know the fielding position, it’d make sense to use a mixture prior. [223002530200] |But if we knew the fielding position, we’d want to create a hierarchical model with a position prior nested inside of a league prior. [223002540010] |NYC NLP/ML Group [223002540020] |Joseph Turian‘s back in town and has started a mailing list/newsgroup and Twitter feed for natural language processing (NLP) and machine learning (ML) in New York City (NYC): [223002540030] |
  • NY NLP/ML Google Group [223002540040] |
  • NY NLP/ML Twitter Feed [223002540050] |Joseph’s trying to organize something like the Foo Camps for NLP and ML programming in NY. [223002540060] |If you’re interested, drop him a line, or wait for the announcement. [223002540070] |P.S. Thanks to Joseph for not requiring registration for reading! [223002540080] |P.P.S. Check out some of Joseph’s latest work surveying features for named entity recognition. [223002550010] |LREC 2010 Tutorial: Statistical Models of the Annotation Process [223002550020] |

    17 May 2010, Malta

    [223002550030] |Massimo Poesio and I are going to be giving a half-day tutorial at LREC 2010 in the afternoon of Monday 17 May 2010. [223002550040] |The title is Statistical Models of the Annotation Process. [223002550050] |We’ll be covering most of the material I’ve discussed in this blog’s data annotation thread and more. [223002550060] |What’s really cool is that it’s in Malta! [223002550070] |I’ll be there for most of the LREC 2010 programme. [223002550080] |I’m particularly excited about the 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining and the New Challenges for NLP Frameworks Workshop (where by “framework”, they mean tools like GATE and LingPipe). [223002550090] |I’ve always liked workshops more than conferences. [223002550100] |

    5–6 May 2010, Edinburgh

    [223002550110] |Before Malta, I’ll be in Edinburgh 6–7 May 2010 for a reunion of the department in which I did my Ph.D., the Centre for Cognitivie Science (originally School of Epistemics). [223002550120] |I’ll be giving a talk on data annotation there, too. [223002560010] |Blegging for Help: Web Scraping for Content? [223002560020] |I’m reduced to blegging (a lexical mashup of “blog” and “beg”). [223002560030] |We run into this problem time and time again and have to tell customers we just don’t know the answer. [223002560040] |If you do, please help by leaving a comment or send me e-mail and I’ll summarize on the blog. [223002560050] |

    How Can I Scrape Web Content?

    [223002560060] |I really don’t know a good general-purpose method of pulling the content out of arbitrary web pages and leaving the boilerplate, advertising, navigation, etc. behind. [223002560070] |The research papers I’ve seen look at large corpora of documents and build background models for sites to pick out boilerplate from fresh content. [223002560080] |They also resort to techniques like 2D rendering just to put contiguous blocks of HTML together logically. [223002560090] |Then there’s the problem of pulling out text that’s interrupted by advertisements, photos, figures, quotes, etc. [223002560100] |

    Yes, I Can Tidy HTML

    [223002560110] |I love tools like NekoHTML that do a reasonable job of converting any old crufty HTML from the web into XHTML so I can parse it with SAX. [223002560120] |It’s what we use in our web demos for HTML input. [223002560130] |

    No, I Don’t Need Structured Parsing

    [223002560140] |Yes, I know there are tons of things out there that let you build custom rule-based scrapers for particular sites. [223002560150] |It’s a pain, but we’ve done this for fixed content. [223002560160] |But I’m talking about making a query on a search engine and downloading the HTML (or whatever) from the top 500 hits and trying to turn it into something you can feed to a natural language processing system to do something like indexing or information extraction. [223002560170] |

    AJAX and Flash a Plus

    [223002560180] |Just thinking about getting the HTML as rendered given all the AJAX makes my head spin. [223002560190] |I also hear that Flash allows some kind of searchability now, so there might be relevant text in there. [223002560200] |

    We’ll Pay

    [223002560210] |If there’s a service out there that does a decent job of this or software we can buy, we’ll gladly pony up. [223002560220] |It’s not like we want to build this ourselves. [223002570010] |Kohlschütter, Fankhauser, Nejdl (2010) Boilerplate Detection Using Shallow Text Features [223002570020] |A lead from my previous post on scraping text out of boilerplate was the following paper, scheduled to be presented next month (specifically, Saturday 6 February 2010) right here in Brooklyn at ACM’s 3rd annual conference on web search and data mining, WSDM 2010: [223002570030] |
  • Kohlschütter, Christian, Peter Fankhauser, and Wolfgang Nejdl (2010) Boilerplate Detection Using Shallow Text Features. [223002570040] |In WSDM ’10. [223002570050] |

    Boilerpipe

    [223002570060] |A java API-based implementation of the system described in the paper is available under a generous Apache 2.0 license from: [223002570070] |
  • Google Code: Boilerpipe [223002570080] |It uses our favorite normalizer, NekoHTML as a front end. [223002570090] |I’ll take it out for a spin in a separate review. [223002570100] |The Java API nature and free licensing make it a very tempting option for the likes of us. [223002570110] |

    The Approach(es)

    [223002570120] |Kohlschütter et al. evaluate a range of features working directly over cleaned HTML for scraping content out of boilerplate like navigation and advertisements. [223002570130] |

    HTML Only

    [223002570140] |They do not consider HTML rendering solutions, which simplifies but also limits the applicability of the final system. [223002570150] |Of course, there’s no reason you couldn’t plug in something like Crowbar (thanks for that link, too) and use its HTML instead of the direct download. [223002570160] |

    Newswire vs. Cleaneval

    [223002570170] |As he pointed out in a comment in the last post, Kohlschütter (and crew) concluded that the Cleaneval bakeoff used highly non-representative web text, and thus wasn’t a good baseline. [223002570180] |As an alternative, they collected a corpus of 250K English news articles downloaded from Google News links from different countries (USA, Canada, UK, South Africa, India, Australia) in four topics (world. technology, sports, entertainment) from over 7500 different sources. [223002570190] |The paper advertises the data’s availability from the following site: [223002570200] |
  • Boilerpipe Data Site [223002570210] |But it’s not there yet (11 January 2010). [223002570220] |

    Four-Way and Two-Way Tasks

    [223002570230] |The authors consider the two-way task of boilerplate versus content and the four-way classification task of boilerplate, full-text/comments, headline, and supplemental. [223002570240] |

    Decision Tree Learner

    [223002570250] |They evaluated decision tree learners, though claim linear SVMs (fit with sequential minimial optimization, aka SMO) had similar performance. [223002570260] |

    Micro-F Evaluation

    [223002570270] |They evaluated on a word-by-word basis. [223002570280] |This is a kind of micro evaluation. [223002570290] |A macro evaluation might rank by block. [223002570300] |

    Features

    [223002570310] |The authors didn’t consider structure other than H, P, DIV, and A. [223002570320] |The features that seemed to work involved number of words and density of text relative to links. [223002570330] |The only global features they use of a collection are the global word frequencies (thus getting at something like stop list type features). [223002570340] |They consider all sorts of local features, such as average word length, total number of words, average sentence length (using a simple heuristic sentence detector), ratio of periods to other characters, number of capitalized words and ratio of capitalized to non-capitalized words, etc. etc. [223002570350] |One of their most useful features is link density, which is number of words in anchor (A) spans compared to total number of words. [223002570360] |They also consider position in the document, though this is tricky with only the raw HTML rather than the rendered output, as relative locations may change visually relative to the underlying HTML. [223002570370] |

    Quotients Features and Context

    [223002570380] |The authors extract these basic features from blocks, but then use the current (C), previous (P), and next (N) values in classification. [223002570390] |Specifically, they find it useful to look at the quotient of the number of words in the previous block to the current block (their notation “P/C”). [223002570400] |The authors say “the text flow capturing variants of thse features (number of words quotient, text density quotient, …), which relate the value of the current block to the previous one, provide the highest information gain, indicating that intradocument context plays an important role”. [223002570410] |It’d be nice if these papers came with a table of features with simple definitions to make it easy for someone to browse them. [223002570420] |

    Sequence Model Decoding?

    [223002570430] |The classification in this paper is apparently done independently in a block-by-block fashion. [223002570440] |I’d think using a sequence model like structured SVMs or CRFs would provide higher accuracy. [223002570450] |Given the small number of categories (i.e. boilerplate/not), it’d be very speedy, too, even with second-order models. [223002570460] |

    (Greedy) Information Gain

    [223002570470] |The authors rank features by information gain. [223002570480] |The big winners are number of words quotient, text density quotient, number of words, and the link and text densities. [223002570490] |Clearly these are related measures, but it’s prohibitive to consider all the possible subsets of features. [223002570500] |

    Conclusions

    [223002570510] |They conclude word count is most important. [223002570520] |Coupled with link density, you’re most of the way to the best results achieved using all local features plus global frequency. [223002570530] |

    Or, Ask Ryan

    [223002570540] |Ryan Nitz, enterprise programmer par excellence, just happened to be over for dinner last night. [223002570550] |I asked him if he’d dealt with this problem and he said he just used the ratio of punctuation to letters. [223002570560] |That seems similar to but simpler and more portable than the stopword approach Aria Haghighi suggested in a comment in the last post. [223002580010] |Nobel (Memorial) Prize for Logistic Regression (aka Discrete Choice Analysis) [223002580020] |I’ve been meaning to blog on this topic ever since Dave Lewis was in town giving a talk to the Columbia stat student seminar. [223002580030] |He was talking about one of his favorite topics, (Bayesian) logistic regression. [223002580040] |It also fits in with (another) one of my favorite topics, the scientific zeitgeist. [223002580050] |As Andrew Gelman once said (and I paraphrase), no matter what model you come up with, someone in psychometrics thought it up 50 years ago (or maybe it was econometrics). [223002580060] |

    Multinomial Regression

    [223002580070] |For simplicitly, we’ll stick to the simple multinomial regression case where we have possible output categories. [223002580080] |Our training data then consists of input/output pairs , where the input is any old object and the output category . [223002580090] |

    “Max Ent” Feature Extraction

    [223002580100] |In the maximum entropy style presentations, there is a feature function from pairs of inputs and output categories to -dimensional feature vectors, so for . [223002580110] |The model consists of (the feature extractors and) a single coefficient vector . [223002580120] |The probability that an input is categorized as category is then [223002580130] |, [223002580140] |where, as usual, the normalizing constant is computed by summation: [223002580150] |. [223002580160] |The thing to notice here is that a feature in the same position can be produced by the same input in two different categories. [223002580170] |

    Stats Textbook Feature Extraction

    [223002580180] |If you pull out a statistics textbook, you’ll see feature extraction for multinomial logistic regression presented rather differently. [223002580190] |In this setting, we have a feature extractor that works only on input categories, producing an -dimensional feature vector that does not depend on the output category, . [223002580200] |The model then consists of a -dimensional coefficient vector for each output category . [223002580210] |The category probability is then defined by [223002580220] |, [223002580230] |with normalizer computed in the usual way: [223002580240] |. [223002580250] |Here, the vector for each output category is distinct, so there’s no (direct) way to produce the same feature (after multiplication with coefficient) from the same input for two different categories. [223002580260] |

    LingPipe’s Approach

    [223002580270] |LingPipe adopts the stats textbook approach for both logistic regression and CRFs. [223002580280] |Mainly because it’s more efficient and I couldn’t find good enough reasons to go with the max ent approach. [223002580290] |I tried to squeeze examples out of people at conferences, via e-mail, over meals, and when I blogged about label-specific features for CRFs! [223002580300] |As is also the case in most stats textbook presentations, without loss of generality (for inference), one of the output category coefficient vectors can be set to zero. [223002580310] |It’s actually required to identify a unique solution, because we can always add or subtract constants from all vectors without changing the predictions. [223002580320] |With priors on the coefficients, you can get a different result pegging one category to zero, so it makes most sense to do this transformation after fitting. [223002580330] |LingPipe does it before fitting, which isn’t the way I’d do it if I had a do-over. [223002580340] |On the other hand, it works well for binary classification problems when everyone uses one coefficient vector (though again, results vary with Gaussian priors because of their non-linearity). [223002580350] |

    Stats Text to Max Ent Conversion

    [223002580360] |The conversion from the stats textbook version to the max ent version is simple. [223002580370] |We just define to be the concatenation of the , setting [223002580380] |. [223002580390] |For the feature function, we just define it to only set non-zero values in the appropriate range, with [223002580400] |, [223002580410] |where we take to be the -dimensional zero vector, and to be instances of the zero vector concatenated. [223002580420] |We get the same result after the translation, because [223002580430] |. [223002580440] |

    Max Ent to Stats Text Conversion

    [223002580450] |Uh oh. [223002580460] |This isn’t so easy. [223002580470] |The problem is that in the max ent presentation, we can have the same feature generated for several different categories. [223002580480] |In the stats textbook presentation, each feature is implicitly separated because the coefficients are separated. [223002580490] |The solution is to allow tying of coefficients in the stats textbook model. [223002580500] |That is, we need to be able to enforce for arbitrary pairs of coefficient positions. [223002580510] |Then, we just put the tied features there. [223002580520] |(It’s hard to write out as a formula, but hopefully the idea’s clear.) [223002580530] |

    Worth a Nobel (Memorial) Prize in Econ?

    [223002580540] |Daniel McFadden was awarded (half the) 2000 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel for work he’d done on discrete choice analysis (DCA). [223002580550] |As you may have guessed from the lead in, DCA’s just the term researchers in econometrics use for logistic regression. [223002580560] |As Dave pointed out in his talk, they tend to use it in the same way as computational linguists working in the max ent tradition, with lots of feature tying. [223002580570] |McFadden lists his 1974 paper on logit as his first econometrics publication on his list of publications (he’s still teaching intro stats at Berkeley!). [223002580580] |The key definition is formula (12) on page 110, where he gives the probability function for multinomial logistic regression and uses max-ent style feature extraction. [223002580590] |It’s clear from McFadden’s extensive references that statisticians had been fitting binomial logistic regressions since 1950 and multinomials since 1960. [223002580600] |Perhaps not coincidentally, McFadden was working on DCA about the same time Nelder and Wedderburn introduced generalized linear models (GLM), which generalize logistic regression to other link functions. [223002590010] |Discrete Choice Analysis Example: Transportation Mode [223002590020] |I forgot to include the example in yesterday’s post. [223002590030] |I’m reconstructing this from memory, so it doesn’t exactly match the one Dave Lewis used in his talk. [223002590040] |Suppose you have a decision to make among different modes of transportation to get you from home to work: car, bus, train, bicycle, walking. [223002590050] |Each will come with a cost in dollars and time (and risk, but I’ll leave that to Professor Risk). [223002590060] |It makes sense to tie the parameters for cost and for time across the different modes of transportation. [223002590070] |We’ll use dimension 1 of our feature vectors for time (in minutes) and dimension 2 for money (in US dollars). [223002590080] |For instance, represents a mode of transportation for a person that takes 15 minutes and costs one dollar. [223002590090] |It’s statistics, so in order to fit a model, we need exchangeable replicates. [223002590100] |In this case, we use multiple commuters. [223002590110] |For instance, suppose we consider commuter , for whom walking takes 30 minutes, and has no cost, taking the bus takes 20 minutes and costs $2, and driving takes 10 minutes and costs $5. [223002590120] |With our simple model of transportation choice, we represent the three outcomes for commuter as [223002590130] |for walking, [223002590140] |for taking the bus, and [223002590150] |for driving. [223002590160] |We might then have a second person who lives further from work, with features [223002590170] |for walking, [223002590180] |for the bus, and [223002590190] |for driving. [223002590200] |The model parameters are the two coefficients for minutes and dollars, yielding a 2D vector . [223002590210] |Suppose we have fit a model (presumably with a well-defined prior), and want to look at predictions. [223002590220] |For person we get probabilities for choice of transportation mode [223002590230] |for walking, [223002590240] |for busing, and [223002590250] |for driving. [223002590260] |This all fits very neatly into the max-ent or DCA style of generating features for a multinomial logistic model. [223002590270] |It even seems easier to conceptualize in this case than tied parameters. [223002590280] |Of course, if we really wanted to model commuter choices, we’d use a much more complex model including features like (disposable) income, neighborhood, etc., plus some non-linear combinations of existing features. [223002590290] |All of these need not be shared across choices like time and money. [223002600010] |Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased [223002600020] |Update: I turned the crank a little too hard on this post. [223002600030] |While it’s technically correct according to one way of interpreting maximum likelihood, the more standard approach does not suffer the same bias. [223002600040] |I describe the “right” way to do EM in the followup post, [223002600050] |
  • Marginalizing Latent Variables in EM.
  • [223002600060] |While this is good news for EM (and RNA-Seq), it undercuts one of the motivations for the Bayesian estimator, which I’ll present soon. [223002600070] |I spent much of the winter break looking over Mitzi’s shoulder at papers on RNA-Seq. [223002600080] |For those who don’t know what that it is, I added an appendix Background on RNA-Seq going over the problems. [223002600090] |

    Statistical Problem Synopsis

    [223002600100] |As usual, the problem is that the MLE falls on the parameter space boundary. [223002600110] |Because of this, the MLE has what is known as statistical bias. [223002600120] |In the case of MLEs for RNA-Seq expression, the bias is toward a winner-take-all solution in which one isoform is assigned all the uncertain mappings. [223002600130] |

    You’ve Seen Biased MLEs Before

    [223002600140] |You’ve seen it in the context of estimating the variance of a distribution from samples . [223002600150] |The maximum likelihood estimate of the mean is the same as the unbiased estimate , [223002600160] |. [223002600170] |But the maximum likelihood estimate of variance is biased: [223002600180] |. [223002600190] |It’s biased on the low side (because it’s based on the estimate of the mean, not the actual mean). [223002600200] |The unbiased estimate arises by subtracting one from the denominator, [223002600210] |. [223002600220] |But unlike the case for expression I’ll discuss next, as the sample size approaches infinity, the unbiased and MLE estimates of variance converge to the same value. [223002600230] |

    Isoform Expression Example

    [223002600240] |Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. [223002600250] |For simplicitly, we’ll suppose all the exons are the same length. [223002600260] |Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. [223002600270] |Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. [223002600280] |In table form, we have the following mapping counts: [223002600290] |The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. [223002600300] |That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2. [223002600310] |So we’ve let N represent the total (out of 70) number of multi-mapped reads assigned to isoform 1. [223002600320] |

    Calculating Expression and Likelihood

    [223002600330] |For a given value of , we can directly calculate the maximum likelihood estimate for isoform expression for isoform 1 () and isoform 2 (), by [223002600340] |and . [223002600350] |The (unnormalized) likelihood function for , which determines , is [223002600360] |and the (unnormalized) log likelihood is [223002600370] |. [223002600380] |Here’s a plot of the (unnormalized) log likelihood versus N that illustrates the problem with the MLE estimator: [223002600390] |The maximum likelihood (which is at the same point as the maximum of the unnormalized likelihood) is achieved at the boundary of the possible parameter values, so the MLE is . [223002600400] |The MLE thus assigns all the multi-mapped reads to the second isoform, which is why I called it a “winner take all bias”. [223002600410] |The unbiased estimator is the expected value of N, [223002600420] |, [223002600430] |which will obviously fall somewhere between 0 and 70, but closer to 0 than 70 given the skew of the likelihood function. [223002600440] |Just to be clear, it’s not just the MLE that’s a problem. [223002600450] |Maximum a posteriori (MAP) estimates, which use Bayesian priors but not Bayesian estimators, have the same problem. [223002600460] |

    Who’s using MLE and MAP?

    [223002600470] |It’s actually even more popular in biology than in computational linguistics. [223002600480] |And while there’s not much that’s been published on RNA-Seq expression profiling, I know these three papers (and would be glad to hear about more): [223002600490] |
  • de Hoon, Michiel J. L. et al. (2010) Cross-mapping and the identification of editing sites in mature microRNAs in high-throughput sequencing libraries. [223002600500] |Genome Research.
  • [223002600510] |
  • Marioni, John C. et al. (2009) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. [223002600520] |Genome Research.
  • [223002600530] |
  • Jiang, Hui et al. (2009) Statistical inferences for isoform expression in RNA-Seq. [223002600540] |Bioinformatics.
  • [223002600550] |Ironically, MLE is being sold as an improvement to the one-step remapping approach outlined in the following two papers, the first of which overlaps significantly in authorship with the de Hoon et al. paper cited above: [223002600560] |
  • Hashimoto, T. et al. (2009) Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. [223002600570] |Bioinformatics
  • [223002600580] |
  • Mortazavi, A. et al. (2009) Mapping and quantifying mammalian transcriptomes by RNA-Seq. [223002600590] |Nature Methods.
  • [223002600600] |While I like the fact that the former papers write down a full generative model, I say “ironically”, because the one-step remapping approach gets the “right” answer for the example above. [223002600610] |The remapping of multi-maps by Mortazavi et al. is done proportionally to the expression computed over the uniquely mapped reads. [223002600620] |In our example, that’s 20 to 50, so the 70 multi-mapped reads are split in the same proportion, with 20 going to isoform 1 and 50 to isoform 2, giving isoform 1 an expression of 40 and isoform 2 an expression of 100. [223002600630] |

    Stay Tuned for the Next Installment

    [223002600640] |In the next post in this series, I’ll show how to estimate the full Bayesian posterior using Gibbs sampling, incorporating probabilistic information from alignment. [223002600650] |From the Bayesian posterior, unbiased point estimate is easily derived as is estimation variance (i.e. “confidence”). [223002600660] |I’ll even show how to derive answers when there are no multi-mapped reads, and when exons and isoforms are of different sizes. [223002600670] |

    Background on RNA-Seq

    [223002600680] |For mammalian (and other complex) genomes, genes are made up of exons (of a few dozen to a few hundred bases) and introns (from very small to tens of thousands of bases in humans). [223002600690] |RNA is transcribed from a gene’s DNA. [223002600700] |After transcription, the spliceosome chops out the introns and splices the exons together, adds a cap on the 5′ end (start) and a sequence of A bases (the poly(A) tail) on the 3′ end (end). [223002600710] |The result is so-called mature mRNA (messenger RNA), because it is ready to be translated into a protein. [223002600720] |The spliceosome, for various not fully understood reasons (including regulation, cryptic [ambiguous] splice junctions, and perhaps chromatin structure), does not always splice all of the exons together to form mRNA. [223002600730] |In different contexts, alternative splicing variants, the variant proteins produced from which are called isoforms, are generated at different levels of expression (concentration). [223002600740] |The ribosome complex translates mature mRNA (which has four bases) into the 20 different amino acids making up proteins Three bases, together called a codon, are used per amino acid, with the so-called genetic code mapping codons to amino acids; there is also a start codon and stop codon so the ribosome knows where to start and end translation. [223002600750] |The (short) portions of the mRNA before the start codon is called the 5′ UTR, and the portion of mRNA between the stop codon and the poly(A) tail is called the 3′ UTR. [223002600760] |In RNA-Seq experiments, mature mRNA is isolated in the lab (typically by grabbing its poly(A) tail), then broken up into short sequences using specialized enzymes called RNases. [223002600770] |The resulting short sequences are then copied to complementary DNA (cDNA), tens of millions short sequences which are then randomly attached to a slide (how this is done depends on the particular sequencer). [223002600780] |Then the sequencer reads the bases, which is a very noisy process; if you’re lucky, you get confidence scores along with the reads that can be interpreted as probabilities of errors. [223002600790] |The different machines have different error profiles. [223002600800] |We now have tens of millions of noisily read sequences. [223002600810] |These are next mapped to known genome positions using a combination of indexing and edit distance (there are some really groovy algorithms here; more on those later). [223002600820] |Reads that map to one position with a higher score than all other positions are said to map uniquely (though they may map to other positions with lower likelihood). [223002600830] |Reads that map to multiple positions are called multi-maps. [223002600840] |The genome (drafts) have been marked up for exons (though this is also noisy) in what are called gene models (no Wikipedia ref I could find!). [223002600850] |Multi-maps are very common if the target is an isoform, because isoforms share exons. [223002600860] |But they also appear elsewhere because of the large number of paralogs (similar sequences) in copies and copy variants of genes in large genomes like those of mammals. [223002600870] |The problem we’re looking at here is calculating isoform expression taking into account multi-maps. [223002610010] |DCA for Discriminative Coreference Resolution [223002610020] |

    DCA isn’t Just Tied Logistic Regression

    [223002610030] |After posting about discrete choice analysis (DCA) a few days ago, then adding a post about analyzing transportation choices, it occurred to me that DCA isn’t really just a form of logistic regression with parameter tying in its most general form. [223002610040] |In its most general form, any number of choices can be presented at run time, whereas in both the machine-learning max-ent style approaches and the stats textbook type approaches to logistic regression, there is always a fixed, finite set of outcome categories. [223002610050] |

    DCA for Within-Document Coreference

    [223002610060] |One of the projects on our plate right now involves within-document coreference resolution, which involves resolving which noun phrases in a document refer to the same entities. [223002610070] |Thus it’s a bit more general than the classical anaphora resolution problem, which finds antecedents for pronouns. [223002610080] |In our case, two mentions of “John Smith” in the same document need to be linked (if they refer to the same person; they won’t if you’re reading Amit Bagga and Breck’ss John Smith resolution paper, or for that matter, the recreation of their results in LingPipe’s clustering tutorial). [223002610090] |LingPipe’s heuristic coreference package is based on Breck’s thesis work (called CogNIAC). [223002610100] |If you troll through the code or read Breck’s paper, you’ll see that it’s essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. [223002610110] |Essentially, it’s a greedy online clustering algorithm. [223002610120] |The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender). [223002610130] |So far, there’s nothing for linking common nouns (as there is in ACE), and nothing for linking compound entities (e.g. “John and Mary …They …”). [223002610140] |

    A DCA Model

    [223002610150] |We could just throw all those features into a DCA type model! [223002610160] |The discrete choices consist of each candidate antecedent mention chain along with the choice of creating a new mention chain. [223002610170] |If we can extract a feature vector given a mention chain and mention , we’re in business. [223002610180] |We could potentially even use the other mention chains in generating the features for mention , as in where is the set of previous mention chains, is the candidate mention chain, and is the mention being resolved. [223002610190] |With all existing mention chains in play, we can also generate a reasonable feature vector for the choice of creating a new mention chain out of given the antecedent candidates . [223002610200] |The model consists of a coefficient vector . [223002610210] |Given a mention and set of potential antecedent chains , the probability of a given antecedent is defined in the usual way for log-linear logistic-type models. [223002610220] |For linking to an antecedent chain , we have [223002610230] |. [223002610240] |For a new chain, which we’ll write , we have [223002610250] |. [223002610260] |Given annotated data and a feature extractor, the training cases easy to create. [223002610270] |Each instance consists of the features for all previous mention chains and for creating a fresh along with the decision. [223002610280] |

    Does Any NLP Package Implement DCA?

    [223002610290] |Alas, LingPipe can’t handle these models the way I wrote logistic regression. [223002610300] |In fact, I’m not sure anyone’s NLP package can handle the full DCA type models. [223002610310] |I’m afraid I can never quite figure out from the Mallet javadoc what exactly it can and can’t do because everything’s so abstract and there are so many implementations of the interfaces and so little high-level doc. [223002610320] |But classes like cc.mallet.classify.MaxEnt seem to assume a fixed set of output categories. [223002610330] |The Open NLP javadoc is much more straightforward; their opennlp.maxent.GISModel also requires a fixed set of categories. [223002610340] |It also doesn’t look like Weka supports DCA-style models from the Weka Javadoc, but again, the interfaces are so general, and the implementations so plentiful, I’m not sure. [223002610350] |I’d love to hear about which packages do support this kind of thing. [223002610360] |Presumably something the econometricians are using would do it. [223002610370] |

    Implementing DCA with SGD

    [223002610380] |DCA won’t be at all hard to write. [223002610390] |In fact, having just finished CRFs (stay tuned for the release, probably next week). [223002610400] |Stochastic gradient descent (or a quasi-Newton method) is easy to adapt. [223002610410] |The correct antecedent candidate or choice to create a new chain is treated as a positive (1) outcome, with all other candidate antecedent mention chains treated as negative (0) outcomes. [223002610420] |

    Coefficient Priors

    [223002610430] |Of course, we can put priors on the coefficients. [223002610440] |It sure would be nice to be able to create Bayesian estimators. [223002610450] |Having just blogged about MLE and MAP’s bias, I’m worried about the MAP estimates for logistic regression all these systems (including LingPipe) use. [223002610460] |

    OK, you could use SVMs, too

    [223002610470] |Of course, SVM (hinge), perceptron (0/1), or other simple loss functions could be used instead of the logistic loss function. [223002610480] |SGD will still work as is. [223002610490] |And you could use non-linear kernels, which often squeeze a bit more accuracy out for very little work (contrasted with hand tuning non-linear features, such as interaction features, in a standard linear model). [223002610500] |

    Coreference Training Data?

    [223002610510] |Well, there was MUC 7 which kicked things off, and then ACE more recently. [223002610520] |MUC-6 and MUC-7 and several years of ACE datasets are available from the LDC for a fee. [223002610530] |A couple months ago, I blogged about Massimo Poesio et al.’s effort to collect a coreference corpus using the Phrase Detectives annotation as game. [223002610540] |This should be out soon, and be much larger than the MUC and ACE datasets. [223002620010] |Marginalizing Latent Variables in EM [223002620020] |After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I “turned the crank” (an idiom in mathematics for solving an equation by brute force without thinking too much about it). [223002620030] |

    Expression Mixture Model

    [223002620040] |Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. [223002620050] |We defined a joint likelihood function for a sequence of observed reads , gene or splice variant sources , and mRNA expression , setting [223002620060] |. [223002620070] |

    Maximum Likelihood

    [223002620080] |Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters. [223002620090] |

    What’s a Parameter?

    [223002620100] |So that’s what I did, calculating [223002620110] |. [223002620120] |The problem is that there’s an implicit “except the latent parameters” inside the usual understanding the MLE. [223002620130] |What I should have done is marginalized out the latent gene sources , taking [223002620140] |. [223002620150] |That requires marginalizing out the latent mapping of reads to gene or splice variant sources , [223002620160] |. [223002620170] |On a log scale, that’s [223002620180] |. [223002620190] |

    The Example

    [223002620200] |I’ll repeat the example here for convenience: [223002620210] |

    Isoform Expression Example

    [223002620220] |Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. [223002620230] |For simplicitly, we’ll suppose all the exons are the same length. [223002620240] |Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. [223002620250] |Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. [223002620260] |In table form, we have the following mapping counts: [223002620270] |The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. [223002620280] |That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2. [223002620290] |

    Turning the Crank In the Right Direction

    [223002620300] |By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant. [223002620310] |Now when we turn the crank, we see that [223002620320] |. [223002620330] |The reduction on the third line is because the probability of a read in the second exon, , is only non-zero for isoform , and similarly for a read in the third exon, where is nly non-zero for the second gene or splice variant, . [223002620340] |The final reduction follows because . [223002620350] |

    Not so Biased After All

    [223002620360] |After marginalizing out the read mappings , the MLE for expression is right where we’d expect it, at the solution of [223002620370] |, [223002620380] |which is [223002620390] |. [223002620400] |It turns out EM isn’t so biased for these discrete mixtures, at least when you turn the crank the right way. [223002630010] |Dekel and Shamir (2009) Vox Populi: Collecting High-Quality Labels from a Crowd [223002630020] |Aleks just sent me a pointer to this paper from last year’s COLT: [223002630030] |
  • Dekel, Ofer and Ohad Shamir. 2009. [223002630040] |Vox Populi: Collecting High-Quality Labels from a Crowd. [223002630050] |In COLT. [223002630060] |The paper presents, proves theorems about, and evaluates a very simple idea in the context of crowdsourcing binary classifier data coding. [223002630070] |Unlike other approaches I’ve blogged about, Dekel and Shamir use only one coder per item, and often coders only label a handful of items. [223002630080] |They train an SVM on all the labels, then use the SVM to evaluate coders. [223002630090] |They then prune out low-quality coders using the SVM as a reference. [223002630100] |This paper reminded me of (Brodley and Friedl 1999), which was also about using multiple trained classifiers to remove bad labels. [223002630110] |Brodley and Friedl remove items on an item by item basis rather than on an annotator by annotator basis. [223002630120] |This paper is also somewhat reminiscent of (Raykar et al. 2009), who train a classifier and use it as another annotator. [223002630130] |

    Lots of Theory

    [223002630140] |There’s lots of theory saying why this’ll work. [223002630150] |It is a COLT paper after all. [223002630160] |

    Evaluation: Web Page Query Relevance

    [223002630170] |They ran an eval on Amazon’s Mechanical Turk asking people to judge a pair consisting of a web query and web page as to whether the page is relevant to the query. [223002630180] |They used 15 coders per query/page pair in order to define a gold standard by voting. [223002630190] |They then evaluated their approach using only one coder per item by subsampling their larger data set. [223002630200] |One thing I noticed is that as they lowered the maximum number of items labeled by an annotator, their average annotator accuracy went up. [223002630210] |This is consistent with our findings and those in Snow et al. that the spammy Mechanical Turkers complete a disproportionate number of tasks. [223002630220] |With only one coder per item, Dekel and Shamir can’t evaluate the way everyone else is evaluating crowdsourcing, because they know their resulting data will remain noisy because it’s only singly annotated. [223002630230] |Estimates of coder sensitivity and specificity could be made based on their performance relative to the SVM’s best guesses. [223002630240] |That’d provide some handle on final corpus quality in terms of false positives and negatives relative to ground truth. [223002630250] |Rather than evaluating trained classifier performance, Dekel and Shamir measure the “noise level” of the resulting data set after pruning. [223002630260] |What we really care about in practice is how much pruning bad annotators helps train a more reliable classifier (or help evaluate if that’s the goal). [223002630270] |They discuss this kind of end-to-end evaluation under “future work”! [223002630280] |It’s really striking how different the evaluation versus theory requirements are for different conferences. [223002650010] |Comparing Precision-Recall Curves the Bayesian Way? [223002650020] |

    Back Story

    [223002650030] |There’s an interesting thread on the BioNLP mailing list (here’s a link to the publicly readable thread). [223002650040] |The thread was kicked off when Andrew Clegg asked: [223002650050] |Suppose I have two precision-recall curves for two different IR algorithms on the same test data and query. [223002650060] |What would be the correct statistical test to determine if they are significantly different? [223002650070] |I’m particularly sympathetic to (part of) Bill Hersh‘s second response. [223002650080] |…All statistical inference tells you is how likely your differences are due to chance. [223002650090] |It says nothing about whether the difference is important. … [223002650100] |to which Andrew replied [223002650110] |…in my case it’s the other way round, I think we’ve already displayed a clear benefit in terms of usability, I’m just thinking of including significance testing in case a very picky reviewer asks if it’s down to chance. [223002650120] |Can you say CYA? [223002650130] |There were a host of classical tests proposed including the new-to-me Page trend test. [223002650140] |Others suggesting taking an aggregate statistic like area-under-the-curve and doing a standard difference test. [223002650150] |

    ROC Curves vs. PR Curves

    [223002650160] |There’s much more literature on receiver operating characteristic curves (aka ROC curves) than precision-recall (PR) curves. [223002650170] |Just as a refresher, if we take a given number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), precision, recall, sensitivity and specificity are: [223002650180] |Sensitivity = TP / (TP + FN) [223002650190] |Specificity = TN / (TN + FP) [223002650200] |Recall = Sensitivity = TP / (TP + FN) [223002650210] |Precision = TP / (TP + FP) [223002650220] |If we have a ranked list of outputs for which we know their gold-standard category (relevant or not in the case of IR), we can consider initial segments of the ranked list and compute the precision/recall at that point. [223002650230] |The ROC curve plots 1-specificity (the false positive rate) versus sensitivity (the true positive rate) at various operating points for an algorithm. [223002650240] |The PR curve plots precision versus recall. [223002650250] |In PR curves, it is often the case that only “interpolated” results are considered, which means taking the convex shell of the curve. [223002650260] |This is easily accomplished by making sure the function is non-decreasing by propagating later higher results back. [223002650270] |For instance, if precision at 1 recall is 1/2 and precision at 2 recall is 2/3 and at 3 recall is 1/2, then the precision at 1 recall is interpolated to 2/3 so the function is non-decreasing. [223002650280] |(I’m not sure what this does to hypothesis testing.) [223002650290] |There’s a lot more information and examples in the class documentation for [223002650300] |
  • LingPipe’s classify.ScoredPrecisionRecallEvaluation [223002650310] |

    The Bayesian Way

    [223002650320] |It’s in exactly these kinds of situations where a Bayesian approach seems natural. [223002650330] |Just add a prior and compute a posterior over entire PR curves. [223002650340] |Then compare the PR curves any way you want. [223002650350] |Say by calculating how likely it is for one PR curve to be above the other at every operating point. [223002650360] |Or how likely it is for one to be above the other at a given operating point (say 99% recall). [223002650370] |Or how likely one is to be substantially better than another, or how much you expect one to be higher than another. [223002650380] |See my earlier post on Bayesian counterpart to Fisher’s exact test on contingency table data. [223002650390] |The nifty part is that “adjustments” for multiple comparisons are built in. [223002650400] |So we can ask what’s the posterior probability of one system having at least 1% higher precision than all others at 99% recall. [223002650410] |But once we’re doing multiple comparisons, it makes sense to build a hierarchical model. [223002650420] |Often, a system will report results on multiple queries, so there’s a set of precision-recall curves for each system, leading to two levels of grouping. [223002650430] |I provide an example of Bayesian multiple comparisons in a hierarchical model in my earlier blog post on Bayesian batting ability with multiple comparisons. [223002650440] |

    Will it Work?

    [223002650450] |How do I model the posterior PR (or ROC) curve? [223002650460] |From that, a regular or interpolated PR (or ROC) curve can be generated if desired. [223002650470] |I’m specifically worried about precision and recall sharing the TP term in their numerators and denominators; ROC curves seem simpler. [223002650480] |I’m also worried about the low counts at the low recall end of the curves. [223002650490] |And about the correlations among the different operating points. [223002650500] |It doesn’t seem that a naive independent binomial for each recall point would be justified, though it’d make the posteriors very clean, especially with uniform priors. [223002650510] |For instance, suppose that after the first 5 true positives in the systems’ ranked lists, we have seen 7, 10, 4, 11, and 21 false positives for the five systems? [223002650520] |The maximum likelihood point estimate of precision with a uniform prior is then 5/12, 5/15, 5/9, 5/16, and 5/26. [223002650530] |Can we just use Beta(6,13), …, Beta(6,27) as the posteriors for comparison? [223002650540] |Somehow it just seems wrong. [223002650550] |

    Any Ideas?

    [223002650560] |Does anyone know how to compute PR or ROC posteriors appropriately? [223002650570] |

    Does Anyone Have Data?

    [223002650580] |Andrew, or anyone: if you have the ranked data and gold standard answers, I’d love to code this up in R/BUGS and see what it looks like in the context of data someone cares about.