[223002160010] |Brodley and Friedl (1999) Identifying Mislabeled Training Data [223002160020] |Today’s blast from the past is the descriptively titled: [223002160030] |
  • Brodley, Carla E. and Mark A. Friedl. 1999. [223002160040] |Identifying mislabeled training data. [223002160050] |JAIR. [223002160060] |It’s amazing this paper is so long given the simplicity of their approach —it’s largely careful evaluation on three approaches versus five data sets and a thorough literature survey. [223002160070] |It’s also amazing just how many people have tackled this problem over time in more or less the same way. [223002160080] |The approach is almost trivial: cross-validate using multiple classifiers, throw away training instances on which the classifiers disagree with the gold standard, then train on the remaining items. [223002160090] |They consider three forms of disagreement: at least one classifier disagrees, a majority of classifiers disagree, and all classifiers disagree with the gold label. [223002160100] |They consider three classifiers, 1-nearest-neighbor with Euclidean distance, linear classifiers (using the “thermal training rule”, whatever that is), and C4.5-trained decision trees. [223002160110] |They conclude that filtering improves accuracy, though you’ll have to be more patient than me to dissect the dozens of detailed tables they provide for more inisght. [223002160120] |(Good for them for including confidence intervals; I still wish more papers did that today.) [223002160130] |The authors are careful to set aside 10% of each data set for testing; the cross-validation-based filtering is on the other 90% of the data, to which they’ve introduced simulated errors. [223002160140] |One could substitute annotators for algorithms, or even use a mixture of the two to help with active learning. [223002160150] |And you could try whatever form of learners you like. [223002160160] |What I’d like to see is a comparison with probabilistic training on the posterior category assignments, as suggested by Padhraic Smyth et al. in their 1994 NIPS paper, Inferring ground truth from subjective labelling of Venus images. [223002160170] |I’d also like to see more of an analysis of noisy training data on evaluation, along the lines of Lam and Stork’s 2003 IJCAI paper, Evaluating classifiers by means of test data with noisy labels. [223002160180] |Because my experience is that gold standards are less pure than their audience of users imagines. [223002170010] |Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval [223002170020] |I’m about to implement the technique for assigning Medical Subject Heading (MeSH) terms to text described in this paper from the TLAs EBI, HMI and TNO ICT: [223002170030] |
  • Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval. [223002170040] |Bioinformatics. [open access home, including link to the Methodology Supplement] [223002170050] |The same technique could be applied to assign Gene Ontology (GO) terms to texts, tags to tweets or blog posts or consumer products, or keywords to scientific articles. [223002170060] |

    k-NN via Search

    [223002170070] |To assign MeSH terms to a text chunk, they: [223002170080] |
  • convert the text chunk to a search query, [223002170090] |
  • run the query against a relevance-based MEDLINE index, then [223002170100] |
  • rank MeSH terms by frequency in the top k (k=10) hits. [223002170110] |In other words, k-nearest-neighbors (k-NN) where “distance” is implemented by a relevance-based search. [223002170120] |

    Just the Text, Ma’am

    [223002170130] |Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation. [223002170140] |k-NN implemented as search could be faceted to include journals, years, authors, etc. [223002170150] |For products, this could include all the facets seen on sites like Amazon or New Egg. [223002170160] |

    Language-Model-Based Search

    [223002170170] |They use language-model-based search, though I doubt that’s critical for success. [223002170180] |Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document’s LM.) [223002170190] |Other LM-based search systems measure similarity by KL-divergence. [223002170200] |There weren’t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij’s Ph.D. thesis on LM-based IR. [223002170210] |

    Application to MEDLINE

    [223002170220] |The text being assigned MeSH terms was another MEDLINE title-plus-abstract. [223002170230] |This may seem redundant given that MEDLINE citations are already MeSH annotated, but it’s useful if you’re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document). [223002170240] |It’s easy to apply the authors’ approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC. [223002170250] |

    Efficient k-NN!

    [223002170260] |I did a double-take when I saw k-nearest-neighbors and efficiency together. [223002170270] |As we all know, k-NN scales linearly with training set size and MEDLINE is huge. [223002170280] |But, in this case, the search toolkit can do the heavy lifting. [223002170290] |The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself. [223002170300] |

    Error Analysis

    [223002170310] |The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example). [223002170320] |They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations. [223002170330] |Errors tended to be more general MeSH terms that just happened to show up in related documents. [223002170340] |I’d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE. [223002170350] |(All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.) [223002170360] |

    What You’ll Need for an Implementation

    [223002170370] |It should be particularly easy to reproduce for anyone with: [223002170380] |
  • half a clue about Lucene (search), [223002170390] |
  • a MEDLINE parser, and [223002170400] |
  • optionally, a MeSH parser. [223002170410] |Look for a demo soon. [223002170420] |

    Discriminative Semi-Supervised Classifiers?

    [223002170430] |It’d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs. [223002170440] |

    Improving IR with MeSH (?)

    [223002170450] |I didn’t quite follow this part of the paper as I wasn’t sure what exactly they indexed and what exactly they used as queries. [223002170460] |I think they’re assigning MeSH terms to the query and then adding them to the query. [223002170470] |Presumably they also index the MeSH terms for this. [223002170480] |I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC’s mean average precision metric). [223002170490] |Alternatives like generating a 24,000-category multinomial classifier are possible, but won’t run very fast (though if it’s our tight naive Bayes, it might be as fast as the authors [223002180010] |Active Learning High Confidence Items for High Precision? [223002180020] |A popular active learning strategy is to label new items about which the current classifier is uncertain. [223002180030] |The usual citation is the 1994 ICML paper by Dave Lewis and Jason Catlett, Heterogeneous Uncertainty Sampling for Supervised Learning. [223002180040] |The intuition is that this focuses the learner on the marginal cases. [223002180050] |I just realized in my response to a comment from Ramesh Nallapati, that what we’ve been doing here at Alias-i is exactly the opposite of uncertainty sampling. [223002180060] |We sample from the highest confidence positive examples of a category for new labels. [223002180070] |This is most likely to uncover highly misclassified negative examples, the bane of high precision systems. [223002180080] |(I say “we”, but this was mainly Breck’s idea, as he does most of our on-the-ground application development with customers.) [223002180090] |Our hypothesis is that this is the quickest route to high precision classifiers. [223002180100] |I know I mainly talk about high-recall systems in the blog, as that’s what we’re focusing on for biology researchers as part of our NIH grant. [223002180110] |But our commercial customers are often most concerned about not looking bad and will put up with some missed cost savings or opportunities. [223002180120] |Our general focus is on high precision or high recall operating points, as we rarely interpret customers’ requests as wanting maximum F measure. [223002180130] |For high recall, we often focus our tuning efforts on the least confident positive examples and the more highly ranked false positives to see if we can move up the relative rank of the positive examples. [223002180140] |In some classification problems, the marginal examples are truly marginal and trying to tease them into absolute categories is counterproductive because the hair-splitting involved is so frustrating. [223002180150] |It’s much much faster, not to mention more fun, to make decisions on high-confidence items. [223002180160] |Remember, a lot of UI is also about perception of time, not the amount of time itself. [223002180170] |The time benefits from labeling easier examples may compound with even greater perceived speed. [223002180180] |Has anyone else tried this kind of “heterogeneous certainty sampling”? [223002180190] |We don’t have any contrastive data. [223002190010] |When to Catch, Pass, or Throw Exceptions? [223002190020] |

    Rule 1. [223002190030] |When to Catch an Exception

    [223002190040] |Catch an exception if you can solve the problem raised by the exception (e.g. by waiting for system resources, balking on a client request in an orderly fashion, interrupting a thread, logging the problem and continuing, or freeing memory). [223002190050] |

    Rule 2. [223002190060] |When to Throw an Exception

    [223002190070] |Throw an exception with an appropriately specific type when a problem arises that you can’t solve. [223002190080] |Runtime exceptions are for programming errors (e.g. null pointers,index out of bounds, or illegal arguments). [223002190090] |Errors are for system resource errors (e.g. out of memory). [223002190100] |Checked exceptions are for predictable forms of application-specific errors (e.g. ill-formed input for parsers, busy network resources, or file permissions). [223002190110] |

    Corollary 3. [223002190120] |When to Pass an Exception

    [223002190130] |Pass any exception that is raised that you don’t want to catch. [223002190140] |Pass the exception as-is if possible. [223002190150] |This requires no code, but for checked exceptions, requires an appropriate try/catch block or a method/constructor declared to throw the exception. [223002190160] |If the exception may not be passed as is, use a try/catch to catch it and throw a new reasonably typed exception that wraps the adapted exception. [223002190170] |That is, don’t just throw an instance of RuntimeException, which is neither an error nor specific enough for a client to catch. [223002200010] |What is “Bayesian” Statistical Inference? [223002200020] |Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). [223002200030] |You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game). [223002200040] |

    Examples

    [223002200050] |I followed up this post with some concrete examples for binary and multinomial outcomes: [223002200060] |
  • Beta-Binomial: [223002200070] |
  • Bayesian Posteriors vs. MLE Estimates for Binomial Batting Ability [223002200080] |
  • Bayesian Point Estimators of Batting Ability [223002200090] |
  • Hierarchical Bayesian Batting Ability with Multiple Comparisons [223002200100] |
  • Bayesian Counterpart to Fisher Exact Test [223002200110] |
  • Dirichlet-Multinomial: Bayesian Naive Bayes [223002200120] |

    Bayesian Inference is Based on Probability Models

    [223002200130] |Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. [223002200140] |Bayesian statistical inference is carried out using standard probability theory. [223002200150] |

    What’s a Prior?

    [223002200160] |The full Bayesian probability model includes the unobserved parameters. [223002200170] |The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. [223002200180] |The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution. [223002200190] |

    Non-Bayesian Statistics

    [223002200200] |Non-Bayesian statisticians eschew probability models of unobservable model parameters. [223002200210] |Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″). [223002200220] |Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades. [223002200230] |

    Bayesian Technical Apparatus

    [223002200240] |The sampling distribution models the probability of observable data given unobservable model parameters . [223002200250] |The prior distribution models the probability of the parameters . [223002200260] |The full joint distribution over parameters and data is computed with the chain rule, . [223002200270] |The posterior distribution of the parameters given the observed data is derived from the sampling and prior distributions via Bayes’s rule, [223002200280] |The posterior predictive distribution for new data given observed data is the average of the sampling distribution over parameters proportional to their posterior probability, [223002200290] |The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. [223002200300] |In particular, the posterior is an overdispersed variant of the sampling distribution. [223002200310] |The extra dispersion arises by integrating over the posterior. [223002200320] |

    Conjugate Priors

    [223002200330] |Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. [223002200340] |For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. [223002200350] |With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. [223002200360] |If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson. [223002200370] |

    Point Estimate Approximations

    [223002200380] |An approximate alternative to full Bayesian inference uses for prediction, where is a point estimate. [223002200390] |The maximum of the posterior distribution provides the-so called maximum a posteriori (MAP) estimate, [223002200400] |If the prior is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood of the data . [223002200410] |The MLE is popular among non-Bayesian statisticians because the prior may be dropped from the optimization because it only contributes a constant factor. [223002200420] |By definition, the unbiased estimator for the parameter is the expected value of the posterior, [223002200430] |A Bayesian point estimator chooses the estimate that minimizes expected loss between an estimate and the actual value. [223002200440] |The expectation is over the posterior distribution for the value being estimated. [223002200450] |If the loss function is the squared difference between the estimate and actual value, the Bayes estimate is the mean of its posterior, as given above. [223002200460] |This is because the mean of a set of values is the estimate that minimizes square loss versus those values. [223002200470] |Point estimates may be reasonably accurate if the posterior has low variance. [223002200480] |If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. [223002200490] |This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model. [223002210010] |Batting Averages: Bayesian vs. MLE Estimates [223002210020] |It’s American baseball season, and I’ve been itching to do more baseball stats. [223002210030] |So in the interest of explaining Bayesian posterior inference with an example… [223002210040] |

    Batting Averages

    [223002210050] |A player’s batting average is his number of “hits” divided by his number of “at bats“. [223002210060] |

    Binomial Model of Batting

    [223002210070] |For simplicitly, we’ll model a sequence of at bats using a binomial distribution with a probability of getting a hit in each at bat. [223002210080] |This assumes that the sequence of at bats is independent and identically distributed, aka iid. [223002210090] |Of course this is wrong. [223002210100] |All models are wrong. [223002210110] |It’s just that this one’s more wrong than usual, despite the popularity of the batting average as a summary statistic. [223002210120] |(See the second last section for a reference to a better model.) [223002210130] |

    Maximum Likelihood Estimate (MLE)

    [223002210140] |Given a sample (aka training data) consisting of a player with hits in at bats (where ), the maximum-likelihood estimate (MLE) of the ability is , aka the player’s batting average. [223002210150] |

    Uniform Prior

    [223002210160] |We assume a uniform (aka noninformative) prior distribution over abilities. [223002210170] |This is also wrong. [223002210180] |We know major league position players don’t bat under 0.200 or they’d be fired, and no one’s had a batting average of over 0.400 since Ted Williams. [223002210190] |But we’ll have to come back to issues with the prior later. [223002210200] |This post is just to clarify what integration over the posterior does for inference, not what an informative prior does. [223002210210] |

    Bayesian Posterior Estimate

    [223002210220] |The beta prior is conjugate to the binomial. [223002210230] |Because the uniform distribution may be expressed as a beta distribution, , the posterior is a beta distribution. [223002210240] |Because the beta and binomial play nicely together, the posterior ability distribution may be expressed analytically as . [223002210250] |The maximum of is , so the maximum a posteriori (MAP) estimate of is also , the same as the maximum likelihood estimate. [223002210260] |

    Posterior Predictive Inference

    [223002210270] |Now let’s look at prediction. [223002210280] |For concreteness, suppose we’ve observed M=10 at bats, of which m=3 were hits, for a batting average of 3/10 = 0.300. [223002210290] |What we want to know is the likelihood of n hits in the next N=5 at bats. [223002210300] |There are six possible outcomes, 0, 1, 2, 3, 4, or 5 hits. [223002210310] |Using the MLE or MAP point estimate, this distribution is . [223002210320] |The Bayesian estimate requires integrating over the posterior, effectively averaging the binomial prediction for each ability by the ability’s probability. [223002210330] |The posterior in this case is , so the integral results in the well known beta-binomial distribution: [223002210340] |I solved the integral numerically using Monte Carlo integration, which is a fancy way of saying I took a bunch of samples from the beta posterior and plugged them into and averaged the resulting predictions. [223002210350] |It’s a few lines of R (see below). [223002210360] |The beta-binomial may be expessed analytically, but that’s even more typing than the Monte Carlo solution. [223002210370] |I’m sure there’s an R package that’d do it for me. [223002210380] |

    Comparing the Results

    [223002210390] |The important issue is to look at the outputs. [223002210400] |Given that we’ve seen a batter get 3 hits in 10 at bats, here’s our prediction using a binomial model with uniform prior, shown with both the MLE/MAP point estimate and the average of the Bayesian predictions: [223002210410] |The Bayesian estimate here is more dispersed (flatter, higher variance, higher entropy) than the MLE estimate. [223002210420] |Of course, a true Bayesian estimate would provide a joint distribution over outcomes 0,1,2,3,4,5 (which reduce to a simplex, because any 5 values determine the 6th) for downstream reasoning. [223002210430] |The nice thing about passing uncertainty along by integration is that it’s very plug-and-play. [223002210440] |

    Advanced Reading

    [223002210450] |Luckily for us baseball stats geeks, there’s some advanced reading hot off the press from Wharton Business School: [223002210460] |
  • Jensen, Shane T., Blake McShane, and Abraham J. Wyner. 2009. [223002210470] |Hierarchical Bayesian Modeling of Hitting Performance in Baseball. [223002210480] |Bayesian Analysis. [223002210490] |

    R Source

    [223002210500] |Here’s how I generated the table. [223002220010] |LaTeX on WordPress-Hosted Blogs [223002220020] |If WordPress is hosting your blog, it’s very easy. [223002220030] |For instance, if I type: [223002220040] |what you see is: [223002220050] |The sum of the first integers, . [223002220060] |Let me take a second to say [223002220070] |w00t! [223002220080] |and [223002220090] |yatta! [223002220100] |

    How’s it Work?

    [223002220110] |It somehow (i.e. no doc on what exacty it does) preprocesses the blog post, and replaces the LaTeX with links to PNG images, such as: [223002220120] |http://s2.wordpress.com/latex.php?latex=n&bg=ffffff&fg=000000&s=0 [223002220130] |If you click on the link, you get a PNG image of the letter “n” as set in LaTeX math mode. [223002220140] |

    More Details

    [223002220150] |The WordPress Support Page also explains how to change font size and foreground/background colors (you can see the parameters coded into the URL above). [223002220160] |It also points out that you can use the usual math packages, amsmath, amsfonts, and amssymb. [223002220170] |

    Oversearch

    [223002220180] |Why couldn’t I find this earlier? [223002220190] |I thought I needed a plugin, so I searched for [wordpress plugin latex] and then started trying to figure out how to install plugins on WordPress-hosted sites. [223002220200] |Doh! [223002220210] |

    Escapes All the Way Down

    [223002220220] |So now the question arises of how I managed to show you this text on the blog, given that WordPress automatically replaces the text [223002220230] |with the relevant LaTeX output? [223002220240] |I know two ways to do this, discovering both of which involved guessing at how the WordPress LaTeX plugin does substitution. [223002220250] |You can use a dummy span element, which isn’t rendered as anything in most styles: [223002220260] |or with unicode escapes: [223002220270] |In general, “&#xNNNN” is the way to include unicode character with code point 0xNNNN (the “NNNN” is a four-digit or two-digit hex number). [223002220280] |I wind up spending lots of time looking at unicode.org’s code point charts. [223002230010] |Moment-Matching “Empirical” Bayes Beta Priors for Batting Averages [223002230020] |Continuing our characterization of Bayesian inference, we now turn to the use of informative priors. [223002230030] |In my previous post on Bayesian batting averages, we saw that even with a noninformative uniform prior, Bayesian multinomial inference led to more dispersed predictive distributions than inference with the point maximum likelihood/MAP estimate because of the posterior averaging over uncertainty. [223002230040] |

    Distribution of Hits vs. At-Bats

    [223002230050] |Here’s the scatterplot of hits vs. at-bats for all 308 of the 2006 American League position players. [223002230060] |The red diagonal line is where someone with a league-average 0.275 batting average would fall (league average is total number of at hits for all players divided by the total number of at bats for all players). [223002230070] |We’ll soon see the effect of the apparent correlation between at-bats and average (better players get put into bat more and higher up in the lineup). [223002230080] |

    Model Refresher

    [223002230090] |Just so we’re all on the same page, recall that we draw player ‘s batting average where is the prior number of hits plus 1 and the prior number of outs plus 1. [223002230100] |Player ‘s number of hits in at bats is then modeled as . [223002230110] |In words, each at bat has an independent chance of being a hit. [223002230120] |We call player ‘s ability. [223002230130] |Part of our inference problem is to characterize a player’s ability as a posterior distribution over abilities. [223002230140] |The more at bats we’ve seen, the tighter (smaller variance) the posterior estimate will be. [223002230150] |We are interested in inferences about the number of hits in the next at bats given an observation of hits in at bats along with the beta prior parameters . [223002230160] |The Bayesian estimate integrates over the posterior estimate of ability: [223002230170] |

    Simulation from Uniform Prior

    [223002230180] |Last time, we assumed a noninformative uniform prior on batting averages (interpreted as 0 prior hits and 0 prior outs). [223002230190] |If we generate simulated data from this model, we get a plot that looks as follows (drawn to the same scale). [223002230200] |Uh-oh. [223002230210] |That’s nothing at all like what we observe in the league. [223002230220] |

    “Empirical” Bayes

    [223002230230] |Short of the fully Bayesian approach to uncertainty is what’s known as “empirical Bayes”. [223002230240] |The term “empirical” here is in contrast to assuming a non-informative prior, not in contrast to fully Bayesian models, which are also empirical in this sense. [223002230250] |An empirical Bayes estimate involves a point estimate of the prior (in this case ) that’s used as a hyperparameter in the model (that is, a parameter that’s fixed as data). [223002230260] |

    Moment Matching Point Estimates

    [223002230270] |Often the point estimate is derived by a so-called moment-matching estimate. [223002230280] |For moment-matching, it’s usual to choose an estimate that has the same mean (1st moment) and variance (2nd moment around the mean) as the empirical data, though other moments may be used. [223002230290] |Moment matching for the beta prior is straightforward. [223002230300] |The mean (1st moment) of is [223002230310] |. [223002230320] |The prior scale (prior at bats plus 2) may be determined from the mean and variance (second moment around the mean), by [223002230330] |. [223002230340] |In our case, the sample mean of the player batting averages is 0.248 with a sample deviation (positive square root of variance, ) of 0.0755. [223002230350] |Note that the mean of averages, 0.248, is less than the league average of 0.275. [223002230360] |This bias is common in macro-averages such as macro-F measure for search or classifier evaluations. [223002230370] |In this case, moment matching leads to a prior mean parameter of 0.248 and a prior scale of 31.8. [223002230380] |For convenience, we reparameterized the usual parameterization (here is prior hits and prior outs) with the prior average and prior scale . [223002230390] |A sample drawn from this model is: [223002230400] |Close, but still not quite like what we observe. [223002230410] |With this model, there’s too much variance in averages and also everything’s a bit too low compared to the empirical data. [223002230420] |The extra variance in the sampled season is due to equally weighting players with low numbers of at bats. [223002230430] |As we know from the central limit theorem, averages based on smaller counts have higher variance (deviation declines inversely proportionally to ). [223002230440] |If we only consider the 88 batters with 400 at bats or more, the mean average is 0.284, higher than the league average. [223002230450] |The prior scale derived by moment-matching with variance is 133.9. [223002230460] |A sample looks as follows. [223002230470] |The variance is closer to the empirical variance, but still a bit too high. [223002230480] |And now the predicted mean is wrong in the opposite direction, above the league average, rather than below. [223002230490] |On the other hand, given the bias toward higher averages with more at bats, this curve predicts the high end of the distribution better. [223002230500] |Not surprising given that’s where it was estimated from. [223002230510] |(In a better model, we’d probably want to take number of at bats into account as a predictor in a regression-style model rather than use a simple beta-binomial setup.) [223002230520] |

    Posterior Inferences

    [223002230530] |Now we can just plug that estimate into our example from last time. [223002230540] |What if someone gets 3 out of 10 hits? [223002230550] |Here’s the new table with the two empirical Bayes posterior predictions added. [223002230560] |I’ve been kind to the MLE and non-informative Bayes estimates by taking a fairly likely number of hits (3) for the number of at bats (10). [223002230570] |The predictions are even more varied if you consider a batter going 5/10 or 1/10, neither of which is that unusual. [223002230580] |Here, the high prior counts will dominate the estimates. [223002230590] |Here’s the 5/10 case: [223002230600] |Comparing these two tables makes it evident how the priors pull the estimates closer to the league averages. [223002230610] |The higher the prior at-bat count, the stronger the pull. [223002230620] |If we up the count to 30/100 at bats, we’ll pull all the averages closer to the empirical averages. [223002230630] |After 30 observed at bats, the differences between the various Bayesian estimates begins to be dampened. [223002230640] |The differences would be a bit greater if we’d chosen a 20/100 batter who’d have been further away from the prior means. [223002230650] |

    Censorship and Moment-Matching are Ad Hoc

    [223002230660] |The problem here is that the “empirical” Bayes estimates based on moment matching and censoring (removing items from) the data are ad hoc. [223002230670] |If we truncate the data to try to reduce the variance, we lose the contribution to the averages of low-count players. [223002230680] |Also, as we truncate the prior average changes, which is not desirable. [223002230690] |Also, there’s no reason to believe matching moments will make sense for binomial data based on varying numbers of at bats. [223002230700] |In the next post, on our way to fully Bayesian inference, we’ll consider the so-called Bayes estimator based on squared loss, which provides a somewhat more principled point estimate for empirical Bayes. [223002240010] |Retraction: Only 1% Precision at 99.9% Recall for BioCreative Gene Chunks [223002240020] |Due to the bug we fixed in our precision-recall curve evaluations in the latest version (3.8.2) of LingPipe, I have to retract the results reported in: [223002240030] |
  • Carpenter, Bob. 2007. [223002240040] |LingPipe for 99.99% Recall of Gene Mentions. [223002240050] |In Proceedings of the 2nd Biocreative Workshop. [223002240060] |Valencia, Spain. [223002240070] |and unfortunately, the paragraph I contributed to: [223002240080] |
  • Larry Smith et al. 2008. [223002240090] |Overview of BioCreative II gene mention recognition. [223002240100] |Genome Biology 9(Suppl 2):S2. doi:10.1186/gb-2008-9-s2-s2 [223002240110] |My very first MEDLINE entry, and it’s buggy. [223002240120] |

    Original (Erroneous) Results

    [223002240130] |In those papers, I reported 7% precision at 99.99% recall, 8% precision at 99.9% recall, and 11% precision at 99% recall. [223002240140] |

    Corrected Results

    [223002240150] |It turns out the real numbers, using default LingPipe settings (n-gram = 5, interpolation = 5) with a max of 1024 chunks/sentence in our chunk.CharLmHmmChunker, the actual results are: [223002240160] |Reducing n-gram length to 4 raises precision at 99.9% recall to 1% and 3-grams raise precision at 99.9% recall to 1.3%. [223002240170] |As I speculated in the paper, less tightly fit models do a bit better in high recall settings, even though they do worse in high-F-measure evaluations. [223002240180] |Luckily it still only takes 2 minutes to do a complete confidence-based 20-fold cross-validation in a single thread. [223002240190] |The code for the evaluation’s all checked into the [223002250010] |Google Fast Flip’s SearchMe-like Interface [223002250020] |I liked the idea behind Google’s fast flip when I first saw it on SearchMe. [223002250030] |It looks like SearchMe is history; the site’s down and the first hit for [searchme] on Bing (yes, still my default search engine) has the title “SearchMe IP for Sale”. [223002250040] |So I’m glad to see Google rolled out a knockoff, albeit only for news. [223002250050] |The slick part is that the Google version is much faster, with their standard minimal interface and super-zippy AJAX caching like in Google Maps. [223002250060] |Can I have this faster interface for general web search, please? [223002250070] |

    Build Your Own

    [223002250080] |The news faceting they’re doing on fast flip would be trivial to recreate —they classify by a few categories (e.g. politics, sports). [223002250090] |Interesting, like with Google news, the assigned categories aren’t necessarily related to the source’s own categorization of articles. [223002250100] |They also facet by author, but only a little bit and only in cases where I saw a clear and easy to parse byline “by X”. [223002250110] |

    It’s all About the Ads

    [223002250120] |Google is stripping ads out of the original sources and displaying their own vertical banner ads in the right column. [223002250130] |Displaying the full original content makes it even less likely for me to click through to the original source than it was with Google News. [223002250140] |So Google’s giving the original content providers a cut of the ad revenue. [223002250150] |Even though Google touts its search results and newspapers their journalism, they’re both in the advertising business. [223002250160] |It remains to be seen how they’ll get along with this new form of co-opetition. [223002260010] |CrowdControl: Gold Standard Crowdsource Evals [223002260020] |CrowdControl is a new technology from Dolores Labs. [223002260030] |It’s a component of their new offering CrowdFlower, a federated crowdsourcing application server/API. [223002260040] |It fronts for providers like Amazon Mechanical Turk and others. [223002260050] |The interface looks a little bit simpler than Mechanical Turk’s and there are other nice add-ons, including help with application building from the experts at Dolores. [223002260060] |So far, I’ve only looked over Luke‘s shoulder for a demo of the beta when I was out visiting Dolores Labs. [223002260070] |

    CrowdControl: Supervised Annotator Evaluation

    [223002260080] |So what’s CrowdControl? [223002260090] |For now, at least, it’s described on the CrowdFlower Examples page. [223002260100] |In a nutshell, it’s Dolores labs’ strategy of inserting gold-standard items into crowdsourced tasks and using responses to them to estimate annotator accuracy in an ongoing fashion. [223002260110] |You may have seen this technique described and evaluated for natural language data in: [223002260120] |
  • Rion Snow, Brendan O’Connor, Dan Jurafsky and Andrew Ng. 2008. [223002260130] |Cheap and Fast — But is it Good? [223002260140] |Evaluating Non-Expert Annotations for Natural Language Tasks. [223002260150] |EMNLP. [223002260160] |Dolores Labs uses the evaluations to give annotators feedback, filter out bad/spam annotators, and derive posterior confidence estimates for labels. [223002260170] |

    Unsupervised Approach

    [223002260180] |I took their data and showed you could do just as well as gold-standard seeding using completely unsupervised model-based gold-standard inference (see my blog post on Dolores’s data and my analysis and the related thread of posts). [223002260190] |The main drawback to using inference is that it requires more data to get off the ground (which isn’t usually a problem in this setting). [223002260200] |It also requires a bit more computational power to run the model, as well as all the complexity in scheduling, which is admittedly a pain (we haven’t implemented it). [223002260210] |It’d also be harder for customers to understand and runs the risk of giving workers bad feedback if you use it as a feedback mechanism (but then so does any non-highly-redundantly-rechecked gold standard). [223002260220] |

    Overestimating Confidence

    [223002260230] |I was mostly interested in posterior estimates of gold-standard labels and their accuracy. [223002260240] |The problem is that the model they use in the paper above (Dawid and Skene’s), typically overestimates confidence due to false independence assumptions among annotations for an item. [223002260250] |The problem I have in my random effects models is that I can’t estimate the item difficulties accurately enough to draw measurably better predictive inferences. [223002260260] |That is, I can get slightly more diffuse priors in the fully Bayesian item difficulty approach, but not enough of an estimate on item difficulty to really change the gold-standard inference. [223002260270] |

    Semi-Supervised Approach

    [223002260280] |A mixture of gold-standard inference and gold-standard seeding (a semi-supervised approach) should work even better. [223002260290] |The model structure doesn’t even change —you just get more certainty for some votes. [223002260300] |In fact, the Bayesian Gibbs sampling and point estimated EM computations generalize very neatly, too. [223002260310] |I just haven’t gotten around to evaluating it. [223002270010] |Tsuruoka, Tsujii, Ananiadou (2009) Stochastic Gradient Descent Training for L1-regularized Log-Linear Models with Cumulative Penalty [223002270020] |I’m a big fan of descriptive titles and clearly stated goals, both of which today’s paper has: [223002270030] |
  • Tsuruoaka, Yoshimasa, Jun’ichi Tsujii, and Sophia Ananiadou. 2009. [223002270040] |Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. [223002270050] |ACL. [223002270060] |

    SGD with Lazy Update and Clipping

    [223002270070] |The paper introduces a variant of truncated stochastic gradient, a perceptron-like online algorithm for estimating L1-regularized (what we’d call a Laplace or double-exponential prior in the Bayesian world) log-linear models (what we call logistic regression). [223002270080] |LingPipe employs the variant the authors evaluated as a baseline and dubbed “SGD-L1 (Clipping + Lazy-Update)” (see LingPipe’s logistic regression tutorial), which is the simplest form of truncated stochastic gradient where batch size is one (that is, it’s fully stochastic, updating stats on a per-item basis). [223002270090] |

    The Problem: Non-Sparse Solutions

    [223002270100] |The beauty of L1 regularization is that it induces sparse solutions (many coefficients zero). [223002270110] |The problem with the stochastic approximation to the gradient is that it can leave many features non-zero solely because of their late position in the training sequence (their figure 1 is a nice illustration). [223002270120] |Tsuroaka et al. evaluate this effect on three CRF-sequence tagging problems (a form of logistic regression where you use the undirected graphical model form of forward-backward to compute the gradient and log loss), and propose a novel solution. [223002270130] |

    The Solution: Cumulative Penalty

    [223002270140] |Because the gradient of the L1 regularizer only depends on the sign of a feature and not its scale, the authors suggest buffering regularization penalties and then applying them right after a feature is estimated. [223002270150] |Furthermore, if the regularization weight is greater than the feature (that is, applying it drives the feature past zero, so it’s clipped), the amount of clipping is buffered for future use. [223002270160] |LingPipe basically uses an algorithm that’s almost identical, except that (1) the regularization gradient is applied before estimation and weight update, and at least once afterwards, at the end of an epoch at the latest; and (2) if there’s clipping, the buffered gradient is reduced to zero rather than buffered. [223002270170] |Another slight difference in their algorithm is that they randomly permute the items before each epoch. [223002270180] |

    Evaluation for Sparsity

    [223002270190] |They showed that their new method reduces the final number of non-zero features compared to the baseline truncated stochastic gradient from roughly 90K to roughly 30K. Exponential decay annealing reduces even more to zero. [223002270200] |Here’s where you want to see plots for different annealing schedules and numbers of iterations. [223002270210] |A standard quasi-Newton implementation resulted in roughly 20K features, so there’s still some slippage in their method compared to the “correct” result. [223002270220] |

    A Possible Bug?

    [223002270230] |What I didn’t see in their algorithm (in figure 2) was something that applied the penalties at the end of training. [223002270240] |More penalty accumulates for variables that never get seen again, and that needs to be applied. [223002270250] |This could actually account for the 20K vs. 30K feature discrepancy when compared to quasi-Newton methods. [223002270260] |I introduced this bug in the first version of logistic regression in LingPipe! [223002270270] |Mike Ross caught this problem during code review. [223002270280] |

    Batch Size 1?

    [223002270290] |I’d have thought using larger batch sizes as proposed by Langford et al. in their truncated stochastic gradient paper would also help to drive features down. [223002270300] |If you’re updated in a batch of 100 features, you get a 100* boost in the regularization gradient that applies. [223002270310] |Of course, this won’t necessarily have the same effect with the cumulative algorithm —it’ll probably help more in the simple clipping-based approach. [223002270320] |

    Evaluation for Log Loss

    [223002270330] |The quasi-Newton method did better in all of the evaluations (though the results were very close on their own NLPBA gene tagging data). [223002270340] |The authors’ cumulative penalty versions had log loss in between the simple truncation algorithm and the quasi-Newton method, but they were all pretty close (1/10th to 1/20th of a bit per example). [223002270350] |I’m thinking this could just be an issue of not running SGD long enough with low enough learning rates. [223002270360] |I found in my experiments, and others have found the same thing, that it takes a long time to converge to several decimal places. [223002270370] |Differing estimates present another issue for speed evaluation. [223002270380] |Without pegging log loss to be the same, you haven’t really found the minimum with SGD. [223002270390] |You can see in their figure 2 of learning rates that they could’ve stopped after 20 passes and gotten to nearly the same log loss. [223002270400] |What we’ve found is that held out performance is sometimes even better this way —it effectively gives you an overlay of extra regularization from the early stopping. [223002270410] |(In an aside, their figures 3 and 4 clip the last 130 epochs of the OWL-QN baseline, misleadingly make it look like the log loss was better for SGD than quasi-Newton.) [223002270420] |

    Oracle Evaluation for Speed

    [223002270430] |Evaluating SGD for speed is tricky, as I mentioned in my favorable review of Pegasos. [223002270440] |The problem is that convergence speed varies dramatically w.r.t. the learning rate and annealing schedule. [223002270450] |If I’m reading the Tsuroaka et al. paper correctly, in section 4.1 the authors say they use 30 passes (the final number they use) to tune the learning rate for SGD, using base learning rates of 1, 0.5, 0.2, and 0.1 (?eh —I need much smaller ones as did Langford et al.) and exponential annelaing rates of 0.9, 0.85 and 0.8 (a total of 12 settings, with no indication of how problem-specific they are). [223002270460] |The problem I have with this approach is that the authors are effectively using an oracle to choose learning rates and then claiming their system is faster. [223002270470] |It’s exactly this tuning of learning rates and number of epochs that’s so cumbersome for SGD approaches. [223002270480] |To make matters worse, they use Microsoft’s research-use-only quasi-Newton system OWL-QN to tune the regularization parameter. [223002270490] |This kind of empirical Bayes tuning of the priors interacts with the learning rate and number of epoch tuning in an unfortunate way. [223002270500] |Or in a fortunate way if you use path following to evaluate a range of priors. [223002270510] |The simple method, as used in LingPipe, was just as fast to get through 30 epochs. [223002270520] |That’s not surprising, but it’s the wrong evaluation. [223002270530] |You can see that the cumulative versions get to the same log loss much faster than the simple approach. [223002270540] |

    Exponential vs. Inverse Annealing

    [223002270550] |They argue for exponential annealing over inverse annealing on practical rather than theoretical grounds; LingPipe implements both forms along with constant learning rates, as well as providing an annealing interface for custom implementations to be plugged in. Inverse annealing is required in theory, but I’ve found the same thing in practice —exponential workds better.) [223002270560] |

    Comparing Apples and Oranges

    [223002270570] |As the authors point out, another problem with evaluating for speed is that they compared with a system they didn’t write, so who knows what unknowns are lurking in that comparison. [223002270580] |Even if a single person writes all the code, there’s no reason to believe their experience and skill is equal or the time they spend tuning for application speed is the same. [223002270590] |Having more comparable implementations (and I don’t even know how to measure this) might drive the ratio in either direction. [223002270600] |

    Memory vs. CPU Speed!

    [223002270610] |Why does everyone only report CPU type and speed (e.g. “Xeon 2.13 GHz processor”)? [223002270620] |Different Xeons have very different amounts of L2 cache and some have different numbers of cores. [223002270630] |Perhaps an even bigger variable is memory and front-side bus speed. [223002270640] |We’ve found that to be as big a factor in these large models where essentially every example needs to be moved into and out of the CPU. [223002270650] |Error correcting memory is slower, but more robust. [223002270660] |Etc. etc. [223002280010] |Did We Learn Anything from the Netflix Prize? [223002280020] |Netflix finally awarded the grand prize (the linked forum page links to the technical descriptions of the top systems). [223002280030] |

    What Did We Learn?

    [223002280040] |Off the shelf regression works pretty darn well. [223002280050] |Improving on it takes a huge amount of both general and domain-specific effort. [223002280060] |The most surprising aspect of the competition to me was how close the final race was. [223002280070] |I’m sure that was in part due to the openness of the competition, such as the winners of the intermediate progress prizes having to publicly disclose their approach(es). [223002280080] |As we tell our customers, it’s not necessarily the “secret sauce” or “magic pixie dust”, so much as a whole lot of elbow grease. [223002280090] |This result actually undercuts the surprisingly common naive argument for machine learning as opposed to heuristic approaches: with machine learning, you don’t have to have experts analyze and tune the data. [223002280100] |You just feed the data into the hopper and get a system out the other end. [223002280110] |Those of us who’ve been around the block a few times know this isn’t true given the results of other competitions, such as the various BioCreative bakeoffs and the i2b2 Obesity Challenge. [223002280120] |

    The Result

    [223002280130] |It took teams of experts thousands of hours to reduce the average error from .95 stars (on a 1-5 star scale) to .85 stars. [223002280140] |That’s only 1/10th of a star. [223002280150] |I don’t even think that’d be noticeable on the Netflix interface. [223002280160] |In absolute terms, that’s a reduction from 23.75% error (.95/4) to 21.25% (.85/4) error. [223002280170] |What’s more is that the computational resources required to achieve this reduction in error were vast. [223002280180] |Hundreds of relatively complex classifiers of various kinds were combined into an ensemble used for the final predictions. [223002280190] |The contest was over a relatively small static data collection (100M ratings, which fit in under 1GB memory), whereas Netflix is faced with a vast amount of dynamic data (new users, new films, new ratings). [223002280200] |The majority of the improvement came right away, with a simple partial singular value decomposition (SVD) implemented with a stochastic gradient alternating least squares fitter. [223002280210] |When I say simple, I mean that the algorithm can be implemented efficiently in a matter of hours. [223002280220] |I ported this approach for LingPipe’s SVD. [223002280230] |Finer-grained improvements came only after detailed study and analysis of the data, and involved things like modeling kinks in ratings due to unknown sources at particular dates and other non-linear effects such as user-specific rating scales. [223002280240] |

    More or Less Data?

    [223002280250] |While we could run experiments to see how well we could’ve done with only 1M or 10M ratings, what would’ve happend with 1G or 10G ratings? [223002280260] |I can’t help being reminded of the Banko and Brill paper on large data sets, the moral of which was that more data trumps fancy learning techniques. [223002280270] |I’d think we’d find a particularly profound effect on the k-nearest-neighbor approaches. [223002290010] |Bayesian Estimators for the Beta-Binomial Model of Batting Ability [223002290020] |In my last post introducing Bayesian stats through the simplest non-trivial distribution, the binomial, I introduced moment-matching “empirical Bayes” point estimates of the beta priors. [223002290030] |In this post, I’ll introduce the so-called “Bayesian estimator” point estimate for the beta priors. [223002290040] |In addition, I’ll show why maximum a posteriori (MAP) estimates of batting average differ from the Bayesian estimates for batting average. [223002290050] |

    Bayesian Estimators

    [223002290060] |Suppose we provide an estimate for a parameter that has true value . [223002290070] |We define a loss function , lower values of which indicate that is a better estimate of . [223002290080] |The Bayesian estimator minimizes the expected loss given a Bayesian posterior distribution over parameter : [223002290090] |

    With Squared Loss, Bayesian Estimates are Posterior Means

    [223002290100] |The most common loss function, due to its convenience, is squared loss: [223002290110] |As stated, this assumes a single real-valued parameter; the natural extension is to square distance for parameter vectors. [223002290120] |With squared loss, the expected loss is minimized by the mean of the posterior. [223002290130] |In other words, the Bayes estimator for a parameter under squared loss is the posterior mean of that parameter: [223002290140] |

    Beta-Binomial Batting Model

    [223002290150] |Our model for batting so far is very simple, with player ‘s ability being drawn from a beta prior with fixed hyperparameters (prior hits plus 1) and (prior outs plus 1): [223002290160] |The number of hits for player in at bats is drawn from a binomial sampling distribution: [223002290170] |The observed batting average is just . [223002290180] |Recall that the binomial is defined by: [223002290190] |and the beta prior by: [223002290200] |Given the convenient exponential form of the prior and likelihood function, the posterior is easily computed as: [223002290210] |Given this final form, which is proportional to a beta, we have [223002290220] |. [223002290230] |Thus the beta prior is conjugate to the binomial distribution, because when the prior over abilities is a beta density and the sampling distribution for the number of hits a binomial distribution, the posterior over abilities is also a beta density. [223002290240] |

    Bayesian Estimator for Batting Ability

    [223002290250] |The posterior mean for batting ability for player is thus the mean of the posterior beta distribution. [223002290260] |The mean of a beta distribution is . [223002290270] |So for the posterior , the Bayes estimators for batting ability are just: [223002290280] |

    Contrast with the MAP Estimate

    [223002290290] |Note that this is not the same estimate as the maximum a posteriori (MAP) estimate, because the mode (maximum point) and mean of a beta distribution are not the same. [223002290300] |The MAP estimate is based on the mode, [223002290310] |

    Diffuse Priors for the Beta Priors

    [223002290320] |In order to derive a Bayes estimate of the prior parameters and , we need to extend our model with a prior for the priors (as we said last time, it’s elephants all the way down). [223002290330] |As is common in these priors-for-priors situations, we’ll put a very diffuse prior on the priors themselves. [223002290340] |For convenience, we’ll reparameterize the beta distribution using its mean and scale . [223002290350] |Given the mean and scale, we can derive and by: [223002290360] |. [223002290370] |We then put priors directly on the mean and scale. [223002290380] |Because the mean has a range of 0 to 1, we can use a uniform prior: [223002290390] |Because the scale ranges from 0 to infinity, it’s impossible to have a (proper) uniform prior, because the integral would diverge. [223002290400] |Instead, we use a Pareto prior, [223002290410] |The Pareto distribution is defined on the interval by [223002290420] |In other words, the Pareto distribution is a power-law distribution with exponent . [223002290430] |This is a fairly diffuse prior, slightly favoring smaller values of the scale. [223002290440] |

    BUGS Code

    [223002290450] |Here’s the full model coded in BUGS: [223002290460] |This model assumes J batters, with abilities coded as ability[j], hits as hits[j], at bats as atbats[j]. [223002290470] |The prior parameter is coded as prior.hits, and as prior.outs. [223002290480] |The reparameterized prior mean is coded as prior.avg, and the reparameterized prior scale as prior.atbats. [223002290490] |

    Gibbs Sampling

    [223002290500] |With the model coded in BUGS, we can automatically run Gibbs sampling to collect a number of samples from the posterior distribution (we suppressed the conditioning on the hyperparameters). [223002290510] |We can then approximate the posterior marginal averages (to any number of decimal places) by taking the average of the samples for the respective variables. [223002290520] |It’s very easy to call BUGS from R using the >R2WinBUGS package. [223002290530] |So we read the data into R, use it to call BUGS, then we collect the results back in R, where we can calculate their means. [223002290540] |This presupposes that you’ve installed BUGS at the location specified by bugs.directory, have provided the BUGS program at the specified full path (I can’t get relative paths to work here), and have provided the data in the file 2006ALnonPitch.tsv (see the last section for a dump of the data in this file). [223002290550] |The format of the tsv (tab-separated values) file is one player per line, with the number of hits, a tab, the number of at bats, and a newline. [223002290560] |The parameters specify three simultaneous Markov chains for 1000 iterations each. [223002290570] |By default, BUGS will discard the first half of the iterations to remove noise from before the chains converge. [223002290580] |We initialize the ability values to , and . [223002290590] |Better initialization helps models converge to the posterior samples more quickly, but aren’t even necessary for easy-to-fit models like these that run relatively quickly. [223002290600] |The data and parameters are specified. [223002290610] |Data must be defined in R before calling BUGS, and parameters are returned as a two-dimensional array of samples (indexed by chain and sample number). [223002290620] |BUGS is not super-fast, taking about 15 minutes to take 3000 samples in 3 chains on my old notebook (it’s about twice as fast on my newish workstation). [223002290630] |

    Bayesian Estimates

    [223002290640] |While I don’t have time to explain Gibbs sampling in detail in this context, let me just say that the three chains mixed very well with values all having values very near 1.0. [223002290650] |The mean prior batting average is 0.271, and the mean prior at bats was 403. [223002290660] |Note that this average is lower and the number of at bats higher than with the moment-matching estimates. [223002290670] |I calculated these values in R after closing the BUGS window by: [223002290680] |This corresponds to beta density parameters and . [223002290690] |I’ll discuss more about the joint marginal average/at-bats posterior and the marginal ability posteriors in the next and (hopefully) final post in the series. [223002290700] |

    Who Do We Think is Better?

    [223002290710] |Who do we think is a better batter, someone we’ve seen go 40/100 (a 0.400 average) or someone we’ve seen go 175/500 (a 0.350 average)? [223002290720] |For now, suppose we fix an empirical Bayes prior to the Bayesian estimate , so that and The Bayesian estimator (posterior mean) for a player’s ability when we observe the player getting get hits in at bats, is then the expected value of the posterior, which is: [223002290730] |If we plug in we get an estimated ability of 0.295. [223002290740] |If we plug in , we get an estimated ability of 0.314! [223002290750] |If we believe this model, we can’t just compare batting averages —the number of at bats is a critical element of our posterior estimate for low numbers of at bats. [223002290760] |This is the effect of the prior, not the Bayesian inference per se. [223002290770] |Of course, the scale of the prior being so large is the result of using the Bayesian estimator for the prior rather than the moment-matching estimate, so the effect is greater. [223002290780] |

    In the Limit

    [223002290790] |In the limit, of course, the Bayesian estimate approaches the maximum likelihood estimate: [223002290800] |

    Data from Baseball1

    [223002290810] |I gathered the data for the 2006 American League position players from Baseball1 (click on the “download database” menu item on the nav bar). [223002290820] |It requires merging the player table (to get positions and exclude piters) and the batting table. [223002290830] |For completeness, here’s the munged data: [223002300010] |Convexity of (Root) Mean Square Error, or Why Committees Won the Netflix Prize [223002300020] |As John Langord mentioned in his blog, the reason committees won the Netflix prize is because root mean square error is convex. [223002300030] |Let’s see what that means with an example and with some math. [223002300040] |

    A Regression Problem

    [223002300050] |The contest involves predicting ratings for movies. [223002300060] |The actual ratings given the movies by users are known to Netflix to be , where the ratings are integers for . [223002300070] |A submission consists of floating point approximations of real numbers where for . [223002300080] |

    (Root Mean) Square Error

    [223002300090] |The error function used by Netflix was root mean square error (RMSE), which is defined as the square root of the mean of the square error: [223002300100] |The mean scales square error to a per-example error, and the square root puts it back on the original point scale. [223002300110] |But note that the best RMSE is achieved at the point of best SE, so the winning entry is the one with the best SE, so we won’t need to worry about means or square roots, just square error. [223002300120] |(Note that square error is just Euclidean distance between the entry vector and true answer vector.) [223002300130] |

    Graphing Upward Curving Error

    [223002300140] |Suppose we have a true rating of for movie . [223002300150] |We plot square error for : [223002300160] |Note that the error curves upward. [223002300170] |Note that the error is much higher for outlier guesses; a guess of 1 has an error of 9 in this case, whereas a guess of 2 has an error of 4 and a guess of 3 an error of only 1. [223002300180] |This tends to drive estimates toward the center of the range to guard against huge error. [223002300190] |By way of comparison, absolute (or taxicab distance) error would be and would not penalize extreme guesses nearly as severely as squared error. [223002300200] |We have plotted as red circles the error for two guesses and , with values [223002300210] |. [223002300220] |Now consider the guess that results from averaging the guesses 3.0 and 4.5, namely 3.75. [223002300230] |It has lower error than either 3.0 or 4.5, [223002300240] |The error is plotted on the curve. [223002300250] |Note that the error for the average of the guesses 0.0625 is not only less than the average of the errors, 1.25/2 = 0.625, it is less than the error of the individual guesses, 1.0 and 0.25. [223002300260] |

    Convexity

    [223002300270] |The reason committees fare so well with root mean square error is because the error function is convex. [223002300280] |Convexity for functions means the same thing as the dictionary definition —curved upwards. [223002300290] |When functions curve upwards, as square error does, the error for averaging two guesses is always lower than the average error from the two guesses. [223002300300] |Mathematically, a function is convex if for all and and , [223002300310] |This says that the value of the function for a weigted average of inputs is less than or equal to the weighted average of the value of the function at those inputs. [223002300320] |In the example above, is the square error function and the examples are weighted equally with . [223002300330] |We know square error is convex because its differentiable everywhere and its second derivative is positive: [223002300340] |The result can also be easily established algebraically by expanding all of the terms, cancelling and rearranging. [223002300350] |Everything generalizes to multiple dimensions by additivity. [223002300360] |Because the full square error is the sum of the dimensionwise square errors, all the partial second derivatives are positive and the entire error function is convex. [223002300370] |

    Committees Average Their Members’ Predictions

    [223002300380] |All of the top-scoring Netflix Prize entries averaged the predictions of multiple base systems. [223002300390] |Given the convexity of error, the error for the average of two equal scoring systems is less than or equal to their average error. [223002300400] |So in general, two predictions with the same error can be averaged to produce a prediction with lower error. [223002300410] |And as we saw in the example above, this can even happen when the predictions are not the same. [223002300420] |It can help to tune the weighting , typically by weighting the better scoring systems more heavily. [223002300430] |

    Discrete Predictions and Posterior Variance

    [223002300440] |As we’ve been talking about in other blog posts, point estimates sweep posterior variance under the rug. [223002300450] |The square error further pushes numerical scores toward non-comittal average ratings. [223002300460] |Suppose we build a discrete 5-way classifier and estimate a posterior for . [223002300470] |The Bayesian estimator here would minimize expected square error with respect to the posterior, namely the expected rating, which weights each prediction by its estimated probability: [223002300480] |For instance, here are plots of two posterior estimates with the same mean prediction, 2.95, the first of which is for a film the system thinks the user will love or hate, [223002300490] |and one for which the system is pretty sure the user will be indifferent, [223002300500] |I find these posterior distributions much more useful than a single plot. [223002300510] |This is how Amazon and NewEgg display their customer ratings, though neither site predicts how much you’ll like something numerically like Netflix does. [223002310010] |Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers [223002310020] |Ironically, naive Bayes, as standardly presented, is not Bayesian. [223002310030] |At least not in the sense that we estimate a posterior distribution over parameters given data and use it for predictive inference for new . [223002310040] |

    What is Naive Bayes?

    [223002310050] |The naive Bayes model assumes that a corpus of documents is generated by selecting a category for a document then generating the words of that document independnetly based on a category-specific distribution. [223002310060] |The naivete in question is in assuming the words are independent, an assumption that’s clearly violated by natural language texts. [223002310070] |In symbols, if we have documents, we assign document to category based on a discrete distribution over categories: [223002310080] |for [223002310090] |We then generate words for document according to a discrete distribution over words for category : [223002310100] |for [223002310110] |

    Naive Bayes Priors

    [223002310120] |Given a corpus of training documents, we typically set and by maximum likelihood or additive smoothing. [223002310130] |These are both maximum a posteriori (MAP) estimates given (uniform) Dirichlet priors. [223002310140] |Specifically, we assume prior category counts (plus 1) and prior word counts (plus 1) : [223002310150] |for [223002310160] |As is clear from the posteriors defined in the next section, if we set , the MAP parameter estimates are the maximum likelihood estimates (MLE). [223002310170] |If we have what’s known as “add 1 smoothing” or “Laplace smoothing”. [223002310180] |

    Maximum a Posteriori Estimates

    [223002310190] |The maximum a posteriori (MAP) estimates given priors and data are: [223002310200] |The MAP parameter estimates are calculated using our old friend additive smoothing: [223002310210] |Note the subtraction of one from the prior counts; Dirichlet distributions are parameterized with and being prior counts plus one. [223002310220] |If , that’s a prior count of zero for categories and words, and hence a maximum likelihood posterior. [223002310230] |

    Inference with Point Estimates

    [223002310240] |With point estimates, inference is simple for a sequence of words. [223002310250] |The probability of category given our parameters is: [223002310260] |

    Bayesian Posteriors for Naive Bayes

    [223002310270] |Because the Dirichlet is conjugate to the discrete (or multinomial) distribution, the posterior parameter distributions and given training data have the closed form solutions: [223002310280] |, where [223002310290] |and [223002310300] |, where [223002310310] |

    Bayesian Inference for Naive Bayes

    [223002310320] |It’s easy to convert the point-estimates used here to full Bayesian inference, thus creating a Bayesian version of naive Bayes. [223002310330] |(The naivete, by the way, refers to the independence assumption over words, not the point estimates.) [223002310340] |We just replace the point estimates with integrals over their posteriors. [223002310350] |where as before, [223002310360] |, and [223002310370] |. [223002310380] |As usual, the integrals could be evaluated by Gibbs sampling. [223002310390] |Or the integrals could be factored into Dirichlet-Multinomial form and solved analytically, along the same lines as for the Beta-Binomial model. [223002310400] |

    Evaluation?

    [223002310410] |Has anyone evaluated this setup for text classification? [223002310420] |I’m thinking it might help with the over-attenuation often found in naive Bayes by allowing a more dispersed posterior compared to the point estimates. [223002320010] |Whitehill, Ruvolo, Wu, Bergsma, and Movellan (2009) Optimal Integration of Labels from Labelers of Unknown Expertise [223002320020] |[Update: 4:51 PM, 5 October 2009 after corrections from Jacob Whitehill (thanks!); they did use a prevalence estimate and did allow mixed panel fits, as the post now reflects.] [223002320030] |Thanks to Panos for pointing out this upcoming NIPS poster, which makes a nice addition to our running data annotation thread: [223002320040] |
  • Whitehill, Jacob, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier Movellan. 2009. [223002320050] |Optimal integration of labels from labelers of unknown expertise. [223002320060] |NIPS Poster. [223002320070] |The authors’ knowledge of the epidemiology literature was limited when they stated: [223002320080] |To our knowledge BLOG [bilinear log odds generative] is the first model in the literature to simultaneously estimate the true label, item difficulty, and coder expertise in an unsupervised manner. [223002320090] |Just check out the literature survey portion of my technical report for a range of different approaches to this problem, some of which have even been applied to binary image data classification such as Handelman’s X-ray data for dental caries diagnoses (see the tech report for the Handleman data). [223002320100] |

    Model Overview

    [223002320110] |In this case, the authors use a logistic scale (that’s the log-odds part) consisting of the product of an annotator accuracy term and an item (inverse) difficulty term (that’s the “bilinear” part). [223002320120] |Although the authors mention item-response and Rausch models (see below), they do not exploit their full expressive power. [223002320130] |In particular, the authors model annotator accuracy, but do not break out sensitivity and specificity separately, and thus do not model annotator bias (a tendency to overlabel cases in one category or another). [223002320140] |I and others have found huge degrees of annotator bias for real data (e.g. Handelman’s dentistry data and the Snow et al. NLP data). [223002320150] |The authors’ model also clips difficulty at a random coin flip, whereas in reality, some positive items may be so hard to find as to have less than a 50% chance of being modeled correctly. [223002320160] |They impose unit normal priors over annotator accuracy and normal priors over the log of item difficulty (ensuring item difficulties are non-negative). [223002320170] |They fit the models with EM using conjugate gradient to solve the logistic regression in the M-step. [223002320180] |Epidemiologists have fitted empirical Bayes priors by using other expert opinion, and I went further and actually fitted the full hierarchical Bayesian model using Gibbs sampling (in BUGS; the code is in the sandbox project). [223002320190] |Point estimates (like EM-derived maximum a posterior estimates as the authors use) always underestimate posterior uncertainty compared to full Bayesian inference. [223002320200] |Posterior uncertainty in item difficulty is especially difficult to estimate with only 10 annotators. [223002320210] |In fact, we found the Bayesian posteriors for item difficulty to be so diffuse with only 10 annotators that using the full posterior effectively eliminated the item difficulty effect. [223002320220] |

    Synthetic Data

    [223002320230] |They run synthetic data and show fits. [223002320240] |Not surprisingly given the results I’ve reported elsewhere about fitting item difficulty, they only report fits for difficulty with 50 annotators! [223002320250] |(I found reasonable fits for a linear (non-multiplicative) model with 20 annotators, though recall the reviewers for my rejected ACL paper thought even 10 annotators was unrealistic for real data!) [223002320260] |They also simulate very low numbers of noisy annotators compared to the actual numbers found on Mechanical Turk (even with pre-testing, we had 1/10 noisy annotations and without testing, Snow et al. found 1/2 noisy labels). [223002320270] |I was surprised they had such a hard time adjusting for the noisy labelers. [223002320280] |I think this may be due to trying to model item difficulty. [223002320290] |Without item difficulty, as in the Dawid and Skene-type models, there’s no problem filtering out bad annotators. [223002320300] |

    Pinning Values

    [223002320310] |The authors note that you can fix some values to known gold-standard values and thus improve accuracy. [223002320320] |I noted this in my papers and in my comment on Dolores Labs’ CrowdControl, which only uses gold-standard values to evaluate labelers. [223002320330] |

    Real Data

    [223002320340] |They show some nice evaluations for image data consisting of synthetic images and classification of Duchenne smiles. [223002320350] |As with other data sets of this kind (such as my results and Raykar et al.’s results), they show decreasing advantage of the model-based methods over pure voting as the number of annotators approaches 10. [223002320360] |This is as we’d expect —the Bayesian priors and proper weighting are most important for sparse data. [223002320370] |

    Mathematical Presentation

    [223002320380] |The authors suppose items (images in this case) and annotators. [223002320390] |The correct label for item is and the label provided by the annotator is is . [223002320400] |They consider fitting for the case where not every annotator labels every item. [223002320410] |The authors model correctness of an annotation by: [223002320420] |where is a measure of an annotators ability and a measure of inverse item difficulty. [223002320430] |The authors observe some limits to help understand the parameterization. [223002320440] |First, as inverse item difficulties approach 0, items become more difficult to label, and accuracy approaches chance (recall ): [223002320450] |. [223002320460] |As inverse item difficulties approach infinity, the item becomes easier to label: [223002320470] |. [223002320480] |As annotator accuracy approaches infinity, accuracy approaches perfection: [223002320490] |. [223002320500] |As annotator accuracy approaches zero, accuracy approaches chance: [223002320510] |. [223002320520] |If accuracy is less than zero, the annotator is adversarial. [223002320530] |We didn’t find any adversarial annotators in any of our Mechanical Turk data, but there were lots of noisy ones, so some of the models I fit just constrained prior accuracies to be non-adversarial. [223002320540] |Others have fixed priors to be non-adversarial. [223002320550] |In some settings, I found initialization to non-adversarial accuracies in EM or Gibbs sampling led to the desired solution. [223002320560] |Of course, when lots of annotators are adversarial and priors are uniform, the solution space is bimodal with a flip of every annotator’s adversarial status and every item’s label. [223002320570] |The authors also model prevalence with a term. [223002320580] |If prevalence is 0.5, it drops out, but their Duchenne smile example was unbalanced, so the prevalence term is important. [223002320590] |

    Comparison with Item-Response and Ideal Point Models

    [223002320600] |The authors mention item-response theory (IRT), which is also where I got the idea to use these models in the first place (too bad the epidemiologists beat us all to the punch). [223002320610] |A basic IRT-like model for annotation would use the difference between annotator accuracy and item difficulty . [223002320620] |By allowing to be positive or negative, you can model positive and negative items on the same scale. [223002320630] |Or you can fit separate and for positive and negative items, thus independently modeling sensitivity and specificity. [223002320640] |Discriminativeness can be modeled by a multiplicative factor , producing a predictor . [223002320650] |In this way, the term models a positive/negative bias and the the sharpness of the decision boundary. [223002320660] |I’m a big fan of the approach in (Uebersax and Grove 1993), which handles ordinal responses as well as discrete ones. [223002320670] |Too bad I can’t find an open-access version of the paper. [223002330010] |API Design: Should I Reify Taggings for CRFs and HMMs? [223002330020] |I finished the technical core and testing for CRFs, including regularized stochastic gradient estimation as well as first-best, n-best and marginal decoders. [223002330030] |Now I’m thinking about retrofitting everything so that CRFs work the same way as HMMs from an interface perpsective. [223002330040] |

    Current HMM Interfaces

    [223002330050] |For hmm.HmmDecoder, the current interface is: Note that this is using arrays for inputs and outputs (I wrote it before generics), and constructs n-best lists and scored outputs using standardized wrappers/interfaces like Java's java.util.Iterator and com.aliasi.util.ScoredObject from LingPipe. [223002330060] |

    Tag Package

    [223002330070] |I started a new package, tag, with three new interfaces for first-best, n-best, and marginal taggers. [223002330080] |I'll adapt the HMM decoders to implement them and deprecate the methods described above (if necessary). [223002330090] |The big question is what should the interfaces look like? [223002330100] |

    First-Best Taggers

    [223002330110] |I generalized inputs to lists of generic objects -- they used to be fixed to arrays of strings. [223002330120] |The first interface is for first-best tagging: [223002330130] |with an associated tag.Tagging abstract base class: [223002330140] |Just to be safe, the public constructor copies the input tags and tokens, and the methods that return the lists provide unmodifiable views. [223002330150] |I also have the tags being returned as a list. [223002330160] |There's nothing in principle preventing the tags from being structured -- it just didn't seem worth the implementation hassle and resulting interface complexity. [223002330170] |Question 1 [223002330180] |Should there be a Tagging interface (or should I just return lists of tags and leave it up to users to keep track of inputs)? [223002330190] |

    N-Best Sequence Taggers

    [223002330200] |There's another interface for n-best taggers: [223002330210] |The ScoredTagging object extends Tagging and implements the interface util.Scored: [223002330220] |I like the iterator over the results, which works well in practice. [223002330230] |What I'm less certain about is: [223002330240] |Question 2 [223002330250] |Should I create a ScoredTagging interface (or just use ScoredObject)? [223002330260] |

    Marginal (per Tag, per Phrase) Tagging

    [223002330270] |The final interface is for marginal results, which allow the computation of the conditional probability of a sequence of tags at a given position given the input sequence. [223002330280] |This one's easy, as I need a structure for the result. [223002330290] |What I need to do is adapt the HMM TagWordLattice so that it implements the common interface TagLattice. [223002330300] |I think all that needs to have in it is: [223002330310] |Of course, I can add utility methods for linear probabilities and accessing tags by name rather than symbol table identifier. [223002330320] |

    Forward-Backward Lattice

    [223002330330] |There's also the implementation issue of whether to combine the currently separate forward-backward (or generalized sum-product) implementations of the tag lattices in HMMs and CRFs, and then whether to expose the implementation methods, which are necessary for extracting n-best phrases in chunkers. [223002330340] |The forward/backward phrase extraction requires forward, backward, and transition scores, and an overall normalizer (conventionally written as Z) to normalize scores to conditional probabilities. [223002330350] |Should I just require a forward-backward lattice in the interface? [223002330360] |I never feel it's worth it to add really generic interfaces, such as: [223002330370] |That'd let the classes that need the forward-backward version specify T. [223002330380] |

    Taggings and Handlers

    [223002330390] |The next big issue is whether to change the way taggings are supplied to taggers. [223002330400] |The current method uses a specific handler: [223002330410] |This allows whitespaces to be provided, but we never use them at the tagger level. [223002330420] |This brings up the question of whether to reify taggings for training: [223002330430] |Question 3 [223002330440] |Should I replace TagHandler with ObjectHandler? [223002330450] |A "yes" answer would allow me to deprecate TagHandler going forward and makes tagging more parallel to chunking. [223002330460] |But it's a lot of work (not that it should worry you), will require lots of changes to existing code, and it messes with longer-term backward compatibility. [223002330470] |

    Evaluator Overloading

    [223002330480] |Right now tagger evaluators only work for HMMs. [223002330490] |I need to generalize. [223002330500] |The classifier evaluator inspects the classifier being evaluated with reflection to see what it can do, and evaluates all of the parts that can be evaluated. [223002330510] |Which brings up: [223002330520] |Question 4 [223002330530] |Should I create three distinct evaluators, one for first-best, one for n-best, and one for marginal taggers (or should I just overload a single one)? [223002330540] |

    Speak Now, or Forever ...

    [223002330550] |My current thinking is to answer "yes" to all the questions. [223002330560] |So now's a good time to chime in if you don't like those answers! [223002340010] |OMOP Cup: Drug Safety Surveillance Bakeoff [223002340020] |[Update: 13 October 2009: David Madigan says they got the wrong legalese the first go round and are going to replace it.] [223002340030] |David Madigan (co-lead of BMR) and crew, just announced the 2009/2010 OMOP Cup, a short bakeoff aimed at predicting which drugs cause which conditions in patients. [223002340040] |

    The Coarse Print

    [223002340050] |There’s no way we’d enter, given the ridiculously restrictive OMOP Cup rules. [223002340060] |The official rules only mention source in passing and only suggest you send them a tech report. [223002340070] |But the home page says you have to “share your method”. [223002340080] |In addition, you have to transfer the rights to the contest organizers, which means you wouldn’t even own the rights to use your own technique after submitting it! [223002340090] |If that’s not enough, the rules also require you to indemnify the organizers against damages (e.g. when a patent troll sues them because they think your entry infringed their patent). [223002340100] |If you work for a company or for a university, you may not even have the right to reassign your intellectual property this way. [223002340110] |

    The Two Challenges

    [223002340120] |There are two predictive challenges, based on the same underlying training data. [223002340130] |For full details, see the site above and the challenge overviews: [223002340140] |
  • Challenge 1 Overview [pdf] [223002340150] |
  • Challenge 2 Overview [pdf] [223002340160] |
  • Official Rules [pdf] [223002340170] |

    Schedule

    [223002340180] |Data’s out now, progress prizes ($5K total) will be awarded end of November 2009, and grand prizes ($15K total) at the end of March 2010. [223002340190] |

    Simulated Training Data

    [223002340200] |Unfortunately, they’re using simulated data. [223002340210] |For what it’s worth, here’s OMOP’s call for simulations, so you can figure out some of the basics of how they were planning to simulate. [223002340220] |The basic data sizes are bigger than for Netflix, consisting of roughly: [223002340230] |
  • 5K drugs [223002340240] |
  • 5K conditions [223002340250] |
  • 10M persons [223002340260] |
  • 300M condition occurrences over 10 years [223002340270] |
  • 90M drug exposures over 10 years [223002340280] |
  • 4K positive, 4K negative associations (labeled training data) [223002340290] |The basic observational training data is organized into four DB table dumps: [223002340300] |
  • Conditions: start date, person, condition [223002340310] |
  • Drug Exposure: start date, end date, person, drug [223002340320] |
  • Person Observations: start date, end date, person id, person status (alive/dead), prescription data (yes/no) [223002340330] |
  • Person Data: id, birth year, gender (M/F), race (white/non-white) [223002340340] |The labeled training data contains 4000 examples of positive drug-condition associations and 4000 examples of negative associations. [223002340350] |I have no idea if the drug and condition data link to real-world drugs and conditions, though the challenge indicates they want you to use outside data, so they probably do (I’ll post any links that people send me about ways to use this data). [223002340360] |OMOP’s common data model (CDM) specification is huge, and all you get in the data files are numerical codes. [223002340370] |

    System Output

    [223002340380] |For Challenge 1 ($10K grand prize), you just provide a score for each drug/condition pair, and the scores are only used for ranking. [223002340390] |For Challenge 2 ($5K grand prize), there’s a time component I didn’t understand, and they’re only using the first 500 of the 5000 drugs. [223002340400] |

    Evaluation

    [223002340410] |For the first bakeoff, it’s just average precision (see LingPipe’s ScoredPrecisionRecall class documentation or the task descriptions linked above for an explanation of (mean) average precision). [223002340420] |For the second bakeoff, it’s mean average precision, where means are over years. [223002340430] |

    Leader Board

    [223002340440] |They’re supposed to have a leaderboard and a way of evaluating responses online as Netflix did. [223002340450] |So far, I don’t see it on their site. [223002340460] |

    What’s OMOP?

    [223002340470] |OMOP is the Observational Medical Outcomes Partnership, a “public-private partnership”, the stated goal of which is to “improve the monitoring of drugs for safety”. [223002350010] |Bayesian Counterpart to Fisher Exact Test on Contingency Tables [223002350020] |I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables. [223002350030] |

    Contingency Table

    [223002350040] |I’ll use the data in the example in the Wikipedia article on contingency tables: [223002350050] |

    Hypothesis

    [223002350060] |The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women. [223002350070] |

    Beta-Binomial Model

    [223002350080] |The observed data has left-handed men of a total of men and left-handed women of a total of women. [223002350090] |Letting be the proportion of left-handed men and the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial, [223002350100] |, and [223002350110] |. [223002350120] |He then suggested simple uniform priors on the proportions, which we know are equivalent to priors: [223002350130] |, and [223002350140] |. [223002350150] |Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors: [223002350160] |, and [223002350170] |We could try to compute the posterior density of analytically by solving the integral [223002350180] |or we could just follow Andrew’s sage advice and estimate the integral by simulation. [223002350190] |

    R Code

    [223002350200] |I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them. [223002350210] |Running this code prints out: [223002350220] |The first numbers displayed are the posterior quantiles for the difference. [223002350230] |The posterior median (50% quantile) difference is 0.084. [223002350240] |The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. [223002350250] |The posterior central 95% quantile is thus (-0.046,0.218). [223002350260] |Quantiles get the obvious interpretation in Bayesian stats —given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218). [223002350270] |The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. [223002350280] |R makes this simple by letting us write mean(theta1 >theta2). [223002350290] |This is also the value you’d get from integrating the posterior density from to . [223002350300] |The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. [223002350310] |This value is 0.0848, which is a bit higher than the median difference of 0.084. [223002350320] |The code also generates the following graph of the posterior: [223002350330] |The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference . [223002350340] |

    Significant?

    [223002350350] |Don’t ask. [223002350360] |You’re a Bayesian now. [223002350370] |Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [223002350380] |[Remember not to use too many digits. [223002350390] |The outputs from R were so you could see the small differences.] [223002360010] |What’s Wrong with Probability Notation? [223002360020] |[Update 21 October 2009: Check out Michael Collins's comment and my reply. [223002360030] |Michael points out that introductions to probability theory carefully subscript probability functions with their random variables and distinguish random variables from regular variables by capitalizing the former. [223002360040] |I replied that it's really the practice that's problematic, and I'm talking about the notation in Gelman et al.'s Bayesian Data Analysis or Blei, Ng and Jordan's LDA paper.] [223002360050] |

    What’s Wrong?

    [223002360060] |What’s wrong with the probability notation used in Bayesian stats papers? [223002360070] |The triple whammy of [223002360080] |
  • overloading for every probability function, [223002360090] |
  • using bound variables named after random variables, and [223002360100] |
  • using the bound variable names to distinguish probability functions. [223002360110] |

    Probabilty Notation is Bad

    [223002360120] |The first and third issues arise explicitly and the second implicitly in the usual expression of the first step of Bayes’s rule, [223002360130] |, [223002360140] |where each of the four uses of corresponds to a different probability function! [223002360150] |In computer science, we’re used to using names to distinguish functions. [223002360160] |So and are the same function applied to different arguments. [223002360170] |In probability notation, and are different probability functions, picked out by their arguments. [223002360180] |

    Random Variables Don’t Help

    [223002360190] |As Michael (and others) pointed out in the comments, if these are densities determined by random variables, we use the capitalized random variables to distinguish the distributions and bound variables in the usual mathematical sense, disambiguating our random variables with [223002360200] |. [223002360210] |When we have dozens of parameters in our multivariate densities, this gets messy really quickly. [223002360220] |So practitioners fall back to the unsubscripted notation. [223002360230] |

    Great Expectations

    [223002360240] |The third issue appears in expectation notation (and in information theory notation for entropy, mutual information, etc.). [223002360250] |Here, statisticans write and for random variables with the probability function and sample space implicit. [223002360260] |The way you then see expectation notation written in applied Bayesian modeling is often: [223002360270] |. [223002360280] |What’s really sneaky here is the use of as a global random variable on the left side of the equation and as a bound variable on the right hand side. [223002360290] |Distinguishing random variables with capital letters, this would look like [223002360300] |. [223002360310] |

    Continuous vs. Discrete

    [223002360320] |The definitions are even more overloaded than they first appear, because of the different definitions for continuous and discrete probabilities. [223002360330] |In Bayes’s rule, if is continuous in , we’re meant to understand integration [223002360340] |, [223002360350] |in place of summation [223002360360] |. [223002360370] |Similarly, for expectations of continuous densities, we write [223002360380] |or if we’re being very careful with random variables, [223002360390] |. [223002360400] |Intros to probability theory often use for continuous probability density functions and reserve for discrete probability mass functions. [223002360410] |They’ll start with notation (or ) for the event probability function. [223002360420] |

    Samples Spaces to the Rescue?

    [223002360430] |In applied work, we rarely, if ever, need to talk about the sample space , measures from subsets of to , or actually consider a random variable as a function from to . [223002360440] |I don’t recall ever seeing a model defined in this way. [223002360450] |Instead, we typically construct multivariate densities modularly by combining simpler distributions. [223002360460] |For instance, a hierarchical beta-binomial model, such as the one I used for the post about Bayesian batting averages, would typically be expressed as: [223002360470] |or in sampling notation by stating and . [223002360480] |In fact, that’s just how it gets coded in the model interpreter BUGS (Bayesian inference using Gibbs sampling) and compiler HBC (hierarchical Bayes compiler). [223002360490] |Since we only really ever talk about probability density functions, why not get rid of random variable notation altogether? [223002360500] |We could start with a joint density function over a vector , and consider projections of in lieu of random variables. [223002360510] |This we can fully specify with high-school calculus. [223002360520] |If we use consistent naming for the dimensions, we can get away without ever formally definining a random variable. [223002360530] |In practice, it’s not really that hard to keep our sampling distributions separate from our posteriors , even if we write them both as . [223002360540] |Technically, we could take as the sample space and then define the random variables as projections. [223002360550] |But this formal definition of random variables doesn’t buy us much other than a connection to the usual way of writing things in theory textbooks. [223002360560] |If it makes you feel better, you can treat the normal ambiguous definitions this way; some of the lowercase letters are random variables, some are bound variables, and we just drop all the subscripts to pick out densities. [223002360570] |

    Lambda Calculus to the Rescue?

    [223002360580] |Maybe we can do better. [223002360590] |We could express our models as joint densities, and borrow a ready-made language for talking about functions, the lambda calculus. [223002360600] |For instance, we could define a discrete bivariate for reference. [223002360610] |We could then distinguish the marginals , using [223002360620] |instead of , and [223002360630] |instead of . [223002360640] |We could similarly distinguish the conditionals, writing [223002360650] |instead of , and [223002360660] |instead of . [223002360670] |Bayes’s rule now becomes: [223002360680] |. [223002360690] |Clear, no? [223002360700] |

    Perhaps Not

    [223002360710] |OK, so maybe the statisticians are onto something here with their sloppy notation in practice. [223002360720] |All attempts to distinguish function names in stats only seem to make matters worse. [223002360730] |This is especially problematic for Bayesian stats, where a fixed prior in one model becomes an estimated parameter in the next. [223002370010] |Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+ [223002370020] |I’ve finished up the first order linear-chain CRF tagger implementation and a bunch of associated generalizations in the tagging interface. [223002370030] |Now it’s time to code up chunkers using CRFs, and I’m faced with the usual problem of how to encode chunks, which are about character spans, as taggings, which are sequences of tokens (words or other units) and tags (aka labels, categories). [223002370040] |

    An Example

    [223002370050] |Here’s an example of a tokenized string and its encoding using several standards: [223002370060] |

    IO Encoding

    [223002370070] |The simplest encoding is the IO encoding, which tags each token as either being in (I_X) a particular type of named entity type X or in no entity (O). [223002370080] |This encoding is defective in that it can’t represent two entities next to each other, because there’s no boundary tag. [223002370090] |

    BIO Encoding

    [223002370100] |The “industry standard” encoding is the BIO encoding (anyone know who invented this encoding?). [223002370110] |It subdivides the in tags as either being begin-of-entity (B_X) or continuation-of-entity (I_X). [223002370120] |

    BMEWO Encoding

    [223002370130] |The BMEWO encoding further distinguishes end-of-entity (E_X) tokens from mid-entity tokens (M_X), and adds a whole new tag for single-token entities (W_X). [223002370140] |I believe the BMEWO encoding was introduced in Andrew Borthwick’s NYU thesis and related papers on “max entropy” named entity recognition around 1998, following Satoshi Sekine’s similar encoding for decision tree named entity recognition. [223002370150] |(Satoshi and David Nadeau just released their Survey of NER.) [223002370160] |

    BMEWO+ Encoding

    [223002370170] |I introduced the BMEWO+ encoding for the LingPipe HMM-based chunkers. [223002370180] |Because of the conditional independence assumptions in HMMs, they can’t use information about preceding or following words. [223002370190] |Adding finer-grained information to the tags themselves implicitly encodes a kind of longer-distance information. [223002370200] |This allows a different model to generate words after person entities (e.g. John said), for example, than generates words before location entities (e.g. in Boston). [223002370210] |The tag transition constraints (B_X must be followed by M_X or E_X, etc.) propagate decisions, allowing a strong location-preceding word to trigger a location. [223002370220] |Note that it also adds a begin and end of sequence subcategorization to the out tags. [223002370230] |This helped reduce the confusion between English sentence capitalization and proper name capitalization. [223002370240] |

    Transition and Feature Tying

    [223002370250] |Consider the BMEWO notation. [223002370260] |At least for regular named-entity recognition in text, we probably don’t want to distinguish being preceded by an end-of-entity token labeled E_X from a whole entity token labeled W_X. [223002370270] |In HMMs, we’d tie transitions, so p(T|E_X) = p(T|W_X) for all tags T. [223002370280] |In CRFs, we could tie the feature generation process, so having either preceding tag (end or whole entity) produces the same features, and hence the same probability estimates. [223002370290] |

    Complexity

    [223002370300] |First-best decoding in a first-order CRF or HMM is quadratic in the number of tags (and linear in the number of tokens). [223002370310] |Here’s a rundown: [223002370320] |

    Legal Transition Pruning

    [223002370330] |It’s not quite as bad as number of tags squared in practice, because not every tag can follow every other tag in the more complex codings. [223002370340] |Even in BIO encoding, I_X can only follow B_X. In BMEWO encoding, M_X and E_X can only follow B_X or I_X, and O can only follow O, E_X, or W_X. [223002370350] |Of course, this is a major pain to compute and specify in the API, as you need some way of figuring out legal transitions. [223002370360] |For HMMs, I used the fact that transition probabilities were zero to rule out consideration. [223002370370] |I’ll have to add some way to specify legal transitions if I implement something general for CRF tagger decoding. [223002370380] |

    Necessary for CRFs?

    [223002370390] |Jenny Finkel recently told me she really thought IO encoding is all you need in most cases. [223002370400] |It’s how I coded a single-entity problem for the mechanical Turkers for NE annotation. [223002370410] |As far as encodability, it’s fine for MUC, almost OK for Biocreative [there are a couple contiguous entities, I think], but degenerate in general. [223002370420] |The really nice thing about this encoding is that for a single entity type, we can get away with a single feature vector. [223002370430] |The recent Ratinov and Roth NE survey concluded BMEWO was superior to BIO for a second-order online logistic model (basically what’s known as a MEMM, but with best guess at each token enforced going forward). [223002370440] |I found BMEWO+ worked best for single-entity and small entity set tagging for our HMMs. [223002370450] |

    What to Do for First-Order CRFs?

    [223002370460] |What I don’t know is what I should do for CRFs. [223002370470] |I’ve settled on only using first-order CRFs, and not having feature extraction depend on the output tag (or more precisely, every feature is crossed with every output tag at every position). [223002370480] |I’m tempted to try to punt and write up a flexible TaggingChunkingCodec interface and let the user specify it. [223002370490] |The real problem is that evaluation’s not simple in this context, because results vary pretty dramatically based on features. [223002370500] |That’s why I love and hate CRFs. [223002380010] |The Futility of Commenting Code [223002380020] |I often read about novice programmers’ aspirations for a greater density of code comments, because they think it’s “professional”. [223002380030] |I have news for you. [223002380040] |Professional coders don’t comment their own code much and never trust the comments of others they find in code. [223002380050] |Instead, we try to learn to read code and write more readable code. [223002380060] |

    API Documentation vs. Code Comments

    [223002380070] |I’m a big believer in clear API documentation. [223002380080] |Java makes this easy with javadoc. [223002380090] |It even produces relatively spiffy results. [223002380100] |But if you look at the LingPipe source code, you’ll find very few code comments. [223002380110] |

    Comments Lie

    [223002380120] |The reason to be very suspicious of code comments is that they can lie. [223002380130] |The code is what’s executed, so it can’t lie. [223002380140] |Sure, it can be obfuscated, but that’s not the same as lying. [223002380150] |I don’t mean little white lies, I mean big lies that’ll mess up your code if you believe them. [223002380160] |I mean comments like “verifies the integrity of the object before returning”, when it really doesn’t. [223002380170] |A typical reason for slippage between comments and code is patches to the code that are not accompanied by patches to the comments. [223002380180] |(This can happen for API doc, too, if you fiddle with your APIs.) [223002380190] |Another common reason is that the code author didn’t actually understand what the code was doing, so wrote comments that were wrong. [223002380200] |If you really mess up along these lines, your code’s likely to be called out on blogs like Coding Horror or Daily WTF. [223002380210] |

    Most Comments Considered Useless

    [223002380220] |The worst offenses in the useless category are things that simply repeat what the code says. [223002380230] |It’s even better when embedded in a bloated Fortran-like comment block: [223002380240] |Thanks. [223002380250] |I didn’t know what int meant or what n+1 did. [223002380260] |This is a classic rookie mistake, because rookies don’t know the language or its libraries, so often what they do seems mysterious to them. [223002380270] |For instance, commenting n >>>2 with “shift two bits to right and fill in on the left with zeroes” may seem less egregious, but your reader should know that >>>is the unsigned shift operator (or look it up). [223002380280] |There is a grey line in the sand (I love mixed metaphors) here. [223002380290] |When you’re pulling out techniques like those in Bloch and Gafter’s Java Puzzlers, you might want to reconsider how you code something, or adding a wee comment. [223002380300] |

    Eliminate, don’t Comment Out, Dead Code

    [223002380310] |I remember being handed a 30-page Tcl/Tk script at Bell Labs back in the 1990s. [223002380320] |It ran some kind of speech recognition pipeline, because that’s what I was up to at the time. [223002380330] |I picked it up and found dozens of methods with the same name, and lots of commented-out code. [223002380340] |This makes it really hard to follow what’s going on, especially if whole blocks get commented out with /* ... */. [223002380350] |Please don’t do this. [223002380360] |You should lLearn any of thea version control systems instead, like SVN. [223002380370] |

    When do I Comment Code?

    [223002380380] |I add comments in code in two situations. [223002380390] |The first is when I wrote something inefficiently, but I know a better way to do it. [223002380400] |I’ll write something like “use dynamic programming to reduce quadratic to linear”. [223002380410] |This is a bad practice, and I wish I could stop myself from doing it. [223002380420] |I feel bad writing something inefficient when I know how to do it more efficiently, and I certainly don’t want people reading the code to think I’m clueless. [223002380430] |I know only one compelling reason to leave comments: when you write code that’s not idiomatic in the language it’s written in, or when you do something for efficiency that’s obscure. [223002380440] |And even then, keep the comments telegraphic, like “C style strings for efficiency”, “unfolds the E step and M step of EM”, “first step unfolded for boundary checks” or something along those lines. [223002380450] |

    Write Readable Code Instead

    [223002380460] |What you should be doing is trying to write code that’s more readable. [223002380470] |I don’t actually mean in Knuth’s literate programming sense; Knuth wants programs to look more like natural language, which is a Very Bad Idea. [223002380480] |For one, Knuth has a terrible track record, bringing us TeX, which is a great typesetting language, but impossible to read, and a three-volume set of great algorithms written in some of the most impenetrable, quirky pseudocode you’re ever likely to see. [223002380490] |Instead, I think we need to become literate in the idioms and standard libraries of the programming language we’re using and then write literately in that language. [223002380500] |The biggest shock for me in moving from academia to industry is just how much of other people’s code I have to read. [223002380510] |It’s a skill, and I got much better at it with practice. [223002380520] |Just like reading English. [223002380530] |Unfortunately, effective programming is as hard, if not harder, than effective essay writing. [223002380540] |Writing understandable code that’s also efficient enough and general enough takes lots of revision, and ideally feedback from a good code reviewer. [223002380550] |Not everyone agrees about what’s most readable. [223002380560] |Do I call out my member variables from local variables with a prefix? [223002380570] |I do, but lots of perfectly reasonable programmers disagree, including the coders for Java itself. [223002380580] |Do I use really verbose names for variables in loops? [223002380590] |I don’t, but some programmers do. [223002380600] |Do I use four-word-long class names, or abbreviations? [223002380610] |The former, but it’s another judgment call. [223002380620] |However you code, try to do it consistently. [223002380630] |Also, remember that coding standards and macro libraries are not good places to innovate. [223002380640] |It takes a while to develop a consistent coding style, but it’s worth the investment.