[223001870010] |Greg Little’s Turkit: Multi-Pass Annotation Refinement [223001870020] |I had the luxury of dashing out to California last week for the San Francisco Turk Meetup hosted by Dolores Labs (see the O’Reilly radar blog summary). [223001870030] |The TurKit Application presented by Greg Little really caught my eye. [223001870040] |You, too, can read the pre-publication paper: [223001870050] |
  • Little, Greg, Lydia B. Chilton, Robert C. Miller, and Max Goldman. 2009. [223001870060] |TurKit: Tools for iterative tasks on Mechanical Turk. [223001870070] |Not yet published. [223001870080] |TurKit’s an open-source package that lets you submit a single task to multiple Turkers in a novel way. [223001870090] |Rather than having them all do the task independently, the Turkers work on the tasks sequentially. [223001870100] |That is the first Turker does it from scratch, then the second Turker works to improve the first Turker’s answer. [223001870110] |The example in the talk involved transcribing a very messily written note. [223001870120] |It’s amazing to see the progression from the first Turker to the eighth Turker. [223001870130] |TurKit has a beautiful programming paradigm which lets you write jobs imperatively. [223001870140] |The base language is JavaScript (with JSON-based serialization). [223001870150] |The beautiful part is that TurKit lets the programmer write loops, the inner operations of which call out to Turk annotation jobs. [223001870160] |This is completely natural, and completely unsupported by the Turk API, which is your usual REST-type stateless application. [223001870170] |It reminds me of Python’s yield-based iterator implementations. [223001880010] |Raykar et al. (2009) Supervised Learning from Multiple Experts: Whom to Trust when Everyone Lies a Bit [223001880020] |Good learners for evil teachers. [223001890010] |Dasgupta and Hsu (2008) Hierarchical Sampling for Active Learning [223001890020] |Once I got onto the ICML discussion site, I followed some of the papers that had discussions. [223001890030] |I’d been meaning to circle back around to this paper, as it addressed several of the concerns I had about active learning: [223001890040] |
  • Dasgupta, Sanjoy and Daniel Hsu. 2008. [223001890050] |Hieararchical sampling for active learning. [223001890060] |In ICML. [223001890070] |The general topic’s still how to efficiently (in human effort) collect useful training data (e.g. for classifiers). [223001890080] |Update at 3:02 PM: As soon as I wrote this, I read a post on active learning in John Langford’s blog, which points to the following tutorial page, which among other things, has a great bibliography: [223001890090] |
  • Sanjoy Dasgupta and John Langord. 2009. [223001890100] |Active Learning Tutorial. [223001890110] |Presentated at ICML. [223001890120] |What’s worried me since the first time I heard about active learning is that the data’s likely to wind up biased, because it’s not being chosen randomly. [223001890130] |For example, by choosing examples at the margin (e.g. always choosing the least confident annotation), you can wind up with a highly non-representative data set. [223001890140] |Dasgupta and Hsu do some analysis along with providing a nice literature survey. [223001890150] |The basic idea of this paper is that you can cluster the underlying data, then sample from the clusters. [223001890160] |If the underlying data clusters well w.r.t. the classification problem, this can be very helpful. [223001890170] |Needless to say, there’s a whole lot of math explaining exactly what it means for data to cluster well w.r.t. the classification problem. [223001890180] |What struck me skimming through this paper is that K-means++ initialization for K-means clustering had the same motivation (it’s now implemented in LingPipe, and we’ve found it to work well in practice). [223001890190] |K-means++ works by sampling a new centroid with a probablity proportional to its distance to the closest centroid. [223001890200] |Outliers are by definition a long way away from a centroid, and thus have high individual selection probabilities. [223001890210] |The idea is that by sampling, more representative items, by sheer numbers, can be chosen, even if they’re individually unlikely. [223001890220] |So what about choosing items for active learning proportional to their expected probability of being erroneously classified. [223001890230] |I’m thinking this’d have the same kind of effect as K-means++. [223001890240] |Has anyone done this already? [223001890250] |PS: I also learned from Foster’s active learning paper that my academic hero and former colleague, Herb Simon, had the first citation in this field back in the early 1970s. [223001890260] |As usual, he was decades ahead of his time. [223001890270] |Update: John Langford just posted a blog entry on active search, mentioning this: [223001890280] |
  • Sanjoy Dasgupta and John Langford. 2009. [223001890290] |Active Learning Tutorial. [223001890300] |At ICML. [223001900010] |Chain Conditional Random Fields: Implementation and Design Issues [223001900020] |I’m starting to implement conditional random fields (CRF) for LingPipe. (See, I really don’t hate CRFs.) If you don’t know CRFs, here are three good tutorials: [223001900030] |
  • Sutton, Charles and Andrew McCallum. 2007. [223001900040] |An Introduction to Conditional Random Fields for Relational Learning. [223001900050] |In Lise Getoor and Ben Taskar, eds., Introduction to Statistical Relational Learning. [223001900060] |MIT Press. [223001900070] |2007 [223001900080] |
  • McDonald, Ryan. 2007. [223001900090] |Generalized Linear Classifiers in NLP. [223001900100] |Course presented to the Swedish Graduate School in Language Technology. [223001900110] |
  • Klein, Dan and Chris Manning. 2003. [223001900120] |Maxent Models, Conditional Estimation, and Optimization, without the Magic. [223001900130] |Tutorial at ACL and NAACL. [223001900140] |Implementing CRFs brings up several design issues. [223001900150] |N-gram Order [223001900160] |Like HMMs, chain CRFs extract features over the sequence of tags. [223001900170] |The issue is whether the features can span more than a single pair of tags (a bigram). [223001900180] |For instance, can I use the previous label and label two back to help model the current label? [223001900190] |Although it’d be nice to implement everything very generally for arbitrary n-gram contexts, it’s actually a pain in practice in that everything gets deeper loops and allocating all the arrays is hard to do generically in Java with arrays. [223001900200] |And then extracting answers with confidence and doing n-best also gets more complex. [223001900210] |I’m also not sure that any order beyond first is going to be tractable enough for training. [223001900220] |Feature Extraction [223001900230] |I plan to do the usual factoring of features into node features (depend only on inputs) and edge features (depend on input and also previous category). [223001900240] |Even though edge features are more general, it’s much more efficient to implement the usually much more prevalent node features independently (it all goes through the forward-backward algorithm). [223001900250] |Nodes and edges have the following interfaces: [223001900260] |and [223001900270] |Then we have two feature extractors required to estimate a CRF from a corpus of labeled tag data: [223001900280] |
  • FeatureExtractor, and [223001900290] |
  • FeatureExtractor. [223001900300] |One or More Coefficient Vectors? [223001900310] |Given the feature extraction, I have to follow our logistic regression model in having different coefficient vectors per category rather than a single coefficient vector that essentially concatenates all the individual vectors. [223001900320] |Having different vectors makes feature extraction much more economical. [223001900330] |In the “standard” CRF presentation, a separate feature extraction is done for each possible tag, both for nodes and edges. [223001900340] |That does allow some richer features (e.g. using a feature for labels that don’t change entity type, such as B_PERSON to I_PERSON), and also allows features to be structural zeroes for some tags. [223001900350] |Note that the output tag is implicit in the proposed feature extraction, which means it’s always included to distinguish both node and edge features across output tags. [223001900360] |K-1 Coefficient Vectors vs. K coefficient Vectors [223001900370] |That leaves the issue of whether to use (K-1) coefficient vectors (where K is the number of tags), as you see in typical statistics presentations of logistic regression. [223001900380] |The first advantage is one less vector to store, and even more importantly, one fewer vector to update. [223001900390] |The second is a system with an identified answer even for Laplace priors or maximum likelihood. [223001900400] |The alternative is to use K coefficient vectors, as you see in the typical “machine learning” presentation of CRFs or logistic regression. [223001900410] |I’m planning to go with the stats approach, as I did for our logistic regression. [223001900420] |Order of Coefficient Vectors and Sparsity [223001900430] |There’s an issue of how to store the coefficient matrix vector, either tag dominant or feature dominant. [223001900440] |That is, do we have a map from features to categories to numerical values, or from categories to features to numerical values. [223001900450] |The latter is the natural way to do it if you look at the typical definitions. [223001900460] |The former way would be useful if many features are non-zero for half the categories or fewer. [223001900470] |I don’t think that’ll happen much given the way the regressions work, so I’m planning to go with (features to categories to numerical values). [223001900480] |I pretty much have to use sparse vectors for the edge and node features. [223001900490] |But what about the model’s regression coefficients? [223001900500] |I’ll be taking the coefficient vector’s dot-product with the sparse edge and node features. [223001900510] |That’s much more efficient for dense coefficient vectors. [223001900520] |I could save some space for really big models, but I think the time’s going to be more precious here. [223001900530] |Luckily, this implementation detail will be easy to change under the hood if I change my mind later. [223001900540] |Arithmetic Precision [223001900550] |We won’t be able to estimate coefficients with more accuracy than is representable as a float. [223001900560] |So for storing the models, floats make most sense. [223001900570] |The problem is that at run time, we need to do double-precision arithmetic for decoding (in all three cases, Viterbi first best, Viterbi/A* n-best, or full forward-backward marginal decoding). [223001900580] |It’s slower to cast floats to doubles every time you do a multiply, so I’ll probably keep the coefficients dense and doubles as long as they’ll fit. [223001900590] |If they won’t fit, I’ll probably switch the implementation to sparse floats. [223001900600] |It’ll all be transparent and backward compatible at the API level if I make the switch after the first release. [223001900610] |Start/End Tags [223001900620] |For HMMs, I added a BEGIN-OF-SENTENCE tag and an END-OF-SENTENCE tag. [223001900630] |The start tag is not generated, but used as context to generate the first word in the sentence. [223001900640] |The end-of-sentence tag is generated, because it’s required to get the normalizations right. [223001900650] |For CRFs, it seems to make most sense to include a BEGIN-OF-SENTENCE tag that doesn’t have an estimated coefficient, but is just used in defining the edge features for the first token in a sentence. [223001900660] |The end marker is actually derivable from the node (word-only) features. [223001900670] |So is the begin marker, but it seems simpler this way than to have a null value passed in to the edge feature extractor. [223001900680] |Stochastic Gradient Descent [223001900690] |I’m not even considering other implementations. [223001900700] |I’ll parameterize it the same way as for logistic regression. [223001900710] |This means I’ll allow coefficient priors (Laplace, Gaussian, Cauchy, with the option of non-zero means and different variances/scales per dimension). [223001900720] |I’ll also allow an annealing schedule. [223001900730] |SGD Code Re-use [223001900740] |A serious question is whether I should try to re-use the SGD code in logistic regression. [223001900750] |The basic algorithm’s the same, only now I have to do forward-backward in the inner loop to compute the gradient of the log likelihood function. [223001900760] |I’m thinking I won’t share the code, but that’s mainly because the refactoring to something with the right callbacks in the inner loop looks daunting. [223001900770] |Forward/Backward Decoding and N-best [223001900780] |I need to do forward/backward for training, not just confidence-based extraction. [223001900790] |What I’d like to do is provide a common format for the lattice output I can share between HMM taggers and CRFs. [223001900800] |I’ve already defined a new package com.aliasi.tag that has three interfaces for tagging: first-best, n-best (joint or conditional probabilities), and marginal (per tag) taggers matching those for HMMs. [223001900810] |Relation to Chunking [223001900820] |I’d also like to provide a general tagger to chunker implementation, which means more refactoring, but will immediately let me use CRFs to do named-entity extraction with n-best and confidence-based (per entity mention) extractions. [223001900830] |I’d very much like to refactor the CharLmHmm chunker to rely on a generic tagger. [223001900840] |This should be doable. [223001900850] |Threading and/or File Compilation [223001900860] |Unfortunately, storing in memory all of the features for an input task the size of the Penn Treebank POS data (about 50 tags and 1M words) is prohibitive. [223001900870] |There are 50M edge feature extractions and 1M node feature extractions. [223001900880] |Even if the edge feature extractions are tiny (say 40 bytes), that’s a lot of memory; if we add in previous tag plus word features, it won’t fit in memory even on a big workstation. [223001900890] |If there are hundreds of word features per node, which is conservative if we use 5-word contexts, n-grams and subword features, and represent them tightly, we’re still looking at several KB/vector and thus several GB of memory. [223001900900] |So I’m inclined to do the feature extraction online. [223001900910] |This is making me queasy, though, because it’ll be very expensive if I do it naively in a single thread (even measured relative to forward/backward, perhaps dominating training time). [223001900920] |Alternatively, I can run feature extraction in one thread and gradient computation in another. [223001900930] |Then we probably won’t notice the feature extraction if there are enough CPUs free. [223001900940] |The other alternative, which matches what may CRF packages do, is to require all of the feature vectors for the training corpus to be serialized. [223001900950] |Then they can be read back in much more efficiently than creating them from scratch. [223001900960] |It won’t take that much disk space, but the requirement will be nontrivial (perhaps as much as 20 or 30GB with Treebank POS size corpora and subword features). [223001900970] |Generic Inputs [223001900980] |There’s really nothing word-specific about CRFs. [223001900990] |In fact, they could take arbitrarily structured inputs instead of word tokens. [223001901000] |This would require generifying the CRF interface to the type of objects being tagged, and similarly generifying feature extraction and the lattice representations. [223001901010] |I could probably get away then with having the HMMs just instantiate the generic to String. [223001901020] |I may just go ahead and do this once it’s implemented for strings. [223001901030] |It’ll probably be easier to refactor a string-based implementation than to start with something more generic. [223001901040] |(Thanks again to those Java engineers for making things so easy to modify with backward compatibility.) [223001901050] |Overflow / Underflow [223001901060] |Overflow happens when a linear predictor gets too big and exponentiating it overflows. [223001901070] |We handle this by computing softmax the right way. [223001901080] |It’s easy to underflow during forward-backward, which is on a linear scale. [223001901090] |Sutton and McCallum outline two approaches to handling this. [223001901100] |One is what we did for HMM forward/backward, which is to normalize each forward vector (the forward values for all tags at a particular token) to sum to 1.0. [223001901110] |In this case, the usual computation for the normalizing constant Z just drops out. [223001901120] |The other approach Sutton and McCallum mention, which I’d never heard of before reading their paper, is to keep the forward/backward values on the log scale and then perform computations in the exponentiated domain and rescale. [223001901130] |This can actually be done with reasonable precision (always a problem with logs/exps). [223001901140] |As Sutton and McCallum note, this can be expensive because of the increased number of calls to log() and exp(). [223001901150] |Unsupported Features [223001901160] |I have to admit I didn’t understand the discussion of “unsupported” features in Sutton and McCallum, which are defined to be features that don’t appear in the training data. [223001901170] |This may be because I’ve already factored my features so that there are no output-specific features, so there may not be any such “unsupported” features. [223001901180] |I’m definitely restricting to only features that appear in the training data, because I can’t train anything using stochastic gradient. [223001901190] |Boolean Feature Vectors [223001901200] |Many NLP applications use boolean feature vectors, which only allow values 0 or 1. [223001901210] |By restricting to boolean feature vectors, many computations can be made more efficient. [223001901220] |I think I may go partway there by including dense and sparse boolean feature vector representations. [223001901230] |The sparse ones would require a very high degree of sparsity to be worth it in terms of space, but they may dominate in things like cross-products because they’d use very few comparisons. [223001910010] |Veeramachaneni and Kondadadi (2009) Surrogate Learning –From Feature Independence to Semi-Supervised Classification [223001910020] |It’s more fun with classifier algebra today, as we look in our friends at Thomson-Reuters R&D: [223001910030] |
  • Veeramachaneni, Sriharsha and Ravi Kumar Kondadadi (2009) Surrogate Learning –From Feature Independence to Semi-Supervised Classification. [223001910040] |In NAACL Workshop on Semi-Supervised Learning.
  • [223001910050] |The Algebra [223001910060] |The problem is binary classification into a class y in {0,1}, and the modeling assumption is that inputs consist of pairs of vectors x = (x1,x2) where x1 in X1 and x2 in X2, and where the x1 and x2 are conditionally independent given the class y: [223001910070] |p(x1,x2|y) = p(x1|y) * p(x2|y) [223001910080] |Note that we don’t have to make the co-training assumption that both x1 and x2 are sufficient to build a classifier in and of themselves. [223001910090] |To prevent divides-by-zero, they also assume p(x1|x2) > 0 and p(x1|y) > 0, and that p(x1|y=0) != p(x1|y=1). [223001910100] |With this, they derive a formula for p(y|x) involving only p(x1|y) and p(x1|x2). [223001910110] |Pr(y=0|x1,x2) = p(x1|y=0) / p(x1|x2) [223001910120] |* [ p(x1|y=1) - p(x1|x2) ] / [ p(x1|y=1) - p(x1|y=0) ] [223001910130] |What’s neat about their analysis is that you can train p(x1|x2) without supervision about p(y|x1,x2). [223001910140] |If you can get a good handle on p(x1|y) in some way, you can bring in all that information about x2 through p(x1|x2). [223001910150] |As the authors note, their approach is in many ways similar to Ando and Zhang’s 2007 two view model. [223001910160] |Special Case Examples [223001910170] |They run two examples, both of which are very real problems for Thomson-Reuters: one on physician record linkage and one on discovering paraphrases for merger and acquisition sentences. [223001910180] |In both experiments they make a further assumption that x1 is a binary variable with 100% recall for y=1, so that: Pr(y=1, x1=0) = 0, which lets them loosen the restriction on independence to cases where y=0, namely [223001910190] |p(x1,x2|y=0) = p(x1|y=0) * p(x2|y=0) [223001910200] |Physician Record Linkage [223001910210] |For the record-linkage case, they are classifying two physician records as to whether they are the same physician (y=1) or not. [223001910220] |The 100% recall feature x1 is defined to be “have the same year of graduation”. [223001910230] |That is, they’re assuming that if two records are identical, they have the same year of graduation. x2 is then the remaining information, and they estimate p(x1|x2) using logistic regression. [223001910240] |They estimate p(x1|y=0) with a simple frequency estimate by taking random samples of pairs of physicians (some of these might actually match, but it’s sparse). [223001910250] |They show that their approach does a pretty good job compared to training a logistic regression model on a few hundred supervised instances (though not overwhelmingly better). [223001910260] |The improvements over the supervised case all come in recall, which is not surprising. [223001910270] |It’d be nice to see the whole precision-recall curves for the problem, which is typical for record linkage apps (see, e.g., one of the first examples in Gelman et al.’s Bayesian Data Analysis, which covers calibrating record linkage, a closely related problem). [223001910280] |They did a nice job of dealing with missing data, too, but that’s really a separate issue to some extent. [223001910290] |Paraphrasing Merger and Acquisition Sentences [223001910300] |This is a neat bootstrap (in the colloquial usage, not the statistical one) approach to finding sentences about mergers and acquisitions between pairs of companies. [223001910310] |It’s similar to Agitchein and Gravano’s Snowball system. [223001910320] |They used a named entity detector to tag the data, and then restricted attention to sentences containing pairs of companies. [223001910330] |They then created high-precision patterns like “ORG bought ORG” and “offer for ORG” to create seeds (it’s not my fault that bootstrapping employs such a horribly mixed metaphor). [223001910340] |They use the seeds to find relation sentences. [223001910350] |They then go and find other sentences with the same pair of organizations within a time frame around the seed sentence of a month or two and consider those positive. [223001910360] |They consider all other instances of one of the organizations and another organization negative (another noisy rule like graduation year). [223001910370] |The 100% recall feature x1 is that the two pair of organizations occur in a seed sentence (this is also noisy). [223001910380] |They use a bag of unigram, bigram and trigram word features for x2. [223001910390] |For some reason they use an SVM for p(x1=1|x2). [223001910400] |They don’t actually mention how they estimate p(x1|y=0). [223001910410] |They give a few operating points, one of which has 99% recall at 79% precision, which is nice. [223001910420] |Conclusion [223001910430] |I’d have given this paper the thumbs-up, but the explanation of the paraphrase task could’ve been more clearly related back to the technique of the paper. [223001910440] |Overall, I’d like to see more of this kind of innovative work on both technique and application in the main ACL. [223001920010] |Log Sum of Exponentials [223001920020] |I’ve decided, at least in the first pass, to calculate forward/backward for CRFs on a log scale. [223001920030] |The alternative way to maintain floating point stability, by keeping log multipliers for columns of outputs, was just making my head hurt too much in the nested loops. [223001920040] |The log computation involves taking the logarithm of a sum over the exponentiated values found in an array. [223001920050] |In symbols: [223001920060] |The problem, of course, is that if any of the x values get sufficiently large (greater than log(Double.MAX_VALUE), which is about 710), they overflow to positive infinity when you calculate the exponent; if all of the values are tiny (less than -750 or so), they all underflow to 0.0 and the resulting log sum is zero. [223001920070] |Luckily, we can get more arithmetic stability with a little algebra. [223001920080] |First, find the maximum value in x: [223001920090] |Then use it to normalize the largest value to 0: [223001920100] |So the largest value passed to the exponential function is 0. [223001920110] |If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic. [223001920120] |I’ve implemented a new static utility method to compute the log of the sum of the exponentiated values of an array: [223001920130] |com.aliasi.util.Math.logSumOfExponentials(double[]). [223001920140] |It’ll be out in the next release (hopefully along with CRFs). [223001920150] |In case you’re curious where this comes up, here’s the inductive step of the forward pass of the forward-backward algorithm for CRFs (φ(n,k’) is the feature vector for token position n and previous outcome k’; β[k] is the coefficient coefficient for outcome k): [223001920160] |and here it is on the log scale: [223001930010] |$1M Netflix Prize Qualifier BellKor Pragmatic Chaos [223001930020] |Finally! [223001930030] |The BellKor Pragmatic Chaos team, winner of the last two Netflix Progress Prizes just topped the 10% improvement on the development set required to be evaluated for the grand prize of US$ 1M. Check out the the Netflix Prize Leaderboard. [223001930040] |Given what the public’s said about this contest before, I can’t say I’m too shocked at the public’s comments to the NY Times’ Big Blog post And the Winner of the $1 Million Netflix Prize (Probably) Is …. [223001930050] |They were overwhelmingly mocking and negative. [223001930060] |They thought the academics were giving Netflix free work. [223001930070] |They thought 10% better than lousy recommendations were still lousy. [223001930080] |I’ve said this before, but what the public doesn’t realize is just how valuable this kind of data is. They’d probably be horrified at how much time we “waste” on bakeoffs like this one where there is no reward. [223001930090] |A lot of groups out there probably would’ve paid Netflix for this kind of data, just like we pay the partly publicly funded Linguistic Data Consortium. [223001930100] |Plain and simple, this is a relatively huge data regression problem with data missing nonrandomly. [223001930110] |The top teams achieved a fairly amazing reduction in variance by pooling multiple predictors. [223001930120] |Given the (root) mean-square error (MSE) evaluation metric, and the relation of the bias-variance decomposition of mean square error, it’s not surprising the best systems were all ensembles of classifiers. [223001930130] |The effort required to fit these models and dream up structures and parameterizations was a Herculean effort. [223001930140] |Check out the details up to last year in: [223001930150] |
  • Bell, Robert M., Yehuda Koren, and Chris Volinsky. 2008. [223001930160] |The BellKor 2008 Solution to the Netflix Prize. [223001940010] |Convergence for (Stochastic) Gradient Descent (or EM) [223001940020] |One of the most confusing aspects of the stochastic gradient descent (SGD) and expectation maximization (EM) algorithms as usually written is the line that says “iterate until convergence”. [223001940030] |For some reason, most presentations don’t indicate how to measure convergence. [223001940040] |There are two common approaches, and perhaps more that I don’t know about: [223001940050] |1. Convergence by Likelihood + Prior [223001940060] |The first common approach to calculate the log likelihood [223001940070] |log p(y|θ) [223001940080] |for the data y given current parameter settings θ, the log prior for the parameters [223001940090] |log p(θ), [223001940100] |then monitor the sum [223001940110] |log p(y|θ) log p(θ) [223001940120] |for convergence. [223001940130] |This approach makes sense intuitively, because it’s the value of the likelihood and prior that’s being maximized by the search algorithms (in loss terms, the negation of the log likelihood and prior is minimized). [223001940140] |2. Convergence by Vector Distance [223001940150] |The second approach is to monitor the coefficient vectors themselves. [223001940160] |Typically, this would involve measuring Euclidean (or taxicab) distance between the parameter vectors across epochs. [223001940170] |Note that the vector-based approach is the only one that makes sense for a truly online algorithm. [223001940180] |3. Convergence by Held-Out Evaluation [223001940190] |I forgot to mention this in the first post. [223001940200] |You can also use held-out labeled data to evaluate the classifier or tagger being produced, and stop training at the point the error is minimized. [223001940210] |The main problem with this approach is that the search problem’s not convex. [223001940220] |In practice, people often use this as a form of “early stopping”, whereby training is halted before convergence. [223001940230] |As Bishop’s machine learning book shows, this can be viewed as a kind of regularization if the learning rate’s low enough. [223001940240] |By stopping early, coefficients haven’t had time to reach their optimal value. [223001940250] |Relative Convergence and Rolling Averages [223001940260] |Algorithms like stochastic gradient (unlike EM) can get into situations where log likelihood actually increases across an epoch. [223001940270] |SGD and EM both sometimes makes small steps, then start making bigger steps. [223001940280] |So what I’ve done is taken a rolling average, as well as allowing a minimum number of epochs to be set. [223001940290] |Likelihood Converges, Vectors Diverge [223001940300] |If the vectors converge, the likelihood converges, but the converse doesn’t hold for two reasons. [223001940310] |First, the vectors can plain old diverge. [223001940320] |If the data’s separable and there’s no prior, there is no limit to how big the coefficients can get; the larger the cooefficients get, the higher the likelihood. [223001940330] |(For logistic regression, this is because 1/(1+exp(-x)) approaches 0 as x approaches negative infinity and approaches 1 as x approaches infinity.) [223001940340] |The second problem is that the coefficients may not be identified. [223001940350] |For instance, if I take maximum likelihood and use a different vector for each category (instead of one fewer than the number of categories), then the maximum a posteriori (MAP) solution isn’t identified in the statistical sense. [223001940360] |That is, two different sets of vectors may produce the same result (subtract any vector from all the coefficient vectors and the results don’t change for logistic regression). [223001950010] |DARPA grants BBN $30M for “Universal Reading System” [223001950020] |Wow. [223001950030] |DARPA just granted BBN a $30M contract (US dollars, of course), spread over 5 years, to develop a “universal reading system.” [223001950040] |
  • BBN Press Release: BBN Technologies Awarded $30 Million in Defense Funding to Teach Machines to Read [223001950050] |I was amused by this quote from BBN’s press release, neatly worded in the subjunctive: [223001950060] |Prem Natarajan, vice president, Speech and Language Processing, BBN Technologies, said, “The machine reading system that DARPA envisions is not evolutionary, but revolutionary. [223001950070] |Such a system could eliminate many of the impediments to stability that our military faces such as a lack of understanding of local customs, and give us the ability to assess global technology developments continuously.” [223001950080] |It’s nice to envision something revolutionary, but here’s what BBN says its going to do: [223001950090] |BBN will leverage its expertise in natural language processing and distillation to develop a universal text engine that captures knowledge from text and transforms it into the formal representations required by artificial intelligence systems. [223001950100] |DARPA’s been promoting this approach to “event extraction” (basically populating a Minsky-style frame from text) in pretty much the same way since the Message Understanding Conferences (MUC) and before. [223001950110] |The DARPA call for proposals would be not unfamiliar to an AI researcher from the 1970s. [223001950120] |I found this announcement in [223001950130] |
  • Steve Arnold’s blog post: DARPA: We Want to Be Like Google, [223001950140] |which linked to this [223001950150] |
  • CNET article: Reading machine to snoop on Web. [223001950160] |The titles are perhaps more revealing of the authors than the proposal and grant. [223001960010] |Medical Subject Headings (MeSH) Parser [223001960020] |I just finished up the doc on a complete parser and object model for the for the U.S. National Library of Medicine’s XML-based Medical Subject Headings (MeSH) distribution. [223001960030] |MeSH contains over 25,000 controlled vocabulary items in the biomedical domain, organized into highly structured and cross-linked records. [223001960040] |Here are links that explain what’s in MeSH: [223001960050] |
  • MeSH Element Descriptions
  • [223001960060] |
  • MeSH example files and download links
  • [223001960070] |
  • Online MeSH Vocabulary Searching and Browsing [223001960080] |What’s really neat is that MEDLINE is distributed with MeSH terms added by a dedicated staff of research librarians. [223001960090] |The parser and object representations follow the XML pretty closely, which may be familiar from LingPipe’s longstanding MEDLINE parsing and object package (which has its own tutorial). [223001960100] |The parsers for MeSH and MEDLINE can take the compressed distribution files and parse out object-oriented representations of the associated MeSH or MEDLINE records. [223001960110] |The parsing pattern employed in LingPipe’s corpus package is very much like SAX’s parser/handler pattern. [223001960120] |At one point, I wrote a whole blog entry on LingPipe’s parser/handler framework. [223001960130] |A handler is attached to a parser, and as the parser parses the raw data, it sends object-oriented representations of records to the handler. [223001960140] |There’s a simple demo command in the package that just prints them out to show how the code can be used. [223001960150] |As part of the preparations for LingPipe 4.0, I’m going to be moving the MEDLINE package (com.aliasi.medline) out of LingPipe proper and into the LingMed sandbox project. [223001960160] |LingMed will probably graduate from the sandbox and start getting distributed on its own. [223001960170] |LingMed also includes parsers for distributions of NLM’s Entrez-Gene, NLM et al.’s Online Mendelian Inheritance in Man (OMIM) and the GO Consortium’s Gene Ontology (GO). [223001960180] |These all work using the same basic SAX-like pattern. [223001960190] |LingMed also has data access layer implementations of search for many of these data sets (like Entrez-Gene and MEDLINE) integrated with Lucene. [223001960200] |In particular, you can do fielded searches, yet retrieve object-oriented results (rather than Lucene documents) out the other side. [223001960210] |We’ll continue to add to the parsing and search base over time. [223001960220] |Mitzi‘s working on Wormbase as part of her job at NYU, and I’ve done the preliminary parsing for Uniprot. [223001960230] |Downloading LingMed [223001960240] |You can find instructions on the LingPipe site for anonymous subversion (SVN) access to our sandbox, which includes specific information about checking out the LingMed project. [223001960250] |Let us know if you find it useful or want us to add other features. [223001960260] |We’d love to get some other folks using this. [223001960270] |As is, the documentation’s a bit sketchy in places. [223001960280] |Please feel free to send the LingPipe mailing list or me (carp@alias-i.com) questions about it directly. [223001970010] |Collapsed Gibbs Sampler for Hierarchical Annotation Model [223001970020] |The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. [223001970030] |Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler. [223001970040] |Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. [223001970050] |While I was procrastinating refactoring feature extraction for CRFs reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about. [223001970060] |Hierarchical Model of Binomial Annotation [223001970070] |The model’s quite simple, at least in the binomial case. [223001970080] |I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing): [223001970090] |The Collapsed Sampler [223001970100] |The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. [223001970110] |With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors. [223001970120] |The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent πof the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. [223001970130] |The formula’s straightforwardly derived with Bayes’ rule: [223001970140] |p(c[i]|x, θ0, θ1) p(c[i]) * Πj in 1:J p(x[i,j] | c[i], θ0[j], θ1[j]) [223001970150] |Moment-Matching Beta Priors [223001970160] |*The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. [223001970170] |Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). [223001970180] |In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra). [223001970190] |I’m not too worried about the Pareto scale prior —it’s pretty diffuse. [223001970200] |I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it's not the ML estimate; check out Thomas Minka's paper Estimating a Dirichlet Distribution and references therein.]). [223001970210] |Initialization [223001970220] |The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. [223001970230] |These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. [223001970240] |Then each epoch just resamples all the categories, then recomputes all the other estimates. [223001970250] |This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. [223001970260] |I made sure to scale for underflow, and that’s about it. [223001970270] |It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s BDA book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry. [223001970280] |Speed and Convergence [223001970290] |It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. [223001970300] |I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. [223001970310] |Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down). [223001970320] |Identifiability [223001970330] |I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5). [223001970340] |Open Question (for me) [223001970350] |My only remaining question is: why does this work? [223001970360] |I don’t understand the theory of collapsed samplers. [223001970370] |First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. [223001970380] |Second, I don’t apply beta-binomial inference in the main sampling equation —that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. [223001970390] |Is this kosher? [223001970400] |Downloading from LingPipe Sandbox [223001970410] |You can find the code in the LingPipe Sandbox in the project hierAnno (the original R and BUGS code and the data are also in that project). [223001970420] |It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe. [223001970430] |[Update: The evaluation looks fine for named entities; I'm going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it's a bit worse than voting as is and the estimates aren't useful because the specificites so dominate the calculations. [223001970440] |For Snow et al.'s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there's a bug, the collapsed sampler isn't stable, or I really do need a prior on those scales!] [223001970450] |[Update 2: The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). [223001970460] |Voting after throwing out the two really noisy annotators is probably competitive as is. [223001970470] |I still don't know why the RTE data's blowing out variance.] [223001980010] |Deleting Values in Welford’s Algorithm for Online Mean and Variance [223001980020] |To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. [223001980030] |What if our variable’s normal? [223001980040] |Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to: [223001980050] |
  • LingPipe Blog: Computing Sample Mean and Variance Online in One Pass
  • [223001980060] |Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes: [223001980070] |Works like a charm. [223001980080] |I’m sure someone’s done this already, but I didn’t find any references in a quick check. [223001980090] |The same technique can obviously be applied to covariance and correlation calculations. [223001980100] |This’ll be in the next release of LingPipe. [223001980110] |References [223001980120] |
  • Welford, B. P. 1962. [223001980130] |Note on a method for calculating corrected sums of squares and products. [223001980140] |Technometrics 4(3): 419-420.
  • [223001980150] |
  • Cook, John. [223001980160] |Accurately computing running variance. [223001980170] |Web page.
  • [223001990010] |EM Goes KABOOM, or: How I Learned to Stop Worrying and Love the Bomb [223001990020] |Yup, just like I was warned: [223001990030] |KABOOM! [223001990040] |Soft K-means can blow up. [223001990050] |Put one cluster exactly on one data point and let its variance go to zero —you can obtain an arbitrarily large likelihood! [223001990060] |Maximum likelihood methods can break down by finding highly tuned models that fi t part of the data perfectly. [223001990070] |This phenomenon is known as over fitting. [223001990080] |The reason we are not interested in these solutions with enormous likelihood is this: sure, these parameter-settings may have enormous posterior probability density, but the density is large over only a very small volume of parameter space. [223001990090] |So the probability mass associated with these likelihood spikes is usually tiny. [223001990100] |(MacKay 2003; Chapter 22, p. 308) [223001990110] |The problem for my hierarchical annotation model, which estimates annotator sensitivity and specificity and their beta priors, is that after a few samples, the annotator sensitivities are all fairly close, so the variance estimate is lowered. [223001990120] |In the next round, the low variance on the prior pulls the sensitivities closer to the prior mean, which in turn lowers the variance for the next round. [223001990130] |You can see where this is going. [223001990140] |(It’s “poof” goes the variance [to 0], or “kaboom” go the scale parameters and log likelihood [to infinity].) [223001990150] |The solution, of course, is to (a) fix priors, or (b) put a prior on the beta prior scale. Option (b) is required for the full Bayesian model, and since BUGS did the heavy lifting, it’s what I’ve used up until now. [223001990160] |It’s been very stable. [223001990170] |Solution (b) is a pain for the beta distribution, because there’s no conjugate prior, which hugely complicates maximum likelihood estimation (see Thomas Minka’s Estimating a Dirichlet distribution —the beta just a Dirichlet with k=2). [223001990180] |I could just add in some more data samples and let them act as prior counts, even if I can’t characterize the resulting prior analytically. [223001990190] |That should probably work, too, even with my (admittedly low road implementation) moment matching estimate. [223001990200] |For now, along with moment-matching for priors (which can diverge), I implemented solution (a), fixing the priors. [223001990210] |With fixed beta priors and EM estimates, I now have a straight-up implementation of Dawid and Skene model (sorry, JSTOR’s still best reference online; if anyone finds an open version, let me know)! [223001990220] |The good news is that with fixed Beta(1,1) priors for sensitivity and specificity, the models converge lickety-split and get the same answers as the full Bayesian models (not so good if you’re looking for evidence that Bayesian models are the way to go). [223001990230] |I could even compute beta estimates based on final sensitivity and specificity estimates if I wanted. [223001990240] |And the results are still better than voting, even if we allow voting to reject annotators who are obviously bad. [223001990250] |References [223001990260] |You can read MacKay’s chapter 22 or the whole book online: [223001990270] |
  • David J. C. MacKay. 2003. [223001990280] |Information Theory, Inference, and Learning Algorithms. [223001990290] |Cambridge University Press.
  • [223001990300] |
  • Chapter 22, Maximum Likelihood and Clustering [pdf]
  • [223001990310] |
  • PDF index by chapter (off-by-one error in numbering vs. PDF) [223001990320] |I love Cambridge University Press! [223001990330] |They’re letting their authors distribute PDFs for free (recent examples: Marti Hearst’s book on search; Manning, Raghavan, and Schütze’s IR book). [223002000010] |Ratinov and Roth (2009) Design Challenges and Misconceptions in Named Entity Recognition [223002000020] |If you’re interested in named entity (mention) recognition, you should check out this paper: [223002000030] |
  • Ratinov, Lev and Dan Roth. 2009. [223002000040] |Design Challenges and Misconceptions in Named Entity Recognition. [223002000050] |In CoNLL. [223002000060] |
  • Try: Online Demo
  • [223002000070] |
  • Download: Code and Resources [223002000080] |There are two things to read this for: (1) the compendium of features to consider, and (2) the analysis of greedy left-to-right decoding. [223002000090] |You might also like their state-of-the-art results. [223002000100] |(They only cover person, location, organization and miscellaneous entity mentions in English, so it’d be nice to see what other domains, entity types, and languages might require.) [223002000110] |

    BIO vs. BMEWO

    [223002000120] |Presumably you already knew from careful study of LingPipe (or Andrew Borthwick’s thesis) that begin-in-out (BIO) coding is inferior to begin-mid-end-whole-out (BMEWO, which the authors call BILOU, for “begin-in-last-out-unit”) in some cases. [223002000130] |This is particularly important if you’re doing some kind of greedy decoding, because it gives you context info that couldn’t otherwise be resolved. [223002000140] |

    Local Features

    [223002000150] |Although the features they considered were hardly exhaustive, they covered a fair range in a fair amount of detail. [223002000160] |If anyone knows a better compendium, or just wants to add features, please comment. [223002000170] |Their baseline system has the usual suspects: previous two predictions (making it a second-order model), word, word type/shape (no details given, but some systems use multiple type or shape features), prefixes and suffixes of the word (no arbitrary substrings and no indication of length they used), and windows of these features around the current word, as well as some interactions (specifically token n-grams and previous label). [223002000180] |The problem with this kind of evaluation is that there are so many parameters to play with there’s no way to evaluate all combinations of parameters. [223002000190] |Most unconventionally, they didn’t use any part-of-speech features (though see below on clustering for a hint as to why). [223002000200] |

    Contextual Features

    [223002000210] |They considered the words around a feature in 200 and 1000 token windows (I presume that’s +/- 100 and +/- 500 tokens). [223002000220] |They ignored doc boundaries and stated consecutive docs were often related in their corpus, so your mileage may vary here. [223002000230] |I’d stick to doc boundaries, myself. [223002000240] |

    Context Aggregation

    [223002000250] |This feature I’d never seen before. [223002000260] |When they look at a word w[n] in context (w[n-2],w[n-1],w[n],w[n+1],w[n+2]), they create a vector by looking at all the instances within 200 tokens (+/- 200?). [223002000270] |Sort of like Ando and Zhang (see paper refs), only much simpler and more context sensitive. [223002000280] |

    Two-Stage Predictions

    [223002000290] |They took the output of a baseline NE detector on a corpus and fed its voted results back into the system, restricting themselves to a window around the instance in question. [223002000300] |Sort of like Krishnan and Manning (see paper refs), only context sensitive. [223002000310] |

    Prediction History

    [223002000320] |Working left-to-right through the corpus, they kept track of earlier predictions in the corpus, taking percentage of assignment as feature value (again, there’s all sorts of issues in scaling these features —logs might be good, and then there’s smoothing for low counts). [223002000330] |Again, they restrict this to context, this time the previous 1000 tokens. [223002000340] |

    Unlabeled Text

    [223002000350] |Again, reminiscent of Rie Ando and Tong Zhang’s work, with an approach borrowed from Peter Brown’s earlier work at IBM (again, see refs), they derived features from Percy Liang’s hierarchical clustering of Reuters (it’s nice to have a pile of in-domain data). [223002000360] |What’ neat is that they used sequences of branching decisions (e.g. 0110 for left,right,left,right) in the dendrogram as features, taking length 4 prefixes (which they claim look like POS tags, hence the explanation for why no POS tagger), as well as length 6, 10, and 20 (no explanation why these). [223002000370] |

    Dictionaries and Gazetteers

    [223002000380] |They threw in 30 dictionaries, derived from the web and Wikipedia, covering a range of classes. [223002000390] |They’re included in the download, and the paper has some lists of Wikipedia categories used to define types of entities. [223002000400] |

    Evaluation

    [223002000410] |I ran some text from the front-page of the NYTimes through their online demo, which is a softball test. [223002000420] |The main problem was messing up the character encodings (and whitespace). [223002000430] |Here’s the output: [223002000440] |[LOC ALBANY ] � [PER Pedro Espada Jr. ] returned to the [MISC Democrats ] and was named [ORG Senate ] majority leader on Thursday as part of a deal worked out by [ORG Senate ] [MISC Democratic ] leaders , ending a monthlong stalemate that has hobbled state government . [223002000450] |Skip to next paragraph Related City Room: [MISC Ravitch Signed Oath ] at [LOC Brooklyn ] Steakhouse [PER Paterson Picks M.T.A. Figure ] as His [ORG No. ] 2 ( July 9 , 2009 ) [MISC Mr. Espada�s ] return gave the [MISC Democrats ] 32 votes in the [ORG Senate ] , a clear two-vote margin that re-established their control of the chamber . [223002000460] |Under the deal , Senator [PER Malcolm A. Smith ] of [LOC Queens ] will be president for an undetermined period of time , and Senator [PER John L. Sampson ] of [LOC Brooklyn ] will be the leader of the [MISC Democratic ] caucus . [223002000470] |Details of the arrangement were explained by [PER Mr. Smith ] at a news conference. �At the end of the day , [MISC Democrats ] always come together , � he said . [223002000480] |The defection of [PER Mr. Espada ] , of the [LOC Bronx ] , plunged [LOC Albany ] into chaos on June 8 . [223002000490] |When it was his turn to speak at [ORG Thursday�s ] news conference , [PER Mr. Espada ] said , �This has obviously taken a toll on the institution.� He described the last several weeks as a time of �chaos and dysfunction� and added , �I profoundly apologize.� Talk of a deal started to bubble up Thursday afternoon . [223002000500] |As it became clear that a deal was at hand , [PER Steve Pigeon ] , a top aide to the [LOC Rochester ] billionaire [PER Tom Golisano ] , a supporter of the [MISC Republican ] takeover , left [MISC Mr. Espada�s ] office , saying little . [223002000510] |[PER Mr. Pigeon ] , who helped orchestrate the coup along with [PER Mr. Golisano ] , then huddled with a top [ORG Senate ] [MISC Republican ] , [PER George D. Maziarz ] , on a stairway near [MISC Mr. Espada�s ] office . [223002000520] |After the conversation , [PER Mr. Maziarz ] was speaking as if a deal was a fait accompli . [223002000530] |Asked if he was disappointed , he said he was not , and said he believed that the rules and reforms [MISC Republicans ] had pushed through last month would still stand . [223002000540] |Treating “Mr. Espada” as a miscellaneous entry is almost certainly a mistake they make because they messed up the char encodings and hence the tokenization missed the possesive apostrophe. [223002000550] |They also mangled the whitespaces in the output. [223002000560] |These detail issues can make or break a real system in practice, but they’re not all that hard to clean up. [223002000570] |More topical text from justjared.buzznet.com is harder: [223002000580] |Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids � [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) � from the new issue of [LOC OK ] ! . [223002000590] |In this picture , Prince , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . [223002000600] |Also pictured is [PER Michael ] at [LOC Neverland ] , celebrating [MISC Prince�s ] sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . [223002000610] |Cute ! [223002000620] |Disclaimer: The LingPipe online tagger demo does much worse on this data. [223002000630] |NE mention detection is very hard once you stray even a little bit outside of newswire. [223002000640] |

    Official Evaluation

    [223002000650] |They claim to have the highest F measure for CoNLL ’03 (F=0.908; Ando and Zhang had F=0.892). [223002000660] |F measure, using an evaluation that ignores punctuation, but we all know these systems are overtrained at this point, and they also had the advantage of having clustering data over the same data set. [223002000670] |Our readers probably know by now how much I distrust post-hoc results like these. [223002000680] |They did label some of their own web page data for which they acknowledge the borderline status of some of their labeling decisions even after adjudication; this directly supports the point I’ve been trying to make about labeling difficulty and noisy corpora; the MUC data’s also noisy, and I’m guessing so is the CoNLL data. [223002000690] |

    (Missing) Conditional Model Details

    [223002000700] |They report that there was very little search error in doing greedy decoding, and even less in doing a ten-item beam search, when evaluated on a non-contextualized baseline system. [223002000710] |I’m not clear on exactly how they used regularized average perceptrons; they never spell out the model in detail. [223002000720] |For instance, they don’t discuss how they do regularization or feature selection (e.g. is there a min count?)! [223002000730] |More confusingly, they don’t discuss how they extend perceptrons to structured predictions for sequence analysis, though from the rest of the paper, it looks like they’re deciding on a per-token basis (though if that’s the case, why do they say they need to convert to probabilities to run Viterbi?). [223002000740] |I love these feature survey papers. [223002000750] |Unfortunately, drawing general conclusions from this kind of evaluation is nearly impossible, even when they’re done as carefully as this one. [223002000760] |We dont’ know if a CRF have worked better. [223002000770] |Given where they go with the system, perhaps they should’ve used MEMMs and a per-tag error function. [223002010010] |Enriched Inputs for Long-Distance CRF Tagging [223002010020] |I’ve been wrestling with how to extract features for CRFs. [223002010030] |Looking at papers like (Ratinov and Roth 2009), and also at what others had done before, it’s pretty clear that longer-distance contextual features are going to be helpful. [223002010040] |These can be at the global (corpus) or more local (sentence/paragaph) level. [223002010050] |At the corpus level, we have things like majority taggings by baseline taggers. [223002010060] |At the sentence level, we have things like part-of-speech tags. [223002010070] |I’ve been struggling with how to preserve a tagging interface like we use for HMMs: [223002010080] |String[] tag(String[] inputs); [223002010090] |And while the Node/Edge feature factoring is theoretically sound, it’s wrong computationally if the input is an an array of strings. [223002010100] |We often want to bring in richer contextual or long-distance information that we don’t want to recompute on a per-node or per-edge basis. [223002010110] |At the same time, I’m doing my homework and looking at packages like CRFSuite, which follow the CoNLL input format. [223002010120] |With CoNLL input format, rather than only token strings, you have multiple features per input item. [223002010130] |Here’s an example from the CRFSuite tutorial (linked above): [223002010140] |The first column is a token, the second column a part-of-speech tag, and the third column a BIO encoding of phrasal chunks. [223002010150] |This can go on with more and more information from dictionaries, etc., as additional columns. [223002010160] |We might even want to bring in Ando and Zhang-like spectral features. [223002010170] |CRFSuite then gives you a language for extracting features from the columns. [223002010180] |For instance, the following specifies a node feature consisting of a window of +/- 2 tokens around the current token: [223002010190] |whereas this one supplies part-of-speech interaction (bigram) features: [223002010200] |While I’m not sure I want to develop a language and parser, I am going to generalize the input from String[] to List, where, for instance, E could be structured as in the CoNLL example as a sequence of strings (one per column, perhaps with named accessors, like getPartOfSpeech()). [223002010210] |This reduces the the earlier token-tagging model when E is String. [223002010220] |The benefit is that I can stick to the simple Node and Edge feature extractor model, only the nodes and edges will have all this extra information available; the downside is that clients will have to compute it for inputs. [223002010230] |This could be wrapped in a bigger chunker, but hey, no one said CRF feature extraction was easy. [223002010240] |This will give the client a way to pass in extra information. [223002010250] |For instance, one column could be reserved for a boolean flag indicating whether the token was tagged as part of an entity earlier in the sequence of sentences being processed. [223002010260] |Several other columns could indicates gazeteer matches. [223002010270] |It puts more work on the client, but I don’t see how to specify a reasonable feature extractor otherwise that doesn’t do a ton of redundant work. [223002010280] |(I thought about adding a metadata object along with the sequence; as is, the items in the sequence can still link back to some kind of metadata.) [223002010290] |One alternative would be to extract a whole feature lattice at once, although that precludes later implementing agressive beam search, which mainly saves time in feature extraction. [223002010300] |I’m also going to get rid of begin-of-sentence features; those can be encoded in the node features. [223002020010] |Nadeau and Sekine (2007) A Survey of Named Entity Recognition and Classification [223002020020] |Following a tip from the last NE survey, today we look at 15 years of named entity research in: [223002020030] |
  • Nadeau, David and Satoshi Sekine (2007) A survey of named entity recognition and classification. [223002020040] |Linguisticae Investigationes 30(1):3–26. [223002020050] |This paper is extracted from the literature survey for Nadeau’s thesis on semi-supervised NE. [223002020060] |Nice work! [223002020070] |It surveys work in many languages and language families, many genres of text or spoken transcription, and many entity types ranging from highly general to very specific. [223002020080] |It surveys the shift from hand-written patterns (not dead yet, especially in conjunction with learned feature weights) to supervised learning to semi-supervised learning, with a detour through approximate dictionary-matching methods such as soundex, edit distance and Jaro-Winkler. [223002020090] |They call bootstrapping from sources like Wikipedia unsupervised because the exact task isn’t labeled, but they feel semi-supervised to me. [223002020100] |Mainly, I wanted to read this for the overview of features, which they cover far better than any other survey I’ve seen, breaking features down into (1) word-level features, such as case, punctuation, word shape, part-of-speech, morphology, and word function, (2) list lookup features such as common words, known entities, and entity cues (e.g. “Mr.” or “Corp.”), and (3) document and corpus features, such as multiple occurrence voting, local syntactic context, document meta-information and corpus frequency information. [223002020110] |They also cover eval, but I think this has been rather weak in the field, so I wouldn’t get too carried away with MUC or ACE-type evals. [223002020120] |In my mind, this survey needs a discussion of the following techniques to be brought up to date: [223002020130] |
  • Dan Roth’s joint relation/entity model and other joint inference systems such as Finkel and Manning’s parsing/entity model (the survey does talk about Grishman and Heng’s coref/entity model) [223002020140] |
  • Finkel et al.’s long-distance model with Gibbs sampling for inference [223002020150] |
  • Ando and Zhang’s feature induction method and similar clustering-type unsupervised methods [223002020160] |
  • Committee-based methods, which I’m not even sure have been used in this domain other than for mixing forward and backward Markovian models [223002020170] |
  • Domain adaptation methods, such as Finkel’s hierarchal models or Daume’s frustratingly simple adaptation [223002020180] |
  • Nested named entities, as found in the Genia corpus [223002020190] |
  • Parsing-based approaches such as Finkel and Manning’s approach to nested entities [223002020200] |Another thing that’d be good to include in a broader survey would be the encoding of chunking problems as tagging problems, which the Ratinov and Roth paper touched on, but didn’t explore fully. [223002020210] |What I liked about this paper is that it avoided categorial conclusions about what works and what doesn’t —I think the jury’s still out on that one. [223002030010] |Finkel and Manning (2009) Hierarchical Bayesian Domain Adaptation [223002030020] |If, like me, you learned Bayesian regression from Gelman and Hill’s book, Data Analysis Using Regression and Multilevel/Hierarchical Models, you’re going to love: [223002030030] |
  • Finkel, Jenny Rose and Christopher D. Manning. 2009. [223002030040] |Hierarchical Bayesian domain adaptation. [223002030050] |In EMNLP.
  • [223002030060] |

    In a Nutshell

    [223002030070] |Finkel and Manning apply the standard hierarchical model for the parameters of a generalized linear model to structured natural language problems. [223002030080] |Although they call it “domain adaptation”, it’s really just a standard hierarchical model. [223002030090] |In particular, they focus on CRFs (a kind of structured logistic regression), applying them to named entity recognition and dependency parsing. [223002030100] |Their empirical results showed a modest gain from partial pooling over complete pooling or no pooling. [223002030110] |

    The Standard Model

    [223002030120] |In the standard non-hierarchical Bayesian regression model (including generalized linear models like logistic regression), each regression coefficient has a prior, usually in the form of a Gaussian (normal, L2, ridge) or Laplace (L1, lasso) distribution. [223002030130] |The maximum likelihood solution corresponds to an improper prior with infinite variance. [223002030140] |In most natural language applications, the prior is assumed to have zero mean and a diagonal covariance matrix with shared variance, so that each coefficient has an independent normal prior with zero mean and the same variance. [223002030150] |LingPipe allows arbitrary means and diagonal covariance matrices, as does Genkin, Madigan and Lewis’s package Bayesian Logistic Regression. [223002030160] |With data of the same type (e.g. person, location and organization named-entity data in English) from multiple sources (e.g. CoNLL, MUC6, MUC7), there are two standard approaches. [223002030170] |First, train individual models, one for each source, then apply the models in-source. [223002030180] |That’s the unpooled result, where the data in each domain isn’t used to help train other domains. [223002030190] |The second approach is to completely pool the data into one big training set and then apply the resulting model to each domain. [223002030200] |

    The Hierarchical Model

    [223002030210] |The hierarchical model generalizes these two approaches, and allows a middle ground where there is partial pooling across domains. [223002030220] |A hierarchical model fits coefficients from each domain using a shared prior for each coefficient that is shared across domains. [223002030230] |That is, we might have a feature f and coefficients β[1,f], β[2,f] and β[3,f] for three different domains. [223002030240] |We assume the β[i,f] are drawn from a shared prior (normal in Finkel and Manning’s case), say Normal(μ[f],σ[f]2). [223002030250] |The prior mean μ[f] and variance σ[f]2 for coefficients for feature f and prior can now be fit in the model along with the coefficients for each model. [223002030260] |In BUGS-like sampling notation, for a simple classification problem: [223002030270] |Finkel and Manning fix the prior variances σ[f] to a single shared value σas a hyperparameter, which makes sense because they don't really have enough data points to fit them in a meaningful way. [223002030280] |This is too bad, because the variance is often the most useful part of setting hierarchical priors (see, e.g, Genkin et al.'s papers). [223002030290] |I suspect the prior variance is going to be a very sensitive hyperparameter in this model. [223002030300] |Of course, μ[f] itself requires a prior, which Manning and Finkel fix as a hyperparameter to have mean ν=0, and variance τ2. I suspect τwill also be a pretty sensitive parameter in this model. [223002030310] |The completely pooled result arises as σ2 approaches zero, so that the β[d,f] are all equal to μ[f]. [223002030320] |The completely unpooled result in this case arises when ν=0 and τ2 approaches zero, so that each μ[f]=0 for all f, so that β[d,f] ~ Normal(0,σ[f]2). [223002030330] |

    Relation to Daumé (2007)

    [223002030340] |Turns out Hal did the same thing in his 2007 ACL paper, Frustratingly easy domain adaptation, only with a different presentation and slightly different parameter tying. [223002030350] |I read Hal's paper and didn't see the connection; Finkel and Manning thank David Vickrey for pointing out the relation. [223002030360] |

    More Levels

    [223002030370] |Of course, as Finkel and Manning mention, it's easy to add more hierarchical structure. [223002030380] |For instance, with several genres, such as newspapers, television, e-mail, blogs, etc., there could be an additional level where the genre priors were drawn from fixed high level priors (thus pooling across genres), and then within-genre coefficients can be drawn from those. [223002030390] |

    More Predictors (Random Effects)

    [223002030400] |There's no reason these models need to remain hierarchical. [223002030410] |They can be extended to general multilevel models by adding more predictors at each level than the instances below it. [223002030420] |One example is that we could pool by year and by genre. [223002030430] |Check out Gelman and Hill's book for lots of examples. [223002030440] |

    Estimation

    [223002030450] |The really cool part is how easy the estimation is. Apparently the loss function's still convex, so all our favorite optimizers work just fine. [223002030460] |The hierarchical priors are easy to marginalize, and thus easy to compute the gradient for. [223002030470] |This is because normals are in the exponential family, and the domains are assumed to be exchangeable, so the big product of exponentials turns into a simple summation in the log loss space where the gradient happens (see the paper for the formula). [223002040010] |Russ Altman on “Poisoning Your Prior” [223002040020] |[Update: 22 July 2009 Please see Russ's comments and my responses for clarification of my confusion about what he meant by "validation".] [223002040030] |Russ Altman was recently quoted as saying: [223002040040] |“You should get your prior, put it in an envelope, and get it time-stamped by the post office,” says Russ Altman. [223002040050] |“Otherwise you are going to have a rough idea of what genes are involved, and that is going to poison your prior.” [223002040060] |in the article: [223002040070] |
  • In Shekhar, Chandra. 2009. [223002040080] |From SNPs to Prescriptions: Can Genes Predict Drug Response?. [223002040090] |Biomedical Computation Review 5(3):10-19.
  • [223002040100] |Altman’s NIH Center Simbios publishes the glossy magazine Biomedical Computation Review, so I doubt it’s a misquote. [223002040110] |To give you some notion of context, Altman’s group is using prior information for gene-drug, drug-target, and gene-gene interactions derived from public databases as the basis for analyzing a genome-wide association study. [223002040120] |He’s paraphrased in the article as saying the key is to “avoid infusing any bias into the analysis of the association data”, for which he offers the advice quoted above. [223002040130] |Noooooo! [223002040140] |It’s especially distressing to hear this kind of FUD coming from someone using informative priors. [223002040150] |What Altman’s afraid of is that strong enough priors will “bias” our estimates. [223002040160] |In the limit, set the prior mean to the desired answer and prior variance to zero, and your posterior is the same as your prior. [223002040170] |Adjust the prior variance to get whatever posterior you want. [223002040180] |The mistake he’s making is that we’re not playing the hypothesis testing game here, as we might do in a clinical drug trial. [223002040190] |Instead, we’re doing general statistical modeling. [223002040200] |Our performance is measured on held-out data, not on fit to distribution as in classical statistics. [223002040210] |The empirical bottom line is: tuning your prior can help with prediction on held-out data. [223002040220] |Every application of a probabilistic model depends on subjective assumptions about model structure and its fit to some real-world process. [223002040230] |For a Bayesian, the prior’s no different in kind than other modeling assumptions, such as linearity of regression predictors or exchangability of samples (i.e. repeatable trials, on which all of classical statistics is based). [223002040240] |If you’re going to use informative priors, as Altman et al. did, then there’s no reason, either empirical or conceptual, not to fit them along with every other aspect of the model. [223002040250] |Yes, priors are subject to overfitting. [223002040260] |So are ordinary regression coefficients. [223002040270] |The fact that we want to use our posterior predictions to guide our research helps insure against overfitting priors. [223002040280] |This would be different if we wanted to use our posterior predictions to get our drug past the FDA. [223002040290] |Gelman et al.’s Bayesian Data Analysis has extensive advice on testing Bayesian model (over)fit, with lots of biology-derived examples (dosage response, survival rates, etc.). [223002040300] |How about we rename “prior” to “hierarchical parameter”, or better yet, “multilevel parameter”, to account for non-nested “priors”, and just get on with modeling? [223002050010] |Label-Specific Features versus Priors for Tying CRF Features [223002050020] |In the first order chain CRF case, the issue is whether the feature extractor has the signature φ(inputs,position,previousTag) or ψ(inputs,position,previousTag,currentTag). [223002050030] |With the former, the current tag is implicitly included, so each feature output by φis multipled by the number of output tags. [223002050040] |As noted by Sutton and McCallum, this can save both time and space in feature generation, which dominates the run-time performance of linear classifiers like CRFs. [223002050050] |I’ve asked many people over the past few months about how they use label-specific edge and node features for CRFs. [223002050060] |Jenny Finkel had the clearest motivation, which is that some tags are related and may benefit from tied features. [223002050070] |For instance, if I’m doing BIO encoding for chunkers, I might want ψ(in,pos,prev,B-PER) and ψ(in,pos,prev,I-PER) to generate some shared features which separate tokens in people names from other tokens, such as “input word at position is in a list of person name tokens”. [223002050080] |In the implicit current tag approach using φ(in,pos,prev), any feature has a distinct instance for each output category, so we’d have one feature for output label B-PER (first token in a person name) and another for I-PER (non-initial token in person name). [223002050090] |One advantage of tying is that it reduces the overall number of parameters, which is always good if it’s motivated, which here means the parameters really should have the same value. [223002050100] |It can also help if they shouldn’t really have the same value, but we don’t have enough data to estimate them properly separately. [223002050110] |If I implement the factored approach with φ(in,pos,prev), I won’t be able to tie parameters at all. [223002050120] |I’d love to hear other opinions on the matter from people with experience training and tuning CRFs. [223002050130] |Having also just read Jenny’s hierarchical modeling paper, it occurred to me that we could impose a hierarchical model on the coefficients. [223002050140] |Tying simply reduces to a prior with zero variance and unknown mean (the prior mean’s the tied value). [223002050150] |That’d be the “right” way to do this, of course, as it’d allow any degree of pooling between completely pooled (tying) and unpooled (completely separating). [223002050160] |P.S. Jenny also helped me understand what Sutton and McCallum meant by “unsupported features”, which are features that do not show up in any positive instance in the training data; Sutton and McCallum just said “those which never occur in the training data”, and I didn’t think anyone included features that didn’t occur at all in the training data. [223002060010] |Does Reviewing (or Other Service) Help Your Career? [223002060020] |There’s been a back-and-forth on the Computational Linguistics Editorial Board mailing list about how to get people to complete reviews in a timely fashion. [223002060030] |I raised the issue that we (reviewers) are all volunteers, so it’s hard to motivate people to do the work. [223002060040] |Another ed. board member pointed out that it’s good for your career. [223002060050] |Is it? [223002060060] |As Mark Steedman once pointed out to me, the reward for doing a good job on administration is more administrative duties. [223002060070] |Hardly the carrot we were looking for to motivate our reviewers. [223002060080] |In my experience, during tenure and promotion at research universities in the U.S., admin per se is universally ignored (or at most given pro forma treatment as part of some onerous template for promotion packages). [223002060090] |On the other hand, general “standing” in the field can be helped by this kind of thing and will come through in reference letters indirectly. [223002060100] |But there’s an asymmetry of information concern here. [223002060110] |Only the editor of the journal (Robert Dale at the present time) knows who completes reviews in a timely manner. [223002060120] |If I’m an area chair for a conference or on a program committee, I might also see who returns things in a timely manner, and who does a good job. [223002060130] |My feeling is that there’s not much penalty for never agreeing to be on an ed board, and never agreeing to review a paper. [223002070010] |Arabic Named Entity Recognition with the ANER Corpus [223002070020] |One thing I forgot to mention on the release notes for LingPipe 3.8.2 is that I’ve added a section on building an Arabic named entity recognizer to the [223002070030] |
  • LingPipe Named Entity Tutorial [223002070040] |

    Benajiba’s ANER Corpus

    [223002070050] |It’s based on Yassine Benajiba‘s freely distributed (thanks!) corpus: [223002070060] |
  • ANER Corpus (Arabic Named Entity Recognition) [223002070070] |It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). [223002070080] |Here’s a sample (rendered with style value direction:rtl to get the ordering right; the actual file’s in the usual character order): [223002070090] |فرانكفورت B-LOC (د O ب O أ) O أعلن O اتحاد B-ORG صناعة I-ORG السيارات I-ORG في O ألمانيا B-LOC ... [223002070100] |Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo. [223002070110] |

    LingPipe Model and Evaluation

    [223002070120] |I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries). [223002070130] |There’s absolutely nothing Arabic-specific about the model. [223002070140] |Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. [223002070150] |Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results: [223002070160] |Dictionaries improved recall a bit, but hurt precision even more. [223002070170] |Bigger dictionaries and more training data would certainly help here. [223002070180] |

    References

    [223002070190] |For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see: [223002070200] |
  • Benajiba, Y. and P. Rosso 2007. [223002070210] |ANERsys 2.0 : Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. [223002070220] |In IICAI-2007.
  • [223002070230] |
  • Benajiba, Y., P. Rosso, and Benedí J. M. 2007. [223002070240] |ANERsys: An Arabic Named Entity Recognition system based on Maximum Entropy. [223002070250] |In CICLing-2007.
  • [223002070260] |And of course, there’s more information in LingPipe’s Named Entity Tutorial. [223002080010] |LingPipe 3.8.2 Released [223002080020] |LingPipe 3.8.2 is now available from: [223002080030] |
  • LingPipe Home Page [223002080040] |This is a patch release, but includes lots of little utility methods and even some utility classes. [223002080050] |The home page above lists the changes. [223002090010] |Has Open Source Failed Commercially? [223002090020] |I just read in Business Week‘s Technology at Work Blog, a post by Peter Yared, The Failure of Commercial Open Source. [223002090030] |Yared argues that companies aren’t making money from open source, because: [223002090040] |
  • Open source only works for commodities (e.g. operating systems) [223002090050] |
  • Open source is as expensive as proprietary software to develop [223002090060] |
  • Selling software is miserable [223002090070] |
  • Customers are switching to software as a service (SaaS) [223002090080] |He argues that open source is only successful when it’s free and supported by a broad community of developers, and argues that only a handful of commercial software companies based on open source have had “liquidity events” (did I mention this was in Business Week?). [223002090090] |I replied (perhaps into the ether given their moderation): [223002090100] |We are operating under a MySQL-like model, but with even more restrictions on our source code. [223002090110] |We have no marketing or sales department (but then we’re only 2.5 people full time with an additional 0-2 contractors at any given time). [223002090120] |A paycheck and a job I love with lots of flexibility is the only “liquidity event” I’m looking for. [223002090130] |It helps having the source out there, even if it’s not covered by an official open source license, because then developers can kick the tires, go beyond the doc, and help fix bugs. [223002090140] |And users can try it and inspect it before they buy. [223002090150] |We don’t need help patching the bugs, but we do need help finding them. [223002090160] |As much as we unit test, bugs wind up in production code, and users find them. [223002090170] |When they report source code locations and suggest patches, it really does save us time. [223002090180] |We’ve dealt with several Fortune 500 companies, but only the U.S. government has put us through the “approved vendor” mill, and only the U.S. defense department was ever interested in security audits (and then mostly to make sure you weren’t sending packets out over the net). [223002090190] |A big issue for sales is that companies want indemnification against lawsuits, mainly for patent infringement. [223002090200] |Are SaaS businesses really taking off? [223002090210] |All the big companies we talk to are too paranoid to let their data offsite. [223002100010] |Zou and Hastie (2005) Regularization and Variable Selection via the Elastic Net [Prior] [223002100020] |I don’t know how the elastic net prior (aka regularizer) snuck by me unnoticed. [223002100030] |It was introduced in: [223002100040] |
  • Zou, Hui and Trevor Hastie. 2005. [223002100050] |Regularization and Variable Selection via the Elastic Net. [223002100060] |Journal of the Royal Statistical Society B. 67(Part 2):301–320. [223002100070] |I was reading about it in the technical report appearing as part of the documentation to their glmnet R package: [223002100080] |
  • Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2009. [223002100090] |Regularized Paths for Generalized Linear Models via Coordinate Descent. [223002100100] |CRAN documentation for glmnet. [223002100110] |The idea’s basically an “interpolation” between the Laplace (L1, lasso) and Gaussian (L2, ridge) priors: [223002100120] |where βis the vector coefficients, λa factor for prior variance, and αthe blending ratio between L2 and L1 priors, themselves represented by the squared L2 norm and (unsquared) L1 norm terms. [223002100130] |Note that the interpolation happens inside the exponential form, so it’s not a mixture of a Gaussian and Laplace prior, but rather a mixture of their bases. [223002100140] |Thus on the log scale used in gradient methods, the prior takes on a pleasantly differentiable form: [223002100150] |where Z is the usual regularization term, here dependent on the variance and mixture parameters. [223002100160] |The later paper divides the L2 component by 2, which gives you a slightly different blend. [223002100170] |In general, I suppose you could give the two distributions being blended arbitrary variances λ1 and λ2. [223002100180] |The authors are particularly interested in two situations that are problematic for L1 or L2 alone: [223002100190] |
  • the number of dimensions is larger than the number of training items [223002100200] |
  • the features/coefficients are highly correlated [223002100210] |Note that bag-of-words and other lexically-based feature representations in natural language classification and tagging are of exactly this nature. [223002100220] |In the former case, L1 breaks down with too few features selected. [223002100230] |In the latter case, L2 tends to select all the features and L1 only one of the features. [223002100240] |With the elastic net, with most of the weight on L1, you get the beneifts of L1′s sparseness and fat tails, with better behavior in the face of correlated variables and large numbers of dimensions. [223002100250] |There’s a beautiful plot in the later paper with prior variance decreasing on the X-axis versus the fitted coefficient values on the Y-axis, with a line for each dimension/feature (and there are lots of non-zero coefficients overlaid on the Gaussian, which never pushes any feature to 0 through regularization): [223002100260] |The Zou and Hastie paper also discusses “bridge regression” (bad pun), which takes an arbitrary Ln penalty for n in the interval [1,2], and thus provides a slightly different way of blending L1 and L2 priors. [223002100270] |The limit at 1 and 2 are the L1 and L2 priors. [223002100280] |Although the gradient is still easy to calculate, the resulting solutions for n >1 are not sparse for the same reason L2 isn’t sparse —the shrinkage is proportional to a fraction of the current value. [223002100290] |Given the general exponential form, the gradient of the elastic net’s log probability is easy to compute. [223002100300] |It just reduces to interpolating between the L1 and L2 gradients. [223002100310] |I’ll probably just go ahead and add elastic net priors to LingPipe as implementations of stats.RegressionPrior. [223002100320] |That’ll let them plug-and-play in multinomial logistic regression and CRFs. [223002110010] |CiteXplore: MEDLINE with Citation Details and Entity Highlighting [223002110020] |[Update, 10 August 2009: check out the comment for some responses to feedback from EBI's development team.] [223002110030] |I found this site through the syllabus for EBI’s upcoming text mining course (October 2009): [223002110040] |
  • EBI: CiteXPlore [223002110050] |It supports searches over MEDLINE with gene name, protein name, MeSH or protein-protein interaction highlighting. [223002110060] |They’re sensibly exploiting external resources, such as iHOP (“info hyperlinked over proteins”), Entrez-PubMed and Entrez-Gene. [223002110070] |I greatly prefer the link-out approach to trying to encapsulate everything on their own page. [223002110080] |What’s really neat is that CiteXplore mined the full-text articles for citation details, which they link back to MEDLINE. [223002110090] |It looks very much like the ACM’s proprietary “Portal”, only over MEDLINE. [223002110100] |One limitation is that CiteXplore only links back into MEDLINE (thus leaving some refs as “not found”), whereas ACM pulls out arbitrary references. [223002110110] |This is something we couldn’t do, because we don’t have subscriptions to all the full text doc feeds (I have no idea how complete CiteXplore is in this regard, but they do have proprietary articles linked for citations.) [223002110120] |CiteXplore supports named-entity tagging for genes and/or proteins, though it looks like gene search also includes proteins. [223002110130] |For instance, pull up their page and do a search for [serpina3], which is a gene that has the same name in six different mammals and birds. [223002110140] |(Sorry I can’t link to the search —like Entrez itself, they used a web engine that doesn’t allow query bookmarking; they also don’t support “open in new window” or “open in new tab”, which is really frustrating.) [223002110150] |Then pull down the “gene name highlighting” menu item and click on the “highlight” button. [223002110160] |The gene names it finds are rendered in green. [223002110170] |CiteXplore finds some of the SERPINA3 mentions, but misses others. [223002110180] |For instance, in the context “Alpha 1-antichymotrypsin/SerpinA3″ it seems to have tokenization problems, because both are names of the same gene, but neither is highlighted. [223002110190] |The mention of the bos taurus (cow) gene SERPINA3 (PMID 18384666) isn’t highlighted, but mus musculus (mouse) is (PMID 15638460). [223002110200] |Unfortunately, the mouse mention is linked to the human SerpinA3 in iHOP. [223002110210] |(Argh! iHOP’s all Javascripted up to intercept search on page and their search doesn’t jump to the matches. [223002110220] |Why, oh why, do people try to innovate on things that already work?) [223002110230] |CiteXplore misses mentions with too many parens —they match “alpha-1-antichymotrypsin”, but miss “alpha(1)-antichymotrypsin”. [223002110240] |And they have trouble with mixed case, matching “SERPINA3″ and “serpina3″, but missing “SerpinA3″. [223002110250] |Other misses seem inexplicable, as in “SERPINA3 (aka alpha-1-antichymotrypsin)”. [223002110260] |They also miss the alias “ACT” for SERPINA3 (it overlaps with a common word, so is probably just hard filtered out). [223002110270] |That one’s tricky, as it requires context, and is very hard to recognize with high precision. [223002110280] |There are also other tokenization problems in that they seem to have inserted extra whitespace after open parens. [223002110290] |I really wish people would take more care with text data. [223002110300] |This is why you need tokenizers that give you info on whitespace (either directly or through start/end offsets for tokens). [223002110310] |I also tried CiteXplore’s “Proteins / Gene ontology” highlighting, but that adds precision problems to their recall problems. [223002110320] |For instance, it matches the “Alpha 1″ in “Alpha1-antichymotrypsin” to the Uniprot entry for mating type protein alpha 1. [223002110330] |They also mess up the whitespace in this application, inserting extra spaces after hyphens. [223002110340] |There are also bigger recall problems. [223002110350] |The search is by exact string match, so overall, they only find 38 citations for the search [serpina3]. [223002110360] |Entrez-PubMed itself only finds 35 citations for the same query. [223002110370] |CiteXplore’s advanced search allows expansion by synonym, raising the number of hits for query [serpina3] to 133. [223002110380] |Judging from their display, they only add “aact” as a synonym for “serpina3″. iHOP, which now has greatly improved recall over its original version, finds what looks to be many more articles about SERPINA3 searching by Entrez-Gene ID (12 for human), but they don’t report hit numbers, and I’m not going to count them. [223002110390] |They also deal with the precision problems involved in expanding with synonyms (and also skip the synonym “ACT”). [223002110400] |CiteXplore also has a frustratingly short fuse on session timeouts; I had to keep redoing all these searches as I wrote this post. [223002110410] |CiteXplore seems to be based on EBI’s text matching system Whatizit, which supports tagging arbitrary text, not just MEDLINE citations found by search. [223002110420] |There’s also a freely downloadable paper explaining the system: [223002110430] |
  • Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. 2008. [223002110440] |Text processing through Web services: calling Whatizit. [223002110450] |Bioinformatics 24(2):296–298. [223002110460] |An application like this could be built with LingPipe’s exact dictionary matcher coupled with a normalizing tokenizer. [223002120010] |Serializing Immutables and Singletons with a Serialization Proxy [223002120020] |[Update: 3 January 2011. [223002120030] |Blog commenter Konrad correctly points out a serious error in the original version of this post. [223002120040] |It turns out you can serialize immutable objects because there's no requirement that there be a public no-argument constructor. [223002120050] |Konrad's example code in the comment provides an example.] [223002120060] |A question recently came up on our mailing list/newsgroup that asked how to implement java.io.Serializable for classes containing non-serializable singletons. [223002120070] |I decided to kick up the abstraction level of the answer and discuss LingPipe’s serialization proxy approach to serializing everything, including singletons and immutables. [223002120080] |Josh Bloch discusses the serialization proxy pattern in the very last section of the second edition of Effective Java. [223002120090] |Java lets you take complete control over serialization, and I strongly advise you to use this control for forward compatibility, working around non-serializable (but constructable) member objects like singletons, and minimizing the size of serialized objects. [223002120100] |

    A Simple Immutable Class

    [223002120110] |Suppose we have a simple immutable class we want to serialize: [223002120120] |This class is immutable. [223002120130] |The reason to prefer immutable classes is discussed at the beginning of Bloch’s book; the basic idea is that immutables guarantee thread safety (assuming the immutable member variables are thread safe for reads) and consistent object state (guaranteed by the constructor). [223002120140] |The only constructor takes two arguments, and sets the final variables for the x and y coordinates of the vector. [223002120150] |For illustration, there’s only a single non-trivial method computing vector length, but the same basic argument applies for any immutable class. [223002120160] |A more robust implementation might require values to be finite (that is, not infinite and not not-a-number). [223002120170] |If we simply declare Vector2D to extend Serializable, we have a problem. [223002120180] |It’ll write OK. [223002120190] |[Update: The following is wrong and has thus been redacted.] [223002120200] |But trying to read it back in causes a problem when reflection tries to invoke a nullary (no argument) constructor Vector2D(), which doesn’t exist. [223002120210] |[Update: See my reply to Konrad's comment to see where my thinking went wrong and why you still need custom serialization to deal with immutables that have consistency conditions on their members enforced through the constructor.] [223002120220] |

    Serialization Proxy

    [223002120230] |Before serializing an object, Java checks to see if the class implements the method Object writeReplace(). [223002120240] |If so, when an instance of the class is serialized, the value returned by writeReplace() is serialized instead of the object itself. [223002120250] |When an instance of a class is being deserialized, Java checks to see if it implements Object readResolve(). [223002120260] |If it does, after the usual steps of serialization, the return value of readResolve() is returned as the result of serialization. [223002120270] |The serialization proxy typically employs a static, private nested class (called the “serialization proxy”) that is serialized in place of the class through writeReplace(). [223002120280] |The proxy itself stores all the information needed to reconstitute the class being serialized. [223002120290] |The method readResolve() is then used to return an instance of the original, immutable class. [223002120300] |It sounds like a mouthful, but is really quite simple: [223002120310] |That’s it. [223002120320] |What’s really nice is that it doesn’t affect the original class (Vector2D) interface. [223002120330] |The writeReplace() method and nested static serialization proxy class Vector2D.SerializationProxy are both private. [223002120340] |

    Serializing Singletons

    [223002120350] |Suppose we have a singleton and we want to make it serializable. [223002120360] |Continuing our earlier example, let’s define a singleton: [223002120370] |(Yes, I know that this isn’t a good example of a singleton because you can construct the same vector directly, but it’ll suffice for this example.) [223002120380] |Because Origin2D is a singleton, we don’t have a public nullary constructor. [223002120390] |But even if we did, we’d have the problem that the deserialized instance would be a new instance, thus defeating the singleton pattern. [223002120400] |Here all we need to do is to define a readResolve() method to return the singleton: [223002120410] |

    Complete Control with Externalizable

    [223002120420] |LingPipe takes even more control by defining the serialization proxies to implement the interface java.io.Externalizable, which extends Serializable. [223002120430] |It defines two methods, writeExternal(ObjectOutput) and readExternal(ObjectInput). [223002120440] |If an object implements Externalizable, these methods are called instead of the default serialization reflection-based methods which simply try to serialize each of the member objects in turn (simply writing primitive objects directly using DataOutput and DataInput). [223002120450] |If a class implements Externalizable, only the fully qualified class name and serial version ID is written automatically; the class itself is responsible for all other serialization and deserialization. [223002120460] |

    LingPipe’s AbstractExternalizable

    [223002120470] |LingPipe provides an abstract base class, com.aliasi.util.AbstractExternalizable, which may be used as a serialization proxy. [223002120480] |It has two abstract methods, Externalizable‘s writeExternal() and Object readObject(ObjectInput), whose return value is used in the concrete implementation of readResolve(). [223002120490] |It also has some static utility methods to read and write objects to streams. [223002120500] |

    Serial Version ID

    [223002120510] |I didn’t add serial version IDs to the example classes to keep the explanation of the proxies simple. [223002120520] |You should add these to all serializable classes so that if the class changes, it won’t break serialiazation. [223002120530] |That is, it preservers forward compatibility. [223002120540] |This’ll typically be a declaration of the form: [223002120550] |Every time Java serializes an object, it writes the objects serial version ID. [223002120560] |If one is not defined in a static variable named serialVersionUID, one is computed through reflection. [223002120570] |So defining one also speeds up the serialization/deserialization process. [223002120580] |If you default to the reflection-based version, if the class changes, so will the ID, and you’ll get conflicts in trying to read in objects. [223002120590] |If you define the ID yourself from the beginning, the value doesn’t matter; if you have a class released into the wild, you should use Java’s serialver utility (which is distriubuted with Sun’s JDK) to compute the value created by reflection to insure backward compatibility. [223002120600] |

    Compilable versus Serializable

    [223002120610] |LingPipe also provides the util.Compilable interface, which defines a method void compileTo(ObjectOutput). [223002120620] |We use this method to compile objects like language-model classifiers, whose compiled form are very different from their regular form. [223002120630] |Some of these classes, like TradNaiveBayesClassifier, also implement Serializable, which writes and reads back in an object with the same behavior as the one serialized. [223002120640] |The deserialized form allows further training; the compiled form is more speed and memory efficient. [223002130010] |LingPipe / GATE Integration [223002130020] |Following on their 2009 summer school in July, Hamish Cunningham and crew have finished the initial integration of LingPipe into GATE (aka the “General Architecture for Text Engineering”, described on their home page as the “Eclipse of natural language engineering”). [223002130030] |So far, the GATE integration includes LingPipe’s tokenization, sentence splitting, part-of-speech tagging, named-entity (mention) recognition, and language ID (which I suppose means classifiers are integrated to some degree). [223002130040] |The integration is now available in the GATE nightly builds from the [223002130050] |
  • GATE SVN Repository. [223002130060] |You can find it under gate/plugins/LingPipe. [223002130070] |Although I don’t see them yet, Hamish tells me there will soon be details in: [223002130080] |
  • GATE User Guide (online), and [223002130090] |
  • GATE User Guide (PDF) [8+MB]. [223002130100] |The integration was carried out with Alias-i’s permission and LingPipe remains under our Royalty-Free License. [223002130110] |GATE itself is distributed under the more permissive LGPL. [223002140010] |Rzhetsky, Shatkay and Wilbur (2009) How to Get the Most out of Your Curation Effort [223002140020] |Following the scientific zeitgeist, here’s another paper rediscovering epidemiological accuracy models for data coding, this time from Mitzi‘s former boss and GeneWays mastermind Andrey Rzhetsky, IR and bio text mining specialist Hagit Shatkay, and the co-creator of the Biocreative entity data, NLM/NCBI’s own John Wilbur: [223002140030] |
  • Rzhetsky Andrey, Hagit Shatkay, and W. John Wilbur. 2009. [223002140040] |How to get the most out of your curation effort. [223002140050] |PLoS Computational Biology. [free download] [223002140060] |Their motivations and models look familar. [223002140070] |They use a Dawid-and-Skene-like multinomial model and a “fixed effects” correlation-based model (to account for the overdispersion relative to the independence assumptions of the multinomial model). [223002140080] |Neither they nor the reviewers knew about any of the other work in this area, which is not surprising given that it’s either very new, quite obscure, or buried in the epidemiology literature under a range of different names. [223002140090] |

    Data Distribution

    [223002140100] |What’s really cool is that they’ve distributed their data through PLoS. [223002140110] |And not just the gold standard, all of the raw annotation data. [223002140120] |This is a great service to the community. [223002140130] |

    What they Coded

    [223002140140] |Özlem Uzuner‘s i2b2 Obesity Challenge and subsequent labeling we’ve done in house convinced me that modality/polarity is really important. [223002140150] |(Yes, this should be obvious, but isn’t when you’ve spent years looking at newswire and encyclopedia data.) [223002140160] |Rzhetsky et al. used 8 coders (and 5 follow-up control coders) to triple code sentences (selected with careful distributions from various kinds of literature and paper sections) for: [223002140170] |
  • Focus (Categorical): generic, methodological, scientific [223002140180] |
  • Direct Evidence for Claim (Ordinal): 0, 1, 2, 3 [223002140190] |
  • Polarity: (Ordinal) Positive/Negative with scale 0,1,2,3 for certainty [223002140200] |I’m not sure why they allowed Positive+0 and Negative+0, as they describe 0 certainty as completely uncertain. [223002140210] |Given the ordinal nature of their data, they could’ve used something like Uebersax and Grove’s 1993 model based on ordinal regression (and a really nice decomposition of sensitivity and specificity into accuracy and bias). [223002150010] |Diagnostic Precision from Sensitivity, Specificity and Prevalence [223002150020] |Suppose you have a diagnostic for your favorite binary condition (YFBC). [223002150030] |For the sake of concreteness, let’s consider the mammogram test for breast cancer. [223002150040] |The American College of Preventive Medicine provides an estimate of between 0.75 and 0.90 sensitivity and 0.9 to 0.95 specificity. [223002150050] |Let’s give the mammogram the benefit of the doubt and assume the high end, with sensitivity of 0.90 and specificity of 0.95. [223002150060] |

    Sensitivity

    [223002150070] |Recall that sensitivity is the accuracy of the test on women who have breast cancer. [223002150080] |In other words, if a woman who has breast cancer gets a mammorgram, the test will come back positive 90% of the time because the test has a 0.9 sensitivity. [223002150090] |Cases where the condition is present and the test is positive are called true positives (TP). [223002150100] |For 10% of the women with breast cancer, the mammogram test result will be a false negative (FN), where the test misdiagnoses a woman who has the disease. [223002150110] |

    Specificity

    [223002150120] |Specificity is the accuracy of the mammogram for women who do not have breast cancer. [223002150130] |That is, if a woman does not have breast cancer, the test will be negative 95% of the time. [223002150140] |If a woman does not have breast cancer and the test results are negative, the outcome is called a true negative (TN). [223002150150] |The other 5% of the time we get a false positive, which means the test is positive for a woman who does not have breast cancer. [223002150160] |

    Prevalence

    [223002150170] |Before we can calculate diagnostic precision, we need to know the prevalence of the condition, which is the percentage of the population that has the condition. [223002150180] |We rarely know population prevalences, but they can be estimated. [223002150190] |For breast cancer, the following paper provides estimates of prevalence of breast cancer, which increase with age: [223002150200] |
  • Michelle E Kruijshaar, Jan J Barendregt, and Lonneke V van de Poll-Franse. 2003. [223002150210] |Estimating the prevalence of breast cancer using a disease model: data problems and trends. [223002150220] |Population Health Metrics 1:5. [223002150230] |
  • Graph from paper: Estimated Prevalence vs. Age [223002150240] |Consulting the chart above, the prevalence of breast cancer is roughly 0.015 for women aged 50. [223002150250] |

    Precision (aka Positive Predictive Value)

    [223002150260] |The precision of a test, also known as its positive predictive value, is the chance that a subject testing positive actually has the condition being tested for. [223002150270] |Precision is thus defined as TP/(TP+FP). [223002150280] |

    My Mammogram is Positive, Do I Have Breast Cancer?

    [223002150290] |Now let’s suppose you’re a 50-year old woman whose mammogram was positive. [223002150300] |What’s the chance you have breast cancer? [223002150310] |Let’s calculate all the probabilities. [223002150320] |
  • p(TP) = prevalence * sensitivity = 0.015 * 0.9 = 0.014 [223002150330] |
  • p(FN) = prevalence * (1 –sensitivity) = 0.0015 [223002150340] |
  • p(TN) = (1 –prevalence) * specificity = 0.97 * 0.95 = 0.94 [223002150350] |
  • p(FP) = (1 –prevalence) * (1 –specificty) = 0.049 [223002150360] |Thus a generous estimate (remember we took the upper bounds for the accuracy estimates) of the chance of a 50-year old having cancer given that she tests positive on a mammogram is only: [223002150370] |p(TP)/(p(TP)+p(FP))=0.21 [223002150380] |Now consider a 40-year old woman getting a mammogram. [223002150390] |The prevalence is only .005 (less than half a percent), leading to a test precision of 0.083, meaning only 1 in 12 40-year old women with a positive mammogram will actually have breast cancer. [223002150400] |If the sensitivity and specificity are on the low end of the estimate, namely sensitivity is 0.7 and specificity 0.90, precision for 50-year old subjects is only .096 (less than 1 in 10), and that for 40-year old subjects 0.033, or about 1/30. [223002150410] |That’s why you only start getting diagnostic tests when you’re older —they’re just too inaccurate for the young and likely healthy. [223002150420] |Or if you’re known to be more likely to get breast cancer, for instance because of incidence in the family or genetic tests; the prevalence for your situation will determine the test’s predictive accuracy. [223002150430] |The U.S. National Cancer Institute recommends “Women in their 40s and older should get a mammogram every 1 to 2 years.” [223002150440] |That’s a whole lot of chances for false positives. [223002150450] |Low positive predictive accuracy is also why there are follow-up imaging tests (sonograms) even before biopsies. [223002150460] |But keep in mind that the same caveats apply to those imperfect tests. [223002150470] |

    More Information

    [223002150480] |For more information on other tests (sonogram, MRI, physical exam and various biopsy procedures), do a PubMed Search. [223002150490] |This article from the National Cancer Institute looks interesting: [223002150500] |
  • Taplin S, Abraham L, Barlow WE, Fenton JJ, Berns EA, Carney PA, Cutter GR, Sickles EA, Carl D, Elmore JG. 2008. [223002150510] |Mammography facility characteristics associated with interpretive accuracy of screening mammography. [223002150520] |J. Natl. [223002150530] |Cancer Inst.. [223002150540] |In a survey of 50+ facilities between 1996 and 2002, they found varying values for sensitivity and specificity among facilities and by procedure (specialist in breast cancer, reading, double versus single reading, etc.), and an overall positive predictive value of mammograms of only 4% (meaning 1/25 women who test positive actually have breast cancer).