Scientific Zeitgeist vs. “Independent” Discoveries

[223000950010] |<title>Scientific Zeitgeist vs. “Independent” Discoveries</title>
[223000950020] |It’s striking just how much the <a>zeitgeist</a> affects scientific thinking.
[223000950030] |I’m thinking about my own present work on deriving gold standards at the moment, but this has been a recurring theme over my whole scientific career.
[223000950040] |The first theorem I proved in grad school was the equivalence of <a>ID/LP grammars</a> and standard CFGs.
[223000950050] |It’s a particularly trivial reduction, so it was no surprise that Stu Shieber had already done it.
[223000950060] |But it was my first exposure to independently discovering a known result. 
[223000950070] |While at CMU, I was working with a student on a project on unification-based phonology, and it turns out my old Ph.D. advisor, Ewan Klein, had done the same thing.
[223000950080] |Surprising?
[223000950090] |Not really.
[223000950100] |We were working from Ewan and Steve Bird’s earlier work, and had done the next logical step.
[223000950110] |Well, so had they. 
[223000950120] |Fast forward to the present, and I find myself in the position of having independently recreated about thirty years worth of work in epidemiology (without the empirical legwork, of course).
[223000950130] |Am I a genius?
[223000950140] |Of course not.
[223000950150] |It’s just the benefit of a 2008 perspective on hieararchical Bayesian modeling and having the same problem to solve. 
[223000950160] |So how’d I stumble onto the same models?
[223000950170] |Well, first, I had a problem to solve, namely figuring out what the recall of a set of annotations for gene mention linkage into Entrez-Gene without a gold standard.
[223000950180] |Then I started paying attention to the literature. 
[223000950190] |The first thing that struck me was Klebanov et al.’s paper on easy/hard cases for annotation and establishing a gold standard:
[223000950200] |<li>Beata Beigman Klebanov, Eyal Beigman, Daniel Diermeier. 2008.
[223000950210] |<a>Analyzing Disagreements</a>.
[223000950220] |COLING.</li> 
[223000950230] |This got me thinking about inferring gold standards rather than inter-annotator agreement rates, as well as the nature of errors.
[223000950240] |We’d done some data annotation ourselves, and this point of view makes a lot of sense.
[223000950250] |I also found at the time Reidsma and Carletta’s paper on annotator bias, which brought home the point that different annotators had different abilities. 
[223000950260] |<li>Dennis Reidsma and Jean Carletta.
[223000950270] |In press.
[223000950280] |<a>Reliability measurement without limits</a>.
[223000950290] |To appear in Comptuational Linguistics. 
[223000950300] |This also felt right from our own and others’ data annotation efforts, but it’s just not built into things like the kappa statistics.
[223000950310] |At the same time, I’ve started dropping in on Andrew Gelman and Jennifer Hill’s working group on <a>multiple imputation</a>.
[223000950320] |My goal was to learn R and BUGS while learning how the statisticians thought about these latent data problems.
[223000950330] |I always recommend diving into the deep end on problems like this; it takes a while to pick up the native’s lingo, but it’s almost always worth it.
[223000950340] |Reading through Gelman et al.’s Bayesian Data Analysis led me to hierarchical binomial models —it’s their first example.
[223000950350] |Reading Gelman and Hill’s regression book led me to the item-response models, which are the same kind of thing in a generalized linear model setting with item- and annotator-level predictors. 
[223000950360] |Hmm, I think, these models I’m learning about from Andrew and Jennifer could be applied to these problems I’m having in data annotation.
[223000950370] |So I wrote out the basic model I had in mind that separated out sensitivity and specificity so I could account for the kind of bias that Reidsma and Carletta mentioned.
[223000950380] |I coded it up in R and BUGS and it seemed to provide quite reasonable results.
[223000950390] |I was really really psyched.
[223000950400] |I haven’t been this excited about an idea I’ve had in ages.
[223000950410] |I wrote my model up on the board at the next group meeting, and Jennifer suggested adding difficulty coefficients for the items (problems to classify) like in the item-response models.
[223000950420] |As far as I know she knows as little about the work in epidemiology as I did.
[223000950430] |I then modified the binomial model for easy/normal mixtures.
[223000950440] |Now I have a whole suite of models that I’ve never seen before, though they’re built out of the same lego bricks as other models with which I am familiar.
[223000950450] |So I write off to Beata and post a blog entry with some sketches of my models.
[223000950460] |Then I hear from Brendan O’Connor of Dolores Labs that <a>Panos Ipeirotis’s blog</a> has <a>a post on Amazon Mechanical Turk</a> with lots of references to articles trying to do the same thing. 
[223000950470] |My point of entry was Dawid and Skene’s 1979 use of EM to get a point estimate of annotator accuracy and gold standard categories.
[223000950480] |Turns out <a>(Bruce and Wiebe 1999)</a> in a paper on sentiment corpus construction had a reference to Dawid and Skene as well as an application.
[223000950490] |I’d read Bruce and Wiebe’s paper, but the fundamental philosophical points it raises didn’t stand out at the time.
[223000950500] |I’m actually surprised more people don’t reference this.
[223000950510] |From this, I could find more entries in the epidemiology literature that sited it.
[223000950520] |What I found is that in a sequence of incremental steps, the epidemiologists had already explored exactly the same models over the past 30 years and have more or less caught up to the models that I sketched on the board (the basic form of which is the same graphical model as Dawid and Skene used) have evolved in the direction Jennifer suggested. 
[223000950530] |There were only two aspects of what I came up with that I didn’t find in the epidemiology literature: inference for the hierarchical parameters (they tended to estimate them with prior knowledge and prune interactions based on knowledge of their diagnostic tests), and general missing and replicated data in the sense of not every annotator annotating every piece of data and some annotating a piece of data more than once.
[223000950540] |Once you kick over to general Bayesian inference, as provided by BUGS, the missing and replicated data is trivial if you code it up the way Gelman and Hill code up item-response models.
[223000950550] |The estimation of hierarchical parameters (for priors on annotator sensitivity and specificity and problem difficulty) also wind up being easy to estimate (actually, difficulty is hard —the models seem to be only weakly identified).
[223000950560] |This extension wasn’t so much an extension, as the way I was taught to think about models by hanging out with Jennifer and Andrew and reading their books.
[223000950570] |If you want other examples of the zeitgeist effect, I would highly recommend James Gleick’s biographies of Richard Feynman and Isaac Newton (though the former is the much better and more interesting book).
[223000950580] |As Newton said, <a>If I have seen a little further it is by standing on the shoulders of Giants.</a>, though as it turns out, he either re-invented or borrowed the expression himself.
[223000960010] |<title>Dolores Labs’ Text Entailment Data from Amazon Mechanical Turk</title>
[223000960020] |Dolores Labs has been doing some really nifty work on collecting and aggregating data using Amazon’s Mechanical Turk.
[223000960030] |I’ve been working on inferring gold standards and annotator accuracies.
[223000960040] |When they posted their forthcoming EMNLP paper, I immediately began to lust after the data, which the authors kindly made available:
[223000960050] |<li>Dolores Labs’ Blog Post.
[223000960060] |<a>AMT is fast, cheap, and good for machine learning data</a></li>
[223000960070] |<li>Rion Snow, Brendan O’Connor, Dan Jurafsky and Andrew Ng. 2008.
[223000960080] |<a>Cheap and Fast — But is it Good?
[223000960090] |Evaluating Non-Expert Annotations for Natural Language Tasks</a>.
[223000960100] |EMNLP.</li>
[223000960110] |<li><a>Annotation Data</a> from the paper</li>
[223000960120] |<li><a>Recognizing Textual Entailment (RTE) Task</a> as seen by Mechanical Turk workers 
[223000960130] |I honed in on the text entailment data, as it looked easiest to deal with in a binary framework (other instances were multinomial or scalar).
[223000960140] |So I wrote a little Java program (I probably could’ve done this in R if I were fluent) to munge their data so that I could fit it with my gold-standard inference models.
[223000960150] |I then fixed the beta priors for sensitivity and specificity to be uniform (alpha=beta=1); BUGS is really nifty for this, you just move the relevant variables from the list of parameters to the list of data, give them values, then use the exact same BUGS model.
[223000960160] |Here’s the result (click on image to enlarge):
[223000960170] |Mechanical Turkers Performance on RTE Data
[223000960180] |The axes represent sensitivity and specificity, which may range from 0.0 to 1.0.
[223000960190] |Each circle represents one of the 164 annotators, with an area proportional to the number of examples annotated (which ranged from the minimum allowed of 20 to all 800 items).
[223000960200] |The center of the circle is placed at the maximum likelihood estimate of sensitivity and specificity given the “gold standard” data provided in the distribution.
[223000960210] |The grey lines represent estimate residuals, starting at the center of the circles and terminating at the model’s estimate of sensitivity and specificity.
[223000960220] |Keep in mind that with only 20 examples, posterior intervals are around 10%, and many of the max likelihood estimates are 1.0 (all those on the top or right-hand boundaries). 
[223000960230] |The green diagonal running from (0,1) to (1,0) represents the expected performance of an annotator flipping a biased coin.
[223000960240] |Look at all those annotators near that line.
[223000960250] |Four of them annotated a lot of data, and it’s all noise. 
[223000960260] |I’m sold on hierarchical models, so let’s compare the model with hierarchical beta priors estimated along with the sensitivities and specificities (this is the model from my paper):
[223000960270] |Mechanical Turker Performance Estimates with Beta Priors
[223000960280] |I’ve added blue lines at the means of the estimated beta priors, which had 95% intervals for specificity and sensitivity of (.80, .87) and (.82, .87), with intervals for the scale of (1.9, 3.8) and (7.0, 19.9) respectively.
[223000960290] |You’ll see that the hierarchical priors are pulling the estimates in their direction, which is not surprising given the counts of 20 for most of the annotators.
[223000960300] |As expected from the prior scales, sensitivity exerts a strong influence.
[223000960310] |Of course, we know we shouldn’t trust this too much given the sensitivity biases of the random annotators.
[223000960320] |So what about Panos Ipeirotis’s question about correlation between sensitivity and specificity?
[223000960330] |I don’t see a strong positive or negative correlation, at least in this case.
[223000960340] |Positive correlation would indicate good annotators are just good, and negative would indicate bias (trading specificity for sensitivity).
[223000960350] |In general, it might be interesting to get annotators to use a scale of how much a sentence is entailed; then if they had general biases, you could fit a threshold to quantize their results into the most consistent form. 
[223000960360] |Here are the more traditional Bayesian posterior plots with 80% intervals around the gold standards and the prior mean indicated as a horizontal:
[223000960370] |Specificity Estimate 80% Posterior Intervals (Hierarchical Model)
[223000960380] |Sensitivity Estimate 80% Posterior Intervals (Hierarchical Model)
[223000960390] |Finally, here’s a look at the category residuals, which I measure as the absolute difference between the true label and the sample mean:
[223000960400] |Absolute Category Residuals (Hierarchical Model)
[223000960410] |Sorry for the dense plot; I haven’t figured out how to put histogram verticals on a log scale yet.
[223000960420] |700 of the values are within epsilon of no residual error.
[223000960430] |57 of the 800 examples have a residual error of 0.5 or more, meaning that the model would assign the wrong category.
[223000960440] |23 of these had a residual error of 0.975 or more, which means the model was way off; 697 were within 0.025, meaning the model was spot on.
[223000960450] |It’d sure be nice to see just how “correct” the experts were.
[223000960460] |A gold standard with 95% accuracy would have 40 errors over 800 items. 
[223000960470] |Next up, I’d like to see what’ll happen when I remove the clear junk annotations.
[223000960480] |Playing with R and BUGS is addictive! 
[223000960490] |<b>Update:</b> 16 September 2008.
[223000960500] |Taking a majority vote of annotators leads to 50 errors and 65 deadlocked cases.
[223000960510] |If we flip a coin for the deadlocks, we get an expected number of errors of 50+65/2=82.5, for an accuracy of 89.7%, as reported by Snow et al. Taking maximum a posteriori estimates from our models produces 59 errors/92.6% accuracy (hierarchically estimated prior) and 57 errors/92.9% accuracy (uniform prior).
[223000960520] |Snow et al. report roughly the same gain (they don’t report the exact number) by weighting annotators’ votes according to their accuracies as estimated with Laplace smoothing (aka add 1 smoothing, which is equivalent to a Beta(2,2) prior under a MAP estimate) against the gold standard.
[223000960530] |Although Snow et al. mention (Dawid and Skene 1979), they don’t actually apply that model’s EM estimator to infer sensitivities and specificities without a gold standard.
[223000960540] |Our results show that you don’t need to have the gold standard to get a reliable estimate of annotator accuracies that will improve collective voting. 
[223000960550] |Of course, with the BUGS implementation of the full Bayesian model, you can add as much information as you have.
[223000960560] |If you know all or some gold standard labels, these can be fixed as data, which will then be used to help estimate the unknown parameters.
[223000960570] |Effectively, it just gets added to the Gibbs sampler as something that’s always sampled the same way.
[223000960580] |“Expert” inter-annotator agreement on this ranged from 91-96%. 
[223000960590] |<b>Postscript:</b> If anyone’s read Ted Pedersen’s “last words”, <a>“Empiricism is Not a Matter of Faith”</a>, in the latest issue of Computational Linguistics, this is how science is supposed to work. 
[223000970010] |<title>Swype, T9, Dasher, and iTap: Predictive Text Decoding</title>
[223000970020] |A friend recently showed me the <a>Swype</a> interface on his phone.
[223000970030] |It lets you drag your fingers across a soft keyboard roughly indicating the keys you intend to type:
[223000970040] |<img/>Swype Text Input Interface
[223000970050] |Given my fat fingers and the dimunitive keyboard, the only way Swype can work is with what the Wikipedia calls <a>predictive text</a>.
[223000970060] |The predictive part is easy —that’s just a noisy channel model, though with a continuous (in theory) real-time input we’ll call “motion”.
[223000970070] |Let’s pretend for the moment we’re doing one word at a time, so that we have
[223000970080] |<code>p(word|motion) prop-to p(motion|word) p(word)</code>
[223000970090] |The source model <code>p(word)</code> is our old friend the language model.
[223000970100] |The trick here is to make the LM adaptive so that it learns about you as it goes.
[223000970110] |This is the same technology required for dictation systems.
[223000970120] |Then we need a channel model that describes which motions are likely given a word, <code>p(motion|word)</code>.
[223000970130] |I don’t do much continous estimation, but I do know that both position and velocity (and probably acceleration) are important.
[223000970140] |Everything could get quantized like in most speech recognizers with numerical differentiation estimating velocity and acceleration, or with something like a <a>Kalman filter</a>, if you want to get fancy.
[223000970150] |I’d guess a simple model like an HMM would work for the channel model given a quantized motion, or you could get fancier and directly use a conditional model like a CRF.
[223000970160] |Either way, you need to be able to do real-time decoding using the CPU on a phone, which I’m guessing is a serious engineering challenge.
[223000970170] |Swype was invented by Cliff Kushler, the co-inventor of <a>T9</a> predictive text interface for phones, which itself is very similar to <a>iTap</a>.
[223000970180] |Predictive <a>text entry for Japanese</a>, especially kana (syllabic) to kanji (ideographic) conversion, has been in the field for years.
[223000970190] |My favorite predictive text interface by far is David MacKay’s <a>Dasher</a>, which animates the <a>arithmetic coding process</a>, thus reducing text entry to a kind of driving game.
[223000970200] |It’d work for any language with an orthographic order on characters or words.
[223000980010] |<title>Fool’s Gold Standard</title>
[223000980020] |I’m still digging into the Dolores Labs data from the Amazon Mechanical Turk.
[223000980030] |As a result, I’ve dug back through the Pascal <a>Recognizing Textual Entailment</a> (RTE) challenge data, and found it to be more than a little confusing.
[223000980040] |If you want to look at the raw data on which the Mechanical Turk evals were run, here’s the official <a>RTE 1 Test Data</a>. 
[223000980050] |You can also download Ido Dagan et al.’s <a>overview of RTE1</a>.
[223000980060] |Among other things, they told their annotators to ignore tense and treat anything that’s not “fully entailed” but is “very probable” as being true.
[223000980070] |These mechanical Turk annotators, who only saw <a>these instructions</a>, were instructed to “Assume that you do not know anything about the situation except what the Text itself says.
[223000980080] |Also, note that every part of the Hypothesis must be implied by the Text in order for it to be true.”
[223000980090] |Here’s a FALSE case from the official training data in which the second sentence does not follow from the first:
[223000980100] |<li>The chief of the European Central Bank, Vim Duesenburg, had spoken two days ago about "doubtless" signs of growth slowdown, following the financial crises that shook Asia then the rest of the developing countries in the last 18 months.
[223000980110] |<li>The Asian economic crash affected the Euro. 
[223000980120] |And here’s a TRUE case from the official data:
[223000980130] |<li>Azerbaijanis began voting this morning, Sunday, in the first round of the presidential elections, of which opinion polls are favoring current President Heydar Aliyev to win.
[223000980140] |<li>Azerbaijanis are voting in the presidential elections, in which Heydar Aliyev is expected to be re-elected. 
[223000980150] |I compiled a list of inferences for which my model chose the wrong answer.
[223000980160] |You can find them in <a>rte1-errors.doc</a> (which is actually an .xml file, but %^&*!
[223000980170] |WordPress doesn’t allow XML docs; just save it and rename it to .xml, or just go ahead and view it in OpenOffice).
[223000980180] |The xml file contains every example to which the model assigned the wrong category, along with some notes, whether I agreed with the gold standard, and the posterior mean of the category variable in the model.
[223000980190] |Here an example the gold standard marked TRUE, which the Turkers were pretty sure was false:
[223000980200] |<li>Seiler was reported missing March 27 and was found four days later in a marsh near her campus apartment.</t>
[223000980210] |<li>Abducted Audrey Seiler found four days after missing.</h> 
[223000980220] |Examples like this bring up the problem of what it means to probably entail something, where the following was marked as FALSE in the gold standard data (as it should be):
[223000980230] |<li>Kennedy had just won California's Democratic presidential primary when Sirhan shot him in Los Angeles on June 5, 1968.
[223000980240] |<li>Sirhan killed Kennedy. 
[223000980250] |More subtle cases are on the not-quite inferred boundary, where the following was marked TRUE, though it didn’t say they’d remove his name, just the accusatory paragraphs:
[223000980260] |<li>The Supreme Court was due to decide today, Wednesday, on the appeal filed by De Klerk for the purpose of omitting the accusatory paragraphs against him in the report, especially the paragraphs on his responsibility for bloody attacks that were carried out in the eighties against anti-apartheid organizations.
[223000980270] |<li>De Klerk awaits the Supreme court decision regarding the omission of his name in the incriminating report. 
[223000980280] |and this one, marked TRUE, even though filing for an IPO isn’t the same as going public:
[223000980290] |<li>Google files for its long awaited IPO.
[223000980300] |<li>Google goes public. 
[223000980310] |And here’s one marked TRUE that’s clearly FALSE, because between now and 2009 doesn’t mean it’ll be in 2009 (note that this was done several years ago, to boot, so we can’t blame inference on the part of the gold standard annotators):
[223000980320] |<li>Sonia Gandhi can be defeated in the next elections in India if between now and 2009, BJP can make Rural India Shine, if it can make every young man have a job that he can be proud of.
[223000980330] |<li>Next elections in India will take place in 2009. 
[223000980340] |There are two morals to this story.
[223000980350] |First, you should always look at your data!
[223000980360] |Over half of the residual disagreements between the Turker annotations and the gold standard were of this highly suspect nature and some were just wrong.
[223000980370] |Second, we need to be very careful in explaining and clarifying the tasks we’re asking annotators to do.
[223000980380] |The user interface component is very important.
[223000980390] |I’ll end with one the Turkers got wrong, as it’s an interesting example:
[223000980400] |<li>Microsoft was established in Italy in 1985.
[223000980410] |<li>Microsoft was established in 1985. 
[223000980420] |Don’t get me started on scalar implicature and intersective modifiers!
[223000990010] |<title>Filtering out Bad Annotators</title>
[223000990020] |As I mentioned <a>a couple posts ago</a>, the amount of noise in Dolores Labs’ results from their Amazon Mechanical Turk experiments was pretty amazing.
[223000990030] |Here’s the graph from that post:
[223000990040] |Mechanical Turker Performance Estimates.
[223000990050] |Circles represent individual Turkers, with area proportional to number of annotations.
[223000990060] |Ends of gray lines are the hierarchical model estimates (blue lines are prior means).
[223000990070] |Green line is chance performance.
[223000990080] |Click to enlarge.
[223000990090] |I wanted to see how much this was affecting my models’ estimates.
[223000990100] |I removed any annotators where the original model estimated their sensitivity or specificity below 50%.
[223000990110] |A better way to do this might be to remove annotators whose performance was near the biased-coin-flip line (the green line in the graph).
[223000990120] |Here’s what it looks like without the spamnotations:
[223000990130] |Same as previous graph, but with annotators with estimated sensitivity or specificity below 50% removed.
[223000990140] |This eliminated 3120/8000 annotations, or about 39% of the data.
[223000990150] |The above graph is the re-estimated sensitivities, specificities and mean priors.
[223000990160] |The priors have better means with the noisy annotators removed.
[223000990170] |Note how much the prior pulls the low sensitivity (vs. the gold standard) annotators with very few annotations toward the prior.
[223000990180] |Without the hierarchical beta prior on sensitivity and specificity, more would’ve been eliminated.
[223000990190] |For instance, for sensitivity, the prior MAP estimate is Beta(9.8,1.8) for sensitivity, which effectively adds 9.8 correct annotations and 1.8 errors to each estimator’s empirical counts to determine the mean of the posterior estimate.
[223000990200] |With annotators only having 20 annotations, only half of which were positive, the prior dwarfs the actual annotation performance.
[223000990210] |That’s why the prior mean indicated where the blue lines cross, draws the estimates like a magnet.
[223000990220] |Now the fun part.
[223000990230] |With the noisy annotators eliminated, majority vote is 92.5% accurate (60/800 errors) versus the model-based estimate which is 92.7% accurate (58/800 errors).
[223000990240] |So as Brendan O’Connor speculated in a previous blog comment, once you eliminate the noisy annotators, majority voting and coin-tossing on ties works just fine. 
[223000990250] |Keep in mind that there’s a chicken-and-egg estimation problem here.
[223000990260] |Without gold standard data, you don’t know the annotator’s accuracies, so you can’t eliminate the noisy ones so easily.
[223000990270] |The noisy annotators were so bad in this experiment, I’m guessing they could be eliminated by having low inter-annotator agreement on average.
[223000990280] |Because many of the prolific noise generators had a specificity bias, this would actually lead to decent raw inter-annotator agreement (68% chance agreement).
[223000990290] |This is exactly the kind of noise that kappa statistics should be able to detect, because their performance won’t be above random given their base distributions.
[223000990300] |The model fits better on the training data without the priors.
[223000990310] |It eliminates a couple more errors, for instance (still well in the noise of the annotations themselves).
[223000990320] |The point of employing the hierarchical priors isn’t to measure the annotators’ performance on the training data, but to improve estimates on future data (aka held out data).
[223000990330] |This is the same effect you get if you estimate baseball batting averages (as in <a>Jim Albert</a>‘s highly readable <a>Bayesian stats books</a>).
[223000990340] |You wouldn’t estimate someone’s batting average in future at-bats at .700 if they went 7/10 in their first ten at bats, would you?
[223000990350] |If you’re a diehard non-Bayesian, you’d simply abstain from predicting anything based on 10 at bats.
[223000990360] |But if you’re a Bayesian, you try to <a>squeeze out all you can from the data</a>.
[223000990370] |By the same reasoning, one wouldn’t want to estimate an annotator’s sensitivity at 100% after only 10 positive examples.
[223001000010] |<title>How to Extract Quotes from the News</title>
[223001000020] |Google recently jumped on the American election-targeted site bandwagon by launching a <a>Google Labs</a> (aka not quite ready for a beta) demo called <a>In Quotes</a>.
[223001000030] |In <a>their own words</a>, “The “In Quotes” feature allows you to find quotes from stories linked to from Google News.”
[223001000040] |As it’s released as of today, there’s a pulldown menu of 20 people whose quotes are being extracted and indexed, where those 20 are chosen based on their hotness in Google News.
[223001000050] |This is not a new idea: check out <a>this description</a> from <a>Arno Klein</a> of how he used <a>OpenNLP</a> and Reuter’s <a>AltertNet News</a> two years ago to create his <a>Quotes Over Time</a> demo. 
[223001000060] |Google’s approach looks straightforward, and could be implemented in LingPipe. 
[223001000070] |<b>Step 0</b> is to spider news sites, scrape the important text out of them, and then de-duplicate the results.
[223001000080] |I’m going to skip this step, as it really has nothing to do with the quote extraction problem per se. 
[223001000090] |<b>Step 1</b> is to break the input into sentences.
[223001000100] |The Indo-European sentence detector that ships with LingPipe will work just fine for this problem.
[223001000110] |The main trickiness in sentence detection is dealing with quotes, parentheticals and abbreviations.
[223001000120] |We discuss all of this in our sentence detection tutorial.
[223001000130] |<b>Step 2</b> is to do named entity extraction.
[223001000140] |Google’s using a finite set of names, which can be done with our exact dictionary matching named entity extractor.
[223001000150] |We cover dictionary-based entity extraction in our named entity tutorial.
[223001000160] |Be sure to add pronouns in anticipation of step 3.
[223001000170] |<b>Step 3</b> is to perform within-document coreference.
[223001000180] |LingPipe’s coref module will do this.
[223001000190] |There’s an example in the generic tutorials on the web. 
[223001000200] |<b>Step 4</b> is to pull out sentences with quote symbols and index them by speaker, as determined by within-document coreference.
[223001000210] |Within-doc coref is necessary to pull out sentences like and link them to the actual speaker.
[223001000220] |They are becoming more isolated in the world,” he said. 
[223001000230] |<b>Step 5</b> is to extract contiguous sentences with quotes by the same speaker.
[223001000240] |This’ll let you get the actual content supplied in the above sentence:
[223001000250] |“Syria and Iran continue to sponsor terror, yet their numbers are growing fewer.
[223001000260] |They are becoming more isolated in the world,” he said.
[223001000270] |“Like slavery and piracy, terrorism has no place in the modern world.” 
[223001000280] |As shown in this example, English punctuation does not include quotes at the beginning of a sentence that is continuing a quote that left off in the last sentence.
[223001000290] |That is, unless there’s non-quoted material like the attribution he said in between, or when a new paragraph is starting.
[223001000300] |So it’ll help if you can break inputs down into paragraphs before inputting them to the system.
[223001000310] |<b>Step 6</b> is to de-duplicate quotes.
[223001000320] |This is quite easy for quoted material given that it’s (usually) not paraphrased, except in punctuation and attribution.
[223001000330] |The string comparison tutorial provides a few methods that’ll work for this, like character n-gram vector comparison.
[223001000340] |This would be tricky to scale, except that you can restrict the search to pairs of quotes by the same person.
[223001000350] |You’ll have to play around with precision and recall here by setting thresholds.
[223001000360] |It’s clear from Google News that even they can’t do this extremely accurately.
[223001000370] |<b>Step 7</b> is to index everything for searchability from the server side.
[223001000380] |Lucene’s a good tool to do this.
[223001000390] |Just create docs consisting of a field with the quote, field with the speaker, field with the source name and link, and with a field for a unique ID per quote as determined in step 5. 
[223001000400] |<b>Step 8</b> is to pull out topics for display using search over the free text indexes of the quotes.
[223001000410] |For instance, the Lucene query <iraq iraqi> would work (Lucene’s queries are implicitly disjunctive). 
[223001000420] |<b>Step into uncharted territory</b> by generalizing what Google’s up to with a statistical named entity detector.
[223001000430] |You can keep the dictionary, but you’ll need to do some kind of cross-document coreference resolution before you create the index.
[223001010010] |<title>LingPipe 3.6.0 Released</title>
[223001010020] |As usual, everything’s available for download at:
[223001010030] |<li><a>LingPipe Home Page</a></li> 
[223001010040] |<b>Intermediate Release</b>
[223001010050] |The latest release of LingPipe is LingPipe 3.6.0.
[223001010060] |This release replaces LingPipe 3.5.1, with which it is backward compatible other than for some unused methods being removed from the MEDLINE package.
[223001010070] |In addition to minor API additions in the forms of utility methods, and some clarifications in the Javadoc for various methods and classes, the following substantial changes were made:
[223001010080] |<b>Hyphenation and Syllabification Tutorial</b>
[223001010090] |There’s a new hyphenation and syllabification tutorial, with evaluations on a range of publicly available and for-a-fee datasets.
[223001010100] |<b>Spell Checking Loader Improvements</b>
[223001010110] |We increased the speed and reduced the memory requirements for loading a compiled model into the spell checker.
[223001010120] |Now it shouldn’t take any overhead beyond the size of the compiled model, and loading should be about twice as fast.
[223001010130] |<b>Length-Based Tokenizer Filter</b>
[223001010140] |We added a class tokenizer.LengthStopFilterTokenizer, which filters tokens out of a tokenizer that are longer than a specified maximum length.
[223001010150] |<b>MEDLINE Package Update</b>
[223001010160] |The com.aliasi.medline package was updated to reflect the fact that there are no book entries in MEDLINE.
[223001010170] |(There are Book elements in NLM DTDs used for MEDLINE, but there are no books in the data itself.)
[223001010180] |The changes were removing the medline.Book class, the methods inBook(), inJournal(), and book(), and in() from the class medline.Article, and the element constant BOOK_ELT from medline.MedlineCitationSet.
[223001010190] |The MEDLINE tutorials were also updated to remove all the branching logic for books.
[223001010200] |<b>MEDLINE Tutorial Update</b>
[223001010210] |We’ve updated the MEDLINE tutorial with much better downloaders and cleaned up the broken links and erroneous property specifications in the read-me so that they match the build file.
[223001010220] |<b>LingMed Sandbox Project</b>
[223001010230] |We’ve included a link from the LingPipe Sandbox to the code we use here for our back-end updating, storage, and indexing of bio-medical resources such as MEDLINE, Entrez-Gene, OMIM and GO.
[223001010240] |It contains extensive documentation and build files, but has lots of moving parts ranging from MySQL to RMI to Log4J.
[223001010250] |The project includes a robust downloader to keep MEDLINE up to date, as well as index construction that may be used remotely through Lucene’s RMI integration.
[223001010260] |There’s a generic abstraction layer that supports object-relation mapping and querying through MySQL and object-document mapping and search through Lucene.
[223001010270] |The LingMed sandbox project also includes a basic version of our gene linkage application, which links mentions of genes and proteins to Entrez-Gene using name matching and context matching.
[223001010280] |<b>Citations Web Page</b>
[223001010290] |We added a new page for citations of LingPipe, including papers, books, classes, and patents.
[223001030010] |<title>Optimizing Feature Extraction </title>
[223001030020] |A widely quoted optimization tip is that the fastest code is code that isn’t executed.
[223001030030] |The same goes for memory; if you want to use less, don’t allocate.
[223001030040] |I’m currently applying this optimization technique to feature extraction.
[223001030050] |As I’ve discussed here before, feature extraction is the slowest aspect of linear classification, which is otherwise just a sequence of multiply and add operations.
[223001030060] |As things stand in LingPipe 3.6, features are extracted from generic objects and passed to classifiers as maps from names represented as strings to numbers.
[223001030070] |This is mediated by the feature extractor interface:
[223001030080] |Even worse, the map gets converted to an instance of <code>matrix.Vector</code> using a <code>symbol.SymbolTable</code> to store the feature to dimension mapping.
[223001030090] |For binary logistic regression, the vector is then multiplied by a vector stored in the classifier, with results greater than zero resulting in category 1.
[223001030100] |Using callbacks for inversion of control, we can get rid of the map and vector altogether.
[223001030110] |With some noise, we can also get rid of the symbol table.
[223001030120] |First, we need a vector entry handler interface:
[223001030130] |Implementations of this interface will receive callbacks to process vectors.
[223001030140] |For binary logistic regression, the handler is simple, incrementing an accumulator by multiplying the value by the parameter vector’s value for that dimension:
[223001030150] |A handler is used for each classification, getting the dimensions and values fed to it through callbacks.
[223001030160] |Calling the same dimension twice is equivalent to summing the feature values, making this especially convenient for structured classification settings.
[223001030170] |Then, instead of implementing <code>FeatureExtractor</code>, we require a parser:
[223001030180] |For instance, the type <code>E</code> could be an XML document represented as a DOM tree, or it might be a sequence of tokens tagged with part-of-speech tags.
[223001030190] |The parser extracts features and sends them to the handler.
[223001030200] |In code, the classifier is simple:
[223001030210] |The handler defined above is then an inner class within <code>LinearClassifier</code>.
[223001030220] |There is no need with this arrangement to create features themselves if the dimensions can be extracted.
[223001030230] |We can use hash codes for them.
[223001030240] |With n-grams, we can even use the <a>Karp-Rabin algorithm</a> to compute the hash codes.
[223001030250] |For convenience, I’ve implemented adapters based on feature parsers that are analogous to vector entry parsers, only with string-based features rather than vector dimensions.
[223001040010] |<title>Computational Linguistics Curriculum</title>
[223001040020] |I recently received a request from someone asking how to tune their undergrad curriculum for computational linguistics.
[223001040030] |As I’ll recount later, I have considerable experience in designing curricula for computational linguistics at all levels of study.
[223001040040] |Here’s what I’d recommend.
[223001040050] |My main piece of advice is to make sure you’re very solid in at least one of the component fields, which I take to be (1) computer science, (2) statistics, (3) linguistics, and (4) cognitive psychology.
[223001040060] |The biggest danger of an interdisciplinary eduction is becoming a <a>jack of all trades, master of none</a>.
[223001040070] |<h3>Course Work</h3>
[223001040080] |For CS, it’s mainly the software side that’s relevant, including programming languages/data structures, algorithm analysis and automata theory/logic/discrete math.
[223001040090] |These days, you’ll need a strong stats background for the machine learning component of the field, and I’d recommend a core math stats sequence, a course on linear regression and a course on Bayesian data analysis if one’s available.
[223001040100] |As background to stats and CS theory, you’ll need math through calc, diff eqs and matrices, though any kind of more advanced math would help, either analysis or abstract algebra.
[223001040110] |Numerical methods are extremely helpful.
[223001040120] |For linguistics, what you want to learn for comptuational linguistics is the good old-fashioned, data-intensive, detail-oriented descriptive linguistics.
[223001040130] |That’s still practiced among the laboratory or acoustic phonetics folks, but by few others.
[223001040140] |Even so, you’ll want to take in phonetics/phonology, syntax, semantics and pragmatics.
[223001040150] |Sociolinguistics is good, too.
[223001040160] |The more empirical data you get to play with, the better.
[223001040170] |It helps to get some basic cognitive psych and psycholinguistics if the latter’s available.
[223001040180] |This isn’t so important from an engineering perspective, but it really helps to have some basic ideas about how the brain works.
[223001040190] |And there are some really really nifty experiments.
[223001040200] |Sometimes you can count these as your social science requirements.
[223001040210] |If you can take other social sci requirements in something quantitative like micro-economics, all the better, especially if you can get beyond maximizing convex functions into things like behavioral econ and decision theory.
[223001040220] |Then there are the interdisciplinary courses like machine learning, artificial intelligence, and computational linguistics itself.
[223001040230] |By all means, take these classes.
[223001040240] |Other interdisciplinary studies include speech recognition, which is often taught in electrical engineering.
[223001040250] |Speech recognition’s great because it gives you some continuous math as a basis and also has efficiency issues beyond what you see with text.
[223001040260] |Information retrieval is also very useful if you can find a class in that in either CS or a library science school.
[223001040270] |Genomics sequence analysis would also be a great thing to take as the algorithms and math are very similar to much of comp ling and it’s a really fun area.
[223001040280] |<h3>Courseware</h3>
[223001040290] |You can do worse than follow <a>Chris Manning’s Stanford Course on NLP</a>.
[223001040300] |It was the first version of this course in 1995 that got me into statistical NLP.
[223001040310] |Why doesn’t <a>Michael Collins</a> have an MIT courseware version?
[223001040320] |<h3>Software to Study</h3>
[223001040330] |There are also lots of software packages out there distributed with source.
[223001040340] |Steve Bird’s <a>NLTK</a> is designed explicitly for teaching and is based on Python.
[223001040350] |Our toolkit, <a>LingPipe</a> isn’t explicitly designed for teaching, but contains a large number of tutorials.
[223001040360] |The two other big toolkits out there, <a>Mallet</a> and <a>MinorThird</a>, are much harder to understand without already knowing the field.
[223001040370] |<h3>Books to Read</h3>
[223001040380] |Here’s a list of some recommended reading in these topics I put together a comprehensive <a>computational linguisitics reading list</a> on Amazon and occassionally update it (it’s up to date as of now).
[223001040390] |<h3>Do Some Real Research</h3>
[223001040400] |If at all possible, do some real research.
[223001040410] |It’s the single biggest factor a grad school will look at if you have the grades and test scores to make their basic cuts.
[223001040420] |I’d also highly recommend browsing recent <a>ACL proceedings online</a> to get a feeling for what the field’s really like.
[223001040430] |And I’d highly suggest going to the one of the ACL meetings.
[223001040440] |Next year, University of Colorado Boulder’s hosting <a>NAACL 2009</a>.
[223001040450] |Another great opportunity for undergrads is the <a>Johns Hopkins Summer Workshops</a>.
[223001040460] |This is a phenomenal opportunity to work with good and diverse teams.
[223001040470] |There’s nothing like applying to grad school with a publication and references from top researchers.
[223001040480] |<h3>Internships</h3>
[223001040490] |Almost everyone in this field seems to offer internships.
[223001040500] |They’ll be harder to land as an undergrad unless you have an advisor hooked into the field who can set you up.
[223001040510] |Try to get an internship that’s different from what you do in school.
[223001040520] |The best thing you can get is product programming experience on real projects with teams of more than one person. 
[223001040530] |<h3>Blogs to Read</h3>
[223001040540] |You’re already here, so you can presumably read our links.
[223001040550] |This’ll give you more of a feel for the day-to-day in the field than the textbooks.
[223001040560] |<h3>What the Profs Think</h3>
[223001040570] |You can see what some of the professors in the field think of curricula and teaching by looking at the proceedings of this year’s ACL <a>workshop on teaching comp ling</a>.
[223001040580] |<h3>How’d I Get Here?</h3>
[223001040590] |I started back in 6th grade (1973-74) when my parents got me a <a>Digi-Comp I</a> from Edmund Scientific as a present.
[223001040600] |The <a>reproduction manufacturers</a>, rightly conclude that “…Digi-Comp is an ingenious, transparent Logical Gizmo that can teach anyone about binary numbers and Boolean algebra …”.
[223001040610] |I don’t know about transparent, but the books explained boolean algebra, binary arithmetic, and game trees, concepts even an elementary school student can grasp.
[223001040620] |In 12th grade (1980-81), I read Hofstadter’s inspiring book <a>Gödel, Escher, Bach</a>, which made artificial intelligence in general, and learning logic in particular, sound fascinating.
[223001040630] |As an undergraduate math major in Michigan State University’s <a>Honors College</a> (1981-1984), I created a computational linguistics curriculum for myself without knowing it.
[223001040640] |I took philosophy of language and analytic philosophy and non-standard logics as my humanities classes, I took developmental psych, micro-econ, cognitive psych and pyshcolinguistics as soc classes, and split the rest of my classes between computer science and (mostly discrete!) math.
[223001040650] |As a Ph.D. student at the University of Edinburgh (1984-1987; go to the U.K. for speed), I found myself in another computational linguistics degree, this time masquerading as a <s>School of Epistemics</s> <a>Centre for Cognitive Science</a>.
[223001040660] |Our four qualifying exams were in syntax, semantics, computational linguistics and psycholinguistics, which is not exactly a general cognitive science curriculum.
[223001040670] |After a brief stint writing my thesis while hanging out at Stanford’s <a>Center for the Study of Langauge and Information</a> (1987-1988), I landed a faculty gig in Carnegie Mellon’s Computational Linguistics program (1988-1996), which is now part of the <a>Language Technologies Institute</a>.
[223001040680] |We introduced new M.S. and B.S. programs while I was there and I had a significant hand in designing (and teaching) the undergraduate and graduate curricula in computational linguistics.
[223001040690] |Sitting in on Chris Manning’s 1995/96 class on statistical NLP was the last straw sending me down the statistical NLP path.
[223001040700] |I learned how to program professionally at SpeechWorks (2000–2002).
[223001040710] |For that, you need a real project, a good team, and a great mentor, like <a>Sasha Caskey</a>.
[223001050010] |<title>Chinese Information Retrieval with LingPipe and Lucene</title>
[223001050020] |The standard approach to information retrieval in English involves tokenizing text into a sequence of words, stemming the words to their morphological roots, then tossing out stop words that are too frequent or non-discriminative.
[223001050030] |This is how the standard settings work in Apache’s Java-based <a>Lucene</a> search engine.
[223001050040] |Taking this approach for Chinese would require being able to find the words.
[223001050050] |But the state of the art in Chinese word segmentation is only 97-98% or so on a relatively well-edited and homogeneous test set making up the <a>third SIGHan segmentation bakeoff</a>.
[223001050060] |And keep in mind that this is the usual average performance figure, with rarer words being harder to find.
[223001050070] |Out-of-training-vocabulary performance is at best 85%.
[223001050080] |Unfortunately, it’s those rarer words in the long tail, likes names, that are often the key disrciminative component of a query. 
[223001050090] |It turns out that you can go a very long way indexing Chinese using character unigrams and bigrams.
[223001050100] |That is, instead of breaking text into tokens, pull out all the characters and index those as if they were words, then pull out all the two-character sequences and index those the same way.
[223001050110] |This is what we did for Technorati.
[223001050120] |The evaluation on real queries showed what you’d expect: very good recall, but often poor precision because of unintended cross-word and sub-word effects.
[223001050130] |Consider an English example of the word sequence “and add it to your”, which would look like “andaddittoyour” without spaces.
[223001050140] |The problem is that without spaces, it matches the words “dad”, “toy”, “ditto”, “you”, and “our”.
[223001050150] |(In swear word filtering, naive replacement in subwords leads to the <a>“clbuttic” mistake</a>).
[223001050160] |By the way, subword problems come up in English at an alarming rate.
[223001050170] |How many of you know that “<a>Spider-man</a>” is two words while “<a>Batman</a>” is only one?
[223001050180] |In looking at customer query logs for TV data, we saw that users have no idea how many spaces to put in many examples.
[223001050190] |We just expect our search engines to have solved that one for us.
[223001050200] |They actually solve this problem roughly the same way as we suggest for Chinese.
[223001050210] |(This is also a problem in noun compounding languages like German and Danish, or generally highly compounding languages like Turkish.)
[223001050220] |The cross-word effects would be completely mitigated if we could teach Chinese query writers to insert spaces between words.
[223001050230] |I’m told this is impossible given the conventions of the language, but my reply is “where there’s a will, there’s a way”, and I think the adoption of <a>Pinyin</a> provides some direct support for my conjecture.
[223001050240] |But let’s suppose we can’t get the Chinese searchers to insert spaces; it won’t take care of the subword problem anyway.
[223001050250] |So how can we improve precision for unigram and bigram indexing?
[223001050260] |Add words when we can find them and let TF/IDF sort out the weights.
[223001050270] |This is the idea behind this paper, which evaluates a number of such combinations: 
[223001050280] |Jian-Yun Nie, Jiangfeng Gao, Jian Zhang, and Ming Zhou. 2001.
[223001050290] |On the Use of Words and N-grams for Chinese Information Retrieval.
[223001050300] |Proceedings of the 5th International Workshop Information Retrieval with Asian Languages. 
[223001050310] |The best indexing seems to arise from a mixture of character unigrams, character bigrams and words. 
[223001050320] |So how do we get the words?
[223001050330] |Two ways, for maximal precision and control.
[223001050340] |First, we use a dictionary.
[223001050350] |We can build an efficient tokenizer to find dictionary words using LingPipe’s Aho-Corasick implementation <a><code>dict.ExactDictionaryChunker</code></a>, the use of which is described in the javadoc and in the <a>named entity tutorial</a>.
[223001050360] |Just make sure you use the right tokenizer, which for Chinese would be the <a><code>tokenizer.CharacterTokenizerFactory</code></a>.
[223001050370] |A dedicated implementation would be faster than going through tokenizers, given that we’re treating each character as a token.
[223001050380] |As to where to find a dictionary, that’s up to you.
[223001050390] |Training data is still available on the old SIGHan site or from LDC. 
[223001050400] |The second approach to word spotting can be approximate, and for that, we recommend our noisy channel Chinese word segmenter, as described in our tutorial on <a>Chinese word segmentation</a>. 
[223001050410] |The only fiddly bits of coding are wrapping up our dictionary chunker and chinese word segmenter as Lucene analyzers (see <a><code>lucene.analysis.Analyzer</code></a>.
[223001050420] |You can see how to adapt LingPipe tokenizer factories for use as Lucene analyzers in our new <a>sandbox project LingMed</a>; look for the top-level <code>read-me.html</code> file and the adapter implementation for LingPipe tokenizers in <code>src/com/aliasi/lingmed/lucene/LuceneAnalyzer.java</code>.
[223001050430] |We’ve sidestepped the fact that Chinese has two distinct ways of rendering characters, <a>traditional Chinese characters</a> (used mainly in Taiwan and Hong Kong and among ex-pats) and <a>simplifed Chinese characters</a> (used primarily on the mainland).
[223001050440] |The <a>unicode standard</a>, of course, contains both. 
[223001050450] |It’s a simple matter to load a single dictionary on both types of characters, and a single word segmenter operating over both types of characters.
[223001050460] |We’ve had good luck in training multilingual (English/Hindi) named entity extractors.
[223001050470] |It works because it relies on local context to predict words, and local context is almost always homogeneous in character set.
[223001050480] |That way, you can handle docs with mixed character sets and don’t have to do a character set ID pass up front (which is tricky, because the code points overlap in unicode and queries are often short).
[223001050490] |We should point out that this method can also work for Japanese.
[223001050500] |Though you might want to up the n-gram count for the syllabic characters in <a>Hirigana</a> and <a>Katakana</a>, while keeping 1-grams and 2-grams for the imported Chinese ideograms making up <a>Kanji</a>.
[223001060010] |<title>Bayesian Active Learning</title>
[223001060020] |Active learning for classifiers dynamically orders items to be annotated by humans so as to train classifiers with as little human annotation effort as possible.
[223001060030] |Active learning typically starts with a few items chosen at random for annotation.
[223001060040] |After they’re annotated, one or more classifiers is trained using the annotated data.
[223001060050] |Then the next item to be annotated is chosen based on the behavior of the classifier(s).
[223001060060] |With one classifier, this is typically done by taking unlabeled items for which the classifier is most uncertain; with more than one classifier, disagreements among the classifiers are often used to rank items for annotation.
[223001060070] |Suppose for simplicity that we have a binary classification problem, perfect annotators, and a single classifier.
[223001060080] |Furthermore, let’s assume we’ll settle for a greedy algorithm that always chooses one item at a time; it’s easy in principle, if not in notation and computation, to consider groups of items to annotate.
[223001060090] |As an item selection metric, I’m proposing <b>expected error reduction</b>. 
[223001060100] |If we have some items annotated, we can train a classifier.
[223001060110] |We can even use the unlabeled items if the classifier allows semi-supervised learning.
[223001060120] |With this probabilistic classifier, we can estimate <code>p[i] = Pr(c[i]=1|item[i])</code>, the probability that item <code>i</code> is of category 1. 
[223001060130] |The expected error under log loss for item <code>i</code> is <code>E(err[i]) = - p[i] * log(1-p[i]) - (1-p[i]) * log(p[i])</code>, which is just the loss if it’s category 1 times the likelihood that it’s category 1 plus the same for category 0.
[223001060140] |The total expected loss for the corpus, <code>E(err)</code>, is just the sum of the item losses, <code>Σ<sub>i</sub> E(err[i])</code>. 
[223001060150] |The key step is estimating the effect of annotating item <code>i</code>. Consider labeling the item 1 and compute <code>E(err|c[i]=1)</code>, the new expected error for a corpus with the <code>i</code>-th item having category 1.
[223001060160] |Do the same for category 0.
[223001060170] |Then weight them by the classifier’s estimates.
[223001060180] |Labeling item <code>i</code> provides an expected error after the labeling of <code>p[i] * E(err|c[i]=1) + (1-p[i]) * E(err|c[i]=0)</code>.
[223001060190] |The next item to label is then the one with the largest expected error reduction.
[223001060200] |The beauty of this approach is that it takes all of our information into account.
[223001060210] |It considers the reduction in error from the example in hand, because that will have no uncertainty after its labeled.
[223001060220] |But it also considers the effect of labeling that item on all the other items.
[223001060230] |A danger of standard active learning is that it just keeps pulling out outlier examples that have high uncertainty but the labeling of which doesn’t bear much on the overall problem.
[223001060240] |The estimated error reduction approach takes into account how much knowing the label for item <code>i</code> will affect inference about the entire corpus.
[223001060250] |We can push as much Bayesian uncertainty in labeling through probabilistic supervision and evaluation as we have computer power. 
[223001060260] |The bummer’s that a naive implementation is quadratic in corpus size even if the training algorithm is linear (like logistic regression estimated by SGD).
[223001060270] |It’s easy to consider subsets to annotate in the same way; it just requires even more cycles to compute as the combinatorics of set selection interact with retraining.
[223001060280] |The other shortcoming is that it doesn’t take into account expected noise from annotators.
[223001060290] |The noise issue is addressed in Victor Sheng et al.’s 2008 KDD paper <a>Get Another Label?</a>.
[223001060300] |It opens the problem up to considering not just new items, but getting further labels for existing items.
[223001060310] |The technically challenging step is to estimate annotator accuracy, though I think we have a good handle on that.
[223001060320] |But if you can’t re-use annotators whose accuracies have been inferred, then you need to be able to estimate the accuracy of a randomly chosen new annotator.
[223001060330] |Inferences for new annotators is possible given the model I describe in <a>Bayesian Hierarchical Models of Categorical Data Annotation</a>.
[223001060340] |Now all I need is an algorithm.
[223001070010] |<title>Lam and Stork (2003) The Effect of Annotator Error on Classifier Evaluation</title>
[223001070020] |Anyone who’s looked at a corpus realizes that the “gold standards” are hardly <a>24 karat</a>; they are full of impurities in the form of mislabeled items.
[223001070030] |Just how many, no one really knows.
[223001070040] |That’d require a true gold standard to evaluate!
[223001070050] |We’ve been wondering how we can conclude we have a 99.9% recall entity annotator when the gold standard data’s highly unlikely to be 100% accurate itself. 
[223001070060] |Check out this 2003 publication from the <a>OpenMind Initiative’s list of publications</a>:
[223001070070] |Lam, Chuck P. and David G. Stork. 2003.
[223001070080] |<a>Evaluating classifiers by means of test data with noisy labels</a>.
[223001070090] |In Proceedings of IJCAI. 
[223001070100] |In particular, table 1, which plots the observed error rates for different true error rates and corpus mislabeling rates.
[223001070110] |Here’s a small slice:
[223001070120] |For simplicity, Lam and Stork assumed the classifier being evaluated and the data annotators make independent errors.
[223001070130] |This is not particularly realistic, as problems that are hard for humans tend to be hard for classification algorithms, too.
[223001070140] |Even the authors point out that it’s common for the same errors to be in training data and test data, thus making it very likely errors will be correlated.
[223001070150] |Lam and Stork’s paper is also the first I know to raise the problem addressed by Sheng et al.’s 2008 KDD paper, <a>Get Another Label?</a>, namely:
[223001070160] |…it is no longer obvious how one should spend each additional labeling effort.
[223001070170] |Should one spend it labeling the unlabeled data, or should one spend it increasing the accuracy of already labeled data?
[223001070180] |(Lam and Stork 2003) 
[223001070190] |Lam and Stork also discuss a problem related to that discussed in Snow et al.’s 2008 EMNLP paper, <a>Cheap and Fast –But is it Good?</a>, namely how many noisy annotators are required to estimate the true error rate (their figure 1).
[223001070200] |The answer, if they’re noisy, is “a lot”.
[223001070210] |Snow et al. considered how many really noisy annotators were required to recreate a gold standard approximated by presumably less noisy annotators, which is a rather different estimation.
[223001070220] |Of course, this is all very Platonic in assuming that the truth is out there.
[223001070230] |Here at Alias-i, we are willing to accept that there is no truth, or at least that some cases are so borderline as to not be encompassed by a coding standard, and that any attempt to extend the standard will still leave a fuzzy boundary. 
[223001070240] |The question we have now is whether our models of annotation will distinguish the borderline cases from hard cases.
[223001070250] |With hard cases, enough good annotators should converge to a single truth.
[223001070260] |With borderline cases, there should be no convergence.
[223001080010] |<title>Non-Identifiability, Arithmetic Precision, and Convergence</title>
[223001080020] |I just received mail from an understandably confused user who ran our singular-value decomposition (SVD) tutorial example multiple times and got multiple different answers.
[223001080030] |As I replied to that user, there are two reasons for this, one trivial and one deeper. 
[223001080040] |The trivial problem is non-identifiability.
[223001080050] |SVD is a matrix factoring.
[223001080060] |As such, you can think of it as a solution to a matrix equation.
[223001080070] |If you remember your quadratic equations, you’ll remember that some equations have multiple solutions.
[223001080080] |In the simplest case, the equation <code>x * x = 4</code> has two solutions, one where <code>x = 2</code> and one where <code>x = -2</code>.
[223001080090] |This is why you sometimes get all the signs flipped in the SVD factoring of a matrix.
[223001080100] |You’ll get the same inferences in terms of document-document, document-term and term-term similarity, though. 
[223001080110] |In practice, the way to solve the first problem is to simply choose one of the representations to serve as the canonical solution.
[223001080120] |For instance, vectors may be ordered lexicographically by dimension, giving you the positive solution <code>x=1</code>.
[223001080130] |You see this kind of maneuver in both Bayesian and non-Bayesian statistics. 
[223001080140] |The second problem is deeper and has to do with the stochastic nature of the search.
[223001080150] |Our SVD is implemented with stochastic gradient descent, which initializes the vectors randomly and then follows the gradient (slope) of the error curve (least squares in the case of SVD) by taking littler and littler steps over time (so-called simulated annealing) until it’s near the answer.
[223001080160] |This kind of search is guaranteed to work in theory.
[223001080170] |Unfortunately, that theory requires an arbitrary number of epochs (iterations), arbitrary arithmetic precision, and may take a long time to even get close to an answer.
[223001080180] |In practice, the exact values that aren’t well identified by stochastic search often don’t matter.
[223001080190] |The error in SVD is measured by least squares.
[223001080200] |In practice, the way to mitigate this insoluble problem is to run multiple trials and take something like a consensus.
[223001080210] |You see this done in Gibbs sampling, where the idea of starting with diffuse initial values and assessing chain mixing (using, for example the R-hat statistic) is standard operating procedure in Bayesian model fitting.
[223001090010] |<title>Updating and Deleting Documents in Lucene 2.4: LingMed Case Study</title>
[223001090020] |<a>Lucene Java 2.4</a> was recently released, and this is a good excuse to review the cumulative changes to the <a><code>org.apache.lucene.index</code></a> <a><code>IndexReader</code></a> and <a><code>IndexWriter</code></a> classes.
[223001090030] |Before Lucene Java 2.1 documents were added to an index using <code>IndexWriter.addDocument()</code>, deleted using <code>IndexReader.delete()</code>.
[223001090040] |Document updates were written as a delete followed by an add.
[223001090050] |In 2.1 the method <code>IndexWriter.deleteDocument()</code> was added to the API, and in version 2.2 <code>IndexWriter.updateDocument()</code> was added as well.
[223001090060] |Most of the demos and tutorials out there predate this release, causing no end of confusion and even heartbreak for the newbie developer.
[223001090070] |Doug Cutting summarizes this nicely, in his comments on this <a>blog post</a>: 
[223001090080] |The presence of IndexReader.delete() traces back to the origins of Lucene.
[223001090090] |IndexReader and IndexWriter were written, then deletion was added.
[223001090100] |IndexWriter should really have been called IndexAppender.
[223001090110] |The only place deletion was possible was IndexReader, since one must read an index to figure out which document one intends to delete, and IndexWriter only knows how to append to indexes.
[223001090120] |This has big performance implications: IndexReader is a heavy-weight view of an index, while IndexWriter is lightweight. …The issue has been discussed at length for years, but no acceptable solution had been developed.
[223001090130] |Recently Ning Li contributed to Lucene a fix for this, a version of IndexWriter that can efficiently intermix deletes with additions. 
[223001090140] |In the <a>LingMed sandbox project</a>, we use Lucene to maintain local version of the <a>MEDLINE</a> citation index, where each MEDLINE citation is stored as a Lucene document.
[223001090150] |Every weekday the NLM releases a set of updates which contains new citations, revised citations, and lists of citations that should be deleted from the index.
[223001090160] |This is an XML file, and we use classes from the <a><code>com.aliasi.medline</code></a> package to process the updates file.
[223001090170] |As an article goes through the publishing pipeline (e.g. accepted, pre-print, corrections added), the corresponding citation in the MEDLINE index is updated accordingly.
[223001090180] |We only want to keep the latest version of a citation in our Lucene index.
[223001090190] |The MEDLINE updates file encodes all citations as <code>MedlineCitation</code> entities, therefore our handler must figure out whether the citation is new or revised.
[223001090200] |Since all Medline citations have a unique <code>PMID</code> (PubMed identifier) element, we search the index for that PubMed id –if it exists then we call <code>IndexWriter.updateDocument</code>, else we call <code>IndexWriter.addDocument</code>.
[223001090210] |Search over the index requires us to open an <code>IndexReader</code>.
[223001090220] |This raises the question: if the <code>IndexWriter</code> makes changes to the index how can the <code>IndexReader</code> see them?
[223001090230] |The answer is found in the <code>IndexWriter</code>'s <a>javadoc</a>: 
[223001090240] |an IndexReader or IndexSearcher will only see the index as of the “point in time” that it was opened.
[223001090250] |Any changes committed to the index after the reader was opened are not visible until the reader is re-opened.
[223001090260] |Given this, processing a set of MEDLINE updates file(s) is pretty straightforward: before processing each file we open an <code>IndexReader</code> on the index, and after processing the entire file we call the <a><code>IndexWriter.commit()</code></a>, which flushes any pending changes out to the index.
[223001090270] |(The Lucene javadoc it says that after a commit “a reader will see changes” but only if the reader goes looking for them!
[223001090280] |See also <a>this discussion</a> on the Lucene mailing list). 
[223001090290] |We use a <a><code>com.aliasi.medline.MedlineParser</code></a> to parse the updates file.
[223001090300] |The parser takes a visitor in the form of a <a>MedlineHandler</a>, which processes the MEDLINE citations as they are extracted by the parser.
[223001090310] |The parser invokes the handler’s <a><code>handle(MedlineCitation)</code></a> method on each <code>MedlineCitation</code> element that it parses out of the updates file, and invokes the handler’s <a><code>delete(String)</code></a> method on each PubMed identifier in the <code>DeleteCitation</code> element. 
[223001090320] |Here is the calling code which processes the MEDLINE updates file(s), (calls to the log4j logger ommitted): 
[223001090330] |The class <code>MedlineIndexer</code> implements <code>MedlineHandler</code> and does the work of updating the index: 
[223001090340] |Instantiating a <code>MedlineIndexer</code> automatically opens a fresh <code>IndexReader</code> and <code>IndexSearcher</code> on the index.
[223001090350] |The <code>MedlineIndexer.close()</code> method closes these objects and commits updates on the index. 
[223001090360] |The <code>MedlineCodec</code> maps the <code>MedlineCitation.pmid()</code> to its own field in the Lucene <code>Document</code>, so that we can uniquely identify documents by PubMed identifier.
[223001090370] |To lookup documents by PubMed id we create a <a><code>org.apache.lucene.index.Term</code></a> object: 
[223001090380] |Here is the <code>MedlineIndexer.handle()</code> method: 
[223001090390] |We use the <code>IndexSearcher</code>'s <a>doqFreq()</a> method to check if a document with this PubMed id is in the index.
[223001090400] |If so, then the handler updates the document, else it adds it. 
[223001090410] |To delete a citation from the index we use the <a><code>deleteDocuments</code></a> method.
[223001090420] |Here is the <code>MedlineIndexer.delete()</code> method: 
[223001090430] |Indexing the MEDLINE daily updates file is a simple batch-oriented process.
[223001090440] |The <code>IndexWriter</code> holds a write-lock on the index (a file-system based lock, see the <a><code>LockFactory</code></a> javadoc), so only a single update process can ever be running.
[223001090450] |However there can be any number of <code>IndexReader</code>s open on the index.
[223001090460] |Their view of the index is whatever state the index was in when either their <code>open()</code> or <code>reopen()</code> method was called.
[223001090470] |Once you understand this about the Lucene <code>IndexReader</code> and <code>IndexWriter</code> classes, you’re ready to start building applications capable of handling a stream of search requests, interleaved with updates to the index, (or else you’re ready to use <a>Solr</a>, or something like it).
[223001100010] |<title>i2b2 Obesity Challenge: No Machine Learning Necessary</title>
[223001100020] |I attended the <a>i2b2 Obesity Challenge</a> Workshop over the weekend, where the top-performing systems by all metrics were primarily hand-built rule-based systems.
[223001100030] |The papers gave me a sense of déjà vu; they were not only built just like the expert systems of the 1970s (such as <a>Mycin</a>), they were motivated by a desire for explainable conclusions.
[223001100040] |That is, a clinician is going to need to review the machine’s findings, and rules are easy to understand.
[223001100050] |The task was to classify (anonymized) patient discharge summaries from Massachussetts General Hospital’s Weight Center for patients at risk of obesity or diabetes as to whether they actually were obese and whether they had 15 other co-morbidities such as diabetes, coronary artery disease, congestive heart failure, gout, and sleep apnea.
[223001100060] |These discharge summaries are hundreds of sentences long and discuss everything from family history and patient medical history to lab test reports and prescription lists. 
[223001100070] |The best-performing machine learning systems that treated the docs as simple bags of words were rule learners like <a>Ripper and decision trees</a>.
[223001100080] |Linear classifiers performed best using the top few features (usually extracted by measuring information gain, which is classification entropy minus conditional entropy given the feature). 
[223001100090] |In terms of feature extraction and document parsing, zoning really helped.
[223001100100] |The family history section (fairly easily extracted in this data) was a common source of false-positives for diseases for naive systems.
[223001100110] |The second important step was to import synonym and abbreviation dictionaries for drugs and diseases.
[223001100120] |We saw a lot of use of resources like <a>UMLS</a> and <a>RxNorm</a> for that.
[223001100130] |Given the task had yes/no/unknown categories, everyone expected approaches such as Chapman’s <a>NegEx</a> to have more of an impact than they did (though one team got more mileage by customizing NegEx with a specialized dictionary for the obesity task). 
[223001100140] |These all point to the difference between this task and other classification tasks such as overall sentiment, topic, language identification —it’s more of an information extraction problem than a full-text classification problem.
[223001100150] |In this, it’s like <a>aspect-oriented sentiment extraction</a>. 
[223001100160] |This bucks the prevailing trend in the field where recent bakeoff winners have been built following a three-step program:
[223001100170] |1. collect and <b>annotate data</b>, 
[223001100180] |2. <b>extract features</b> with a rule-based system to create a vectorized representation of a document, then 
[223001100190] |3. <b>fit</b> one or more discriminative linear classifiers (e.g. SVMs, logistic regression, or perceptrons). 
[223001100200] |This is a hybrid method, which really undercuts all the claims of automation from the machine learning crowd.
[223001100210] |Perhaps that’s why everyone’s so obsessed with adaptation and semi-supervised learning these days.
[223001100220] |At the same time, all the rule-based systems leaned heavily on the data collection step to tune their rules.
[223001100230] |Clearly, none of the machine learning-based entries (including ours) spent nearly enough time on feature extraction.
[223001100240] |MITRE and Mayo Clinic leveraged Mayo’s existing entity extraction and normalization systems, and the results were pretty good, though they didn’t have time to customize the resources much for the challenge (the knowledge needed was fairly deep and broad, though as one team pointed out, fully accessible on the web through keyword lookups). 
[223001100250] |I also suggested to <a>Özlem Uzuner</a> (the challenge organizer) that we might run the same task again next year with another pass over the data by the annotators (my current hobby horse!).
[223001100260] |One of the huge pains for this kind of eval is scrubbing for anonymity, which makes large semi-supervised tasks problematic.
[223001100270] |It’s also hard to get good gold-standard agreement and arrive at a consistent coding standard with only a pair of annotators and a tie-breaker in a single pass. I’d love to have a chance to take the features of the winning systems and carry out step (2).
[223001100280] |I can’t do it now, because we had to destroy all the data after the workshop due to privacy and liability concerns. 
[223001100290] |Cincinnati Children’s Hospital managed to get their <a>ICD-9-CM coding data</a> released to the public, which I’m told is quite remarkable.
[223001100300] |Their Medical NLP Challenge to perform ICD-9 coding of radiology reports showed a similar pattern of results to the i2b2 Obesity Challenge, except for <a>UPenn’s entry</a>, which placed second following the above methdology.
[223001100310] |If you’re interested in how we did, we were in the middle of the pack of 28 systems.
[223001100320] |A few quick and dirty feature extraction tricks for associating drug terms and diseases and to distribute negation helped a bit, as did using information gain to select features before training with L1-regularized logistic regression.
[223001110010] |<title>Medical Coding: Accident Involving Spacecraft Injuring Other Person</title>
[223001110020] |It was déjà vu all over again on day 2 of the <a>AMIA 2008 Annual Symposium</a>, when I attended an all-day tutorial titled Clinical Classifications and Biomedical Ontologies: Terminology Evolution, Principles and Practicalities (really —it was just plain déjà vu in <a>my last post</a>). 
[223001110030] |If a satellite or debris from a space shuttle exploding drops on your head, ICD-9 <a>E845.9</a> is the code for you (“accident involving spacecraft injuring other person”).
[223001110040] |Luckily, this code’s never been needed, but unfortunately, <a>E845.0</a>, its sister concept, <a>has been</a> (“accident involving spacecraft injuring occupant of spacecraft”).
[223001110050] |In less extreme cases, we saw there were hundreds of codes for tuberculosis, but only one for prostate cancer (<a>185</a>), which is quite diverse in its forms and their reprercurssions.
[223001110060] |The cause is that whoever was in charge of the TB section of ICD-9 really “went to town”, with the implication that the urologists/oncologists were slackers.
[223001110070] |Medical ontologies are being built the same way good-old-fashioned-AI ontologies were built back in the 1980s and early 1990s —with <a>description logics</a>. 
[223001110080] |I’m a big fan of description logics; I even wrote <a>a book on the topic</a> back when I was teaching description logics at CMU.
[223001110090] |To add to the sense of déjà vu, a former <s>victim</s> student, <a>Thomas Polzin</a> of <a>MultiModal</a>, was in the small audience with me. 
[223001110100] |The uses to which description logics are being put in ontology development are in their sweet spot, which is complex type and consistency checking.
[223001110110] |It sure brought back memories to see a hierarchy of description logics on the complexity scale.
[223001110120] |It also reminded me that logics (by themselves) just aren’t good ways to program algorithms, which require a knowledge of compilation and execution, no matter how strong the <a>siren call</a> of the proponents of declarative this or that.
[223001110130] |That’s why I’ve always advocated using description logics for type checking in grammar development and logic programming (a position reinforced by Dick Crouch of Powerset and Tracy Holloway King of Xerox in their <a>ACL software workshop talk</a> this past summer). 
[223001110140] |The other thing that got me thinking about my more theoretical days was the emphasis on philosophy of language, which I also used to teach.
[223001110150] |The tutorial started at the “chasm of semantic despair” (somewhere between genomics and patient record data at around the cellular level, where terminology is unclear and evolving, though I think that’s just what the presenters had most experience with).
[223001110160] |We were pitched the “value proposition” (better organize your data if you want to improve quality of care; oddly not much talk of billing, which all goes through ICD-9 codes in the U.S. as of 2008).
[223001110170] |Then we started from the beginning with Aristotle.
[223001110180] |When talking ontologies, I’m more often reminded of <a>Procrustes</a> than of Aristotle.
[223001110190] |Being scientists, the presenters were not surprisingly, although unbeknownst to them, squarely within the <a>logical positivist</a> tradition.
[223001110200] |Clearly they never read <a>Wittgenstein</a> or <a>Quine</a>, or they’d be <a>post-analytic</a>.
[223001110210] |Much less <a>Rorty</a>, or they’d be <a>neo-pragmatic</a>, like me.
[223001110220] |Keep tuned for more philosophy once I clear out the technical backlog.
[223001110230] |The number of controlled vocabularies and ontologies in the medical domain is rather overwhelming.
[223001110240] |You’ve got ICD-9 (clinical modification of course, with ICD-10 scheduled for 2011 and ICD-11 in the works) for diseases and billing, Snomed 3 for clinical terminology, LOINC for lab terminology, NDC for drugs and packaging, and the UMLS to unify them all, with clinical additions like HL7 for electronic health records (Check out the U.S. National Cancer Institute’s <a>BioPortal</a>, which lets you search or browse some of these terminologies; we actually did some coding in the tutorial, which was illustrative.)
[223001110250] |The tutorial was being taught by <a>Chris Chute</a> of Mayo Clinic and <a>Jim Cimino</a> of the NLM.
[223001110260] |As I learned later, this was like walking into an intro to Java tutorial being taught by <a>Josh Bloch</a> and <a>Doug Lea</a>.
[223001110270] |These guys are chairing the committees that are designing these ontologies (for instance, Chris Chute is the chair of the ICD-11 revision for the World Health Organization!). 
[223001110280] |This was one of the better tutorials I’ve ever seen, and I’d highly recommend it.
[223001110290] |It looks like they give this presentation regularly, judging by its practiced nature and their deck of 400 detailed PowerPoint slides.
[223001110300] |Unfortunately, being doctors, not computer scientists, I can’t find any of their slides or other reference materials on the web. 
[223001120010] |<title>White Paper: Multilevel Bayesian Models of Categorical Data Annotation</title>
[223001120020] |Whew.
[223001120030] |I spent most of the weekend finishing this off.
[223001120040] |It’s one stop shopping for all the info on models and applications I’ve been blogging about (hopefully with enough introductory material that it’ll be comprehensible even if you don’t know anything about latent data models, multilevel models, or Bayesian inference):
[223001120050] |<li>Carpenter, Bob.
[223001120060] |2008 <a>Multilevel Bayesian Models of Categorical Data Annotation</a>.
[223001120070] |Technical Report.
[223001120080] |Alias-i.</li> 
[223001120090] |It’s 52 pages, but it’s mostly figures.
[223001120100] |As usual, any feedback here or to my e-mail (<code>carp@alias-i.com</code>) would be greatly appreciated.
[223001120110] |Here’s the abstract:
[223001120120] |This paper demonstrates the utility of multilevel Bayesian models of data annotation for classi ers (also known as coding or rating).
[223001120130] |The observable data is the set of categorizations of items by annotators (also known as raters or coders) from which data may be missing at random or may be replicated (that is, it handles fixed panel and varying panel designs).
[223001120140] |Estimated model parameters include the prevalence of category 1 outcomes, the “true” category of each item (the latent class), the accuracy in terms of sensitivity and speci city of each annotator (latent annotator traits), the difficulty of each item (latent item traits).
[223001120150] |The multilevel parameters represent the average behavior and variances among the annotators and the items.
[223001120160] |We perform inference with Gibbs sampling, which approximates the full posterior distribution of parameters as a set of samples.
[223001120170] |Samples from the posterior category distribution may be used for probabilistic supervision and evaluation of classi ers, as well as in gold-standard adjudication and active learning.
[223001120180] |We evaluate our approach with simulated data and two real data sets, including data for which a prior “gold standard” exists. 
[223001120190] |All the code (R, BUGS, with some Java for munging), all the data, and the source for the paper are available from the <a>LingPipe sandbox</a> in the project <code>hierAnno</code>; see the sandbox page for information on checking it out, or just use this command (you’ll need to have CVS installed):
[223001120200] |Next up, I’m hoping to collect some named entity data in time to write this all up for a <a>NAACL/HLT 2009</a> submission, so I’m especially keen to get feedback before then.
[223001130010] |<title>Lucene or a Database? Yes!</title>
[223001130020] |This question comes up pretty often on the <a>Lucene mailing lists</a>.
[223001130030] |Assuming that’s an <a>inclusive OR</a> in the question, we say yes! to both.
[223001130040] |In many situations this is not just a clever answer it’s the correct one —while the two systems can both be used as a data store, there are some things that a database can do better than Lucene, and some things that Lucene can do that a database can’t do at all. 
[223001130050] |Lucene provides full-text search over a collection of data that returns results in order of relevancy.
[223001130060] |If search over (lots of) text data is a non-negotiable requirement, then a date with Lucene (or one of its cousins) is in your future, because this is something that most database systems do badly, if at all.
[223001130070] |Here’s a <a>nice critique</a> of this feature in MySQL (an otherwise fine RDBMS). 
[223001130080] |Lucene provides search and indexing over <a><code>Document</code></a> objects.
[223001130090] |A <code>Document</code> is a set of <a><code>Field</code>s</a>.
[223001130100] |Indexing a <code>Field</code> processes its contents into one or more <a><code>Term</code></a> objects which contain the <code>Field</code> name and the <code>Term</code> string value.
[223001130110] |Lucene maintains an <a>inverted index</a> which maps terms into documents.
[223001130120] |Queries against the index are scored by <a>TF-IDF</a>, a measure of document relevance.
[223001130130] |A Lucene <code>Analyzer</code> defines how index terms are extracted from text.
[223001130140] |The developer can define their own Analyzer, or use one of the pre-existing analyzers in the Lucene API, which can do tokenization for many different languages, as well as stop-listing of common words, case normalization, stemming, and more.
[223001130150] |For a nice example of a custom analyzer, see <a>Bob</a>‘s chapter on “Orthographic variation with Lucene” in the <a>Lucene in Action</a> book. 
[223001130160] |Relational databases store data as rows and columns in tables.
[223001130170] |A table declaration specifies the columns, the type of data stored in each column, and constraints over columns.
[223001130180] |The DBMS enforces these constraints as rows are added to the tables.
[223001130190] |In Lucene there is no notion of document type.
[223001130200] |A document is a set of fields, and processing a document consists of processing those fields into fielded terms that are added to the index.
[223001130210] |Lucene was designed for rapid indexing of documents.
[223001130220] |Checking document completeness is the responsibility of the calling application, ditto enforcing cross-document constraints.
[223001130230] |In the latter case, trying to enforce a uniqueness constraint on a term in the index is likely to impair indexing speed. 
[223001130240] |Searches against the two are different both in the search operations and search results.
[223001130250] |Database queries are specified in <a>SQL</a> which allows for composition of data from different tables via JOIN, complex boolean selectional restrictions, and ordering of results.
[223001130260] |Search over a database returns a ResultSet object containing a table which is a new view on the data resulting from executing the search query, that is, it may contain data stored in several different tables underlyingly. 
[223001130270] |Lucene search is over the term index, and the results returned are a set of pairs of documents and scores.
[223001130280] |Lacking explicit document types, a query is over some number of fields.
[223001130290] |Lucene has its own <a>query language</a>, and the package <a>org.apache.lucene.search</a> provides classes to do this programmatically.
[223001130300] |Lucene uses the score to limit the number of documents returned.
[223001130310] |Overriding this behavoir requires getting all matching documents for a query, and this can be very expensive, and queries designed to find things like dates or numbers within some range can be inefficient. 
[223001130320] |So if you want to provide relevance-based text search over some heterogeneous data store that has both a transactional component, such as an accounting system, as well as a storing large amounts of text you’re going to have to use both Lucene and a database, and the questions to ask are: 
[223001130330] |<li>What is the mapping between the database tables and views and Lucene documents?
[223001130340] |<li>What text, if any, needs to be stored in the index?
[223001130350] |<li>How do I keep my Lucene index in sync with the database? 
[223001130360] |The Scaling Web blog has <a>a nice case study</a>.
[223001130370] |They have some good tips about how to conflate information from multiple table columns into good fielded documents, and how best to use <code>IndexReader</code> and <code>IndexWriter</code> objects in order to keep the index up-to-date with the database. 
[223001130380] |Of course there are situations where just Lucene or just a database is a sufficient data store.
[223001130390] |In the <a>LingMed sandbox project</a> we use Lucene to store a local version of the <a>MEDLINE</a> citation index.
[223001130400] |The challenge there is making sure that updates to the index are visible to all search clients, and this was covered in an earlier blog post: <a>“Updating and Deleting Documents in Lucene”</a>.
[223001130410] |Each MEDLINE citation has a unique PubMed ID, and treating this as an index field allows for rapid search by PubMed ID.
[223001130420] |The citation title and abstract are also indexed into their own fields, and we can use Lucene to find out how many different documents contain a word in the article title or abstract –that is, we can quickly and easily calculate the document frequency for an item. 
[223001130430] |LingMed uses a LingPipe <a><code>Chunker</code></a> and a <a><code>LanguageModel</code></a> to find all mentions of genes in MEDLINE citations and assign a score to each.
[223001130440] |A MySQL database is the appropriate data store because we need to keep track of a many-to-many relationship between genes and articles (each gene mention); as well as information about the article itself.
[223001130450] |We don’t need sophisticated text search over this data, nor do we want to rank our results by TF-IDF; we are using LingPipe to compute our own scores, and use SQL’s “ORDER BY” clause to retrieve the best-scoring items. 
[223001130460] |The correct answer to the question “Lucene or a Database?” always depends on the specifics of the situation.
[223001130470] |Furthermore, as more functionality is added to Lucene (and family), the question is worth revisiting, and is revisited fairly frequently on the <a>Lucene mailing lists</a>.
[223001130480] |These questions are usually accompanied by a specific use case, and the comments of the Lucene contributors provide good guidelines which should help you find the answer that is right for your application. 
[223001160010] |<title>Known Unknowns in Discharge Summary Mining</title>
[223001160020] |In a <a>previous post</a>, I summarized the workshop for the <a>i2b2 Obesity Challenge</a>.
[223001160030] |I’ve finally uploaded our presentation:
[223001160040] |<li>Carpenter, Bob, Breck Baldwin, Carlos Cano, and Leon Peshkin. 2008.
[223001160050] |<a>Known Unknowns in Discharge Summary Mining</a>.
[223001160060] |Talk presented to the AMIA Second i2b2 Shared-Task and Workshop Challenges in Natural Language Processing for Clinical Data.
[223001160070] |Obesity Challenge (A Shared-Task on Obesity): Who’s obese and what co-morbidities do they (definitely/likely) have?
[223001160080] |Washington, D.C.</li> 
[223001160090] |I tried to explain how we’d misunderstood the task definition.
[223001160100] |Let’s consider just the textual task, and just the congestive-heart-failure co-morbidity.
[223001160110] |The task was to classify patient discharge summaries (dozens to hundreds of semi-structured sentences about a patient and their history, meds, and course of treatment) into one of four categories: (YES) the text stated the person had CHF, (NO) the text stated the person did not have CHF, (QUESTIONABLE) the text stated that it was questionable whether a person had CHF, and (UNKNOWN) the text said nothing about the patient’s CHF status.
[223001160120] |The intuitive form of the CHF task removed the UNKNOWN and asked the annotators to use their best judgement rather than requiring explicit statements in the text (annotators disagreed on how explicit things had to be to count as evidenced by the text).
[223001160130] |The organizers argued that the key to the eval was getting the questionable categories right (I think that’s the “likely” in the title), because they were important.
[223001160140] |I somehow think that if patients with questionable disease statuses were so important, something would go in their chart to that effect.
[223001160150] |As is, there were 39 questionable cases (across all 16 co-morbidities) in 11,630 cases.
[223001160160] |Clearly the clinicians didn’t feel questionable conclusions (or for that matter, negative conclusions, of which there were only 87 instances across all 16 co-morbidities) were worth recording.
[223001160170] |The point of my talk (after describing how we used regularized sparse logistic regression with some clever feature extraction for drug names and negation by Carlos and Leon), was more philosophical (and doesn’t necessarily represent the views of my co-authors, who didn’t have a chance to review my “late-breaking” presentation). 
[223001160180] |To my way of thinking, the task is really a quantization of four possible states of posterior information.
[223001160190] |But first, let’s look at the prior, which summarizes our uncertainty about the population distribution p(CHF) for congestive heart failure (I’m just making these distributions up from simple betas —they’re not intended to be realistically shaped).
[223001160200] |Population distribution of congestive heart failure (hypothetical).
[223001160210] |What I’d like to do is gather some data, like a discharge summary Text, and infer a posterior p(CHF|Text).
[223001160220] |For instance, the following posteriors are likely candidates to describe as NO and YES:
[223001160230] |Posterior distribution for a NO.
[223001160240] |[caption id="attachment_315" align="alignleft" width="150" caption="Posterior corresponding to YES."]<a><img/></a>[/caption]
[223001160250] |whereas the following would constitute an unknown case: 
[223001160260] |Posterior corresponding to UNKNOWN.
[223001160270] |Note that it is the same as the prior.
[223001160280] |Note that the posterior p(CHF|Text) in the UNKNOWN case is the same as the prior p(CHF) indicating the text didn’t give us any information about CHF.
[223001160290] |On the other hand, the posterior p(CHF|Text) is both broad and centered near 0.5, indicating a high degree of uncertainty about whether the patient has CHF given the text.
[223001160300] |Posterior corresponding to QUESTIONABLE; not shift and degree of uncertainty.
[223001160310] |While this view of information is quite pleasing theoretically, and seems promising for practical applications in that it’s easy to combine with other information, it remains a difficult paradigm to evaluate in a bakeoff situation.
[223001170010] |<title>Bayesian Posterior Entropy as a Measure of Uncertainty</title>
[223001170020] |<a><img/></a>Bob and Mitzi's chalkboard
[223001170030] |<a>Mark Johnson’s comment</a> in my last post about a Bayesian proposal for identifying <a>known unknowns</a>, sent Mitzi and me to the chalkboard after dinner (yes, we’re nerdy enough to have a chalkboard in our breakfast nook) to see if at least estimating normal means works the same way.
[223001170040] |Yes, it does.
[223001170050] |Posterior variance is always lower than prior variance about the mean parameter.
[223001170060] |Of course it is, since the precisions (inverse variances) are non-negative and additive.
[223001170070] |I’ve read all of this many times, but it takes a real example to make things stick.
[223001170080] |While writing various beta distributions on the board, it dawned on me that uncertainty in our predictive inferences, not uncertainty in our parameters, is what matters.
[223001170090] |For that, <a>entropy</a> is a better measure.
[223001170100] |For a Bernoulli distribution with parameter <img/>, the entropy <img/>is defined by:
[223001170110] |Given that we have a posterior distribution <img/>over the Bernoulli parameter <img/>, which represents the chance that someone has congestive heart failure given that they have report <img/>.
[223001170120] |To measure uncertainty, look at the Bayesian estimate of the posterior mean of <img/>, which is just the weighted average of the entropy over the posterior distribution of <img/>, which is:
[223001170130] |This now has the right properties.
[223001170140] |The highest possible posterior entropy here is with all probability mass centered at 0.5, leading to an entropy of 1.
[223001170150] |Even if the posterior for <img/>is tighter, if it’s shifted more centrally, it’ll result in increased entropy. 
[223001170160] |The lowest possible entropy will result with all of the probability mass centered at either 0 or 1, leading (in the limit of a <a>delta function</a> [cool animation on Wikipedia]) to an entropy of 0.
[223001170170] |PS: <a>Sitmo.com</a> is a very sweet little app for adding LaTeX-derived images to blogs, which is especially easy with their permanent links (I don’t quite trust that) and WordPress’s image insertion button.
[223001170180] |It reminds me of a Mac app I used in the 90s that tried to render LaTeX WYSIWYG.
[223001170190] |But it’s still a major pain compared to the joys of directly using LaTeX or even of just rendering something that looks ugly in HTML.
[223001180010] |<title>Semi-Supervised Classification vs. the LingPipe API</title>
[223001180020] |Semi-supervised learning involves training a classifier, tagger or some other model using a combination of labeled and unlabeled data.
[223001180030] |The challenge is to get some mileage out of the unlabeled data over a fully supervised baseline using only the labeled data.
[223001180040] |This is particularly helpful in an application context where you need to build classifiers on the fly with only a few handfuls of training examples.
[223001180050] |<h3>Literature Survey</h3>
[223001180060] |I’d highly recommend this very nicely written paper on using expectation maximization (EM) for semi-supervised learning:
[223001180070] |<li>Kamal Nigam, Andrew McCallum, and Tom M. Mitchell. 2006.
[223001180080] |<a>Semi-Supervised Text Classification Using EM</a>.
[223001180090] |In O. Chapelle, A. Zien, and B. Scholkopf (Eds.) Semi-Supervised Learning.
[223001180100] |MIT Press. </li>
[223001180110] |They evaluate a semi-supervised naive Bayes classifier over the <a>20 Newsgroups data</a>.
[223001180120] |This one figure and caption pretty much sums up the results (click to enlarge):
[223001180130] |As usual, they found EM-based training most useful when the labeled data is small.
[223001180140] |<h3>The Algorithm</h3>
[223001180150] |The basic EM algorithm for semi-supervised classification is quite simple:
[223001180160] |First train on classified data, then look at each unlabeled item, compute the conditional probability of the various categories (these are expected category counts, hence the name “E step”), then train using these probabilities as weights (assumes training is a maximization or “M” step).
[223001180170] |Sometimes only sufficient statistics are collected inside the inner for-loop with the M step outside.
[223001180180] |Convergence is usually measured by some error metric not improving, typically negative log likelihood.
[223001180190] |With only conditional models, this just has to be the log likelihood of the categories given the input.
[223001180200] |With a joint model, it’s typically joint log likelihoods.
[223001180210] |The error can be calculated over the supervised data, the supervised data and unsupervised data, or over some held out evaluation set.
[223001180220] |<h3>API Design Issues</h3>
[223001180230] |I believe the term of art for this kind of vexing design problem is “<a>Ay, Carumba!</a>“
[223001180240] |The algorithm’s fully general and works on any classifier that computes conditional probabilities, which in LingPipe, means an instance of <code>Classifier<E,? extends ConditionalClassification></code>.
[223001180250] |It requires the classifier be trainable with weights, but this can be finessed and quantized at the same time by multiplying the E values by a number, say 100, rounding, and training as if there were that many instances (making sure to adjust smoothing parameters to deal with the multiplicative effect).
[223001180260] |In LingPipe, that means the Bernoulli, naive Bayes, character and token-based language-model classifiers, K-nearest-neighbors, and logistic regression.
[223001180270] |But we don’t need a classifier, we need a classifier factory, because we create new classifiers within the algorithm, so now we’re dealing with type <code>Factory<Classifier<E,? extends ConditionalClassification>></code>.
[223001180280] |But wait, that’s not all.
[223001180290] |We also need that classifier to be trainable, which we can approximate by requiring it to implement <code>ClassifierHandler<E,Classification></code>.
[223001180300] |So that means what we really need to pull the classifier out is:
[223001180310] |To truly encapsulate the entire EM algorithm, we need to take the data as input.
[223001180320] |The supervised data is going to need to be an instance of <code>Corpus<ClassifierHandler<E,Classification>></code> if we’re going to use the quantization fudge for generality, or <code>Corpus<ClassifierHandler<E,ConditionalClassification>></code> if we aren’t.
[223001180330] |The unsupervised data is much simpler, being an instance of <code>Corpus<ObjectHandler<E>></code>.
[223001180340] |Next, because we have the whole convergence loop inside, we need to set up a maximum number of epochs and a minimum amount of relative improvement in log likelihood to consider the result converged.
[223001180350] |As in our other algorithms, we’ll need to monitor feedback, so that leaves us with something like:
[223001180360] |Now let’s hope that everything here implements the same generic type <code>E</code> or we’ll need to complicate to <code>(? super E)</code> or <code>(? extends E)</code> depending on the polarity of the context.
[223001180370] |And keep in mind we’re assuming some form of general convergence measurement; if that needs to be configurable we’re in for an even more complex method.
[223001180380] |So far, we haven’t included any kind of annealing schedule, which Nigam et al. found to be useful in encouraging EM to converge reliably without getting stuck in local optima.
[223001180390] |Our logistic regression API already has a flexible <a>annealing schedule abstract base class</a>, so we could always include an implementation as a further argument.
[223001180400] |I’m thinking it’d be easier, and it’ll certainly be more flexible, to just write the EM algorithm and monitor it on the outside rather than writing all the implementations required for the EM-training method.
[223001180410] |I’m already planning a simplified API for our existing implementations of iterative methods: regularized logistic regression, latent Dirichlet allocation (LDA), and singular value decomposition (SVD).
[223001190010] |<title>LingPipe 3.7.0 Released</title>
[223001190020] |LingPipe 3.7.0 is now available from:
[223001190030] |<li><a>LingPipe Home Page</a> 
[223001190040] |The only significant change is an update to the MEDLINE DTDs used by the MEDLINE parser to their 2009 versions, with corresponding changes to the tutorials.
[223001200010] |<title>Gibbs Sampling for Semi-Supervised Learning</title>
[223001200020] |I promised to use “estimation” instead of “learning”, but I’m torn with standard terminology like <a>semi-supervised learning</a>, of which statistical estimation is only one approach. 
[223001200030] |My <a>last post on semi-supervised learning</a> considered the expectation maximization (EM) algorithm and considered how it could be integrated into LingPipe.
[223001200040] |In this post, I’m going to do the same for Gibbs sampling. 
[223001200050] |The algorithm is just as simple as EM, if not simpler:
[223001200060] |The algorithm has the same basic structure as EM: start by training a model on the labeled data, then assigning and training on the unlabeled data points at random.
[223001200070] |Next, iterate over all of the training data several times.
[223001200080] |While looping over the training data, the Gibbs sampler visits each item once.
[223001200090] |It first decrements the count from the previous epoch (not necessary in the first epoch, of course), then samples a label from the current model, then increments the current model’s data counts with the sampled value. 
[223001200100] |Unlike EM, Gibbs sampling doesn’t create a new model in each epoch.
[223001200110] |If we can compute the expectation (E) step of the EM algorith, we can trivially sample by ordering the items, generating a random number in [0,1] and choosing the corresponding label based on cumulative probability [in fact, this is a clever way to do generation given an <a>arithmetic coding</a>].
[223001200120] |The extra bit that Gibbs sampling needs is the ability to decrement counts in models. 
[223001200130] |Sometimes Gibbs sampling is run in batches where all of the labels are sampled given the previous model and re-incremented; in this setup, you can use the same sort of model generation as in EM to avoid having to decrement, and can have the same freedom to define complex maximization (M) steps.
[223001200140] |If the estimators have simple sufficient statistics, such as most of the exponential family, like multinomials with conjugate Dirichlet priors or Gaussians, the models can be fairly easily updated online for each item sampled.
[223001200150] |The result of Gibbs sampling is a (Markov) chain of samples for every parameter in <code>θ</code>, including the missing data labels (the missing data labels by themselves constitute a <a>multiple imputation</a> of the missing data).
[223001200160] |In practice, multiple Gibbs samplers are run, and their resulting sequences of estimates are checked for convergence.
[223001200170] |Typically the earlier epochs before multiple chains begin to mix well are discarded.
[223001200180] |EM is deterministic, but requires several starting points to be evaluated to ensure convergence to the global maximum (unless the loss is convex, in which case only one run is necessary).
[223001200190] |In either case, we want to take training cases (both supervised and unsupervised) <code>X</code> and draw inferences about new data <code>Y</code>, which we write as <code>p(Y|X)</code>.
[223001200200] |In both cases, we have the same parametric likelihood function <code>p(Y|θ)</code>.
[223001200210] |The EM algorithm produces a maximum likelihood estimate <code>θ<sup>*</sup></code> of the model parameters (or maximum a posteriori estimate, if we have a non-uniform prior <code>p(θ)</code>).
[223001200220] |This estimate is then used to reason about future cases, approximating <code>p(Y|X)</code> by <code>p(Y|θ<sup>*</sup>)</code>. 
[223001200230] |The full definition of <code>p(Y|X)</code> requires the evaluation of an integral: <code><big><big>∫</big></big><sub>Θ</sub>p(Y|θ) p(θ|X) dθ</code>.
[223001200240] |This integral averages the predictions <code>p(Y|θ)</code> weighted by the probability <code>p(θ|X)</code> of the parameters. 
[223001200250] |In Gibbs sampling, we get a sequence of samples of parameter values <code>θ<sup>(1)</sup>,...,θ<sup>(N)</sup></code>.
[223001200260] |We use these to approximate the integral definining <code>p(Y|X)</code> as <code><big><big>Σ</big></big>p(Y|θ<sup>(n)</sup>)/N</code>. 
[223001200270] |The motivation for this more involved reasoning is that it incorporates all of our uncertainty about the parameter values <code>θ</code>, whereas EM places all of the weight on the maximum likelihood value.
[223001200280] |In some cases, the maximum likielihood parameters give a good approximation of <code>p(Y|X)</code> and in other cases they don’t. Gibbs sampling will approximate <code>p(Y|X)</code> to arbitrary precision given enough samples.
[223001200290] |In practice, 100 to 1000 are often used.
[223001200300] |The question is, will it matter in practice?
[223001210010] |<title>Thomas Bayes vs. Sherlock Holmes: The Case of Who’s Laughing Now?</title>
[223001210020] |I’ve been thinking about writing an introductory book on linear classifiers.
[223001210030] |All the math bits are easy, but how do I introduce naive Bayes with a simple example?
[223001210040] |(Suggestions appreciated!)
[223001210050] |While at home over the holidays, my parents (<a>mom</a>‘s a huge British mystery fan) rented <a>Dr. Bell and Mr. Doyle: The Dark Beginnings of Sherlock Holmes</a> (it’s pretty good if you like that sort of thing).
[223001210060] |That got me thinking about my favorite fictional detective, <a>Encyclopedia Brown</a>.
[223001210070] |Combined with my love of <a>Martin Gardner</a>‘s mathematical games column in Scientific American, I thought a little puzzle might be in order. 
[223001210080] |Here’s my first attempt —it could use some help in the story part.
[223001210090] |Or is this just too undigified for a textbook? 
[223001210100] |<h3>The Case of Who’s Laughing Now?</h3>
[223001210110] |Mr. and Mrs. Green had very different senses of humor and somewhat distinctive laughs.
[223001210120] |Only one of them ever laughs at a time.
[223001210130] |But they both laugh by saying “hee” or “haw”, sometimes using a mix of the two sounds in succession, such as “hee hee haw hee haw”.
[223001210140] |Over time, Sherlock has observed that when Mr. Green laughs, 20% of the utterances are “hee” and 80% are “haw”; Mrs. Green is more ladylike, with 60% “hee” and only 40% “haw”.
[223001210150] |One day, Sherlock was walking by the Green house, and heard the laugh “hee haw haw” from a window.
[223001210160] |He had no knowledge of whether Mr. or Mrs. Green was more likely to be laughing, but knew it had to be one of them.
[223001210170] |What odds should Sherlock post to create a fair bet that the laugh was Mr. Green’s?
[223001210180] |<h3>Hint</h3>
[223001210190] |I did mention <a>naive Bayes</a>, didn’t I?
[223001210200] |<h3>The Answer</h3>
[223001210210] |Coming soon…
[223001220010] |<title>“With Bayes’s Rule, it’s Elementary”, says Sherlock </title>
[223001220020] |In my last post, <a>Thomas Bayes vs. Sherlock Holmes: The Case of Who’s Laughing Now?</a>, I asked the following riddle:
[223001220030] |Mr. and Mrs. Green had very different senses of humor and somewhat distinctive laughs.
[223001220040] |Only one of them ever laughs at a time.
[223001220050] |But they both laugh by saying “hee” or “haw”, sometimes using a mix of the two sounds in succession, such as “hee hee haw hee haw”.
[223001220060] |Over time, Sherlock has observed that when Mr. Green laughs, 20% of the utterances are “hee” and 80% are “haw”; Mrs. Green is more ladylike, with 60% “hee” and only 40% “haw”.
[223001220070] |One day, Sherlock was walking by the Green house, and heard the laugh “hee haw haw” from a window.
[223001220080] |He had no knowledge of whether Mr. or Mrs. Green was more likely to be laughing, but knew it had to be one of them.
[223001220090] |What odds should Sherlock post to create a fair bet that the laugh was Mr. Green’s? 
[223001220100] |Rich W. calculated the answer in a response to the last post, but made a calculation mistake which I didn’t catch; thanks to Andraz Tori for the correction in the comment; I just updated the values here so they’re right:
[223001220110] |Let me show the rest of the work.
[223001220120] |What we want to calculate is the odds, which are defined by:
[223001220130] |As we’ll see below, the result is 3/4, for odds of 3:4 for Mrs. Green.
[223001220140] |Here we already see the potential for confusion.
[223001220150] |As Rich W. noted, if we were gambling, the fair payoff (meaning an expected winnings of 0 for both sides) would be the inverse of the odds.
[223001220160] |So if the odds are 3:4 for Mrs. Green, the fair payoff will be $4 for every $3 bet on Mrs. Green. 
[223001220170] |So now we only need to calculate <code>p(Mrs|hee haw haw)</code> and similarly for <code>Mr</code>. Here’s where Bayes’s rule:
[223001220180] |comes into play.
[223001220190] |It lets us define the posterior probability <code>p(A|B)</code> as the likelihood <code>p(B|A)</code> times the prior <code>p(A)</code> divided by the marginal <code>p(B)</code>.
[223001220200] |In our case, we have:
[223001220210] |Now let’s plug that into the odds formula:
[223001220220] |In Bayesian parlance, this derivation is of equality of the posterior odds (<code>p(Mrs|hee haw haw)/p(Mr|hee haw haw)</code>) with the prior odds (<code>p(Mrs)/p(Mr)</code>) times the likelihood ratio (<code>p(hee haw haw|Mrs)/p(hee haw haw|Mr)</code>).
[223001220230] |When the puzzle said there’s no reason ahead of time to assume it was either Mr. or Mrs. Green who was laughing, I meant to imply equal priors, so that <code>p(Mr) = p(Mrs) = 0.5</code>.
[223001220240] |That makes the prior odds 1, so they drop out:
[223001220250] |So far, we’ve only stipulated the likelihood function.
[223001220260] |In our next installment, Sherlock’s going to have to do some estimation.
[223001220270] |OK —I don’t think I’ve gotten the hang of this yet.
[223001220280] |This should all be told with a story setup (why does Sherlock want to know who’s laughing) and from Sherlock’s point of view.
[223001220290] |He needs to round up the subjects and then lecture them.
[223001230010] |<title>Naive Bayes, Binomials, and Bags of Words</title>
[223001230020] |Sherlock’s analysis in the last blog post, <a>“With Bayes’s Rule, it’s Elementary”, says Sherlock</a>, explicitly said that when Mr. Green laughs, 80% of the words are “haw” and 20% are “hee”. 
[223001230030] |This can’t mean every laugh utterance has 80% “hee”s.
[223001230040] |With only three words it’s impossible to have an 80/20 balance.
[223001230050] |It also doesn’t mean that if he’s uttered 4 “haw”s in row, the next word has to be a “hee”.
[223001230060] |Instead, the idea is that when Mr. Green laughs, he picks the next word to utter with an 80% probability of “haw” and a 20% probability of “hee”, no matter what he’s uttered already. 
[223001230070] |Sherlock makes a strong independence assumption according to which each word in an utterance is generated independently.
[223001230080] |This allows us to factor 
[223001230090] |With this representation, it’s clear that 
[223001230100] |Order doesn’t matter, only the counts, which in this case is 2 “haw”s and 1 “hee”. 
[223001230110] |Real language is more subtle.
[223001230120] |If I say “car” in a conversation, I’m more likely to say “driver” or “gas” than if I hadn’t said “car”.
[223001230130] |Also, position matters: “Mrs. Green likes Mr. Green” is a different statement than “Mr. Green likes Mrs. Green”, not to mention the word salad “Green Mrs. Mr. likes Green”.
[223001230140] |That’s why they call strong independence assumptions “naive” —they’re usually not met by reality.
[223001230150] |With naive Bayes, the probability of an utterance only depends only on the count, not the order, of words.
[223001230160] |In this sense, naive Bayes uses what is known as the “bag of words” representations.
[223001230170] |A bag of words contains a (non-negative, integer) count for each word, but does not represent order.
[223001230180] |We can calculate the likelihood that Mr. Green utters 0, 1, 2, or 3 “hee”s when uttering a 3-word laugh.
[223001230190] |You just add up the probabilities of all the different permutations.
[223001230200] |That is:
[223001230210] |The same logic applies to Mrs. Green, who utters 60% “hee”s and 40% “haw”s.
[223001230220] |We summarize the probabilities of the various counts in the following table:
[223001230230] |The binomial coefficient (M choose N) is a mathematical expression standing for how many different ways there are to order M objects consisting of N objects of one type (e.g. “hee”) and (M-N) objects of the other type (e.g. “haw”).
[223001230240] |The general formula is:
[223001230250] |Recall that the factorial is defined by <code>M! = M * (M-1) * (M-2) * ... * 2 * 1</code>.
[223001230260] |Working this out for our case of a three-word laugh, (3 choose 0) = (3 choose 3) = 1, whereas (3 choose 1) = (3 choose 2) = 3. 
[223001230270] |Now what’s the probability that Mr. Green utters 2 “hee”s in a 3 word laugh?
[223001230280] |We just multiply the probability of any of the instances by the number of variants.
[223001230290] |That is, <code>p(hee haw haw|Mr) = .2 * .8 * .8</code> times <code>(3 choose 1) = 3</code>, for a total of <code>3 * .2 * .8 * .8</code>.
[223001230300] |That’s what’s shown in the table. 
[223001230310] |In general, we have the following formula:
[223001230320] |This is known as the binomial distribution over the non-negative integers (though all values above M have probability zero).
[223001230330] |It turns out we don’t even need to know the order of the “hee”s and “haw”s to estimate who’s laughing.
[223001230340] |Going back to our example of “hee haw haw”, we see that is 1 “hee” and 2 “haw”s.
[223001230350] |We have:
[223001230360] |Applying Bayes’s rule as we did before, only now with counts other than orderings, we have:
[223001230370] |All of the binomial coefficients drop out of the equations, as there’s an (3 choose 1) in all of the <code>p(1 "hee"|3 words, X)</code> terms.
[223001230380] |The bottom line is that the naive Bayes model, which models order by ignoring it, and the binomial model, which explicitly ignores counts, lead to the same posterior probability estimate for who’s laughing. 
[223001230390] |Other probability models, such as logistic regression, can also use a bag of words representation; but logistic regression is not naive in that it doesn’t assume the words are independent.
[223001240010] |<title>Book Review: The Numerati, by Stephen Baker</title>
[223001240020] |I was very excited when <a>this book</a> came out, because I love (well, love-hate) pop science books.
[223001240030] |Each chapter is a standalone case study similar to a <a>New Yorker profile</a>.
[223001240040] |Each chapter covers a different application of data mining, ranging from supermarket organization to political marketing to finding terrorists to diagnosing disease.
[223001240050] |Baker even has <a>a blog on the same topic</a> as the book.
[223001240060] |I (and many readers of this blog) belong to the tribe Baker’s calling “The Numerati”.
[223001240070] |There’s a well known problem with reading pop science in one’s own field.
[223001240080] |The mistakes and inconsistencies are glaring, and somehow seem very annoying, like we’ve been let down by the press. 
[223001240090] |The use of “-ati” in the title evokes the mysticism of the <a>Illuminati</a> (which, according to Wikipedia means “enlightened” in Latin, traditionally referred to a secret society of the 1700s, and now refers to a “purported conspirational organization which acts as a shadowy power behind the throne”.)
[223001240100] |And yes, Baker mentions <a>total information awareness</a> in the section on finding terrorists, so I’m not thinking the connection’s too much of a stretch.
[223001240110] |It’s really the mysticism that bugged me.
[223001240120] |I don’t mind the flowery language.
[223001240130] |Baker introduces <a>Nicolos Nicolov</a> (who many of you may know) thusly:
[223001240140] |“I’ve got bigger fish to fry [than sarcasm detection],” says Nicolas Nicolov, Umbria’s chief scientist.
[223001240150] |A Romanian-born computer scientist, Nicolov got his doctorate in Edinburgh before moving to America, first to IBM’s Watson Lab and then to Umbria [now part of J.D. Power].
[223001240160] |He has an angular face and dark deep-set eyes, and he sports thick black bangs —a bit like Jim Carrey in his early movies.
[223001240170] |He works in a small, dark office down the hall from [Ted] Kremer’s [the chief technical officer's] sunny, expansive digs.
[223001240180] |It feels like I’ve stepped into a cave. 
[223001240190] |Cool.
[223001240200] |Baker’s writing about one of us, right down to the modest digs [OK, we have a sunny loft in Brooklyn].
[223001240210] |How will he describe the technology?
[223001240220] |Always start with an example (my mom was a writing teacher).
[223001240230] |Nicolas offers that when analyzing consumer’s feeling about a product, “big” is a bad thing for a notebook computer, but a good thing for a notebook computer’s hard drive.
[223001240240] |Ouch.
[223001240250] |What about the tech?
[223001240260] |Imagine a vast multidimensional space, Nicolov instructs me.
[223001240270] |Remember that each document Umbria studies has dozens of markers —the strange spellings, fonts, word choices, colors, and grammar that set it apart from others.
[223001240280] |In this enormous space I’m supposed to imagine, each marker occupies its own patch of real estate. 
[223001240290] |We can tell he’s talking about features/predictors, where each feature’s “patch of real estate” is actually just a dimension.
[223001240300] |You know, like the old X, Y, and Z of 3D real space.
[223001240310] |The X dimension doesn’t occupy its own patch of real estate.
[223001240320] |After some descriptions of dimensions possibly involving punctuation (you can tell Nicolas must’ve really been working to get this much across), we get:
[223001240330] |And each document —blog or splog —is given an assignment: it must produce a line —or vector —that intersects with each and every one of its own markers in the entire universe.
[223001240340] |It’s a little like those grade-school exercises where a child follows a series of numbers or letters with her pencil and ends up with a picture of a puppy or Christmas tree. 
[223001240350] |Puppies?
[223001240360] |Christmas trees?
[223001240370] |What’s going on? 
[223001240380] |Nicolov tries to draw a diagram on the whiteboard.
[223001240390] |But he gives up in short order.
[223001240400] |It’s impossible, because in a world of two dimensions, or even three, each of the vectors would have to squiggle madly and perform ridiculous U-turns to meet up with each of its markers. 
[223001240410] |OK, clearly the author doesn’t understand vectors.
[223001240420] |We’re talking high school physics here, not <a>algebraic geometry</a>.
[223001240430] |After an anecdote about detecting spies by asking them about <a>spitballs</a>, we’re back to spam detection.
[223001240440] |What next?
[223001240450] |The splog neighborhood must be cordoned off, condemned.
[223001240460] |Imagine placing a big shield between the good [ham] and bad [spam] vectors.
[223001240470] |Speaking geometrically, the shield is a plane.
[223001240480] |The spam fighters maneuver it with a mouse, up and down, this way and that.
[223001240490] |The plane defines the border between the two worlds, and as the scientists position it, the machine churns through thousands of rules and statistics that divide legitimate blogs from spam. 
[223001240500] |If you liked this, you’ll love the rest of the book.
[223001240510] |It’s all like this. 
[223001240520] |Your reward for reading to the end of the review is a recommendation for a book I couldn’t put down, Cory Doctorow’s <a>Little Brother</a>, from which you’ll learn a whole lot more technical detail about data mining than from Baker’s book (including spam filtering, <a>gait detection</a>, histograms, etc.), all in the context of an action-packed work of young-adult fiction.
[223001240530] |Doctorow has <a>his own blog</a>.
[223001240540] |Thanks to <a>Forbidden Planet</a> for recommending this one!
[223001250010] |<title>LingPipe is (X)Lint Free</title>
[223001250020] |What a difference one line makes in the Java compiler.
[223001250030] |By setting the compiler flag <code>-Xlint:all</code>, (in ant, <code><compilerarg value="-Xlint:all"/></code>), you see all the compiler warnings which don’t cause outright errors.
[223001250040] |The notion of checking a program for “lint” derives from the Unix utility <a>lint</a>, which operated on C programs. 
[223001250050] |With javac, lint checking includes <code>deprecation</code> (finds deprecated methods without <code>@Deprecated</code> attributes), <code>unchecked</code> (finds questionable uses of type casting and generics), <code>fallthrough</code> (finds fall through cases in switch statements), <code>path</code> (don’t know what this does —didn’t have any of those errors), <code>serial</code> (finds serializable classes without a serial version ID), and <code>finally</code> (also a mystery).
[223001250060] |We had thousands of errors.
[223001250070] |Luckily, sometimes a fix takes care of dozens of errors in one line.
[223001250080] |It’s taken me three days to fix them all or verify they were right and suprress the warnings (some switch statements and some safe casts the compiler couldn’t verify).
[223001250090] |But it’s done. 
[223001250100] |I was mainly trying to follow master’s most excellent advice from the second edition of <a>Effective Java</a>.
[223001250110] |In fact, <del><a>Joss Whedon</a></del> Josh Bloch is my master now (and has been since the first edition; I’ll buy the t-shirt if someone makes one up).
[223001250120] |This is absolutely the best book for taking your Java code to the next level.
[223001250130] |(After you’ve digested it, check out Brian Goetz et al.’s <a>Java Concurrency in Practice</a>, of which Bloch’s a co-author, for some real mental gymnastics.) 
[223001250140] |I <del>made the mistake</del> had the opportunity to read the second edition after I gave it to Mitzi for Christmas [no, I’m not that much of a Grinch —we both pick gifts we’ll both enjoy; she got me some killer <a>Volnay</a> and I also got her some <a>Nortec beats</a>).
[223001250150] |It found lots of places where I was naughty and forgot to put in serial versions for compilation (needed for forward compatibility).
[223001250160] |Most of the warnings had to do with unchecked type conversions.
[223001250170] |There I was just lazy when I converted to generics for release 3.0.
[223001250180] |I generified everything I could find that was causing warnings, suppressed warnings for necessary casts and array allocations (the type system isn’t perfect), and am now much more confident that things are working right on the inside.
[223001250190] |What does this buy the LingPipe user?
[223001250200] |Well, you won’t get any warnings if you turn lint on and compile LingPipe now.
[223001250210] |You can bask in the glow of the warm fuzzy feeling I have for a compilation well done.
[223001250220] |I don’t think I changed a single bit of behavior (except protecting against forward compatibility errors from changing method signatures) with this exercise.