ACE ‘08 X-Doc Coref Bakeoff

[223000580010] |<title>ACE ‘08 X-Doc Coref Bakeoff</title>
[223000580020] |The <a>ACE ’08 site</a> is up.
[223000580030] |In addition to the <a>schedule</a>, you’ll want to check out the <a>evaluation plan</a>.
[223000580040] |Anyone can play, but there’s a gag order on any result comparisons and the data’s strictly proprietary (from LDC).
[223000580050] |The evaluation plan is particularly interesting as a case study in specifying an ontology and tagging standard for a complicated problem.
[223000580060] |If you’ve never thought about something like this, I’d recommend it highly.
[223000580070] |Don’t spend too much time worrying about their evaluation metrics.
[223000580080] |One interesting note: they’ll be using Breck Baldwin’s and Amit Bagga’s B-Cubed measure for cross-document coreference scoring.
[223000580090] |I still like relational scoring myself, especially as it allows for uncertainty on coreference to be “integrated out”.
[223000580100] |But I’ve never been able to convince anyone else about its usefulness.
[223000580110] |We (Alias-i) probably won’t have time to participate.
[223000580120] |Our research energy’s going to be mostly directed toward our NIH grant, which itself has a cross-document coreference and entity extraction component, but one focused on high recall and database linkage rather than first-best clustering.
[223000580130] |If I had my research druthers and decided to take on this problem, I’d focus on Dirichlet process clusterers.
[223000580140] |In particular, I’d like to see a truly Bayesian version of something like <a>Haghighi and Klein (2007)</a> that used posterior sampling.
[223000580150] |Even more fun would be to integrate (pun intended) with a Bayesian tagger, like the one described in <a>Finkel et al. (2005)</a>.
[223000580160] |In fact, it looks like <a>Finkel et al. (2006)</a> are already thinking along these lines in other arenas.
[223000580170] |I’ve been fascinating with cascading processing, and in particular on-line disambiguation, ever since grad school, where I was encouraged in this pursuit by <a>Mark Steedman</a> (who knew computational linguists had Wikipedia entries?).
[223000580180] |Online processing and disparate information integration for disambiguation was even the subect of my job talk at Carnegie Mellon way back in 1989.
[223000580190] |It was what I was working on at the end of my time at Bell Labs, which spun out into my last ACL publication, <a>Collins et al. (2004)</a>.
[223000590010] |<title>Positive, Negative and Neutral Sentiment</title>
[223000590020] |“Classical” sentiment analysis, as defined in Pang and Lee’s <a>seminal experiments</a>, classifies reviews into two categories: positive and negative.
[223000590030] |Pang and Lee created training data from <a>Rotten Tomatoes</a> reviews, which are published with stars.
[223000590040] |Their data consisted of two classes representing negative and positive reviews.
[223000590050] |Neutral reviews (those getting 3/5 stars) were not included in the data set, and thus the resulting systems aren’t tested on their ability to reject neutral reviews as being neither positive nor negative.
[223000590060] |Here’s what I’ve been suggesting: use two classifiers, one for positive/non-positive and one for negative/non-negative.
[223000590070] |Then you get a 4-way classification into positive (+pos,-neg), negative (-pos,+neg), mixed (+pos,+neg) and neutral (-pos,-neg).
[223000590080] |The problem here is that I need data to try it out.
[223000590090] |This level of annotation can’t be extracted from review text plus star rating.
[223000590100] |I need to know sentence-by-sentence sentiment.
[223000590110] |Someone just forwarded me a pointer to a 2005 IJCAI paper that actually takes neutral sentiments seriously:
[223000590120] |M. Koppel and J. Schler (2005) <a>Using Neutral Examples for Learning Polarity</a>.
[223000590130] |In IJCAI.
[223000590140] |What Koppel and Schler built was a three-way classifier for positive, negative and neutral sentiment.
[223000590150] |They do it by combining binary classifiers into n-way classifiers using a standard round-robin approach.
[223000590160] |Specifically, they build three binary classifiers: positive/negative, positive/neutral, and negative/neutral.
[223000590170] |Then they run all three and take a vote (with some tie-breaking scheme).
[223000590180] |They use SVMs, but any binary classifier may be used.
[223000590190] |This approach is generalizable.
[223000590200] |You can slice and dice the categories differently.
[223000590210] |What I suggested was actually combining two classifiers each of which would be trained on all the data, namely positive/neutral+negative and negative/neutral+positive.
[223000590220] |You can go further and fully expand all the combinations, adding positive+negative/neutral to round out the set of six binary classification problems.
[223000590230] |You can just add the larger categories into the vote.
[223000590240] |On another note, I’d actually like to get the time to take my proposed approach and build it into a hierarchical model, like McDonald et al.’s <a>hierarchical SVM-based approach</a>.
[223000590250] |I’d use something like latent Dirichlet analysis (coming soon to LingPipe) instead of SVMs, so I could predict posterior probabilities, but that’s a relatively trifling detail compared to the overall structure of the model.
[223000590260] |It would actually be possible to partially supervise LDA, or the whole model could be induced as latent structure from the top-level review ratings.
[223000590270] |Even more fun would ensue if we could use a kind of hierarchical LDA, with a level dedicated to overall sentiment and then another level per genre/domain (this’d be hierarchical on the word distributions, not the topic distributions as in standard hierarchical LDA).
[223000600010] |<title>Blog Moved to WordPress</title>
[223000600020] |We’re rearranging our IT infrastructure so we don’t have to manage our own servers for easily hostable activities like e-mail, blogging, version control and web hosting.
[223000600030] |As part of that activity, we moved our blog here.
[223000600040] |Not that it was painless.
[223000600050] |Our blog was running an ancient version of WordPress (1.5.2), so I had to install a plugin to enable me to dump the old contents out in XML so I could read them into wordpress.com.
[223000600060] |Alas, the first thing I tried didn’t even come close to installing, but then I tried <a>Aaron Brazell</a> WordPress XML Export plugin, and it almost worked.
[223000600070] |It installed (gee, PHP is easy —just drop in and go), but then threw a SQL error.
[223000600080] |Luckily, WordPress has an interface for editing the PHP, so I squinted and chopped, and it worked (just needed to remove a field from a query that didn’t exist in 1.5.2).
[223000600090] |The only problem was that my export was 5MB and they only took up to 4MB.
[223000600100] |Ouch.
[223000600110] |So I trolled the XML and realized it wasn’t just bad markup, it was spam.
[223000600120] |So I went in and moderated the spam away, dumped it out, and miraculously it started again on the WordPress side. 
[223000600130] |The only thing that’s messed up is links.
[223000600140] |Before, they used to be <code>?p=50</code> and now they’re descriptive (search engine optimized, I suppose).
[223000600150] |The old numerical links still work, but they’ve been renumbered to be continuous.
[223000600160] |So the links are broken.
[223000600170] |And somehow they put in a backslash before “Carp’s Blog”, but that was easy to edit.
[223000600180] |I had to copy the blogroll by hand, but that wasn’t too big.
[223000600190] |Let’s hope comments work better here.
[223000600200] |The spam was killing us, so we basically just turned off comments on the old blog.
[223000600210] |The only problem is that there’s no way (I can find, anyway) to turn on comments for all of our older posts.
[223000600220] |They all showed up with comments off after the import.
[223000600230] |Update 23 January 2008: Doh!
[223000600240] |We forgot about redirecting users who are hitting our RSS feed, blogrolling our top-level site, and linking to individual entries.
[223000600250] |That required figuring out RSS redirects, HTML redirects, and Breck writing a little Perl CGI to handle the fact that the old WordPress used URLs with properties, so we couldn’t just use static redirects.
[223000610010] |<title>Comments Turned ON</title>
[223000610020] |Comments are now possible for the LingPipe blog.
[223000610030] |Depending on how things go, we may have to leave moderation turned on.
[223000610040] |So far, WordPress is doing a much better job at filtering the spam than we were.
[223000620010] |<title>LingPipe 3.4.0 Released</title>
[223000620020] |LingPipe 3.4.0 is out.
[223000620030] |As usual, find out more and download from the home page:
[223000620040] |<li><a>LingPipe Home Page</a></li> 
[223000620050] |Here are the release notes.
[223000620060] |<h4>Intermediate Release</h4>
[223000620070] |The latest release of LingPipe is LingPipe 3.4.0.
[223000620080] |This release replaces LingPipe 3.3.0, with which it is backward compatible other than for one constructor for the approximate dictionary chunker (see below for details).
[223000620090] |<h4>Serializable LM and Spelling Correction Training</h4>
[223000620100] |This release adds the ability to serialize instances of lm.NGramProcessLM and spell.TrainSpellChecker.
[223000620110] |This will allow more data to be added to training without having to rescan earlier data.
[223000620120] |Serializing spelling training led to adding a serialization implementation to spell.FixedWeightEditDistance.
[223000620130] |<h4>Count-Based Training for LM Classifiers</h4>
[223000620140] |There is a new method in classify/DynamicLMClassifier that allows counts to be included with training instances.
[223000620150] |<h4>Revised Approximate Dictionary Chunker</h4>
[223000620160] |The dict.ApproxDictionaryChunker class was refactored for compatiblity with the util.Distance and util.Proximity interfaces as well as the spell.WeightedEditDistance class.
[223000620170] |Unfortunately, a constructor had to be removed, so it’s not fully backward compatible.
[223000620180] |The documentation has been completely rewritten (as has the doc for edit distance), and a tutorial example is available as part of the named entity tutorial (see below for a link).
[223000620190] |<h4>Factory Interface</h4>
[223000620200] |In support of some of the projects in development, we’ve added a generic factory implementation in util.Factory.
[223000620210] |<h4>String Comparison Tutorial</h4>
[223000620220] |There’s a new string comparison tutorial which covers weighted edit distances and other ways to compare strings.
[223000620230] |<h4>Approximate Dictionary Chunker Tutorial</h4>
[223000620240] |There’s a new section in the named entity tutorial covering approximate dictionary-based chunking.
[223000620250] |<h4>New Demo Jars</h4>
[223000620260] |There are no longer any dependencies in LingPipe itself.
[223000620270] |The demos depend on various third party libraries.
[223000620280] |The following jars in demos/lib are newer versions than found in the last release:
[223000620290] |<li>commons-fileupload-1.2.jar</li>
[223000620300] |<li>commons-net-1.4.1.jar</li>
[223000620310] |<li>lucene-core-2.3.0.jar</li>
[223000620320] |<li>luke-0.8.1.jar</li>
[223000620330] |<li>nekohtml-1.9.6.1.jar</li>
[223000620340] |<li>servlet-api-6.0.16.jar</li> 
[223000620350] |<h4>Sandbox Temporary Home</h4>
[223000620360] |The sandbox projects are temporarily living at threattracker.com.
[223000620370] |See the Sandbox Home for info on the sandbox.
[223000630010] |<title>Understanding Java Generics through Erasure</title>
[223000630020] |Now that I’ve been playing with generics for nearly a year, I think I better understand how to convey what’s really going on with them in one simple example.
[223000630030] |<h4>Erasure for Compilation</h4>
[223000630040] |Consider the following very simple class which merely holds a reference to a character sequence generically:
[223000630050] |The first key to understanding this class is to understand how it looks with all the generics "erased":
[223000630060] |The erased version is the one that is compiled.
[223000630070] |Thus a client can use the class with no generics whatsoever according to the erased version. 
[223000630080] |Note that the erasure left the interface being extended by the generic <code>E</code> behind.
[223000630090] |So clients can assign the results of a get to a <code>CharSequence</code> variable.
[223000630100] |They cannot construct an instance by calling something like <code>new Foo(new Integer(5))</code>, because it doesn’t match the erasure.
[223000630110] |<h4>Casting for Clients</h4>
[223000630120] |Now what happens with clients of the class that use the generic.
[223000630130] |For instance, I might do this:
[223000630140] |If the target of compilation is the erased class, how is this simple client able to assign the value of the getter to a string?
[223000630150] |The answer is that the compiler inserts implicit casts, so that the code above is equivalent to one with appropriate casts inserted: 
[223000630160] |The guarantee is that the casts so-inserted will not cause run time cast class exceptions, because the declaration of <code>foo</code> as <code>Foo<String></code> requires a compile-time check that all arguments match the declaration.
[223000630170] |Thus we can do this:
[223000630180] |we cannot do this:
[223000630190] |It will fail the compile-time test.
[223000630200] |<h4>Why no Instances?</h4>
[223000630210] |The erased form of the class indicates why it’s impossible to create generic arrays.
[223000630220] |For instance, I couldn’t declare a method in <code>Foo</code> like this:
[223000630230] |The problem is with the erasure:
[223000630240] |and the corresponding client code:
[223000630250] |Specifically, the implicit cast causes an error:
[223000630260] |This is because Java doesn’t let you cast a <code>CharSequence[]</code> to a <code>String[]</code>.
[223000640010] |<title>Our NIH work and Autism</title>
[223000640020] |We are funded by NIH (thanks to all you tax payers) to develop better ways to connect wet lab work to databases and text sources using software.
[223000640030] |I have blogged before about <a>the importance of better linguistics in bioinformatics</a> and now we are on Phase II of taking this issue on.
[223000640040] |As a starting place we are developing software that links a <a>database of genes</a> to <a>PubMed</a>.
[223000640050] |We have further restricted ourselves (at least initially) to working on the slice of PubMed that is returned by the query “(gene or genes or DNA or mRNA) and (autism or autistic)” which returned 1228 abstracts.
[223000640060] |Why genes?
[223000640070] |Much of modern research is driven by studies that investigate the genome of disease groups.
[223000640080] |This in turn generates via micro array experiments large sets of candidate genes that must be winnowed out by text and database driven informatics techniques–this is our strong suit.
[223000640090] |Why autism?
[223000640100] |We have a relationship with Harvard and that is what they are working on as part of the Autism Consortium.
[223000640110] |We need a user base, a smaller slice of the medical world that we can get a handle on and a genetic disease with difficult informatics problems.
[223000640120] |<h3>Where do we stand now?</h3>
[223000640130] |The short answer is, roughly speaking, that we are humbled.
[223000640140] |We cut our teeth in the intelligence world starting in 2000 and got NIH funding because we promised to deliver what we could do for tracking Osama bin Laden for medical researchers.
[223000640150] |While tracking bin Laden had its challenges, tracking genes has opened a world of tough problems.
[223000640160] |One dimension of the problem is obvious upon examination.
[223000640170] |Allow me to elaborate.
[223000640180] |I decided that I wanted a sanity check on exact dictionary matches of genes from the Entrez Gene database into the autism literature.
[223000640190] |Being a good researcher you always want to establish a baseline system to get a sense of the scope of the problem.
[223000640200] |We built a somewhat effective system for doing this for human genes based on work by William Hayes et al as reported in the AZuRE system for our Phase I work.
[223000640210] |Now in Phase II I decided that we needed to move beyond our species to the very important animal models that are also in Entrez Gene–mouse being particularly significant for autism.
[223000640220] |Little did I know that genetics researchers name thier genes with the explicit purpose of making my life difficult.
[223000640230] |I wrote a simple <a>exact dictionary matcher</a> for known aliases of genes.
[223000640240] |Entrez Gene has half a million genes spanning more than 500 species with more than a million unique aliases for genes.
[223000640250] |Nothing particularly concerning there yet.
[223000640260] |Looking up gene names in an autism abstract from PubMed however yields an interesting problem.
[223000640270] |Here is the abstract:
[223000640280] |Note: Genes found with simple dictionary lookup, abstract on left, found genes on right in bold.
[223000640290] |Species and Entrez Gene id listed per alias.
[223000640300] |Not all species shown.
[223000640310] |You don’t need to be a doctor to know that the phrases ‘we’, ‘unbalanced’, ‘a’ and ‘or’ are not gene names in this context.
[223000640320] |But believe it or not, they are actually legitimate in other contexts.
[223000640330] |As in ‘The mutant gene <b>we</b> is probably responsible for a disrupted induction signal from the dermal papilla towards ectodermal cells of a hair follicle.’
[223000640340] |I am not kidding and I feel like Dave Barry saying it.
[223000640350] |This is really a challenging problem, and unlike Dave Barry I have to solve it.
[223000640360] |Soon I will post about our approaches to solving this over generation problem.
[223000640370] |In the world of computational linguistics it would be said that we have a precision problem.
[223000640380] |That is, we can’t accurately find examples of genes without finding a pile of junk in the process.
[223000640390] |It has the same quality as Paul Samuelson claiming that Wall Street indexes predicted 9 out of the past 5 recessions.
[223000640400] |Wish us luck, we all will live healthier if we get this sorted out….
[223000640410] |Breck
[223000650010] |<title>EM Clustering, Fat Gaussians and Number of Clusters</title>
[223000650020] |I just got back from a very nice talk at Columbia by <a>Sanjoy Dasgupta</a>, who’s on leave at Yahoo! from UCSD.
[223000650030] |The talk was basically about an old paper, <a>Dasgupta and Schulman 2000</a>, in which the authors introduce a “fat Gaussian” to overcome the curse of dimensionality in EM (soft) clustering.
[223000650040] |Dasgupta started by making the point that a Gaussian with 1000 independent dimensions has almost all sample points roughly the same distance from the mean, which is easy to prove given the central limit theorem.
[223000650050] |In other words, the standard Gaussian defines a thin high-likelihood shell around the expected distance.
[223000650060] |In reality (or at least in experimental data sets), data’s more dispersed than independent high-dimensional Gaussians can represent.
[223000650070] |Therefore, they introduce fat Gaussians, which are just mixtures of Gaussians with different variances, thus averaging over a bunch of thin shells at their respective expected distances from the mean.
[223000650080] |He went on to show various separability results for clustering based on these fat Gaussians and mostly separable data.
[223000650090] |What struck me most about the talk was Dasgupta’s answer to a question from an audience member who asked if his method would help determine the number of clusters.
[223000650100] |This is a perennial problem in clustering.
[223000650110] |K-means and EM-clustering require the number of clusters to be set ahead of time.
[223000650120] |Dasgupta’s answer was in two parts.
[223000650130] |The official answer was “no”.
[223000650140] |But he qualified that by saying that what he really believed was that the problem was ill-formed because different numbers of clusters were useful for different applications.
[223000650150] |Some natural language processing problems are like this, and some are different.
[223000650160] |Categorizing documents by topic is like this, because the world doesn’t come pre-segmented into a known number of topics.
[223000650170] |For instance, we might categorize news into business, politics, sports and entertainment, leading to four categories.
[223000650180] |Or we might go further, and divide them into regional and international politics, baseball, soccer and bowling, and movies versus music.
[223000650190] |It depends on the application.
[223000650200] |But one reason we’re interested in clustering is for coreference resolution, including database deduplication and grouping mentions of the same entity (such as a consumer electronics product, gene name or person) in text.
[223000650210] |For these tasks, there is a notion of “true” number of clusters, though one might quibble that Bob-as-cook is different than Bob-as-programmer.
[223000650220] |EM clustering that estimates means and variances in a maximum likelihood setting is ill-formed.
[223000650230] |The max likelihood solution in the 2-cluster setting places one cluster mean on a single data point with zero variance, leading to an infinite value of the density for that point, with the other points belonging to a single high-variance covering distribution.
[223000650240] |One solution to this problem is to fix the variances to positive values ahead of time and only estimate the means.
[223000650250] |Another thing that strikes me about EM clustering is that it’s almost always done with a uniform distribution over categories.
[223000650260] |That is, the likelihood of a point x being in cluster c is p(c) * p(x|c).
[223000650270] |I’ve never seen a presentation of Gaussian EM clustering where p(c) is taken to be anything other than uniform.
[223000650280] |That is, we’re not modeling the situation where clusters have different likelihoods —the generative model is to draw a cluster according to a uniform distribution and then draw a point from that cluster’s distribution.
[223000650290] |The Bayesian (equivalently minimum description length) solution to the number of clusters problem also addresses the uniform-cluster size problem.
[223000650300] |Specifically, we put a prior on the model structure and then choose the clustering that maximizes the likelihood times the prior: p(data|model) * p(model).
[223000650310] |The standard Bayesian prior for this kind of setting would comprise something like a Dirichlet process prior on the distribution of categories p(c); we could also put Bayesian priors on the means or more commonly variances of the Gaussians.
[223000650320] |With a Dirichlet process, there is no bound on the number of categories, and draws from a Dirichlet process prior lead to power-law (Zipf-like) distributions of number and size of categories.
[223000650330] |You can see this working quite well in coreference resolution as evidenced by the recent paper <a>Haghighi and Klein 2007</a>.
[223000650340] |The only problem with the Bayesian “solution” is that the concentration parameter of the Dirichlet process prior is just another way of setting k.
[223000650350] |That is, you need to make a prior assumption about how costly it is to introduce new categories.
[223000650360] |This is just replacing a direct control (number of clusters) with an indirect one (cost to introduce a new cluster).
[223000650370] |Haghighi and Klein, for instance, set the concentration parameter of their Dirichlet process prior empirically.
[223000660010] |<title>Spinning Straw into Gold: How to do gold standard data right</title>
[223000660020] |We have been struggling with how to evaluate whether we are finding ALL the genes in MEDLINE/PubMed abstracts.
[223000660030] |And we want to do it right.
[223000660040] |There has been a fair amount of work on how to evaluate natural language problems–search via TREC, BIOCREATIVE, MEDTAG but nothing out there really covered what we consider to be a key problem in text based bioinformatics–coverage or recall in application to existing databases of Entrez Gene.
[223000660050] |<h3>What is the <a>Rumpelstiltskin</a> tie in?</h3>
[223000660060] |From the Wikipedia:
[223000660070] |“In order to make himself appear more important, a miller lied to the king that his daughter could spin straw into gold.
[223000660080] |The king called for the girl, shut her in a tower room with straw and a spinning wheel, and demanded that she spin the straw into gold by morning, for three nights, or be executed. ”Much drama ensues but in the end a fellow named Rumpelstiltskin saves the day.
[223000660090] |The cast breaks down as follows:
[223000660100] |<li><b>The king: </b>The National Institutes of Health (NIH) who really prefer that you deliver on what you promise on grants.</li>
[223000660110] |<li><b>The miller: </b>Our NIH proposal in which we say “We are committed to making all the facts, or total recall, available to scientists…” Even worse is that this is from the one paragraph summary of how we were going to spend $750,000 of the NIH’s money.
[223000660120] |They will be asking about this.</li>
[223000660130] |<li><b>The daughter: </b>I (Breck), who sees lots of straw and no easy path to developing adequate gold standard data to evaluate ourselves against.</li>
[223000660140] |<li><b>The straw: </b>15 million MEDLINE/PubMed abstracts and 500,000 genes that need to be connected in order to produce the gold.
[223000660150] |Really just a subset of it.</li>
[223000660160] |<li><b>The gold: </b>A scientifically valid sample of mappings between genes and abstracts that we can test our claims of total recall.
[223000660170] |This is commonly called “Gold Standard Data.”</li>
[223000660180] |<li><b>Rumpelstiltskin: </b>Bob, and lucky for me I do know his name.</li> 
[223000660190] |<h3>Creating Gold from Straw</h3>
[223000660200] |Creating linguistic gold standard data is difficult, detail oriented, frustrating and ultimately some of the most important work that one can do to take on natural language problems seriously.
[223000660210] |I was around when version 1 of the Penn Treebank was created and would chat with Beatrice Santorini about the difficulties they encountered for things as simple seeming as part-of-speech tagging.
[223000660220] |I annotated MUC-6 data for named entities and coreference, did the John Smith corpus of cross-document coref with Amit Bagga and have done countless customer projects.
[223000660230] |All of those efforts gave me insights that I would not have had otherwise about how language is actually used rather than the idealized version you get in standard linguistics classes.
[223000660240] |The steps for creating a gold standard are:
[223000660250] |<li><b>Define what you are trying to annotate:</b> We started with a very open ended “lets see what looks annotatable” attitude for linking Entrez Gene to MEDLINE/PubMed.
[223000660260] |By the time we felt we had a sufficiently robust linguistic phenomenon we had a standard that mapped abstracts as a whole to gene entries in Entrez Gene.
[223000660270] |The relevant question was: “Does this abstract mention anywhere a literal instance of the gene?”
[223000660280] |Gene families were not taken to mention the member genes, so “the EXT familly of genes” would not count, but “EXT1 and EXT2 are not implicated in autism” would.</li>
[223000660290] |<li><b>Validate that you can get multiple people to do the same annotation:</b> Bob and I sat down and annotated 20 of the same abstracts independently and compared our results.
[223000660300] |We found that we had 36 shared mappings from gene to abstract, with Bob finding 3 mappings that Bob did not and I found 4 that Bob did not.
[223000660310] |In terms of recall I found 92% (36/39) of what Bob did.
[223000660320] |Bob found 90% (36/40) of what I found.
[223000660330] |Pretty good eh?
[223000660340] |Not really, see below.</li>
[223000660350] |<li><b>Annotate enough data to be statistically meaningful:</b> Once we are convinced we have a reliable phenomenon, then we need to be sure we have enough examples to minimize chance occurrences.</li> 
[223000660360] |<h3>The Tricky Bit</h3>
[223000660370] |I (the daughter) need to stand in front of the king (the NIH) and say how good our recall is. Better if the number is close to 100% recall.
[223000660380] |But what is 100% recall?
[223000660390] |Even a corpus annotation with an outrageously high 90% interannoatator agreement leads to problems:
[223000660400] |<li><b>A marketing problem: </b>Even if we hit 99.99% recall on the corpus, we don’t know what’s up with the 5% error.
[223000660410] |We can report 99.99% recall against the corpus, but not against the truth. after being total rock stars and modeling Bob at 99.99%, a slide that says we can only claim recall of 85-95% on the data.
[223000660420] |So I can throw out the 99.99% number and introduce a salad of footnotes and diagrams.
[223000660430] |I see congressional investigations in my future.</li>
[223000660440] |<li><b>A scientific problem: </b>It bugs me that I don’t have a handle on what truth looks like.
[223000660450] |We really do think recall is the key to text bioinformatics and that text bioinformatics is the key to curing lots of diseases. 
[223000660460] |<h3>Rumpelstiltskin Saves the Day</h3>
[223000660470] |So, here we are in our hip Brooklyn office space, sun setting beautifully over the Williamsburg bridge, Bob and I are sitting around with a lot of straw.
[223000660480] |It is getting tense as I imagine the king’s reaction to the “standard approach” of working from an interannotator agreement validated data set.
[223000660490] |Phrases like “cannot be done in a scientifically robust way”, “we should just do what everyone else does” and “maybe we should focus on precision” were bandied about with increasing panic.
[223000660500] |But the next morning Rumpelstiltskin walked in with the gold.
[223000660510] |And it goes like this:
[223000660520] |The problem is in estimating what truth is given somewhat unreliable annotators.
[223000660530] |Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings.
[223000660540] |If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption. 
[223000660550] |Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard.
[223000660560] |Basically we should take annotations from all remotely qualified annotators and not worry about it.
[223000660570] |We know that is going to push down our precision (accuracy) but we are not in that business anyway.
[223000660580] |<h3>An apology</h3>
[223000660590] |At ISMB in Detroit, I stood up and criticized the BioCreative/GENETAG folks for adopting a crazy-seeming <a>annotation standard</a> that went something like this: “Annotate all the gene mentions in this data.
[223000660600] |Don’t get too worried about the phrase boundaries, but make sure the mention is specific.”
[223000660610] |I now see that approach as a sane way to increase recall.
[223000660620] |I see the error of my ways and feel much better since we have demonstrated 99.99% recall against gene mentions for that task–note this is a different, but related task to linking Entrez Gene ids to text abstracts.
[223000660630] |And thanks to the BioCreative folks for all the hard work pulling together those annotations and running the bakeoffs. 
[223000670010] |<title>(Abstract) Base Classes vs. Interfaces</title>
[223000670020] |When definining a low-level abstraction as part of a <a>framework</a>, say vectors and matrices to make this concrete, there’s a choice of whether to code to interfaces or specify a (possibly abstract) base class.
[223000670030] |There are two key differences between the base class and interface approach.
[223000670040] |First, in Java (and .NET from what I gather), classes must extend exactly one other class (<code>Object</code> is the default), but can implement zero or more interfaces.
[223000670050] |Second, interfaces can’t specify any implementations of methods (though they may define static constants). 
[223000670060] |But as Kirill Osenkov’s blog entry <a>Choosing: Interface vs. Abstract Class</a> points out, there are serious downstream consequences of this choice. 
[223000670070] |Most notably, once an interface is released, it has profound backward compatibility implications.
[223000670080] |Particularly, any attempt to add a method to the interface will break backward compatibility for anyone who has implemented the interface.
[223000670090] |As long as the base class implements the added method, there’s no such problem. 
[223000670100] |The conclusion I’ve come to after several years of coding to interfaces (following the Java collections framework for guidance), is that they almost all should’ve been (abstract) base classes.
[223000670110] |The exceptions are <a><code>util.Compilable</code></a>, <a><code>util.Scored</code></a>, and other lightweight marker-type interfaces, most notably <a><code>corpus.Handler</code></a>. 
[223000670120] |So what do you do if you forgot a critical method in an interface?
[223000670130] |I’m in that position right now with our <a><code>matrix.Vector</code></a> interface.
[223000670140] |I’ve just implemented multinomial logistic regression classification (aka max entropy, aka soft max, aka log linear classifiers) and found that I need two new vector operations to make the inner loop efficient.
[223000670150] |One, I need to be able to find the non-zero dimensions of a vector, and two, I need to add a scaled sparse vector to a vector. 
[223000670160] |I’ve added the methods to the interface and to the low-level abstract implementation <a><code>matrix.AbstractVector</code></a>.
[223000670170] |I’m banking on no one having implemented <code>Vector</code>, because there’s really no reason to do it, so changing the interface won’t be so terrible. 
[223000670180] |I’d dearly love to add a method to the <a><code>tokenizer.TokenizerFactory</code></a> interface to create tokenizers from character sequences as well as slices.
[223000670190] |But alas, I fear too many users have coded to the existing interface, so I’m not going to do that.
[223000680010] |<title>How Fat is Your (Prior’s) Tail?</title>
[223000680020] |What kind of prior should you use for logistic regression (aka maximum entropy) coefficients?
[223000680030] |The common choices are the Laplace prior (aka L1 regularization, aka double exponential prior, aka the lasso), the Gaussian prior (aka L2 regularization, aka normal, aka ridge regression), and more recently, the Cauchy prior (aka Lorentz, aka Student-t with one degree of freedom).
[223000680040] |These are all symmetric priors, and we typically use zero mean distributions as priors (or zero median in the Cauchy, which has no mean because the integral diverges).
[223000680050] |The effect is that there’s a penalty for larger coefficients, or looked at the other way, the priors shrink the parameters.
[223000680060] |[paragraph revised June 13, 2008] The main difference between the priors is how fat their tails are.
[223000680070] |For a given variance, the Laplace prior has more mass around zero and fatter tails than the Gaussian; the Cauchy has very thick tails compared to the Gaussian.
[223000680080] |There’s an interesting paper by Goodman, <a>Exponential Priors for Maximum Entropy Models</a>, for which Microsoft received a <a>blatantly ridiculous patent</a>.
[223000680090] |Maybe we’ll get a nastygram from Redmond for using a textbook technique.
[223000680100] |Like the stats textbooks, Goodman notes that the Laplace prior likes to shrink coefficients to zero, effectively performing a kind of Bayesian feature selection.
[223000680110] |There’s also a more recent offering by Genkin, Lewis and Madigan, <a>Large-scale Bayesian logistic regression for text categorization</a>, which covers both L1 and L2 regularization, which also comes with an open-source implementation for the multinomial case.
[223000680120] |This paper does a good job of comparing regularized logistic regression to SVM baselines.
[223000680130] |<a>Dave Lewis</a> has been a great help in understanding the math and algorithms behind logistic regression, particularly the one-parameter vector vs. two-parameter vector case, and the implications for priors and offsets in sparse computations.
[223000680140] |There’s an interesting paper by Gelman, Jakulin, Su and Pittau, <a>A Default Prior Distribution for Logistic and Other Regression Models</a>, in which they argue that after centering and variance adjusting inputs, the Cauchy prior is a good general purpose prior.
[223000680150] |They evaluate on a range of binary classification problems. 
[223000680160] |I just had coffee with <a>Andrew Gelman</a> after guest-lecturing in his stat computation class, and we had a chat about regularization and about discriminitive versus generative models.
[223000680170] |He wasn’t happy with Laplace priors being applied willy-nilly, suggesting instead that feature selection be done separately from feature estimation.
[223000680180] |Andrew also surprised me in that he thinks of logistic regression as a generative model (“of course it’s generative, it’s Bayesian”, I believe he said); but for me, this is odd, because the data’s “generated” trivially.
[223000680190] |He was keen on my trying out some kind of posterior variance fiddling to move from a maximum a posteriori form of reasoning to a more Bayesian one.
[223000680200] |The only problem is that I just can’t invert the million by million matrix needed for the posterior variance estimates.
[223000680210] |The issue here is a culture conflict between “machine learning” and “statistics”.
[223000680220] |If you read Hill and Gelman’s <a>book on regression</a>, it’s study after study of small dimensionality problems with dense data, not much data, and carefully engineered features and interaction features.
[223000680230] |Natural language problems, like most machine learning problems, involve millions of features and very sparse vectors.
[223000680240] |Maybe different techniques work better for quantitatively different problems.
[223000690010] |<title>Symbols as Hash Codes for Super-fast Linear Classifiers</title>
[223000690020] |I was talking to <a>John Langford</a>, of <a>Machine Learning(Theory)</a> blog fame, after his talk in the <a>Columbia machine learning reading group</a> yesterday.
[223000690030] |He mentioned a way to speed up linear classifiers at runtime that’s intriguing.
[223000690040] |A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse).
[223000690050] |Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
[223000690060] |The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors.
[223000690070] |If you use a symbol table, the bottleneck is the hash lookup in the symbol table.
[223000690080] |The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product.
[223000690090] |Only the array lookup is time consuming here.
[223000690100] |Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter 
[223000690110] |So what happens if we replace the symbol generated by a symbol table with a hash code?
[223000690120] |Instant speedup.
[223000690130] |We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a string-match until we get a match or exhaust the bucket.
[223000690140] |The price we pay is possible collisions.
[223000690150] |In effect, any two features that have the same hash code get conflated.
[223000690160] |If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
[223000690170] |Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU.
[223000690180] |Now we’re really flying.
[223000690190] |The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast.
[223000690200] |It’s the lookup out of cache and subsequent matching that’s the time-sink.
[223000690210] |In practice, John reports that experiments they’ve done have shown that this isn’t a problem.
[223000690220] |He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much.
[223000690230] |It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
[223000690240] |This approach extends to the more complex, structured features common in discriminitive classifiers.
[223000690250] |We never need to build an explicit feature representation if we can generate a hash code for it.
[223000690260] |If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about.
[223000690270] |I might also be thinking about <a>perfect hashing</a>, because I’m a neat freak.
[223000700010] |<title>Logistic Regression by Any Other Name</title>
[223000700020] |I (Bob) have been working on logistic regression.
[223000700030] |In particular, multinomial logistic regression with a range of different priors and a sparse regularized stochastic gradient descent optimizer.
[223000700040] |It’ll be out in LingPipe 3.5 any day now.
[223000700050] |One of the problems I had is the diverse literature around regularized logistic regression and the sheer volume of terms used to refer to it and its properties.
[223000700060] |Ditto for the stochastic gradient and other optimization literature.
[223000700070] |Here’s the “also known as” section from the forthcoming LingPipe 3.5 Javadoc (we’re our own <a>search engine optimization</a> providers):
[223000700080] |I wrote up everything I learned in a white paper, <a>Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression</a>.
[223000700090] |The result was a slew of algebra reminiscent of Dan Klein and Chris Manning’s <a>max-ent tutorial</a>, but with more general regularization, a different (k-1)-vector parameterization, and a different optimization scheme.
[223000700100] |I added a new <a>white papers section</a> to the blog to hold it.
[223000700110] |I made a minor original contribution: a stochastic gradient descent algorithm with regularization that fully preserves both sparseness and the stocahsticity of regularization.
[223000700120] |The amount of math I had to do to put the pieces back together took me straight back to grad school.
[223000700130] |My <a>parents are visiting</a> and asked me last night why there was an enormous pile of math books on the table (Strang on matrices [I couldn't remember what positive definite means], a calc textbook [I couldn't even remember how to differentiate logs and exponentials], Larsen and Marx (or was it DeGroot and Schervish?) [need the basic density definitions and properties], Gelman et al.’s Bayesian Data Analysis and Regression books, rounded off with Bishop’s, MacKay’s, and Hastie et al.’s trio of machine learning books, Cover and Thomas’s info theory book, and Manning and Schuetze for good measure).
[223000700140] |The answer is that they all contained pieces of the puzzle.
[223000700150] |I work out the full error function under maximum likelihood and Gaussian, Laplace and Cauchy priors, derive the gradients and Hessians, and show how to build it all into a stochastic gradient descent solver.
[223000700160] |I show the links between logistic regression and max entropy, work out the impact of feature functions, exponential bases, binary features, input centering and variance normalization, and even kernel methods.
[223000700170] |I have to admit I skipped full duality (but did talk about kernelization), didn’t work out the Hessian positive definiteness proof for error function is concavity, and don’t cover any of the convergence proofs for SGD.
[223000700180] |But there are lots of references to books, papers, tutorials and software packages.
[223000700190] |During this process I felt like the great Nicolai Ivanovich Lobachevsky as satirized in the Tom Lehrer song <a>Lobachevsky</a>.
[223000700200] |The unit tests are now passing.
[223000700210] |I’d like to thank <a>Dave Lewis</a> for help with test cases and understanding some of the math.
[223000700220] |Dave and I were undergrad math majors at Michigan State together, and now find ourselves doing logistic regression for NLP; he’s contributing to the <a>BMR: Bayesian Multinomial Regression Software</a> package (check out their cool use of priors for IDF-like term weighting).
[223000710010] |<title>SGD Special Ingredient: Reckless Driving</title>
[223000710020] |I’ve been playing around with convergence for my SGD implementation for the upcoming LingPipe 3.5, in the context of the 2008 <a>i2b2 Obesity Challenge</a>, the full title of which is "Second i2b2 Shared-Task and Workshop; Challenges in Natural Language Processing for Clinical Data; Obesity Challenge (A Shared-Task on Obesity): Who’s obese and what co-morbidities do they (definitely/likely) have?".
[223000710030] |Participants have until April 15, 2008 to register to participate. 
[223000710040] |Slide 37 of Léon Bottou’s NIPS tutorial <a>Learning with Large Datasets</a> reveals the "secret ingredient" behind (his) successful stochastic gradient search (where ηis the learning rate, which he calls a "gain"):
[223000710050] |At any moment during training, we can:
[223000710060] |<li>Select a small subsample of examples.</li>
[223000710070] |<li>Try various gains η on the subsample.</li>
[223000710080] |<li>Pick the gain η that most reduces the cost.</li>
[223000710090] |<li>Use it for the next 100000 iterations on the full dataset.</li> 
[223000710100] |This is a kind of meta-descent algorithm, the most well known of which is Nicolas Shraudolph’s <a>Stochastic Meta-Descent</a>.
[223000710110] |I’m going to try Bottou’s approach tomorrow.
[223000710120] |But I’m skeptical it’ll work very well, because the optimal learning rate changes pretty quickly through the corpus, so balancing batch size and number of rates to probe may prove tricky.
[223000710130] |What I need is a control stick, sort of like David MacKay et al.’s <a>Dasher interface</a> for convergence.
[223000710140] |Until today, I’ve been using a simple annealing schedule to set the learning rate.
[223000710150] |The good news is that the rumors were true —sparse SGD is very fast.
[223000710160] |The bad news is that driving the SGD implementation is like bobsledding on a steep hill in the dark over hilly terrain (that reference was for you, Jeff).
[223000710170] |Too high a rate, and the first bump’ll blow you off the hill; too heavy a hand on the brakes, and it’ll take months to get to the bottom.
[223000710180] |Just right and you’re there in a few hundred iterations. 
[223000710190] |Today, I played around with the so-called "reckless driver" heuristic, which increases the learning rate for the next epoch if the current epoch lowered error, and decreases it if the overall error goes up.
[223000710200] |It helps.
[223000710210] |Most of the time.
[223000710220] |I’m undecided about whether to include it in the release and if so, whether or not to parameterize the acceleration/deceleration terms.
[223000710230] |If you can get close to a solution early on with a fairly low learning rate, turning the rate up later on seems to help.
[223000710240] |Yes, I know that eventually the learning rate has to decrease to get within epsilon of a solution.
[223000710250] |So much for asymptotics and unbounded number of steps in the theoretical analysis.
[223000710260] |I’m reporting total error and breaking out the log likelihood component.
[223000710270] |What’s interesting to me is that the log likelihood goes way down (almost to zero, indicating the data’s pretty much separable), then starts climbing up as regularization kicks in to shrink all the parameters back down.
[223000710280] |I’m wondering if separate learning rates for regularization and likelihood gradient contributions make sense. 
[223000710290] |I’m also realizing why the "early stopping" heuristic was introduced as a poor man’s regularization.
[223000710300] |Even with proper Bayesian regularization, the optimal solution under cross-validation for many choices of prior is not the maximum a posteriori solution.
[223000710310] |For instance, if I take a prior Gaussian with a variance of 2.0 for every dimension, the MAP solution is inferior to "early stopping".
[223000710320] |Hello, <a>empirical Bayes</a>.
[223000710330] |And perhaps dimension-specific priors, as in Genkin, Madigan and Lewis’s <a>Bayesian Multinomial Regression Software</a>, which lets you get the effects of variance normalizing the input dimensions.
[223000710340] |The other good news is that (at least under most parameterizations) regularized logistic regression definitely works better than naive Bayes or our character language model or TF/IDF classifiers.
[223000710350] |As I suspected all along, logistic regression, even with just bag-of-word or bag-of-character-ngrams features, is incredibly sensitive to priors, and the task has incredibly high variance under cross-validation no matter which classifiers we use.
[223000710360] |On the very next page of his tutorial (38), Bottou describes the batch version of regularized SGD to maintain sparsity in the per-input gradient updates.
[223000710370] |I independently came up with the same algorithm, and then went on to derive my fully lazy implementation which preserves stochasticity.
[223000720010] |<title>The Entropy of English vs. Chinese</title>
[223000720020] |I (Bob) have long been fascinated by the idea of comparing the communication efficiency of different languages.
[223000720030] |Clearly there’s a noisy-channel problem that languages have in some way optimized through evolution.
[223000720040] |There was some interesting discussion recently by Mark Liberman on the <a>Language Log</a> in an entry <a>Comparing Communication Efficiency Across Languages</a> and a reply to a follow-up by Bob Moore in <a>Mailbag: comparative communication efficiency</a>. 
[223000720050] |Mark does a great job of pointing out what the noisy channel issues are and why languages might not all be expected to have the same efficiency.
[223000720060] |He cites grammatical marking issues like English requiring articles, plural markers, etc., on every noun. 
[223000720070] |The spoken side is even more interesting, and not just because spoken language is more "natural" in an evolutionary sense.
[223000720080] |Just how efficiently (and accurately) the sounds of a language are encoded in its characters plays a role in the efficiency of the writing system.
[223000720090] |For instance, <a>Arabic orthography</a> doesn’t usually encode the vowels in their spellings, so you need to use context to sort them out.
[223000720100] |The alphabet includes vowels, but they are conventionally employed only for important texts, like the Qur’an.
[223000720110] |I would add to Mark’s inventory of differences between English and Chinese the fact that English has a lot of borrowings on both the lexical and spelling side, which increase entropy.
[223000720120] |That is, we could probably eke out some gains by recoding “ph” and “f”, collapsing the distinction between reduced vowels and so on; for instance, we wouldn’t have to code the difference between “Stephen” and “Steven” which is only present in the written language (at least in my dialect). 
[223000720130] |There are lots of other differences.
[223000720140] |It may seem that Chinese doesn’t waste bits coding spaces between words.
[223000720150] |Or encoding capitalized versus uncapitalized letters.
[223000720160] |Surprisingly, when I was working on <a>language modeling in LingPipe</a>, I tested the compressibility (with character n-grams ranging from 5-16) of English drawn from LDC’s Gigaword corpus, with and without case normalization.
[223000720170] |The unnormalized version could be compressed more, indicating that even though there are more superficial distinctions (higher uniform model entropy), in fact, these added more information than they took away.
[223000720180] |Ditto for punctuation.
[223000720190] |I didn’t try removing spaces, but I should have.
[223000720200] |I also found counter-intuitively that MEDLINE could be compressed tighter than Gigaword English.
[223000720210] |So even though it looks worse to non-specialists, it’s actually more predictable. 
[223000720220] |So why can’t we measure entropy?
[223000720230] |First of all, even the Gigaword New York times section is incredibly non-stationary.
[223000720240] |Evaluations on different samples have much more variance than would have been expected if the data were stationary. 
[223000720250] |Second of all, what’s English?
[223000720260] |We can only measure compressibility of a corpus, and they vary by content.
[223000720270] |Finally, why can’t we trust <a>Brown et al.</a>‘s widely cited paper?
[223000720280] |Because the result will depend on what background training data is used.
[223000720290] |They used a ton of data from "similar" sources to what they were testing.
[223000720300] |The problem with this game is how close are you allowed to get?
[223000720310] |Given the test set, it’s pretty easy to engineer a training set by carefully culling data.
[223000720320] |We might try to compress a fixed corpus, but that leads to all the usual problems of overtraining.
[223000720330] |This is the approach of the <a>Hutter Prize</a> (based on compressing the Wikipedia).
[223000720340] |So instead, we create baskets of corpora and evaluate those, with the result that there’s no clear “winning” compression method.
[223000730010] |<title>LingPipe 3.5.0 Released</title>
[223000730020] |<h3>Intermediate Release</h3>
[223000730030] |The latest release of LingPipe is LingPipe 3.5.0.
[223000730040] |This release replaces LingPipe 3.4.0, with which it is backward compatible other than for the <code>matrix.Vector</code> interface (details below). 
[223000730050] |<h3>Logistic Regression (aka Max Entropy)</h3>
[223000730060] |The main addition in this release is of multinomial logistic regression (often called "maximum entropy classification" in the computational linguistics literature).
[223000730070] |Logistic regression produces a probabilistic discriminitive classifier with state-of-the-art accuracy.
[223000730080] |The regression estimators use stochastic gradient descent for scalability to large feature spaces with sparse inputs.
[223000730090] |There is a direct matrix-based implementation in 
[223000730100] |<li><a><code>stats.LogisticRegression</code></a> </li>
[223000730110] |and an adapter based on general feature extraction in:
[223000730120] |<li><a><code>classify.LogisticRegressionClassifier</code></a> </li>
[223000730130] |There are two support classes introduced for logistic regression, one for Laplace, Gaussian and Cauchy priors (also known as "regularizers"): 
[223000730140] |<li><a><code>matrix.RegressionPrior</code></a> </li>
[223000730150] |and one for annealing schedules to control the gradient descent: 
[223000730160] |<li><a><code>stats.AnnealingSchedule</code></a> </li>
[223000730170] |There is a new tutorial describing how to use these classes referenced below. 
[223000730180] |<h3>Cross-Validating Classification Corpus</h3>
[223000730190] |To support cross-validation evaluations for classifiers, there is a new corpus implementation: 
[223000730200] |<li><a><code>classify.XValidatingClassificationCorpus</code></a> </li>
[223000730210] |There is a new tutorial section describing how to use this class referenced below. 
[223000730220] |<h3>LineParser and SVM<sup>light</sup> Classification Parser</h3>
[223000730230] |There’s an implementation of a parser for the <a>SVM<sup>light</sup></a> file-based classifier format: 
[223000730240] |<li><a><code>corpus.parsers.SvmLightClassificationParser</code></a> </li>
[223000730250] |This parser is based on a new abstract line-based parser implementation in: 
[223000730260] |<li><a><code>corpus.LineParser</code></a> </li>
[223000730270] |<h3>Pair Utility Class</h3>
[223000730280] |There is a new class for pairs of heterogeneous types introduced primarily as a utility for methods that return pairs of results: 
[223000730290] |<li><a><code>util.Pair<E,F></code></a> </li>
[223000730300] |<h3>Additional Vector Methods</h3>
[223000730310] |The interface <a><code>matrix.Vector</code></a> has been updated with two new methods which allow an efficient inspection of non-zero dimensions.
[223000730320] |This was done to allow the interface vector to be used directly in logistic regression. 
[223000730330] |Vector and matrix client code remains unaffected.
[223000730340] |A conflict will arise only with implementations of vector outside of LingPipe. 
[223000730350] |<h3>Logistic Regression Tutorial</h3>
[223000730360] |There’s a new classification tutorial which covers the new logistic regression classes in the <code>stats</code> and <code>classify</code> packages: 
[223000730370] |<li><a>Logistic Regression Tutorial</a></li> 
[223000730380] |<h3>Cross-Validation Tutorial Section</h3>
[223000730390] |There’s a new section in the topic classification tutorial covering cross-validation of classifiers and the <code>corpus.Corpus</code> class. 
[223000730400] |<li><a>Topic Classification Tutorial: Cross-Validation</a></li> 
[223000750010] |<title>Tokenization vs. Eager Regular Expressions</title>
[223000750020] |We now have regular-expression based <a>chunkers</a> and <a>tokenizers</a> in LingPipe.
[223000750030] |They work by compiling a regex using <a><code>java.util.regex.Pattern</code></a> and then running a <code>java.util.regex.Matcher</code></a>‘s <code>find()</code> method over the inputs and pulling out the matches.
[223000750040] |Recently, I (Bob) have been tuning part-of-speech taggers for French trained on the <a>French Treebank</a>. 
[223000750050] |I thought writing a regex-based tokenizer would make sense to match the way the treebank itself tokenized as best as possible.
[223000750060] |Unfortunately, it’s impossible to write a pure tokenizer to match because, like the Penn Treebank and Penn BioIE projects, the annotators decided to use syntactic and semantic contextual information in deciding when to group a sequence of characters into a token.
[223000750070] |For instance, hyphens after prefixes (a lexical syntactic issue) are coded one way and hyphens before suffixes another; in the Penn Treebank, periods at the end of sentences (a contextual decision) are coded as separate tokens whereas those appearing sentence internally are coded as part of the token they follow.
[223000750080] |It turns out that regular expressions don’t work the way I thought they would.
[223000750090] |I wanted to write a bunch of regexes and then or (<code>|</code>) them together to produce a larger regular expression that would greedily match as much as it could in each chunk.
[223000750100] |Let’s consider a simple example:
[223000750110] |And consider running a find against the string <code>"aacac"</code>.
[223000750120] |What do we get?
[223000750130] |Let’s ask Java. 
[223000750140] |This just sets up the pattern based on the regex, generates a matcher from the input and then runs find on the matcher and prints out the first thing found.
[223000750150] |What’ll it do? 
[223000750160] |Ouch.
[223000750170] |I was expecting the whole thing to match.
[223000750180] |Apparently, that’s what a <a>POSIX regex</a> would do.
[223000750190] |But Java follows the Perl model, in which eagerness overcomes greediness.
[223000750200] |Specifically, if the first disjunct of an alternation matches, the matcher does not try to find a longer match in the second disjunct.
[223000750210] |Unfortunately, there are no greediness/reluctance modifiers for the disjunct.
[223000750220] |So what do we do?
[223000750230] |Refactor the regex, of course.
[223000750240] |How about this one?
[223000750250] |This should do the trick of matching the longest possible sequence of alternating as and bs or alternating as and cs.
[223000750260] |Sure enough, adding the following to the <code>main()</code> method:
[223000750270] |produces the expected output:
[223000760010] |<title>F-measure Loss for “Logistic Regression”</title>
[223000760020] |Given our recent inclusion of regularized logistic regression into LingPipe and our ongoing focus on high-recall classifiers, taggers and chunkers, I’ve been studying this paper:
[223000760030] |<li>Martin Jansche.
[223000760040] |<a>Maximum expected F-measure training of logistic regression models</a>.
[223000760050] |Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing.
[223000760060] |October 2005. 
[223000760070] |First a recap.
[223000760080] |The only difference between logistic regression, perceptron and SVM models is the error function being optimized.
[223000760090] |They’re all simple linear models.
[223000760100] |Logistic regression uses negative log likelihood as error, whereas perceptrons use 0/1 loss and SVMs hinge loss.
[223000760110] |What Martin’s doing in the paper cited above is presenting yet another loss function –F-measure based loss.
[223000760120] |The reason he still calls it logistic regression is that he’s using the same generalized linear model link function —the logit (aka inverse logistic or inverse sigmoid) to convert linear basis predictors into probability estimates.
[223000760130] |What’s different is that he’s using F-measure as error.
[223000760140] |So why is this so hard?
[223000760150] |Standard logistic regression models present a concave error minimization problem (equivalently, a convex maximum a posteriori (MAP) probability maximization problem).
[223000760160] |F-measure, as illustrated in the wonderfully elaborate 3-d diagrams in the paper, does not present a concave error function.
[223000760170] |This makes the optimization problem intractable in theory.
[223000760180] |Aside from framing the problem, the insight reported in this paper was using expectations as defined by the logistic regression model’s probability estimates to approximate the delta functions needed for truly defining F-measure loss.
[223000760190] |After that, there’s a lot of heavy lifting on the implementation and evaluation side.
[223000760200] |The really nice part about Martin’s formulation is that it works not only for the standard balanced F(0.5) measure (equally weighted harmonic mean of precision and recall), but also for arbitrary F(α ) measures, where αdetermines the weighting of precision and recall.
[223000760210] |Martin evaluated α=0.5 (balanced), and α=0.75 (recall weighted 0.75, precision weighted 0.25), and as expected, the α=0.75 setting indeed produced higher recall (though counter-intuitively not a better F(0.75) measure). 
[223000760220] |He also evaluated the result of training a regular old maximum likelihood logistic regression model and showed that it performed terribly (the task, by the way, was sentence extraction for summarization —I think it’d be easier for the rest of us to evaluate these techniques on something we can reproduce like the Reuters corpus or 20 newsgroups).
[223000760230] |He then showed that a posterior fitting of the binary classification threshold away from 0.0 improved performance immensely.
[223000760240] |I’m left with two questions.
[223000760250] |(1) If you want high recall, can you just train a regular logistic regression method and then set the acceptance probability lower than 0.5? and (2) how does a reasonably regularized logistic regression fare?
[223000760260] |I saw the same kind of bad performance for maximum likelihood and ridge regression (Gaussian priors) in Genkin, Lewis and Madigan’s paper about <a>large-scale logistic regression with priors</a> (also well worth reading). 
[223000760270] |I have a long history with Martin’s F-measure error paper.
[223000760280] |Martin couldn’t get a visa to travel from the U.S. to Canada for the conferences.
[223000760290] |So he had me tack up the poster for this paper for him.
[223000760300] |Too bad I didn’t understand the technical details at the time.
[223000760310] |(During the same trip, I presented his <a>treebank transfer paper</a>, which as far as I know, was the first truly Bayesian NLP paper and is well worth reading.)
[223000770010] |<title>Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions</title>
[223000770020] |Pearson’s chi-squared independence test may be used to compute <a>collocations</a> in a text corpus; that is, sequences of words that occur together more than they might by chance.
[223000770030] |This typically finds names, idioms, set-phrases and the like in text.
[223000770040] |There’s a whole chapter on collocations in <a>Manning and Schütze’s Stat NLP book</a>, and there’s a Lingpipe <a>Interesting Phrases Tutorial</a>.
[223000770050] |In the case of bigrams, we have two words A and B, and want to know if they occur as a bigram (A,B) more often than chance.
[223000770060] |The simple binary chi-squared independence test is based on a confusion matrix which computes bigram counts for (+A,+B), (+A,-B), c(-A,-B), and (-A,-B), where (+A,+B) is the count of the bigram (A,B), (+A,-B) is the count of all bigrams (A,D) where D != B, and (-A,-B) is the count of all bigrams (C,D) where C != A and D != B. The formula for the test statistic itself is defined in the javadoc for method <a><code>Statistics.chiSquaredIndependence(double,double,double,double</code></a>).
[223000770070] |The LingPipe class <a><code>lm.TokenizedLM</code></a> provides a convenient wrapper with which to count sequences of token n-grams.
[223000770080] |The tokenized LM class has a <code>train(CharSequence)</code> method which allows texts to be given.
[223000770090] |This method stores the n-gram counts after tokenizing.
[223000770100] |So let’s see what happens if we give it two sentences as inputs:
[223000770110] |<li>John ran</li>
[223000770120] |<li>John ran home</li> 
[223000770130] |This produces the following counts: 
[223000770140] |This causes a subtle problem for chi-squared statistics: the count for a final word may be higher than its count as the first element of a bigram.
[223000770150] |The same problem happens for initial words and their counts as the second element of a bigram.
[223000770160] |So what are our exact bigram counts for (ran,home)? They’re (+ran,+home)=1, (+ran,-home)=0, (-ran,+home)=0, and (-ran,-home)=2.
[223000770170] |(Leave aside for now the fact that this isn’t high enough counts for the significance test to be accurate; it’s just an example of a point.)
[223000770180] |The trie data structure used to represent counts (<a><code>lm.TrieIntSeqCounter</code></a>) can easily compute the number of words following a given word with a local lookup of a word’s daughters in the trie.
[223000770190] |It can’t easily compute the number of words preceding a given word; that would require iterating over all characters.
[223000770200] |So would computing all bigrams.
[223000770210] |So the chi-squared implementation takes a shortcut.
[223000770220] |It approximates count(+ran,-home)=0 by count(+ran)-count(+ran,+home) = 2-1.
[223000770230] |We’re in effect overcounting the negative cases.
[223000770240] |Thus our chi-squared statistic is wrong.
[223000770250] |But what if we implicitly assume there are boundary tokens, so we model the following sequences: 
[223000770260] |<li>BOS John ran EOS</li>
[223000770270] |<li>BOS John ran home EOS</li> 
[223000770280] |giving us the following additional unigram and bigram counts: 
[223000770290] |Our approximate counts now become exact: 
[223000770300] |<li>count(+ran,+home)=1</li>
[223000770310] |<li>count(+ran,-home) = count(+ran)-count(+ran,+home) = 1</li>
[223000770320] |<li>count(-ran,+home) = count(home)-count(+ran,+home) = 0</li>
[223000770330] |<li>count(-ran,-home) = totalCount-count(+ran,+home)-count(-ran,+home)-count(+ran,-home) = 9-0-1-1 = 7</li> 
[223000770340] |Et voilà.
[223000770350] |An ex post facto rationalization.
[223000770360] |What LingPipe is counting in chi-squared collocation tests includes the implicit boundaries.
[223000780010] |<title>An Online Logistic Regression API with Regularization</title>
[223000780020] |Large-scale and online classification problems require a classifier that allows online training.
[223000780030] |By large-scale, I mean problems so large they won’t fit in memory.
[223000780040] |For this blog entry, size doesn’t really matter.
[223000780050] |I’m worried about honest-to-goodness online requirements, which force a learner to get a training example, classify it, get feedback, do some learning, then forget it, all in a constant amount of time per example.
[223000780060] |Client-side spam detection and many server-side solutions would require a truly online algorithm in this sense.
[223000780070] |So it’s perhaps not surprising that Goodman and Yih of Microsoft Research used an online approach in their paper <a>Online Discriminative Spam Filter Training</a>.
[223000780080] |They use standard binomial logistic regression (max likelihood, not regularized) with stochastic gradient descent.
[223000780090] |They don’t give many details about their implementation other than say it was a naive Perl implementation that was fast enough for doing evaluations.
[223000780100] |Your typical “online” learner requires training examples to be processed one-at-a-time.
[223000780110] |This solves the large-scale problem, in that you only need to bring one example into memory at a time, assuming the parameters fit in memory.
[223000780120] |Even so, online learners like Bottou’s <a>SGD</a> and our own <a>logistic regression</a> still load all the data into memory, and then make multiple passes over it.
[223000780130] |Langford, Li and Strehl’s <a>Vowpal Wabbit</a> doesn’t load all the data into memory.
[223000780140] |VW even bounds the size of the features so a constant stream of new features won’t blow out memory (see my previous blog entry on <a>using hash codes as features</a>).
[223000780150] |Unfortunately, VW runs from the command-line and requires the training examples to be in a data file before the program starts.
[223000780160] |This could be modified, because the good folks at Yahoo! released it with source under a BSD license.
[223000780170] |Handling logistic regression in an online setting <strong>without</strong> regularization (priors) is straightforward.
[223000780180] |You just use stochastic gradient descent and take the loop over the data out of the algorithm and make it a function that updates the parameters.
[223000780190] |All you need is a set of parameter vectors for coefficients that’s resizable.
[223000780200] |This could either be done by making them map-like structures (presumably what Goodman and Yih did in Perl), or resizing them like Java’s <code>util.ArrayList</code>.
[223000780210] |The latter is preferable from a size and speed point of view.
[223000780220] |The algorithm won’t converge in a single pass the way the multiple pass algorithms do, but it’ll be fast and quite straightforward.
[223000780230] |This leaves two problems.
[223000780240] |First, how do we handle the learning rate?
[223000780250] |We can just leave it as a constant.
[223000780260] |Annealing doesn’t seem to make much sense, as the later examples will count less and less.
[223000780270] |This may be just the opposite of what you want in an online learning setting, where more recent information may be more representative of what’s to follow.
[223000780280] |In an API, we can just have a method to set the learning rate at any given point.
[223000780290] |The second problem is what to do about regularization.
[223000780300] |In the standard Bayesian maximum a posteriori estimation setting, the prior is fixed ahead of time and contributes a single error term for all of the data.
[223000780310] |In a setting where you know the number of training instances, it’s easy to handle.
[223000780320] |You just regularize at a rate of 1/numberOfTrainingExamples per example.
[223000780330] |Regularization can be done lazily to preserve sparseness, as I point out in my tech report on <a>lazy regularized SGD</a>.
[223000780340] |What I don’t know is how to do regularization online.
[223000780350] |What I’ve done in a preliminary implementation of truly online logistic regression is set a per-instance regularization discount.
[223000780360] |For instance, in a corpus with 100 training instances, that’d be 1/100.
[223000780370] |But what should the regularization discount factor be for an unbounded number of instances when you have to apply it online?
[223000780380] |If we leave this discount at a constant, it has the effect of reducing the variance of the prior as the number of training examples increases.
[223000780390] |Normally, the data would outweigh any fixed prior.
[223000780400] |But with a constant regularization discount factor, the prior keeps up at its regular pace for all the data.
[223000780410] |This might even be desirable, as it will keep tamping down the coefficients.
[223000780420] |For instance, it’ll guarantee that a feature seen only once will eventually go to zero under a Laplace prior (and approach zero with a Gaussian or Cauchy prior).
[223000780430] |It’d also be nice to be able to serialize the learner at any point, as well as compile it to a fixed representation, which catches up all of the lazy online regularization.
[223000790010] |<title>Per-Tag Error Function for CRFs and MEMMs for Removing Label Bias</title>
[223000790020] |As you may have noticed from the posts, I (Bob) have been <a>swotting up</a> on discriminative methods, such as logistic regression and conditional random fields (CRFs).
[223000790030] |A couple posts ago, I mentioned Martin Jansche’s training of a logistic regression-like classifier based on (an approximation of) <a>F-measure error</a>.
[223000790040] |The goal was to build a classifier with a better F-measure than one trained on traditional log loss.
[223000790050] |Along these same lines, I came across a paper minimizing yet another error function, this time for sequence taggers:
[223000790060] |<li>Kakade, Sham, Teh, Yee-Whye, and Sam T. Roweis. 2002.
[223000790070] |<a>An Alternate Objective Function for Markovian Fields</a>.
[223000790080] |Procs of <em>ICML</em></li> 
[223000790090] |Kakade et al. look at discriminative sequence tagging models including max-entropy Markov models (MEMM) and conditional random fields (CRF).
[223000790100] |MEMMs and CRFs are both trained like logistic regression using log loss.
[223000790110] |The key point is that this is defined on a per-input basis, not a per-tag basis.
[223000790120] |Kakade et al. use a per-tag error function instead, with some surprising results.
[223000790130] |Here’s the general setup.
[223000790140] |The training data consists of <code>n</code> pairs of tag/token sequences <code>tags[i], tokens[i]</code> for <code>i <n</code>.
[223000790150] |The models (either CRFs or MEMMs) supply probability estimates <code>p(tags[i] | tokens[i], β)</code>, and the error for parameters <code>β</code> is the sum of the log loss values per input in the data:
[223000790160] |The chain rule reduces this for bigram context models such as MEMMs (but not CRFs) to:
[223000790170] |and as they point out, this decomposition is the source of the so-called label-bias problem in MEMMs as originally discussed in <a>Léon Bottou’s Thesis</a>.
[223000790180] |What Kakade et al. do is replace this whole-sequence error with the following error function, which is decomposed per tag:
[223000790190] |As with Jansche’s paper, the goal is to devise an error function that matches the evaluation function more closely (e.g. per-tag part-of-speech tagging error).
[223000790200] |And also like Jansche’s paper, the work is all in the gradient calculations of the new error function.
[223000790210] |That, and the actual training algorithm, which as you can guess, makes use of the forward-backward algorithm to compute <code>p(tag[i][j] | tokens[i])</code> (note that this use of forward-backward is implemented in LingPipe’s HMM tag-word lattice class).
[223000790220] |What’s really neat about this technique is how well it works.
[223000790230] |They use a synthetic example, and one example with very high accuracy (McCallum’s FAQ dataset).
[223000790240] |It would sure be nice to see if this worked on a data set we care more about, like part-of-speech tagging.
[223000790250] |The big problem is that per-tag optimized output is not very suitable for tagging in support of named entity extraction or other chunking tasks, because the first-best output isn’t guaranteed to be consistent.
[223000800010] |<title>Book: Building Search Applications: Lucene, LingPipe and Gate</title>
[223000800020] |<h4>There’s a book about LingPipe!</h4>
[223000800030] |<li>Konchady, Manu. 2008.
[223000800040] |<a><em>Building Search Applications: Lucene, LingPipe, and Gate</em></a>.
[223000800050] |Mustru Publishing.</li> 
[223000800060] |The title is linked to the Amazon page; it’s also available as an inexpensive download from <a>Lulu</a>.
[223000800070] |<h4>The Bottom Line</h4>
[223000800080] |The subtitle <em>“A practical guide to building search applications using open source software”</em> pretty much sums it up (comment added June 22, 2008: please see Seth Grimes’s comment below about LingPipe’s royalty-free license not being compatible with other open-source licenses).
[223000800090] |It takes a reader that knows Java, but nothing at all about search or associated text processing algorithms, and provides a hands-on, step-by-step guide for building a state-of-the-art search engine.
[223000800100] |I (Bob) gave Manu feedback on a draft, but there wasn’t much to correct on the LingPipe side, so I can vouch for the book’s technical accuracy.
[223000800110] |(Disclaimer: I didn’t actually try to run the code.)
[223000800120] |<h4>Chapter by Chapter Overview</h4>
[223000800130] |After (1) a brief discussion of application issues, the chapters include (2) tokenization in all three frameworks, (3) indexing with Lucene, (4) searching with Lucene, (5) sentence extraction, part-of-speech tagging, interesting/significant phrase extraction, and entity extraction with LingPipe and Gate (6) clustering with LingPipe, (7) topic and language classification with LingPipe, (8 ) enterprise and web search, page rank/authority calculation, and crawling with Nutch, (9) tracking news, sentiment analysis with LingPipe, detecting offensive content and plagiarism, and finally, (10) future directions including vertical search, tag-based search and question-answering.
[223000800140] |For those wanting introductions to the LingPipe APIs mentioned above, Konchady’s book is a gentler starting point than <a>our own tutorials</a>. 
[223000800150] |That may sound like a whole lot of ground to cover in 400 pages, but Konchady pulls the reader along by illustrating everything with working code and not getting bogged down in theoretical boundary conditions.
[223000800160] |There are pointers to theory, and a bit of math where necessary, but the book never loses sight of its goal of providing a practical introduction.
[223000800170] |In that way, it’s like the Manning <a>in Action</a> series.
[223000800180] |The book’s hot of the presses, so it’s up to date with Lucene 2.3 and LingPipe 3.3.
[223000800190] |<h4>About the Author</h4>
[223000800200] |<a>Manu Konchady</a>‘s an old hand at search and text processing.
[223000800210] |You <a>may remember him from</a> such books as <a><em>Text Mining Application Programming</em></a> and <a><em>High Performance Parallel Algorithms for Scientific Computing with Application to a Coupled Ocean Model</em></a>.
[223000810010] |<title>Shame and Joy of Open Source: Perf Bug in FastCache</title>
[223000810020] |Hats off to <a>Brian Wilson</a> of <a>Endeca</a> for profiling and then debugging a huge problem in my (Bob’s) implementation of <a><code>com.aliasi.util.FastCache</code></a>.
[223000810030] |The joy is that Brian could find and diagnose the bug; the shame is mine from having made this colossal blunder in the first place and then not spotting it with size-sensitive tests. 
[223000810040] |The immediate upshot is that I’m going to roll out a 3.5.1 version with the bug patched this weekend.
[223000810050] |The fast cache is designed to be a thread-safe frequency-of-use based cache with pruning triggered by a new entry exceeding the load factor.
[223000810060] |The pruning is supposed to scale counts by dividing by 2 and then discard records with zero counts.
[223000810070] |You might be able to spot the problem yourself by looking at the implementation, which is online at: <a><code>3.5.0 version of FastCache.java</code></a>.
[223000810080] |I’ll give you a hint.
[223000810090] |It’s in the <code>prune()</code> method.
[223000810100] |Specifically, in this line:
[223000810110] |Here’s what it should be:
[223000810120] |Here’s the <a>3.5.1 version of <code>FastCache.java</code></a>.
[223000810130] |The problem was that I didn’t decrement the counts.
[223000810140] |So why didn’t we notice this before?
[223000810150] |Because the implementation in place now is not only sound (that is, it produces the right answers), but it’s also relatively well-behaved in the face of natural power-law distributed natural language data.
[223000810160] |Every time the size threshold gets exceeded, all records with counts of 3 or less get discarded.
[223000810170] |Since language is dominated by 1 and 2 counts in the tail, this isn’t such a problem in processes that don’t run forever on a really diverse set of data with lots of local duplicate terms. 
[223000810180] |The other problem is that I only meant to divide by 2, not by 4, which a right-shift (zero filled) by 2 achieves.
[223000810190] |I’ll blame the psycholinguistic <a>lexical priming effect</a> for the 2 bug (I was thinking divide by 2 so inserted a 2).
[223000810200] |This is a very serious bug because not only does it allow the cache to grow without bound, every time a new entry is added after it exceeds capacity with 4 counts and above, the entire cache gets visited.
[223000810210] |In the original implementation, pruning recreated a whole bunch of new map entry records, and also a whole bunch of soft reference wrappers.
[223000810220] |I also tweaked the code so as not to recreate records, but just modify theones that exist.
[223000810230] |This is the clue that Brian found in profiling the code.
[223000810240] |And even then I didn’t spot the bug.
[223000810250] |Ironically, I’m in Columbus for the <a>2008 Assoc for Comp Ling Conference</a>, where I’m co-organizing (with <a>Kevin Bretonnel Cohen</a>, who did most of the work) a workshop on <a>Software Engineering, Testing, and Quality Assurance for Natural Language Processing</a>. 
[223000820010] |<title>The Curse of “Intelligent” Tokenization</title>
[223000820020] |We’re now running into a problem we’ve run into before: so-called “intelligent” tokenization.
[223000820030] |The earliest version of this of which I’m aware is the <a>Penn Treebank tokenization</a>, which assumes sentence splitting has already been done.
[223000820040] |That way, the end-of-sentence punctuation can be treated differently than other punctuation.
[223000820050] |Specifically, “Mr. Smith ran.
[223000820060] |Then he jumped.” gets split into two sentences, “Mr. Smith ran.” and “Then he jumped.”.
[223000820070] |Now the fun starts.
[223000820080] |The periods at the end of a sentence are split off into their own token.
[223000820090] |The period after “Mr” remains, so the tokens are “Mr.”, “Smith”, “ran” and “.”.
[223000820100] |Note that the Treebank tokenizer also replaces double quotes with either left or right LaTex-style quotes, so there’s no way to reconstruct the underlying text from the tokens.
[223000820110] |Like many other projects, they also throw away the whitespace information in the data, so there’s no way to train something to do tokenization that’s whitespace sensitive because we just don’t have the whitespace.
[223000820120] |That’s what you get for releasing data like “John/PN ran/V ./PUNCT” —you just don’t know if there was space between that final verb “ran” and the full stop “.”.
[223000820130] |You’ll also see their <a>script</a> builds in all sorts of knowledge about English, such as a handful of contractions, so that “cannot” is split out into two tokens, “can” and “not”. 
[223000820140] |The most recent form of “intelligent” tokenization I’ve seen is perpetrated by UPenn, this time as part of their <a>BioIE</a> project.
[223000820150] |There, the data’s not even consistently tokenized, because they left it to annotators to decide on token boundaries.
[223000820160] |They then use statistical chunkers to uncover the tokenizations probabilistically.
[223000820170] |<a>Google’s n-gram data</a> is also distributed without a tokenizer.
[223000820180] |It looks very simple, but there are lots of heuristic boundary conditions that make it very hard to run on new text.
[223000820190] |Practically speaking, the data’s out of bounds anyway because of its <a>research-only license</a>.
[223000820200] |Unlike the Penn Treebank or French Treebank, there’s no option to buy commercial licenses.
[223000820210] |I’ve just been struggling with the <a>French Treebank</a>, which follows the Penn Treebank’s lead in using “intelligent” tokenization.
[223000820220] |The problem for us is that we don’t know French, so we can’t quite read the technical docs (in French), nor induce from the corpus how tokenization was done. 
[223000820230] |This is all terribly problematic for the “traditional” parsing model of first running tokenization, then running analysis.
[223000820240] |The problem is that the tokenization depends on the analysis and vice-versa.
[223000820250] |At least with the Penn approach, there’s code to do their ad hoc sentence splitting and then their ad hoc heuristic tokenization.
[223000820260] |It may not be coherent from an English point of view (a handful of contractions will be split; others won’t), but at least it’s reproducible.
[223000820270] |Our own approach (in practice —in theory we can plug and play any tokenizer that can be coded) has been to take very fine-grained tokenizations so that the tokenization would be compatible with any old kind of tagger.
[223000820280] |Our HMM chunker pays attention to tokenization, but the rescoring chunker uses whitespace and longer-distance token information.
[223000820290] |At the BioNLP workshop at ACL 2008, <a>Peter Corbett</a> presented a paper (with Anne Copestake) on <a>Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition</a>.
[223000820300] |It’s a really neat paper that addresses issues of confidence estimation, and particularly trading precision for recall (or vice-versa).
[223000820310] |But they weren’t able to reproduce our <a>99.99% recall gene/protein name extraction</a>.
[223000820320] |During the question period, we got to the bottom of what was going on, which turned out to be intelligent tokenization making mistakes so that entities weren’t extractable because they were only parts of tokens.
[223000820330] |I’m hoping Peter does the analysis to see how many entities are permanently lost due to tokenization errors.
[223000820340] |So why do people do “intelligent” tokenization?
[223000820350] |The hope is that by making the token decisions more intelligently, downstream processing like part-of-speech tagging is easier.
[223000820360] |For instance, it’s difficult to even make sense of assigning part-of-speech tags to three tokens derived from “p-53″, namely “p”, “-” and “53″.
[223000820370] |Especially if you throw away whitespace information.
[223000820380] |Tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on.
[223000820390] |This turned out to be <a>a problem for WordNet</a>).
[223000820400] |In some sense, tokenization is even more vexing in Chinese, which isn’t written with spaces.
[223000820410] |To get around that problem, our named-entity detector just treats each character as a token; that worked pretty well for <a>our entry in the SIGHAN 3 bakeoff</a>.
[223000820420] |There were even two papers on jointly modeling segmentation and tagging for Chinese at this year’s ACL (<a>Jiang et al.</a> and <a>Zhang et al.</a>).
[223000820430] |Joint modeling of this kind seems like a promising approach to allowing “intelligent” tokenization; by extending the tokenization model far enough, we could even maintain high end-to-end recall, which is not possible with a state-of-the-art first-best probabilistic tokenizer.
[223000830010] |<title>Lucene’s Missing Token Stream Factory</title>
[223000830020] |While we’re on the subject of <a>tokenization</a>, every time I (Bob) use <a>Lucene</a>, I’m struck by its lack of a tokenizer factory abstraction.
[223000830030] |Lucene’s abstract class <a><code>TokenStream</code></a> defines an iterator-style method <code>next()</code> that returns the next token or <code>null</code> if there are no more tokens.
[223000830040] |Lucene uses the abstract class <a><code>Token</code></a> for tokens.
[223000830050] |A Lucene token contains a string representing its text, a start and end position, and a lexical type which I’ve never seen used.
[223000830060] |Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn’t create objects for them beyond their string texts.
[223000830070] |But that’s the subject of another blog entry.
[223000830080] |A Lucene <a><code>Document</code></a> is essentially a mapping from field names, represented as strings, to values, also represented as strings.
[223000830090] |Each field in a document may be treated differently with respect to tokenization.
[223000830100] |For instance, some might be dates, others ordinary text, and others keyword identifiers).
[223000830110] |To handle this fielded document structure, the Lucene class <a><code>analysis.Analyzer</code></a> maps field names, represented as strings, and values, represented as instances of <code>java.io.Reader</code>, to token streams.
[223000830120] |The choice of <code>Reader</code> for values is itself puzzling because it introduces I/O exceptions and the question of who’s responsible for closing the reader.
[223000830130] |Lucene overloads the analyzer class itself to provide the functionality of LingPipe’s tokenizer factory.
[223000830140] |Lucene classes such as <a><code>SimpleAnalyzer</code></a> and <a><code>CJKAnalyzer</code></a> return the same token stream no matter which field is specified.
[223000830150] |In other words, the field name is ignored.
[223000830160] |What would be useful would be a Lucene interface <code>analysis.TokenStreamFactory</code> with a simple method <code>TokenStream tokenizer(CharSequence input)</code> (note how we’ve replaced the analyzer’s reader input with a generic string).
[223000830170] |Then analyzers could be built by associating token stream factories with fields.
[223000830180] |This would be the natural place to implement Lucene’s simple analyzer, CJK analyzer, and so on.
[223000830190] |The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.
[223000850010] |<title>How Can I Build a Classifier with no Negative Data?</title>
[223000850020] |As part of our NIH grant, we’re working on the database linkage problem from gene/protein mentions in <a>MEDLINE</a> to database entries in <a>EntrezGene</a>.
[223000850030] |Basically, it’s what the biologists call "gene normalization", and was the basis of <a>Biocreative Task 2</a>.
[223000850040] |I can summarize the problem we’re having with a simple example.
[223000850050] |We’d like to classify all 17M or so entries in MEDLINE as to whether they’re about genomics or not.
[223000850060] |EntrezGene provides links to 200K citations that are about particular genes, so we have a pile of articles about genomics (making up about 300 million characters).
[223000850070] |What we don’t have is any negative training data. 
[223000850080] |So my question is: how do I build a classifier for articles about genomics versus those that are not about genomics?
[223000850090] |The job running in the background giving me time to write this post is generating cross-validation on cross-entropy rates for all of these 200K citations.
[223000850100] |That is, I train a character-level language model on 180K citations and use it to evaluate the other 20K, for all possible choices.
[223000850110] |This gives me a received set of expected scores for positive examples (assuming there’s no bias in that 200K set, which there is in terms of recency and particular gene focus, not to mention focus on human genomics).
[223000850120] |I’m going to plot this and see what the curve looks like.
[223000850130] |Empirically, we can then set a threshold that would accept 99% of the articles we have.
[223000850140] |Unfortunately, I have no idea how well this’ll work in practice at rejecting the articles that aren’t about genomics.
[223000850150] |For the genomics/non-genomics problem, we can just annotate a few thousand examples; it’ll only take a day or two.
[223000850160] |The real problem is that we want to build models to classify contexts for the 30K or so human gene entries in Entrez.
[223000850170] |Some of them have a handful of example docs, some have hundreds.
[223000850180] |We’re going to pull out articles with potential mentions, then filter with the classifier.
[223000850190] |It’s related to what we did in Phase I of our grant and reported in our <a>gene linkage white paper</a>.
[223000850200] |In that setting, we can generate candidate docs using approximate matching of aliases, then use the scores to rank the possible docs according to their language model scores against the known articles for the gene in question.
[223000850210] |This is great in a search context, but doesn’t give us a go/no-go decision point, which we need for some of our downstream applications.
[223000850220] |If anyone knows how to tackle this problem, I’d love to hear about it.
[223000850230] |I might even implement it as part of LingPipe if the idea’s simple and general enough.
[223000860010] |<title>Good Kappa’s not Enough</title>
[223000860020] |I stumbled across Reidsma and Carletta’s <a>Reliability measurement without limits</a>, which is pending publication as a Computational Lingusitics journal squib (no, not a non-magical <a>squib</a>, but a linguistics <a>squib</a>). 
[223000860030] |The issue they bring up is that if we’re annotating data, a high value for the kappa statistic isn’t enough to guarantee what they call "reliability".
[223000860040] |The problem is that the disagreements may not be random.
[223000860050] |They focus on simulating the case where an annotator over-uses a label, which results in kappa way overestimating reliability when compared to performance versus the truth.
[223000860060] |The reason is that the statistical model will be able to pick up on the pattern of mistakes and reproduce them, making the task look more reliable than it is.
[223000860070] |This discussion is similar to the case we’ve been worrying about here in trying to figure out how we can annotate a named-entity corpus with high recall.
[223000860080] |If there are hard cases that annotators miss (over-using the no-entity label), random agreement assuming equally hard problems will over-estimate the "true" recall.
[223000860090] |Reidsma and Carletta’s simulation shows that there’s a strong effect from the relationship between true labels and features of the instances (as measured by <a>Cramer’s phi</a>).
[223000860100] |<h3>Review of Cohen’s Kappa</h3>
[223000860110] |Recall that kappa is a "chance-adjusted measure of agreement", which has been widely used in computational linguistics since Carletta’s 1996 <a>squib on kappa</a>, defined by:
[223000860120] |where <code>agreement</code> is just the empirical percentage of cases on which annotators agree, and <code>chanceAgreement</code> is the percentage of cases on which they’d agree by chance.
[223000860130] |For Cohen’s kappa, chance agreement is measured by assuming annotators pick categories at random according to their own empirical category distribution (but there are lots of variants, as pointed out in this <a>Artstein and Poesio tech report</a>, a version of which is also in press at The Journal of Kappa Studies, aka Computational Linguistics).
[223000860140] |Kappa values will range between -1 and 1, with 1 only occurring if they have perfect agreement. 
[223000860150] |I (Bob) don’t like kappa, because it’s not estimating a probability (despite being an arithmetic combination of [maximum likelihood] probability estimates).
[223000860160] |The only reason to adjust for chance is that it allows one, in theory, to compare different tasks.
[223000860170] |The way this plays out in practice is that an arbitrary cross-task threshold is defined above which a labeling task is considered "reliable". 
[223000860180] |A final suggestion for those using kappa: confidence intervals from <a>bootstrap resampling</a> would be useful to see how reliable the estimate of kappa itself is.
[223000870010] |<title>Good Kappa’s Not Necessary, Either</title>
[223000870020] |My last blog post, <a>Good Kappa’s Not Enough</a>, summarized Reidsma and Carletta’s arguments that a good kappa score is not sufficient for agreement.
[223000870030] |In this post, I’d like to point out why it’s not necessary, either.
[223000870040] |My real goal’s to refocus the problem on discovering when a gold standard can be trusted.
[223000870050] |Suppose we have five annotators who are each 80% accurate for a binary classification problem whose true category distribution is 50-50.
[223000870060] |Now let’s say they annotate an example item (1,1,1,1,0), meaning annotator 1 assigns category 1, annotator 2 assigns category 1, up through annotator 5 who assigns category 0.
[223000870070] |What can we conclude about the example?
[223000870080] |Assuming the errors are independent (as kappa does), what’s the likelihood that the example really is of category 1 versus category 0?
[223000870090] |Bayes’ rule lets us calculate:
[223000870100] |Recall the definition of kappa:
[223000870110] |If errors are distributed randomly, agreement will be around 0.8^2 + 0.2^2 = 0.68 in a large sample, and the chance agreement will be 0.5^2 + 0.5^2 = 0.5, for a kappa value of (0.68-0.5)/(1-0.5)=0.36.
[223000870120] |That’s a level of kappa that leads those who follow kappa to say “go back and rethink your task, your agreement’s not high enough”. 
[223000870130] |Unfortunately, with 80% per-annotator accuracy, we only expect 74% or so of the examples to have a 4:1 or 5:0 vote by 80% accurate annotators (74% = 5 * 0.8^4 0.2^1 + 0.8^5). 
[223000870140] |I believe the question we really care about is when we can trust an annotation enough to put it in the gold standard.
[223000870150] |So let’s say we have two 80% accurate annotators and the true category is 1.
[223000870160] |The likelihood of various annotation outcomes are: 
[223000870170] |So clearly two annotators aren’t enough to be confident to the 99% level.
[223000870180] |We’d need 90% accurate annotators for that.
[223000870190] |But what about three annotators?
[223000870200] |The chance of three 80% annotators agreeing by chance is only 0.8%.
[223000870210] |And in 51.2% of the cases, they will agree and be right.
[223000870220] |So we use a minimum of three annotators, and if they agree, go on. 
[223000870230] |If they disagree, we need to bring in more annotators until we’re confident of the outcome.
[223000870240] |When we get to a four out of five vote, as in our first example, we’re confident again.
[223000870250] |But even 3/4 agreement is still pretty weak, yielding only a 94% chance that the agreed upon value is correct. 
[223000870260] |Of course, this line of reasoning supposes we know the annotator’s accuracy.
[223000870270] |In practice, we can’t evaluate an annotator’s accuracy because we don’t know the true labels for items.
[223000880010] |<title>SearchMe.com is Useful</title>
[223000880020] |I’m repeating pretty much verbatim a comment I left on <a>Páraic Sheridan</a>‘s blog, <a>Returned Emigrant</a> in a response to his post <a>Search is not a solved problem</a>. 
[223000880030] |Discussing whether search is a solved problem reminds me of a talk <a>Hynek Hermansky</a> gave at ICSLP in Sydney in ’98 on why speech recognition isn’t a solved problem.
[223000880040] |Hynek made an analogy to flight, where at any point between balloons and jet airplanes, flight might’ve been considered solved.
[223000880050] |<a>SearchMe.com</a> is great.
[223000880060] |They provide reduced resolution page views with overlayed snippets, and it’s fast.
[223000880070] |This just feels right for navigation searches (e.g. typing “lingpipe” to try to find our home page or blog).
[223000880080] |And the page views add more value than I could’ve imagined to the snippets.
[223000880090] |Like Páraic, I’m so enamored of it that it’s replacing Google as my default search engine on Firefox (just click on the link in the upper right of searchme.com’s home page).
[223000880100] |I sure hope they can scale as more people find out about them.
[223000880110] |<a>Cuil.com</a>, by focusing on recall (and marketing), seems less useful, even if they get the bugs ironed out.
[223000880120] |Despite the fact that we’re focusing on recall for genomics information extraction tasks, I’ve never felt recall was an issue for most web searches.
[223000880130] |I could use more approximate and contextual matching, perhaps, but the index size has never seemed an issue.
[223000880140] |I miss <a>Excite.com</a>, which used to run TF/IDF rather than social-network-based search ranking algorithms.
[223000880150] |I missed Excite even as I was starting to use Google for many searches.
[223000880160] |But then again, if Excite had been successful, we wouldn’t have the <a>Apache Lucene</a> search engine. 
[223000880170] |<a>PowerSet.com</a> was focusing on some kind of precision and question answering (and marketing), which I also felt was of questionable value (for me as a searcher; it clearly worked for their VCs and founders) compared to using Google.
[223000880180] |Plus, they never showed (at least to the public) that their tech scaled either in complexity (different page types, multiple pages for entities) and size (number of web pages).
[223000890010] |<title>Epidemiologists’ Bayesian Latent Class Models of Inter-Annotator Agreement</title>
[223000890020] |I (Bob) have been looking at the inter-annotator agreement and gold-standard adjudication problems.
[223000890030] |I’ve also been hanging out with Andrew Gelman and Jennifer Hill thinking about <a>multiple imputation</a>, which has me studying their <a>regression book</a>, which covers the <a>item-response (Rasch) model</a>, and <a>Gelman et al.’s book</a>, which describes the the beta-binomial model.
[223000890040] |After thinking about how these tools could be put together, I believed I might have a novel approach to modeling inter-annotator agreement.
[223000890050] |I got so excited I went and wrote a 30 page paper based on a series of simulations in R and <a>BUGS</a>, which may still turn out to be useful as an introduction to these approaches.
[223000890060] |It’s already been useful in giving me practice in building hierarchical models, simulations, and generating graphics from BUGS and R.
[223000890070] |I even analyzed one of our customer’s real annotated data sets (more on that to follow, I hope, as they had 10 annotators look at 1000 examples each over half a dozen categories).
[223000890080] |Alas, the epidemiologists have beaten me to the punch.
[223000890090] |They generalized the item-response model in exactly the way I wanted to and have even implemented it in WinBUGS.
[223000890100] |In fact, they even used pretty much the same variable names!
[223000890110] |It just goes to show how much people with the same tools (hieararchical Bayesian modeling, logistic regression, beta-binomial distributions, 0/1 classification problems, sensitivity vs. specificity distinctions) tend to build the same things.
[223000890120] |If you can only read one paper, make it this one (if you can find it; I had to schlep up to Columbia where I can download papers on campus):
[223000890130] |<li><a>Albert, Paul S.</a> and Lori E. Dodd. 2004.
[223000890140] |<a>A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard.</a>
[223000890150] |Biometrics 60:427-435. </li>
[223000890160] |It cites most of the relevant literature other than (Dawid and Skene 1979), who, as far as I can tell, first introduced latent class models into this domain using EM for estimation.
[223000890170] |To make a long story short, the model is quite simple in modern sampling/graphical notation, and just about as easy to code up in BUGS.
[223000890180] |Here’s the model without accounting for missing data.
[223000890190] |First the variable key:
[223000890200] |Recall that <a>sensitivity</a> is accuracy on positive cases, which is just recall, TP/(TP+FN), whereas <a>specificity</a> is just accuracy on negative cases, TN/(TN+FP).
[223000890210] |Precision is TP/(TP+FP), but that doesn’t account for TN cases, and is thus incomplete as a full probability specification when paired with recall.
[223000890220] |That’s why <a>ROC curves</a>, which plot specificity vs. sensitivity, are more popular than precision-recall curves in the rest of the civilized world.
[223000890230] |Other than the annotations <code>x[i,j]</code>, all other variables are unknown.
[223000890240] |That includes the prevalence <code>pi</code>, the true categories <code>c</code>, the annotator specificities and selectivities <code>a[0]</code> and <code>a[1]</code>. 
[223000890250] |The sampling model without the noninformative priors:
[223000890260] |That’s it.
[223000890270] |Not that complex as these things go.
[223000890280] |Scaling the difficulties to have 0 mean and variance 1 identifies the scale of the model; as Gelman and Hill describe in their book, there are lots of ways this can be done, including only scaling the mean of the difficulties to be 0. 
[223000890290] |There are lots of different priors that could be put on what’s here the <code>logit(a[m,j])</code> terms.
[223000890300] |The more traditional thing to do in this kind of model is to use a normal prior.
[223000890310] |In any case, you’re not going to be able to estimate the priors for specificity and selectivity with only a handful of annotators.
[223000890320] |There’s a simplified version of this model mentioned in Dodd and Albert where items are divided into easy and regular cases, with the easy cases having all annotators agree and regular cases having annotators respond independently according to their own specificity and selectivity. 
[223000890330] |The point of the Albert and Dodd paper cited above wasn’t to introduce these models, but to evaluate a range of them one against the other by simulating data in one model and evaluating it in the other.
[223000890340] |They also evaluated real data and saw how it fit differently in the different models.
[223000890350] |I should also point out that the following paper mentions the Dawid and Skene model in the computational linguistics literature:
[223000890360] |<li>Bruce, Rebecca F. and Janyce M. Wiebe. 1999.
[223000890370] |<a>Recognizing subjectivity: a case study of manual tagging</a>.
[223000890380] |Natural Language Engineering 1:1-16. </li>
[223000890390] |Bruce and Wiebe even talk about using the posterior distribution over true categories as a proxy for a gold standard, which seems to be the right way to go with this work.
[223000890400] |But they use a single latent variable (roughly problem difficulty) and don’t separately model annotator specificity from selectivity, which is critical in both the simulations and real world data analyses I’ve done recently.
[223000890410] |The depressing conclusion for NLP and other applications of classifiers is that it’s clear that with only 3 annotators, it’s going to be impossible to get a gold standard of very high purity.
[223000890420] |Even with 5 annotators, there are going to be lots of difficult cases. 
[223000890430] |The other applications besides inter-annotator agreement that I’ve run across in the past couple of days are educational testing, epidemiology of infections and evaluating multiple tests (e.g. stool inspection and serology), evaluations of health care facilities in multiple categories, evaluations of dentists and their agreement on caries (pre-cavities), adjustments to genome-wide prevalence assertions, and many more.
[223000900010] |<title>Hunter/Gatherer vs Farming with Respect to Information Access</title>
[223000900020] |Antonio Valderrabanos of Bitext.com gave a talk at NYU today and he compared the current search/NLP strategies by information providers to humanity’s hunter/gatherer stage and offered a vision of information farming.
[223000900030] |I kept having images of Google’s web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth.
[223000900040] |Then I would switch to images of a farm stocked supermarket with well organized shelves, helpful clerks and lots of choice.
[223000900050] |The analogy brought up a strong bias that I have in applying natural language processing (NLP) to real word problems– I generally assume that the software must encounter text as it occurs in the “wild”–after all it is what humans do so well and we are in the business of emulating human language processing right?
[223000900060] |Nope, not on the farm we’re not.
[223000900070] |On the farm we use NLP help to enhance information that was never a part of standard written form.
[223000900080] |We use NLP to suggest and assign meta tags, connect entities to databases of concepts and create new database entries for new entities.
[223000900090] |These are things that humans are horrible at but humans are excellent at choosing from NLP driven suggestions– NLP is pretty good at suggestions.
[223000900100] |So NLP is helping create the tools to index and cross reference at the concept level all the information in the supermarket.
[223000900110] |Humans function as filters of what is correct.
[223000900120] |At least initially.
[223000900130] |As the information supermarket gets bigger, the quality of the NLP (machine learning based) will get better, perhaps good enough to start automatically bringing in “wild” information with decent concept indexing and meta tagging.
[223000900140] |A keyword index is crude yet effective tool but an inventory system it is not and that is what we need to advance to the next level of information access.
[223000910010] |<title>Where’s Georgia? DB Linkage isn’t Easy</title>
[223000910020] |In case you didn’t see the <a>link on Slashdot</a>, Google was supplying maps of the <a>American state Georgia</a> when they should’ve been linking to the Caucasian country <a>Georgia</a>.
[223000910030] |As <a>Homer</a> (the Simpson, not the classical Greek poet, the Alaskan city, the Illinois city, the American painter, or the tunnel in New Zealand) would say, “<a>D’oh!</a>.”
[223000910040] |Slashdot was just picking up the <a>story from Valleywag</a>.
[223000910050] |Google didn’t take this lying down; the <a>current page</a> shows a map centered on Vienna, which if you click on it, takes you to New York, saying “U.N.”. 
[223000910060] |The problem is that linking textual mentions to database entries is non-trivial, even for the relatively simple problem of geo-location.
[223000910070] |This is the business <a>Metacarta</a> is in, and users we’ve talked to say they do a very good job of it. 
[223000910080] |You see similar problems for products, as in <a>the app formerly known as Froogle</a> or <a>ShopWiki</a>, as the Slashdot story pointed out in the case of yet another search engine <a>mismatching product photos</a>. 
[223000910090] |We ran into this problem trying to find rap artists in text, who have names like “<a>The Game</a>.”
[223000910100] |And we’re battling the problem in our ongoing NIH project on linking genes and protein mentions in text to Entrez-Gene.
[223000910110] |I’ve noted before that the NY Times site runs into problems in cases such as distinguishing the Pittsburgh suburb named Mount Lebanon from the area of middle Eastern country Lebanon.
[223000910120] |In general, text matching doesn’t work real well by itself.
[223000910130] |You run into false positives by linking the Pittsburgh suburb to the middle East, but you get false negatives if you just don’t link “Mount Lebanon”.
[223000910140] |But the Times has full editorial control, so they can catch these problems and fix them manually during copy editing.
[223000910150] |To solve this problem automatically with better results than plain text matching, we need context, which is available on the web in the form of both texts and links.
[223000910160] |We discuss this in our <a>clustering tutorial</a>, which uses context to disambiguate multiple John Smiths in the news, and in our <a>white paper on gene linkage</a>.
[223000910170] |This basically reduces the linkage problem to that of <a>word sense disambiguation</a>.
[223000910180] |The only real problem (besides the remaining difficult to disambiguate cases) is that it’s relatively storage and compute intensive to use context compared to a simple dictionary matcher.
[223000920010] |<title>Hierarchical Bayesian Models of Categorical Data Annotation</title>
[223000920020] |Here’s a 2 page write-up of one of the models I’ve been looking at for evaluating data annotation in order to evaluate coding standards and annotator sensitivity and specificity:
[223000920030] |<li>Carpenter, Bob. 2008.
[223000920040] |<a>Hierarchical Bayesian Models of Categorical Data Analysis</a>. </li>
[223000920050] |I’ve submitted it as a poster to the <a>New York Academy of Sciences 3rd Annual Machine Learning Symposium</a>, which will be October 10, 2008.
[223000920060] |Please let me know what you think (<a>carp@alias-i.com</a>).
[223000920070] |I didn’t have room to squeeze in the more complex model that accounts for “easy” items.
[223000920080] |This model and the one for easy items derive from the epidemiology literature (cited in the paper), where they’re trying to estimate disease prevalence from a heterogeneous set of tests.
[223000920090] |I’ve added some more general Bayesian reasoning, and suggested applications for annotation (though Bruce and Wiebe were mostly there in their 1999 paper, which I cite), and for training using probabilistic supervision (don’t think anyone’s done this yet).
[223000920100] |I’m happy to share the R scripts and BUGS models I used to generate the data, fit the models, and display the results.
[223000920110] |I’d also love to know how to get rid of those useless vertical axes in the posterior histogram plots.
[223000930010] |<title>Biocreative II Gene Mention Evaluation Paper</title>
[223000930020] |My first paper that’ll be indexed in MEDLINE!
[223000930030] |Too bad I’m the 19th of 30-some authors:
[223000930040] |<li>Larry Smith et al. 2008.
[223000930050] |<a>Overview of BioCreative II gene mention recognition</a>.
[223000930060] |Genome Biology 9(Suppl 2):S2. </li>
[223000930070] |The paper’s an overview of the Biocreative II gene mention (GM) eval, which was almost two years ago (October 2006).
[223000930080] |It’s a nice thing to read if you’re looking for ideas on how to build an NE recognizer, as 19 different systems are described (most of them based on CRFs).
[223000930090] |It also describes experiments using CRFs to combine the annotations of all systems, which improved the best single-system F-measure of .87 to a committee-based system score of .90, and showed that even low-ranking systems contributed to combined accuracy.
[223000930100] |Shades of Netflix.
[223000930110] |Perhaps not surprisingly, many of the individual systems were themselves committee-based, often in a forward-backward combo of the same learner with B-I-O tagging of named entity tokens.
[223000930120] |Since we’ve been doing annotation recently, I was struck by that section of the paper.
[223000930130] |It turns out over 10% of the annotations in Biocreative I were changed for Biocreative II (which used a superset of the original data).
[223000930140] |That’s a strong indication that their original coding standards were not well enough defined.
[223000930150] |Here’s what they (we?) say:
[223000930160] |It can be argued that the difficulty experienced by human annotators in reaching mutual agreement directly limits the performance of automated systems, and this can be influenced by the clarity of the annotation guidelines. 
[223000930170] |So far so good.
[223000930180] |It has been pointed out [ed: by us at Alias-i among others] that the guidelines for annotating genes are surprisingly short and simple given the complex guidelines for annotating named entities in news wires [1]. 
[223000930190] |Basically, the guidelines were nonexistent in my understanding —they told biologists to mark “gene mentions”, including proteins. 
[223000930200] |However, a gene is a scientific concept, and it is only reasonable to rely on domain experts to recognize and annotate gene mentions. 
[223000930210] |Of course we’re not going to have good luck with people who don’t know what genes are annotating the data.
[223000930220] |The question is, what more do we need to tell them?
[223000930230] |Thus, the gene annotation guidelines can be conveyed by reference to a body of knowledge shared by individuals with experience and training in molecular biology, …
[223000930240] |I’m not sure how to read this.
[223000930250] |The “can be conveyed” indicates a sense of sufficiency for the instructions of “just annotate the genes”. 
[223000930260] |…and it is not feasible to give a complete specification for gene annotation that does not rely on this extensive background knowledge. 
[223000930270] |It’s not as if the previous clause follows from this one.
[223000930280] |The NE guidelines for newswire don’t start with a definition of what it is to be a person or a company.
[223000930290] |While it’s necessary to employ annotators who are domain experts, domain expertise doesn’t define linguistic boundary conditions, and that’s what this kind of named entity detection game is all about.
[223000930300] |What we call a gene is a matter of convention, and such conventions need to be laid out in annotation standards to remove ambiguity, scientific concept or no. Scientific concepts are no more rigid than other ones in language.
[223000930310] |Granted, we need to lean on background knowledge, but that’s also true in annotating corporations in newswire. 
[223000930320] |We annotated a few hundred abstracts from the autism literature and were overwhelmed by the slipperiness of the term “gene”.
[223000930330] |There are all kinds of tricky boundary cases where we know what the article is referring to in a scientific sense, but don’t know if a phrase should count as a gene mention or not.
[223000930340] |Does a mention of a gene family constitute a mention of a gene?
[223000930350] |What about substructures?
[223000930360] |What if a protein’s mentioned that we know is produced by a single gene?
[223000930370] |These are all linguistic decisions relating to corpora construction.
[223000930380] |The term “gene” is just too vague on its own.
[223000930390] |We’ll never get a complete definition, but hopefully we’ll get one that’s clearer as measured by better inter-annotator reliability.
[223000930400] |Nevertheless, we believe that some improvement could be achieved by documenting current annotation decisions for difficult and ambiguous gene mentions. 
[223000930410] |P.S.
[223000930420] |The results sorted by best F-measure put us 11th out of 19 teams, at 0.80 F measure vs. top scoring 0.87.
[223000930430] |Our first-best system had the second highest recall in the eval (though the system with the highest recall was also 10% higher precision than ours).
[223000930440] |We did no feature tuning and used no external resources, <a>creating our submission</a> in a matter of hours.
[223000930450] |We also had by far the highest recall submission at 99.9%, as described in the paper.
[223000930460] |We still can’t get anyone to compete on precision at 99.99% recall, but we’re pretty sure it could be done better than our 7% precision submission.
[223000930470] |And more importantly, we think it’s what should be done. 
[223000940010] |<title>Posterior Category Distributions from Annotations (without a Gold Standard)</title>
[223000940020] |Here’s Handelman’s dentistry data (see paper cited below for a citation), which had 5 dentists evaluate X-rays for caries (a kind of pre-cavity), in the form of a contingency table:
[223000940030] |For instance, there were 789 cases where annotators 1-4 said there was no caries, whereas annnotator 5 said there was.
[223000940040] |There were 3 cases where annotators 1-3 said there was caries, and annotators 4-5 said there were not.
[223000940050] |The goal is to infer the true categories given these annotations.
[223000940060] |We described a general latent-category approach in our previous post, <a>Hierarchical Bayesian Models of Categorical Data Annotation</a>.
[223000940070] |But the paper didn’t have room to display the derived category densities, which is one of the main inference tasks.
[223000940080] |Here are the posterior density estimates for the categories for each of the possible annotations, with horizontal axes all drawn on [0,1] and vertical axes to scale (the skewed distros are clipped).
[223000940090] |Click on it to view full size.
[223000940100] |Posterior Category Densities with Shared Scales
[223000940110] |The thing to note here is that the results are rather different than a simple voting scheme.
[223000940120] |I’m hoping it’ll actually perform much better.
[223000940130] |The place to compare would be to the simple voting scheme, as evaluated for de-duplication by Dolores Labs and described in Brendan O’Connor’s blog post <a>Wisdom of small crowds, part 1: how to aggregate Turker judgments for classification (the threshold calibration trick)</a>.
[223000940140] |(I would also highly recommend all their followup posts, but specifically the most recent with a link to Rion et al.’s forthcoming EMNLP paper, <a>AMT is fast, cheap, and good for machine learning data</a>.)
[223000940150] |I’m trying to follow Gelman and Hill’s advice (from their regression book) on displaying multiple graphs.
[223000940160] |Here’s what R draws by default (thanks to Brendan O’Connor for telling me how to get rid of the vertical axes), which has different horizontal and vertical scales per plot:
[223000940170] |Posterior Category Densities, R default scales