[223001260010] |Cache Liveness Testing with Java’s Scheduled Thread Pool Executor [223001260020] |Perhaps needless to say, unchecked generics were not the only thing I wanted to clean up in LingPipe. [223001260030] |Reading the Effective Java book’s section on threading convinced me to use the java.util.concurrent libraries for execution rather than managing my own threads. [223001260040] |It also reminded me that without variables being synchronized on both sides or being declared volatile, changes are not guaranteed to be visible across threads. [223001260050] |There aren’t member variables changing here, but there are values in arrays that change. [223001260060] |So I wondered about the liveness of LingPipe’s util.FastCaceh thread-safe cache. [223001260070] |The existing tests were for safety, not liveness. [223001260080] |What if every thread had to recreate its own cache entries? [223001260090] |First, a batch of imports: [223001260100] |Then the definition of the java.lang.Runnable, instances of which will be run in multiple threads simultaneously. [223001260110] |The first part just sets up the member variables, which store the cache, number of entries, and number of cache hits: [223001260120] |An instance of FastCache implements java.util.Map; in this case, the cache maps integers to integers. [223001260130] |In our HMM decoder implementation, the cache maps strings (tokens) to arrays of doubles (emission probabilities indexed by tag/category ID). [223001260140] |The meat is in the run() method: [223001260150] |Basically, it looks up each entry in the cache, and if it finds it, increments the number of hits, and if it doesn’t find it, it adds it to the cache. [223001260160] |It sleeps a small random number of milliseconds for each entry; without the sleep, the first thread populates the cache before the others even get going! [223001260170] |The actual test harness is run through JUnit (3.8 —I haven’t gone over to the annotation version —that’s coming up, too). [223001260180] |Here’s the method: [223001260190] |It’s very straightforward: set up the cache, then create an array for all of the test class instances, then create the executor, execute the test classes, then shut down the executor, wait for all the threads it’s working on to terminate, validate the results, then calculate the number of cache hits to print. Note that the values are chosen so the cache’s size and load factor guarantee it’ll have to expire entries. [223001260200] |Here’s what the run produces on my dual quad-core Xeon Windows Vista 64 workstation: [223001260210] |Overall, I’m happy with the way this works. [223001260220] |Of course, any of you thread gurus out there who want to help make it better, I’m all ears. [223001260230] |I’m still a relative novice. [223001270010] |Lucene’s Missing Term/Field Query Structure [223001270020] |In a follow-up to an earlier blog post, Lucene’s Missing TokenStream factory, I’d like to discuss a problem with Lucene’s query structure. [223001270030] |Not Lucene’s query syntax, which has its own problems, but gets the term/field distinction (mostly) right. [223001270040] |But be careful —the query syntax has a different notion of term than the object representations in Lucene’s search package. [223001270050] |The underlying problem is the same as with the missing token stream factories: a missing abstraction at the plain text level that winds up conflating notions of plain text, like ranges, and notions of fielded queries, like ranges of plain text in a specified field. [223001270060] |At the root of the query structure problem is the index.Term class, which is constructed with a field and a textual value for that field. [223001270070] |The simplest possible query is a search.TermQuery, which consists of a single term. [223001270080] |So far, no problem. [223001270090] |But what about a search.RangeQuery? [223001270100] |It’s constructed using a pair of terms. [223001270110] |There’s a note in the constructor that both terms must be from the same field. [223001270120] |What’s the problem? [223001270130] |Range queries shouldn’t be built out of two field/text pairs, but rather out of one field and two text objects. [223001270140] |Phrase queries (search.PhraseQuery) have the same problem as range queries, in that they should only apply to terms with the same field. [223001270150] |Although the add term checks that added terms have the same field and throws an exception if they don’t, there is no warning in the javadoc. [223001270160] |The solution to the problem is simple. [223001270170] |Split the structure of queries into two types, one for the text part of a query and one for restricting it to fields. [223001270180] |It’s easiest to see in BNF, though what I’m really talking about is object structure, not the query syntax: [223001270190] |The logical operations all distribute through fields, so a query like [AUTHOR: (Smith OR Jones)] is equivalent to ([AUTHOR:Smith] OR [AUTHOR:Jones]). [223001270200] |An advantage the query object representation has over the query syntax is that there are no default fields. [223001270210] |Lucene’s query syntax, on the other hand, allows a fielded query to consist of a term query. [223001270220] |The use of an analyzer in constructing an actual query parser then fills in the missing field. [223001270230] |The class search.WildcardQuery is equally problematic in that it takes a term as an argument. [223001270240] |It’s overloading the textual value of the term to represent not a term, but a string with structure including a special character for the multiple character (*) and single character (?) wildcards. [223001270250] |But what if I want a question mark or asterisk in my term itself? [223001270260] |The query syntax handles this problem, but not the object representation. [223001270270] |The classes search.PrefixQuery and search.FuzzyQuery have the same problem with conflating terms and a syntax for search. [223001270280] |Phrase queries and boolean queries have the additional problem of being mutable, which makes their use in collections problematic (just as using mutable collections within other collections is problematic). [223001270290] |For instance, if you construct a phrase query, its notion of equality and hash code change as more terms are added, because equality is defined to require having an equal list of terms. [223001280010] |Porteous et al. (2008) Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation [223001280020] |I read lots of papers, so I thought I’d start a blog thread to share what I’m thinking about them. [223001280030] |This is the kickoff edition, based on a paper I read on the subway to and from Columbia Uni this moring: [223001280040] |
  • Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. [223001280050] |Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation. [223001280060] |KDD ’08. [223001280070] |In the first pass writing this, I forgot to mention the results. [223001280080] |They were able to speed up LDA with 4000 latent topics over a pruned MEDLINE corpus of 8.2M docs, 140K words, and 730M token instances by a factor of 8 (with some footnotes). [223001280090] |Some 800 topic runs for smaller data sets were sped up by a factor of 4. [223001280100] |I’d have expected a larger speedup for 4K topics, but issues such as cache locality can have a huge effect on this kind of thing. [223001280110] |One question is how much constant overhead was shared between the runs? [223001280120] |You might be familiar with collapsed Gibbs sampling for LDA through the LingPipe class cluster.LatentDirichletAllocation, which is explained with simple and real-world examples in our clustering tutorial. [223001280130] |If you’re not familiar with LDA and its collapsed Gibbs sampler, I’m not going to have time to explain it in this note! [223001280140] |Collapsed Gibbs sampling for LDA was introduced in (Griffiths and Steyvers 2004). [223001280150] |The collapsing is of the multinomials, so that a Gibbs sample is represented by an assignment of topics to each word in each document. [223001280160] |The Dirichlet priors are hyperparameters (that is, set to constants, not fitted) in the collapsed sampler. [223001280170] |Porteous et al. present a method for speeding up the sampling step of the inner loop of collapsed LDA, but their method is actually much more general. [223001280180] |The approach is pretty straightforward, and reminds me of arithmetic coding. [223001280190] |It relies on being able to visit outcomes in roughly decreasing order or probability, compute their unnormalized probability efficiently, and computing an upper bound on the normalization constant given only the unnormalized probabilities of the items seen so far (plus other knowledge if it’s easy to compute). [223001280200] |Suppose we have K topics and W words and we’re looking at word w in document j. [223001280210] |We can compute unnormalized probabilities q[k] for each topic k, so that the probability of a topic k is p[k] = q[k]/Z, where the normalization term Z=q[1]+...+q[K]. [223001280220] |If we can create a sequence of upper bounds Z[1],...,Z[K], such that Z[i] >Z[K] = Z, then we have a sequence of decreasing upper bounds on p[k], namely q[k]/Z[1],...,q[k]/Z[K], such that the final bound is exact q[k]/Z[K] = q[k]/Z = p[k]. [223001280230] |The point of the algorithm is to commit to a sampled outcome as soon as soon as the probability is bounded. [223001280240] |We start by generating random number u in Uniform(0,1). [223001280250] |If it is lower than q[1]/Z[1], we can return topic 1 without computing any of the other q[k] or Z[k]. [223001280260] |If it is greater, we have to compute q[2] and Z[2], at which point we may or may not have bounded the result (we have if (q[1]+q[2])/Z[2] ). [223001280270] |If the bounds are fairly tight and we consider the topics with high values of q[k] first, we can potentially return answers with much less work than computing all normalized probabilities. [223001280280] |The amount of speedup is clearly related to the amount of skew in the distributions and the tightness of the upper bounds on the normalization constants. [223001280290] |LDA topic distribution per word is highly skewed, especially as the Dirichlet parameter gets smaller, as is typically the case with large topic models. [223001280300] |In cases where the distribution is uniform, the sort is poor, or the upper bounds are very loose, Porteous et al.’s optimization can take longer than Griffiths and Steyvers’ direct sampling. [223001280310] |For LDA, Porteous et al. sort topics by document probability, which is a good proxy for overall topic probability. [223001280320] |They compute upper bounds Z[k] using a nifty inequality discovered in the late 1800s called Hölder’s inequality (which is based on vector norms). [223001280330] |They consider two parameterizations, but this is not the only way to bound probabilities. [223001280340] |Porteous et al.’s technique can be applied in many situations in graphical models. [223001280350] |The situation in LDA has us estimate p(topic=k|word,document,θ,φ) which is proportional to p(topic|θ[doc]) * p(word|φ[k]). [223001280360] |This is a typical setup if we’re Gibbs sampling an interior node in a graphical model. [223001280370] |It involves the parent node (θ[doc]), the daughter node (w), and the daughter’s mother (aka an aunt) φ[k]. [223001280380] |I wouldn’t be surprised if general Gibbs sampling packages like HBC could incorporate Porteous et al.’s algorithm as a compiler optimization. [223001280390] |It’d be even better if we could make like Java’s Just in Time (JIT) compiler and do run-time profiling to decide when to apply Porteous et al.’s algorithm. [223001280400] |Finally, a big raspberry for KDD for making their proceedings closed source through the pay-to-access ACM portal. [223001290010] |Lacks a Convincing Comparison with Simple Non-Bayesian Approaches [223001290020] |That was just one comment from my 2009 NAACL rejection letter (reprinted in full with a link to the paper below). [223001290030] |As Andrew says, I’m glad to hear they had so many great submissions. [223001290040] |This was for a paper on gold-standard inference that I’ve been blogging about. [223001290050] |Here’s a link to the submission: [223001290060] |
  • Bob Carpenter. 2008. [223001290070] |A Multilevel Bayesian Model of Categorical Data Annotation. [223001290080] |Rejected by NAACL 2009. [223001290090] |The only thing it adds to the white paper and graphs on the blog is another mechanical Turk experiment that recreated the MUC 6 PERSON entity annotations. [223001290100] |As with the Snow et al. work, the Turkers did better than the gold standard. [223001290110] |I should’ve paid more attention to Simon Peyton Jones’s most excellent advice. [223001290120] |Ironically, I used to give most of that advice to our own students. [223001290130] |Always easier medicine to give than to take. [223001290140] |In particular, I needed to make the use cases much clearer to those who weren’t knee deep in Bayesian analysis and inter-annotator agreement statistics. [223001290150] |The trap I fell into has a name. [223001290160] |It’s called “the curse of knowledge”. [223001290170] |The best general study of this phenomenon I know is Elizabeth Newton’s research, which is described in Heath and Heath’s great book Made to Stick (see the tappers and listeners section of the excerpt). [223001290180] |Good psych experiments are so compelling they give me goosebumps. [223001290190] |It’s hard to compare the Bayesian posterior intervals with non-Bayesian estimates. [223001290200] |The whole point of the Bayesian analysis is to analyze the posteriors. [223001290210] |But if you’re not a Bayesian, what do you care? [223001290220] |The standard in NLP is first-best analyses on held-out data sets with a simple accuracy measure. [223001290230] |No one ever talks about log loss except in model fitting and language modeling evaluation. [223001290240] |So even a probabilistic model that’s not Bayesian can’t get evaluated on its own terms. [223001290250] |This goes for simple posterior predictive inferences in Bayesian analyses. [223001290260] |I guess the reviewer missed the comparison with simple voting, which is the de facto standard (coupled with censorship of uncertain cases). [223001290270] |What I really should’ve addressed is two sides of the issue of what happens with fewer annotators. [223001290280] |The answer is that the Bayesian approach has an even stronger advantage in agreeing with the gold standard annotations than simple voting. [223001290290] |I only did that after the analysis after the submission in a “Doh!” moment anticipating the problem reviewers might have with 10 annotators. [223001290300] |The second aspect is to push harder on the fools gold standard point that the state of the art produces very poor corpora in terms of consistency. [223001290310] |It is possible to see if the informative priors help with cross-validation. [223001290320] |But even without cross-validation, it’s obvious with 20 samples when the annotator made no mistakes —they’re unlikely to have 100% accuracy on the rest of the corpus. [223001290330] |You don’t estimate a player’s batting average (in baseball) to be .500 after he goes 10 for 20 in his first 20 at bats. [223001290340] |Everyone in NLP knows low count estimates need to be smoothed. [223001290350] |So now that we’ve decided we need a prior (call it regularization or smoothing if you like), we’re just haggling about price. [223001290360] |So I just optimized that, too. [223001290370] |It’s what any good Bayesian would do. [223001290380] |Here’s the full set of comments and ratings (on a 1-5 scale, with 5 being the best). [223001300010] |Java Quiz: Iterating over Binary Tree Elements [223001300020] |I was thinking this’d make a good Java quiz question. [223001300030] |It’s something I have to do all the time in order to make collections accessible in a streaming fashion. [223001300040] |Getting an iterator is the easiest way for a client to visit the elements of a collection one at a time. [223001300050] |The usual reason is to apply some operation to them. [223001300060] |If the collection’s large (sometimes offline, even), it’s less memory intensive to iterate. [223001300070] |Problem: Suppose you have a standard binary tree implementation (see below) and want to implement an instance of java.util.Iterator that iterates over the elements. [223001300080] |To satisfy the spirit of the exercise, the implementation must take constant time per item iterated and may take at most an amount of memory proportional to the depth of the tree. [223001300090] |Here’s a standard CS 101 binary tree implementation, with the add() and toString() methods for convenience: [223001300100] |To see how a tree gets built, here’s an example: [223001300110] |If I compile and run, here’s what I get: [223001300120] |Hint: It’s easy to walk over the nodes in order using a recursive program. [223001300130] |Here’s a simple string builder: [223001300140] |If I add the following to my main(): [223001300150] |I get the following line of output: [223001310010] |Quiz Answer: Continuation Passing Style for Converting Visitors to Iterators [223001310020] |Here’s the answer to the question raised in my previous post, Java Quiz: Iterating over Binary Tree Elements. [223001310030] |The answer is surprisingly easy once you see it. [223001310040] |What’s really cool is that it’s an instance of a general programming technique called continuation passing style (CPS). [223001310050] |In the form discussed here, CPS is a general technique to convert recursion into iteration. [223001310060] |Enough of the generalities, here’s my answer: [223001310070] |I now add the following to the main() from the last post: [223001310080] |and here’s what I see: [223001310090] |Conceptually, the way the program works is very simple. [223001310100] |The main idea is to keep track of the information that would be on the call stack in the recursive implementation in an explicit programmatic stack data structure. [223001310110] |The elements in the queue represent the continuations. [223001310120] |They essentially tell the program where to pick up processing when called next, just like the stack does in recursion. [223001310130] |Envious Aside: Continuation passing (literally, not just in style) is what lets Python implement the super-cool yield construction. [223001310140] |Yields in Python are a general way to turn visitors into iterators. [223001310150] |They’re possible because continuations in Python are first-class object. [223001310160] |You can even write completely stackless Python. [223001310170] |Back to Java. [223001310180] |Our code maintains a stack of the nodes whose values are left to be printed, in order of how they should be printed. [223001310190] |Because you print the left daughters before the root, you always queue up the entire set of left daughters for a node. [223001310200] |After initialization, the stack consists from bottom to top of the root, the root’s left daughter, the root’s left daughter’s left daughter, and so on, down to the lower “left corner” of the tree. [223001310210] |The iterator has more elements if the stack’s not empty. [223001310220] |The next element is always the value on the top of the stack. [223001310230] |After returning it, you need to push it’s right daughter, as the descendants of the current node’s right daughter all have to be output before the other elements on the stack. [223001310240] |One thing to look at is how I implemented the recursive add using iteration (I probably would’ve used a for-loop, but thought the while loop easier to undertsand). [223001310250] |You could’ve also done this recursively: [223001310260] |The recursive version’s clearer (at least to me), but programming languages tend to be much better at iteration than recursion (even languages designed for recursion, like Lisp and Prolog). [223001310270] |Java has relatively small stack sizes (typically, they’re configurable), and miniscule maximum depths (usually in the thousands). [223001310280] |Of course, if your binary tree is 1000 nodes deep, you have bigger problems than Java’s stack! [223001310290] |Also check out the little details. [223001310300] |I made variables final that could be final, kept the class’s privates to itself, and implemented the proper exceptions. [223001310310] |To see that it satisfies the performance constraints, first note that the amount of memory used is never greater than the depth of the tree, because the elements of the stack always lie on a single path from the bottom of a tree up. [223001310320] |Constant time per element is a bit tricky, and you might want to call me on it, because it requires an amortized analysis. [223001310330] |For each element returned, its node was pushed onto the stack once and popped off the stack once. [223001310340] |That’s all the work that happens during next(). [223001310350] |You also do tests for every value’s returned daughters, which is still only a constant amount of work per element. [223001310360] |Extra Credit: Use java.util.AbstractSet to implement SortedSet. [223001310370] |The only methods that need to be implemented are size() and iterator(), so you’re mostly there. [223001310380] |Extra Extra Credit: Discuss the problem of concurrent modification and implement a solution for the iterator like Java’s. [223001310390] |You might also want to note that the tree should be balanced somehow, but that seems like an orthogonal problem to me from the issue with iterators. [223001310400] |Check out Java’s own java.util.TreeSet implementation. [223001310410] |The code for the collections framework is very useful as a study tool. [223001320010] |Matthew Wilkins Evaluates POS Taggers [223001320020] |There’s a fascinating thread on Matthew Wilkins‘s blog Work Product. [223001320030] |Matthew’s a humanities postdoc studying literature. [223001320040] |His thread starts in a 25 October 2008 entry: [223001320050] |
  • Work Product: POS Taggers [223001320060] |with this quote: [223001320070] |I surely just need to test a bunch of them [part of speech taggers] is some semi-systematic way, but is there any existing consensus about what works best for literary material? [223001320080] |I would highly recommend reading from the first post in the thread forward. [223001320090] |It’s a great fly-on-the-wall view of a non-specialist coming to grips with natural language processing. [223001320100] |Over the past three months, Matthew’s evaluated a whole bunch of different part-of-speech taggers looking for something that’ll satisfy his accuracy, speed, and licensing needs. [223001320110] |The data is literary English, for now, mostly culled from Project Gutenberg. [223001320120] |The current entry, Evaluating POS Taggers: Conclusions, dated 27 January 2009, starts with: [223001320130] |OK, I’m as done as I care to be with the evaluation stage of this tagging business, which has taken the better part of three months of intermittent work. [223001320140] |This for a project that I thought would take a week or two. [223001320150] |There’s a lesson here, surely, … [223001320160] |Amen, brother. [223001320170] |That’s one reason why Kevin Cohen’s organizing the software workshops at ACL (this year Marc Light co-chairs, but I’m still on the PC). [223001320180] |So I suggested to Matthew that he submit this diary of his work to the NAACL 2009 Software Workshop, which is explicitly calling for just such case studies. [223001330010] |Denis, Gilleron &Tommasi (2002) Text Classification from Positive and Unlabeled Examples [223001330020] |I’m finally getting back to the problem of how to learn with only positively labeled data (and lots of unlabeled data). [223001330030] |I discussed this last in the entry How Can I Build a Classifier with No Negative Data?. [223001330040] |This paper takes an interesting (and, in retrospect, obvious) approach: [223001330050] |
  • Denis, François, Rémi Gilleron, and Marc Tommasi. 2002. [223001330060] |Text Classification from Positive and Unlabeled Examples. [223001330070] |In Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2002).
  • [223001330080] |The foundation is naive Bayes with a bag of words representation for a binary classification task with two categories {0,1}. [223001330090] |They remind us in their equation (4) that naive Bayes yields a mixture of unigrams language mode for words w: [223001330100] |Denis et al. then (well, actually before) observe that we can estimate p(w|cat=1) from positive examples. [223001330110] |And we can estimate p(w) over the entire corpus without labels. [223001330120] |We can probably make a pretty good guess at p(cat=0), which gives us p(cat=1) = 1.0 - p(cat=0). [223001330130] |If the documents are at all long and the category probabilities are not horrendously skewed, the cat probs usually don’t contribute much to the category estimates in naive Bayes. [223001330140] |Aha! [223001330150] |Now we can solve for p(w|cat=0), which they do for us in their equation (6): [223001330160] |Denis et al. note that for this to make sense, the document length distributions have to be the same. [223001330170] |This not being the case may not matter much in practice. [223001330180] |Recall that the independence assumptions of Naive Bayes need to be met, which they almost never are in bags of words under naive Bayes —actual counts are almost always overdispersed relative to the expectation under a multinomial as in naive Bayes.) [223001330190] |The paper includes some eval over WebKB Course Dataset, a set of 1000 or so academic department web pages from four universities classified into types. [223001330200] |Denis et al. took the course pages (22% of total) to be category 1 and all other pages to be category 0. [223001330210] |They got quite decent results with both 20 and 50 positively-labeled documents and up to 200 unlabeled docs. [223001330220] |They showed that classification performance was robust to errors in the category distribution specification p(cat=0). [223001330230] |One good idea’s all a paper (or a thesis for that matter) needs. [223001330240] |They did more eval and added cotraining in: [223001330250] |
  • François Denis, Anne Laurent, Rémi Gilleron, Marc Tommasi. 2003. [223001330260] |Text classification and co-training from positive and unlabeled examples. [223001330270] |In Proceedings of ICML. [223001340010] |Unit Testing Exceptions with JUnit 4.5 Annotations [223001340020] |I not only decided to clean up all the unchecked warnings from the compiler, I decided to upgrade our unit testing framework to JUnit 4.5. [223001340030] |There’s a very nice introduction to JUnit 4.5 at: [223001340040] |
  • cavdar.net’s JUnit 4 in 60 Seconds
  • [223001340050] |I just wrote a script with a bunch of regexes to transform our current tests into the new form. [223001340060] |That meant taking every public void method whose name started with “test” and adding the annotation @Test. [223001340070] |That is, my test methods now all meet this year’s look and feel for JUnit: [223001340080] |I’m still not a big fan of the static import, which adds the method to the local namespace, but that’s the recommended practice, and I believe in following recommended practice more than I believe in innovating, at least for syntactic details like this. [223001340090] |Now if you want to test that a class throws an exception, what I used to do was this: [223001340100] |The pattern’s simple: set up a try/catch block, with the thing you expect to throw an exception in the try block. [223001340110] |If the statement succeeds without an exception, you hit the fail method, which registers a test failure with JUnit. [223001340120] |If you throw the exception, it gets caught and registers a test success. [223001340130] |I had to implement the success method myself to register successes. [223001340140] |The recommended style for this test now is: [223001340150] |This certainly looks much cleaner. [223001340160] |The test succeeds if and only if the exception is thrown. [223001340170] |Now consider this test, which has the same form as the testGamma() method in unit.io.BitInputTest: [223001340180] |The test checks that the first call to next succeeds, the second throws an exception, then the has-next method returns false. [223001340190] |I can’t see how to do this in the new style. [223001340200] |The first problem is that I don’t want the test to succeed if the first call to next throws an exception. [223001340210] |So I’d have to write a separate test to make sure at least that part succeeds. [223001340220] |But the real problem is the post test for has-next to return false, which I can’t figure out how to finagle into a test with an expected exception. [223001350010] |Doubling Speed and Halving Memory in Binomial Naive Bayes [223001350020] |It’s easy to convert naive Bayes into a linear classifier, then reduce the number of coefficient vectors to the number of categories minus 1. [223001350030] |We’ll work through the binomial (two category) case here, where we can get away with a single parameter vector. [223001350040] |The general case is the same with more subscripts. [223001350050] |First, let’s express naive Bayes as a linear classifier. [223001350060] |Suppose we have a binary naive Bayes text classifier over tokens: [223001350070] |The typical parameterization is with a coefficient vector for each category: [223001350080] |Naive Bayes represents documents as bags of words, which provide a count for each word in the document. [223001350090] |Specifically, an input document represented as a vector of token counts, with a constant zero dimension set to one for an intercept: [223001350100] |Then the joint probability of a category and a document is given by: [223001350110] |Thus it’s essentially a multinomial classifier. [223001350120] |Now suppose we only have two categories, 0 and 1. [223001350130] |The conditional probabilities are computed in the usual way by marginalization: [223001350140] |Now suppose we define a new feature vector: [223001350150] |and look at its behavior when multiplied by an document token vector: [223001350160] |we see that it produces the log odds. [223001350170] |Noting that the two conditionals add to 1: [223001350180] |we contiue the expansion to derive: [223001350190] |so that: [223001350200] |Therefore, we only need to store the single parameter vector and multiply it by our document count vector. [223001350210] |That neatly cuts our storage and the number of arithmetic operations in half. [223001350220] |If we have K dimensions, we can get away with (K-1) feature vectors by subtracting the last feature vector from all the previous vectors. [223001350230] |The time and space saving are 1/K. [223001350240] |In practice, there’s no need to allocate a vector of token counts. [223001350250] |Instead, as each token is streamed out of the tokenizer, its coefficient is looked up and added to the running total, which is initialized with the category log probability difference (aka the intercept). [223001350260] |This recasting of naive Bayes as a linear classifier is not novel. [223001350270] |The question becomes how best to choose the coefficient vector. [223001350280] |Naive Bayes tends to be outperformed in terms of accuracy by discriminative methods like logistic regression or linear support vector machines, but these methods are orders of magnitude more expensive to train. [223001350290] |There’s a good discussion of discriminative versus generative estimation of linear classifier coefficients in: [223001350300] |
  • Klein, Dan and Christopher D. Manning. 2002. [223001350310] |Conditional structure versus conditional estimation in NLP models. [223001350320] |In Proceedings of EMNLP. [223001350330] |Note: Klein and Manning use a vector per category, as is traditional in natural language and machine learning approaches to both naive Bayes and logistic regression. [223001360010] |Java Floating Point: strictfp, StrictMath vs. LingPipe’s Forward-Backward Algorithm [223001360020] |(Note: there are two questions I have, highlighted in bold.) [223001360030] |I’ve been wrestling with varying floating point behavior across platforms, a bug feature of modern programming languages over multiple platforms and compilers. [223001360040] |Here’s a nice intro from Sun that pre-dates Java, but discusses the IEEE floating point standard, the main platform-based variations, and their effects on the C language: [223001360050] |David Goldberg. 1991. [223001360060] |What Every Computer Scientist Should Know About Floating-Point Arithmetic. [223001360070] |ACM Computing Surveys. [223001360080] |Here’s the whole book as a PDF. [223001360090] |It’s just the appendix; the rest of the doc goes into gory detail about floating point. [223001360100] |Much more so than in my numerical optimization (Nocedal and Wright) or matrix computation (Golub and van Loan) books. [223001360110] |One of the features of LingPipe’s implementation of hidden Markov models (HMMs) is a full implementation of the forward-backward algorithm (also see Bishop’s pattern recognition textbook). [223001360120] |Forward-backward is just a special case of the general sum-product algorithm for graphical models. [223001360130] |It’s also used in conditional random fields (CRFs). [223001360140] |Forward-backward is what allows us to calculate the probability of a tag for a token given the whole sequence of input tokens. [223001360150] |It’s also what’s behind the same calculation for named entities in our HMM named-entity extractor. [223001360160] |As I mentioned previously, in a post on POS tagger eval, Matthew Wilkins has a corpus of 1.6M tokens of literary text with POS tags. [223001360170] |During the evaluation, our HMM decoder threw an exception in fold 347 of a 500-fold cross-validation run (it was so many folds so as not to blow out memory storing cases —I’ll make storage optional in the next release). [223001360180] |I fixed the problem by adding a strictfp declaration to every class, and replacing all calls to java.lang.Math with the corresponding calls to java.lang.StrictMath. [223001360190] |It wasn’t noticeably slower to decode on my desktop machine (dual quad-core Xeon, Vista64), in either 1.5 or 1.6. [223001360200] |Is this what others have found with strict math? [223001360210] |What I read on the web led me to believe it’d be much slower. [223001360220] |The bug was in the middle the forward portion of the forward-backward algorithm. [223001360230] |Forward-backward’s a nasty algorithm numerical-computation-wise because it involves a sequence of multiply-adds (basically vector dot products, where the output of one set of computations makes up the input to the next). [223001360240] |Here’s how the forward values (alphas) are defined for a given token position (n) and category/tag (t): [223001360250] |For each input token (n), there are multiply-adds accumulated over all of the tags (t), each involving the output of the previous (n-1) calculation! [223001360260] |The algorithm’s not as bad as it looks because we store the previous alphas and use dynamic programming. [223001360270] |Because the forward values (alphas) are joint probabilities of the input tokens and the tag assigned at a position, they can easily underflow in practice on a linear scale. [223001360280] |That’s why Viterbi (first best algorithm) uses log scales (that, and it converts multiplication to addition). [223001360290] |In practice, we really only care about the conditional probabilities. [223001360300] |So we can rescale: [223001360310] |As those of you who follow floating point know, this is all risky business. [223001360320] |What happened in practice was that the rescaled values summed to 1.09 at one point (I only rescaled up, so this is a problem in the forward calculations at some point). [223001360330] |Finally: Does anyone know any references about calculating forward-backward in a way that’s sensitive to how floating point arithmetic actually works? [223001370010] |Elkan and Noto (2008): Learning Classifiers from Only Positive and Unlabeled Data [223001370020] |Charles Elkan sent along links to these two papers in response to my previous post, Denis et al.’s Text Classification from Positive and Unlabeled Examples. [223001370030] |The title overlap indicates just how much we need auto-alerting using tech like Hal’s WhatToSee: [223001370040] |
  • K. Noto, M. H. Saier Jr., and C. Elkan. [223001370050] |Learning to Find Relevant Biological Articles Without Negative Training Examples. [223001370060] |In Procs. of the Australasian Joint Conference on Artificial Intelligence (AI ’08).
  • [223001370070] |
  • C. Elkan and K. Noto. [223001370080] |Learning Classifiers from Only Positive and Unlabeled Data. [223001370090] |In Proceedings of the Fourteenth International Conference on Knowledge Discovery and Data Mining (KDD ’08 ).
  • [223001370100] |Because they’ve supplied the data, I can use it in my upcoming tutorial on semi-supervised learning (I’m also doing standard EM). [223001370110] |

    The Setup

    [223001370120] |The basic setup is the same as for Denis et al.: a two-category (positive/negative) classification problem with a relatively tiny data set. [223001370130] |In this case, a total of 7500 or so SwissProt text recrods, 2500 or so of which are positive instances that mention transport proteins. [223001370140] |Then a labeled training set consisting of a randomly selected subset of the positive examples. [223001370150] |The goal is to estimate a classifier p(positive|text) given only the collection of 7500 texts and 300 labeled positive examples. [223001370160] |

    Lemma 1

    [223001370170] |Elkan and Noto show that if you can estimate the probability of a positive item being labeled, p(lab|pos), and you can estimate the probability of an item being labeled, p(lab|text), then you can estimate: [223001370180] |Given that p(lab|pos) is a constant, we also see that p(pos|text) is proportional to p(lab|text). [223001370190] |They lean on this ratio to train a binary classifier to estimate p(lab|text), which provides the same ranking by probability as the estimand of interest, p(pos|text). [223001370200] |

    The Evaluation

    [223001370210] |They evaluate by precision at 300 and the results look very good. [223001370220] |The KDD paper cuts off the ROC curve at .95 recall saying the only range of interest is precision-at-300; the AI’08 paper shows the whole thing, where they hit 100% recall at a respectable 70% sensitivity, which almost seems too good to be true. [223001370230] |Their application of ranking articles is more like traditional search in that it does not require a properly calibrated p(pos|text), only a ranking. [223001370240] |They’re interested in supplying a list of 300 papers to a human curator to review them for inclusion in a database of transport proteins. [223001370250] |That’s a bit different than our concern. [223001370260] |We’re working on the much larger scale problem of linking Entrez-Gene (hundreds of thousands of genes with between a handful and thousands positive examples plus metadata) and MEDLINE (tens of millions of citations with titles, authors, abstracts, dates, and other meta-data like MeSH terms). [223001370270] |For each of the genes in Entrez, we have positive-only data consisting of MEDLINE citations to papers. [223001370280] |We also have lots of metadata and attached descriptive text. [223001370290] |We’d like to be able to calibrate p(gene|text) across genes so that for each paper, we can rank genes in order of likelihood they’re mentioned in the text. [223001370300] |This is very different than ranking texts in order of likelihood of mentioning a gene, which we also want to do, but is easier with traditional “more like this” methodology. [223001370310] |

    Estimation

    [223001370320] |In order to derive estimates of p(pos|text), the authors use a 1D logistic regression to estimate probabilities p(lab|text) from SVM classifier margins. [223001370330] |Of course, the sigmoid is monotonic, so this preserves the same ranking as the SVMs. [223001370340] |The authors estimate p(lab|pos) by computing the expectation of p(lab|pos) by averaging p(lab|x) over a cross-validated sample of labeled positives. [223001370350] |

    Does Negative Data Help?

    [223001370360] |In the AI’08 paper, they consider adding negative data gleaned from early review rounds where they gave 300 papers to curators. [223001370370] |Surprisingly, their results showed the negative data didn’t help much at all. [223001370380] |They were surprised about variance in their data samples, which showed from 1.18% to 4.25% relevant articles in 300 samples. [223001370390] |I’m not surprised –the 95% posterior interval for Binomial(n|300,0.025)/300 is (0.010,0.043). [223001370400] |

    Underestimation Bias (I think)

    [223001370410] |Their figure shows the result of a (mis-specified, in the statistical sense) two-dimensional logistic regression on two categories generated by 2D normal distributions (from the figure, it looks like zero covariance), using an expanded basis with intercept and interaction terms: 1, x, y, x*x, y*y (click on figure to enlarge): [223001370420] |The blue pluses are positive cases, about 20% of which are labeled (not separated from the unlabeled positive cases); the red circles are negative cases. [223001370430] |The smaller blue ellipse is the 50% contour for the labeled-vs-unlabeled classifier, and the larger blue ellipse is the 50% contour for a classifier trained on positive-vs-negative. [223001370440] |This neatly shows why the classifier trained on the labeled versus unlabeled distinction has high precision. [223001370450] |It also neatly shows a case where Lemma 1 fails in practice, because the probabilities are not in the ratio dicated by Lemma 1. [223001370460] |My main concern with this paper is that training a classifier on labeled versus unlabeled data is going to systematically underestimate p(lab|text) because the unlabeled data contains many points drawn from the same set of underlying positive examples (red pluses) as the labeled examples. [223001370470] |Their diagram shows this very neatly, though we can’t prove anything with a mis-specified model anecdote. [223001370480] |

    What about Recall?

    [223001370490] |It’s hard to evaluate recall in this setup, and the AI’08 paper provides some heuristics that require evaluating some negative data in the end from a sample. [223001370500] |For the gene linkage problem of generating classifiers for every gene in Entrez-Gene, it’s hard to even get a sample where there’s enough positive density to allow estimation. [223001370510] |

    Questions and Quibbles

    [223001370520] |A question I had is why they didn’t iterate. [223001370530] |Given what they’re doing, it’d seem natural to combine this with EM-like iterations. [223001370540] |Or Gibbs samples. [223001370550] |Quibble 1: After all the estimation, they don’t measure how well calibrated the probabilities are —we only see the ROC curves. [223001370560] |Although every reviewer wants more graphs, I really want to see how well calibrated the probabilities are. [223001370570] |Quibble 2: The authors really shouldn’t use four decimal places (0.xxxx) for reporting results on 6000 data points in their tables, but rounding makes them look like less of an improvement on “prior art”. [223001370580] |Quibble 3: I didn’t see any justification for their claim in the AI’08 paper that SVMs have a “small and useful boost in accuracy compared to methods like maximum entropy [aka logistic regression]“. [223001370590] |What’s the gain from their use of a quadratic kernel over the linear kernel? [223001370600] |Did they regularize? [223001370610] |I suppose SVMs are like IBM and you can’t go wrong these days buying into them. [223001370620] |I’d have thought for estimating the probabilities (Quibble 1), logistic regression might have an advantage because training maximizes the sum of the log estimates. [223001370630] |And even if SVMs do have a small and useful boost in accuracy, we’re talking about ranking on the one hand and probability estimate on the other, not first-best classifier accuracy. [223001380010] |Linear Models for N-Gram Probabilities [223001380020] |I’ve been thinking about linear classifiers, and while it’s well known how to convert naive Bayes (multinomial) classifiers into linear models (just treat log p(cat) as intercept, and log p(word|cat) as dimensions), it’s less well known how to do this with higher-order n-grams. [223001380030] |One use is an n-gram language-model based search engine implemented on top of a vector-based retrieval engine (I’m not sure Lucene’s scoring’s flexible enough for this). [223001380040] |Well, it turns out to be easy if you use interpolated smoothing like Witten-Bell. [223001380050] |In the bigram case, the smoothed estimate is: [223001380060] |where lam(b) in [0,1] is the interpolation coefficient for context b and lam() is the interpolation coefficient for the empty n-gram. p'(a|b) and p'(a) are maximum likelihood estimates, and unif is the uniform probability of a token. [223001380070] |For Witten-Bell smoothing, interpolation is defined as: [223001380080] |where count(b,_) is the number of times the token b is seen before another token in training, making lam(b) increase with the number of training instances. [223001380090] |Let count[X] be the count for n-gram X (where the length may be 0, 1 or 2). [223001380100] |For instance, the vector “John likes John” has counts for the following dimensions: [223001380110] |Then define coefficient vector beta, with dimensions for all of the n-grams seen in training (after pruning): [223001380120] |If we look at the contribution from terms in our example, and sum, we see how it works, assuming all bigrams have been seen in training other than [likes,John]. [223001380130] |Doing the sums leaves us with the log estimate: [223001380140] |It’s no problem extending this to higher order n-grams. [223001380150] |Many other kinds of interpolated smoothing will work, too. [223001390010] |Document-Length-Normalized Naive Bayes [223001390020] |I just finished re-reading the Nigam, McCallum and Mitchell paper Semi-Supervisied Text Classification Using EM before implementing an EM tutorial and new naive Bayes implementation. [223001390030] |I was struck by: [223001390040] |For pre-processing, stopwords are removed and word counts of each document are scaled such that each document has constant length, with potentially fractional word counts. [223001390050] |(Nigam et al. p. 9; my bold) [223001390060] |I figured they meant an L1 norm, which amounts to dividing by the sum of (the absolute value of the) counts. [223001390070] |But they could’ve meant the L2 norm that’s used in vector-based information retrieval, which amounts to dividing by vector length, which is just the square root of the sum of the squared counts. [223001390080] |In fact, (Rennie et al. 2003) used L2 norms (after log-scale TF/IDF weighting!). [223001390090] |I also didn’t know what length they normed to, but figured it’d be something small so that the conditional probability estimates of naive Bayes would have less variance (more on this below). [223001390100] |So I wrote off to Kamal Nigam, who kindly responded with answers. [223001390110] |Yes, they used an L1 norm, but they used a surprisingly long length, on the order of hundreds of words, which will cause most naive Bayes estimates to tend toward 0 or 1 due to the faulty independence assumption underlying naive Bayes. [223001390120] |They used the stoplist from Andrew McCallum’s BOW toolkit, and Kamal was pretty sure they case normalized. [223001390130] |(20 Newsgroups also works a smidgen better for classification if you strip out the headers, which I’m not sure if they did or not.) [223001390140] |It’s pretty clear how to generalize naive Bayes to handle fractional counts —just use floating point accumulators (better make that a double) and carry out all the other calculations in exactly the same way. [223001390150] |Let’s say I have two categories with the following estimates after training: [223001390160] |Now what happens when I get the document “a a b”? [223001390170] |so that: [223001390180] |That’s without length normalization. [223001390190] |What if we make the length norm 1? [223001390200] |Then we have to raise all of our token probabilities (not the category probabilities) to the 1/3 power, yielding: [223001390210] |This works out to an estimate of p(cat1|a a b) = 0.91. [223001390220] |Suppose we take the norm to be 100, we raise the conditional probs to the power of (100/3), yielding p(cat1|a a b)=1.0 (well, actually 0.999999925). [223001390230] |In general, length norms lower than the length of the document drive the estimates p(cat|text) closer to the category distribution p(cat|text), in this case 0.9, and length norms greater than the document length drive the estimates closer to 0 or 1. [223001390240] |Thus with a balanced category distribution, the posterior category estimates are driven to the uniform distribution. [223001390250] |One side effect of this normalization is that the joint probabilities are no longer coherent. [223001390260] |It’s easy to see that the infinite chain of inputs “a”, “a a”, “a a a”, …, each of which has the same “joint probability” after length normalization, is going to drive our total probablity above 1.0. [223001390270] |Nigam et al. just went ahead and carried out EM in the usual way in an attempt to maximize joint probability estimates of the data and the models (simply Dirichlet prior and MAP estimate). [223001400010] |Rennie, Shih, Teevan, and Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers [223001400020] |It turns out Nigam et al. weren’t the only ones fiddling with naive Bayes. [223001400030] |I just finished: [223001400040] |
  • Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. [223001400050] |In ICML.
  • [223001400060] |Rennie et al. introduce four distinct modifications to Naive Bayes: [223001400070] |
  • log term frequency normalization, to cope with overdispersion, [223001400080] |
  • length normalization, to dampen inter-term correlation, [223001400090] |
  • inverse doc frequency normalization, to cope with noise from common features, and [223001400100] |
  • complementation, to cope with low-count bias. [223001400110] |

    Log Term Frequency (TF)

    [223001400120] |Word counts tend to be overdispersed relative to what is predicted by a multinomial. [223001400130] |Simply put, two or three occurrences of a word in a document are much more common than predicted by multinomials. [223001400140] |Ken Church called this “burstiness”. [223001400150] |The two principled ways to address this are (1) mixture models, including the gamma/Poisson [negative binomial], the Dirichlet/multinomial, and zero-inflated models, and (2) latent topic models, where the latent topics model correlations among terms. [223001400160] |Rennie et al. motivate log word frequency transforms on the basis that modeling log counts with a multinomial is overdispersed compared to standard multinomial models of linear counts. [223001400170] |Rather than training on raw document counts, as in standard naive Bayes, the count vectors count(w,d) of words in documents are first transformed based on term frequency (TF) and inverse document frequency (IDF): [223001400180] |where [223001400190] |where count(w,d) is the count of word w in document d, [223001400200] |

    Inverse Document Frequency (IDF)

    [223001400210] |To help eliminate noise caused by common words (defined as those that show up in many documents), word counts are next scaled by inverse document frequency: [223001400220] |where numDocs is the total number of documents and numDocs(w) is the total number of documents containing word w. [223001400230] |

    Length Normalization (L2)

    [223001400240] |Length normalization dampens the correlations among words in documents. [223001400250] |It also scales probabilities for different lengths, which is not important for Rennie et al.’s first-best classifiers. [223001400260] |To L2 (Euclidean) length normalize, they divide the vector of TF/IDF counts by its length: [223001400270] |where the L2 (Euclidean) length of the term vector for document d is defined by: [223001400280] |I’ve been doing classification like this ever since I started doing classification, which is why the result so far looks like LingPipe’s classify.TfIdfClassifierTrainer. [223001400290] |

    Complementation

    [223001400300] |What I’d never seen done is complementation. [223001400310] |Complementation is introduced to deal with the bias seen with imbalanced training data. [223001400320] |What’s really nifty is their comparison to one-versus-all formulations of linear classifiers. [223001400330] |I’ll talk about both of these issues in a future blog post. [223001400340] |Instead of training a category based on its own counts and then taking the most likely category, Rennie et al. train a category based on all other category counts, then take the lowest scoring category. [223001400350] |That is, we count all the instances (after all the TF/IDF and length norms) of the word in categories other than c. [223001400360] |Finally, the discrete distribution underlying category c is defined as: [223001400370] |

    Classification

    [223001400380] |The output is a set of weights that can be used as the basis of a regular old linear classifier (with intercept feature of the category probabilities). [223001400390] |As such, this looks like search-engine TF/IDF computations for information retrieval (see my post TF/IDF, Scaling and Symmetry for gory details), for which there are typically no normalizations applied to the queries. [223001400400] |It’s possible to roll most of the norms into the training, so this is no big deal, and the final length normalization doesn’t affect first-best answers. [223001400410] |

    First-Best, Conditional Probabilities and EM

    [223001400420] |Recall that (Nigam, McCallum and Mitchell 2003) (L1) length-normalized the documents being classified, to fit better semi-supervised models with expectation maximization (EM). [223001400430] |Although this has no effect on first-best classification, it does change the category probability estimates, and thus affects expectations, and hence EM. [223001400440] |Ironically, Rennie et al. list not being able to run EM as a drawback to their ad-hoc normalizations. [223001400450] |As Nigam et al. showed, you really only need the conditional estimates for computing the E step, and these may be computed with Rennie et al.’s modifications as well as with Nigam et al.’s. [223001400460] |Then the M step can be any old model fitting, including Rennie et al.’s. [223001410010] |Lucene 2.4 in 60 seconds [223001410020] |[Update: 17 Nov 2010. [223001410030] |We now have an extensive (20 pages in textbook format) tutorial for Lucene 3.0 as an appendix for the LingPipe book. [223001410040] |The PDF for the current draft and code samples in a tarball are available from the following site: [223001410050] |
  • Text Analysis in LingPipe [223001410060] |As a bonus, the tokenization chapter provides adapter implementations for converting LingPipe tokenizers to Lucene analyzers and vice-versa.] [223001410070] |This is a tutorial on getting started with the Lucene 2.4 Java API which avoids using deprecated classes and methods. [223001410080] |In particular it shows how to process search results without using Hits objects, as this class is scheduled for removal in Lucene 3.0. [223001410090] |The time estimate of 60 seconds to complete this tutorial is more wish than promise; my goal is to present the essential concepts in Lucene as concisely as possible. [223001410100] |In the best “Hello World” tradition I have written a class called HelloLucene which builds a Lucene index over a set of 3 texts: { “hello world”, “hello sailor”, “goodnight moon” }. HelloLucene has two methods (besides main): buildIndex and searchIndex, which are called in turn. buildIndex builds an index over these 3 texts; searchIndex takes the args vector and runs a search over the index for each string, printing out results to the terminal. [223001410110] |Here is buildIndex and its required import statements: [223001410120] |A Lucene data store is called an Index. [223001410130] |It is a collection of Document objects. [223001410140] |A Document consists of one or more named Field objects. [223001410150] |Documents are indexed on a per-field basis. [223001410160] |An IndexWriter object is used to create this index. [223001410170] |Let’s go over the call to its constructor method: [223001410180] |The first argument is the index that the IndexWriter operates on. [223001410190] |This example keeps the index on disk, so the FSDirectory class is used to get a handle to this index. [223001410200] |The second argument is a Lucene Analyzer. [223001410210] |This object controls how the input text is tokenized into terms used in search and indexing. HelloLucene uses a StandardAnalyzer, which is designed to index English language texts. [223001410220] |It ignores punctuation, removes common function words such as "if", "and", and “"but", and converts words to lowercase. [223001410230] |The third argument determines the amount of text that is indexed. [223001410240] |The constant IndexWriter.MaxFieldLength.LIMITED defaults to 10,000 characters. [223001410250] |The for loop maps texts into Document objects, which contain a single Field with name "text". [223001410260] |The last 2 arguments to the Field constructor method specify that the contents of the field are stored in the index, and that they are analyzed by the IndexWriter's Analyzer. [223001410270] |The IndexWriter.addDocument() method adds each document to the index. [223001410280] |After all texts have been processed the IndexWriter is closed. [223001410290] |Both indexing and search are operations over Document objects. [223001410300] |Searches over the index are specified on a per-field basis, (just like indexing). [223001410310] |Lucene computes a similarity score between each search query and all documents in the index and the search results consist of the set of best-scoring documents for that query. [223001410320] |Here is searchIndex and its required import statements: [223001410330] |Access to a Lucene index (or indexes) is provided by a Searcher object. searchIndex instantiates a IndexSearcher over the directory "helloLuceneIndex". [223001410340] |Search happens via a call to the Searcher.search() method. [223001410350] |Before calling the search() method the search string must be processed into a Lucene Query. [223001410360] |To prepare the query we create a QueryParser and call its parse() method on the search string. [223001410370] |The QueryParser is constructed using the same Analyzer as was used for indexing because the QueryParser processes the search string into a set of search terms which are matched against the terms in the index, therefore both the indexed text and the search text must be tokenized in the same way. [223001410380] |Here are the statements which process the results: [223001410390] |The search() method returns a TopDocs object, which has 2 public fields: scoreDocs and totalHits. [223001410400] |The latter is the number of documents that matched the query, and the former is the array of results in the form of ScoreDoc objects, where a ScoreDoc is itself a simple object consisting of two public fields: doc, the Document id (an int value); and score a float value calculated by Lucene (using a similarity metric best described as Unexplainable Greek kung fu). searchIndex reports the TopDoc.totalHits, then iterates over the TopDoc.scoreDocs array. [223001410410] |The ScoreDoc.doc is used to retrieve the Document object from the index, and finally the method Document.get() retrieves the contents of the Field with name "text". [223001410420] |In order to compile this program, download the Lucene 2.4 distribution, which contains the jarfile lucene-core-2.4.0.jar. [223001410430] |Either add this to your classpath, or simply put it in the same directory as HelloLucene, and compile with the following command: [223001410440] |Invoke the program with the following command: [223001410450] |Produces these results: [223001410460] |The first query "hello world" matches exactly against our first text, but it also partially matches the text "hello sailor". [223001410470] |The second query "hello" matches both texts equally well. [223001410480] |The third query "moon" matches against the only document which contains that word. [223001410490] |The fourth query "foo" doesn’t match anything in the index. [223001410500] |

    Exercise for the reader

    [223001410510] |Refactor HelloLucene.java into two programs: BuildHelloLucene.java and
    SearchHelloLucene.java, where BuildHelloLucene uses its command line arguments to build the index, instead of the strings supplied by Sring[] texts. [223001410520] |Runing BuildHelloLucene with inputs "hello world", "hello sailor", and "goodnight moon" and then running SearchHelloLucene with inputs "hello world", "hello", "moon" and "foo" should give the same results as above. [223001410530] |Different input texts and searches will expose the behaviour of the StandardAnalyzer in particular, and Lucene in general. [223001420010] |Java Specification Request (JSR) 247: Data Mining 2.0 [223001420020] |I just stumbled on this Java Specification Request: [223001420030] |
  • JSR 247: Data Mining 2.0
  • [223001420040] |Is anyone using it? [223001420050] |Are there public implementations? [223001420060] |It seems to have become inactive around 2006 on the Java site. [223001420070] |The only reference I can find are Oracle javadoc for JSR 73 (Data Mining 1.0), which is part of the Oracle Data Mining kit (here’s their overview PDF). [223001420080] |I found it when looking through the source code distribution for Satnam Alag’s Collective Intelligence in Action (Manning Publications, 2008). [223001420090] |The author included decompiled source for a version of someone’s JSR 247 implementation, rooted at package javax.datamining. [223001420100] |The number of interfaces defined is truly remarkable. [223001420110] |For example, javax.datamining.algorithm includes [223001420120] |
  • svm.classification.SVMClassificationSettingsFactory
  • [223001420130] |as one of several SVM-specific interfaces. [223001420140] |The package javax.datamining.supervised contains: [223001420150] |
  • classification.ClassificationTestMetricsTaskFactory,
  • [223001420160] |just one of dozens of interfaces dealing with classifier test configuration. [223001420170] |I couldn’t actually find implementations of SVMs in the Manning distribution of the decompiled javax.datamining package. [223001440010] |Building a Stemming Corpus: Coding Standards [223001440020] |We’d very much like to be able to offer stemmers for multiple languages, for applications in search and feature extraction for classifiers. [223001440030] |We’d like to base them on pre-compiled dictionaries for common words and an automatic system trained on the dictionary. [223001440040] |We only have research rights to CELEX 2, and it’s only available for English, Dutch and German anyway. [223001440050] |It’s also small, with only about 50K English words. [223001440060] |We’ve had success with Amazon’s Mechanical Turk, so we thought we’d try to define a stemming standard. [223001440070] |The legwork (well, Python-work and REST-work) is being done by Emily Jamison, who’s working as an intern for us. [223001440080] |I’ll discuss the results we got from the mechanical Turk in a later blog entry. [223001440090] |Emily’s about to launch a third mechanical Turk run based on what we’ve learned from the first two. [223001440100] |The rest of this post is just about defining how to stem. [223001440110] |

    The Universe of Tokens

    [223001440120] |We started with the set of lowercased alphabetic tokens appearing at least twice in the English section of the Leipzig Corpus. [223001440130] |Emily sampled 1000 of these at random and had them each stemmed by 5 Turkers. [223001440140] |We’ve actually done this twice trying out two different interfaces. [223001440150] |

    What’s a Stem?

    [223001440160] |For technical reasons, I want a corpus consisting of a word and the next simpler word(s) out of which it’s made, not counting productive affixes —basically an unlabeled morphological parse tree. [223001440170] |Thus it might take a sequence of such stemming steps to produce a root form (e.g. “restfully” to “restful” to “rest”). [223001440180] |In the first run, we tried to collect the affixes, too (e.g. “ly”, “ful” and no-stem). [223001440190] |My first motivation is term (feature) extraction for search (or general classifiers). [223001440200] |I want to include the word itself and its whole chain of stems in the index and in the result of parsing queries. [223001440210] |That way, if the whole long form matches, you get a higher score than if just a stem matches. [223001440220] |This is similar to indexing Chinese with n-grams and words and should have a similar advantage for precision and recall. [223001440230] |Hopefully this will mitigate the main problem raised in my earlier post, To Stem or Not to Stem?. [223001440240] |The second motivation is to gather data to train an automatic stemmer. [223001440250] |This is much easier to do a single stem at a time, because the data’s awfully sparse for long sequences of prefixes and suffixes and compounds. [223001440260] |A third motivation is to make the task easier for the mechanical Turkers. [223001440270] |The first run that asked for all the subwords and affixes was just too much work (4-5 minutes per 20 words for all stems versus 1 minute per 20 words for just the “next” stem). [223001440280] |

    Does it Work?

    [223001440290] |Emily and I each steemed the 1000 words after discussing the standard (but not the examples). [223001440300] |It took us both less than an hour, making me think we could, with reasonable precautions against RSI, label a corpus of the most frequent 100,000 words ourselves. [223001440310] |Our agreement rate was an abysmal 85%. [223001440320] |Most of the errors were on compounds. [223001440330] |Not counting compounds, our agreement was closer to 95%, with most of the errors either brainos or typos, not misunderstandings of the standard. [223001440340] |

    Case Studies

    [223001440350] |In the rest of this post, I break down the cases which required some thoughtful adjudication. [223001440360] |The raw word’s on the left and the stem in our gold standard on the right. [223001440370] |I include notes on CELEX-2′s stemming in parentheses when they disagreed with our results. [223001440380] |Foreign/Historical Roots. [223001440390] |Here, we often have multiple words in English that share a foreign or historical root that is not itself an English word. [223001440400] |For instance, “hypo” and “epidemi” are not English words, nor is there a shared English stem for “euphoria” and “euphoric”. [223001440410] |
  • odious: odious [223001440420] |
  • euphoria: euphoria (CELEX-2 stems “euphoric” to “euphoria”) [223001440430] |
  • mathematical: mathematics [223001440440] |
  • epidemiology: epidemic (CELEX: missing) [223001440450] |
  • hypocrisy: hypocrisy (CELEX: hypocrite) [223001440460] |
  • maximize: maximum [223001440470] |Prefix/Suffix Ambiguity. [223001440480] |A word with a productive prefix and suffix can often be analyzed in two ways. [223001440490] |For instance, “unlocking” could be “un” + “locking” or “unlock” + “ing”. [223001440500] |Here, because the action is an unlock action, we prefer that root. [223001440510] |Similarly, we prefer “modernised” to “unmodern” as a reduction for “unmodernised”. [223001440520] |The case of “overcollateralization” is trickier, with either “overcollateralize” and “collateralization” being reasonable choices. [223001440530] |
  • unmodernised: modernised (CELEX: missing) [223001440540] |
  • prearrangement: prearrange [223001440550] |
  • unlocking: unlock (CELEX: missing) [223001440560] |
  • incorrigible: corrigible [223001440570] |
  • disability: disable [223001440580] |
  • inconvenience: inconvenient (CELEX: convenient) [223001440590] |
  • resentencing: sentencing (CELEX: missing) [223001440600] |
  • overcollateralization: collateralization (CELEX: missing) [223001440610] |Compound/Affix Ambiguity. [223001440620] |Compounds provide a general case of ambiguity with affixes or suffixes. [223001440630] |Again, we want the natural word to be left behind after stemming, which is why “headquarters” must analyzed as “head” + “quarters” rather than “headquarter” + “s” —it’s not the plural of “headquarter”. [223001440640] |On the other hand, “cockfighting” becomes “cockfight” + “ing”, because the compound is a natural word. [223001440650] |In contrast, “newsgathering” becomes “news” + “gathering”, because “newsgather” isn’t a typical compound. [223001440660] |Cases like “ultraleftist” are harder, because they make sense either way. [223001440670] |Also note that you get non-affixational morphology with “signalmen”, the plural of “signalman”. [223001440680] |
  • cockfighting: cockfight (CELEX: cock fighting) [223001440690] |
  • headquarters: head quarters [223001440700] |
  • ultraleftist: ultra leftist (CELEX: missing) [223001440710] |
  • omnipotence: omnipotent [223001440720] |
  • steelmaking: steel making (CELEX: missing) [223001440730] |
  • stockholder: stock holder [223001440740] |
  • signalmen: signalman (-) [223001440750] |
  • weatherbeaten: weather beaten (CELEX: weather-beaten) [223001440760] |
  • supercomputer: super computer (CELEX: missing) [223001440770] |
  • newsgathering: news gathering (CELEX: missing) [223001440780] |To be honest, I steamrolled my own opinions in a couple of cases here. [223001440790] |Emily still prefers “ultraleft” + “ist” to “ultra” + “leftist”. [223001440800] |Digging my old linguistic semantics hat out of the closet, I’d argue there are two ambiguous derivations (sort of like “the ball in the box on the table”). “ultraleft” + “ist” means anyone on the extreme left, whereas “ultra” + “leftist” means someone who is an extreme instance of a leftist. [223001440810] |To see this subtle distinction, note that “ultra” doesn’t mean moving more to the left —it’s just extreme in general. [223001440820] |So you have to read “ultra” + “leftist” the way you’d read “ultramoderate”. [223001440830] |Non-Standard Affixes. [223001440840] |Next, we have non-standard affixes or “funny” words, like “freebie”, which is clearly related to “free”. [223001440850] |
  • freebie: free (CELEX: freebie) [223001440860] |Fixed Affixes. [223001440870] |Some prefixes seem to be awfully tightly bound; it’s not clear “recall” should stem to “call” even though they’re morphologically related. [223001440880] |
  • enclose: close (CELEX: enclose) [223001440890] |
  • recall: call [223001440900] |Comparatives/Superlatives. [223001440910] |We were unclear what to do with comparatives, but decided they should reduce to their basic scalar adjectives (“fitter” and “fittest” to “fit”), as it makes the most sense linguistically: [223001440920] |
  • fitter: fit [223001440930] |
  • furthest: far (CELEX: missing) [223001440940] |For search and feature extraction, it might make sense to first reduce the superlative to the comparative (“fittest”: “fitter”), then the comparative to the scalar adjective (“fitter”: “fit”). [223001440950] |False versus Rare Roots. [223001440960] |Sometimes words aren’t compounds even though they look like it, such as “bulldoze”, which is a back-derivation from a proper the proper name “Bulldozer”. [223001440970] |Who knew that “fangle”, “percept” and “aniline” were words? [223001440980] |Or that “weary” isn’t related to “wear”? [223001440990] |And then there’s “incommunicado” where we might say “communicado” (73K Google hits), even though “communicado” is not in any of the online dictionaries. [223001441000] |Cases like “chloracne” are less clear, because “chlor”, one of the compounded nouns, isn’t an ordinary English word. [223001441010] |We break apart even the words that standard introductions to linguistic morphology tell you are not morphologically complex, like “gooseberry” and “strawberry”. [223001441020] |This might argue that “bulldoze” should be broken into “bull” + “doze”, especially for search, where users often don’t know which words are compounds. [223001441030] |
  • bulldoze: bulldoze (CELEX: bull doze) [223001441040] |
  • fangled: fangle (CELEX: missing) [223001441050] |
  • weary: weary [223001441060] |
  • perceptual: percept (CELEX: missing) [223001441070] |
  • incommunicado: incommunicado [223001441080] |
  • doomsday: dooms day (CELEX: doomsday) [223001441090] |
  • chloracne: chlor acne (CELEX: missing) [223001441100] |
  • polyaniline: poly aniline (CELEX: missing) [223001441110] |
  • gooseberry: goose berry [223001441120] |Semantic Drift. [223001441130] |Some words don’t seem very closely related to their stems because the derived word has gathered a fixed meaning. [223001441140] |We stemmed them anyway. [223001441150] |
  • directory: direct [223001441160] |I don’t think the stem should be “director”. [223001441170] |The examples surrounding “authority” versus “author” in my previous blog post were clearer. [223001441180] |Misspellings/Contractions/Slang. [223001441190] |We just treat them as if they were spelled properly and did as well as we could. [223001441200] |
  • oughta: ought (CELEX: missing) [223001441210] |Derivational Ambiguity: Our first example was better, leaving us asking whether “butcher” should stem to “butch” (imagine the gender-based usage, not the meat-cutting one). [223001441220] |In some cases there are ambiguities. [223001441230] |In our sample of 1000, we found this one: [223001441240] |
  • sweater: sweat [223001441250] |More than One Issue. [223001441260] |Some words have multiple issues (non-standard roots plus compound ambiguity; misspellings plus foreight roots). [223001441270] |
  • sadomasochism: sado masochism (sadomasochism) [223001441280] |
  • honoaria: honoarium (CELEX: missing but they stem “honourarium” to “honour”) [223001441290] |

    Google(fight) Bug

    [223001441300] |Emily found an interesting bug in Google’s results exploring “honoaria”, which reported 891K hits (“honoraria” reported 896K hits): [223001441310] |but ran out of actual results after 66: [223001441320] |So much for Google-based lexicography. [223001441330] |What’s the world coming to when you can’t trust a good old fashioned googlefight? [223001450010] |Atkinson, Sack, Santoro, and Strothotte (1986) Min-max heaps and generalized priority queues. CACM. [223001450020] |I’m busy working on auto-completion for text inputs (like on the Google home page), so I’ve been thinking hard about efficiency, and thus about bounded priority queues. [223001450030] |Auto-completion’s easy compared to more general spell checking in that it has a natural A* search (the A* bound is the best scoring completion of a prefix). [223001450040] |Good A* search requires good priority queues. [223001450050] |Recall that a priority queue is a data structure with an element ordering (e.g. a java.util.Comparator) that supports operations to add elements, and to pop (or peek at) the largest element. [223001450060] |A reasonable implementation is with a resizable heap, which makes the add and pop operations logarithmic on the number of entries and the peek operation constant time. [223001450070] |For search algorithms, the beam heuristic (in one incarnation), restricts the queue to the best K elements (it’s a heuristic because it can result in search errors). [223001450080] |This means a bounded priority queue, which has a fixed upper bound on size. [223001450090] |The problem is that you can’t use a regular heap-based priority queue for this, because there’s no way to pop off the minimal element in less than linear time in the size of the heap. [223001450100] |You can use a fully sorted implementation, like java.util.TreeSet, and that’s exactly what LingPipe’s util.BoundedPriorityQueue does. [223001450110] |But it’s inefficient, because the red-black tree underlying the set implementation does a whole lot of work, including allocating nodes. [223001450120] |Today’s review paper has the “right” algorithm to do bounded priority queues of fixed maximum sizes, which I implemented way back in LingPipe 2.4 in util.MinMaxHeap: [223001450130] |
  • Atkinson, M.D., J.-R. Sack, N. Santoro and T. Strothotte. 1986. [223001450140] |Min-max heaps and generalized priority queues. [223001450150] |Communications of the ACM 29(10):996-1000.
  • [223001450160] |The usual heap is a binary tree satisfying the heap property, namely that each element is larger than all of its descendants. [223001450170] |It’s faster than fully ordering, because of the reduced requirement on sorting all elements. [223001450180] |A min-max heap is like a heap where every other level in the binary tree satisfies the heap property, with the root being the largest element in the queue, and the two daughters of the root being the two smallest elements in the queue, and so on down, alternating greater or lesser at each level. [223001450190] |With this arrangement of nodes, getting either the largest or smallest element is constant time. [223001450200] |The bubble-up portion of the min-max heap algorithm is very similar to that of the heap algorithm, only with a lot of grand-daughter/grand-mother computations in the tree rather than simple mother/daughter computations. [223001450210] |Because its backed by an array, it doesn’t require any allocation of node objects on the fly. [223001450220] |It would be possible to resize such a min-max heap, but the basic version was plenty twisty to code on its own. [223001450230] |A related algorithm is the double-ended priority queue, which basically keeps two tree structures overlayed on top of each other, one for the largest-to-smallest and one for the reverse ordering. [223001450240] |It’s a better algorithm if you need intervals, which for some reason java.util.SortedSet requires. [223001450250] |I’ve just extended LingPipe’s priority queues to implemented SortedSet, but I had to punt on the dynamic views returned by the headSet(), tailSet() and subSet() methods. [223001450260] |For auto-complete, I also just added a small bounded priority queue for accumulating results, because bubble-sort‘s much faster for small sets than all the tree-fiddling in tree sets. [223001460010] |The Incompleteness of Alias Lists [223001460020] |Yes, Alias-i was named after the notion of alias, as in lists of aliases for persons, or for genes, or for consumer products. [223001460030] |The problem with alias lists is that they’re notoriously incomplete. [223001460040] |The OpenCyc projects lists the following aliases for ultra-thin flat panel display: [223001460050] |English Aliases: [ "ultra thin flat panel screen", "ultra thin flat panel screens", "ultra thin screen", "ultra thin screens", "ultra-thin display", "ultra-thin displays", "ultra-thin flat panel displays", "ultra-thin screen", "ultra-thin screens" ] [223001460060] |Lists like these also seem arbitrary. [223001460070] |Why provide “ultra-thin” versus “ultra thin” variations in only some of the aliases? [223001460080] |Is “ultra thin display”, which is not listed, an inferior alias? [223001460090] |For some reason, the title isn’t included as an alias. [223001460100] |The temptation is to use a regular expression, but often there are dependencies that are hard to capture with a regular expression. [223001460110] |For instance, if I want to be able to say “tall white coffee” in all six orders, it’s pretty much impossible to characterize with a regex; the tightest I could come up with was (tall (white coffee|coffee white))|(white (tall coffee|coffee tall))|(coffee (tall white|white tall)). [223001460120] |So what we wind up doing instead is taking a set of aliases and then looking at some kind of approximate matching. [223001460130] |Our goal right now’s to get a high recall extraction over gene names. [223001460140] |We know we can get the mentions, but what about the Entrez-Gene-ID/mention pairs? [223001470010] |Affero Gnu Public License (AGPLv3) for LingPipe 4.0? [223001470020] |We’re busy preparing to launch LingPipe 4.0. [223001470030] |We’re leaning toward switching from our “unauthorized” Alias-i Royalty Free License (version 1) to the GNU Affero General Public License (version 3) (AGPLv3). [223001470040] |The motivation for our license was very similar to Affero’s. Affero, the original license developer, wanted to close the so-called “Application Service Provider (ASP) loophole”, which didn’t count running software as a service as redistribution. [223001470050] |Thus application service providers (ASPs) may run Gnu Public Licensed (GPLed) software as a service without distributing their software own software linked to it. [223001470060] |The AGPL is stronger (i.e. its viral license is more contagious) in that it considers service providers to be distributing the software. [223001470070] |We’re neither lawyers nor software freedom fighters (though I am getting tired of all this software patent nonsense). [223001470080] |The motivation for our royalty-free license was partly to prevent potential paying customers from running LingPipe on proprietary data, while keeping the system as “open” as possible, especially for research and teaching purposes. [223001470090] |Our business model was thus something like MySQL’s business model. [223001470100] |Here’s a nice survey of open-source business models —we’re a mix of 3, 4, and 5, plus 1 if you count DARPA/NIH grants as donations/subsidies. [223001470110] |Check out Bruce Perens‘s quite sensible review of the plethora of OS licenses: How Many Open Source Licenses Do You Need? [223001470120] |(Spoiler: Perens, who wrote the original version of OSI’s Open Source Definition now thinks 4 are sufficient; I suppose you can’t blame a computer scientist for generalizing!). [223001470130] |We’re wondering what the ramifications of a switch to AGPLv3 would be. [223001470140] |On the upside, we will potentially attract more users and perhaps even developers by having a license that plays nicely with others (specifically GPL and Apache/BSD). [223001470150] |We also think it may be more attractive to funding agencies, like NIH. [223001470160] |On the downside, we’re worried it may adversely affect our ability to sell our software, or perhaps even the business as a whole. [223001470170] |Are we crazy, or should we have done this ages ago? [223001480010] |Eclipse IDE for 64-bit Windows and 64-bit Java [223001480020] |New: Belorussian translation Ide Eclipse of the blog courtesy of Amanda L. [223001480030] |Here’s where you want to go for Eclipse for 64-bit Java + 64-bit Windows, including Windows XP, Vista, or Windows 7, or for any other “oddball” configuration not linked from their main download page: [223001480040] |
  • Eclipse Download List:http://download.eclipse.org/eclipse/downloads/ [223001480050] |From there you need to click through for a specific release. [223001480060] |The most recent release is at the top of the first table; right now it points to: [223001480070] |
  • Latest Release List: Eclipse 3.5 (“Galileo”) Release Page [223001480080] |From the release page, you have to choose a platform and click on the http link. [223001480090] |That’ll then take you to the download page. [223001480100] |For 64-bit Windows (x86_64) use: [223001480110] |
  • 64-bit Windows Zip: 64-bit Windows Zip File [223001480120] |I’m using the Windows (x86_64) builds on Vista on both an Intel Core-2-based laptop and dual-Xeon-based workstation. [223001480130] |It works with the 1.5 JDKs from Sun. [223001480140] |We have instructions on getting it running in our LingPipe Eclipse Tutorial. [223001480150] |I’m about to add that you need to close the funny start page they have now, which wasn’t obvious. [223001480160] |This took some finding. [223001480170] |Actually, it took a hint from Mike, because you certainly won’t find it by navigating, which will take you to: [223001480180] |
  • Eclipse Partial Download Page with Cute Graphics:http://www.eclipse.org/downloads/ [223001480190] |which doesn’t have links to 64-bit versions of Eclipse for Windows. [223001480200] |If you search for on their site, you get old bboard questions without answers. [223001480210] |If you search for on Google, Yahoo, or MSN, only MSN has anything close to a helpful answer in the first 10 hits. [223001490010] |TreeSet.removeAll() and Comparators Not Consistent with Equals [223001490020] |I thought it’d be convenient to upgrade com.aliasi.util‘s BoundedPriorityQueue to have it implement java.util‘s SortedSet. [223001490030] |That way I could use more generic returns for things like the auto-completer. [223001490040] |Turns out I built a tighter bubble-sort based sorted set for that in the end, but that’s a different story. [223001490050] |So I replaced the extends AbstractCollection with extends AbstractSet. [223001490060] |Everything’s hunky-dory, and it’s passing all of our unit tests. [223001490070] |Move on a few days, and I’m doing massive refactorings to remove compiler warnings, so I want to use Eclipse, which has even more aggressive compiler warnings than the JDK. [223001490080] |I couldn’t get it to work on 64-bit Windows, so I go home and fire up our antique ThinkPad. [223001490090] |Eclipse only does JDK 1.5, so I plug that in. [223001490100] |I do a little refactoring, then run unit tests. [223001490110] |Failing in cluster.CompleteLinkClusterer. [223001490120] |WTF? [223001490130] |I shouldn’t have changed anything recently that has an effect on clustering. [223001490140] |I trace through the code by dumping out dendrograms as they get created (agglomerative clusterers are easy to trace). [223001490150] |It’s getting the distances wrong. [223001490160] |The only way it could do that is if the priority queue of inter-cluster distances is messed up. [223001490170] |Uh oh, that’s my new BoundedPriorityQueue. [223001490180] |How’d I mess it up? [223001490190] |Next, I back all the changes I’ve made with Eclipse and still it’s failing. [223001490200] |But hang on, it worked on my machine! [223001490210] |So I switch the JDK back to 1.6 and voilà, it works again. [223001490220] |How can it work in 1.6 and not 1.5? [223001490230] |Unfortunately, I still don’t know the answer. [223001490240] |But I did, through more fine-grained tracing, figure out where it was going wrong. [223001490250] |The culprit is the inconsistent behavior of TreeSet‘s removeAll() method, which is only javadoc-ed on the superclass, AbstractSet, which points to the source of the problem: [223001490260] |This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. [223001490270] |If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. [223001490280] |If it is so contained, it is removed from this set with the iterator’s remove method. [223001490290] |If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set’s remove method. [223001490300] |The problem here is that iterating over the tree set itself (what happens when it’s smaller) uses a test with contains() and then calls the remove method on the iterator, whereas iterating over the argument collection calls remove() on the tree set. [223001490310] |These differ with respect to whether they use the natural ordering (equals()) or whether they use the tree set’s comparator method. [223001490320] |Thus tree sets whose comparators are not compatible with natural equality run into problems. [223001490330] |Needless to say, there’ve been lots of “bugs” filed against this behavior, every one of which Sun dismisses as “obeying the letter of the spec”, but I happen to agree with user “gouldina” on this one: [223001490340] |
  • JDK Bug 4730113: TreeSet removeAll(), retainAll() don’t use comparator; addAll() does
  • [223001490350] |
  • JDK Bug 6394757: (coll) AbstractSet.removeAll is surprisingly dependent on relative collection sizes [223001490360] |Here’s the code from the 1.5 JDK (version 1.26): [223001490370] |So I simply implemented my own removeAll() for bounded priority queues that used the iterator-based approach (first case above). [223001490380] |That’s how AbstractCollection implements it, and why the program worked before. [223001490390] |Problem solved. [223001490400] |Of course, it’s ridiculously inefficient for large queues. [223001490410] |I need a better data structure here. [223001490420] |The root cause (pun intended) is a trickier question. [223001490430] |Both JDK 1.5 and 1.6 have the same definition of removeAll(), but it’s a maze of twisty little passages, all different, out from there through all the indirection of the remove() and contains() through the superclasses and contained map’s definitions. [223001490440] |Something must be different in 1.5 and 1.6, because I’m getting different answers. [223001490450] |But I have better things to do than trace down what it is. [223001510010] |Decorator and Adapter Patterns Everywhere [223001510020] |Now that we’ve been gearing up some applications involving logistic regression and feature-based clustering, I’ve been getting lots of requests for how to do things in LingPipe. [223001510030] |The answers all point to two general design patterns: [223001510040] |
  • Wikipedia: Adapter Pattern (aka Wrapper)
  • [223001510050] |
  • Wikipedia; Decorator Pattern
  • [223001510060] |Really, I’m not an architecture astronaut. [223001510070] |No more pattern talk, I promise, just case studies. [223001510080] |

    Case Study 1: Single Evaluation Spanning Cross-Validations

    [223001510090] |Now that we have a cross-validating corpus built in for classifiers, it’s getting heavy use. [223001510100] |The typical use is to evaluate each fold (perhaps collecting cross-fold stats for summaries): [223001510110] |You can print out stats in the loop, or collect up stats for the whole corpus. [223001510120] |But what if we want to combine the evaluations? [223001510130] |The trick is to write a simple mutable classifier filter (more patterns): [223001510140] |Now, we pass the filter into the evaluator and set the classifier in it before each round: [223001510150] |That’s it. [223001510160] |After the loop’s done, the single evaluator holds the combined eval under cross-validation. [223001510170] |We got around the immutability of the single classifier held by the evaluator by writing a simple filter that has a mutable object. [223001510180] |

    Case 2: Nullary Tokenizer Factory Constructors

    [223001510190] |This case just came up on our mailing list. [223001510200] |I made a regrettable design decision in writing only the fully qualified class name of a tokenizer during serialization so that it gets reconstituted using reflection over the nullary (no-arg) constructor. [223001510210] |But what if you want a regular-expression based tokenizer that has no nullary constructors? [223001510220] |Simple, write an adapter. [223001510230] |That’s it. [223001510240] |Same behavior only now we have the necessary nullary constructor for my brain-damaged serializer. [223001510250] |

    Case Study 3: Decorators

    [223001510260] |Let’s say you have a corpus and it’s being passed into some program that takes a long time to chew on it but doesn’t give any feedback. [223001510270] |We can instrument the corpus with a decorator to give us a little feedback: [223001510280] |Yes, that’s it. [223001510290] |Well, actually we need to fill in the ellipses with the same thing for visitTrain(). [223001510300] |On the same topic, suppose we have a text corpus and we want to restrict it to only texts of length 30 to 50 (yes, that just came up this week in a client project). [223001510310] |We just apply the same trick twice, filtering the corpus by filtering the handler: [223001510320] |Basically, these approaches all put something in between the implementation you have and the implementation that’s needed, usually acting as a filter. [223001510330] |While it’s not quite Lisp, it’s getting close in terms of parenthesis nesting, and is a handy tool for your Java toolbox. [223001520010] |Better Fact Checking than in Science [223001520020] |Magazines and newspapers often have better fact checking than scientific journals. [223001520030] |Not just a little bit better, but way better. [223001520040] |Check out this John McPhee article from the 9 February 2009 issue of The New Yorker: [223001520050] |
  • Personal History: Checkpoints [registration required to get beyond the intro]
  • [223001520060] |The article goes into amazing detail about fact checking at The New Yorker magazine. [223001520070] |In particular, it shows a great sensitivity to implications of punctuation and grammatical decisions. [223001520080] |I wasn’t the only one struck by this article. [223001520090] |Check out Peter Norvig’s response, which covers more of the details: [223001520100] |
  • Peter Norvig’s All We Want are the Facts, Ma’am [223001520110] |at the Language Log: [223001520120] |
  • Chris Potts’s Fact-checking commas [223001520130] |and some related posts by Andrew: [223001520140] |
  • Andrew Gelman’s Harold Ross would never have let this one get by [223001520150] |
  • Andre Gelman’s Is there such a thing as a statistical fact checker? [223001520160] |It also invoked comparisons to software: [223001520170] |
  • Julia Lerman’s Fact Checking [223001520180] |Of course, so many more people read The New Yorker or the The New York Times than a scientific paper, that there are lots of blogs about where their fact-checking went wrong. [223001530010] |Speeding up K-means Clustering with Algebra and Sparse Vectors [223001530020] |This blog entry’s going to show you how to implement K-means clustering a bit more efficiently than the naive algorithm with the help of a little algebra and sparse vector manipulation. [223001530030] |The current implementation of k-means clustering in LingPipe is woefully slow: [223001530040] |
  • LingPipe Javadoc: cluster.KMeansClusterer [223001530050] |We’re working on a medium-size clustering problem for a customer with a few hundred thousand short-ish text messages being clustered. [223001530060] |It was taking roughly forever (OK, several hours/epoch). [223001530070] |I knew it was slow when I was writing. [223001530080] |You can tell by the apologetic javadoc implementation note: [223001530090] |Implementation Note: The current implementation is an inefficient brute-force approach that computes the distance between every element and every centroid on every iteration. [223001530100] |These distances are computed between maps, which itself is not very efficient. [223001530110] |In the future, we may substitute a more efficient version of k-means implemented to this same interface. [223001530120] |More efficient implementations could use kd-trees to index the points, could cache scores of input feature vectors against clusters, and would only compute differential updates when elements change cluster. [223001530130] |Threading could also be used to compute the distances. [223001530140] |All true, but there’s a simpler approach than kd-trees or multiple threads or caching that I just implemented and now the epochs are taking roughly no time at all (OK, about a minute). [223001530150] |K-means clustering is very simple. [223001530160] |You start with a set of items and a feature extractor to convert them to feature vectors. [223001530170] |These feature vectors are maps from strings to numbers. [223001530180] |In the first step, items are assigned to clusters somehow, for instance randomly (though there are better ways to do this for some purposes). [223001530190] |Then on for each iteration until a maximum number or convergence, the mean (aka centroid) of each cluster is computed by averaging all the vectors (centroids are calculated dimensionwise), and each item’s Euclidean distance to each centroid is computed. [223001530200] |After all these computations, each item is reassigned to the closest centroid at the end of an epoch. [223001530210] |Typically, when I do heavy lifting with features, I extract the features once and then re-use the vectors. [223001530220] |I did that with k-means, but didn’t consider the downstream computations closely enough. [223001530230] |When you have a 100,000-feature centroid vector and a 100-feature item vector, it’s rather inefficient to compute Euclidean distance, which is defined as: [223001530240] |Of course, we don’t care about distance per se, so we can save the square root operation and work with squared distances [223001530250] |There are a couple of problems. [223001530260] |Hash maps in Java involving numbers are wasteful in size and time to convert back and forth to primitives for arithmetic. [223001530270] |They’re also slow to iterate (even linked hash maps). [223001530280] |That’s an obvious bottleneck. [223001530290] |But the real bottleneck is iterating at all. [223001530300] |Remember, the fastest code is the code that doesn’t execute at all (as I noted in a comment about point 6 of John Langford’s excellent blog entry on machine learning code optimization). [223001530310] |

    The Algebra

    [223001530320] |The key to optimization is noting that most of the elt[i] values are zero. [223001530330] |So what we do is compute the squared length of each centroid and store it at the beginning of each epoch: [223001530340] |The second part of this equation is key, as it shows how the length of the centroid is related to distances to dimensions with zero values. [223001530350] |Here’s the formula: [223001530360] |Now note that when elt[i]=0, the terms inside the sum cancel. [223001530370] |So we keep going with: [223001530380] |This simple algebra means that after computing the length of each centroid, each distance computation only requires a number of operations proportional to the number of non-zero elements in the item’s feature vector. [223001530390] |

    Sparse Vectors

    [223001530400] |Combining this with more reasonable sparse vector representations (parallel int[] index and double[] value arrays), and reasonable dense vector representations (i.e. double[]), K-means is now hundreds of times faster for the kinds of problems we care about: large sets of sparse feature vectors. [223001530410] |I know I can squeeze out at least another factor of 2 or 3 on my dual quad-core machine by multi-threading; k-means and other EM-like algorithms are trivial to parallelize because all the heavy computations in the inner loops are independent. [223001530420] |

    Memory Locality

    [223001530430] |The next obvious speedup would be to improve memory locality (Langford’s tip number 3, and the heart of a lot of the speedups in Jon Bentley’s wonderful book Programming Pearls, the pearls in which were extracted from his Dr. Dobbs columns). [223001530440] |As it is, I iterate over the items, then for each item, I iterate over the cluster centroids computing distances, which involves iterating over the non-zero values in the item’s feature vector. [223001530450] |It’d be much better for memory locality to do this by iterating over the non-zero values in the item’s feature vector and pulling back an array indexed by cluster. [223001530460] |That way, there’d be a number of non-memory-continguous lookups equal to the number of non-zero features, and everything else would be local. [223001540010] |Softmax without Overflow [223001540020] |Softmax is another name for the generalization of the logistic sigmoid function to n outcomes. [223001540030] |It’s popular as an activation function for neural networks and as an (inverse) link function for generalized linear models like logistic regression. [223001540040] |Softmax converts an arbitrary real-valued vector into a multinomial probability vector. [223001540050] |Except when it doesn’t. [223001540060] |LingPipe uses softmax it in two places. [223001540070] |First, to construct a classify.JointClassification using joint log probabilities, which requires properly scaled linear probabilities in order to construct its superclass classify.ConditionalClassification. [223001540080] |The second is inside multinomial logistic regression, where its used to convert the linear predictors into probabilities. [223001540090] |Suppose we have a vector of finite numbers: [223001540100] |The softmax operation converts it into a vector whose entries are finite, non-negative, and sum to 1: [223001540110] |where the normalizing term is the usual sum: [223001540120] |so that by construction: [223001540130] |The goal is to actually compute the terms in the softmax vector. [223001540140] |The problem? [223001540150] |Overflow. [223001540160] |The terms in x might be large. [223001540170] |So large that Math.exp(x[i]) overflows. [223001540180] |That doesn’t even need to be that big, just Math.log(Double.MAX_VALUE), or about 710. [223001540190] |It’s especially easy to do in logistic regression under many conditions (e.g. large absolute feature values, large variance within a dimension of features, large predictive differences, weak priors with separable features). [223001540200] |It’s algebra month here at Alias-i, so you shouldn’t be surprised when I note that the vector x can be shifted dimensionwise without changing the softmax result: [223001540210] |where the additive term factors through the normalizing constant: [223001540220] |and thus drop out to establish the equivalence. [223001540230] |What we do is rescale so there’s no overflow by subtracting the maximum of the x values from each x[i]. [223001540240] |That converts overflow into underflow. [223001540250] |Underflow is no problem, because that rounds off to zero, which is a well-behaved floating point number. [223001540260] |The code is roughly as follows: [223001540270] |The real code’s less clear because it caches the exponent computation Math.exp(xs[i] - a) in ps while computing the sum for Z and then just rescales ps by Z. [223001540280] |In practice, suppose x = (305, 300, 150). [223001540290] |So we subtract the max (305) from each entry to get x' = (0,-5,-155). [223001540300] |We then calculate the normalizer Z = exp(0) + exp(-5) + exp(-155) ~ 1 + 1/32 + 0 = 33/32. [223001540310] |Note how the exponentiation of the big negative value (-155) underflows to zero. [223001540320] |We now compute the linear probabilities as p' = (exp(0)/(33/32)), exp(-5)/(33/32), exp(-155)/(33/32)) = (32/33, 1/33, 0). [223001540330] |The old classifier implementation used the max rather than the min to rescale, which isn’t agressive enough in that it still allows overflow when the inputs may be large an dpositive. [223001540340] |That wasn’t a problem for classifiers when the numbers were log probabilities, because they’re already all negative; so there the scaling prevents catastrophic underflow, and we already had that right. [223001540350] |This method is guaranteed not to have any overflow. [223001540360] |And it’s guaranteed not to have all zero values, too. [223001540370] |I’m afraid it’s simply not possible to represent two numbers more than exp(711) apart or so using IEEE double-precision floating point arithmetic. [223001540380] |And frankly, once events are relatively this unlikely, we’re not too worried about rounding to zero. [223001540390] |Look for the new implementation in LingPipe 4.0′s logistic regression classification function. [223001540400] |Coming soon to a web site near you. [223001550010] |Should I Build or Borrow a Text Annotation GUI? [223001550020] |There’s been a hot thread on the BioNLP mailing list about which text annotation GUI to use. [223001550030] |In this case, they wanted to mark up relations. [223001550040] |There were, of course, lots of tools suggested. [223001550050] |But I happen to agree with Peter Corbett, who said: [223001550060] |One option is to make your own. [223001550070] |This is the option I took, and I’m very happy with the results. [223001550080] |I spent a while looking at other annotation tools and finding them to be missing key features or cumbersome to use or difficult to translate the results into the right format. [223001550090] |After discussing this in a group meeting, the consensus was that it would be better and easier to build our own tool than to build a pre- and post-processing pipeline to adapt our formats to a different annotation tool. [223001550100] |Of course, this only works if you have the resources in both time and programmer knowledge to do this. [223001550110] |Luckily, we do here at Alias-i, so that’s what we’ve done in the following cases. [223001550120] |

    Named-Entity Annotation

    [223001550130] |I built a token-centric annotation tool that could be driven by the keyboard and by custom callback behaviors for dragging and highlighting. [223001550140] |We’ve used this for annotating standard named entities and bibliographic citations and I can annotate around 5K tokens/hour accurately using it. [223001550150] |You can get this one from: [223001550160] |
  • LingPipe Sandbox [223001550170] |

    Alignment

    [223001550180] |I built a tool for aligning parallel sequences of symbols back when I was at Bell Labs. [223001550190] |I was working on pronunciation models, so the symbols were short n-grams of chars and short n-grams of phonemes. [223001550200] |I had to build some custom control behavior to get the aligned symbol slider interface right. [223001550210] |

    Morphology and Stemming

    [223001550220] |This was a project for a customer who wanted to develop corpora in multiple languages. [223001550230] |I built a pretty simple unsupervised model (like Wicentowski’s supervised model), and used these annotations to train it. [223001550240] |

    Gene Linkage

    [223001550250] |This is still an active project. [223001550260] |As part of our NIH grant, we’re linking gene mentions in MEDLINE to Entrez-Gene. [223001550270] |Breck and Mitzi built this one to be served by Tomcat over the web and record results in a database. [223001550280] |It uses our high-recall mention finder to highlight potential mentions, and then uses some heuristic matching to pull out possible matches from Entrez-Gene, OMIM and organize the results using Homologene. [223001550290] |

    Context-Free Grammar Trees

    [223001550300] |My favorite annotation tool was also built at Bell Labs. [223001550310] |I used the inside-outside algorithm to derive probabilities of subtrees, and used a gradient green-yellow-red color scheme to indicate which were likely. [223001550320] |The user clicked accept/reject on a node, and that was filtered from the set of possibilities, forward-backward was recalculated, and the new best analysis consistent with all previous clicks was put up. [223001550330] |We were using it for short sentences arising from a spoken dialogue system for movie listings. [223001550340] |I don’t think it’d have scaled (as a GUI) to the 40+ word sentences common in newswire. [223001550350] |

    Tag a Little, Learn a Little

    [223001550360] |In all cases, I used the tag-a-little, learn-a-little paradigm, which first crossed my radar in the (apprently now retired) Alembic Workbench. [223001550370] |I also auto-fed examples with a big “done” button that also allows you to page back and forth between examples to inspect and correct your previous work. [223001550380] |I can’t stress just how important this is. [223001550390] |In all cases, we need to adjudicate results, but I’ve not built that into any of the interfaces. [223001550400] |

    Mechanical Turk

    [223001550410] |We’ve been using more and more mechanical turk interfaces. [223001550420] |So far, I’ve done much simpler web-form based interfaces for named-entity, and Emily’s gone through three iterations until we found one that works for stemming. [223001550430] |She’s also working on rolling out gene linkage to the Turkers this week. [223001550440] |For named entity and stemming, their annotations are amazingly accurate, especially in aggregate if you give them pre-qualifying tests to reinforce the anno standard. [223001550450] |

    Excel or SQL Dumps

    [223001550460] |This is actually what all of our customers tend to do. [223001550470] |This is particularly easy for classification and spell checking evaluation tasks.