[223001560010] |Computing Sample Mean and Variance Online in One Pass [223001560020] |Update, 30 April 2009: [223001560030] |OK, the good method mentioned in the links below, and in the final comment of mine below, is now implemented in LingPipe. [223001560040] |Here’s the Javadoc and code: [223001560050] |
  • Javadoc: stats.OnlineNormalEstimator [223001560060] |
  • Code: stats/OnlineNormalEstimator.java [223001560070] |Update, 6 April 2009: [223001560080] |Just check out: [223001560090] |
  • Wikipedia: Algorithms for Calculating Variance
  • [223001560100] |In particular, Welford’s algorithm, which is both online and fairly numerically stable. [223001560110] |There’s a nice evaluation and description in: [223001560120] |
  • John D. Cook: Accurately Computing Running Variance [223001560130] |It shows the problem with computing variances for a set of large values that are close together. [223001560140] |A two pass algorithm that first estimates the mean is even more robust; that combines Welford’s algorithm with Andrew’s suggestion in the comments about subtracting out the mean. [223001560150] |Original Post: [223001560160] |Algebra week continues here at Alias-i with this piece of advice on how to compute sample means and variances in one pass through the data. [223001560170] |Here’s the standard textbook definition of the sample mean of an array x of length n: [223001560180] |Here’s the formula for sample variance: [223001560190] |From the way they’re written, it looks like you need to make two passes through the array x, first to compute the mean and then again to compute the variance. [223001560200] |But it turns out with a little algebra, it’s easy to see we only need a single pass. [223001560210] |The trick is to keep a running total of the sums of the x values: [223001560220] |and the sum of the squares of the x values: [223001560230] |Of course, you need to keep track of n, the number of examples seen. [223001560240] |It’s easy to compute the mean, you just divide by the length as in the first equation above. [223001560250] |With the mean in hand, the variance is easy to compute, too. [223001560260] |We just expand out the square: [223001560270] |and then rearrange: [223001560280] |to derive: [223001560290] |and the final formulas for mean (repeated from above) and variance: [223001560300] |The sample mean and sample variance are maximum likelihood estimates of the true mean and variance. [223001560310] |But the sample variance is biased. [223001560320] |The unbiased variance estimate is larger by a factor of n/(n-1): [223001560330] |In conclusion, the sum of values, sum of squares, and number of samples are collectively sufficient statistics for estimating a normal distribution. [223001560340] |The other nice thing about this is that you can also do it online by updating the sum of values and sum of squares as each value arrives. [223001560350] |The Wikipedia has a nice discussion of unbiased estimators covering sample mean and variance. [223001570010] |How General Should Generic Wildcard Captures Be? [223001570020] |In working on the next release of LingPipe, I’m finding many opportunities to generalize the generic types. [223001570030] |Let’s take the cached feature parser I’m working on right now in the other window. [223001570040] |Feature vectors are extracted as instances of [223001570050] |I had to do that to allow LingPipe utility classes like [223001570060] |and [223001570070] |to be usable as collectors for feature extractors. [223001570080] |The cached feature parser accepts a map that’ll act as a cache for feature vectors. [223001570090] |The most general type that’d work is: [223001570100] |Intuitively (?!), this says that we need any old map (it can be a subtype for the outer type) that can hold mappings from E objects to results that implement maps from strings to numbers. [223001570110] |But is anyone other than an obsessive type geek going to be able to even parse the expression? [223001570120] |Should I just go with the simpler form here: [223001570130] |I think it makes sense to use the simpler form in this setting because it’s going to be a cache that gets constructed under the user’s control and most of the constructors for these kinds of maps are very general (e.g. all of Java’s Map implementations and LingPipe’s util.FastCache implementation). [223001570140] |You might think making the nested types covariant would help, but as Josh Bloch points out in Effective Java (2nd Ed.), this is problematic because the direction of subclassing differs for arguments and return results. [223001570150] |That’s why there are both super and extends in the wildcard capture definitions above. [223001570160] |In the simplest case, you extend return results and super arguments. [223001580010] |Arthur &Vassilvitskii (2007) k-means++: The Advantages of Careful Seeding [223001580020] |The method written up in this paper comes independently evaluated and recommended by Bruce Carneal at UCSD: [223001580030] |Arthur, David and Sergei Vassilvitski (2007) k-means++: The Advantages of Careful Seeding. [223001580040] |SODA 2007. (also available: presentation slides). [223001580050] |As we all know, k-means, like EM, is a mode-finding algorithm, meaning that it’s guaranteed to find local optima, but unlikely (even in practice) to find global optima. [223001580060] |k-means++ is a simple probabilistic means of initialization for k-means clustering that not only has the best known theoretical guarantees on expected outcome quality, it reportedly works very well in practice. [223001580070] |The key is a cluster-initialization technique that preservers diversity of seeds while being robust to outliers. [223001580080] |

    Hard Clustering Error

    [223001580090] |We’ll only consider n-dimensional Euclidean space, though the approach can be generalized. [223001580100] |We have a set X of points and are looking for a set C of points, of size K, called centers that minimizes the error: [223001580110] |where the point C(x) in C is the closest center to the point x: [223001580120] |Sorry to overload C (C as a set of clusters; C(x) as a function from an element to its cluster) here, but it’s standard for equivalence relations like clusterings. [223001580130] |

    K-means

    [223001580140] |The k-means algorithm is simple. [223001580150] |It works from a given set of centers, which have traditionally been chosen randomly from among the points in the set X. [223001580160] |The K-means loop iteratively performs two deterministic steps until (guaranteed) convergence: (1) pass over all data points and reassign them to their closest centers, then (2) recalculate the centers as the mean of the points assigned to them. [223001580170] |

    K-means++

    [223001580180] |Arthur and Vassilvitskii simply created a very simple probabilistic method for generating initial centers for K-means clustering a set of points X: [223001580190] |
  • Sample the first center c[1] from a uniform distribution over X. [223001580200] |
  • For k = 2 to K Sample the k-th center c[k] from a multinomial over X where a point x has has probability θx, defined by: [223001580210] |where D(x) is defined as the distance to the closest existing center: [223001580220] |

    Empirical Results

    [223001580230] |The authors (and others since then) found both order-of-magnitude speedups and order of magnitude error reduction over random-initialized k-means on “real” data sets. [223001580240] |Like many of these speedups, the results improve relatively as the number of clusters increases. [223001580250] |For instance, they report a factor of 4 speedup and factor of 20 reduction in error for a tiny (4600 item) spam dataset with a few dozen features over 50 clusters. [223001580260] |Unfortunately, they dont’ evaluate any large-dimensional problems, though they do some large item problems (n=500,000). [223001580270] |

    Theoretical Guarantees

    [223001580280] |If the inputs are nicely separated, it had already been proved that this initialization had good behavior (the same initialization with a different analysis had been independently proposed by Ostrovsky et al. in a FOCS 2006 paper; more of that scientific zeitgeist). [223001580290] |If the points aren’t known to be well behaved, its expected to produce a result that’s O(log k) competitive with the best possible clustering. [223001580300] |Being O(f) competitive means that the expected error is bounded to be on the order of f relative to the optimal order. [223001580310] |For k-means++, the theorem provied by Arthur and Vassilvitskii bounds k-means++ error for any set of data points X by: [223001580320] |Where the expectation is over the probability p(C) of a given clustering (determined by the initial assignment) in the k-means++ algorithm. [223001580330] |

    Diversity with Robustness to Outliers

    [223001580340] |In the past, folks (like me) often chose points successively to maximize their distance from existing clusters. [223001580350] |The problem with this approach is that it can be dominated by outliers. [223001580360] |Sanjoy Dasgupta (in a COLT 2003 paper) introduced a subsampled approach that used a subset of the initial points, which was far less prone to outliers, because there are presumably relatively few outliers. [223001580370] |Searching over all of X finds all of the outliers by definition. [223001580380] |The k-means++ algorithm shares with Dasgupta’s appoach a kind of robustness in the face of outliers. [223001580390] |While outliers may have a large maximum distance (i.e. a high D(x) value), by definition, there aren’t very many of them. [223001580400] |So with sampling, the common points are balanced against the outliers using their strength in numbers against the outlier’s overall strength. [223001580410] |

    Is the Theoretical Analysis Practical?

    [223001580420] |Very nice analysis (that’s why it’s published in a high-end theory conference). [223001580430] |But even with a small number of clusters we’re still looking at being a factor of 20 away from the best solution (in the worst case). [223001580440] |I don’t know if this is like other theoretical error bounds, such as for SVM error, that tend to be beaten by the reality of typical problems (which usually have way less error than predicted by the bounds). [223001580450] |

    The Code

    [223001580460] |I have the rest of the day blocked out to implement, optimize and test k-means++ for LingPipe. [223001580470] |The paper authors themselves were also kind enough to provide: [223001580480] |
  • Arthur and Vassilvitskii’s C++ Test Code (zip file) [223001590010] |Joint Referential (Un)Certainty: The “Wallace and Gromit” Dilemma [223001590020] |Mitzi and I were talking and she said she loved …“corn chips”. [223001590030] |Hmm. [223001590040] |I was expecting “cheese”, which is what she’s usually eating in the other room at this time. [223001590050] |So I was primed to think “Wallace and Gromit”. [223001590060] |But I couldn’t remember which of the pair, Wallace or Gromit was which. [223001590070] |I remember the characters. [223001590080] |One’s a head-in-the-clouds human cheese lover, the other a clever pooch. [223001590090] |Back when I worked on semantics, this is just the kind of data that’d get me very excited. [223001590100] |Why? [223001590110] |Because it’s inconsistent with theories of reference, like the one I put forward in my semantics book. [223001590120] |My theory of plurals would have it that to understand a conjunction like “Wallace and Gromit”, you first identified a referent for each of the conjuncts, “Wallace” and “Gromit”, which could then be used to pick out a group. [223001590130] |In this case, I know the two members of the group, I just don’t know which had which name. [223001590140] |But maybe “Wallace and Gromit” as a whole is a name. [223001590150] |That is, maybe it’s a frozen expression, at least for me. [223001590160] |Lots of names are like that for me, like “Johnson and Johnson”. [223001590170] |Speaking of Johnson and Johnson, could they be just one person? [223001590180] |That is, could both conjuncts refer to the same person? [223001590190] |It probably mostly refers to a company as a fixed expression these days. [223001590200] |At one point, “Johnson and Johnson” would’ve caused confusion for named entity detectors (conjunction of two person names, or a combined company name; annotation standards like Genia’s Technical Term Annotation let you keep both). [223001590210] |This is a problem for us now in our high recall entity extraction with terms like “insulin receptor” —is that a reference to insulin (one thing), or the receptor (another thing)? [223001590220] |Mitzi’s a virtual font of referential uncertainty data tonight. [223001590230] |She said she knew that “Abbott and Costello” (the comedy team) had first names “Lou” and “Bud”, but she didn’t know which went with which last name (hint: that’s Lou Costello on the left and Bud Abbott on the right). [223001600010] |JDK 7 Twice as Fast* as JDK 6 for Arrays and Arithmetic [223001600020] |The 7th version of the Java Developer’s Kit (aka JDK 7) delivers quite a speed boost over JDK 6 array accesses. [223001600030] |For us, this is huge. [223001600040] |It’s like another year and a half of Moore’s law for free. [223001600050] |Only in software. [223001600060] |And you don’t even have to write multi-threaded code. [223001600070] |I’ve been profiling my new K-Means++ implementation for the next LingPipe release on some randomly generated data. [223001600080] |It’s basically a stress test for array gets, array sets, and simple multiply-add arithmetic. [223001600090] |Many LingPipe modules are like this at run-time: named entity, part-of-speech tagging, language modeling, LM-based classifiers, and much more. [223001600100] |While I was waiting for a run using JDK 1.6 to finish, I installed the following beta release of JDK 7: [223001600110] |You can get it, too: [223001600120] |
  • sun.com: JDK 1.7 Early Access Download Page [223001600130] |
  • java.net: JDK 1.7 Binaries (including Installre) [223001600140] |I believe much of the reason it’s faster is the work of these fellows: [223001600150] |
  • Würthinger, Thomas, Christian Wimmer, and Hanspeter Mössenböck. 2007. [223001600160] |Array Bounds Check Elimination for the Java HotSpot Client Compiler. [223001600170] |PPPJ. [223001600180] |Java’s always suffered relative to C in straight matrix multiplication because Java does range checks on every array access (set or get). [223001600190] |With some clever static and run-time analysis, Würthinger et al. are able to eliminate most of the array bounds checks. [223001600200] |They show on matrix benchmarks that this one improvement doubles the speed of the LU matrix factorization benchmark in the U.S. National Institute of Standards (NIST) benchmark suite SciMark 2, which like our clustering algorithm, is basically just a stress test for array access and arithmetic. [223001600210] |So far, my tests have only been on a Thinkpad Z61P notebook running Windows Vista (64 bit) with an Intel Core 2 CPU (T2700; 2.0GHz), and 4GB of reasonably zippy memory. [223001600220] |I don’t know if the speedups will be as great for other OSes or for 32-bit JDKs. [223001600230] |I’m pretty excited about the new fork-join concurrency, too, as it’s just what we’ll need to parallelize the inner loops without too much work for us or the operating system. [223001600240] |*Update: 2:30 PM, 30 March 2009 JDK 7 is only about 15% faster than Sun’s JDK 6 on my quad Xeons (E5410, 2.33GHz) at work running the same code. [223001600250] |I’ll have to check the exact specs on both of my memory buses. [223001600260] |The notebook has surprisingly fast memory and the Xeon’s running ECC registered memory that I don’t think is quite as fast. [223001600270] |Update: 11:00 AM, 31 March 2009 Like other matrix algorithms, k-means clustering is extremely front-side-bus sensitive (connection between memory and the CPU), because the bottleneck is often between memory and the CPU’s L2 cache. [223001600280] |Memory’s significantly slower than CPU these days. [223001600290] |The Intel dual quad-core Xeon E5410 have 12MB of L2 cache at 2.3GHz, whereas the Thinkpad Z61P has Intel Core 2 Mobile T7200 has only 4MB of L2 cache at 2GHz. [223001600300] |The Core 2 has a 667 MHz front-side bus whereas the Xeon reports a 1333 MHz front-side bus (is that just the confusion between spec reporting). [223001600310] |I actually don’t know what kind of memory’s in the workstation —I’ll have to crack it open and look. [223001600320] |I’ve got 4GB of RAM in the notebook, but the motherboard can only see 3GB (ithat is, it’s not Windows —the same thing happened when I installed Ubuntu on the notebook and it’s a known design limitation in many motherboards); I have 16GB of RAM in the workstation and the motherboard can see all of it. [223001600330] |But it has two physical chips, each of which share the memory, so the motherboard’s architecture’s very different. [223001600340] |There are so many confounding factors that I can’t tease apart what’s speeding up in JDK 7 so much on the notebook. [223001600350] |Anway, go forth and test. [223001600360] |If you’re using a machine like my notebook to do number crunching, JDK 7 really is twice as fast as JDK 6 for matrix algorithms. [223001610010] |Bruce Lee’s Jeet Kune Do for Software [223001610020] |When you’re in the weeds in the middle of a software project, pair programming like there’s no tomorrow, ask yourself: [223001610030] |

    Who would you rather have watching your back?

    [223001610040] |But hang on, Kato’s a fictional character. [223001610050] |Why not ask the actor who brought the Green Hornet’s sidekick to life? [223001610060] |That is, [223001610070] |

    What would Bruce Lee do?

    [223001610080] |Sure, Bruce is agile. [223001610090] |He’s extreme. [223001610100] |But is he is not locked into some process involving index cards or post-it notes and following a certified scrum master like a sheep. [223001610110] |No way. [223001610120] |Or rather, Bruce Lee’s way of no way, known in martial arts circles as: [223001610130] |

    Jeet Kune Do (截拳道)

    [223001610140] |Jeet Kune Do is the school of martial arts and philosophy founded by Bruce Lee. [223001610150] |The following are excerpts from the Wikipedia article on Jeet Kune Do, with my own headings with lessons for coding. [223001610160] |Code Defensively: “One of the theories of JKD is that a fighter should do whatever is necessary to defend himself, regardless of where the techniques come from.” [223001610170] |Be Pragmatic: “Additionally, JKD advocates that any practitioner be allowed to interpret techniques for themselves, and change them for their own purposes. …Lee felt the dynamic property of JKD was what enabled its practitioners to adapt to the constant changes and fluctuations of live combat. [223001610180] |Be Proactive “JKD practitioners also subscribe to the notion that the best defense is a strong offense, hence the principle of “Intercepting”.” [223001610190] |Be Economical: “JKD students are told to waste no time or movement. [223001610200] |When it comes to combat JKD practitioners believe the simplest things work best. [223001610210] |Economy of motion is the principle by which JKD practitioners achieve “efficiency” describe in the three parts of JKD. [223001610220] |Utilizing this principle conserves both energy and time. [223001610230] |Energy and time are two crucial components in a physical confrontation that often leads to success if employed efficiently.” [223001610240] |This last point, particularly, seems to be lost in heavy software processes. [223001610250] |Bruce Lee’s own description of Jeet Kune Do is embedded in its koan-like motto: [223001610260] |“Using no way as way” [223001610270] |Hunt and Thomas couldn’t have said it better themselves. [223001620010] |Reading Models (or other Resources) from the Classpath in Servlets (and Other Applications) [223001620020] |In some circumstances, such as deploying models as part of a servlet, it’s convenient to be able to read them from the classpath. [223001620030] |This is just one reason why we read models through input streams rather than files. [223001620040] |It’s actually really easy: [223001620050] |The class can be anything. [223001620060] |You can also use this.getClass() if it’s non-static. [223001620070] |You might also want to buffer than input for efficiency. [223001620080] |Then I just compile it (making sure LingPipe’s on the classpath): [223001620090] |then fire it up (making sure LingPipe, the newly compiled class, and the model’s directory are on the class path): [223001620100] |As is conventional, the backslash (\) indicates that there’s no line break; thus the entire command above is on a single line. it’s critical here that the directory c:\carp\lingpipe\trunk\demos\models\ is on the classpath. [223001620110] |This is the directory where we store our models for distribution. [223001620120] |The point of putting them on the classpath is that we can now call them by name rather than by an explicit path. [223001620130] |The output is just what you’d expect: [223001620140] |For servlets, just put models in the same place as the other compiled classes or jars in the war (by convention, WEB-INF/lib) and you’re good to go. [223001620150] |Here’s the ant target to build the war file for our demos: [223001620160] |Then you can build a model in the servlet in the same way as in the standalone example above. [223001620170] |Here’s the actual method from our com.aliasi.demo.framework package (in $LINGPIPE/demos/generic, in which the following method is implemented in the abstract base class AbstractTextDemo.readResource(String), which is used by the command-line, GUI, and servlet-based demos: [223001620180] |If I were doing this today, I'd throw the exceptions rather than converting them to runtime exceptions. [223001620190] |And, of course, there's no reason to throw an IOException then catch it and modify it; the test should just directly throw the right kind of exception. [223001630010] |Convergence is Relative: SGD vs. Pegasos, LibLinear, SVM^light, and SVM^perf [223001630020] |I still can’t quite get over how well stochastic gradient descent (SGD) works for the kinds of large scale, sparse convex optimization problems we find in natural language processing —SVMs, CRFs, logistic regression, etc. [223001630030] |I just finished reading this ICML paper that was recommended to me: [223001630040] |
  • Shalev-Schwartz, Shai, Yoram Singer, and Nathan Srebro. 2007. [223001630050] |Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. [223001630060] |ICML. [223001630070] |

    Basic Approach

    [223001630080] |It’s a nice paper that introduces a blocked gradient optimization approach for support vector machines (SVMs) with linear kernels. [223001630090] |(The issues for linear SVMs are almost the same as for logistic regression.) [223001630100] |Setting the blocks to a single item yields stochastic gradient, and setting them to the whole corpus gives you a traditional gradient approach. [223001630110] |The innovation is a step that handles L2 (aka Gaussian prior) regularization by projecting the coefficients down onto an L2-ball (this technique’s since also been used for L1 norms). [223001630120] |The benefit over standard SGD is that it automatically sets the pesky learning rate parameter (see below for an empirical discussion). [223001630130] |

    Convergence

    [223001630140] |The paper also introduces a theoretical convergence analysis that shows it converges in O(d/(λε)) where d is a bound on non-zero features in an example, λis the regularization parameter, and εis the error bound on estimation. [223001630150] |Note that like other SGD-like methods, the convergence is [223001630160] |
  • independent of the number of samples, and [223001630170] |
  • independent of total dimensionality [223001630180] |though it is dependent on d, the maximum number of non-zero features per instance. [223001630190] |

    Does it Work in Practice?

    [223001630200] |Shalev-Schwartz et al. then show that their approach crushes not only Thorsten Joachims’s SVMlight but also handily outperforms his linear-kernel optimization using cutting-planes in SVMperf. [223001630210] |But what really struck me is Figure 2 on page 7, which compares the rate of convergence of Pegasos with that of a sparse stochastic gradient, as described in Tong Zhang’s earlier ICML paper: [223001630220] |
  • Zhang, Tong. 2004. [223001630230] |Solving large scale linear prediction problems using stocahstic gradient descent algorithms. [223001630240] |ICML. [223001630250] |The problem is text classification of Arxiv papers as to whether they’re about astro-physics or not, still a small problem with 100K features and 60K examples with high sparsity. [223001630260] |The lines in red are Tong’s SGD with different constant learning rates: 10, 1, 0.1, 0.01, 0.001 (presumably arranged left-to-right, with 10 being the non-converger). [223001630270] |The blue line is Pegasos. [223001630280] |The horizontal axis is log time, and the vertical axis represents the error function (for SVMs, that’s hinge loss on the training set; logistic regression uses log loss rather than hinge loss, but basically works the same way). [223001630290] |Shalev-Schwartz conclude that the main advantage of their approach over SGD (which enjoys many of the same high-level advantages, like convergence independent of number of samples) is that you don’t have to set the learning rate. [223001630300] |This may seem to be nitpicking, but it’s not. [223001630310] |Those five runs of SGD represent quite a bit more processing time than one run of Pegasos. [223001630320] |Perhaps even more crucially, when you’re doing SGD, you don’t know what the final converged error rate is, so you can’t tell if you’ve gotten there without trying lots of rates. [223001630330] |

    Embarssing Parallelism

    [223001630340] |The graph and our own experiences show that you don’t actually need to run SGD to convergence for the different learning rates. [223001630350] |You can see from the early part of the SGD learning rate curves which learning rate has the steepest initial descent. [223001630360] |That, combined with a reasonable annealing rate (lowering learning rate over time) winds up being fast in practice. [223001630370] |But it requires babysitting by someone with a feel for the results. [223001630380] |Evaluating multiple learning rates can be easily parallelized —it’s what’s known as “embarassingly parallel” process. [223001630390] |Or just interleaved on one processor. [223001630400] |Running all 5 SGD runs up to 104 seconds gets you to the result in half the time that running Pegasos for 105 seconds. [223001630410] |

    Annealing for Final Convergence

    [223001630420] |Also note that annealing lets you get closer to the optimal answer than is shown in the graph above. [223001630430] |The reason the faster learning rate curves asymptote above the error of Pegasos is that near convergence, learning rates need to be lowered. [223001630440] |

    How Much Accuracy is Necessary?

    [223001630450] |For a general overview of the properties of SGD and convergence, I’d highly recommend Léon Bottou’s very readable slides from his 2007 NIPS tutorial: [223001630460] |
  • Bottou, Léon. 2007. [223001630470] |Learning with Large Scale Datasets. [223001630480] |NIPS Tutorial. [223001630490] |For me, the most significant piece of the presentation was this graph from page 31, comparing C.-J. Lin et al.’s LibLinear coordinate descent-based regularized SVM solver to SGD: [223001630500] |The horizontal axis is the accuracy of the solution relative to the optimal solution (the εfrom above). [223001630510] |The top graph shows how held-out test error (also called generalization error) drops as estimation accuracy increases. [223001630520] |By the time we’re within 0.0001 of the minimum error, we’ve asymptoted on held-out accuracy. [223001630530] |More simply, we don’t need more than 4 or 5 decimal places of accuracy with respect to error (what LingPipe uses as a stopping condition for regression). [223001630540] |We’ve found exactly the same thing on the real problems (e.g. classifying clinical reports by disease and linkage between textual gene mentions and Entrez-Gene). [223001630550] |The bottom graph shows the time it takes the solvers on the vertical axis, with SGD being way faster for lower accuracies. [223001630560] |This is why I was so vexed when I wrote the first unit tests for our SGD logistic regression solver. [223001630570] |They were taking forever (i.e. 100 seconds) to converge to 7 decimal places of accuracy. [223001630580] |

    Convergence by Held-Out Performance

    [223001630590] |With training data you can use cross-validation to estimate convergence using held-out performance. [223001630600] |We monitor cross-validation error along with training set error along with the variance of the same. [223001630610] |Our cross-validating corpus is getting a heavy workout in conjunction with SGD for logistic regression. [223001630620] |

    The Bottom Line

    [223001630630] |The bottom-line of Bottou’s graph (and this blog post) is that SGD estimation time grows inversely proportionally to accuracy. [223001630640] |It’s really fast to get within a thousandth or ten thousandth of the optimal solution. [223001630650] |And getting further doesn’t help much, if at all, on held-out data. [223001640010] |Tokenizer Pattern Smackdown: Factories, Constructors, Singletons, Filters, and Adapters [223001640020] |We’re rolling out a suite of feature extraction utilities and tokenizer factories for the upcoming 3.8 release. [223001640030] |This has caused me (and those in my questioning range, namely Mike, Mitzi and Breck), considerable anguish. [223001640040] |I’m trying to be a good boy and follow Josh Bloch’s advice in Effective Java. [223001640050] |But like Paul Grice’s conversational maxims, which urge you to be both concise and unambiguous, Bloch’s individually reasonable sounding advice is rather contradictory when considered collectively. [223001640060] |

    The Tokenization Interfaces

    [223001640070] |We’re pretty much locked into our basic interfaces and base classes, which for the point of this discussion may be thought of as: [223001640080] |

    Chaining Tokenizers with Filters

    [223001640090] |Tokenizers like to chain. [223001640100] |Typically, you start with a base tokenizer, like our Indo-European tokenizer, then filter it by lowercasing tokens, converting to stems, removing stopwords, and so on. [223001640110] |This is a classic instance of the filter pattern. [223001640120] |Way back in version 1.0, tokenizers were often external to models. [223001640130] |This is still the case, for instance, with HMMs, which take in an array of strings (representing tokens) and return an array of strings (representing tags, or hidden states). [223001640140] |It’s then the client’s job to make sure the tokens supplied at run time match those used for training. [223001640150] |The natural thing to do was to use a filter pattern to implement. [223001640160] |With simple tokenizers and filtered tokenizers, we’d do things like: [223001640170] |Looks just like java.io stream, reader and writers in action. [223001640180] |That is, we construct an instance of tokenizer using a constructor (in this case, IndoEuropeanTokenizer(cs,start,len)). [223001640190] |Then we apply a bunch of filters through their constructors (e.g. LowerCaseFilterTokenizer(ieTok)). [223001640200] |Each filter holds onto its contained tokenizer and modifies its output in some way. [223001640210] |

    Type Raising: Factories

    [223001640220] |In order to supply tokenization capabilities to model, we need a way to create tokenizers from strings, which is where the tokenizer factory comes in. [223001640230] |This interface is easy but the implementation is harder. [223001640240] |To apply the tokenizer-level filters, we need to use: [223001640250] |Oh, and you probably want to make that implement java.io.Serializable, which means you need a static nested class or new class file on which to hang the declaration. [223001640260] |Writing robust serialization is a pain (LingPipe’s util.AbstractExternalizable helps immensely in following Josh Bloch’s recommended practices). [223001640270] |

    Filtering Tokenizer Factories

    [223001640280] |This was all getting to be too much. [223001640290] |What we really need is something to filter tokenizer factories. [223001640300] |What we eventually settled on was the same-old pattern, reasoning that it’d be much more familiar to programmers because of its ubiquity in the Java libraries (like java.io and org.xml.sax). [223001640310] |What we now have is something like this: [223001640320] |

    Singleton Base Factories

    [223001640330] |Typically, a tokenizer factory is stateless and thread safe, and hence can be modeled using the singleton pattern, meaning a single instance that is used wherever an instance of that tokenizer factory needed. [223001640340] |So I’ve gone through and deprecated the constructors (e.g. new IndoEuropeanTokenizerFactory) in favor of static final instances (e.g. IndoEuropeanTokenizerFactory.INSTANCE), which is by far the safest way to implement singletons in Java. [223001640350] |Java’s class loader even provides lazy instantiation (the singleton’s created when the class is loaded, and being static and final, is always safely published). [223001640360] |It’s an easy change for the clients, with the first line nullary constructor being replaced with the instance: [223001640370] |

    Singleton Tokenizer Factory Filters?

    [223001640380] |So why not do the same thing with the filters? [223001640390] |That is, define an interface such as this one: [223001640400] |They’re easy to implement, too, given that we’ve already defined the constructor-based filter LowerCaseTokenizerFactoryFilter. [223001640410] |Here we’ve just gone ahead and assigned them to a singleton, as we don’t need more than one instance of the filter: [223001640420] |We can even collect the singleton tokenizer factories and filters into a big utility class and privatize the classes that implement them. [223001640430] |Then what we’re left with is something like this: [223001640440] |In the end, we decided that as cool as that is, and as much as it follows all of Josh Bloch’s advice to a tee. [223001640450] |By hiding all the classes, you cut down on cognitive load. [223001640460] |Unfortunately, you also cut down on findability and make one big class with tons of doc and singletons in it. [223001640470] |You do get the advantage of being able to return whatever types you want and retain freedom to change implementations (modulo serializability backward compatiblity) at will. [223001640480] |

    Let’s Not Innovate on Style

    [223001640490] |But in the end, we decided it was so non-standard that it’d just be confusing to clients. [223001640500] |So we went with the more familiar constructor-based filter pattern. [223001640510] |Although I did make the base factories singletons. [223001640520] |

    Adapters to the Rescue

    [223001640530] |What I did need to do was add adapter implementations. [223001640540] |Most of the tokenizer factory filter implementations really only modify the tokenizer, and then, typically they only modify the tokens and whitespaces individually. [223001640550] |So I broke out ye olde adapter pattern, first for tokenizers: [223001640560] |and then one more level to filter the tokenizer itself (in simplified form; the real implementation has to deal with deletions of tokens, too, in order to generalize to stop lists): [223001640570] |It makes implementing the filters a piece of cake. [223001640580] |The basic idea is to adapt a token modifier up to a tokenizer modifier and then a tokenizer factory modifier. [223001640590] |

    Feature Extractors: More of the Same

    [223001640600] |Most of this discussion applies to feature extractors, too, which we’ve modified in the same way as tokenizers for 3.8. [223001640610] |In fact, when you create feature extractors on top of tokenizers, it looks even more like Java I/O, because you move from tokenizer factories to feature extractors the same way Java I/O moves from streams to readers/writers. [223001650010] |Backward-Compatible Java (De)Serialization [223001650020] |Time to ‘fess up. [223001650030] |Sometimes when you look at code you wrote in the past, it looks very ugly. [223001650040] |I took a very ill-advised expedient in serializing tokenized language models in the original implementation. [223001650050] |

    Current TokenizerFactory Serialization

    [223001650060] |I simply wrote out their fully qualified class name: [223001650070] |Then I read it back in the obvious way: [223001650080] |where Reflection.newInstance(String) is our (about to be deprecated) utility to create an object from the name of its class using the no-arg constructor. [223001650090] |

    Pure Unadulterated Laziness

    [223001650100] |What was I thinking? [223001650110] |The sad truth is that I was too lazy to write all those serializers. [223001650120] |Always think twice before implementing something because it’s easy. [223001650130] |The whole util.Reflection was just such a hack; the original reflection implementations threw all those exceptions for a reason! [223001650140] |

    The Real Problem

    [223001650150] |The problem here is that folks ran into run-time issues with their factories (and ours) not having no-arg constructors. [223001650160] |For instance, suppose we have a factory that requires a string and integer to construct: [223001650170] |

    A Hacked Solution

    [223001650180] |In practice, when you’re building a specific model, there’s a fixed value for the constructor args. [223001650190] |So you can define a quick adapter class: [223001650200] |

    Refactoring the Right Way

    [223001650210] |So I want to now refactor to serialize the factory rather than writing the name of its class. [223001650220] |But how to handle backward compatibility? [223001650230] |At least in this case, it wasn’t too hard. [223001650240] |I use the fact that I used to write out a string to write out a “magic value” as a flag, in this case the empty string, because it’s short and it can’t be a class name. [223001650250] |To read, I just first read the string, and if its empty, read the object: [223001650260] |There you have it, backward compatibility. [223001660010] |What Should Tokenizers Do? [223001660020] |I’ve written before on the curse of intelligent tokenization, where I bemoaned the context-sensitivity of tokenizers like the Penn Treebank’s, which is sensitive to ends of sentences, and Penn’s BioIE’s and OpenNLP‘s tokens, both of which can be extracted by statistical decoders. [223001660030] |

    Question of the Day

    [223001660040] |Should tokenizers modify tokens or should they just return slices of their input? [223001660050] |I think there’s a lot to be said for both views, both in terms of conceptual clarity and efficiency. [223001660060] |We chose to allow the former in LingPipe, but I’m not sure I’d make the same choice if I had a do-over. [223001660070] |I’m confident in saying that tokens should link to a span of text, if not be identified with a span of text. [223001660080] |Otherwise, you can’t do things like highlighting in applictaions very easily. [223001660090] |LingPipe only gets halfway there with a start position; if there’s stemming, etc. you can lose the end position. [223001660100] |Maybe I’ll go and add that; luckily I made Tokenizer an abstract base class, so I can add (implemented) methods without compromising backward compatibility. [223001660110] |

    Java’s Tokenizers

    [223001660120] |

    Java’s Stream Tokenizer

    [223001660130] |At one extreme are the familiar programming language tokenizers like Lex, which is the model followed by Java’s own java.io.StreamTokenizer, as originally written by the master himself, James Gosling. [223001660140] |First note that the input is completely streaming in the form of a java.io.Reader; this means it can really scale to lots of text with little memory. [223001660150] |The method nextToken() returns an integer type of the next token (i.e. number, word, end-of-line, and end-of-stream in the stream tokenizer) and toString() gives you a string representation. [223001660160] |More modern Java would convert that integer type to an enum, and no one would dream of overloading toString() that way. [223001660170] |The to-string method actually ends with the statement return "Token[" + ret + "], line " + LINENO;; as you can see, no one expected you to actually use these as tokens in an NLP sense. [223001660180] |This way of doing things is the most efficient way possible if you’re not going to modify any underlying tokens and just return slices. [223001660190] |You can easily generate Java strings, UIMA offset representations, or OpenPipeline-type tokens. [223001660200] |

    Java’s String Tokenizer

    [223001660210] |If you then consider Java’s StringTokenizer (which no one took credit for in the code!), you’ll see that they follow a pattern like Chris Cleveland used for OpenPipeline (and commented about in our last blog entry about tokenization). [223001660220] |That is, they take an entire representation of a string rather than streaming, then return substrings of it as tokens, which means no more char array allocation. [223001660230] |They don’t need to keep all of the strings —you can free up tokens for garbage collection after using them. [223001660240] |

    Tokenization for NLP

    [223001660250] |

    LingPipe and Lucene

    [223001660260] |Back in version 1.0, LingPipe only needed tokenizers to convert strings into tokens for named-entity detectors and soon after, general HMMs for part-of-speech taggers. [223001660270] |These models are defined as mappings from sequences of symbols (e.g. tokens in natural language or base pairs in genomics) to sequences of labels (or tags). [223001660280] |The models are neither position-sensitive nor whitespace sensitive. [223001660290] |LingPipe, like Apache Lucene, set up a base tokenizer that essentially iterates tokens, along with filters. [223001660300] |Lucene’s Tokenizer iterates tokens, and their TokenFilter modifes these streams. [223001660310] |One difference is that Lucene takes Reader instances as input for tokenization, like Java’s stream tokenizers, where LingPipe operates over character array slices, like Java’s string tokenizer. [223001660320] |With a do-over, I’d have tokenizers accept CharSequence instances rather than character array slices. [223001660330] |But it’s an interface, so I can’t do it without sacrificing backward compatibility. [223001660340] |This pattern makes it particularly easy to define operations like case normalization, stemming, stoplisting, synonym expansion, etc. etc. [223001660350] |But should we? [223001660360] |

    Mallet

    [223001660370] |Mallet, on the other hand, reifies complete tokenizations in a cc.mallet.extract.Tokenization object, which is a document plus set of spans. [223001660380] |This is what I was talking about in the comments as representing an entire tokenization; they even named it properly. [223001660390] |

    LingPipe Chunking

    [223001660400] |A Mallet tokenization shares some degree of similarity with LingPipe’s chunkings. [223001660410] |A chunking is a character sequence plus a set of chunks, where a chunk has a start position, end position, string-based tag, and double-based score. [223001660420] |They can overlap, there can be multiple chunks with the same spans (but not the same span and type), etc. [223001660430] |It’d be easy to implement an adapter that would implement a chunker (function taking char sequences to chunkings) based on a tokenizer. [223001660440] |There are, in fact, reg-ex based chunkers in the chunk package and reg-ex based tokenizer factories in the tokenizer package. [223001660450] |

    UIMA

    [223001660460] |UIMA‘s common analysis system (CAS) encodings tend to do the same thing, but they’re very flexible, so you could build just about anything you want. [223001660470] |So to filter the output of a tokenizer, you get the whole tokenization to work with. [223001660480] |This gives you the utmost in flexibility, though it doesn’t necessarily track the history of derivations. [223001660490] |Of course, you can represent just about anything in CAS, so you could track this kind of thing if you want, and do it very flexibly, as in pointing to the three base tokens that produced a token trigram, or pointing to dictionary resources online used for synonymy, or pointing to a whole analysis tree for stemming. [223001660500] |Once you start thinking like this, tokenization becomes a very rich process. [223001660510] |

    MinorThird

    [223001660520] |MinorThird defines its edu.cmu.minorthird.text.Tokenizer interface to map documents (or strings) to arrays of TextToken objects. [223001660530] |These token objects hold a pointer to the underlying string or document data and represent token positions as offsets. [223001660540] |

    OpenNLP

    [223001660550] |OpenNLP defines its opennlp.tools.tokenize.Tokenizer interface to map an input string to an array of tokens represented as strings or as span objects. [223001660560] |An OpenNLP span contains a start and end offset, as well as a method, getCoveredText(), to return the string yield. [223001660570] |But then I didn’t see any typing information other than the covered text or any (public) extensions. [223001660580] |There were lots of useful utility methods which we’ve packed (some of) into static methods to consider overlaps, containment, begins and ends. [223001660590] |It also implements Comparable (generic would be Comparable), which our chunks don’t (but there are static utility comparators). [223001660600] |

    OpenPipeline

    [223001660610] |Dieselpoint, Inc.’s OpenPipeline follows a hybrid model that uses Java string reader style representations of tokens, but then allows them to pipeline through filters like Mallet/UIMA. [223001660620] |I had to download the API to get at the javadoc —I couldn’t find it online. [223001660630] |[Update: Chris uploaded the Javadocs, so you can get them here: [223001660640] |
  • OpenPipeline Javadoc [223001660650] |I believe what Chris Cleveland, the architect of OpenPipeline, was suggesting was to do a simple tokenization as spans, then everything on top of that is a different kind of analysis. [223001660660] |I think this makes sense as an answer to the question of the day. [223001660670] |

    And More…

    [223001660680] |Please feel free to add other models of tokenization if I’ve missed them. [223001670010] |LingPipe 3.8.0 Released [223001670020] |LingPipe 3.8.0 is now available at: [223001670030] |
  • LingPipe Home Page
  • [223001670040] |This is a big release —the details are available on the home page linked above. [223001670050] |The big news is that 3.8 is likely to be the last backward-compatible release before migration to 4.0, which will remove methods, classes, and constants deprecated in 3.8. [223001670060] |Please let us know if you have questions or comments. [223001680010] |Parsers, Handlers and Corpora: Patterns for Data Sets [223001680020] |Now that we’ve thoroughly discussed tokenization, I’d like to move on to the other big design problem facing any machine learning or NLP toolkit: how to represent data sets. [223001680030] |

    Why not File Formats?

    [223001680040] |The usual solution, as adopted by almost all of the popular machine learning toolkits, is to this is to define a structured file format and read training and evaluation data from such files. [223001680050] |It’s a very unix-y way of working on problems. [223001680060] |The problem is that it sweeps under the rug all the problems of raw data parsing and feature extraction. [223001680070] |Feature extraction happens outside of the tools themselves, making the process difficult to instrument for debugging in conjunction with the models (e.g. you have to keep track of which input line corresponded to which classification instance to put the system’s answers back together with the data you care about). [223001680080] |It’s also rather inefficient to convert to a line and string-based representation when what you really want is feature vectors. [223001680090] |But then that’s not our major motivation, or we would’ve went even further toward raw vector internals. [223001680100] |

    Programmatic Access or Bust

    [223001680110] |The unix-y style of working tends to encourage the application of a succession of scripts to on-disk representations. [223001680120] |The problem is that the sequence of steps used to create the final representations are often lost during production due to the dynamic nature of scripting languages and the problems that tend to crop up during munging. [223001680130] |You can solve this problem for file-based systems by insisting the file-based representation is generated by a single script. [223001680140] |I’d encourage you to do so if you’re working in this world. [223001680150] |

    The LingPipe Way

    [223001680160] |We prefer to have everything built into Java. [223001680170] |Partly because I’m too lazy to learn other languages beyond a reading knowledge, and partly because it allows us to bring a unified toolkit to bear on the problem. [223001680180] |Sure, Java’s a pain to wheel out for small problems (you are scripting compilation and run, I hope), but it’s just right for larger, more complex problems with lots of moving parts, such as when comparing different approaches to feature extractionin [223001680190] |To support code-based development of data sets, we’ve introduced object-oriented representations of data handlers, data parsers, and corpora. [223001680200] |The model we followed for handlers and parsers is that of SAX parsing. [223001680210] |Corpora were introduced later to represent whole data sets. [223001680220] |

    Handler and Handlers

    [223001680230] |There’s a marker interface corpus.Handler, which all data handlers extend. [223001680240] |Its modeled on SAX’s ContentHandler interface. [223001680250] |There are specific handler interfaces that extend it, such as: [223001680260] |Basically, they all accept streaming content as specified by their methods. [223001680270] |Our online trainable models implement these methods. [223001680280] |For instance, trainable HMMs implement the TagHandler interface and trainable chunkers implement the ChunkingHandler interface and trainable classifiers like naive Bayes and k-nearest neighbors implement the ClassificationHandler interface. [223001680290] |This allows different models to accept the same form of input. [223001680300] |It’s like defining a file format, only in code. [223001680310] |

    Parsers

    [223001680320] |Parsers are modeled on SAX’s XMLReader interface. [223001680330] |Specifically, the abstract base class is defined as: [223001680340] |The generic parameter specifies the type of handler receives events from the parser. [223001680350] |Just as with an XML reader, you can set and get the handler from a parser. [223001680360] |Then, when one of the parse methods is called, events extracted from the specified input (file, URL, InputSource or string) are sent to the handler. [223001680370] |Parsers are data-source specific. [223001680380] |So we have parsers for the Brown corpus, Penn Treebank, Reuters classification corpus, CoNLL tagging corpora, MEDLINE, and so on. [223001680390] |In fact, you need one for each data format. [223001680400] |Here’s an example of training our most accurate chunker using MUC-format named-entity data in two files: [223001680410] |

    One Parser (per app) to Rule them All

    [223001680420] |An alternative would be to provide a single format for each type of data, then require external data sources to be converted to this format externally. [223001680430] |You can still do that with LingPipe, and in fact, many of our users find it easier to munge data than implement parsers. [223001680440] |

    Corpora

    [223001680450] |We ran into a problem with parsers and handlers when we started building models such as latent Dirichlet allocation clustering, EM semi-supervised classifier training, and logistic regression model fitting. [223001680460] |In these cases, we needed the entire corpus to be provided to the estimator at once in order to do batch processing (this isn’t strictly required for these models, but it’s how they’re typically implemented in practice). [223001680470] |So we introduced another abstract base class representing an entire corpus of data: [223001680480] |The pattern here is that a corpus sends all of the information in a corpus to a single handler. [223001680490] |Note that this is divided by testing and training data. [223001680500] |An alternative would’ve been to define a single method visit(H), then implement two separate corpora for testing and training data. [223001680510] |We lean very heavily on our XValidatingClassifcationCorpus implementation, which implements Handler to collect a bunch of data, which it then divides into training and test splits using the usual cross-validation logic with a configurable number of folds. [223001680520] |It doesn’t even require a parser, because it accumulates data programatically. [223001680530] |Here’s an example of how it works, supposing we want to evaluate naive Bayes on a bunch of data encoded in multiple files in SVMlight format. [223001680540] |You can use: [223001680550] |An alternative would be to require data to be munged into a single file. [223001680560] |That way, a parser could handle it. [223001680570] |Then something like cross-validation would happen by creating separate training and testing corpora to send to a parser. [223001680580] |

    Simplifying those Handlers?

    [223001680590] |I made a conscious decision to follow the SAX reader/handler paradigm. [223001680600] |That meant that it’s possible for handlers to define all sorts of content-handling methods. [223001680610] |SAX’s content handler has eleven. [223001680620] |The only time I ever used this flexibility is for MEDLINE; there events can be either add citation or delete citation events coming from update files. [223001680630] |I also used multiple arguments. [223001680640] |Tag handlers have a handle method with three arguments, text handlers with three arguments, classification handlers with two objects, etc. [223001680650] |But what if we replaced handler with: [223001680660] |That is, we require a single object to be handled. [223001680670] |That means we need to convert the objects of the handlers into first-class objects. [223001680680] |This is easy with text —just replace (char[],int,int) with CharSequence. [223001680690] |Similarly we could replace (E,C) in the classifier handler with a new interface Classified representing an object of type E classified as a C. [223001680700] |The big problem is MEDLINE. [223001680710] |Mike suggested the workaround of a single command-type object, with two subtypes, delete and citation, but some other kind of compound container could be used. [223001680720] |We actually introduced this interface as ObjectHandler, which extends Handler, and extended all the single-argument handlers to implement it. [223001680730] |With the data handled being represented as a single object, we could change the parser specification and rename to: [223001680740] |Corpora undergo the same change (and also a change of name): [223001680750] |It’d be pretty easy; just deprecate all the current handlers other than ObjectHandler, create object representations of all handler arguments, then replace parsers and corpora with readers and data sets in all the classes that use them. [223001680760] |I could even do it in two steps by deprecating the old interfaces while introducing the new ones. [223001680770] |It doesn’t seem clean enough to be worth the effort, but I’m curious what everyone else thinks. [223001680780] |Of course, only the API nuts like me are still following this discussion, so the answers may be biased. [223001680790] |

    Corpora and Parsers as Iterators

    [223001680800] |If we move to a pure object handling setup, it’d be easy (API-wise; implementation’s a pain) to extend corpora to implement iterators over data objects (one each for testing and training). [223001680810] |Parsers could return iterators given inputs. [223001680820] |This is a much more familiar pattern for handling data, and would work for all of our online trainable models. [223001680830] |I’m about to roll out a truly online logistic regression estimator in the next LingPipe; at that point, running batch-style will require the user to send in the same data multiple times. [223001680840] |It’s all just a matter of control. [223001690010] |Drunk Bavarian Corpus [223001690020] |I was amused by ELRA‘s release of: [223001690030] |
  • Bavarian Archive for Speech Signals’ Alcohol Language Corpus
  • [223001690040] |The description says “ALC contains recordings of speakers that are either intoxicated or sober.” [223001690050] |Is there a third kind of person? [223001690060] |I assume they’re serious, becuase none of the posts are dated April 1. [223001690070] |The price is 1000 euros give or take plus VAT; ELRA jitters their prices, so the real price is 1020 euros plus VAT, though you might see if they’ll take 1019.99. [223001690080] |There’s even a sample labeled recording, headset, intoxicated [223001690090] |Here’s the README, including ASCII art logo. [223001690100] |Or you can read the paper: [223001690110] |
  • Schiel, F., Heinrich, Chr., Barfüßer, S., Gilg, Th. [223001690120] |(2008). [223001690130] |ALC –Alcohol Language Corpus. [223001690140] |In: Proc. of LREC 2008, Marrakesch, Marokko.
  • [223001690150] |What’s next, the Drunk American College Student Corpus from LDC? [223001700010] |Goyal, Daumé III, Venkatasubramanian (2009) Streaming for Large Scale NLP: Language Modeling [223001700020] |Hal, of the NLPers blog, was just in town for a talk at Columbia on hierarchical Bayesian models (which was awesome —more on the zeitgeist of hierarchical Bayesian models in NLP in future posts). [223001700030] |So I thought I’d catch up on his recent work, which includes this paper for the upcoming NAACL meeting in Colorado this summer: [223001700040] |
  • Goyal, Amit, Hal Daumé III, and Suresh Venkatasubramanian. 2009. [223001700050] |Streaming for Large Scale NLP: Language Modeling.. [223001700060] |NAACL.
  • [223001700070] |This paper imports a nice, simple algorithm from [223001700080] |
  • Manku, Gurmeet Singh. and Rajeev Motwani. 2002. [223001700090] |Aproximate frequency counts over data streams. [223001700100] |VLDB.
  • [223001700110] |and evaluates it on some NLP data. [223001700120] |Goyal et al.’s a prime instance of a least publishable unit (LPU). [223001700130] |I’m quite surprised it was accepted; I would’ve expected to see the contribution of this paper as an aside in a machine translation paper. [223001700140] |But then I wouldn’t have found it. [223001700150] |I really like small publishable units, as long as they’re not redundant (in a field; I also don’t mind importing results from other fields and do it all the time myself). [223001700160] |Another advantage is that you can go into depth on one thing, which is really critical if you’re trying to implement (or decide to implement) something yourself. [223001700170] |The motivation is streaming data that requires (approximate) word n-gram counts because of its size. [223001700180] |Having said that, they evaluated only on relatively small data sets (i.e. word 5-grams over fractions of English Gigaword under 1G words total). [223001700190] |But there are lots of ways to deal with data this size. [223001700200] |We managed to build character 8-grams in memory on even more data without pruning. [223001700210] |You could also perform multiple passes, each storing a shard of the data, as is common with search engines and then merge it —that’s what LingPipe does to scale beyond what’ll fit in memory, following the Lucene search index model. [223001700220] |You can speed this up with pre-passes for short n-grams to prune out any extensions of low-count low-order n-grams. [223001700230] |[Revised:] Goyal et al. cite Alex Franz and Thorsten Brants’s Google n-gram effort, and it’d be interesting to see how close you could get to that effort on a single machine. [223001700240] |Franz and Brants ran over 1000 times more data (over 1T words) and collected many more n-grams (min count of 40). [223001700250] |You simply couldn’t have produced the Google scale corpus using the Goyal et al. methods as they described their implementation on a regular desktop workstation (Goyal et al. restricted themselves to 8GB of RAM), because it produced hundreds of millions of n-grams. [223001700260] |Goyal et al. took a threshold of 100 occurrences in 1G instances, which is a whole lot smaller (i.e. requiring 2000 times the relative frequency to keep a word). [223001700270] |Presumably they could’ve collected that corpus (at frequency cutoff 80,000), but how long would it’ve taken to run? [223001700280] |Is it even the right question to ask of an experimental paper? [223001700290] |In a nutshell, the Manku and Motwani algorithm, as presented by Goyal et al., keeps counts, error estimates, and periodically prunes the counts. [223001700300] |It assumes you know the number N of tokens in the corpus ahead of time and have set thresholds εand s such that ε<the algorithm returns all words with actual counts >sN, [223001700360] |
  • the algorithm returns no word with actual counts <(s –ε)N, and [223001700370] |
  • returned counts are less than or equal to actual counts by an amount at most εN. [223001700380] |The harder bit’s the (presumably amortized; I didn’t read the original paper) analysis required to establish its space bound: [223001700390] |
  • The algorithm requires at most O(1/ε log (εN)) space. [223001700400] |Goyal et al. actually retain all counts (that is, they effectively set s=ε). [223001700410] |This admits large relative frequency errors, because a word with actual count T = εN could have a returned count of 1 and another with actual count T could have a returned count of T. In NLP, the appearance of a term is often much more important than the counts, so I appreciate their motivation. [223001700420] |When s >>εas the algorithm developers intended, the relative frequency errors are bounded much more tightly because the min count sN is much greater than the max error εN. [223001700430] |You could actually run blocks of 1/ε tokens and find your N at the end if you didn’t know N ahead of time, as would be the case in a truly streaming application. [223001700440] |The empirical evals show exactly what you’d expect —the counts derived using this approach have slightly inferior performance compared to storing full counts and pruning. [223001700450] |They did crude implementations with hash tables for the n-grams rather than tries or PAT-tries, which is suboptimal in space and in time. [223001700460] |It’s what Joel Spolsky called Shlemiel the painter’s algorithm, the usual source of which is the use of += on strings in Java. [223001700470] |In technical terms, it’s a quadratic algorithm that keeps repeating work (1+2+3+…+N = O(N2) steps for problem of size N) where there’s a simple linear algorithm. [223001700480] |LingPipe updates n-gram counts using the standard linear algorithms which are well known from the suffix tree literature. [223001700490] |But I think the point of Goyal et al. was to evaluate the technique, not build a super-scalable implementation. [223001700500] |It’s not quite possible to use our sequence counters to implement this algorithm with LingPipe. [223001700510] |We don’t store the error bounds. [223001700520] |If you just compute each block separately and prune to minimum count 2 in each block, you can easily establish the same bounds on accuracy (though clearly not on size, though the fact that things are going to disk should make this easier to handle). [223001700530] |You could store the error bounds separately and prune by hand, but you’d need to compute the error bounds yourself by finding all the new n-grams in a block, so it’s really not practical as LingPipe’s written now. [223001710010] |Max Entropy vs. Logistic Regression: Feature Extraction and Coefficient Encoding [223001710020] |I’m designing a conditional random field (CRF) package for LingPipe, so I’ve been reading through the literature on CRFs, most of which is expressed in terms of maximum entropy. [223001710030] |I can’t recommend Ryan McDonald’s tutorial slides highly enough: [223001710040] |
  • McDonald, Ryan. 2007. [223001710050] |Generalized Linear Classifiers in NLP. [223001710060] |Course presented to the Swedish Graduate School in Language Technology. [223001710070] |Among other things, Ryan shows how the maximum entropy estimate subject to expectation matching constraints yields the same unique solution as the maximum likelihood estimate (where the derivative of error shows the max likelihood estimate matches expectations). [223001710080] |But I’m not going to harp on terminology or philosophy today, but rather on the more immediate practical problems of feature extraction and coefficient encoding. [223001710090] |

    K vs. K-1 Coefficient Vectors?

    [223001710100] |Consider multinomial classifiers with K outcome categories c[1],…,c[K] and n-dimensional vector inputs x. [223001710110] |In logistic regression, as typically done in stats, you have K-1 n-dimensional coefficient vectors β[1], …, β[K-1] and the probability distribution over categories given an input vector x is defined so that p(c[k]|x) is proportional to exp(β[k] * x) for 1 <= k <= K-1, and p(c[K]|x) is proportional to exp(0 * x) = 1. [223001710120] |Sometimes, you see K coefficient vectors and p(c[K]|x) is proprotional to exp(β[K] * x) just like for the other dimensions. [223001710130] |For instance, Dave Lewis and David Madigan et al.’s package Bayesian Multinomial Regression (BMR) lets you choose K or K-1 coefficient vectors; LingPipe uses the K-1 encoding. [223001710140] |Dave (Lewis) told me they included the K-vector approach because their users found it easier to interpret than pinning the last coefficient vector to 0. [223001710150] |The problem with the K coefficient vector approach is that the parameters are not identified in the statistical sense. [223001710160] |Basically, if you subtract β[k] from all the coefficient vectors (leaving β[k] = 0), you get exactly the same predictions. [223001710170] |Zero-centered (or other) priors can identify unique solutions that minimize coefficient values. [223001710180] |Gaussian priors (aka L2 regularization) always leads to a unique solution. [223001710190] |Laplace priors might not. [223001710200] |Imagine a very simple binary (outcomes K = 2) single regression (dimensions n = 1), and the vector β[1] = 1.0 in the K-1 coefficient coding, which implicitly sets β[2] = 0.0. [223001710210] |For an input x, we get [223001710220] |p(c[1]|x) = exp(x * 1.0) / ( exp(x * 1.0) + exp(x * 0.0) ) [223001710230] |p(c[2]|x) = exp(x * 0.0) / ( exp(x * 1.0) + exp(x * 0.0) ). [223001710240] |We get the same results with the K-vector coefficient encoding, taking β[1] = 0.5 and β[2] = -0.5, namely [223001710250] |p(c[1]|x) = exp(x * 0.5) / ( exp(x * 0.5) + exp(x * -0.5) ) [223001710260] |p(c[2]|x) = exp(x * -0.5) / ( exp(x * 0.5) + exp(x * -0.5) ) [223001710270] |(To see the equivalence, multiply the numerator and denominator of the second set of equations by exp(x*0.5). ) [223001710280] |Coefficients of 0.5 and -0.5 are more likely in a zero-centered Gaussian prior (L2 regularization) than coefficients of 1.0 and 0.0, but have the same likelihood under a Laplace prior (L1 regularization). [223001710290] |Sampling approaches to modeling the posterior distribution p(θ|X) are more problematic, because the usual measures of Gibbs sampler convergence (e.g. R hat) won’t work. [223001710300] |The usual solution is to monitor the K-1 dimensional transform, because that is identified. [223001710310] |

    Using a Single Coefficient Vector

    [223001710320] |What always throws me when I read max entropy presentations is that they don’t use K or K-1 vectors, they use a single vector, with a different kind of feature extraction. [223001710330] |Specifically, there’s a different input feature vector for each category, traditionally written φ(c[k],e) for input e and category c[k]. [223001710340] |Recall that the usual set up in logistic regression is to use a single feature extractor ψ(e) that is independent of category (the value of ψ(e) is the input vector x we used above before talking about feature extraction). [223001710350] |Now with K vectors for input e, φ(c[1],e) through φ(c[K],e), we have a single coefficient vector βand we model probabilities p(c[k]|e) as being proportional to φ(c[k],e) * β. [223001710360] |It’s easy to code the usual logistic setup in the max entropy pattern. [223001710370] |Just set β= β[1],…,β[K] and set φ(c[k],e) to 0,0,…ψ(c[k]),…,0. [223001710380] |Going the other way around doesn’t work, because there may be features extracted by more general φthat can’t be replicated in the usual logistic setup. [223001710390] |I asked Hal Daumé III when he was in town if people used this flexibility in the models. [223001710400] |He said usually not, but he did mention one case he’d seen, which is in first-order CRFs, where the conditioning is on the previous category as well as the entire input string. [223001710410] |He said he’d seen people code up the feature “my tag is the same as the previous tag”, which is not something you can code directly in the logistic setting. [223001710420] |There, you can only get “is my tag T and is the previous tag T” duplicated for every tag T. [223001710430] |Are there other examples? [223001710440] |Is it worth implementing the more general feature extraction case? [223001720010] |The Danger of Overloading Generic Methods [223001720020] |We ran into a problem with generic methods in logistic regression. [223001720030] |LingPipe’s logistic regression classifieruses a feature extractor to map objects to feature vectors, which are maps from strings to numbers. [223001720040] |There used to be a single public classification method, so the class looked like: [223001720050] |We thought it’d be convenient to have a second classification method that could classify feature vectors (which are mappings from string-valued features to numerical values) [223001720060] |It compiles just fine, and I think it’s pretty clear what the difference is between the overloaded methods. [223001720070] |Now consider creating an instance of this type: [223001720080] |All together now: “Doh!“. [223001720090] |The new class can be constructed, but I can’t call either classify() method with a Map argument, because it matches both methods, which turn out to have identical signatures from a client perspective. [223001720100] |Java method resolution (which happens statically), picks the most specific signature that matches the argument type. [223001720110] |You can see why it’s stuck with two identical arguments. [223001720120] |But hang on, the compiler shouldn’t let us compile a class with two methods with the same signature. [223001720130] |So why did it work? [223001720140] |It all goes back to the semantics of generics in Java, which are based on generic erasure. [223001720150] |With erasure, these two methods have the following signatures: [223001720160] |The problem is that this isn’t the view presented to client code; to clients, there are two identical classes, so code won’t be able to use them, because it won’t be able to resolve the method signature. [223001720170] |And that, kids, is how the method [223001720180] |Classification classifyFeatures(Map) [223001720190] |got its name. [223001730010] |Cross-Validation isn’t Enough for “Empirical” Bayes Prior Estimation [223001730020] |Andrew set me straight the other day on why cross validation isn’t enough to estimate priors in hierarchical models. [223001730030] |I naively thought you could just run cross-validation and choose the prior that works the best overall. [223001730040] |By best overall, I meant the one that leads to the lowest cross-validation error. [223001730050] |Suppose we have K folds of data d[1],…,d[K]. [223001730060] |We are interested in finding the prior parameter vector βthat maximizes [223001730070] |Σk log p(d[k] | d[1..k-1, k+1..K], β), [223001730080] |where the probability density function p is the usual predictive inference, which integrates over model parameters θgiven the data d[1..k, k+1,K] and prior βto predict d[k], [223001730090] |p(d[k] | d[1..k-1, k+1..K], β) = Θp(d[k] | θ) p(θ | d[1..k-1, k+1,..,K], β) dθ [223001730100] |Andrew pointed out that the degenerate prior with mean set to the corpus maximum likelihood estimate (let’s call it θ*) and zero variance works the best. [223001730110] |That’s because θ* is the parameter that maximizes the corpus likelihood: [223001730120] |Σk log p(d[k] | θ*) [223001730130] |and with a zero-variance prior βwith mean θ*, the integral over parameters θworks out to [223001730140] |p(d[k] | d[1..k-1, k+1..K], β) = p(d[k] | θ*) [223001730150] |The problem is clear here. [223001730160] |But I’m not sure the prior with mean θ* and zero variance is always optimal. [223001730170] |Isn’t it possible that we could use a different prior β’and by carefully carving up the folds find that we could estimate a better θk for a fold k and hence a better overall performance? [223001730180] |I think it may be possible because we’re only trying to predict one fold at a time. [223001730190] |But then collectively we may not be able to win. [223001730200] |I could try to work out some examples, but instead I’ll ask, are there any statisticians in the audience? [223001740010] |Langford, Li, and Zhang (2009) Sparse Online Learning via Truncated Gradient [223001740020] |This paper is the bomb: [223001740030] |
  • Langford, John, Lihong Li and Tong Zhang. 2009. [223001740040] |Sparse online learning via truncated gradient. [223001740050] |JMLR 10:777–801. [223001740060] |It contains a very general theoretical analysis of the convergence of the kind of truncated L1 normalization (Laplace prior) we have in LingPipe’s logistic regression implementation. [223001740070] |They establish convergence for a very general case of loss including standard least squares linear regression, SVMs and logistic regression as special cases, and a very general stochastic condition which allows updates for N examples at a time. [223001740080] |(Is this latter condition useful for parallelizing this algorithm?) [223001740090] |One of the paper’s main motivations is to find sparse sets of coefficients, which is why they explore L1 norms (Cauchy priors [free preprint version] also have this property of having a density spike around the mean coupled with fat tails). [223001740100] |I’ve been planning to add a truly online version of logistic regression training that’d only store the coefficients in memory and a handful of examples (the batch size, which when set to 1, gives traditional stochastic gradient). [223001740110] |Our implementation only meets the first of the other two desiderata laid out by Langford et al. (p. 779): [223001740120] |
  • The algorithm should be computationally efficient: the number of operations per online step should be linear in the number of nonzero features, and independent of the total number of features. [223001740130] |
  • The algorithm should be memory efficient: it needs to maintain a list of active features, and can insert (when the corresponding weight becomes nonzero) and delete (when the corresponding weight becomes zero) features dynamically. [223001740140] |What I haven’t done is made the coefficient vectors sparse. [223001740150] |The problem with doing that is there’s a pretty steep penalty to pay in efficiency for doing so, at least using any kind of sparse matrices I know how to implement. [223001740160] |There’s also the issue of symbol tables if there’s a truly vast set of features, but that’s not really the gradient estimator’s responsibility. [223001740170] |There’s the usual set of experimental results, too, showing that the method works well on known data sets, such as Reuters RCV1. [223001740180] |They also evaluate some of their own (Yahoo! advertising) data with 3 x 109 features, 26 x 106 training instances. [223001740190] |Note that this is nearly three orders of magnitude more features than data points, which is why you need to apply such a strong prior; without it you’d have rather an overfitting problem (there’s more discussion in Tong’s further work on the the subject, Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations). [223001740200] |They use held-out evals like machine learning folks rather than reporting log loss on the training data like statisticians. [223001740210] |All in all, L1 with truncated gradient does a pretty good job at removing features —even better than the regular Lasso (they evaluated prior variances that led to fewer and fewer features; sort of like the diagram in Hastie et al.’s machine learning book when they describe the Lasso). [223001740220] |Maybe we should be experimenting with even lower prior variances from a zero mean. [223001750010] |Java Object Allocations Costly in Inner Loops [223001750020] |I’d pretty much stopped worrying about short-lived object allocations in Java. [223001750030] |GC’s gotten so fast it’s usually not the bottleneck in anything I’ve profiled lately, especially for short-lived objects in programs where throughput rather than low maximum latency is critical. [223001750040] |That is, until I got a phone call from a customer asking if I’d added multi-threading to the logistic regression estimator. [223001750050] |The short answer was “no”. [223001750060] |Hmm. [223001750070] |Logistic regression version 3.8 uses two threads. [223001750080] |What changed? [223001750090] |Just a really tiny, really short-lived object allocation inside the next-to-innermost loop. [223001750100] |It was a wrapper for the non-zero dimensions. [223001750110] |Mike added it in order to clean up my original code which had cut-and-paste the prior/regularization/shrinkage steps in multiple places to deal with middle/end of loop differences and sparse/dense array differences. [223001750120] |Here’s what version 3.8.0 looks like (slightly simplified), with this loop being executed once per training epoch: [223001750130] |and then here’s the code in the inner loop: [223001750140] |With the primitive int iterator (in util.Iterators along with factory creation methods as used above), there’s no further boxing/unboxing required for the numbers. [223001750150] |Mike’s changes seemed like a good idea to me when I code reviewed it, as they made the code much cleaner and more modular. [223001750160] |And after all, the innermost loop went over the non-zero dimensions themselves, doing multiply/add steps after method calls. [223001750170] |And usually hundreds of these. [223001750180] |And all the code to actually follow the likelihood gradient is even more expensive than the prior gradients. [223001750190] |So how expensive could the object allocation be, relatively speaking? [223001750200] |Very. [223001750210] |It was enough to cause a second thread to be pinned on the computer running the job. [223001750220] |That’s exactly what you’d expect if you maxed out the garbage collector with short-lived object deallocations. [223001750230] |In the server versions of the JVMs (pretty much all we use as all our machines are 64 bit), garbage collection runs concurrently by default. [223001750240] |At work, on our dual quad-core Xeons with big caches and fast memory, you couldn’t even notice it in terms of either wall clock time or its effect on the system. [223001750250] |Debugging on my Core Duo notebook was a different story. [223001750260] |Firing up logistic regression version 3.8.0 so hosed the machine I couldn’t even open Firefox without a 2 minute wait for the screen to draw. [223001750270] |To make a long story short, I ripped the lid off again and we now have the best of both worlds. [223001750280] |I refactored Mike’s refactoring to preserve the modularity but remove the object allocation. [223001750290] |Profiling showed it worked in that logistic regression estimation is only using a single thread, which is what you’d expect, because there’s no object allocation anywhere. [223001750300] |It’s also faster on my notebook. [223001750310] |The moral of the story is that fast Java programs need to look like C (without the malloc, of course). [223001750320] |In most cases, we’re not deep enough in inner loops in CPU-intensive enough jobs for this to matter. [223001750330] |Alas, sometimes, we are. [223001750340] |Stay tuned —I’m about to roll out LingPipe 3.8.1 which has this patch in place. [223001750350] |For comparison, here’s the new code: [223001750360] |where nonZeroDimensions() returns the array of non-zero dimensions for a sparse vector. [223001750370] |The update code is pretty much the same: [223001750380] |(Warning: There’s another bug in the above code relative to 3.7.0, namely that the lastRegularizaton gets reset too soon, preventing the 2nd to K-1st coefficient vectors from getting regularized. [223001750390] |On the plus side, we’ve also fixed a bug and a separate arithmetic overflow problem from 3.7.0.) [223001760010] |LingPipe 3.8.1 Released [223001760020] |We just released LingPipe 3.8.1, which patches LingPipe 3.8.0. [223001760030] |As usual, it’s available from the home page: [223001760040] |
  • LingPipe Home Page [223001760050] |There are only two changes: [223001760060] |
  • Bug fixes/performance for logistic regression [223001760070] |We removed the object allocations introduced in 3.8 to reduce garbage collection overhead. [223001760080] |We fixed a bug introduced in 3.8 that prevented all but the first category in a multinomial logistic regression from being regularized. [223001760090] |
  • Start/End points for tokenizers [223001760100] |We added an end point method to tokenizers, and implemented start/end with unit tests for all of the supplied tokenizers. [223001760110] |This makes for easier text highlighting, especially for filtered tokenizers like Porter stemming or stoplisting. [223001770010] |Java Multiplication (Much) Faster than Division [223001770020] |

    Micro-benchmarking

    [223001770030] |I wrote this little micro benchmark to test it out: [223001770040] |

    Run time on 1.6 JVM

    [223001770050] |Here’s the timing on my workstation, a dual quad-core Xeon (E5410 2.33GHz), with 16GB of ECC/registered DDR2667 memory, Vista 64 bit OS. I’m running the following 1.6 JVM: Java(TM) SE Runtime Environment (build 1.6.0_05-b13), Java HotSpot(TM) 64-Bit Server VM (build 10.0-b19, mixed mode) [223001770060] |Because there’s more going on than just the inner loop’s multiplication or division, here are timings with the inner loop doing just sum += i: [223001770070] |Here’s the result with multiplication: [223001770080] |Multiplication is not very costly compared to the basic loops and additions. [223001770090] |Compare this to division (i/1000.0): [223001770100] |This is a huge penalty for division over multiplication or nothing at all. [223001770110] |All that extra cost is due to division. [223001770120] |I tried different numbers to multiply and divide and their choice doesn’t matter. [223001770130] |

    Compilation in JDK 1.5 vs. 1.6

    [223001770140] |These are numbers from compiling in the 1.6 JDK. [223001770150] |Compiling in 1.5 (as we compile LingPipe), and then running in the 1.6 JVM didn’t make a difference. [223001770160] |

    Run time in 1.5 JVM

    [223001770170] |Here are numbers for the 1.5 JVM, Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_15-b04), Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_15-b04, mixed mode). [223001770180] |First, with just the addition in the loop: [223001770190] |I have no idea why the 1.5 JVM is twice as fast as the 1.6 JVM for simple loops. [223001770200] |Here’s the result for multiplication, which clearly has a measurable cost in 1.5, with the total compute time for multiplication still a bit faster than in 1.6: [223001770210] |Division is relatively slow in the 1.5 JVM: [223001770220] |Of course, one has to be careful that precision doesn’t get hurt by switching divides to multiplies. [223001770230] |

    Comparison to Logarithms

    [223001770240] |In the 1.5 JVM, swapping Math.log(x) for x/1000.0 was about ten times slower. [223001770250] |In the 1.6 JVM, logs were only about three times slower than division. [223001770260] |As I’ve said before, use one of the more recent builds of 1.6 in server mode if you want the fastest JVM for LingPipe. [223001780010] |Quantized Information Gain (Conditional Entropy) for Real- and Count-Valued Features [223001780020] |I’ve just implemented information gain for feature selection, but I’m unhappy with how poorly its selection criterion is suited to linear classifiers like logistic regression, which take real-valued inputs, or even naive Bayes, which takes count-valued inputs. [223001780030] |I’ve been thinking about the best I can do is to quantize the input features to three categories, positive, negative, and zero. [223001780040] |

    Information Gain (aka Mutual Information)

    [223001780050] |Information gain (closely related to mutual information) is defined intuitively as the reduction in your uncertainty about the category of an item being classified (X) provided by knowing the value of a feature (Y). [223001780060] |Uncertainty is measured in entropy for distributions, sample entropy or estimated model entropy for data sets. [223001780070] |Uncertainty given the value of a feature is conditional entropy. [223001780080] |The gain is the difference between the entropy and conditional entropy. [223001780090] |In symbols, the entropy H(X) of a discrete random variable X with distribution p() is [223001780100] |H(X) = Σx in X p(x) log p(x). [223001780110] |Conditional entropy is just the entropy of random variable X given the value of Y, [223001780120] |H(X|Y) = Σy in Y p(y) H(X|Y=y), [223001780130] |where H(X|Y=y) is the entropy of the variable X restricted to cases where Y=y, [223001780140] |H(X|Y=y) = Σx in X p(x|y) log p(x|y). [223001780150] |The information gain of a feature Y with respect to X is just its mutual information with respect to X: [223001780160] |IGX(Y) = I(X; Y) = H(X) –H(X|Y). [223001780170] |Because H(X) is fixed, a feature Y has a higher information gain than feature Z if H(X|Y) Count and Continuous Variables [223001780200] |In the most common case, Y is a boolean (Bernoulli) random variable, taking on only two possible values, 0 and 1. [223001780210] |More than two outcomes is typically coded with boolean indicator variables, exactly one of which takes on value 1 for any input. [223001780220] |In these cases, information gain makes sense. [223001780230] |But we’re usually dealing with count variables (e.g. naive Bayes, language model classifiers) or continuous variables (e.g. logistic regression, K-nearest neighbors, decision trees). [223001780240] |The definition of conditional entropy already coves the count case, because it’s countable. [223001780250] |But it assumes that somehow naive Bayes could somehow treat the counts non-monotonically. [223001780260] |In fact, naive Bayes probability estimates are always monotonic in the features (because they’re a simple multiplying counts by word log probability estimates). [223001780270] |Continuous variables are handled by replacing the sum with an integral. [223001780280] |But with discriminative classifiers like logistic regression, we don’t even have a generative model of the input data, so what do we integrate? [223001780290] |Logistic regression, at least without expanding the feature basis with interactions, quadratic terms, etc., is also monotonic, though it can go up and down because of the possibility of negative values. [223001780300] |What I’ve been toying with doing is just quantizing continuous variables and count variables into three categories: negative (Y < 0), zero (Y = 0), and positive (Y >0). [223001780310] |If no one has any objections or better ideas, that’s what’s going into com.aliasi.features for the next release. [223001780320] |It reduces to the usual definition for binary features (coded either way, as variables that take on only values 0 or 1, or take on only values -1 or 1). [223001780330] |This is pretty much what George Forman did in his 2003 JMLR paper on feature selection, An Extensive Empirical Study of Feature Selection Metrics for Text Classification (in their special issue on feature selection). [223001780340] |Spoiler Warning: Forman found information gain to be among the top feature selection methods for text classifiers, but only considered count data quantized into two categories, zero and positive. [223001790010] |Lucida Console is My Font [223001790020] |The Hivelogic blog has a post about Top 10 Programming Fonts. [223001790030] |I have to say my tastes run rather in the opposite direction from the poster’s. Here’s a screenshot of Hivelogic’s top recommended coding font (Inconsolata) next to some un-antialiased Lucida Console at about the same size, along with some bigger Lucida console: [223001790040] |It looks exactly the same in a DOS/Cygwin shell window. [223001790050] |To me, there’s just no comparison. [223001790060] |The anti-aliased fonts are blurry. [223001790070] |Cleartype’s even blurrier in my opinion. [223001790080] |It looks great from a distance, but for small programming fonts? [223001790090] |Lucida Console without anti-aliasing is where it’s at. [223001790100] |If you want Lucida for coding/shell use, you can get the true type fonts from Windows, or from the Java 1.6 SE distribution. [223001790110] |Just unpack it and look in: [223001790120] |jre/lib/fonts/LucidaTypewriterRegular.ttf [223001790130] |There’s also a bold version. [223001790140] |They’re confusingly named —the full name’s “Lucida Sans Typewriter” (they also developed a serifed version, but it’s not distributed with java, but does come with the PcTeX distro). [223001790150] |I’ve been composing text using emacs and typesetting math in LaTeX since 1984. [223001790160] |Back in the day, it was pretty much some form of courier or another. [223001790170] |Nowadays, for me, it’s all about Lucida console. [223001790180] |Maybe it’s all the Java, because the font’s not only distributed with Java, it’s the font used in all my favorite Java books (i.e. Arnold and Gosling Java Programming Language, Goetz’s Concurrency, and Bloch’s Effective Java). [223001790190] |I like Lucida Console so much I shelled out $99 to buy a version of the Lucida family of fonts (including Lucida Bright and Lucida Typewriter) for TeX (from the PcTeX Store —it works with any TeX —I’m using MikTeX). [223001790200] |The nice thing about Lucida Console and Lucida Bright together is that they have balanced color (that is, the amount of black ink looks about the same from a distance) and balanced x-height so they look good next to each other. [223001790210] |Courier and Times do not play nicely together, and you have to go through contortions to get Times and Lucida Console to look OK (see the colophon in the Arnold et al. book). [223001790220] |For more design discussion, check out my previous post, LingPipe Site Colophon. [223001800010] |Flexible Dictionary-Based Chunking for Extracting Gene and Protein Names [223001800020] |We’ve been using our exact dictionary matching chunker rather heavily lately in extracting gene and protein names from free text. [223001800030] |We’ve been looking at BioCreative 2′s Protein-Protein Interaction (PPI) interacting pairs subtask (IPS). [223001800040] |I’d be curious to hear how others dealt with the data, which is not annotated at the mention level (it’s full text articles run through HTML and PDF converters, so there’s a huge amount of “noise” from figures and nav controls to boot). [223001800050] |Gene and protein dictionaries present quite a difficult matching problem, because the available dictionaries (e.g. Entrez-Gene and Uniprot-KB) are very spotty. [223001800060] |One of the recurring problems is case and modifiers. [223001800070] |For instance, it’s common to have a gene such as “Trp53″ (Entrez-Gene 22059) referred to in text as “mP53wt” (mouse p-53, wild type). [223001800080] |This is clearly problematic if you have a space-based notion of tokenization. [223001800090] |LingPipe lets you define flexible tokenizers. [223001800100] |For this problem, we get rid of all the punctuation (treating it as whitespace), and choose to break at case changes and digits, using a regular expression tokenizer with pattern [223001800110] |([a-z]+)|([A-Z]+)|([0-9]+) [223001800120] |This’ll tokenize “mP53wt” into four tokens, “m”, “P”, “53″, and “wt”. [223001800130] |We typically wrap the reg-ex tokenizer in a lowercasing filter, so that the final matching is case insensitive. [223001800140] |LingPipe’s exact-match dictionary chunker is not sensitive to whitespaces. [223001800150] |It tokenizes the phrases in the dictionary and the text being chunked, and sees if there’s a sequence match. [223001800160] |Hence a dictionary entry “p53″ will match “mP53wt” producing a chunk starting at position 1 and ending at position 4 (one past last char). [223001800170] |The tokenizer factories support stoplists, so it’s possible to filter out words like “protein” (allowing “Neuron-specific X11 protein” to match “Neuron-specific X11″), or like “related” (e.g. “transformation related protein 53″), or like “member” (e.g. “inwardly rectifying subfamily J member 2″). [223001800180] |The last example suggests we might want to stem, as well, so that “inwardly” would match “inward” and “rectifying” would match “rectify”. [223001800190] |It’s also possible to normalize words using a token modifying tokenizer factory to reduce “alpha”, “A”, and “α”to the same token. [223001800200] |Warning: In LingPipe 3.8.1, it’s not possible to write a dictionary chunker that uses a reductive tokenizer (like a stemmer), because it was counting the space by the length of the tokens and whitespaces. [223001800210] |As of the next release, this problem has been fixed by converting the dictionary chunker to rely on the tokenizer’s start and end positions for the previous token. [223001800220] |If you’re curious to try before then, I can send you the patch. [223001800230] |The other big problems are species normalization (Entrez lists a whole host of organisms with TP53 genes), and word-order variation. [223001800240] |Stay tuned for more details on what we’re doing for these. [223001810010] |Bulmer (1967) Principles of Statistics [223001810020] |When I asked Steve Ansolabahere what to read as an intro to stats, he recommended this: [223001810030] |
  • Bulmer, M. G. 1967. [223001810040] |Principles of Statistics. [223001810050] |2nd Edition. [223001810060] |Dover. [223001810070] |[Update: I first wrote 1966, which is between the first edition in '65 and the second in '67.] [223001810080] |As Mitzi says, clever boy, that Steve. [223001810090] |This book is fantastic. [223001810100] |It’s concise (250 page), incredibly clear and straightforward, and at just the right level of mathematical rigor not to get bogged down in extraneous details. [223001810110] |And the price is right, US $10.17 brand new from Amazon. [223001810120] |I’ve always been a fan of Dover Press. [223001810130] |In a few short chapters, with no dissertations on combinatorics or urns in sight, Bulmer lays out basic probability theory (e.g. conditional probability, chain rule, independence, etc.). [223001810140] |He actually uses real data for examples. [223001810150] |Random variables arise naturally after this rather than as some measure-theoretic notion that’s always way too abstract for a first pass at the material (Larsen and Marx’s Mathematical Statistics is very good if you really want the gory details for Euclidean space using the minimal amount of analysis). [223001810160] |There’s a wonderful discussion of expectations, variance and moments along with a very nice intro to moment generating functions with comprehensible examples. [223001810170] |Throughout, the examples are only as complicated as they need to be to get the basic idea across. [223001810180] |Bulmer first goes into depth on the discrete distributions, binomial, Poisson, and exponential. [223001810190] |There’s a very nice chapter on the normal distribution and its relation to the central limit theorem. [223001810200] |What I liked best about the book, though, was its coverage of χ2 and t distributions and their relations to estimation. [223001810210] |I just read it again and I marvel at how Bulmer explains this entire topic in 15 easy pages. [223001810220] |Really. [223001810230] |Read it. [223001810240] |After covering traditional point-estimate hypothesis testing (boo, hiss), Bulmer jumps into a philosophical discussion of statistical inference, providing a nice intro to Bayesian and non-Bayesian approaches. [223001810250] |This is followed by a discussion of point estimates and estimation robustness. [223001810260] |If you’re forced to go classical (why else would you?), this is a good way to think about things. [223001810270] |Finally, there’s a cursory section on regression which covers the linear case and then the bivariate case and its relation to correlation very cleanly without any matrix notation. [223001810280] |Alas, no logistic regression. [223001810290] |On the other hand, once you’ve read this, you should be ready for Gelman and Hill’s book on regression. [223001810300] |If you don’t believe me, read the reviews on Amazon. [223001810310] |It’s a stack of love letters to the author for his or her insight. [223001820010] |June 10 Mechanical Turk Meetup –San Francisco [223001820020] |I’m heading West for a June 10 meetup about the Mechanical Turk, hosted by the Turk experts, Dolores Labs: [223001820030] |
  • Mechanical Turk/Crowdsourcing Meetup Page [223001820040] |The page contains a description of the program, and there’s space for a few more; just RSVP at the meetup page above. [223001820050] |Lukas (our host) just made up a title for my presentation (which is of joint work with Breck Baldwin and Emily Jamison), “Failure study of [a] biology task”. [223001820060] |I’m going to be talking about some of the NLP tasks we’ve run successfully (named entity tagging and morphological analysis) and also one (linking MEDLINE to Entrez-Gene) that didn’t work. [223001820070] |I’ll discuss some of the UI issues along the way. [223001820080] |I’ll also describe the high level view of inferring gold standards and annotator accuracy, particularly with respect to its effect on applications. [223001820090] |But don’t worry —I won’t be assuming the audience is full of Bayesian statistician. [223001830010] |Bing Brings its A Game [223001830020] |[Update: Bing won the race to index this blog post. [223001830030] |A search 30 minutes after this post went live for showed up on Bing but not Google (Yahoo! wants to spell-check "bing" to "being").] [223001830040] |Wow. [223001830050] |I just did fifteen general non-structured searches to compare Microsoft’s new search engine, Bing to Google and Yahoo!. [223001830060] |I must be like Woz, because after a full bout (fifteen rounds), while there was not a clear knockout, I have to say I’m amazed at how well Bing works. [223001830070] |Noticeably better than Google for many queries, in fact. [223001830080] |I liked Bing’s snippets much much more than Google’s, but they should adopt Google’s indentation policy for multiple results from the same site (that does cause re-ordering, but it’s useful for me as a consumer). [223001830090] |The Bing snippets would be even better if they highlighted the search terms! [223001830100] |In most cases, Bing had a noticeably better diversity of results. [223001830110] |For those two reasons alone, I’m going to switch to Bing for the near term and see how I like it. [223001830120] |I’m not happy with either Bing’s or Google’s attempt to “facet” results (by clustering or what?). [223001830130] |I’m very happy with how up-to-date both engines are (e.g. ) and how they present a few news headlines. [223001830140] |Yahoo! has the best followup and related query suggestions as you type, but the ones they present with results aren’t as good as the live ones. [223001830150] |Bing and Google take different approaches (related versus refinement), and both are OK, but not particularly insightful. [223001830160] |Google’s spell checker seems awfully aggressive. [223001830170] |And I’m feeling that they’re being a little too agressive stemming (I need some good examples of this —none came to mind right away; but for example, “infinity” is a synonym for LaTeX’s “infty” command, at least if you can go by highlighting!). [223001830180] |Google and Bing are both surprisingly up to date. [223001830190] |I’d hate to have to solve the spidering and updating problem they face. [223001830200] |I like Bing’s clean layout —Google’s results feel cramped by comparison. [223001830210] |Now that I look at it, Google’s one-column layout feels very late 1990s, early 2000s. Bing’s narrow left-hand column with types of news and related searches (for query refinement) seems cleaner than Google’s approach, where you have to scroll down to related searches. [223001830220] |I really appreciate the big targets for the paging at the bottom of Bing (ditto Yahoo!); clearly Google’s designers have never heard of Fitt’s Law. [223001830230] |I’m really impressed with how fast Bing is. It’s at least as responsive as Google if not more so. [223001830240] |Let’s see how that goes if it gets more popular. [223001830250] |The champ’s a bit woozy, and there’s still one round left: [223001830260] |It’s too close for me to call with just these 15 queries. [223001830270] |But clearly Bing’s a contender for general web search. [223001830280] |Should Google be worried? [223001830290] |Is the ad business zero sum? [223001830300] |(Those questions weren’t rhetorical —I know nothing about the ad business other than that it’s been profitable for Google and annoying for me as a user.) [223001840010] |Predicting Category Prevalence by Adjusting for Biased Classifiers [223001840020] |Suppose you are interested in text-based epidemiology, as our customer Health Monitoring Services is. [223001840030] |Among the information they get are real-time inputs in the form of chief complaints (a few words about what’s wrong with the patient) from hospital emergency rooms. [223001840040] |These are then classifed by syndrome (e.g. botulinic, gastrointestinal, neurological, etc.) to carry out so-called syndromic surveillance. [223001840050] |(HMS is doing lots of other cool stuff; this is just the very basics.) [223001840060] |Public health agencies like the U.S. Center for Disease Control are more interested in overall prevalence of conditions than in individual cases. [223001840070] |That is, how many people have the flu or are infected with HIV? [223001840080] |And where are they geographically? [223001840090] |(Epidemiologists, in general, also care about the reliability of diagnostic tests, such as the sensitivity and specificity of a particular blood test.) [223001840100] |The naive way to do this computation is to run everything through a first-best classifier and then count the results in each category. [223001840110] |You can smooth it a bit, and then look for significant divergences from expected norms (tricky given seasonality, etc., but doable). [223001840120] |A slightly more sophisticated way to do this would be to count expectations. [223001840130] |That is, run a probabilistic classifier to compute posterior inferences about the distribution of categories for a given input. [223001840140] |Then sum the probabilities, which is the expected number of cases according to the model. [223001840150] |This should work better. [223001840160] |Well, I just learned an even better way to do this last weekend. [223001840170] |Gary King (of the Social Science Statistics Blog) presented this paper co-authored with Daniel Hopkins: [223001840180] |
  • Hopkins, Daniel and Gary King. 2008. [223001840190] |A Method of automated nonparametric content analysis for social science. [223001840200] |They point out that social scientists are much more interested in population trends than the opinions of single members of the public. [223001840210] |So they propose to estimate the prevalence of positive sentiment toward political candidates. [223001840220] |But they’re worried about bias, and don’t care about individual level classification accuracy, which is what all the models train for in one way or another (i.e. SVMs, perceptrons, naive Bayes, logistic regression). [223001840230] |They also point out that the distribution of categories may be very different in the test set than the training set, because things like opinions may drift over time. [223001840240] |So maybe we can get away from assuming our test set is a random sample from the same pool as the training (in which case the problem’s solved, because the training distribution of categories will match the test distribution by definition!). [223001840250] |Dan and Gary pulled out the wayback machine (that’s what the image above is of) and found this (sorry for the closed source; I can’t get at anything beyond the abstract, either): [223001840260] |
  • Levy, P. S. and E. H. Kass. 1970. [223001840270] |A three population model for sequential screening for Bacteriuria. [223001840280] |American Journal of Epidemiology. [223001840290] |91:148–154. [223001840300] |It uses what’s in hindsight, a blindingly obvious adjustment for bias. [223001840310] |Suppose we take a classifier and measure sensitivity, [223001840320] |sens = TP / (TP + FN), [223001840330] |which is just its accuracy truly positive items (has the disease), and specificity, [223001840340] |spec = TN / (TN + FP), [223001840350] |which is its accuracy on truly negative items (don’t have the disease). [223001840360] |We can estimate sensitivity and specificity by cross-validation on our training set easily enough; the functionality’s built into LingPipe (combine a cross-validating corpus and a classifier evaluator). [223001840370] |Suppose C is a category taking on values 0 or 1; if C’ is an estimate of C for a new data point, we have: [223001840380] |Pr(C’=1) = sens * Pr(C=1) + (1 –spec) * Pr(C=0) [223001840390] |That is, there are two ways in which the system estimates the category C’ to be 1, by being right for an item which truly is of category 1, or being wrong for an item which truly is of category 0. [223001840400] |An instance is of category 1 with probability Pr(C=1) and is classified correctly with probability sensitivity and is of category 0 with probability Pr(C=0) = (1 –Pr(C=1)) and classified in error with probability (1 –specificity). [223001840410] |Now we just apply some algebra to solve for Pr(C=1): [223001840420] |Pr(C=1) = (Pr(C’=1) + spec –1) / (sens + spec –1) [223001840430] |Et voilà. [223001840440] |An unbiased estimate for Pr(C=1). [223001840450] |I love algebra. [223001840460] |Of course, this assumes that your sensitivity and specificity are the same on the test set as on the training set, but you have to assume something to get off the ground here. [223001860010] |Comparing Bernoulli and Multinomial (Reply) [223001860020] |So what do you do when a blog doesn’t allow comments? [223001860030] |Comment from your own blog. [223001860040] |The BioNLP mailing list had a link to: [223001860050] |
  • Emerging Computational Linguistics Blog [223001860060] |It’s apparently posted by a M.S. student at U. Washington, but there’s no more evidence, even through whois, about who it is. [223001860070] |And there are no comments allowed! [223001860080] |The perspective of a student learning computational linguistics is interesting. [223001860090] |I was particularly struck by the absurdity of this year-old post: [223001860100] |
  • Emerging CL. [223001860110] |3 February 2008. [223001860120] |Comparing Bernoulli and Multinomial [Classifiers] [223001860130] |It’s a classic, “let’s compare two approaches” post, here the two simplest linear classifiers, Bernoulli (each feature is true/false) and Multinomial (each feature gets a count, aka naive Bayes). [223001860140] |They both use Bayes’s rule to compute the category probabilities: [223001860150] |Bayes’s Rule: p(cat|x) p(x|cat) p(cat) [223001860160] |Presumably we take the same space of features numbered 1 to D (the dimensionality). [223001860170] |The Bernoulli model uses boolean feature values, setting: [223001860180] |p(feats|cat) = Πd in 1:D feats(d) ? θ[d,cat] : (1 –θ[d,cat]) [223001860190] |where θ[f,cat] is the probability that feature d is true in the specified category and feats(d) is true or false, based on whether the feature appeared in the input. [223001860200] |The multinomial uses count values (0,1,2,…) for features, setting: [223001860210] |p(feats|cat) Πd in 1:D φ[d,cat]feats(d) [223001860220] |where φ[d,cat] is the probability of the word d in the model and feats(d) is the number of times it appeared in the input being classified. [223001860230] |We don’t usually compute the normalizing term for first-best classifiers. [223001860240] |There are two problems with this report. [223001860250] |One, it shows the multinomial classifier being about 40 times faster! [223001860260] |How can that be given that the inside of the product is simpler for the Bernoulli model? [223001860270] |The answer is that they must have used a sub-optimal implementation. [223001860280] |I can guess what they did —implemented Bernoulli as the naive product and implemented the binomial in the usual way ignoring the zero counts, which work out to φ0 = 1 and thus cancel. [223001860290] |The Bernoulli should be much faster if it’s implemented the right way, which is to start with an array of scores computed by baseline log odds assuming all features are false. [223001860300] |score[cat] = Πd in 1:D (1 –θ[d,cat]) [223001860310] |Then for each true feature d, [223001860320] |score[cat] *= θ[d,cat] / (1 –θ[d,cat]) [223001860330] |If we convert all that to logs, the Bernoulli will be faster. [223001860340] |And it’s a lot cheaper to compute boolean feature vectors than to use counts (you can use a set instead of a map, and in a good implementation, feature extraction dominates). [223001860350] |The second problem is that the experiments explored a set of prior parameter values where the best performance was at the edge of the values tested. [223001860360] |They tried priors of 0.1, 0.5, 1, and 2, with 0.1 performing the best. [223001860370] |We know even more diffuse priors are good for naive Bayes, so I suspect setting the prior concentration to less than 0.1 would’ve worked even better. [223001860380] |The real issue is that I see conference papers with the same two flaws in evaluation. [223001860390] |What typically happens to the professionals is that they use a baseline implementation of someone else’s algorithm with which they’re not familiar, and a well-tuned version of their own. [223001860400] |It’s notoriously tricky to do fair efficiency and scalability tests without a no-holds-barred policy and a bakeoff. [223001860410] |And even then, skill of programmers is likely to dominate the evaluation.