Expected Error Histograms from Illumina 36bp Reads

[223002890010] |<title>Expected Error Histograms from Illumina 36bp Reads</title>
[223002890020] |Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.
[223002890030] |<li>Marioni JC, Mason CE, Mane SM, Stephens M, and Gilad Y. 2008.
[223002890040] |<a>RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays</a>.
[223002890050] |Genome Res.</li> 
[223002890060] |The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer.
[223002890070] |More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads. 
[223002890080] |<h3>Error Profile of Illumina</h3>
[223002890090] |What’s nice about modern bioinformatics is that many authors submit their data to public repositories.
[223002890100] |So Mitzi was able to download and crunch Marioni et al.’s Illumina data.
[223002890110] |For each 36bp read, Illumina provides a probability estimate for each base.
[223002890120] |For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A.
[223002890130] |But according to Illumina, there’s only a 97% chance that call is correct! 
[223002890140] |<h3>Calculating Expected Errors per Read</h3>
[223002890150] |So how many errors do we expect in a 36bp read?
[223002890160] |If <img/>is the probability that the most likely base at position <img/>is correct, the expected number of errors in the first <img/>reads is <img/>. 
[223002890170] |With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:
[223002890180] |But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads.
[223002890190] |Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:
[223002890200] |As you can see, this dramatically reduces the expected number of errors. 
[223002890210] |<h3>Calculating Expected Number of Correct Reads</h3>
[223002890220] |Mitzi also plotted the probability that the first best calls up to length <img/>are all correct for a read, which is <img/>.
[223002890230] |The probabilities are so low it requires a log (base 10) scale when using all 36bps: 
[223002890240] |The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence.
[223002890250] |This has unsettling ramifications for procedures like SNP calling.
[223002890260] |<h3>The Done Thing</h3>
[223002890270] |What Marioni et al. did is typical.
[223002890280] |They used a heuristic aligner (<a>Bowtie</a>) which ignores the read quality scores and allows only a small number of edits.
[223002890290] |They then censor (aka discard) any read from their data set that doesn’t align “uniquely”.
[223002890300] |A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits.
[223002890310] |That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.
[223002890320] |Why throw away non-unique reads?
[223002890330] |There’s good reason to believe that the unique alignments are not unique.
[223002890340] |Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my <a>joint mapping recovery and mRNA expression model</a>.
[223002890350] |Why throw away quality scores from the aligner?
[223002890360] |I only know of two aligners that use quality scores, <a>Gnumap</a> and <a>BWA</a>, but these systems are far less popular than simple edit counting systems like Bowtie.
[223002890370] |Why only count edits?
[223002890380] |A proper channel model should be better.
[223002890390] |For instance, the <a>SHRiMP system</a> employs a more refined channel model (which my <a>probabilistic alignment model</a> refines with proper normalization and integrating out of the actual sequence of edits).
[223002890400] |The answer’s simple really: it’s easier to code and easier to explain. 
[223002890410] |<h3>Can We Do Better?</h3>
[223002890420] |Like every newbie, I feel we can.
[223002890430] |Not surprisingly, I think we should use all the data we have, integrating a proper <a>probabilistic channel model for alignment</a> with a <a>Bayesian mRNA expression model</a>. 
[223002890440] |Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model.
[223002890450] |After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer. 
[223002890460] |<h3>Bioinformatics?</h3>
[223002890470] |No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me.
[223002890480] |Maybe I should choose a hobby that’s a little more dissimilar from my day job.
[223002900010] |<title>LREC 2010 Tutorial: Modeling Data Annotation</title>
[223002900020] |Massimo and I are giving our tutorial this afternoon in Malta at LREC 2010.
[223002900030] |Here’s a link to my slides:
[223002900040] |<li>Carpenter, Bob and Massimo Poesio. 2010.
[223002900050] |<a>Models of Data Annotation</a>.
[223002900060] |Part II of our tutorial at LREC. </li>
[223002900070] |Thanks to Massimo for lots of great suggestions and feedback from when I gave the talk in Rovereto Italy last week (U. Trento), when we talked about it at length between then and now, and on the first few drafts of the slides.
[223002900080] |And here’s a link to Massimo’s slides:
[223002900090] |<li>Poesio, Massimo and Bob Carpenter (and Ron Artstein). 2010.
[223002900100] |<a>Statistical Models of the Annotation Process</a>.
[223002900110] |Part I of our tutorial at LREC. 
[223002900120] |Massimo’s part was based on his and Ron’s CL Journal review of kappa-like stats, and the tutorials they’ve given in the past.
[223002900130] |As such, they’re much more mature.
[223002900140] |The examples are really tight and to the point.
[223002910010] |<title>Real and Perceived Non-Determinism in java.util.HashSet Iteration (and Print) Order</title>
[223002910020] |We had a recent inquiry on the mailing list about LingPipe’s complete-link clustering implementation.
[223002910030] |For the same inputs, it was producing what looked like different results.
[223002910040] |Well, it actually was producing different results.
[223002910050] |Different results that were structurally equivalent, but not Java equal.
[223002910060] |It turns out that the culprit was our old friends <code>Object.equals(Object)</code>, <code>Object.hashCode()</code> and collection implementations inside <code>java.util</code>. 
[223002910070] |<h3>Perceived Non-Determinism Sample</h3>
[223002910080] |Here’s the simplest code I could think of that illustrates what appears to be a problem with non-deterministic behavior: 
[223002910090] |First note that the static nested class <code>Foo</code> extends <a><code>java.lang.Object</code></a> and hence inherits <code>Object</code>‘s implementation of <a><code>equals(Object)</code></a> and <a><code>hashCode()</code></a> [links are to the JDK 1.5 Javadoc].
[223002910100] |As a result, two <code>Foo</code> objects holding the same integer are not equal unless they are the same object. 
[223002910110] |The <code>main()</code> runs a loop adding 10 instances of <code>Foo</code> to a <code>java.util.HashSet</code> then printing them out.
[223002910120] |The result is different orderings:
[223002910130] |Of course, given our definition of <code>Foo</code>, the sets aren’t actually equal, because the members aren’t actually equal.
[223002910140] |The members just appear to be equal because they’re structurally equivalent in terms of their content. 
[223002910150] |If you want instances of Java classes things to behave as if their equal, you have to override the default implementations <code>equals(Object)</code> and <code>hashCode()</code> accordingly.
[223002910160] |<h3>But what if the Objects are Equal?</h3>
[223002910170] |So what if we look at a case where the objects and hence sets really are equal?
[223002910180] |Alas, we can still get what looks to be non-equal outputs. 
[223002910190] |Consider the following program:
[223002910200] |which if we compile and run, gives us
[223002910210] |Here the print order is different because the iteration order is different in different hash sets.
[223002910220] |With different sizes, the underlying array of buckets is a different size, hence after modular arithmetic, elements wind up ordered differently. 
[223002910230] |But the sets produced here are actually equal in the Java sense that <code>set1.equals(set2)</code> would return <code>true</code>.
[223002910240] |They just don’t look equal because the print and iteration order is different.
[223002910250] |<h3>Bug or Feature?</h3>
[223002910260] |Let’s just call it a quirk.
[223002910270] |The doc for hash sets and other collections is pretty explicit.
[223002910280] |That doesn’t mean it’s not confusing in many applications.
[223002920010] |<title>Hierarchical Agglomerative Clustering is Non-Deterministic</title>
[223002920020] |Some of you may have followed the recent back-and-forth on the <a>LingPipe newsgroup</a> concerning the non-determinstic behavior of LingPipe’s implementation of complete-link and single-link agglomerative clustering.
[223002920030] |It’s not our fault.
[223002920040] |Really.
[223002920050] |The basic algorithm is non-determistic. 
[223002920060] |<h3>The Usual Algorithm</h3>
[223002920070] |As it’s usually presented, we initialize with every element in its own singleton cluster.
[223002920080] |Then at each step, we choose two clusters to merge based on the distance between them.
[223002920090] |For single-link clustering, distance is the distance between the two closest points (i.e. min of pairwise distances); for complete-link, the distance between the two furthest points (i.e. max of pairwise distances).
[223002920100] |Usually we choose the closest pair of clusters to merge.
[223002920110] |But what if there’s a tie?
[223002920120] |<h3>Some Examples</h3>
[223002920130] |Neither the Wikipedia nor the new(ish) Stanford IR book discuss what to do in the case of ties, leaving the algorithm non-determinstic.
[223002920140] |Here are links:
[223002920150] |<li>Wikipedia: <a>Cluster Analysis</a>
[223002920160] |<li>Stanford IR Book: <a>Hierarchical Agglomerative Clustering</a>
[223002920170] |<li>Stanford IR Book: <a>Single-Link and Complete-Link Clustering</a> 
[223002920180] |(As of today), the Wikipedia says [my emphasis], “This method builds the hierarchy from the individual elements by progressively merging clusters. …The first step is to determine which elements to merge in a cluster.
[223002920190] |Usually, we want to take the two closest elements, according to the chosen distance.” 
[223002920200] |Similarly, the Stanford IR book’s section on hierarchical agglomerative clustering says [my emphasis], “In each iteration, the two most similar clusters are merged.”
[223002920210] |Neither source says what to do when there are ties. 
[223002920220] |<h3>Example</h3>
[223002920230] |Suppose we have three points on a line, {1, 3, 4, 6} with the usual Euclidean distance function (e.g. d(1,3)=2; d(3,4)=1, d(6,1)=5, d(4,4)=0, …).
[223002920240] |And let’s suppose we go with complete-link clustering.
[223002920250] |There are two alternative possibilities when we come to the second step:
[223002920260] |because d(1,{3,4}) = 3.0 and d({3,4},6) = 3.0. 
[223002920270] |Without the scores, the dendrograms are {{1,{3,4}},6} vs. {1,{{3,4},6}}.
[223002920280] |Very different structures. 
[223002920290] |Single-link clustering suffers from the same non-determinsm.
[223002920300] |Specifically, for the example above, the same first step applies, and then the same ambiguity in the second step where both distances are 2.0 (rather than the 3.0 in the complete-link algorithm). 
[223002920310] |<h3>Breaking the Symmetry</h3>
[223002920320] |Usually an implementation will break the tie in some programmatic way, such as choosing the first one to appear in some data structure.
[223002920330] |The problem for LingPipe, which uses hash-based Java collections over objects with reference-based equality is that the ordering’s not fixed.
[223002920340] |(This is well documented, though a bit subtle.)
[223002920350] |Specifically, different runs can produce different results.
[223002920360] |See the <a>last blog post</a> for more details on the non-determinism of Java hash-based collections.
[223002920370] |<h3>Cophenetic Distance</h3>
[223002920380] |I suspect it’s not possible to normalize at the dendrogram level for either complete-link or single-link clustering. 
[223002920390] |On one interpretation of clustering, dendrograms allow us to infer <a>cophenetic distances</a>, which is the distance at which clusters are merged.
[223002920400] |Unfortunately, the cophenetic distances derived from the two clusterings resulting from the non-deterministic complete-link example above are different, with either 1 or 6 being closer to {3,4} depending on which is merged first.
[223002920410] |I’m guessing the same thing will happen with single-link, but haven’t worked out the details (in this example, the cophenetic distances are the same). 
[223002920420] |Thus I believe the real problem’s in the model implicitly assumed in the clustering strategies. 
[223002920430] |Furthermore, both of these algorithms are notoriously non-robust to small differences in distances or the set of objects being clustered. 
[223002920440] |<h3>Does it Matter?</h3>
[223002920450] |Are there applications where this will matter?
[223002920460] |I suppose if you really believe in clustering results and try to analyze them via cophenetic correlation or other measures it will.
[223002930010] |<title>Should We Migrate to Java 1.6?</title>
[223002930020] |The impending release of LingPipe 4 brings up the question of Java version.
[223002930030] |Specifically:
[223002930040] |Should we start releasing LingPipe compiled with J2SE 1.6 rather than 1.5?
[223002930050] |And what about the models?
[223002930060] |<h3>Compiling for 1.5 Still Possible</h3>
[223002930070] |LingPipe would still be compilable with the 1.5 JDK, so users could build their own 1.5-compatible LingPipe jar.
[223002930080] |But the official release would be compiled with Java 1.6 and that’s what we’d target for support.
[223002930090] |There are no immediate plans to add any dependencies on features introduced in Java 1.6. 
[223002930100] |<h3>No More Public Java 1.5 Support</h3>
[223002930110] |The reason we’re considering migrating is that the 1.5 version of the JDK (officially <a>Java SE 5.0</a>) reached its end of service life (EOSL) in November 2009.
[223002930120] |In practice, this means <a>no more public bug fixes</a> for 1.5. 
[223002930130] |Oracle will support 1.5 for those willing to pay for it through “<a>Java for Business</a>“.
[223002940010] |<title>LingPipe 4.0.0 Released</title>
[223002940020] |After a week of relentless deprecation and pushing every last change through the unit tests and tutorials, we’re simultaneously rolling out LingPipe 4.0.0 and LingPipe 3.9.3.
[223002940030] |As usual, the latest is available from the
[223002940040] |<li><a>LingPipe Home Page</a>.</li> 
[223002940050] |There’s a link from there to the latest 3.9 release, 3.9.3.
[223002940060] |3.9.3 is a new release, and is only very minimally different than 3.9.2. I expected more updates for deprecation, but the whole process went more smoothly than anticipated.
[223002940070] |<h3>Migrating from LingPipe 3.9 to LingPipe 4.0</h3>
[223002940080] |Models are Backward Compatible
[223002940090] |The good news is that all of the models compiled in 3.9 will run as is in 4.0 (using the 4.0 methods, of course).
[223002940100] |You can read about the mild wizardry required to do this in a previous blog entry, <a>Upgrading Java classes with backward-compatible serialization</a>.
[223002940110] |Remove Deprecation Warnings from Code
[223002940120] |The migration process for existing code importing LingPipe classes is simple.
[223002940130] |Get your code to where it compiles in 3.9 with no deprecation warnings and you’re good to go for 4.0.
[223002940140] |All of the deprecated methods and classes in 3.9 indicate what to replace them with in their javadoc. 
[223002940150] |<h3>3.9 Branched and Supported</h3>
[223002940160] |We’re expecting to have to support 3.9 for a while longer, because 4.0 isn’t backward compatible without code changes.
[223002940170] |We’re anticipating continuing to patch bugs for 3.9 and release subsequent versions on the 3.9 branch.
[223002940180] |At least until all of our support-paying customers upgrade all of their code to 4.0. 
[223002940190] |<h3>The Time to Make it Shorter</h3>
[223002940200] |Other than updating all the jars, there are no new features in 4.0.
[223002940210] |In fact, it’s smaller than 3.9.
[223002940220] |As Pascal said (originally in French, referring to a letter),
[223002940230] |I have only made this longer, because I have not had the time to make it shorter. 
[223002940240] |Well, we finally found the time to make LingPipe shorter.
[223002950010] |<title>Piecewise-Linear or Isotonic Regression for Calibrating Naive Bayes (or Other Predictors)</title>
[223002950020] |I was just reading a <a>really nifty Bayesian approach to clickthrough rate estimation</a> from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating <a>naive Bayes classifiers</a>:
[223002950030] |<li>Zadrozny, Bianca and Charles Elkan. 2001.
[223002950040] |<a>Obtaining calibrated probability estimates from decision trees and naive Bayes classifiers</a>.
[223002950050] |In ICML. 
[223002950060] |I’ve blogged about Rennie et al.’s approach to tackling the <a>poor assumptions of Naive Bayes for text classifiers</a>; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic. 
[223002950070] |<h3>Failed Independence Assumptions in Naive Bayes</h3>
[223002950080] |The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not.
[223002950090] |This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations.
[223002950100] |For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic).
[223002950110] |It’s what you’ll get from LingPipe’s <code>NaiveBayesClassifier</code> and <code>TradNaiveBayesClassifier</code> classes.
[223002950120] |In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent.
[223002950130] |This is especially easy to see for repeated words.
[223002950140] |If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “<a>lebron</a>” (actually it’s probably higher, due to the celebrity of LeBron James).
[223002950150] |So what if we see “lebron” twice?
[223002950160] |In the model, the probability of being about basketball goes up to <img/>.
[223002950170] |The problem is that two “lebron”s don’t tell you much more about the topic than one.
[223002950180] |An article about the filmmaker <a>Eddie Lebron</a> or the Salsa band <a>The Lebrón Brothers</a> is also likely to mention “lebron” more than once.
[223002950190] |Once you’ve seen one, the effect of the next one should go down.
[223002950200] |(The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)
[223002950210] |<h3>Measuring Term Correlation</h3>
[223002950220] |One thing to do is try to measure the correlation.
[223002950230] |Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled <a>One term or two?</a> (but only available pay-per-view) and Martin Jansche’s <a>careful follow up</a>. 
[223002950240] |It’s traditional to hack calibrations by converting every document <img/>to the same virtual document length <img/>, using <img/>.
[223002950250] |This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for <a>EM</a> and presumably for n-best results ordering). 
[223002950260] |<h3>Binned Calibration</h3>
[223002950270] |What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions.
[223002950280] |In particular, they sort the inputs by predicted probability for the most likely category.
[223002950290] |They then sort these into ten bins of equal size.
[223002950300] |For each bin, they calculate the proportion of true answers in the bin.
[223002950310] |For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.
[223002950320] |They talk about using held-out data, but don’t. I’m not sure why —naive Bayes is pretty easy to overfit.
[223002950330] |Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size. 
[223002950340] |<h3>Isotonic Regression</h3>
[223002950350] |The Microsoft paper referred to Zadrozny and Elkan’s approach as <a>isotonic regression</a>, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities.
[223002950360] |They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results.
[223002950370] |Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.
[223002950380] |<h3>Binning versus Piecewise Linear Normalization</h3>
[223002950390] |Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have <img/>and <img/>). 
[223002950400] |Instead of binning, a more reasonable solution seems to be a piecewise linear approximation.
[223002950410] |Just fit linear functions through each bin that remain isotonic.
[223002950420] |That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).
[223002950430] |<h3>Length, Anyone?</h3>
[223002950440] |In addition to predicted probability on new items, I’d think document length would be a huge factor here.
[223002950450] |In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data.
[223002950460] |As documents get longer, the overconfidence due to failed independence assumptions gets worse. 
[223002950470] |<h3>Category Effects?</h3>
[223002950480] |I’d also think you could adjust for category effects.
[223002950490] |There’s often one or more small foreground categories and a big background category.
[223002950500] |I’d expect them to calibrate differently.
[223002950510] |<h3>Comaprative Effects</h3>
[223002950520] |It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability.
[223002950530] |That may also help here.
[223002960010] |<title>Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty</title>
[223002960020] |It was bound to happen.
[223002960030] |The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered.
[223002960040] |Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.
[223002960050] |The following paper presents roughly the same model of expression I presented in a previous blog post, <a>Inferring Splice Variant mRNA Expression with RNA-Seq</a>.
[223002960060] |<li>Li, Bo, Victor Ruotti, Ron M. Stewart, James A. Thomson and Colin N. Dewey. 2010.
[223002960070] |<a>RNA-Seq gene expression estimation with read mapping uncertainty</a>.
[223002960080] |Bioinformatics <b>26</b>(4):493–500. 
[223002960090] |It also incorporates some of the features I suggested for edit distance in another blog post, <a>Sequence alignment with conditional random fields</a>.
[223002960100] |<h3>The Basic Expression Model</h3>
[223002960110] |In fact, it’d almost be easier to say what’s different in Li et al.’s model.
[223002960120] |The expression model is almost identical, down to the name of the variable, <img/>, used for read expression.
[223002960130] |The model in this paper assumes <img/>reads, implicitly assumes an uniform prior for <img/>, then for each read <img/>, chooses splice variants <img/>.
[223002960140] |So far, identical (other than that I considered arbitrary Dirichlet priors). 
[223002960150] |[<b>Update, June 10, 2010:</b> I forgot to mention:
[223002960160] |<h3>Noise Isoform</h3>
[223002960170] |Li et al. include a so-called "noise isoform", which they assign a simple probability distribution to.
[223002960180] |This will act as a sink for reads that do not map elsewhere.
[223002960190] |I don't quite see given how it's defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]
[223002960200] |<h3>Position and Strand-Based Reads</h3>
[223002960210] |Things look different when Li et al. start to generate reads <img/>.
[223002960220] |What I did was assume an arbitrary distribution <img/>, with parameters <img/>characterizing the likelihood of observing read <img/>from reference source <img/>.
[223002960230] |Li et al. decompose part of <img/>in their model.
[223002960240] |First, they assume <img/>is the start position of read <img/>on reference sequence <img/>and assume <img/>is the strand, both of which are generated from the ref sequence <img/>, by distributions <img/>and <img/>. 
[223002960250] |This is where Li et al.’s proposal relates to my probabilistic aligner proposal.
[223002960260] |In particular, with the position, strand and reference sequence, <img/>, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in <a>[Hansen, Brenner and Dudoit 2010]</a>).
[223002960270] |They only study the positional biases, and for their data, they were small.
[223002960280] |But groovy nonetheless.
[223002960290] |It’s building in the right kind of sensitivity to the biological sample prep and sequencing.
[223002960300] |<h3><del>Non-</del>Probabilistic Alignment</h3>
[223002960310] |[<b>Updated, June 10, 2010:</b> I don't know what I was thinking -- Li et al. definitely use probabilistic alignment.
[223002960320] |In fact, they use a position-specific edit matrix <img/>for substitutions (no indels).
[223002960330] |I'm not sure how the matrices are estimated.]
[223002960340] |What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s <a>GNUMap aligner</a>). 
[223002960350] |I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is). 
[223002960360] |The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.
[223002960370] |<h3>EM MLE vs. Gibbs Sampling Bayesian Posterior</h3>
[223002960380] |The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior <img/>of expression given reads <img/>and channel model <img/>and Dirichlet prior <img/>.
[223002960390] |Li et al. use EM to find the maximum likelihood estimate, <img/>.
[223002960400] |As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is <img/>. 
[223002960410] |<h3>Bootstrap Standard Error Estimates</h3>
[223002960420] |One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty.
[223002960430] |This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates. 
[223002960440] |Li et al. use a <a>bootstrap estimate</a> of standard error, which is great to see.
[223002960450] |I wish more researchers provided variance estimates. 
[223002960460] |The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values.
[223002960470] |You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts. 
[223002960480] |[<b>Update: June 10, 2010</b> Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too.
[223002960490] |It's really pretty much like Gibbs sampling in terms of the application, if not philosophy.]
[223002960500] |<h3>Simulations</h3>
[223002960510] |They run a bunch of simulation experiments to show that this kind of model makes sense.
[223002960520] |I did the same thing on a (much much) smaller scale.
[223002960530] |They use Langmead et al.’s <a>BOWTIE aligner</a> with up to two mismatches, which seems a rather dodgy basis for this kind of expression model.
[223002960540] |It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment. 
[223002960550] |[<b>Update: June 10, 2010</b>: BOWTIE has a <code>--all</code> setting that the <a>documentation</a> to generate all matching reads, but there's also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.
[223002960560] |<del>Even if BOWTIE can be configured to find all the matches up to edit distance 2, there's no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.
[223002960570] |My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance</del>.]
[223002960580] |What we really need is some real applications.
[223002960590] |Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in <a>(Marioni et al. 2008)</a>, and also some new data generated by <a>Ken Birnbaum</a> (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday.
[223002960600] |The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s <a>D-Chip factor model for microarray expression</a>.
[223002960610] |<h3>20 to 25 Bases!</h3>
[223002960620] |The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.
[223002970010] |<title>Contextual Effects and Read Quality in a Probabilistic Aligner</title>
[223002970020] |In my last post on this topic, <a>Sequence Alignment with Conditional Random Fields</a>, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence).
[223002970030] |The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.
[223002970040] |<h3>Context Plus Read Quality</h3>
[223002970050] |Today I want to generalize the model in two ways.
[223002970060] |First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina. 
[223002970070] |Second, I’ll take into consideration per-base read quality, as produced by Illumina’s <code>prb</code> files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer). 
[223002970080] |We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.
[223002970090] |The dynamic programming algorithms work as before with finer-grained cells.
[223002970100] |Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.
[223002970110] |<h3>Modeling Alignments</h3>
[223002970120] |What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is <img/>, the probability of a read <img/>given a reference sequence <img/>.
[223002970130] |Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment <img/>at a (latent) start position <img/>, <img/>, and then marginalizing,
[223002970140] |<img/>.
[223002970150] |The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm.
[223002970160] |Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region).
[223002970170] |Both can be sped up with upper bounds on distance. 
[223002970180] |<h3>The Model</h3>
[223002970190] |We assume there is a score <img/>for each start position <img/>in reference sequence <img/>.
[223002970200] |For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.
[223002970210] |We will also assume there is a quality score on a (natural) log probability scale <img/>for the probability of base <img/>at position <img/>in read <img/>.
[223002970220] |For instance, these may be the Phred quality scores reported in Illumina’s <code>prb</code> files converted to (natural) log probability of match.
[223002970230] |Our definition only normalizes for reads of fixed length, which we’ll write <img/>.
[223002970240] |We will recursively define a score on an additive scale and then exponentiate to normalize.
[223002970250] |In symbols,
[223002970260] |where we define the alignment score function <img/>, for alignment sequence <img/>, read <img/>and read position <img/>, reference sequence <img/>and reference position <img/>and previous alignment operation <img/>, recursively by
[223002970270] |Reminds me of my <a>Prolog</a> <a>programming</a> days. 
[223002970280] |<h4>The Edit Operations</h4>
[223002970290] |The notation <img/>is meant to be the edit operation <img/>followed by edit sequence <img/>.
[223002970300] |The notation <img/>is for the empty edit sequence. 
[223002970310] |The notation <img/>is for the substitution of an observed base <img/>from the read for base <img/>of the reference sequence; if <img/>it’s a particular mismatch, and if <img/>, it’s a match score.
[223002970320] |The notation <img/>is for the insertion of observed base <img/>and <img/>for the deletion of base <img/>in the reference sequence.
[223002970330] |Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.
[223002970340] |Note that the base case requires the entire read to be consumed by requring <img/>.
[223002970350] |<h4>Scoring Functions</h4>
[223002970360] |Note that the score functions, <img/>, <img/>, and <img/>, have access to the read <img/>, the reference sequence <img/>, the present position in the read <img/>and reference sequence <img/>, as well as the previous edit operation <img/>.
[223002970370] |<h3>Enough Context</h3>
[223002970380] |This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here).
[223002970390] |It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place. 
[223002970400] |Further note that by defining the start position score by a function <img/>, which includes a reference to not only the position <img/>but the entire reference sequence <img/>, it is possible to take not only position but also information in the bases in <img/>at positions around <img/>into account. 
[223002970410] |<h3>Dynamic Programming Same as Before</h3>
[223002970420] |The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply.
[223002970430] |The only tricky part is computing the normalization for <img/>, which has only been defined up to a constant.
[223002970440] |As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read. 
[223002970450] |In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution.
[223002970460] |To compute the normalizer, you treat each base log prob from a pseudoread of the given read length <img/>as having score 0 and accumulate all the matches in the usual way with the <a>sum-product algorithm</a>.
[223002970470] |(You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)
[223002970480] |In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations. 
[223002970490] |<h3>Inferring the Actual Read Sequence for SNP, Etc.
[223002970500] |Calling</h3>
[223002970510] |Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence.
[223002970520] |We can marginalize out the probability that the read sequence is <img/>given that it was generated from reference sequence <img/>with read <img/>(remember the read’s non-deterministic here),
[223002970530] |<img/>,
[223002970540] |where <img/>is the read produced from alignment <img/>, which itself is defined recursively by 
[223002970550] |<img/>, 
[223002970560] |<img/>, 
[223002970570] |<img/>, and 
[223002970580] |<img/>,
[223002970590] |where <img/>is the empty string and <img/>string concatenation.
[223002970600] |I believe this is roughly what the <a>GNUMAP</a> mapper does (here’s the <a>pay-per-view paper</a>).
[223002970610] |<h3>Reference Sequences with Uncertainty</h3>
[223002970620] |We could extend the algorithm even further if we have a non-deterministic reference sequence.
[223002970630] |For instance, we can use data from the <a>HapMap project</a> to infer single nucleotide polymorphisms (SNPs).
[223002970640] |We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix. 
[223002980010] |<title>The Unbearable Heaviness of Java Strings</title>
[223002980020] |<b>Instantaneous Update:</b> No sooner had I posted this than I received a <a>really helpful clarifying comment</a> from <a>Ismael Juma</a> with a link to exactly the references I was looking for.
[223002980030] |It turns out the 64-bit JVMs do compress references in some cases.
[223002980040] |And people ask me why I waste my time blogging!
[223002980050] |<b>Update:</b> 23 June 2010.
[223002980060] |Super cool.
[223002980070] |As of <a>Java SE 6 build 14</a>, all you need to do is provide the <code>java</code> command the argument <code>-XX:+UseCompressedOops</code> and all references will consume 32 bits (“OOP” is for “ordinary object pointer”).
[223002980080] |That does limit you to 4 billion objects on the heap (2<sup>32</sup>), which if they’re just plain objects will consume 32GB.
[223002980090] |You’d need at least 64GB of RAM to make it worth turning the option off, because 4 billion objects consume 64GB without OOP compression. 
[223002980100] |Does anyone know how fast the JVM is with compression?
[223002980110] |I’ll have to run some tests.
[223002980120] |Let’s start with the punchline.
[223002980130] |In Sun/Oracle’s 64-bit JVM, a string consumes <b>60 bytes plus</b> the size of its UTF-16 encoding, which is at least 2 times the number of Unicode characters.
[223002980140] |In a 32-bit JVM, if anyone’s still using one, there’s a 36-byte fixed cost plus the cost of the UTF-16 encoding. 
[223002980150] |My number one peeve with Java is that there is no equivalent of C’s <code>struct</code>.
[223002980160] |I can do without pointer arithmetic, but objects in Java are just too heavy for serious lifting (somehow that metaphor worked out backward).
[223002980170] |It’s the reason there are parallel arrays all over LingPipe —arrays of structured objects just take up too much space and cause too much garbage collection.
[223002980180] |<h3>UTF-16?</h3>
[223002980190] |Yes, Java made the questionable choice of using UTF-16 to encode strings internally.
[223002980200] |Each <code>char</code> is a UTF-16 byte pair.
[223002980210] |Any unicode character with a code point at or below U+FFFF uses a single UTF-16 byte pair to encode, and anything higher requires two UTF-16 byte pairs.
[223002980220] |(In the designer’s defense, there weren’t any Unicode characters with code points above U+FFFF when Java was released and 64-bit architectures in laptops seemed a long way off.)
[223002980230] |The bummer is that for all of us with piles of ASCII or near-ASCII (e.g. Latin1 or Windows-1252) text, UTF-16 takes nearly twice as much space as UTF-8. 
[223002980240] |<h3>Java Object Memory Layout</h3>
[223002980250] |The layout of objects in memory is not part of the Java language specification, so implementations may do this differently.
[223002980260] |We’ll only consider Sun/Oracle’s 32-bit and 64-bit JVMs. 
[223002980270] |After a lot of guesswork and empirical profiling, I finally found some references that look solid on the topic of how Java objects are laid out in memory:
[223002980280] |<li>Code Instructions: <a>Java Objects Memory Structure</a>
[223002980290] |<li>Streamy Development: <a>HOWTO: Determine the size of a Java Object or Class</a>
[223002980300] |<li>Just Another Geek: <a>Memory Structure of Java Object</a> 
[223002980310] |What’s worrisome is that they look awfully similar and there are no external references.
[223002980320] |How did they figure this out?
[223002980330] |Looking at the byte code?
[223002980340] |By looking at the implementation of Java itself?
[223002980350] |By reverse engineering? 
[223002980360] |<h3>Principles of Object Layout</h3>
[223002980370] |Assuming they’re correct, and the argument looks convincing, the basic idea of sizing a Java object (not including the size of any object it references), is based on a few principles motivated by byte alignment.
[223002980380] |The idea is that it’s more convenient for the underlying architecutre if items start on even number multiple of their number of bytes.
[223002980390] |So a double or long would want to start at position 0, 8, 16, …, an int or float on positions 0, 4, 8, 12, …, a short or char on positions 0, 2, 4, 6, …and bytes on any position at all. 
[223002980400] |References (aka pointers), will be 8 bytes on 64-bit architectures and 4 bytes on 32-bit architectures.
[223002980410] |<h3>Basic Object Layout</h3>
[223002980420] |An instance of <code>Object</code> requires two references, which consume 8 bytes (32 bit) or 16 bytes (64 bit).
[223002980430] |One reference is to the class of the object, and one is the ID of the object.
[223002980440] |Objects get IDs because they may be relocated in memory yet need a unique identifier to resolve reference equality (<code>==</code>).
[223002980450] |The ID is also used for synchronization and for the hash code method implemented in <code>Object</code>.
[223002980460] |In order to support byte alignment, the member variables are laid out from largest to smallest, that is doubles and longs, then ints and floats, then shorts and chars, then bytes. 
[223002980470] |Objects are reference aligned, because they start with references, so object sizes are rounded up to the nearest even multiple of 4 bytes (32 bit) or 8 bytes (64 bit).
[223002980480] |<h3>Subclass Layout</h3>
[223002980490] |Subclasses add their own member length to their superclass’s.
[223002980500] |Subclasses point to their own class definition with one of their references and lay out their member values below the superclass’s.
[223002980510] |There are some subtleties here surrounding ordering member variables in the subclass layout to reuse space that would’ve been padding in the superclass. 
[223002980520] |<h3>Array Layout</h3>
[223002980530] |Arrays take up 12 bytes (32 bit) or 20 bytes (64 bit) plus their member size, plus an additional 4 bytes if the array holds longs, doubles or object references.
[223002980540] |Arrays behave like objects in java, and contain the same first two references (to the array class and to the ID).
[223002980550] |They additionally contain an integer length (a choice that, sadly, limits the size of Java arrays).
[223002980560] |Then they contain the equivalent of their length in member variables.
[223002980570] |These start after the length, which means 4 more bytes of padding for double or long or object reference arrays. 
[223002980580] |<h3>Strings</h3>
[223002980590] |The Sun JDK implements strings as holding an array of <code>char</code> values, an integer start and an integer length definining a slice of the array, and an integer hash code. 
[223002980600] |That’s 5 REFs, 4 ints, and a number of char equal to the length (in UTF-16 byte pairs) of the string, plus padding.
[223002980610] |So with 8-byte refs in the 64-bit JVM, that’s 60 bytes and with 4-byte refs in the 32-bit JVM, it’s 36 bytes.
[223002980620] |Plus the length of the string times 2.
[223002980630] |<h3>Ouch!</h3>
[223002980640] |That’s why efficient coding in Java looks like C. It’s just too expensive to allocate objects in tight loops, no matter what you may have heard to the contrary about the compiler being smarter than you and generational garbage collection curing all ills.
[223002990010] |<title>Training Examples: A Stack Overflow for NLP and ML and …</title>
[223002990020] |Joseph Turian’s just created a site
[223002990030] |<li><a>Training Examples Q&A</a> 
[223002990040] |which describes itself as a place “where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!” 
[223002990050] |It’s modeled pretty directly on the hugely popular <a>Stack Overflow</a>, which bills itself as a “collaboratively edited question and answer site for programmers – regardless of platform or language.” 
[223002990060] |In case you’re too lazy to look yourself, questions at the top of Training Examples right now are
[223002990070] |<li>how to discovery emerging topics in customer contacts?
[223002990080] |<li>How do I test a method (murmurhash) for generating random numbers?
[223002990090] |<li>Finding related users based on user data
[223002990100] |<li>Algorithm to discover features from raw data?
[223002990110] |<li>When is low-dimensional dense representations better than high-dimensional and sparse?
[223002990120] |And vice-versa?
[223002990130] |<li>What are the state of the art optimization algorithms for problems with numerous local minima? 
[223002990140] |Par for the course, it’s a mix of wildly general (non-convex optimization) and reasonably specific (testing a random number generator) questions. 
[223002990150] |Joseph sites the following motivators for using Training Examples:
[223002990160] |<li>Communicate with experts outside of your lab.
[223002990170] |<li>Answer a question once publicly, instead of potentially many times over email.
[223002990180] |<li>Share knowledge you have that might not be reflected in your publication record.
[223002990190] |<li>Find new collaborators. 
[223002990200] |Those are also good reasons for starting a blog or commenting on one!
[223002990210] |The site requires registration for posting or tracking questions, but not for reading questions or answers.
[223002990220] |I have gazillions of passwords and a regular process for managing them, but it’s still enough of a pain that I don’t just register at every site I visit.
[223002990230] |Of course, with something like this, you need registration, or it’ll be nothing but spam in ten minutes flat.
[223002990240] |It’s a constant battle on the blog, but I don’t want to moderate or require registration.
[223003000010] |<title>Unicode Shell for Java: Windows PowerShell ISE</title>
[223003000020] |I finally figured out how to have my Java programs write Unicode to a Windows shell and show up looking the right way.
[223003000030] |Here’s the final result (click to enlarge):
[223003000040] |And here’s how I did it.
[223003000050] |<h3>1. Fire Up PowerShell ISE</h3>
[223003000060] |My new Windows 7 Professional notebook comes with an application called <a>PowerShell ISE</a> (make sure to run the ISE version; the unmarked one is more like DOS and has the same problems noted below).
[223003000070] |The “ISE” is for “integrated scripting environment”.
[223003000080] |It defaults to Consolas font, which looks a lot like Lucida console. 
[223003000090] |<h3>2. Set Character Encoding to UTF-8</h3>
[223003000100] |This’ll set the DOS character encoding to be UTF-8.
[223003000110] |<h3>3. Set Java System.out to UTF-8</h3>
[223003000120] |Before writing to <code>System.out</code>, Java’s constant for the standard output, you’ll need to set it up to use UTF-8.
[223003000130] |It’s a one-liner. 
[223003000140] |The <code>true</code> value enables auto-flushing, which is a good idea for standard output.
[223003000150] |<h3>Demo Program</h3>
[223003000160] |Here’s a simple test program (the backslash escapes are Java literals for Unicode code points).
[223003000170] |You can see the output in action in the screen dumps at the top of this post.
[223003000180] |<h3>Problems with DOS</h3>
[223003000190] |The DOS shell will let you set the character encoding and change the font to Lucida console.
[223003000200] |But it oddly duplicates characters at the end of lines if the lines contain non-ASCII code points.
[223003000210] |You can see this on the example.
[223003000220] |And it can’t handle the Tamil or Han characters.
[223003010010] |<title>Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes</title>
[223003010020] |[<b>Update:</b> 27 September 2010.
[223003010030] |I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle.
[223003010040] |I also added some historical notes and references.
[223003010050] |Thanks for all the feedback.]
[223003010060] |I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).
[223003010070] |<li>Carpenter, Bob. 2010.
[223003010080] |<a>Integrating out multinomial parameters in latent Dirichlet allocation and naive Bayes for collapsed Gibbs sampling</a>.
[223003010090] |LingPipe Technical Report. 
[223003010100] |Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, <a>Finding scientific topics</a>.
[223003010110] |They show the final form, but don’t derive the integral or provide a citation. 
[223003010120] |I suppose these 25-step integrals are supposed to be child’s play.
[223003010130] |Maybe they are if you’re a physicist or theoretical statistician.
[223003010140] |But that was a whole lot of choppin’ with the algebra and the calculus for <a>a simple country</a> computational linguist like me. 
[223003010150] |<h3>On to Bayesian Naive Bayes</h3>
[223003010160] |My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual <img/>functions).
[223003010170] |It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup. 
[223003010180] |It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit.
[223003010190] |But in the end, the result’s interesting.
[223003010200] |I didn’t wind up with what I expected.
[223003010210] |Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc.
[223003010220] |Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item. 
[223003010230] |This is so cool.
[223003010240] |I actually learned something from a derivation.
[223003010250] |In my prior post, <a>Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers</a>, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM).
[223003010260] |Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, <a>What is Bayesian Statistical Inference?</a> if you’re curious.
[223003010270] |<h3>Thanks to Wikipedia</h3>
[223003010280] |I couldn’t have done the derivation for LDA (or NB) without the help of
[223003010290] |<li>Wikipedia: <a>Latent Dirichlet Allocation</a> 
[223003010300] |It pointed me (implicitly) at a handful of invaluable tricks, such as
[223003010310] |<li>multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
[223003010320] |<li>unfolding products based on position, unfolding a <img/>function for the position at hand, then refolding the rest back up so it could be dropped out, and
[223003010330] |<li>reparameterizing products for total counts based on sufficient statistics. 
[223003010340] |Does anyone know of another source that works through equations more gently?
[223003010350] |I went through the same exegesis for <a>SGD estimation for multinomial logistic regression with priors</a> a while back. 
[223003010360] |<h3>But Wikipedia’s Derivation is Wrong!</h3>
[223003010370] |At least I’m pretty sure it is as of 5 PM EST, 13 July 2010. 
[223003010380] |Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth.
[223003010390] |At this stage, we’ve already eliminated all the integrals, but still have a mess of <img/>functions left.
[223003010400] |The only hint at what’s going on is in the text above which says it drops terms that don’t depend on <img/>(the currently considered topic assignment for the <img/>-th word of the <img/>-th document).
[223003010410] |The Wikipedia’s calculation then proceeds to drop the term <img/>without any justification.
[223003010420] |It clearly depends on <img/>. 
[223003010430] |The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of <img/>function applications are dropped, such as <img/>, which even more clearly depend on <img/>. 
[223003010440] |It took me a few days to see what was going on, and I figured out how to eliminate the variables properly.
[223003010450] |I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations.
[223003010460] |I also use the more conventional stats numbering system where the loop variable <img/>ranges from <img/>to <img/>so you don’t need to keep (as large) a symbol table in your head.
[223003010470] |I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.
[223003010480] |<h3>LaTeX Rocks</h3>
[223003010490] |I also don’t think I could’ve done this derivation without LaTeX.
[223003010500] |The equations are just too daunting for pencil and paper.
[223003010510] |The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX.
[223003010520] |Any suggestions? 
[223003020010] |<title>LaTeX Problem’s Not Our Fault</title>
[223003020020] |Thanks to those who’ve sent me e-mail saying that the LaTeX is broken on the site.
[223003020030] |It’s not our fault.
[223003020040] |Really.
[223003020050] |It’s a problem with WordPress’s server. 
[223003020060] |Are lots of people having problems seeing LaTeX on our site?
[223003020070] |<h3>What’s Going On</h3>
[223003020080] |WordPress uses a web service to decode the LaTeX expressions to PNGs.
[223003020090] |When I write something like
[223003020100] |in a blog post, it gets rendered as <img/>. 
[223003020110] |What’s really going on is that WordPress is preprocessing my doc and replacing the LaTeX escapes with links to a web service that returns PNG images.
[223003020120] |For instance, the example above translates to the following link:
[223003020130] |If you click on the link, you should see just the formula. 
[223003020140] |(Note: Given that they run the preprocessor, WordPress might change the path and the link might go stale.
[223003020150] |Or their server might be down.
[223003020160] |If that happens, right click on the image in your browser and select “View Image Info” or whatever the variant for doing that on your browser is.)
[223003020170] |<h3>What Happens When it Fails</h3>
[223003020180] |When WordPress’s web server goes down, it fills in something like “latex path not specified”.
[223003020190] |Not sure whether that’s postprocessing or just the result from the server not being configured properly.
[223003020200] |Brendan O’Connor was kind enough to send me a screen shot:
[223003030010] |<title>The Latin1 Transcoding Trick (for Java Servlets, etc.)</title>
[223003030020] |This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters. 
[223003030030] |<h3>The Trick</h3>
[223003030040] |Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want. 
[223003030050] |Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java <code>char</code> for each byte.
[223003030060] |These <code>char</code> values may be translated back into bytes with a simple cast:
[223003030070] |Now you have the original bytes back again in the array <code>bs</code>.
[223003030080] |In Java, <code>char</code> values act as unsigned 16-bit values, whereas <code>byte</code> values are signed 8-bit values.
[223003030090] |The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation.
[223003030100] |(I attach a program that’ll verify this works at the end.)
[223003030110] |You can now use your own character encoding detection or pull out a general solution like the <a>International Components for Unicode</a> (which I highly recommend —it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).
[223003030120] |<h3>Use in Servlets for Forms</h3>
[223003030130] |I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly).
[223003030140] |Hunter uses the trick for decoding form data.
[223003030150] |The problem is that there’s no way in HTML to declare what character encoding is used on a form.
[223003030160] |What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string. 
[223003030170] |Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:
[223003030180] |Of course, this assumes that the <code>getParameter()</code> will use Latin1 to do the decoding so that the <code>getBytes("ISO-8859-1")</code> returns the original bytes.
[223003030190] |According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding. 
[223003030200] |We borrowed this solution for our <a>demos</a> (all the servlet code is in the distribution under <code>$LINGPIPE/demos/generic</code>), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story). 
[223003030210] |<h3>Testing Transcoding</h3>
[223003030220] |which prints out
[223003050010] |<title>LingPipe Book Draft Available</title>
[223003050020] |We’ve started work on a proper LingPipe book.
[223003050030] |You can download the current partial draft of the book and sample code from:
[223003050040] |<li><a>LingPipe Book Home Page</a> 
[223003050050] |In case you were wondering why the blog’s been quieter these days, this is it! 
[223003050060] |<h3>Our Goals</h3>
[223003050070] |Our goal is to produce something with a little more breadth and depth and much more narrative structure than the <a>current LingPipe tutorials</a>.
[223003050080] |Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.
[223003050090] |We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in <a>LingPipe’s javadoc</a> or the current tutorials.
[223003050100] |<h3>Progress so Far</h3>
[223003050110] |So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant.
[223003050120] |The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration.
[223003050130] |The third chapter covers regular expressions, focusing on their interaction with Unicode.
[223003050140] |The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy.
[223003050150] |The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.
[223003050160] |The first appendiex is an introduction to Java, including the primitives, objects and arrays.
[223003050170] |The second appendix contains suggested further reading in areas related to natural language processing. 
[223003050180] |I hope to keep churning out chapters at around one per week.
[223003050190] |As I complete chapters, I’ll release new versions.
[223003050200] |<h3>Comments Most Appreciated</h3>
[223003050210] |C’mon —you know you want to be listed in the front of the book for sending us feedback.
[223003050220] |Any comments, suggestions, etc., should go to <a><code>carp@alias-i.com</code></a>. 
[223003050230] |The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing. 
[223003050240] |I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.
[223003050250] |<h3>Eventual Publication</h3>
[223003050260] |We’ll be publishing it ourselves, probably through <a>CreateSpace</a>.
[223003050270] |That’ll let us sell through Amazon. 
[223003050280] |If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).
[223003050290] |We plan to continue distributing PDF versions for free.
[223003050300] |<h3>It’s about Time</h3>
[223003050310] |I’m psyched, because it’s been over ten years since <a>my last book</a>. 
[223003050320] |I’m also working on a book on Bayesian categorical modeling —the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book.
[223003050330] |Don’t hold your breath; I’m trying to finish the LingPipe book first.
[223003060010] |<title>Sierra &Bates. Head First Java 2nd Ed. (Review)</title>
[223003060020] |After the comments on the last post about relying more heavily on Java textbooks, I ordered the most popular ones from Amazon.
[223003060030] |I’ll start the reviews with the <a>number one most popular book on Java</a> (ranked by Amazon sales; it also has 269 reviews; I’m happy to see Effective Java is number 4):
[223003060040] |<li>Sierra, Kathy and Bert Bates. 2005.
[223003060050] |<a>Head First Java</a>.
[223003060060] |2nd Edition.
[223003060070] |O’Reilly. 
[223003060080] |<h3>Mashup: Elementary School Workbook + Kitschy Humor</h3>
[223003060090] |I grew up with texts like <a>The Little LISPer</a>, but this book’s even stranger.
[223003060100] |It strikes me as a mashup of an elementary school math problem workbook and one of those cutesy humor books you see near the checkout counters of bookstores around the holidays.
[223003060110] |Here’s an example of one of their quizzes, where you fill in missing pieces of a Java program; just like school writing exercises:
[223003060120] |<li><a>Pool Puzzle</a> (from page 396) 
[223003060130] |<h3>The Design and Layout</h3>
[223003060140] |I think they were striving for a sense of the enthusiastic amateur.
[223003060150] |The “fun” fonts and askew alignments make it look non-linear.
[223003060160] |Code comments are added in a handwritten post-it note style. 
[223003060170] |You really need to head over to Amazon and “look inside”. 
[223003060180] |<h3>Who’s it For?</h3>
[223003060190] |It’s aimed at Java novices.
[223003060200] |Other than that, I’m not sure.
[223003060210] |Maybe for potential Java developers who like crossword puzzles or true/false quizzes with titles like “Be the Compiler”.
[223003060220] |I’m more of a <a>Java Puzzlers</a> guy myself (thank my other childhood friend, <a>Martin Gardner</a>).
[223003060230] |Unless you laughed at the photo included above, don’t buy it for the humor. 
[223003060240] |<h3>How Much Does it Cover?</h3>
[223003060250] |It covers a startling amount of Java, including advanced topics like threads, sockets, RMI, Swing, audio, …
[223003060260] |Of course, there can’t be much depth given this range.
[223003060270] |But you do get a wide range of hello-world-type programs. 
[223003060280] |<h3>Is it Sound Java?</h3>
[223003060290] |The authors clearly know what they’re doing.
[223003060300] |The advice and coverage are too basic if you want to program Java seriously, but it’ll get you started on the right track. 
[223003060310] |There’s even good high-level advice like telling you to write unit tests (though not using anything as complex as JUnit). 
[223003060320] |<h3>It’s an Industry</h3>
[223003060330] |Like other popular series (i.e., …for Dummies, …in Action, …in a Nutshell), this one’s turned into an industry, with dozens of titles covering everything from HTML to statistics. 
[223003060340] |As such, they’ve created all the trappings of such an enterprise, such as “Head First learning principles”:
[223003060350] |<li>make it visual —put the words within or near graphics
[223003060360] |<li>use a conversational and personalized style
[223003060370] |<li>get the learner to think more deeply
[223003060380] |<li>get —and keep —the reader’s attention
[223003060390] |<li>touch their emotions 
[223003060400] |And they break out all sorts of “metacognitive” tips explaining why they wrote such a whacky book and how you should go about learning with it (like read it before bed and mix right-brain/left-brain activities for learning).
[223003060410] |<a>Many reviewers</a> like the approach well enough to <a>blurb about it</a>.
[223003080010] |<title>McCandless, Hatcher &Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)</title>
[223003080020] |I’m very happy to see that the 2nd Edition is out:
[223003080030] |<li>McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010.
[223003080040] |<a>Lucene in Action</a>.
[223003080050] |Second Edition.
[223003080060] |Manning Press. 
[223003080070] |Manning’s offering 40% off until September 30, 2010.
[223003080080] |Follow the <a>link to the book</a> and use code <code>lingpipeluc40</code> when you check out. 
[223003080090] |<h3>You Need This Book</h3>
[223003080100] |If you’re interested in version 3 or later of the Apache Lucene search library, you need this book.
[223003080110] |As far as I know, this book is the only way to come up to speed on the intricacies of Lucene. 
[223003080120] |<h3>What if I have the First Edition?</h3>
[223003080130] |Keep it for your projects that require 2.x versions of Lucene.
[223003080140] |So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use.
[223003080150] |Every example has changed.
[223003080160] |It’s pretty much a whole new book (other than the first page where it describes Lucene as simple —that used to be true, but is not any longer).
[223003080170] |I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time.
[223003080180] |I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer.
[223003080190] |(In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)
[223003080200] |<h3>Is it a Good Book?</h3>
[223003080210] |I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”.
[223003080220] |Now let me show you why.
[223003080230] |The authors clearly know their Lucene.
[223003080240] |Not surprising, given that they’re three of the core developers for Lucene.
[223003080250] |(Erik Hatcher was also involved in the Ant build system, and has another Manning …in Action book for that.)
[223003080260] |Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run.
[223003080270] |As such, the book’s better than good —it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations. 
[223003080280] |The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible.
[223003080290] |It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations.
[223003080300] |It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.
[223003080310] |They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative). 
[223003080320] |If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application.
[223003080330] |Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!). 
[223003080340] |They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself. 
[223003080350] |<h3>Needs a Revision</h3>
[223003080360] |Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts.
[223003080370] |I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback.
[223003080380] |As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is —all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)
[223003080390] |The initial two lines of a paragraph are repeated on pages xxx and xxxi.
[223003080400] |There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii. 
[223003080410] |There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.
[223003080420] |There are also bugs in code.
[223003080430] |I only studied a few programs in the term vector section (which is great) and in the Tika section.
[223003080440] |Listing 5.18 has a bug in that it calls <code>searcher.search()</code> with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).
[223003080450] |There are dangling references like “uncomment the output lines toward the end of the <code>docsLike</code> method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).
[223003080460] |I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it. 
[223003080470] |<h3>What’s Missing?</h3>
[223003080480] |There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put.
[223003080490] |As a result, the authors face a daunting curatorial challenge.
[223003080500] |From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search.
[223003080510] |It’s more than just stripping accents from Western characters. 
[223003080520] |<h3>JUnit Examples</h3>
[223003080530] |Don’t get me wrong.
[223003080540] |I love JUnit.
[223003080550] |We use it for unit testing in LingPipe.
[223003080560] |And I recommend reading our tests to help understand our classes.
[223003080570] |I just don’t think it’s a good way to present examples in an intro book.
[223003080580] |On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods. 
[223003080590] |<h3>What about the Code?</h3>
[223003080600] |Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice.
[223003080610] |I have to fight my own tendency to err on the latter side.
[223003080620] |The authors favor simplicity in most cases, as they explain up front.
[223003080630] |The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity.
[223003080640] |(We do the same thing in our tutorials.) 
[223003080650] |Many (but not all) methods that throw exceptions are just declared to throw <code>Exception</code>.
[223003080660] |I think this does readers a disservice in not knowing what’s actually throwing what kind of exception.
[223003080670] |Others might argue it makes the code cleaner and less cluttered.
[223003080680] |The authors explain that’s their motivation and point out proper exception etiquette.
[223003080690] |We do the same thing in some of our tutorial code.
[223003080700] |The use of generics is spotty, with allocations like <code>new TreeMap()</code> that should be <code>new TreeMap<String,Integer></code> and subsequent use of non-generic <code>Iterator</code> instances.
[223003080710] |As a result, they have to cast their gets and can’t use auto-boxing.
[223003080720] |This just clutters up the code using the maps and makes their intended contents opaque. 
[223003080730] |I’d recommend autoboxing, or its explicit result, <code>Integer.valueOf(int)</code>, over the older style <code>new Integer(int)</code> (the latter always creates a new instance, whereas <code>valueOf()</code> being a factory can share instances).
[223003080740] |I’d like to see more foreach and fewer explicit iterators.
[223003080750] |They don’t like ternary operators or min/max operators, resulting in awkward fragments like:
[223003080760] |rather than the simpler 
[223003080770] |I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!). 
[223003080780] |There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block.
[223003080790] |They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to <code>acos()</code>.
[223003080800] |Their approach is insufficient.
[223003080810] |The only way to guarantee that your natural cosine computations fall in this range is to do:
[223003080820] |That’s another problem we ran into for our k-means clustering implementation in LingPipe.
[223003080830] |Once the vectors get big, it’s easy to get results greater than 1 for close matches. 
[223003080840] |Also, if you’re worried about efficiency, replace <code>Math.sqrt(x) * Math.sqrt(y)</code> with <code>Math.sqrt(x * y)</code>.
[223003080850] |<h3>Layout, Etc.</h3>
[223003080860] |I’m not a big fan of Manning’s approach to code formatting and presentation.
[223003080870] |Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text.
[223003080880] |I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.
[223003080890] |The code formatting also bugged me.
[223003080900] |Rather than breaking before assignment equal signs, they break after.
[223003080910] |They use two characters of indent, which is hard to scan.
[223003080920] |And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense). 
[223003080930] |The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines.
[223003080940] |A class name like <code>IndexWriter</code> is apt to show up with <code>Index-</code> at the end of one line and <code>Writer</code> at the start of the next.
[223003080950] |The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers). 
[223003080960] |<h3>How Much Math?</h3>
[223003080970] |For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses).
[223003080980] |If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive).
[223003080990] |Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni.
[223003081000] |Press let them leave an <a>online version</a> up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.
[223003081010] |In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions.
[223003081020] |If someone needs cosine defined for them, they probably need the definitions for basic vector operations.
[223003081030] |In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers.
[223003081040] |I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to <code>Math.acos()</code>, the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).
[223003081050] |For spell checking, it’d have been nice to get up to something like the noisy channel model.
[223003081060] |Peter Norvig provides an elegant introduction in <a>his chapter in the Beautiful Data book</a> from O’Reilly.
[223003081070] |Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.
[223003081080] |<h3>What about Solr?</h3>
[223003081090] |Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr.
[223003081100] |This book only mentions it in passing in a page-long section 10.8.
[223003081110] |<h3>Other Books on Lucene</h3>
[223003081120] |I haven’t seen Smiley and Pugh’s 2009 <a>Solr 1.4 Enterprise Search Server</a>, but it has positive feedback on Amazon. 
[223003081130] |Manu Konchady’s book <a>Building Search Applications with Lucene, LingPipe, and Gate</a> is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.
[223003081140] |<h3>About those Cover Illustrations</h3>
[223003081150] |On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn.
[223003081160] |They say the cover page is missing so they couldn’t track it down, but they mention the date and artists.
[223003081170] |I just did a search for the terms they mention, <code>[ottoman empire clothing 1802 william miller]</code> and found two copies of the book for sale at a <a>Donald Heald Books</a> (you may need to <a>search again</a>).
[223003081180] |Here’s the scoop
[223003081190] |[DALVIMART, Octavien (illustrator) - William ALEXANDER (1767-1816)].
[223003081200] |The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French
[223003081210] |London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817).
[223003081220] |Folio (14 1/8 x 10 1/2 inches).
[223003081230] |Parallel text in French and English.
[223003081240] |Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart.
[223003081250] |Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges.
[223003081260] |Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).
[223003081270] |A good copy of this classic work on the costume of the Ottoman Empire.
[223003081280] |Starting with the "Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”.
[223003081290] |Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text).
[223003081300] |“The Drawings, from which …[the] plates have been engraved, were made on the spot …[in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness.
[223003081310] |They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface).
[223003081320] |Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.
[223003081330] |Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337. 
[223003090010] |<title>* Your Mileage May Vary</title>
[223003090020] |<a>Alex Smola</a> just introduced a blog, <a>Adventures in Data</a>, that’s focusing on many of the same issues as this one.
[223003090030] |His first few posts have been on coding tweaks and theorems for gradient descent for linear classifiers, a problem Yahoo!’s clearly put a lot of effort into.
[223003090040] |<h3>Lazy Updates for SGD Regularization, Again</h3>
[223003090050] |In the wide world of independently discovering algorithmic tricks, Smola has a post on <a>Lazy Updates for Generic Regularization in SGD</a>.
[223003090060] |It presents the same technique as I summarized in my pedantically named white paper <a>Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression</a>.
[223003090070] |It turns out to be a commonly used, but not often discussed technique.
[223003090080] |<h3>Blocking’s Faster</h3>
[223003090090] |I wrote a <a>long comment</a> on that post, explaining that I originally implemented it that way in LingPipe, but eventually found it faster in real large-scale applications to block the prior updates. 
[223003090100] |<h3>Your Mileage May Vary</h3>
[223003090110] |To finally get to the point of the post (blame my love of rambling New Yorker articles), <a>your mileage may vary</a>. 
[223003090120] |The reason blocking works for us is that we have feature sets like character n-grams where each text being classified produces on the order of 1/1000 of all features.
[223003090130] |So if you update by block every 1000 epochs, it’s a win.
[223003090140] |Not even counting removing the memory and time overhead of keeping the record of sparseness and increased memory locality. 
[223003090150] |<h3>Horses for Courses</h3>
[223003090160] |The only possible conclusion is that it’s <a>horses for courses</a>.
[223003090170] |Each application needs its own implementation for optimal results. 
[223003090180] |As another example, I can estimate my annotation models using Gibbs sampling in BUGS in 20 or 30 minutes.
[223003090190] |I can estimate them in my custom Java program in a second or two.
[223003090200] |I only wrote the program because I needed to scale to a named-entity corpus and BUGS fell over.
[223003090210] |As yet another example, I’m finding coding up genomic data very similar to text data (edit distance, Bayesian models) and also very different (4-character languages, super-long strings). 
[223003090220] |<a>Barry Richards</a>, the department head of Cog Sci when I was a grad student, gave a career retrospective talk about all the work he did in optimizing planning systems.
[223003090230] |Turns out each application required a whole new kind of software.
[223003090240] |As a former logician, he found the lack of generality very dispiriting. 
[223003090250] |Not being able to resist one further idiom, this sounds like a <a>probletunity</a> to me. 
[223003090260] |<h3>Threading Blog Discussions</h3>
[223003090270] |This post is a nice illustration of Fernando Pereira’s <a>comment on Hal Daumé III’s blog</a> about the problematic nature of threading discussions across blogs. 
[223003090280] |<h3>That Zeitgeist Again</h3>
[223003090290] |I need to generalize my earlier post about the <a>scientific zeitgeist</a> to include coding tricks.
[223003090300] |The zeitgeist post didn’t even discuss my rediscoveries in the context of SGD!
[223003100010] |<title>R’s Scoping</title>
[223003100020] |[<b>Update: 10 September 2010</b> I didn't study Radford Neal's example closely enough before making an even bigger mess of things.
[223003100030] |I'd like to blame it on HTML formatting, which garbled Radford's formatting and destroyed everyone else's examples, but I was actually just really confused about what was going on in R.
[223003100040] |So I'm scratching most of the blog entry and my comments, and replacing them with Radford's example and a pointer to the manual.]
[223003100050] |<h3>A Better Mousetrap</h3>
[223003100060] |There’s been an ongoing discussion among computational statisticians about writing something better than R, in terms of both speed and comprehensibility:
[223003100070] |<li>Andrew Gelman: <a>The Future of R</a>
[223003100080] |<li>Julien Cornebise (via Christian Robert): <a>On R Shortcomings</a>
[223003100090] |<li>Radford Neal: <a>Two Surprising Things about R</a> (following up his earlier series, <a>Design flaws in R</a>) 
[223003100100] |<h3>Radford Neal’s Example</h3>
[223003100110] |Radford’s example had us define two functions,
[223003100120] |This illustrates what’s going on, assuming you can parse R.
[223003100130] |I see it, I believe it.
[223003100140] |The thing to figure out is why a=10 was picked up in the call to g() in f, but b=200 was not picked up in the call to f() in h. Instead, the global assignment b=3 was picked up. 
[223003100150] |<h3>RTFM</h3>
[223003100160] |Even after I <a>RTFM</a>-ed, I was still confused. 
[223003100170] |<li>Venables, W. N., D. M. Smith and the R Core Development Team. 2010.
[223003100180] |<a>Introduction to R 2.11.1</a>. 
[223003100190] |It has a section 10.7 titled “Scope”, but I found their example
[223003100200] |and the following explanation confusing,
[223003100210] |The variable n in the function sq is not an argument to that function.
[223003100220] |Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it.
[223003100230] |Under static scope (S-Plus) the value is that associated with a global variable named n.
[223003100240] |Under lexical scope (R) it is the parameter to the function cube since that is the active binding for the variable n at the time the function sq was defined.
[223003100250] |The difference between evaluation in R and evaluation in S-Plus is that S-Plus looks for a global variable called n while R first looks for a variable called n in the environment created when cube was invoked. 
[223003100260] |I was particularly confused by the “environment created when cube was invoked” part, because I couldn’t reconcile it with Radford’s example. 
[223003100270] |Let’s consider a slightly simpler example without nested function calls. 
[223003100280] |This shows it can’t be the value of j at the time f is defined, because it changes when I change j later.
[223003100290] |I think it’s actually determining how it’s going to find j when it’s defined.
[223003100300] |If there’s a value of j that’s lexically in scope (not just defined in the current environment), it’ll use that value.
[223003100310] |If not, it’ll use the environment of the caller.
[223003100320] |And things that go on in subsequent function definitions and calls, as Radford’s example illustrates, don’t count.
[223003100330] |Am I the only one who finds this confusing?
[223003100340] |At least with all your help, I think I finally understand what R’s doing.
[223003110010] |<title>Grammar as Science</title>
[223003110020] |I had an hour free time on the University of Chicago campus, so after strolling <a>the quad</a> and gawking at the <a>Robie house</a>, I toddled into the university bookstore.
[223003110030] |After marveling at the number of X and Philosophy titles<sup>*</sup>, I wandered over to the linguistics section, where I found:
[223003110040] |<li>Richard Larson. 2010.
[223003110050] |<a>Grammar as Science</a>.
[223003110060] |MIT Press. 
[223003110070] |I was struck by the beginning of the back-cover blurb:
[223003110080] |This introductory text takes a novel approach to the study of syntax.
[223003110090] |Grammar as Science offers an introduction to syntax as an exercise in scientific theory construction. 
[223003110100] |If I’m not mistaken, this is implying that other introductions to syntax are exercises in something other than science.
[223003110110] |I’ll buy that.
[223003110120] |But I’m not sure I buy the next bit: 
[223003110130] |Syntax provides an excellent instrument for introducing students from a wide variety of backgrounds to the principles of scientific theorizing and scientific thought; it engages general intellectual themes present in all scientific theorizing as well as those arising specifically within the modern cognitive sciences. …
[223003110140] |This book aside, I doubt Chomskyan linguistics is a good exemplar of science for reasons that (part of) Peggy Speas’s endorsement illuminates:
[223003110150] |[Grammar as Science] shows students explicitly how to ‘think like a linguist.’
[223003110160] |Students who use this book will come away with an extraordinarily strong grasp of the real underpinnings of linguistics. 
[223003110170] |The key here is “think like a linguist”.
[223003110180] |If you’ve ever tried to take Chomsky’s conception of grammar from Aspects and explain it to students (which I’m embarrassed to admit that I have done in the past), you’d have your doubts about the relation between “think like a scientist” and “think like a linguist”, too. 
[223003110190] |I was further amused by (the beginning of) Norbert Hornstein’s endorsement, also on the back cover:
[223003110200] |What makes modern generative linguistics a science, and an interesting one at that, is a subtle interaction among several factors: the questions that it asks, the abstractions that it makes to answer these questions, the technical apparatus it uses to make these questions precise, the evidence it uses to adjudicate between different answers, and the aesthetic that animates it when putting all these ingredients together. 
[223003110210] |Silly me.
[223003110220] |Here I was thinking what makes something interesting science is predictive power. 
[223003110230] |<h3>Show, Don’t Tell</h3>
[223003110240] |One piece of very good advice for writers in any genre is <a>show, don’t tell</a>. 
[223003110250] |Sci-fi writers and linguists seem to have a particularly hard time following this advice.
[223003110260] |I’m afraid the newbie/off-worlder/woken-up-from-cryosleep listener who has everything explained to him obeys the letter but not the spirt of the dictum.
[223003110270] |That’s similar to “serious” texts where thin characters have a “dialogue” that’s a proxy for the author explaining to the reader.
[223003110280] |Examples include Gödel, Escher, Bach in logic/AI (which I loved when I was in high school when it came out) and <a>Rhyme and Reason</a> in linguistics, which covers very similar “philosophy of science” ground as Grammar as Science.
[223003110290] |I’ve never seen a physics or chemistry or even a biology book start off by asserting that what follows is really science.
[223003110300] |Most scientists, and most scientific texts, just do science.
[223003110310] |Methinks the linguists <a>do protest too much</a>. 
[223003110320] |<h3>Been There, Done That</h3>
[223003110330] |If you look at the first chapter of the <a>second HPSG book</a>, Carl Pollard (this was his axe to grind) preaches the same gospel, as in this excerpt from page 7:
[223003110340] |<sup>*</sup> The following are real, though I’m too lazy to link past the first: <a>Anime and Philosophy</a>, Star Wars and Philosophy, Alice in Wonderland and Philosophy, The Matrix and Philosophy, Facebook and Philosophy, Transformers and Philosophy, Batman and Philosophy, The Undead and Philosophy and so on (and on and on).
[223003110350] |Is this some kind of works project for philosophers?
[223003120010] |<title>How to Define (Interpolated) Precision-Recall/ROC Curves and the Area under Them?</title>
[223003120020] |I’m revising LingPipe’s <a><code>ScoredPrecisionRecallEvaluation</code></a> class.
[223003120030] |My goal is to replace what’s currently there with more industry-standard definitions, as well as fixing some bugs (see the last section for a list). 
[223003120040] |<h3>Uninterpolated and Interpolated Curves?</h3>
[223003120050] |I’ve seen multiple different definitions for uninterpolated and interpolated curves.
[223003120060] |For the uninterpolated case, I’ve seen some authors include all the points on the curve and others include only the points corresponding to relevant results (i.e., true positives).
[223003120070] |LingPipe took the latter approach, but <a>Mannig et al.’s IR book</a> took the former.
[223003120080] |For the interpolated case, I’ve seen some authors (e.g., Manning et al.) set precision for a given recall point r to be the maximum precision at an operating point with recall r or above.
[223003120090] |I’ve seen others just drop points that were dominated (LingPipe’s current behavior).
[223003120100] |I seem to recall reading that others used the <a>convex hull</a> (which is what LingPipe’s doc says it does, but the doc’s wrong).
[223003120110] |<h3>Limit Points?</h3>
[223003120120] |Should we always add the limit points corresponding to precision/recall values of (0,1) and (1,0)?
[223003120130] |<h3>Break-Even Point?</h3>
[223003120140] |If we use all the points on the precision-recall curve, the break-even point (BEP) will be at the rank equal to the number of relevant results (i.e., BEP equals R-precision).
[223003120150] |If we remove points that don’t correspond to true positives, there might not be an operating point that is break even (e.g., consider results: irrel, irrel, rel, rel in the case where there are only two relevant results —BEP is 0/0 in the full case, but undefined if we only consider true positive operating points).
[223003120160] |<h3>Area Under the Curves?</h3>
[223003120170] |For area under the curve (AUC) computations, I’ve also seen multiple definitions.
[223003120180] |In some cases, I’ve seen people connect all the operating points (however defined), and then calculate the actual area.
[223003120190] |Others define a step function more in keeping with interpolating to maximum precision at equal or higher recall, and then take the area under that. 
[223003120200] |<h3>Bugs in LingPipe’s Current Definitions</h3>
[223003120210] |For the next version of LingPipe, I need to fix three bugs:
[223003120220] |1. The ROC curves should have 1-specificity on the X axis and sensitivity on the Y axis (LingPipe currently returns sensitivity on the X axis and specificity on the Y axis, which doesn’t correspond to anyone’s definition of ROC curves.)
[223003120230] |2. The break-even point calculation can return too low a value.
[223003120240] |3. Area under curve calculations don’t match anything I can find in the literature, nor do they match the definitions in the javadoc.
[223003130010] |<title>First Day of School</title>
[223003130020] |Today is my (<a>Bob Carpenter</a>‘s) first day as a research scientist at Columbia University <a>Department of Statistics</a>.
[223003130030] |I’m part of a decent-sized team working for <a>Andrew Gelman</a> on implementing scalable Bayesian inference. 
[223003130040] |Having spent almost thirty years on the school calendar the first time around (from kindergarten in ’68 to tenured professor in ’96), the rhythm of the school calendar feels very natural.
[223003130050] |Some of my more senior friends have expressed envy at my (relative) lack of responsibility (for designing curricula, recruiting students and hiring faculty, raising funds, teaching, advising students, etc.).
[223003130060] |Ironically, when I left academia the first time, many non-academics thought I’d finally have to “work for a living”.
[223003130070] |In fact, industry’s a <a>cakewalk</a> compared to life as part of the <a>professoriate</a>. 
[223003130080] |<h3>Why Academia?</h3>
[223003130090] |My motivation for moving is to have more time to spend learning about and working on hierarchical models of the kind I’ve been exploring for <a>data coding</a> and <a>RNA sequencing</a>.
[223003130100] |We’ll also be working on hierarchical models for social science, such as extending the multilevel logistic regression Andrew and crew used for <a>Red State, Blue State, Rich State, Poor State</a> to larger data sets, more predictors, more interactions, and more levels (the details of the basic setup are in Andrew and Jennifer Hill’s <a>regression book</a>).
[223003130110] |<h3>What about the LingPipe Blog?</h3>
[223003130120] |I’ll still blog here about natural language processing in general and LingPipe in particular. 
[223003130130] |I’ll move my general statistical blogging to <a>Andrew’s blog</a>. 
[223003130140] |I don’t know what I’m going to do with the <a>bioinformatics stuff</a>, but I may just start another blog for that.
[223003130150] |If you haven’t guessed, I think blogs are most useful if they’re focused.
[223003130160] |<h3>What about LingPipe and Alias-i?</h3>
[223003130170] |I’m still working part time for Breck at LingPipe<sup>*</sup>.
[223003130180] |I’ll still be answering e-mail and supporting the software.
[223003130190] |I’ll even be adding to it as needed. 
[223003130200] |We’re about to roll out version 4.0.1, which fixes some inconsistencies in the tutorials and fixes some non-standard definitions and some bugs in the scored precision recall evaluation.
[223003130210] |It’ll also have blocked logistic regression.
[223003130220] |I’m also about to roll out the next draft of the <a>LingPipe book</a>.
[223003130230] |It’s now 350+ pages, with lots more general and LingPipe-specific content. 
[223003130240] |<sup>*</sup>Breck’s changed the name of the company from “Alias-i” to “LingPipe”.
[223003130250] |It’s déjà vu —my last employer changed its name from “Applied Language Technologies” to “SpeechWorks” to match the product name.
[223003130260] |Breck’s working on the rebranding and he’s hired our friend <a>Philip Riley</a> to work out the design side of things.
[223003140010] |<title>Google’s Custom Search Adware and Instant Autoclicking</title>
[223003140020] |I’ve temporarily inherited a machine running <a>Linux Mint</a>, which appears to be a form of Ubuntu. 
[223003140030] |The default home page for a new account in the Firefox browser is Google.
[223003140040] |Not the Google we all know and love, but a really crappy <a>adware</a> version based on something called “<a>Google Custom Search</a>“.
[223003140050] |Not surprisingly, the second bullet point on Google’s ad page for the service says “Make money with Adsense for Search”. 
[223003140060] |When I do a search through the Linux Mint Google Custom Search page, I get results that don’t look at all like they do when searching from google.com.
[223003140070] |For instance, here’s the lousy result I get with Mint’s Google Custom Search app, along with the usual result from google.com:
[223003140080] |<li><a>[named entity recognition] on “Custom Search”
[223003140090] |<li><a>[named entity recognition] on google.com</a> 
[223003140100] |The damnedest part is that Mint has linked their adware-infested version of google to the Google toolbar, so even if I search from there, I get crapped-up results without anything like the new Bing-like suggestions Google are offering in the left nav bar, no “Instant”, and lots of ads.
[223003140110] |The only way to get regular Google search is to go to google.com directly. 
[223003140120] |<h3><CR> on Google Instant Auto-clicks Ads</h3>
[223003140130] |I think I’ve discovered the real motivation for Google Instant —automatically clicking ads if you hit return twice.
[223003140140] |I hadn’t noticed this on my last machine because I’d turned the Instant feature off. 
[223003140150] |What I see is that if Google Instant is left on (the default), and you enter your search to completion, you’re already looking at your results (if you’ve paused long enough after typing).
[223003140160] |Then, if you hit return, nothing happens.
[223003140170] |If you hit return again, perhaps wondering why nothing happened the first time, it auto-clicks the first link, which is typically a sponsored ad. 
[223003140180] |Sorry, <a>Basistech</a>, you’ll be paying Google for auto-enabling this “feature”. 
[223003140190] |This behavior goes away if you turn Google Instant off.
[223003140200] |<h3>Two More Reasons to Use Bing</h3>
[223003140210] |That’s two more reasons to use Bing, which I’ve now configured as the default search tool on the Firefox toolbar for my linux account.
[223003140220] |<h3>Google as Kleenex</h3>
[223003140230] |I would also like to note that my mom now “googles” using Bing.
[223003140240] |Kleenex, a brand of facial tissue, was the poster child for a brand name turned generic.
[223003140250] |Companies fight this tooth and nail, but language will not bend to their corporate wills.
[223003140260] |I set mom up with a Yahoo! e-mail account ages ago when she left her company.
[223003140270] |She spends lots of time in e-mail, and does most of her search from her My Yahoo! page.
[223003140280] |No wonder Bing’s picking up market share.
[223003150010] |<title>LingPipe 4.0.1 Patch Release</title>
[223003150020] |I just released LingPipe version 4.0.1.
[223003150030] |As usual, you can read more about and download the latest version from the
[223003150040] |<li><a>LingPipe Home Page</a>. 
[223003150050] |The latest version of the book is soon to follow, with much more LingPipe-specific content.
[223003150060] |<h3>Release Notes</h3>
[223003150070] |<li>fixes some bugs and brings definitions in line with standards for <code>classify.ScoredPrecisionRecallEvaluation</code>, </li>
[223003150080] |<li>added more documentation and utilities to <code>classify.ConfusionMatrix</code>,</li>
[223003150090] |<li>updates <code>stat.LogisticRegression</code> and <code>classify.LogisticRegressionClassifier</code> regression so that likelihood gradient updates are blocked along with prior updates, </li>
[223003150100] |<li>adds an implementation <code>tokenizer.TokenNGramTokenizerFactory</code>,</li>
[223003150110] |<li>updates <code>toString()</code> methods in all tokenizer factories to expose nested structure,</li>
[223003150120] |<li>added a collection utilities class <code>util.CollectionUtils</code>,</li>
[223003150130] |<li>added some more utility methods for utilities classes code>util.Streams</code> and <code>util.Strings</code>,</li>
[223003150140] |<li>added utility methods for <code>corpus.ListCorpus</code> and <code>corpus.Parser</code>,</li>
[223003150150] |<li>fixed a bug in sparse vector addition with other sparse vectors (which nothing else depended on), and </li>
[223003150160] |<li>brings the tokenized LM tutorial in line with LingPipe 4. </li>
[223003160010] |<title>Evaluating with Probabilistic Truth: Log Loss vs. O/1 Loss</title>
[223003160020] |I’ve been advocating an approach to classifier evaluation in which the “truth” is itself a probability distribution over possible labels and system responses take the same form.
[223003160030] |<h3>Is the Truth Out There?</h3>
[223003160040] |The main motivation is that we don’t actually know what the truth is for some items being classified.
[223003160050] |Not knowing the truth can come about because our labelers are biased or inconsistent, because our coding standard is underspecified, or because the items being classified are truly marginal.
[223003160060] |That is, the truth may not actually be out there, at least in any practically useful sense. 
[223003160070] |For example, we can squint all we want at the use of a pronoun in text and still not be able to come to any consensus about its antecedent, no matter how many experts in anaphora we assemble.
[223003160080] |Similarly, we might look at a phrase like fragile X in a paper on autism and truly not know if it refers to the gene or the disease.
[223003160090] |We may not know a searcher’s intent when we try to do spelling correction on a query like Montona —maybe they want Montana the state, maybe they want the pop star, or maybe they really want the west central region of the island of São Vicente.
[223003160100] |The truth’s not always out there.
[223003160110] |At least for language data collected from the wild. 
[223003160120] |<h3>Partial Credit for Uncertainty</h3>
[223003160130] |The secondary motivation is that many classifiers, like most of LingPipe’s, are able to return probabilistic answers.
[223003160140] |The basic motivation of log loss is that we want to evaluate the probability assigned by the system to the correct answers.
[223003160150] |The less uncertain the system is about the true answer, the better. 
[223003160160] |If we want to be able to trade precision (or specificity) for recall (i.e., sensitivity), it’s nice to have probabilistically normalized outputs over which to threshold.
[223003160170] |<h3>Probabilistic Reference Truth</h3>
[223003160180] |Let’s just consider binary (two category) classification problems, where a positive outcome is coded as <img/>and a negative one as <img/>..
[223003160190] |Suppose there are <img/>items being classified, <img/>.
[223003160200] |A probabilistic corpus then consists of a sequence of probabilistic categorizations <img/>, where <img/>for <img/>.
[223003160210] |A non-probabilistic corpus is just an edge case, with <img/>for <img/>.
[223003160220] |<h3>Probabilistic System Responses</h3>
[223003160230] |Next, suppose that the systems we’re evaluating are themselves allowed to return conditional probabilities of positive responses, which we will write as <img/>for <img/>. 
[223003160240] |A non-probabilistic response is again an edge case, with <img/>.
[223003160250] |<h3>Log Loss</h3>
[223003160260] |The log loss for response <img/>for reference <img/>is the negative expected log probability of the response given the reference, namely
[223003160270] |<img/>.
[223003160280] |If we think of <img/>and <img/>as parameters to Bernoulli distributions, it’s just KL divergence.
[223003160290] |If the reference <img/>is restricted to being <img/>or <img/>, this reduces to the usual definition of log loss for a probabilistic classifier (that is, <img/>may be any value in <img/>.
[223003160300] |<h3>Plotting a Single Response</h3>
[223003160310] |Because we just sum over all the responses, we can see what the result is for a single item <img/>.
[223003160320] |Suppose <img/>, which means the reference says there’s a 70% chance of item <img/>being positive, and a 30% chance of it being negative.
[223003160330] |The plot of the loss for response <img/>given reference <img/>is
[223003160340] |It’s easy to see that the response <img/>is better than <img/>for the case where the reference is <img/>. 
[223003160350] |The red vertical dashed line is at the point <img/>, which is where <img/>.
[223003160360] |The blue horizontal line is at <img/>, which is the minimum loss for any <img/>when <img/>.
[223003160370] |<h3>Classifier Uncertainty Matches Reference Uncertainty</h3>
[223003160380] |It’s easy to see from the graph (and prove via differentiation) that the log loss function is convex and the minimum loss occurs when <img/>.
[223003160390] |That is, best performance occurs when the system response has the same uncertainty as the gold standard.
[223003160400] |In other words, we want to train the system to be uncertain in the same places as the reference (training) data is uncertain.
[223003160410] |<h3>Log Loss Degenerates at the Edges</h3>
[223003160420] |If the reference is <img/>and the response <img/>, the log loss is infinite.
[223003160430] |Similarly for <img/>.
[223003160440] |As long as the response is not 0 or 1, the results will be finite. 
[223003160450] |If the response is 0 or 1, the loss will only be finite if the reference category is the same value, <img/>.
[223003160460] |In fact, having the reference equal to the response, <img/>, with both being 0 or 1, is the only way to get zero loss.
[223003160470] |In all other situations, loss will be non-zero.
[223003160480] |Loss is never negative because we’re negating logs of values between 0 and 1.
[223003160490] |This is why you don’t see log loss used for bakeoffs.
[223003160500] |If you’re using a system that doesn’t have probabilistic output and the truth is binary 0/1, then every system will have infinite error. 
[223003160510] |<h3>Relation to Regularization via Prior Data</h3>
[223003160520] |This matching of uncertainty can be thought of as a kind of regularization.
[223003160530] |One common way to implement priors if you only have maximum likelihood estimators is to add some artificial data.
[223003160540] |If I add a new data point <img/>which has all features turned on, and give it one positive instance and one negative instance, and I’m doing logistic regression, it acts as a regularizer pushing all the coefficients closer to zero (the further they are away from zero, the worse they do on the combination of the one negative and one positive example).
[223003160550] |If I add more of this kind of data, the regularization effect is stronger.
[223003160560] |Here, we can think of adding 3 negative examples and 7 positive examples, or 0.3 and 0.7 if we allow fractional training data.
[223003160570] |<h3>Comparison to 0/1 Loss</h3>
[223003160580] |Consider what happens if instead of log loss, we consider the standard 0/1 loss as expressed in a confusion matrix (true positive, true negative, false positive, and false negative counts). 
[223003160590] |With a probabilistic reference category <img/>and probabilistic system response <img/>, the expected confusion matrix counts are
[223003160600] |<img/>,
[223003160610] |<img/>,
[223003160620] |<img/>, and
[223003160630] |<img/>.
[223003160640] |With this, we can compute expected accuracy, precision, recall (= sensitivity), specificity, and F-measure as
[223003160650] |<img/><img/>.
[223003160660] |<img/><img/>,
[223003160670] |<img/><img/>,
[223003160680] |<img/><img/>, and
[223003160690] |<img/><img/>.
[223003160700] |All of these statistical measures are degenerate for probabilistic evaluation.
[223003160710] |For instance, if <img/>, then accuracy is maximized with <img/>, if <img/>, accuracy is maximized with <img/>, and if <img/>, accuracy is the same for all <img/>!
[223003160720] |The rest of the measures are no better behaved. 
[223003160730] |<h3>R Source</h3>
[223003160740] |Plotting functions is easy in R with the <code>curve()</code> function.
[223003160750] |Here’s how I generated the one above. 
[223003160760] |Just be careful with that <a>dynamic lexical scope</a> if you want to redefine the reference result <code>p</code> (<img/> in the definitions above).