Processing Tweets with LingPipe #1: Search and CSV Data Structures

[223003180010] |<title>Processing Tweets with LingPipe #1: Search and CSV Data Structures</title>
[223003180020] |I have been focusing on teaching developers how to approach the challenges of computational linguistics.
[223003180030] |An interesting source to work with is Twitter since it is culturally relevant these days, fairly easy to work with from a data feed perspective and it is just so darned cute.
[223003180040] |Like many cute things, it is not long before it is eating your shoes, resisting toilet training and overall testing one’s patience.
[223003180050] |We will be using LingPipe to help tame the beast. 
[223003180060] |I will start with how to work with the search client and will move towards language ID, deduplication and a bit of sentiment classification in later blog entries.
[223003180070] |The code for this post is available in <a>this tarball</a> or from our subversion repository which can be accessed via the command: 
[223003180080] |<code>svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter</code>
[223003180090] |Go ahead and download the software and make sure you can compile.
[223003180100] |I am developing with Ant so I just type ‘ant’ in the teaching/twitter directory:
[223003180110] |<h3>Searching Twitter</h3>
[223003180120] |We are using the <a>Twitter4j</a> library for accessing the Twitter API.
[223003180130] |The folks at Twitter apparently make contributions, while not an endorsement, it certainly increases credibility. 
[223003180140] |Getting some data on disk is the first item.
[223003180150] |Run a search on Obama as follows:
[223003180160] |The file name will be the query, <code>searchers/Obama.csv</code>.
[223003180170] |Note that subsequent searches will not overwrite previous search results.
[223003180180] |Rerunning the above query would create a file <code>searches/Obama_1.csv</code>. 
[223003180190] |Looking at the csv output in spread sheet like Open Office or Excel (import as comma seperated UTF-8 text)
[223003180200] |The tweets are in column D.
[223003180210] |To get them more readable you should select the entire column D, resize width to around 5″ and modify formatting to include wrapping the text. 
[223003180220] |Pretty cute eh?
[223003180230] |Lots of entrepreneurs have gotten to this stage of perusing Twitter and see piles of customers clamoring for access to this feed that is effectively the <a>id</a> of the internet.
[223003180240] |Clean it up with some super ego and profit is sure to follow.
[223003180250] |Well, the cute gets ugly fast.
[223003180260] |Run a few queries in areas of interest and the messiness becomes apparent quickly.
[223003180270] |Below we go into some details and a bit of data analysis.
[223003180280] |<h3>Details</h3>
[223003180290] |We will be using the CSV data structure to work with the Twitter search results.
[223003180300] |Spreadsheets provide an easy way to view, sort and manipulate the tweets and there is a simple parser that allows us to interface with LingPipe functionality.
[223003180310] |The relevant bits of the search code in src/SearchTwitter4j.java are:
[223003180320] |The only tricky bit to this search interface is the two levels of indexing for search results, pages and tweets per page.
[223003180330] |Otherwise we accumulate the tweets and send them off to our TweetWriter class for writing to disk.
[223003180340] |The code for writing is in <code>src/TweetWriter.java</code>:
[223003180350] |This code uses the <a>Super Csv</a> package to handle reading and writing csv (comma seperated value) formatted files.
[223003180360] |For this tutorial I am only interested in the text of the Tweet object but much more information is available–see the Twitter4j documentation on the returned Tweet object. 
[223003180370] |The basic pattern is to populate a row with string values and write the row out.
[223003180380] |I have adopted the convention that the first line of a csv file contains headers which is handled before iterating over the tweets.
[223003180390] |Writing out the tweets involves 3 empty fields followed by the text of the tweet.
[223003180400] |Later those empty fields will be populated by human and machine annotations so this is our foundational data structure.
[223003180410] |The remaining odd bit is the <code>whitespaceNormedTweet = tweet.getText().replaceAll("\\s+"," ").trim();</code> which replaces new lines and carriage returns with a single white space.
[223003180420] |This is in clear violation of my “do not adulterate the data” stance but for the purposes of this tutorial the improved readability of the csv format makes it worth it. 
[223003180430] |Also note that the IO has been wrapped for maximum paranoia regarding open file handles that might take out a JVM on repeated failures of <code>writer</code>, <code>streamWriter</code> or <code>fileOut</code>.
[223003180440] |I just know that someone is going to copy and paste the above code into a production system so might as well make it well behaved.
[223003180450] |BTW a real implementation would be streaming tweets to disk to keep down memory use.
[223003180460] |Thanks to Bob for noting all this and re-writing the closes. 
[223003180470] |<h3>Exploring the Data</h3>
[223003180480] |Blog entries are supposed to be short I will cut the presentation so that I may blog another day in good conscience.
[223003180490] |But a little data analysis to whet the appetite seems appropriate. 
[223003180500] |<h4>Duplicates and Near Duplicates:</h4>
[223003180510] |We now have an on disk data structure for storing searched for tweets that we can browse with a spread sheet and modify at our will.
[223003180520] |Lets look at what our harvest has reaped.
[223003180530] |Select the entire sheet and sort by column D. Scrolling down a bit I find for my Obama search some near duplicates that share prefixes:
[223003180540] |Looking more closely we see that the repeats are minor variants with both the referred to URL, the #2 vs #5 and some more variation in the last example.
[223003180550] |For the purposes of this tutorial I assume that duplicates and near duplicates need to be eliminated–there are other use cases where retweets, which are near duplicates passed on by different users, are desirable.
[223003180560] |We will be handling duplicates and near duplicates in the next blog post. 
[223003180570] |<h4>Languages:</h4>
[223003180580] |I browsed the tweets and started to markup up what the language was.
[223003180590] |I didn’t have to look past around 100 tweets to get a pretty rich smattering of languages. 
[223003180600] |Diversity of Languages Displayed. 
[223003180610] |In another blog post I will approach the issue of language identification–fortunately it is pretty easy for languages that are not near neighbors like British English and American English.
[223003180620] |If you want to get a heads up on language id look at our <a>language ID tutorial.</a>
[223003180630] |That is all for now.
[223003180640] |My next posts will attend to deduplication, then language ID before moving on to classification for topic and sentiment. 
[223003180650] |Breck
[223003190010] |<title>Metaphor Fail: Not that Kind of Tsunami, Yahoo!</title>
[223003190020] |Mitzi pointed out the increasing politicization of <a>Yahoo! Weather News</a>, which now offers extensive coverage of the recent American mid-term elections.
[223003190030] |For instance, today’s roundup of weather-related news includes this excerpt from a piece by Gail Russell Chaddock in the Christian Science Monitor [my emphasis]:
[223003190040] |<a><b>Why Nancy Pelosi remains leader of House Democrats despite huge loss</b></a>
[223003190050] |Call it a triumph of inside politics.
[223003190060] |While midterms 2010 felt like a tsunami outside of Washington, the waves didn’t reach Capitol Hill’s corner offices, where House leadership teams in both parties barely budged. 
[223003190070] |The <a>wave</a> <a>metaphor</a> is quite popular these days among Republican commentators, so there’s more and more political commentary on the Weather page. 
[223003190080] |<h3>Classification, The Mother-in-Law Recipe<sup>*</sup></h3>
[223003190090] |Add one part heuristic dictionary matching, one part metaphor, stir, and put into production. 
[223003190100] |<h3>Classification, The Real Recipe</h3>
[223003190110] |Post-process the previous recipe by filering output with a conservative threshold using a simple bag-of-words classifier.
[223003190120] |After all, it’s a precision-related game. 
[223003190130] |If nothing else, you can bootstrap training data from the seed terms.
[223003190140] |<sup>*</sup> Mitzi<sup>**</sup> uses the term “mother-in-law recipe” for one with missing steps.
[223003190150] |The result is that the unsuspecting daughter-in-law could never duplicate mother’s results.
[223003190160] |Mitzi comes from a family of competitive (and talented) cooks.
[223003190170] |Mitzi got just such a recipe from my dad’s mother for vanilla buttercream frosting, which was legendary for both quality and the fact that only grandma could make it.
[223003190180] |I think that was simple omission, because Grandma gave her a live demo of how she made it when Mitzi said she couldn’t reproduce the recipe.
[223003190190] |<sup>**</sup> I couldn’t find a reference on the internet, but in about ten seconds Mitzi found one in print in Michael J. Rosen’s Great American Bake Sale.
[223003200010] |<title>Information Gain and Task-Based Costs: Biased versus Spam Annotators</title>
[223003200020] |Following on from Sheng et al.’s <a>Get Another Label?</a> paper, Panos and crew have a Human Computation Workshop paper,
[223003200030] |<li>P. Ipeirotis, F. Provost, J. Wang. 2010.
[223003200040] |<a>Quality Management on Amazon Mechanical Turk</a>.
[223003200050] |KDD-HCOMP. 
[223003200060] |The motivation for the new paper is to try to separate bias from accuracy.
[223003200070] |That is, some annotators would try hard, but the bias in their answers would give them an overall low exact accuracy.
[223003200080] |But their responses can still be useful if we correct for their bias.
[223003200090] |Luckily, the Dawid-and-Skene-type models for estimating and adjusting for annotator’s accuracies does just that. 
[223003200100] |<h3>G, PG, R, or X?</h3>
[223003200110] |As an example, Ipeirotis et al. consider having workers on Amazon Mechanical Turk classify images into G, PG, R, and X ratings following <a>MPAA ratings guidelines</a>. 
[223003200120] |This is really an ordinal classification problem where the approach of <a>Uebersax and Grove (1993)</a> seems most appropriate, but let’s not worry about that for right now.
[223003200130] |We can imagine a purely categorical example such as classifying tokens in context based on part-of-speech or classifying documents based on a flat topic list.
[223003200140] |<h3>Bias or Inaccuracy?</h3>
[223003200150] |Uebersax and Grove discuss the bias versus accuracy issue.
[223003200160] |It’s easy to see in a two-category case that sensitivity and specificity (label response for 1 and 0 category items) may be reparameterized as accuracy and bias.
[223003200170] |Biased annotators have sensitivities that are lower (0 bias) or higher (1 bias) than their specificities.
[223003200180] |What Ipeirotis et al. point out is that you can derive information from a biased annotator if their biases can be estimated.
[223003200190] |As they show, a model like <a>Dawid and Skene’s (1979)</a> performs just such a calculation in its “weighting” of annotations in a generative probability model.
[223003200200] |The key is that it uses the information about the likely response of an annotator given the true category of an item to estimate the category of that item.
[223003200210] |<h3>Decision-Theoretic Framework</h3>
[223003200220] |Ipeirotis et al. set up a decision-theoretic framework where there is a loss (aka cost, which may be negative, and thus a gain) for each decision based on the true category of the item and the classification. 
[223003200230] |One nice thing about the image classification task is that it makes it easy to think about the misclassification costs.
[223003200240] |For instance, classifying an X-rated image as G and having it land on a children’s site is probably a more costly error than rating a G image as PG and banning it from the same site.
[223003200250] |In the ordinal setting, there’s a natural scale, where rating an X-rated image as R is closer to the true result than PG or G. 
[223003200260] |<h3>Getting Another Label</h3>
[223003200270] |Consider a binary setting where the true category of item <img/>is <img/>, the prevalence (overall population proportion) of true items is <img/>, and annotator <img/>has sensitivity (accuracy on 1 items; response to positive items) of <img/>and a specificity (accuracy on 0 items; 1 –response to negative items) of <img/>.
[223003200280] |If <img/>is the response of annotators <img/>, then we can calculate probabilities for the category <img/>by
[223003200290] |<img/>and
[223003200300] |<img/>.
[223003200310] |The thing to take away from the above formula is that it reduces to an odds ratio.
[223003200320] |Before annotation, the odds ratio is <img/>.
[223003200330] |For each annotator, we multiply the odds ratio by the annotator’s ratio, which is <img/>if the label is <img/>, and is <img/>if the label is negative. 
[223003200340] |<h3>Random Annotators Don’t Affect Category Estimates</h3>
[223003200350] |Spammers in these settings can be characterized by having responses that do not depend on input items.
[223003200360] |That is, no matter what the true category, the response profile will be identical.
[223003200370] |For example, an annotator could always return a given label (say 1), or always choose randomly among the possible labels with the same distribution (say 1 with 20% chance and 0 with 80% chance, correspdonding, say, to just clicking some random checkboxes on an interface).
[223003200380] |In the binary classification setting, if annotations don’t depend on the item being classified, we have specificity = 1 –sensitivity, or in symbols, <img/>for annotator <img/>.
[223003200390] |That is, there’s always a <img/>chance of returning the label 1 no matter what the input. 
[223003200400] |Plugging this into the odds ratio formulation above, it’s clear that there’s no effect of adding such a spammy annotator.
[223003200410] |The update to the odds ratios have no effect because <img/>and <img/>.
[223003200420] |<h3>Cooperative, Noisy and Adversarial Annotators</h3>
[223003200430] |In the binary case, as long as the sum of sensitivity and specificity is greater than one, <img/>, there is positive information to be gained from an annotator’s response. 
[223003200440] |If the sum is 1, the annotator’s responses are pure noise and there is no information to be gained by their response. 
[223003200450] |If the sum is less than 1, the annotator is being adversarial.
[223003200460] |That is, they know the answers and are intentionally returning the wrong answers.
[223003200470] |In this case, their annotations will bias the category estimates.
[223003200480] |<h3>Information Gain</h3>
[223003200490] |The expected information gain from having an annotator label an item is easily calculated.
[223003200500] |We need to calculate the probability of true category and then probability of response and figure out the odds of each and the contribution to our overall estimates. 
[223003200510] |The random variable in question is <img/>, the category of item <img/>.
[223003200520] |We will assume a current odds ratio after zero or more annotations of <img/>and consider the so-called information gain from observing <img/>, the label provided for item <img/>by annotator <img/>,
[223003200530] |<img/>,
[223003200540] |where the expectation is with respect to our model’s posterior. 
[223003200550] |Expanding the expectation in the second term gives us
[223003200560] |The formulas for the terms inside the entropy are given above.
[223003200570] |As before, we’ll calculate the probabilities of responses using our model posteriors.
[223003200580] |For instance, carrying this through normalization, the probabilities on which the expectations are based are 
[223003200590] |<img/>, and
[223003200600] |<img/>, where
[223003200610] |<img/>, and
[223003200620] |<img/>.
[223003200630] |so that the probability the annotation is 1 is proportional the sum of the probability that the true category is 1 (here <img/>) and the response was correct (<img/>) and the probability that the true category is 0 (<img/>) and the response was incorrect (<img/>).
[223003200640] |It’s easy to see that spam annotators who have <img/>provide zero information gain because as we showed in the last section, <img/>if annotator <img/>provides random responses, then <img/>. 
[223003200650] |<h3>Decision-Theoretic Framework</h3>
[223003200660] |Ipeirotis et al. go one step further and consider a decision-theoretic context in which the penalities for misclassifications may be arbitrary numbers and the goal is minimizing expected loss (equivalently maximizing expected gain). 
[223003200670] |Rather than pure information gain, the computation would proceed through the calculation of expected true positives, true negatives, false positives, and false negatives, each with a weight. 
[223003200680] |The core of Bayesian decision theory is that expected rewards are always improved by improving posterior estimates.
[223003200690] |As long as an annotator isn’t spammy, their contribution is expected to tighten up our posterior estimate and hence improve our decision-making ability.
[223003200700] |Suppose have have weights <img/>, which are the losses for classifying an item of category <img/>as being of category <img/>.
[223003200710] |Returning to the binary case, consider an item whose current estimated chance of being positive is <img/>.
[223003200720] |Our expected loss is
[223003200730] |<img/>.
[223003200740] |We are implicitly assuming that the system operates by sampling the category from its estimated distribution.
[223003200750] |This is why <img/>shows up twice after <img/>, once for the probability that the category is 1 and once for the probability that’s the label chosen. 
[223003200760] |In practice, we often quantize answers to a single category.
[223003200770] |The site that wants a porn filter on an image presumably doesn’t want a soft decision —it needs to either display an image or not.
[223003200780] |In this case, the decision criterion is to return the result that minimizes expected loss.
[223003200790] |For instance, assigning category 1 if the probability the category is 1 is <img/>leads to expected loss
[223003200800] |and the loss for assigning category 0 is expected to be
[223003200810] |<img/>.
[223003200820] |The decision is simple: return the result corresponding to the smallest loss. 
[223003200830] |After the annotation by annotator <img/>, the positive and negative probabilities get updated and plugged in to produce a new estimate <img/>, which we plug back in. 
[223003200840] |I’m running out of steam on the <img/>derivation front, so I’ll leave the rest as an exercise.
[223003200850] |It’s a hairy formula, especially when unfolded to the expectation.
[223003200860] |But it’s absolutely what you want to use as the basis of the decision as to which items to label. 
[223003200870] |<h3>Bayesian Posteriors and Expectations</h3>
[223003200880] |In practice, we work with estimates of the prevalence <img/>, sensitivities <img/>, and specificities <img/>.
[223003200890] |For full Bayesian inference, Gibbs sampling lets us easily compute the integrals required to use our uncertainty in the parameter estimates in calculating our estimate of <img/>and its uncertainty.
[223003200900] |<h3>Confused by Section 4</h3>
[223003200910] |I don’t understand why Ipeirotis say, in section 4, 
[223003200920] |The main reason for this failure [in adjusting for spam] is the inability of the EM algorithm to identify the “strategic” spammers; these sophisticated spammers identify the class with the highest class prior (in our case the “G” class) and label all the pages as such. 
[223003200930] |Huh?
[223003200940] |It does in my calculations, as shown above, and in my experience with the Snow et al. NLP data and our own NE data.
[223003200950] |One problem in practice may be that if a spammer annotates very few items, we don’t have a good estimate of their accuracies, and can’t adjust for their inaccuracy.
[223003200960] |Otherwise, I don’t see a problem.
[223003210010] |<title>Need Another Label(s)!</title>
[223003210020] |It occurred to me while working through the math for <a>my last post</a> that there are situations when you not only need another label for an existing item, but need more than one to achieve an expected positive payoff.
[223003210030] |<h3>Prior and Posterior Odds Ratios</h3>
[223003210040] |First, let’s summarize the result from adding another annotator.
[223003210050] |For a given image <img/>, suppose the current odds (aka prior) of clean (<img/>) versus porn (<img/>) are <img/>, which corresponds to a probability of being clean of <img/>.
[223003210060] |For instance, if the odds are 4:1 the image is clean, the probability it is clean is 4/5. 
[223003210070] |Now suppose we get a label <img/>from an annotator <img/>for image <img/>.
[223003210080] |We update the odds to include the annotator’s label to get new (posterior) odds <img/>, by the following formula if the label is 
[223003210090] |<img/>.
[223003210100] |We just multiply the prior odds <img/>by the likelihood ratio of the annotation <img/>given that <img/>or <img/>.
[223003210110] |The new probability of a clean image is thus <img/>.
[223003210120] |<h3>The Porn Payoff</h3>
[223003210130] |Suppose the payoff for the porn task is as follows.
[223003210140] |A porn image classified as clean has a payoff of -100, a porn image classified as porn has a cost of 0, a clean image classified as clean has a payoff of 1, and a clean image classified as porn has a payoff of -1. 
[223003210150] |In this setup, when we have an image, we need odds of better than 100:1 odds (a bit more than 99% probability) that it is clean before we return the decision that it’s clean; otherwise we maximize our payoff saying it’s porn.
[223003210160] |Unless we are 100% sure of our decision, we always have an expected loss from returning a decision that an image is porn, because the payoff is zero for a true negative (classifying porn as porn), whereas the payoff is -1 for rejecting a non-porn image.
[223003210170] |Now suppose that the prevalence of clean images is 20 in 21 (of course, we work with an estimate).
[223003210180] |So we start with odds of 20:1.
[223003210190] |The decision to classify the item as clean before annotation has an expected payoff of [20*1 + 1*(-100)]/21 or about -4 per decision.
[223003210200] |The decision to classify an image as porn before an annotation has a payoff of [20*(-1) + 1*0]/21 which is around -1 per decision.
[223003210210] |<h3>Need Multiple Annotators</h3>
[223003210220] |Clearly we need to annotate an item before we can have a positive expected payoff. 
[223003210230] |Suppose for the sake of argument (and easy arithmetic) that we have an annotator with sensitivity 0.9 (they correctly classify 90% of the clean images as clean and reject 10% of the clean images as porn) and specificity 0.8 (they correctly classify 80% of the porn images as porn and let 20% through as clean).
[223003210240] |In this context, we actually need more than one annotator to annotate an item before we get a positive expected payoff.
[223003210250] |We first need to work through our expectations properly.
[223003210260] |First, we start with 20:1 odds (probability 20/21) an image is clean, so we can expect to see 20 clean images for each porn image.
[223003210270] |Then we have to look at the annotator’s expected response.
[223003210280] |If the image is clean, there’s a 80% chance the annotator says it’s clean and a 20% chance they say it’s porn.
[223003210290] |If the image is porn, there’s a 90% chance they say it’s porn and a 10% chance they say it’s clean.
[223003210300] |That let’s us work out the expectations for true category and response a priori.
[223003210310] |For instance, the chance the image is clean and the annotator says its porn is 20/21 * 0.2 = 4/21. 
[223003210320] |We then calculate the updated odds under both possible annotator responses and figure out the new odds and weight them by our expectations before seeing the label.
[223003210330] |I’ll let you work out the arithmetic, but the upshot is that until you have more than one annotator, you can’t get 100:1 odds or more of an image not being porn.
[223003210340] |The key step is noting that we only get positive expected payoff if we return that the image is clean, but the posterior odds if the annotator provides a 1 label, which are 20/1 * 0.9/0.2 = 90:1.
[223003210350] |And don’t forget to factor in that we only land in the happy situation of a getting a non-porn label around 90% of the time, so the expected gain in payoff from the annotation is less than the improvement from 20:1 to 90:1 we get in the ideal case. 
[223003210360] |In reality, you’re likely to have slightly better porn/not-porn annotators than this because it’s a relatively easy decision problem.
[223003210370] |But you’re also likely to have spammers, as we mentioned last time, which is really a mixed pool of spammers and cooperators.
[223003210380] |<h3>Unknown Annotators</h3>
[223003210390] |I’ll say again that one of the pleasant properties of the hierarchical model extension of Dawid and Skene is that it allows us to predict the behavior for a new unknown annotator.
[223003210400] |This is particularly useful in a mechanical Turk setting where we can’t choose our annotators directly (though we can feed items to an annotator if we write our own web app and they volunteer to do more). 
[223003210410] |We just sum over predictions from each posterior sample of the hyperpriors, then sample an annotator from that, calculate what the outcome would be for them, and average the result.
[223003210420] |What’s extra cool is that this includes all the extra dispersion in posterior inference that’s warranted due to our uncertainty in the hyperpriors representing the population of annotators.
[223003220010] |<title>Danger Zone! of Data Science</title>
[223003220020] |I’m not a big fan of links-only posts, but I’m still chuckling at the “Danger Zone!” quadrant from Drew Conway’s post <a>The Data Science Venn Diagram</a> on the <a>Dataists Blog</a>:
[223003220030] |Venn diagram <b>by <a>Drew Conway</a></b>.
[223003220040] |I just realize I posted to a cross-post, which is one of the reason I don’t like all this linking about.
[223003220050] |The original blog entry is on Drew Conway’s blog, <a>Zero Intelligence Agents</a>.
[223003230010] |<title>Call for PhD Ideas for Lucene/Solr Implementation</title>
[223003230020] |Otis over at the Sematech blog wants to get people identifying hard problems faced by people working in search that PhD students can select from and hopefully solve with a Lucene/Solr implementation.
[223003230030] |The students get a ‘real world problem’ and the world gets a concrete open source implementation of the solution.
[223003230040] |The call is: <a>Lucene / Solr for Academia: PhD Thesis Ideas</a>
[223003230050] |My suggested PhD idea is tolerable precision at high recall dictionary matching of phrases.
[223003230060] |Mike Ross spent a good deal of time trying to get 100% matches of genes in MEDLINE abstracts given a dictionary of genes (Entrez Gene) and aliases.
[223003230070] |The core of the problem is that not all the mentions of genes are on the aliases set for the gene.
[223003230080] |Huge issues around efficiency in addition to getting it working at all. 
[223003240010] |<title>T-Shirts, Suits, and Khakis</title>
[223003240020] |I’ve been traveling more, which means more instances of the following conversation. 
[223003240030] |<h3>Q &A</h3>
[223003240040] |<b>Question</b>: I can see that you [Bob] write most of the product code.
[223003240050] |What does everyone else at Alias-i do?
[223003240060] |And how do you make money?
[223003240070] |<b>Answer</b>: I (Bob) do the t-shirt work, Breck does the suit work, and we both pitch in for the khakis work, though Breck does the lion’s share. 
[223003240080] |Although most of our income is license sales, not contracting, we do help our customers get up and running, though they do the integration with their products.
[223003240090] |Amazingly, we’ve been off the government research teat for the last two years. 
[223003240100] |And yes, there really are just the two of us and a part-time admin and the occasional contractor, so we don’t need to sell oodles of software to stay afloat. 
[223003240110] |By the way, the company’s now called LingPipe.
[223003240120] |<h3>Explanation</h3>
[223003240130] |The t-shirts do all the deep technical stuff involving math, stats, and algorithms.
[223003240140] |In my case, I also do the tech writing. 
[223003240150] |The suits do all the business stuff involving development, contracts, partnerships, marketing, sales, licensing, accounting, etc. 
[223003240160] |The khakis do two things, keep the IT infrastructure working (almost all Breck with some contractors) and work on customer-facing services/consulting projects (mostly Breck and some contractors). 
[223003240170] |Literally?
[223003240180] |For me, yes, though I can be made presentable.
[223003240190] |Breck’s more of the <a>classic Billyburg hipster</a>.
[223003240200] |Neither of us would be caught dead really wearing the khakis-and-blue-shirt uniform of tech support and tech consulting. 
[223003240210] |<h3>Paying the Bills</h3>
[223003240220] |The question always strikes me as naive in the same way that a question from a student about what the professor/lab-director does who never has any time for “real work” because they’re always traveling or in meetings or giving presentations.
[223003240230] |If the lab director spent their time doing the fun stuff, there’d be no lab to direct!
[223003240240] |It’s the same deal with a software company —there’s lots more to do than write software. 
[223003240250] |One could argue that selling software is the only thing that really matters.
[223003240260] |Although it helps to have nice software to sell, it’s not strictly necessary.
[223003240270] |On the other hand, nice software with no sales gets you <a>bupkus</a>.
[223003250010] |<title>Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization</title>
[223003250020] |This post is about taking the csv file format used in <a>post #1</a> and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results.
[223003250030] |The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation. 
[223003250040] |As before you can download the code with <a>this tarball</a> or get it from subversion with the command
[223003250050] |<code>svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter </code>
[223003250060] |We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency.
[223003250070] |The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality. 
[223003250080] |<h3>Hashing Driven Deduplication</h3>
[223003250090] |A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets.
[223003250100] |Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet.
[223003250110] |Below is the recognition phase from <code>src/DedupeSimpleHash.java</code>:
[223003250120] |The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string.
[223003250130] |I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing.
[223003250140] |Below the code for writing out the CSV format for presumably unique tweets:
[223003250150] |Note that I have kept the tweet in the same row as the search results.
[223003250160] |This convention will allow interaction with other components. 
[223003250170] |Running the program on searches/Obama.csv yields the following:
[223003250180] |<h3>Duplicate Detection via Normalization</h3>
[223003250190] |The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets.
[223003250200] |Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:
[223003250210] |Note that your results will vary since we are not running the searches on the same day.
[223003250220] |These posts only differ in the url for bit.ly which is still pretty much the same tweet as far as I am concerned.
[223003250230] |Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well. 
[223003250240] |It appears that removing urls and the retweet prefixes will find more duplicates.
[223003250250] |That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences.
[223003250260] |A look at <code>src/DedupeNormalizedHash.java</code> shows regular expressions that replace the url and retweets with the empty string.
[223003250270] |Running the normalized approach the modifications are evident:
[223003250280] |Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘and ‘http://bit.ly/gXZfqN http://fb.me/NRKRnQWy’.
[223003250290] |The normalization allows for identification of another 157 duplicates.
[223003250300] |Normalization is a standard technique used when working with text data.
[223003250310] |You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a <a>stemmer</a>.
[223003250320] |Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’
[223003250330] |Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.
[223003250340] |Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close.
[223003250350] |Lets move to more open ended approaches that better address the long tail side of the issue:
[223003250360] |The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case.
[223003250370] |The second example shows that the basic tweet has many small variations.
[223003250380] |In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same. 
[223003250390] |In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development. 
[223003250400] |Breck
[223003260010] |<title>A Call to Code: Microsoft Research/Bing Query Spell Check Challenge</title>
[223003260020] |And I quote: "Microsoft Research in partnership with Bing is happy to launch the Speller Challenge."
[223003260030] |First place is $10,000, starts Jan 17, 2011, contest is on May 27, 2011.
[223003260040] |<a>Speller Challenge</a>
[223003260050] |We have the baseline implementation of the required system in our tutorials, see <a>Spelling Tutorial</a>.
[223003260060] |The system is not wrapped in a web service and you will need to dig up your own training data—Wikipedia anyone?
[223003260070] |There may be issues if you want to use LingPipe tools to build the system, here is our <a>Royalty Free License</a>.
[223003260080] |For the purposes of this contest we will create a free license for contest use that is non-restrictive.
[223003260090] |If you’d like to request a free license for the contest, send me an email at <a>breck@lingpipe.com</a>.
[223003260100] |Breck
[223003270010] |<title>Processing Tweets with LingPipe #3: Near duplicate detection and evaluation</title>
[223003270020] |<h3>Summary: </h3>
[223003270030] |In <a>Post #1</a> we covered how to search Twitter and get a useful on disk data representation of the search results.
[223003270040] |In <a>Post #2</a> we covered the first level of deduplication using HashSets as the test of sameness.
[223003270050] |We extended sameness by doing some normalization of the tweets which included removing urls and retweets. 
[223003270060] |You can download the code with <a>this tarball</a> or get it from subversion with the command
[223003270070] |<code>svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter </code>
[223003270080] |Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts.
[223003270090] |In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data.
[223003270100] |If there are duplicates then we are training on testing for some examples and results end up looking too good.
[223003270110] |<h3>Tokenization and Jaccard Distance</h3>
[223003270120] |The duplicate detection problem in Twitter is really about word overlap with a slight <a>game of telephone</a> quality to it.
[223003270130] |The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved.
[223003270140] |Not much rephrasing is going on so word overlap between tweets is the obvious place to go.
[223003270150] |That entails a few elaborations to our approach:
[223003270160] |<li>Find words in the tweets: Tokenization</li>
[223003270170] |<li>Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance</li> 
[223003270180] |<h3>Tokenization</h3>
[223003270190] |A very simple tokenizer can just break on what characters separate words as follows:
[223003270200] |This code is in <code>src/Tokenize1.java</code>.
[223003270210] |It produces output:
[223003270220] |Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:
[223003270230] |This approach produces the following output:
[223003270240] |Note that the text that separate the tokens are very different.
[223003270250] |Between ‘RT’ and ‘mparent77772′ is the separator ‘@’.
[223003270260] |Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like.
[223003270270] |Often it is both.
[223003270280] |While these approaches might work for the case at hand we will introduce the LingPipe <a>tokenizers</a> instead.
[223003270290] |They offer much richer tokenization options ready to go–see chapter 8 of the <a>LingPipe book draft</a> for more details or look at the Java doc.
[223003270300] |An equivalent tokenization in the LingPipe API is created as follows:
[223003270310] |Its output reports on both the white spaces as well as the tokens.
[223003270320] |We add the normalization features of <a>post #2</a> by extending the <a>ModifyTokenTokenizerFactory</a> to filter retweets and urls.
[223003270330] |A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data.
[223003270340] |In ‘src/NormalizedTokenizerFactory.java’ there is a main method: 
[223003270350] |Running the above with <code>ant normalizedTokenizerFactorty</code> produces nicely normalized output.
[223003270360] |How that tokenizer functions is left as an exercise for the reader.
[223003270370] |Time to take on Mr. Jaccard. 
[223003270380] |<h3>Jaccard Distance as a Measure of Similarity</h3>
[223003270390] |Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage.
[223003270400] |If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score.
[223003270410] |From the <a>Jaccard Javadoc</a> the proximity of two character sequences:
[223003270420] |We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences.
[223003270430] |The code to compute this for all pairs of tweets in our search results is in src/ExploreJaccard.java and the relevant bit of code is: 
[223003270440] |The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap.
[223003270450] |The <code>filterNormalizedDeduplicates(tweets,tokFactory)</code> filters out duplicates as discussed in #2.
[223003270460] |We can then decide on a threshold to throw out tweets with too much overlap.
[223003270470] |Running the program on the Obama.csv example:
[223003270480] |and then viewing the output .csv file with a sort on column B we get (click to see larger image):
[223003270490] |Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2.
[223003270500] |In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2.
[223003270510] |Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet. 
[223003270520] |Scrolling down reveals just how bad redundancy can be with tweets.
[223003270530] |At the 50% overlap point the tweets are still looking quite similar:
[223003270540] |There are 14 unique tokens with 7 overlapping yielding 50% overlap.
[223003270550] |36% of the Obama query tweets have an overlap of 50% or more with another tweet.
[223003270560] |Trying a different query, "the" finds less overlap between tweets.
[223003270570] |Only 3% of the tweets have another tweet with 50% overlap.
[223003270580] |The query "today" yields 14% of tweets overlap.
[223003270590] |The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing.
[223003270600] |A more technical way of expressing this observation is that the <a>entropy</a> of the resulting search results varies depending on the query.
[223003270610] |The "obama" search result entropy is lower (less random) than results for "the" which is more random. 
[223003270620] |The ‘today’ query had usefully different tweets at the 50% overlap level:
[223003270630] |Despite sharing half the words they are clearly not retweets and contain different information.
[223003270640] |It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.
[223003270650] |Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in <code>src/DedupeJaccard.java</code>.
[223003270660] |The interesting bit is:
[223003270670] |Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds.
[223003270680] |The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well.
[223003270690] |Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics.
[223003270700] |A recent paper with a decent overview of past work is http://research.microsoft.com/pubs/130851/ANDD.pdf. 
[223003270710] |Running the code on the Obama search results removes half the tweets with a .5 cutoff. 
[223003270720] |Looking at the tweets another issue immediately presents itself:
[223003270730] |All sorts of languages are in the data.
[223003270740] |It looks like there is French and Spanish mixed in with the English.
[223003270750] |Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above. 
[223003270760] |<h3>Wrapup</h3>
[223003270770] |In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts.
[223003270780] |Summarizing:
[223003270790] |<li>Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing.
[223003270800] |Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing. </li>
[223003270810] |<li>Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.</li>
[223003270820] |<li>Entropy: Some collections of data are more uniform than others.</li>
[223003270830] |<li>Similarity: There are measures like Jaccard Distance to estimate the similarity of text.
[223003270840] |We will see many more examples that implement topic assignment, sentiment and more. </li>
[223003270850] |<li>Evaluation: We have picked an operating point for near duplicate detection by examining data.
[223003270860] |While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.</li> 
[223003270870] |Breck
[223003280010] |<title>Monitoring Convergence of EM for MAP Estimates with Priors </title>
[223003280020] |I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm.
[223003280030] |Elementary textbook presentations often just say “until convergence”, which left me scratching my head.
[223003280040] |More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.
[223003280050] |Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors.
[223003280060] |Given data <img/>and parameters <img/>, the goal is to find the parameters <img/>that maximize the likelihood function <img/>. 
[223003280070] |<h3>Likelihood and Missing Data</h3>
[223003280080] |Usually EM is used for latent parameter problems, where there are latent variables <img/>which are treated like missing data, so that the full likelihood function is actually <img/>.
[223003280090] |For instance, <img/>might be mixture component indicators, as in soft (EM) clustering.
[223003280100] |Typically the full likelihood is factored as <img/>.
[223003280110] |Even though the expectation (E) step of EM computes “expectations” for <img/>given current estimates of <img/>and the data <img/>, these “expectations” aren’t used in the likelihood calculation for convergence.
[223003280120] |Instead, the form of likelihood we care about for convergence marginalizes <img/>away.
[223003280130] |Specifically, the maximum likelihood estimate <img/>is the one that maximizes the likelihood with <img/>marginalized out,
[223003280140] |<img/>.
[223003280150] |<h3>Monitoring Likelihood or Parameters</h3>
[223003280160] |There’s more than one way to monitor convergence.
[223003280170] |You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale).
[223003280180] |Log likelihood is more task-oriented, and thus more common in the machine learning world.
[223003280190] |But if you care about your parameters, you may want to measure them for convergence, because …
[223003280200] |<h3>Linearly Separable Data for Logistic Regression</h3>
[223003280210] |In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite.
[223003280220] |Thus the parameters will never converge.
[223003280230] |But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0). 
[223003280240] |<h3>Convergence with MAP?</h3>
[223003280250] |Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors.
[223003280260] |Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.
[223003280270] |Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data.
[223003280280] |In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood.
[223003280290] |Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.
[223003280300] |<h3>Maximization Step with Priors</h3>
[223003280310] |In EM with priors, the maximization step needs to set <img/>, the parameter estimate in the <img/>-th epoch, to the value that maximizes the total probability, <img/>, given the current “expectation” for the latent parameters <img/>based on the the data and previous epoch’s estimate of <img/>.
[223003280320] |That is, you can’t just set <img/>to maximize the likelihood, <img/>.
[223003280330] |There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound.
[223003280340] |And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM). 
[223003280350] |<h3>Convergence with Priors</h3>
[223003280360] |Well, you could just monitor the parameters.
[223003280370] |But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, <img/>, not just the log likelihood <img/>.
[223003280380] |What EM guarantees is that every iteration increases this sum.
[223003280390] |If you just monitor the likelihood term <img/>, you’ll see it bouncing around rather than monotonically increasing.
[223003280400] |That’s because the prior’s having its effect, and you need to take that into account.
[223003290010] |<title>pyhi: Python Package/Module Hello World with all the Trimmings</title>
[223003290020] |I’ve been teaching myself Python, and being the compulsive neat freak that I am, I first had to figure out their namespaces and how to package everything properly.
[223003290030] |<h3>pyhi: a Demo Python Package with all the Trimmings</h3>
[223003290040] |If you want a skeletal application that does everything the right way (as far as I can tell from their online style recommendations), including package/module namespaces, unit tests, installer, documentation, and packages scripts, check out:
[223003290050] |<li><a>pyhi Home Page</a> 
[223003290060] |Of course, I’m happy to get advice if there are better ways to do this.
[223003290070] |<h3>Why Python?</h3>
[223003290080] |I’m working with Andrey Rzhetsky and James Evans at University of Chicago on a version of the Bayesian (and EM) annotation models in Python.
[223003290090] |I’m also working with Matt Hoffman, Andrew Gelman and Michael Malecki on Python at Columbia for Bayesian inference.
[223003290100] |Watch this space (and Andrew’s blog) for news.
[223003290110] |<h3>Does Python Rock?</h3>
[223003290120] |The short story is that I learned enough in a week to already use it for scripting and munging instead of Java.
[223003290130] |Compared to Perl, it’s a positively genius work of minimalism and consistency.
[223003290140] |Everything works pretty much the way you’d expect.
[223003290150] |When you need C/Fortran back ends (all the optimization, distribution, and matrix libs), Python’s a relatively clean front end.
[223003290160] |Numpy and PyMC are nice; the design of PyMC is particularly well thought out. 
[223003290170] |I love the generators and named arguments/defaults.
[223003290180] |I hate the whitespace syntax (no moving blocks of code with emacs to auto-indent).
[223003290190] |I wish I had a bit more control over types and pre-allocation, but that’s easily solved with utility functions. 
[223003290200] |At least as of version 2.6, character strings are the usual mess (Java’s became a mess when Unicode surpassed 16-bit code points), with one type for bytes and a different one for unicode strings (sort of like Java, only there are no built-in Java types for byte-sequence literals). 
[223003290210] |The lack of backward compatibility among versions of Python itself reminds me how fantastic the Java releases have been in that regard.
[223003290220] |Particularly the heroic effort of retro-fitting generics.
[223003290230] |I find the lack of proper for(;;) loops or ++ operators rather perplexing; I get that they want everything to be a first class object but loop ranges seem to be taking this a bit far.
[223003290240] |And the “friendly” syntax for ternary operators is an oddly verbose and syntactically contorted choice for Python (“a if cond else b”).
[223003290250] |At least they left in break/continue. 
[223003290260] |The idea to execute a file on import probably makes sense for an interrpeted language, but boy is it ever slow (seconds to import numpy and pymc).
[223003290270] |It does let you wrap the imports in try/catch blocks, which strikes me as odd, but then I’m used to Java’s more declarative, configurable, and just-in-time import mechanism.
[223003290280] |Why doesn’t assignment return its value?
[223003290290] |I can’t write the usual C-style idiomatic I/O loops.
[223003290300] |There are so many opportunities for function chaining that aren’t used.
[223003290310] |It must be some kind of stylistic battle where the Pythonistas love long skinny programs more than short fat ones. 
[223003290320] |Having to install C and Fortran-based packages takes me straight back to 1980s Unix build hell and makes me appreciate the lovely distribution mechanism that are Java jar files.
[223003290330] |I found the <a>Enthought distribution</a> helpful (it’s free for academics but pay-per-year for industrialists), because it includes numpy and then the PyMC installer worked (on Windows 32-bit; couldn’t get 64-bit anything working due to GCC conflicts I didn’t have the patience to sort out). 
[223003290340] |Of course, Python’s a dog in terms of speed and memory usage compared to Java, much less to C, but at least it’s an order of magnitude faster than R.
[223003300010] |<title>Chris Harrison’s Awesome Word Association Visualizations</title>
[223003300020] |Wow!
[223003300030] |These visualizations, which I just saw linked from Slashdot, blew me away:
[223003300040] |<li><a>Chris Harrison’s Visualizations</a> 
[223003300050] |I particularly like the <a>word associations visualization</a>, which compares pairs of words, such as good/evil and then investigates the words that follow them in bigrams base on conditional probablity bands, then sorts the words in each band by unigram frequency.
[223003300060] |The <a>word spectrum visualization</a> is also nice.
[223003300070] |By the use of space and scale, Harrison was able to show much more information than I’ve ever seen in a graph like this.
[223003300080] |Usually they look like giant hairballs. 
[223003300090] |The natural language processing part of this exercise is pretty much trivial.
[223003300100] |It’d be easy to do with the LingPipe language modeling package, for instance. 
[223003300110] |I’d like to see some part-of-speech type things done this way, but that’d be of more interest to linguistic geeks than the general public.
[223003300120] |Translation would also be interesting if you knew two languages.
[223003300130] |The Netflix data or other collaborative filtering data would be fun to visualize, too.
[223003300140] |As would phrasal data with a binary feature, such as <a>Ryan McDonald et al.’s phrase-sentiment graph</a>.
[223003310010] |<title>Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing</title>
[223003310020] |Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in <a>his last post on deduplication</a>, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance. 
[223003310030] |<h3>The Book</h3>
[223003310040] |Check out Chapter 3, Finding Similar Items, from:
[223003310050] |<li>Rajaraman, Anand and Jeff Ullman.
[223003310060] |2010 (Draft).
[223003310070] |<a>Mining of Massive Data Sets</a>. 
[223003310080] |It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners.
[223003310090] |Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).
[223003310100] |<h3>The Drill</h3>
[223003310110] |Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.
[223003310120] |Step 1.
[223003310130] |Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens.
[223003310140] |LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer.
[223003310150] |Of course, if all you need to do is the next steps, you don’t need to build string-based tokens —you only need their hash codes, and that’s done more efficiently using something like a <a>rolling hash</a> (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the <a>Karp-Rabin string search algorithm</a>, which is well described in Gusfield’s most awesome string algorithms book).
[223003310160] |Step 2.
[223003310170] |From each document, extract multiple shingles.
[223003310180] |Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from .
[223003310190] |Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items. 
[223003310200] |Step 3.
[223003310210] |Calculate minhash values for each of the shingles in a doc.
[223003310220] |This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above).
[223003310230] |There’s no Wikipedia page (that I could find), but here’s a nice <a>blog entry on MinHash</a>, which comes with (C#?) code.
[223003310240] |Step 4.
[223003310250] |Calculate locality-sensitive hashes of the minhash values.
[223003310260] |The point of locality-sensitivity hashing is to map similar items to similar buckets.
[223003310270] |There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.
[223003310280] |Step 5.
[223003310290] |Test for equality using the locality-sensitive hashes (LSH).
[223003310300] |This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time.
[223003310310] |You can get an idea of what the usual presentation looks like for LSH by considering the <a>LSH Wikipedia page</a>, the first line of which assumes you know what a metric space is!
[223003310320] |Step 6.
[223003310330] |You can then check the actual docs if you want to prevent false positive matches. 
[223003310340] |The book draft has nice examples of all of these things.
[223003310350] |It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math —it just sticks to what you need to know to understand the problem and build an implementation yourself.
[223003310360] |In fact, I may do just that. 
[223003310370] |<h3>Tunable Tradeoffs in Accuracy versus Speed</h3>
[223003310380] |One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy.
[223003310390] |By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy.
[223003310400] |Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.
[223003310410] |<h3>What Does Nutch Do?</h3>
[223003310420] |The <a>Apache Nutch</a> project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer.
[223003310430] |It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation).
[223003310440] |Does anyone have a pointer to how it works?
[223003320010] |<title>Which Automatic Differentiation Tool for C/C++?</title>
[223003320020] |I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms.
[223003320030] |In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book). 
[223003320040] |<h3>Why Hamiltonian Monte Carlo?</h3>
[223003320050] |Hamiltonian MC methods use gradient (derivative) information to guide Metropolis-Hastings proposals.
[223003320060] |<a>Matt</a> is finding that it works well for the multilevel deep interaction models we’re working on with <a>Andrew</a>.
[223003320070] |As the theory suggests, it works much better than Gibbs sampling when the variables are highly correlated, as they tend to be with multilevel regression coefficients.
[223003320080] |The basic idea goes back to the paper:
[223003320090] |<li>Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987.
[223003320100] |Hybrid Monte Carlo.
[223003320110] |Physics Letters B <b>195</b>(2):216–222. <a>doi:10.1016/0370-2693(87)91197-X</a> 
[223003320120] |<h3>Why AD?</h3>
[223003320130] |Because we want to model general directed graphical models a la <a>BUGS</a>, we need to compute gradients. 
[223003320140] |If you need general gradients, it sure seems like automatic differentiation (AD) deserves at least some consideration.
[223003320150] |AD is a technique that operates on arbitrary functions defined by source code, generating new source code that computes the derivative (thus it’s a kind of automatic code generation).
[223003320160] |At a high level, it does just what you might expect: computes the derivatives of built-in functions and constants then uses the chain rule to put them together.
[223003320170] |Because it generates code, you can even have a go at tuning the resulting derivative code further.
[223003320180] |Here are some basic refs:
[223003320190] |<li>Wikipedia: <a>automatic differentiation</a>
[223003320200] |<li>Community Site with Software Links: <a>Autodiff.org</a>
[223003320210] |<li>Survey Paper: Bischof, C. H. and P. D. Hovland. 2008.
[223003320220] |<a>On the implementation of automatic differentiation tools</a>.
[223003320230] |Higher-Order and Symbolic Computation <b>21</b>(3):311–331. 
[223003320240] |<h3>So Which One?</h3>
[223003320250] |So what I want to know is, if I’m going to be coding in C, which implementation of AD should I be using?
[223003320260] |(There are implementations in everyting from Fortran to Python to Haskell.)
[223003320270] |<b>Update</b> 21 January 2010: So, two problems rear their ugly heads.
[223003320280] |One: almost all of the tools other than Tapenade require you to seriously modify the code with different data types and flags for the code generator.
[223003320290] |Two: most of these tools are for non-commercial use only and can’t be redistributed.
[223003320300] |Not exactly how we’d like to roll together our own academic software.
[223003320310] |This is especially annoying because they’re almost all government-research funded (Tapenade in France, many of the others in the US, with copyright held tightly by institutions like the University of Chicago and INRIA).
[223003320320] |Many of the packages require extensive secondary packages to be installed.
[223003320330] |I miss Java and Apache.
[223003320340] |<h3>Tapenade’s Online AD for C, C++ and Fortran</h3>
[223003320350] |<b>Update</b> 21 January 2010: maybe people coming from the blog killed it, but the Tapenade Online app is no longer up and hasn’t been since an hour or two after this blog post.
[223003320360] |You can try it on your own C, C++, or Fortran program using the 
[223003320370] |<li><a>Tapenade Online AD app</a>
[223003320380] |For instance, consider the simple (intententionally non-optimal) C code to compute <img/>:
[223003320390] |Running Tapenade in forward mode, we get
[223003320400] |If you plug in 1.0 for xd (in general, this can be used for chaining functions), you can see that the function returns the derivative <img/>and also sets the value of argument f (given as a pointer) to <img/>. 
[223003320410] |In backward mode, Tapenade generates the following code:
[223003320420] |If you squint hard enough, you’ll realize this method also computes the derivative <img/>of <img/>.
[223003320430] |It’s set up to run the chain rule in reverse.
[223003320440] |<h3>Citation Stats?</h3>
[223003320450] |Yes, I know this is totally crass, but nevertheless, I went to autodiff.org and checked out the many systems for C/C++ they cite, and then did a Google Scholar query <code>[SystemName "automatic differentiation"]</code>, restricted to results after (including?) 2005.
[223003320460] |Here’s what I found:
[223003320470] |This doesn’t seem to be expressing a particularly clear community preference.
[223003320480] |So I was hoping some of you may have suggestions.
[223003320490] |I should add that many of the papers are themselves surveys, so don’t actually correspond to someone using the package, which is what I’d really like to know. 
[223003330010] |<title>The Matrix Cookbook</title>
[223003330020] |I was browsing through the doc for the <a>Eigen matrix package</a> (C++ templated, which I need for automatic differentiation) and they had a pointer to the following extended (70+ page) matrix “cheat sheet”:
[223003330030] |<li>K. B. Petersen and M. S. Pedersen. 2008.
[223003330040] |<a>The Matrix Cookbook</a>.
[223003330050] |Technical Report. 
[223003330060] |It contains all kinds of useful definitions and derivations, most relevantly for me, gradients of all the matrix operations (inverses, determinants, traces, eigenvalues).
[223003330070] |They even have the gradient for the multivariate normal density function (see section 8), which has a pleasingly simple form.
[223003330080] |Now if I could just get them to do the Wishart and multivariate-t, which they define, but don’t differentiate.
[223003330090] |And how great is that pair of author names, “Petersen” and “Pedersen”?
[223003330100] |They’re reminiscent of Tintin’s <a>Thomson and Thompson</a>.
[223003330110] |The authors even have <a>a blog about The Matrix Cookbook</a>, though it hasn’t been updated in almost a year.
[223003340010] |<title>LingPipe Book Draft, Version 0.3</title>
[223003340020] |I’ve finished the section on naive Bayes and made a bunch of fairly minor updates to earlier sections based on comments from Breck. 
[223003340030] |As before, you can find the latest version at:
[223003340040] |<li><a>LingPipe Book, v0.3</a></li> 
[223003350010] |<title>Apache Lucene 3.0 Tutorial</title>
[223003350020] |With this release of the LingPipe Book, I created a standalone version of the tutorial for version 3 of the <a>Apache Lucene</a> search library. 
[223003350030] |It contains about 20 pages covering the basics of analysis, indexing and search.
[223003350040] |It’s distributed with sample code and an Ant build file with targets to run the demos.
[223003350050] |<li><a>Lucene Version 3.0 Tutorial</a> (pdf)</li>
[223003350060] |<li><a>Supporting Code</a> (tar, gzipped)</li> 
[223003350070] |<h3>Building the Source</h3>
[223003350080] |The ant build file is in the file <code>src/applucene/build.xml</code> and should be run from that directory.
[223003350090] |The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs.
[223003350100] |There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated.
[223003350110] |<h3>More In-Depth Info on Lucene</h3>
[223003350120] |The standard reference for Lucene is not its own site or javadoc, which are fairly limited tutorial-wise, but rather the recently released (as of February 2011) book by three Lucene committers:
[223003350130] |<li>Michael McCandless, Erik Hatcher, and Otis Gospodnetić. 2010.
[223003350140] |<a>Lucene in Action, Second Edition</a>.
[223003350150] |Manning Press.</li> 
[223003350160] |Looking at the Manning Press page for the book (linked above), I just realized they blurbed one of my previous blog posts, a <a>review of Lucene in Action</a>! 
[223003350170] |<h3>But wait, there’s more</h3>
[223003350180] |If you’re interested in natural language, or just need a tutorial on character encodings and Java strings and I/O, you can find the rest of the LingPipe book at its home page:
[223003350190] |<li><a>LingPipe Book (Draft)</a> 
[223003350200] |Enjoy.
[223003350210] |And as always, let me know if you have any comments, here, or directly to <code>carp@lingpipe.com</code>.
[223003360010] |<title>LingPipe Baseline for MITRE Name Matching Challenge</title>
[223003360020] |We’ve released a baseline system for <a>MITRE’s name matching challenge</a>.
[223003360030] |Unfortunately, there’s no training data, so I used a simple heuristic matcher that we’ve used before for this problem and related problems.
[223003360040] |<h3>In First Place (for now)</h3>
[223003360050] |As of right this second, we’re in first place on their leaderboard (team name “Agent Smith”).
[223003360060] |There aren’t many participants yet, so I don’t expect to enjoy this position for long.
[223003360070] |<h3>Check out the LingPipe Baseline</h3>
[223003360080] |You can anonymously check out our source for our entry from the <a>LingPipe Sandbox</a> with <a>Apache Subversion version control system</a> via the command:
[223003360090] |There’s an Ant build file that runs the eval and a README.txt to help you get started.
[223003360100] |I’ll repeat the contents of the README (as it stands now) below.
[223003360110] |<h3>Mean Average Precision Scoring</h3>
[223003360120] |The contest is being scored by <a>mean average precision</a>.
[223003360130] |This measure does not penalize you for returning long lists of irrelevant results.
[223003360140] |Our baseline returns lots of results, and its results for the variously balanced F measures that are reported could probably be improved by tuning thresholds and the maximum number of results returned.
[223003360150] |What I’d like to see, as usual, is a precision-recall curve or <a>ROC curve</a> derived from setting a threshold on the scores.
[223003360160] |But it’s hard to rank bakeoff entries by ROC curve, so IR bakeoffs typically resort to area under one of these curves or a measure like mean average precision (sort of like area under the curve —see the <a>LingPipe book</a>‘s section on classifier evaluation for more discussion).
[223003360170] |<h2>README File</h2>
[223003360180] |<h3>The Contest</h3>
[223003360190] |MITRE is hosting an unsupervised name-matching challenge.
[223003360200] |The home page is:
[223003360210] |<a>http://www.mitre.org/work/challenge/</a>
[223003360220] |From there, you can register for the challenge and download the data, which consists of two lists of names, one long “index” file (about 825K names), and one short “query” file (around 9K names).
[223003360230] |The task is to rank potential matches in the index file for each name in the query file.
[223003360240] |The names are broken into forenames and surnames, but there’s no guarantee this assignment is accurate.
[223003360250] |There is no gold standard as to which names actually match, so MITRE’s using a liberal notion of matching corresponding to “requires further human review to reject”.
[223003360260] |Furthermore, this notion of matching may change as the task evolves.
[223003360270] |<h3>Example Matches</h3>
[223003360280] |The only way (I know of) to see some example matches is to download the distribution and look at the documentation (.docx format —I had to install the latest OpenOffice to read it).
[223003360290] |They list examples of what they take to be potential matches for further review. 
[223003360300] |<h3>The LingPipe Baseline</h3>
[223003360310] |This directory contains baseline code for an entry based on character n-grams.
[223003360320] |The entry is set up so that the character n-gram scores are used as a filter, which should have high sensitivity (low false negative rate) for matches, though low specificity (high false positive rate).
[223003360330] |Parameters controlling its agressiveness may be tuned.
[223003360340] |If you download the data and unpack it into directory <code>$DATADIR</code>, you can run the task as follows:
[223003360350] |This’ll write the output match rankings into a fresh file in the output subdirectory <code>/runs</code>.
[223003360360] |If you don’t change the output routines, the output will be in the appropriate format for submission to the contest. 
[223003360370] |You can turn verbose debugging on via the flag <code>DEBUG</code> in the code itself.
[223003360380] |<h3>Three Pass Refinement</h3>
[223003360390] |Our approach is based on three passes. 
[223003360400] |<h4>1. Indexing</h4>
[223003360410] |Create a character n-gram index and select potential pairs based on having at least one n-gram in common.
[223003360420] |The n-gram length is parameterizable through the constant INDEX_NGRAM.
[223003360430] |Setting it lower increases run time but may increase sensitivity for obscure matches.
[223003360440] |<h4>2. TF/IDF Scoring and Filtering</h4>
[223003360450] |Rank the first-pass possible matches by TF/IDF distance over their character n-grams.
[223003360460] |The range of n-grams is parameterizable with MATCH_NGRAM_MIN and MATCH_NGRAM_MAX; setting these to 2 and 4 respectively produces matches based on 2-, 3-, and 4-grams.
[223003360470] |TF/IDF weighting will weight the less frequent n-grams more heavily.
[223003360480] |The maximum nubmer of candidates surviving this stage may be bounded by setting the MAX_RESULTS_INDEX variable in the code.
[223003360490] |<h4>3. Rescoring</h4>
[223003360500] |The method
[223003360510] |takes the original fields (first name/last name) as string arrays and a double consisting of the n-gram match score, and allows an arbitrary score to be returned.
[223003360520] |As distributed, this method just returns the score passed in.
[223003360530] |The maximum number of results surviving the final ranking is determiend by the variable MAX_RESULTS.
[223003360540] |This will write a system response ready for submission to the challenge in the default output directory /runs.
[223003360550] |<h3>Questions and Comments</h3>
[223003360560] |Let us know if you have questions or comments about our distribution.
[223003360570] |Feel free to post your uses here or give the rest of us suggestions on how to improve the baseline.
[223003370010] |<title>When You Hear Hoofbeats…</title>
[223003370020] |I’ve been learning C++ and feeling like a <a>total newbie</a>.
[223003370030] |I spent about half an hour debugging a “forgot to put a comma at the end of a class definition” bug and several hours on a “virtual functions must be defined, not just declared in abstract base classes” bug.
[223003370040] |Oddly, I sorted out my templating/overloading bug really quickly.
[223003370050] |<h3>If you hear hoofbeats…</h3>
[223003370060] |This got me to thinking about my old friend, Hunt and Thomas’s Pragmatic Programmer book.
[223003370070] |One of their pieces of advice for debugging is:
[223003370080] |If you hear hoofbeats, think horses, not zebras. 
[223003370090] |(Of course, this isn’t new —Wikipedia’s <a>Zebra (medicine)</a> page is for a surprising diagnosis in a medical context, with the analogy attributed to Theodore Woodward in the 1940s.)
[223003370100] |<h3>Uh, it’s not working</h3>
[223003370110] |It dawned on me that the problem newbies have is that we’ve never seen a horse, so when we hear hoofbeats, we have no idea what to expect.
[223003370120] |A better analogy is that I grew up in the American wild west and recently moved to Africa, so the sounds are vaguely familiar in one sense (animals, birds, water) yet entirely different in another (what’s a parrot?).
[223003370130] |But my instincts are pretty off.
[223003370140] |Who’d have thought you needed to define a method in a virtual base class? 
[223003370150] |I found some online resources pretty helpful.
[223003370160] |Ironically, most useful were the clueless newbs who sent their error messages without their code and wizened sages conjectured on possible causes.
[223003370170] |I realize I can do this in my sleep in Java, but have no clue at all when it comes to C++.
[223003370180] |<h3>Please build the Hoofbeats web site</h3>
[223003370190] |I found sites like StackOverflow pretty useless for the problems I was having.
[223003370200] |That’s because they’re indexed sensibly by content. 
[223003370210] |What I want is a reverse index where you type in your error message, and a list of possible causes are enumerated. 
[223003370220] |You could even set it up like a community site.
[223003370230] |For instance, if your error message was “hoofbeats”, people could add possible causes like “horses” (972 votes [let's be optimistic]), “zebras” (112 votes, qualified by “likely only if you’re in Africa or at a zoo”), and “antelope” (22 votes).
[223003370240] |Pretty please.
[223003370250] |Not just for me, for humanity.
[223003380010] |<title>Scoring Betting Pools for March Madness with a Bradley-Terry Model</title>
[223003380020] |I was commenting on a Wall St. Journal <a>blog post on NCAA bracket math</a> and figured I could actually elaborate on the math here.
[223003380030] |For those of you who don’t know what it is, “March Madness” is a tournament of college basketball teams in the United States.
[223003380040] |The same approach could be used for the World Cup, chess tournaments, baseball seasons, etc.
[223003380050] |Anything where bettors assign strengths to teams then some subset of those teams play each other.
[223003380060] |It can be round robin, single elimination, or anything in between. 
[223003380070] |<h3>The Problem: A Betting Pool</h3>
[223003380080] |Now suppose we want to run an office betting pool [just for fun, of course, as we don't want to get in trouble with the law].
[223003380090] |We want everyone to express some preferences about which teams are better and then evaluate who made the best predictions.
[223003380100] |The contest will consist of 63 games total in a single-elimination tournament.
[223003380110] |(First round is 32 games, next 16, next 8, then 4, 2, and 1, totalling 63.)
[223003380120] |The big problem is that who plays in the second round depends on who wins in the first round.
[223003380130] |With 64 teams, there are <img/>possible matchups, a few too many to enumerate. 
[223003380140] |You could do something like have everyone rank the teams and then somehow try to line those up.
[223003380150] |See the WSJ blog post for more ad hoc suggestions. 
[223003380160] |<h3>The Bradley-Terry Model</h3>
[223003380170] |The <a>Bradley-Terry model</a> is a model for predicting the outcome of pairwise comparisons.
[223003380180] |Suppose there are <img/>items (teams, in this case) being compared.
[223003380190] |Each item gets a coefficient <img/>indicating how strong the team is, with larger numbers being better.
[223003380200] |The model then assigns the following probability to a matchup:
[223003380210] |The inverse logit is <img/>.
[223003380220] |For instance, if the team <img/>and team <img/>have the same strength, that is <img/>, the the probability is even.
[223003380230] |If <img/>, then the probability approaches 1 that team <img/>will defeat team <img/>. 
[223003380240] |As an aside, there have been all kinds of adjustments to this model.
[223003380250] |For instance, you can add an intercept term <img/>for home team advantage, and perhaps have this vary by team.
[223003380260] |You can imagine adding all sorts of other random effects for games.
[223003380270] |We can also add a third outcome for ties if the sport allows it.
[223003380280] |<h3>Scoring Predictions</h3>
[223003380290] |Suppose we have each bettor assign a number <img/>to each team <img/>.
[223003380300] |This vector <img/>of team strength coefficients determines the probability that team <img/>defeats team <img/>for any <img/>.
[223003380310] |Suppose the games are numbered 1 to 63 and that in game <img/>, the result is that team <img/>defeated team <img/>.
[223003380320] |Then the score assigned to ratings <img/>of team strengths is:
[223003380330] |<img/>.
[223003380340] |Higher scores are better.
[223003380350] |In words, what happens is that for each game, you get a score that’s the log of the probabilty you predicted for winning for the team that won.
[223003380360] |The total score is just the sum of these individual game scores, so it’s the total probability that your rankings assigned to what actually happened. 
[223003380370] |For instance, let’s suppose there are three teams playing each other round robin.
[223003380380] |Let’s suppose that team 1 beats team 2, team 1 beats team 3 and team 3 beats team 2.
[223003380390] |Now suppose we assigned strengths <img/>to team 1, <img/>to team 2 and <img/>to team 3.
[223003380400] |The total score would be
[223003380410] |<img/><img/><img/>. 
[223003380420] |This corresponds to an <img/>probability assigned by the coefficients <img/>to the outcomes of the three games. 
[223003380430] |Breaking out the probabilities for the individual games, note that 
[223003380440] |<img/>, <img/>, and <img/>.
[223003380450] |Note that multiplying these probabilities together yields 0.20. 
[223003380460] |<h3>Fitting the Model Given Data</h3>
[223003380470] |Given the outcomes, it’s easy to optimize the coefficients.
[223003380480] |It’s just the maximum likelihood estimate for the Bradley-Terry model!
[223003380490] |Of course, we can compute Bayesian posteriors and go the full Bayesian inference route (Gelman et al.’s Bayesian Data Analysis goes over the Chess case where there may be draws.)
[223003380500] |An obvious way to do this would be to use the season’s games to fit the coefficients.
[223003380510] |You could add in all the other teams, too, which provide information on team strengths, then just use the coefficients for the teams in the tournament for prediction.
[223003380520] |A multilevel model would make sense in this setting, of course, to smooth the strengths.
[223003380530] |The pooling could be overall, by division, or whatever else made sense for the problem.
[223003380540] |<h3>How to Explain it to the Punters?</h3>
[223003380550] |No idea.
[223003380560] |You could have them assign numbers and then they could explore what the predictions are for any pair of teams.
[223003390010] |<title>What is a (Mathematical) Model?</title>
[223003390020] |I was shocked and dismayed when I heard from a reader that I’d used the term “model” over 200 times in the <a>LingPipe book</a> without ever saying what it meant.
[223003390030] |This kind of thing is why it’s so hard to write introductory material. 
[223003390040] |Perhaps I shouldn’t have been surprised at this comment, because other people had expressed some uncertainty about the term “model” to me in the past. 
[223003390050] |<h3>What is a (Mathematical) Model?</h3>
[223003390060] |In short, when I say “model”, I mean it in the bog-standard scientific sense, as explained on:
[223003390070] |<li>Wikipedia: <a>Mathematical Model</a> 
[223003390080] |Quite simply, it’s just a bunch of math used to describe a phenomenon.
[223003390090] |Nothing interesting here either philosophically or conceptually, just the usual scientific method.
[223003390100] |For instance, Newton’s equation <img/>(force equals mass times acceleration) is a mathematical model that may be used to describe the motions of the planets, among other things.
[223003390110] |Newton derived his model from Kepler’s observation that the planets picked out equal area in equal time in their orbits.
[223003390120] |Newton realized that by introducing the notion of “gravity”, he could model the orbits of the planets.
[223003390130] |Of course, he had to invent calculus to do so, but that’s another story.
[223003390140] |<h3>Prediction vs. Postdiction</h3>
[223003390150] |Typically models are used for predicting future events, but sometimes they’re used to retroactively try to understand past events (“backcasting” ["aftcasting" if you're nautical] is the time opposite of “forecasting”, and “postdiction” the opposite of “prediction”).
[223003390160] |For instance, climate scientists attempt to postdict/backcast earth temperatures from data such as tree rings; we’re working on fitting models of such data with Matt Schofield as part of our Bayesian inference project at Columbia. 
[223003390170] |<h3>All Models are Wrong, but …</h3>
[223003390180] |As the statistician George E. P. Box said, “Essentially, all models are wrong, but some are useful.”
[223003390190] |For instance, Newton’s model is wrong in that it doesn’t correct for relativistic effects at very high velocities.
[223003390200] |But it proved useful at predicting everything from eclipses to the tides. 
[223003390210] |The models we’ve used in LingPipe, such as the HMM model of part-of-speech tagging, are also clearly wrong.
[223003390220] |Language just isn’t Markovian (meaning that the n-th word only depends on a fixed window of previous few words).
[223003390230] |But we can still do pretty well at predicting part of speech tags with the simplified model.
[223003400010] |<title>Resistance is Futile —I’ve been Assimilated (by Apple)</title>
[223003400020] |The second most popular post ever on this blog was my review of Lenovo’s Thinkpad W510.
[223003400030] |I used to really like the IBM Thinkpads.
[223003400040] |(The number one post is the link to 64-bit versions of Eclipse, and I don’t even use Eclipse except for an occassional refactoring or stepwise debug). 
[223003400050] |I just gave my 15-month old quad-core, high-res, 8GB Thinkpad W510 with a 128GB SSD to one of our grad students after Andrew bought me a shiny new Macbook Air 13″ with 4GB and a 256GB SSD.
[223003400060] |Apple only just a few weeks ago came out with similar hardware for the Macbook Pro, by the way, and Lenovo’s not exactly rushing out of the gate with new CPUs. 
[223003400070] |<h3><del>Evaluation Period</del> Honeymoon</h3>
[223003400080] |I thought at the very least, if I hated Mac OS X, I could just install Windows.
[223003400090] |If the Air was too small, I could just use it for travel.
[223003400100] |Within about 24 hours, having watched Apple’s converting from Windows to Mac video, I’m a convert.
[223003400110] |So much so that Mitzi’s been getting sick of hearing me tell her how great the new Macs are.
[223003400120] |We’d both last used them with any frequency when I had a Mac Quadra in the early 1990s. 
[223003400130] |Among my favorite things is emacs commands in text windows.
[223003400140] |And I’m not sure how I lived without the gestures for scrolling and navigating the web.
[223003400150] |It really is a unix machine underneath, though I never had any problems with Cygwin on Windows.
[223003400160] |I also love the keyboards.
[223003400170] |That and the weight are what finally soured me on the Thinkpads —their new ones are terrible.
[223003400180] |Did I mention the insane battery life? 
[223003400190] |Or just how nice it feels?
[223003400200] |It’s like moving from a Kodak Brownie to a Leica.
[223003400210] |The last time I was this impressed with a piece of hardware was when my grad school department got its first Sun Workstation I way back in 1985 or 1986. 
[223003400220] |It’s not perfect.
[223003400230] |Does Microsoft own the patent on grabbing windows by their sides to resize them?
[223003400240] |It’s not possible in either Linux or the Mac [I was corrected in the comments; it is possible in Linux, just very trying as a test of hand-eye-mouse coordination].
[223003400250] |And why do I have a delete key that’s really a backspace and no real delete key?
[223003400260] |(Yes, I realize fn-delete adds up to a delete.)
[223003400270] |And what’s the function key doing where my control key should be?
[223003400280] |Does anyone ever use caps lock for anything?
[223003400290] |Why do computers still have them?
[223003400300] |Also, I don’t find Macs as blindingly obvious as everyone else says they are.
[223003400310] |For instance, I had no clue as to what to do with the “power cord”.
[223003400320] |I went online.
[223003400330] |I read the manual.
[223003400340] |I’d have thought it was the wrong piece, but it was listed in the parts.
[223003400350] |Turns out you need to take their power supply apart and then it clips in. Huh?
[223003400360] |It reminds me of getting an iPod many years ago and having no clue how to turn it off. 
[223003400370] |<h3>No Half Measures</h3>
[223003400380] |Given that the money was turned on at Columbia, I ordered the Macbook Air for home and travel, as well as a 27″ iMac for the office, and an iPad 2.
[223003400390] |The iMac is very sweet, but somehow not nearly as cool as the Air.
[223003400400] |I can’t wait until the iPad shows up —I’m tired of printing out PDFs.
[223003400410] |As soon as the iPhone 5 shows up, I’m getting one from Verizon (I don’t even have a cellphone now).
[223003400420] |<h3>Welcome to the New Borg</h3>
[223003400430] |The old joke used to be that Microsoft was like <a>The Borg</a>, whose tagline was “you will be assimilated”.
[223003400440] |Slashdot even uses a Borged-out Bill Gates as an icon.
[223003400450] |It turns out that as hard as Google’s trying to be the new Borg, the current Borg headgear apparatus rests on Apple’s head.
[223003400460] |After all, they’ve not just locked down their hardware, they’re also trying to take over the music and movie distribution business.
[223003400470] |That’s why I’m so surprised I couldn’t find a Steve-Jobs-as-Borg image.
[223003400480] |C’mon Slashdot, help me out here. 
[223003400490] |<h3>Tip of the Day</h3>
[223003400500] |When your significant other says he or she thinks you like your new Macbook Air more than him or her, do NOT reply, “It’s so thin!”.
[223003400510] |Luckily, Mitzi has a sense of humor to go with her yoga-and-running-toned body.
[223003410010] |<title>LingPipe Book Draft 0.4</title>
[223003410020] |I just put up the latest draft of the LingPipe book.
[223003410030] |Since the last edition, I’ve added a chapter on character language models and added details of the various classifier interface and classification types.
[223003410040] |The location’s the same:
[223003410050] |<li><a>LingPipe Book</a> 
[223003410060] |I’m about to roll out LingPipe 4.0.2, which patches a few minor inconsistencies that have turned up in writing the book (like process character LMs not being serializable and some remaining generic arrays that I’ve recoded with lists [and deprecated the generic array versions]).
[223003420010] |<title>Pro Bono Projects and Internships with LingPipe</title>
[223003420020] |We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project. 
[223003420030] |So far is is going pretty well with one classification project underway.
[223003420040] |The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running.
[223003420050] |Now the intern is running the project with some feedback from me but not much is required at this point.
[223003420060] |We are taking applications for interns wanting to learn how to code with LingPipe.
[223003420070] |We also will take applications for projects.
[223003420080] |Non-profits or academic research only please.
[223003420090] |Just send breck@lingpipe.com an email outlining your skills/needs and we will see what we can do.
[223003420100] |Breck
[223003430010] |<title>Zeno’s Version Numbering</title>
[223003430020] |In a talk at Columbia last week, <a>Yoav Goldberg</a> mentioned that version 0.99 had replaced version 0.9 of his Hebrew parser.
[223003430030] |I suggest we use the term “Zeno Version Numbering” to describe such a scheme.
[223003430040] |I first saw it used for the Lucene index browser <a>Luke</a>, which was up to version 0.9.9 before moving.
[223003430050] |<h3>Zeno, according to Rick Martin</h3>
[223003430060] |<a>Zeno’s paradox</a> is perhaps best explained as a joke.
[223003430070] |Here’s <a>Richard F. “Rick” Martin’s</a> version (from his awesome <a>engineer/physicist/mathematician jokes page</a>):
[223003430080] |In the high school gym, all the girls in the class were lined up against one wall, and all the boys against the opposite wall.
[223003430090] |Then, every ten seconds, they walked toward each other until they were half the previous distance apart.
[223003430100] |A mathematician, a physicist, and an engineer were asked, “When will the girls and boys meet?”
[223003430110] |The mathematician said: “Never.”
[223003430120] |The physicist said: “In an infinite amount of time.”
[223003430130] |The engineer said: “Well… in about two minutes, they’ll be close enough for all practical purposes.” 
[224000010010] |<title>Welcome</title>
[224000010020] |I enjoy reading and (occasionally) posting on other research blogs, most specifically <a>John Langford</a>'s <a>Machine Learning Theory</a>.
[224000010030] |As much as machine learning is a part of my life, I've not seen a place to discuss issues in NLP/CL/etc. in an open forum.
[224000010040] |I've found in my life as an NLPer (a term I started using with <a>Yee Whye Teh</a> and hope others will adopt) that there are often lots of discussions across lunch tables and over coffee that should be made public for the benefit of others.
[224000010050] |I welcome comments by people both within and without the NLP/CL community, and would love to have occasional "guest" posts (email me if you'd like to post something).
[224000010060] |I'll begin with the first real post shortly and see where things go from there.
[224000020010] |<title>Interesting topics in NLP</title>
[224000020020] |Papers for HLT/NAACL are due Friday, but since I'm probably not submitting anything, I'll take this opportunity to write down some thoughts about what NLP seems to care about and what I think it might care about more.
[224000020030] |Topic-wise, we have several big areas that typically overwhelm conference proceedings:
[224000020040] |<li>Machine Translation</li>
[224000020050] |<li>Summarization</li>
[224000020060] |<li>Language Modeling</li>
[224000020070] |<li>Information Extraction</li>
[224000020080] |<li>Parsing</li> 
[224000020090] |There's obviously a lot more to NLP than these topics, but these by far recieve the highest publication rate at conferences like ACL, NAACL, HLT and COLING.
[224000020100] |I think it's fairly obvious why.
[224000020110] |Each has some strong combination of the "required" two aspects for publication at these conferences:
[224000020120] |<li>Importance (or strong community beliefs thereof)</li>
[224000020130] |<li>Evaluatability (especially on common data sets with common metrics)</li> 
[224000020140] |There is a third aspect under which contributions can be measured that doesn't seem to play a strong role in our field right now: Interestingness.
[224000020150] |The problem is that evaluating interestingness is difficult.
[224000020160] |As pointed out by <a>Sam Roweis</a> at a discussion at the <a>BayesNLP workshop</a>, this is something that SIGGRAPH is very successful at.
[224000020170] |Perhaps SIGGRAPH papers aren't on the most important or evaluatable topics, but they are certainly interesting.
[224000020180] |Putting too much weight on interestingness is probably a bad idea, especially if it completely forgoes the other two aspects.
[224000020190] |But maybe we'd be better off as a community if we had one or two tracks (or perhaps a workshop) that focused exclusively on interesting, new problems, even if they fumble a bit on evaluation and obvious short-term importance.
[224000020200] |Some topics I care about (i.e., think are interesting) that are not commonly seen at ACL-like conferences are:
[224000020210] |<li>User modeling or personalization in general </li>
[224000020220] |<li>CogSci-oriented human-learning or -usage of language </li>
[224000020230] |<li>Situated language experiments, in real or simulated worlds</li>
[224000020240] |<li>Super-sentence models of coherence, cohesion, relation between documents</li>
[224000020250] |<li>Tasks involving multiple media (text + {video, images, charts/tables, ...})</li> 
[224000020260] |I don't expect these thing to actually start appearing soon, and I know there are conferences that focus more on each aspect individually, but I think it's time that our community as a whole starts paying more serious attention to what goes on in these areas.
[224000030010] |<title>Bayesian Methods for NLP (summary)</title>
[224000030020] |I recently co-organized a <a>BayesNLP</a> workshop with <a>Yee Whye</a>.
[224000030030] |Here's a brief summary of a subset of the talks and the discussions that took place.
[224000030040] |Topic models and language modeling: A lot of the discussion and papers were about either TMs, LMs or both.
[224000030050] |Much of the discussion of topic models was how to introduce Markov-style dependencies.
[224000030060] |There are essentially three ways to do this: (1) make word i dependent on topic i and word (i-1); (2) make topic i dependent on topic (i-1); (3) both.
[224000030070] |Basically this comes out in how you structure the "beta" language models (in LDA terminology).
[224000030080] |There is a trade-off between number of params (|vocab| * (# top)^2 versus |vocab|^2 * (# top)) and the ability to fit data.
[224000030090] |I think the general consensus is that if you have a lot of data (which you should!) then you should use the more expressive models.
[224000030100] |The major problem with these models is that they often are evaluated by their perplexity on test data.
[224000030110] |These perplexities are significantly higher than those obtained by people in the speech community, which raises the "why should I care question" (see <a>this other entry</a>).
[224000030120] |There are several potential answers: (1) topics can be embeded in a task (say MT) and this leads to better performance; (2) topics are used to enable new tasks (browsing Science repositories); (3) topics can be compared with what humans do in a CogSci manner.
[224000030130] |This topic lead into some incomplete discussion on what sorts of problems we might want to work on in the future.
[224000030140] |I don't think there was a solid decision made.
[224000030150] |In terms of what applications might be interesting, I think the agreement was that Bayesian techniques are most useful in problems for which there is insufficient data to fit all parameters well.
[224000030160] |Since "there's no data like more data" has become a mantra in NLP, this seems like it would include every problem!
[224000030170] |My opinion is that Bayesian methods will turn out to be most useful for largely unsupervised tasks, where my prior knowledge can be encoded as structure.
[224000030180] |I think there's lots of room to grow into new application domains (similar to some stuff Andrew McCallum has been working on in social network analysis).
[224000030190] |Introducing new tasks makes evaluation difficult which can make publication difficult (your eight pages have to go both to technique an evaluation), but I think it's the right way for the community to head.
[224000030200] |I also really like Yee Whye's talk (which happened to propose basically the same model as a <a>paper</a> by Goldwater, Griffiths and Johnson at this same NIPS), where he basically gave an interpretation of KN smoothing as a nonparametric Bayesian model with a Poisson-Dirichlet prior.
[224000030210] |Unlike previous methods to explain why KN works, this actually give superior results to interpolated KN (though it loses to modified interpolated KN).
[224000030220] |Shaojun talked about integrating a whole bunch of stuff (Markov models, grammars and topics) into a language model using directed Markov fields as an "interface" language.
[224000030230] |This was really cute and they seem to be doing really well (going against the above comment that it's hard to get comparable perplexities).
[224000030240] |I believe there's an upcoming CL paper on this topic.
[224000030250] |If anyone else took part in the BNLP workshop and would like to comment, you're more than welcome.
[224000040010] |<title>NLP and NIPS</title>
[224000040020] |Several people have already posted what they thought were the <a>coolest</a> <a>NIPS</a> <a>papers</a> this year elsewhere.
[224000040030] |Instead of saying what I thought were necessarily the best, I'll say what I thought were most interesting from an NLP perspective.
[224000040040] |<a>Max Margin Semi-Supervised Learning for Structured Variables</a>.
[224000040050] |(Yasemin Altun, David McAllester, Mikhail Belkin) This paper talks about the problem of using unlabeled data to learn structured prediction models, in the style of M3Ns.
[224000040060] |The idea is the same as manifold-based semi-supervised learning in standard classification, extended to structure: penalize disagreement of "nearby" labels (where nearby is defined in feature space, not structure space).
[224000040070] |They work in the very low data setting and find that (a) unlabeled data helps and (b) structure helps.
[224000040080] |Neither is surpising, but both are good.
[224000040090] |I wonder how to scale this to large data sets: all these manifold-based SS techniques require matrix operations on the gram matrix from the unlabeled data, which must be at least O(n^2) but is often worse (n = # of unlabeled points).
[224000040100] |In NLP, we have bajillions of unlabeled points.
[224000040110] |There's got to be a good way to do this that will scale, but I haven't see anything yet.
[224000040120] |<a>Group and Topic Discovery from Relations and Their Attributes</a>.
[224000040130] |(Xuerui Wang, Natasha Mohanty, Andrew McCallum) This is a multi-clustering (cluster multiple things at once) paper, where they simultaneously identify clusters of groups and clusters of topics.
[224000040140] |E.g., they want to cluster senators by how they vote and bills by what they are about.
[224000040150] |If you do these two things simultaneously, you can get better statistics and thus do better overall.
[224000040160] |<a>Structured Prediction via the Extragradient Method</a>.
[224000040170] |(Ben Taskar, Simon Lacoste-Julien, Michael Jordan) This is a clever solution to extending structured prediction outside the world of Markov fields to the world of quadratic cost min-flow problems (think matching).
[224000040180] |They get big speedups over previous optimization techniques and can handle a larger set of problems.
[224000040190] |I plan on writing a separate post on structured prediction, where I'll discuss this and related papers in more detail.
[224000040200] |But I thought it was cool enough to mention here anyway.
[224000040210] |<a>Interpolating between types and tokens by estimating power-law generators</a>.
[224000040220] |(Tom Griffiths, Sharon Goldwater, Mark Johnson) Despite the scariness of the name, this paper is very interesting to NLP folk.
[224000040230] |Everything in language looks like a Zipf/power-law distribution, and this paper shows that the Pitman-Yor process models this very well.
[224000040240] |This is the same model that Yee Whye suggested for modeling Kneser-Ney smoothing, so it's clearly something that has some real weight in NLP.
[224000040250] |<a>Context as Filtering</a>.
[224000040260] |(Daichi Mochihashi, Yuji Matsumoto) Discusses the issue of long-range dependencies in language modeling, breaking out of the bag of words-style histories paradigm.
[224000040270] |They use both an LDA-like and Dicihlet-mixture topic model that "shifts" through time and can perform inference by treating it as a particle filter problem.
[224000040280] |The results are on 11m words of data, which is reasonable, and seem promising, though I'd like to see more comparative results.
[224000040290] |And that's it!
[224000040300] |Feel free to post other cool NIPS papers.
[224000050010] |<title>Summarization</title>
[224000050020] |Summarization is one of the canonical NLP problems, but of all the big ones (MT, speech, IR, IE, etc.) it is in my opinion the most difficult (let the flames begin!).
[224000050030] |The reason I think it's so hard is because it's unclear <a>what a summary is</a>.
[224000050040] |When one cannot define a problem well, it is impossible to solve, and difficult ⊆impossible.
[224000050050] |There has been an enormous amount of effort to define specific summarization problems that we can solve, a comparable amount of effort on figuring out how to measure how good a solution is, and a lot of effort on building models/systems that can solve these problems.
[224000050060] |We've tried many things; see the history of <a>DUC</a> for a small, biased, incomplete set.
[224000050070] |That said, I think the field has much potential, but not necessarily by trying to mimick what a human would do when asked to produce a summary.
[224000050080] |I'm interested in doing summarization-like tasks that a human would never be able to do reliably.
[224000050090] |Here are some examples to give a flavor of what I'm talking about:
[224000050100] |<li>Let me go to a scientific engine like <a>Rexa</a> or <a>CiteSeer</a> and ask for a summary of "reinforcement learning."
[224000050110] |The system should know what papers I've read, what papers I've written, the relationship between all the papers in its database, an analysis of the goodness of authors and so on.
[224000050120] |What it produces for me must be markedly different from what it would produce for <a>Satinder Singh</a>.</li>
[224000050130] |<li>Let me go to <a>Amazon</a> and ask about new digital cameras.
[224000050140] |Maybe I'm interested in spending $200-$300.
[224000050150] |It should know that I've never bought a digital camera before, or that I've bought 4. I want a summary of important specifications, user comments and so on.</li>
[224000050160] |<li>Let me go to my own inbox and ask for a summary of what I've been discussing with <a>John</a> recently.</li>
[224000050170] |One can imagine many more similar tasks, but these are three obvious ones.
[224000050180] |The nice thing about these is that even partial solutions would be enormously useful (to me, at least...my mom might not care about the first).
[224000050190] |These are also things that people really can't do well.
[224000050200] |If someone asks me for something like the first one but, say, on structured prediction instead of reinforcement learning, I can give a summary, but it will be heavily biased.
[224000050210] |It is worse for the second, where I can basically only produce anecdotal evidence, and virtually impossible for the third.
[224000050220] |The most important outstanding issue is how to measure sucess at such problems.
[224000050230] |I cannot imagine how to do this without doing user studies, but people probably felt the same way about MT <a>a few years ago</a>.
[224000050240] |How about <a>now</a>?
[224000050250] |But given the amount of personalization in these tasks, I feel that it would be harder to do automatic evaluation of them.
[224000050260] |Probably the most important things to measure in user studies are subjective satisfaction, how many times multiple searches had to be performed and so on.
[224000050270] |One could also take a TREC style approach for comparative pair-wise evaluations by marking system 2 down if it missed something system 1 found that a human thought was important.
[224000050280] |There are also tons of subproblems that can be pulled from this tangle of tasks.
[224000050290] |Most notably, personalization methods, social network analysis methods, redundancy identification, coherence, information presentation (UI) techniques, generation of multimodal outputs (tables, graphs, etc.), dealing with imperfect input (googling for "reinforcement learning" also produces irrelevant documents), opinion identification, processing of ungrammatical input, anti-spam, and so on.
[224000050300] |I'm not a huge proponent of <a>just</a> <a>solving</a> <a>subtasks</a> that aren't proven to be necessary, but it is sometimes helpful to take this approach.
[224000050310] |I think we just have to keep the big picture in our minds.