[223000240010] |Getting Users the Old-Fashioned Way: Paying Them [223000240020] |IBM recently announced 20 "winners" of their 2006 UIMA Innovation Award. [223000240030] |These 20 full time faculty members received $10,000-30,000 each to “encourage the use of open source and open standards-based tools for academic curricula and research,” which is later qualified to “the porting of significant analysis algorithms to the UIMA framework, and in the use of UIMA to support knowledge acquisition for the semantic web.” [223000240040] |See IBM’s UIMA Home Page for more info. [223000240050] |It’s interesting to see that Nstein rolled out UIMA support for a suite of tools including doc/sentence classification for topic, objectivity and polarity, as well entity extraction and some relation extraction. [223000240060] |I’m guessing the incentive program might have something to do with a recent up-tick in requests we’ve received from students for UIMA components. [223000240070] |If anyone has any UIMA wrappers laying around for LingPipe components that we could share, please send them to us and we’ll start a collection in the sandbox. [223000250010] |Biocreative 2006: Results Released, but not public yet [223000250020] |The named entity results for the 2006 Biocreative came out today, but I’m still waiting to hear if we can talk about them. [223000250030] |We received a warning not to talk about the results along with the results. [223000250040] |So I’m trying to clarify with the organizers. [223000260010] |Researcher 6-Degrees of Separation [223000260020] |Chris Volinsky, of AT&T Research, put up a nice application that plays Six Degrees of Separation online with “proximity networks,” his social-network derived notion of distance, a fast interface, and a decent visualization. [223000260030] |You might want to connect Breck Baldwin and Willard von Orman Quine; Breck’s the president and founder of Alias-i, whereas W.V.O. Quine is one of the greatest 20th Century philosophers of language. [223000260040] |More fun can be had by connecting Bob Carpenter and Hal Daume III; I (Bob) am the software/machine learning guy at Alias-i and Hal’s the author of the NLPers Blog. [223000260050] |Here’s Volinsky’s description of the application and here’s the top-level start page. [223000260060] |If you’d like to try something more mainstream, you can also connect Hollywood actors to one another. [223000260070] |Dustin Hoffman’s a long way from Katherine Hepburn. [223000270010] |Why do you Hate CRFs? [223000270020] |After my talk at Columbia, a grad student asked me “Why do you hate CRFs?”. [223000270030] |This is a tough question to answer because of the failed presupposition, which assumes I hate CRFs and asks me to explain why. [223000270040] |I want to defeat the question by saying that I don’t hate CRFs. [223000270050] |The right question is: [223000270060] |Why isn’t there a CRF package in LingPipe? [223000270070] |This is a good question because if you look at any of the recent tagging/chunking bakeoffs (CoNLL, Biocreative, etc.) you’ll see that the top-scoring systems are CRFs or similar richly featured conditional models. [223000270080] |The short, though highly idiomatic, answer is “horses for courses”. [223000270090] |(Etymology hint: racing horses specialize in muddy or other conditions, as explained in Abbott and Costello’s classic “Mudder and Fodder” sketch). [223000270100] |To put this in context, we’re looking for high recall taggers that have a decent precision at very high recall. [223000270110] |Our chunk-level forward-backward HMM tagger achieves 99.95% recall at 8% precision, even though it’s a whopping 15% behind the best systems in first-best performance. [223000270120] |There are three main problems with using CRFs for the kinds of problems in which we’re interested: [223000270130] |
  • CRFs need fairly extensive feature engineering to outperform a good HMM baseline,
  • [223000270140] |
  • large feature sets and discriminitive training can lead to very high model complexity and variance, and
  • [223000270150] |
  • these features and training regimes make estimation (training) and inference (decoding/tagging) computationally intensive.
  • [223000270160] |1. Portability: Unfortunately, the performance of CRFs is directly attributable to their large sets of hand-tuned features. [223000270170] |Consider the following quote from (Finkel, Dingare, Nguyen, Nissim, Manning and Sinclair 2004): [223000270180] |Using the set of features designed for that task in CoNLL 2003 [24], our system achieves an f-score of 0.76 on the BioCreative development data, a dramatic ten points lower than its f-score of 0.86 on the CoNLL newswire data. [223000270190] |Despite the massive size of the final feature set (almost twice as many features as used for CoNLL), its final performance of 0.83 is still below its performance on the CoNLL data [223000270200] |The baseline features are listed in section 3 (below). [223000270210] |Surprisingly, LingPipe’s HMM performed similarly to baseline CRFs out of the box on BioCreative data. [223000270220] |Finkel et al.’s paper describes the wealth of features they applied to port their CRF system from CoNLL to biocreative. [223000270230] |They also describe heuristics, such as mismatched paren balancing, pruning given various known suffixes, etc. [223000270240] |2. Bias-Variance Tradeoff: What you see in heavily tuned conditional models such as CRFs is a strongly attenuated probability distribution. [223000270250] |Such models are usually much more confident in their decisions than they should be. [223000270260] |This overconfidence stems from a lack of modeling of dependencies, and shows up in our models, too, especially in classification tasks, where we’re surprised every time the document mentions “baseball”. [223000270270] |This is why speech recognition is so bad: the acoustic context-dependent phoneme mixture models are surprised every 1/100th of a second that the speaker’s still using a Texas accent. [223000270280] |Interestingly, 2005′s winning Spam detection entry for TREC (Bratko and Filipic 2005) mitigated this problem in an interesting way by learning on the doc being classified. [223000270290] |Topic-level models (such as LDA) and more highly dispersed frequency models (such as zero-inflated, hierarchal or simple over-dispersed models) are interesting approaches to this problem. [223000270300] |There’s also a nice discussion of this attenuation problem in section 1.2.3 of a classification overview paper by Nigam, McCallum and Mitchell (2006): [223000270310] |"The faulty word independence assumption exacerbates the tendency of naive Bayes to produce extreme (almost 0 or 1) class probability estimates. [223000270320] |However, classification accuracy can be quite high even when these estimates are inappropriately extreme." [223000270330] |In statistical terms, CRFs and other conditional models are often unbiased, but high variance. [223000270340] |As usual, this arises from high model complexity (large numbers of parameters with free structure). [223000270350] |That is, small changes in the training data or test samples leads to wide differences in models. [223000270360] |Our language-model based approaches, on the other hand, tend to have lower variance and higher bias. [223000270370] |They’re not nearly as sensitive to training data, but they tend to have built-in biases. [223000270380] |Another way to say this is that the CRFs are much more tightly fit to their training data than the language-model based generative approaches we’ve adopted in LingPipe. [223000270390] |Many models, including support vector machines, boosting, perceptrons and even active learning, are explicitly designed to bias the global statistics in such a way as to emphasize the cases near the decision boundary. [223000270400] |While this helps with first-best decisions, it tends to hurt any confidence-based decisions. [223000270410] |We see this same bias/variance tradeoff within the models supplied by LingPipe. [223000270420] |Our rescoring chunkers are nearly useless for confidence based extraction due to their overly attenuated models. [223000270430] |3. Efficiency: The current algorithms for training CRFs are slow. [223000270440] |As in hours if not days of CPU time. [223000270450] |Run time is much better, but still slow. [223000270460] |For instance, Finkel et al. (2004) report beam-based decoding speeds of roughly 10KB/second. [223000270470] |Ben Wellner’s Carafe: CRFs for IE implementation (in Objective ML, of all things!) is reportedly very fast —around 50K words/sec, with beam-based pruning, feature caching and bigram sequence stats. [223000270480] |So it’s clear that careful engineering of features with pruning and caching can be very useful. [223000270490] |In fact, this result’s surprising enough it’s making me want to do more exploration of pruning in our own first-best decoders (remember, “horses for courses”). [223000270500] |The problem with pruning in a high recall tagger is that it explicitly eliminates hypotheses for which the model estimates will have lower likelihood than the first-best hypotheses. [223000270510] |For high recall applications, these hypotheses are relevant. [223000270520] |The reason training and decoding is fairly slow is simple: lots of features. [223000270530] |For decoding, time is pretty much wholly dependent on the number of features looked up per character or token of input. [223000270540] |Because even our models are larger than existant CPU caches and because the models are accessed randomly, the time is mostly determined by front-side bus speed, which determines how fast data shuttles between memory and the outermost cache of the CPU. [223000270550] |Here’s a list of “baseline” CRF features from (Krishnan and Manning 2006): [223000270560] |Our baseline CRF is a sequence model in which labels for tokens directly depend only on the labels corresponding to the previous and next tokens. [223000270570] |We use features that have been shown to be effective in NER, namely the current, previous and next words, character n-grams of the current word, Part of Speech tag of the current word and surrounding words, the shallow parse chunk of the current word, shape of the current word, the surrounding word shape sequence, the presence of a word in a left window of size 5 around the current word and the presence of a word in a left window of size 5 around the current word. [223000270580] |This gives us a competitive baseline CRF using local information alone, whose performance is close to the best published local CRF models, for Named Entity Recognition [223000270590] |That is a lot of features. [223000270600] |LingPipe’s HMM feature set includes exactly two components to predict the tag for the current word: the current word itself and the tag of the previous word. [223000270610] |In the interest of full disclosure, we estimate the word probabilities a character at a time using language models, which is where the work is in our decoder. [223000270620] |Theoretically, if CRFs provide much sharper estimates and much better tuning, they could turn out to be faster than the kind of simple HMMs we use for first-best decoding. [223000270630] |This counterintuitive otucome arises in many search settings, where more expensive but tighther pruning leads to overall performance improvements. [223000280010] |JDK 1.6 for Number Crunching [223000280020] |OK, the next time Martin Jansche and Chris Manning independently recommend something, I’m going to listen. [223000280030] |I was relaxing this weekend after our Friday NIH grant submission by playing around training some Gaussian mixture models with EM. [223000280040] |As we all know, Gaussian estimates involve transcendal calculations (square roots and exponents), and EM applies them in a tight loop. [223000280050] |I plugged in the 1.6 JDK (there’s now a release candidate; the end of 1.4 may be in sight), and EM per-iteration times went down 40%. [223000280060] |This is running 32-bit on my 3GHz Pentium 4 at home in -server mode. [223000280070] |And the memory footprint was smaller with less dynamic resizing; this app doesn’t generate a lot of small objects, so I don’t know what’s up with this behavior. [223000280080] |Maybe 1.6 will be a lot faster in other areas, too. [223000280090] |I hear they’re still tuning the GC. [223000280100] |For LingPipe, 1.6 will provide a huge speed boost for all of the dynamic estimators, which compute logs for each estimate. [223000280110] |This includes language models, taggers, chunkers and classifiers. [223000280120] |Note that this is estimation time, not training time —training doesn’t involve transcendal operations for our models, which is one of the reasons we chose them. [223000280130] |But it will make make compilation of the models faster, as compilation just runs all the possible estimates it can precompile. [223000280140] |All I can say is, wow. [223000280150] |Thanks, Sun. [223000290010] |Code Spelunking: Jaro-Winkler String Comparison [223000290020] |We’re working on a database record linkage (also known as database deduplication) problem for a customer. [223000290030] |The standard string comparison for this task is the Jaro-Winkler distance, named after Matt Jaro and Bill Winkler of the U.S. Census Bureau. [223000290040] |Here are some quick links to what we’ve done: [223000290050] |
  • LingPipe Javadoc: com.aliasi.spell.JaroWinklerDistance
  • Lingpipe Java Source: JaroWinklerDistance.java
  • [223000290060] |
  • Lingpipe String Comparison Tutorial
  • [223000290070] |Unfortunately, all of the definitions I could find were as vague as the Jaro-Winkler Wikipedia entry. [223000290080] |And some definitions actually varied when considering the boundary conditions. [223000290090] |I next went and looked in two open-source string comparison packages, namely SecondString and Simmetrics. [223000290100] |Here's the Simmetrics Java source and the SecondString Java source. [223000290110] |Interestingly, they contained almost identical definitions that matched neither the definitions in Winkler's papers nor their own documentation (which is clearly wrong in the third term in Jaro distance Simmetrics case). [223000290120] |Hmm. [223000290130] |This makes me very skeptical of the widely cited results reported in (Cohen et al. 2003); more on their soft TF/IDF measure in a future blog post. [223000290140] |To get to the bottom of the matter, I composed a long e-mail missive to William Winkler at the U.S. Census Bureau. [223000290150] |I outlined the vexing boundary conditions. [223000290160] |He was not only kind enough to write back, he sent me a copy of the C code he actually used in his papers, along with pointers to tables in the paper that had sample output. [223000290170] |With the C code in hand, GCC (Gnu's C compiler), and a quick squint at Kernighan and Ritchie (the C version of Gosling et al.) to remind myself how to allocate an array in C, I was in business. [223000290180] |I implemented Jaro-Winkler as an instance of the interface com.aliasi.spell.StringDistance. [223000290190] |I implemented all of Winkler's tests as unit tests along with a number of pure boundary condition tests. [223000290200] |It'll be out in the next release of LingPipe, along with some TF/IDF-based distances. [223000290210] |Here's the class javadoc, which contains links to all of the relevant papers and test cases. [223000290220] |---------------------------------------- [223000290230] |The JaroWinklerDistance class implements the original Jaro string comparison as well as Winkler's modifications. [223000290240] |As a distance measure, Jaro-Winkler returns values between 0 (exact string match) and 1 (no matching characters). [223000290250] |Note that this is reversed from the original definitions of Jaro and Winkler in order to produce a distance-like ordering. [223000290260] |The original Jaro-Winkler string comparator returned 1 for a perfect match and 0 for complete mismatch; our method returns one minus the Jaro-Winkler measure. [223000290270] |The Jaro-Winkler distance measure was developed for name comparison in the U.S. Census. [223000290280] |It is designed to compae surnames to surnames and given names to given names, not whole names to whole names. [223000290290] |There is no character-specific information in this implementation, but assumptions are made about typical lengths and the significance of initial matches that may not apply to all languages. [223000290300] |The easiest way to understand the Jaro measure and the Winkler variants is procedurally. [223000290310] |The Jaro measure involves two steps, first to compute the number of "matches" and second to compute the number of "transpositions". [223000290320] |The Winkler adjustment involves a final rescoring based on an exact match score for the initial characters of both strings. [223000290330] |

    Formal Definition of Jaro-Winkler Distance

    [223000290340] |Suppose we are comparing character sequences cs1 and cs2. [223000290350] |The Jaro-Winkler distance is defined by the following steps. [223000290360] |After the definitions, we consider some examples. [223000290370] |Step 1: Matches: The match phase is a greedy alignment step of characters in one string against the characters in another string. [223000290380] |The maximum distance (measured by array index) at which characters may be matched is defined by: [223000290390] |The match phase is a greedy alignment that proceeds character by character through the first string, though the distance metric is symmetric (that, is reversing the order of arguments does not affect the result). [223000290400] |For each character encountered in the first string, it is matched to the first unaligned character in the second string that is an exact character match. [223000290410] |If there is no such character within the match range window, the character is left unaligned. [223000290420] |Step 2: Transpositions: After matching, the subsequence of characters actually matched in both strings is extracted. [223000290430] |These subsequences will be the same length. [223000290440] |The number of characters in one string that do not line up (by index in the matched subsequence) with identical characters in the other string is the number of "half transpositions". [223000290450] |The total number of transpoisitons is the number of half transpositions divided by two, rounding down. [223000290460] |The Jaro distance is then defined in terms of the number of matching characters matches and the number of transpositions, transposes: [223000290470] |In words, the measure is the average of three values; (a) the percentage of the first string matched, (b) the percentage of the second string matched, and (c) the percentage of matches that were not transposed. [223000290480] |Step 3: Winkler Modification The Winkler modification to the Jaro comparison, resulting in the Jaro-Winkler comparison, boosts scores for strings that match character for character initially. [223000290490] |Let boostThreshold be the minimum score for a string that gets boosted. [223000290500] |This value was set to 0.7 in Winkler's papers (see references below). [223000290510] |If the Jaro score is below the boost threshold, the Jaro score is returned unadjusted. [223000290520] |The second parameter for the Winkler modification is the size of the initial prefix considered, prefixSize. [223000290530] |The prefix size was set to 4 in Winkler's papers. [223000290540] |Next, let prefixMatch(cs1,cs2,prefixSize) be the number of characters in the prefix of cs1 and cs2 that exactly match (by original index), up to a maximum of prefixSize. [223000290550] |The modified distance is then defined to be: [223000290560] |Examples: We will present the alignment steps in the form of tables, with offsets in the second string below the first string positions that match. [223000290570] |For a simple example, consider comparing the given (nick)name AL to itself. [223000290580] |Both strings are of length 2. [223000290590] |Thus the maximum match distance is max(2,2)/2 - 1 = 0, meaning all matches must be exact. [223000290600] |The matches are illustrated in the following table: [223000290610] |The notation in the matches row is meant to indicate that the A at index 0 in cs1 is matched to the A at index 0 in cs2. [223000290620] |Similarlty for the L at index 1 in cs1, which matches the L at index 1 in cs2. [223000290630] |This results in matches(AL,AL) = 2. [223000290640] |There are no transpositions, so the Jaro distance is just: [223000290650] |Applying the Winkler modification yields the same result: [223000290660] |Next consider a more complex case, matching MARTHA and MARHTA. [223000290670] |Here the match distance is max(5,5)/2 - 1 = 1, allowing matching characters to be up to one character away. [223000290680] |This yields the following alignment. [223000290690] |Note that the T at index 3 in the first string aligns with the T at index 4 in the second string, whereas the H at index 4 in the first string alings with the H at index 3 in the second string. [223000290700] |The strings that do not directly align are rendered in bold. [223000290710] |This is an instance of a transposition. [223000290720] |The number of half transpositions is determined by comparing the subsequences of the first and second string matched, namely MARTHA and MARHTA. [223000290730] |There are two positions with mismatched characters, 3 and 4. [223000290740] |This results in two half transpositions, or a single transposition, for a Jaro distance of: [223000290750] |Three initial characters match, MAR, for a Jaro-Winkler distance of: [223000290760] |Next, consider matching strings of different lengths, such as JONES and JOHNSON: [223000290770] |The unmatched characters are rendered in italics. [223000290780] |Here the subsequence of matched characters for the two strings are JONS and JONS, so there are no transpositions. [223000290790] |Thus the Jaro distance is: [223000290800] |The strings JONES and JOHNSON only match on their first two characters, JO, so the Jaro-Winkler distance is: [223000290810] |We will now consider some artificial examples not drawn from (Winkler 2006). [223000290820] |First, compare ABCVWXYZ and CABVWXYZ, which are of length 8, allowing alignments up to 8/4 - 1 = 3 positions away. [223000290830] |This leads to the following alignment: [223000290840] |Here, there are 8/8 matches in both strings. [223000290850] |There are only three half-transpositions, in the first three characters, because no position of CAB has an identical character to ABC. [223000290860] |This yields a total of one transposition, for a Jaro score of: [223000290870] |There is no initial prefix match, so the Jaro-Winkler comparison produces the same result. [223000290880] |Now consider matching ABCVWXYZ to CBAWXYZ. [223000290890] |Here, the initial alignment is 2, 1, 0, which yields only two half transpositions. [223000290900] |Thus under the Jaro distance, ABC is closer to CBA than to CAB, though due to integer rounding in computing the number of transpositions, this will only affect the final result if there is a further transposition in the strings. [223000290910] |Now consider the 10-character string ABCDUVWXYZ. [223000290920] |This allows matches up to 10/2 - 1 = 4 positions away. [223000290930] |If matched against DABCUVWXYZ, the result is 10 matches, and 4 half transposes, or 2 transposes. [223000290940] |Now consider matching ABCDUVWXYZ against DBCAUVWXYZ. [223000290950] |Here, index 0 in the first string (A) maps to index 3 in the second string, and index 3 in the first string (D) maps to index 0 in the second string, but positions 1 and 2 (B and C) map to themselves. [223000290960] |Thus when comparing the output, there are only two half transpositions, thus making the second example DBCAUVWXYZ closer than DABCUVWXYZ to the first string ABCDUVWXYZ. [223000290970] |Note that the transposition count cannot be determined solely by the mapping. [223000290980] |For instance, the string ABBBUVWXYZ matches BBBAUVWXYZ with alignment 4, 0, 1, 2, 5, 6, 7, 8, 9, 10. [223000290990] |But there are only two half-transpositions, because only index 0 and index 3 mismatch in the subsequences of matching characters. [223000291000] |Contrast this with ABCDUVWXYZ matching DABCUVWXYZ, which has the same alignment, but four half transpositions. [223000291010] |

    Implementation Notes

    [223000291020] |This class's implementation is a literal translation of the C algorithm used in William E. Winkler's papers and for the 1995 U.S. Census Deduplication. [223000291030] |The algorithm is the work of multiple authors and available from the folloiwng link: [223000291040] |
  • Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch. [223000291050] |1994. strcmp95.c, Version 2. [223000291060] |United States Census Bureau.
  • [223000291070] |Unlike the C version, the distance(CharSequence,CharSequence) and compare(CharSequence,CharSequence) methods do not require its inputs to be padded with spaces. [223000291080] |In addition, spaces are treated just like any other characters within the algorithm itself. [223000291090] |There is also no case normalization in this class's version. [223000291100] |Furthermore, the boundary conditions are changed so that two empty strings return a score of 1.0 rather than zero, as in the original algorithm. [223000291110] |The algorithm, along with applications in record linkage, is sketched in the following highly readable survey article: [223000291120] |
  • Winkler, William E. 2006. [223000291130] |Overview of Record Linkage and Current Research Directions. [223000291140] |Statistical Research Division, U.S. Census Bureau.
  • [223000291150] |This document provides test cases in Table 6, which are the basis for the unit tests for this class (though note the three 0.0 results in the table do not agree with the return results of strcmp95.c or the results of this class, which matches strcmp95.c). [223000291160] |The description of the matching procedure above is based on the actual strcmp95 code, the boundary conditions of which are not obvious from the text descriptions in the literature. [223000291170] |An additional difference is that strcmp95, but not the algorithms in Winkler's papers nor the algorithm in this class, provides the possibility of partial matches with similar-sounding characters ( .g. c and k). [223000291180] |The greedy nature of the alignment phase in the Jaro-Winkler algorithm actually prevents the optimal alignments from being found in some cases. [223000291190] |Consider the aligning ABCAWXYZ with BCAWXYZ: [223000291200] |Here the first pair of A characters are matched, leading to three half transposes (the first three matched characters). [223000291210] |A better scoring, though illegal, alignment would be the following, because it has the same number of matches, but no transposes: [223000291220] |The illegal links are highlighted in yellow. [223000291230] |Note that neither alignment matches in the initial character, so the Winkler adjustments do not apply. [223000291240] |

    Acknowledgements

    [223000291250] |We'd like to thank Bill Winkler for helping us understand the versions of the algorithm and providing the strcmp95.c code as a reference implementation. [223000300010] |Naughty or Nice? Servers and Notebooks for the Holidays [223000300020] |I hope I’ve been nice enough this year to get a custom workstation and Thinkpad. [223000300030] |If not, I’m sure we’ll be getting $%^&* Dell. [223000300040] |

    Quad-core, 16GB RAM, 1TB disk: $4820.83 plus shipping

    [223000300050] |I’ve had my home machine for 3.5 years. [223000300060] |Here’s a diagram from when I built it: [223000300070] |
  • Bob’s Home Computer, Music and Video Setup
  • [223000300080] |I never did get the PlayStation after trying a friend’s, and the VCR and DVD player are long gone. [223000300090] |The power supply blew out and was replaced, the graphics card blew out and was replaced, and the memory was swapped for ECC memory to stop Java from throwing seg faults in critical loops. [223000300100] |I also added some 250GB disks about the time they were considered large. [223000300110] |Finally, something’s tempting me to upgrade, and it’s called 64-bit Java and Intel Core Duo. [223000300120] |Ever since we bought two dual-CPU, big memory (8GB and 16GB) servers at work from SWT, I’ve been thinking about replacing my home machine. [223000300130] |When I built it, my home machine was faster than anything in the office. [223000300140] |And watching films on video cassette was still considered a viable option. [223000300150] |The only problem is that our servers at work are loud. [223000300160] |Very loud. [223000300170] |And the dual-core AMDs suck down amazing quantities of power. [223000300180] |So I’m thinking about getting the following box of components and building a new machine for home, the specifications of which are available at: [223000300190] |
  • NewEgg Wish List
  • [223000300200] |Just to summarize, that includes a pair of Intel dual-core, 2.33 GHz Woodcrest chips with 4MB L2 cache. [223000300210] |Some of our models might even fit into that space. [223000300220] |And it has a heat sink, not a fan; the chip runs at only 65 watts! [223000300230] |And eight 2GB ECC registered memory chips at 667MHz, for a total of 16GB RAM. [223000300240] |Because most of our high performance computing is memory bound, this is should give a nice boost over my existing 400MHz memory, not to mention the 333MHz memory in our servers at work. [223000300250] |Toss in a couple of 500GB SATA drives, and I’m good to go. [223000300260] |If anyone knows anything about building computers and would recommend something other than what I’ve specified, let me know —I’m not buying it until early 2007 when we’re back from traveling the great midwest. [223000300270] |

    24" Monitor a Necessity

    [223000300280] |But even more than a good computer, get a good monitor, like a 24" Samsung, with 1920×1200 resolution. [223000300290] |The 30" Apple ones are just too big; even the 24" are a bit too big physically (not too high resolution). [223000300300] |I considered a 24" screen a necessity when they were three times the price they are now. [223000300310] |The screen works very well for watching movies, too. [223000300320] |

    WUXGA, Dual-Core Thinkpad

    [223000300330] |Here’s what I ordered for my wife (she does the database programming and networking and I spec out and buy the hardware): [223000300340] |
  • Lenovo ThinkPad Z61P
  • [223000300350] |Lenovo (nee IBM) has the best screens and keyboards. [223000300360] |Paired with 2GB of memory and Intel dual core CPUs, this thing should rock. [223000300370] |The Z61P has the WUXGA screen, which is 1920×1200, which means it’s big enough for full 1080p HDTV. [223000300380] |Not to mention nice looking fonts. [223000300390] |Not that I’ll get to use it; she’ll probably take it into work and use it for boring things like gene sequencing and microarray analysis. [223000300400] |At least I’ll inherit her current Thinkpad. [223000310010] |Industrialist or Auteur? [223000310020] |I found the following quote in Anthony Lane’s New Yorker article about Walt Disney, "Wonderful World", Decmeber 11, 2006 (p. 69). [223000310030] |It turns out after Snow White, Disney was "still fretting over the shortcomings of his heroine", saying "’The bridge on her nose floats all over her face’". [223000310040] |Lane goes on to say: [223000310050] |He [Walt Disney] became an industry, but the one thing that links the industrialist, whatever the product, with the auteur, whatever the form, is obsessive pedantry —the will to get things right, whatever the cost may be. [223000310060] |By the criterion of "obsessive pedantry", I’m clearly an auteur. [223000310070] |In other words, you can take the boy out of academia, but you can’t take the academia out of the boy. [223000310080] |I’ve had auteurial leanings since an early age; the first creative work I ever sold was a claymation film I made with friends (they did the art, I handled the technical details); I sold it for $100 (a princely sum to a 9th grader in 1978) to the TV show Kidsworld, which was designed to showcase kids’ work. [223000310090] |In case academic English isn’t your first language, here’s a American Heritage Dictionary definition of "pedantry": [223000310100] |1. Pedantic attention to detail or rules. [223000310110] |2. An instance of pedantic behavior. [223000310120] |3. The habit of mind or manner characteristic of a pedant. [223000310130] |and here’s "pedantic": [223000310140] |1. Characterized by a narrow, often ostentatious concern for book learning and formal rules: a pedantic attention to details. [223000310150] |If you need proof, check out two blog posts ago on Jaro-Winkler string comparison. [223000320010] |LingPipe 2.4.0 Released [223000320020] |I just put LingPipe 2.4.0 up on our web servers. [223000320030] |It’s a fairly minor change in terms of code, but the upgrades are not 100% backward compatible. [223000320040] |I ran tests with Sun JDKs 1.4, 1.5, and the now-official JDK 1.6. [223000320050] |We’re sticking with 1.4 jar builds until Sun declares the end of 1.4 support. [223000320060] |Here’s the list of changes from the new home page: [223000320070] |MEDLINE 2007: The major new feature is support for the new 2007 MEDLINE document type definitions (DTDs). [223000320080] |The tutorials have also been updated to reflect the new format. [223000320090] |Sequence Training for Token Language Models: The token language models and underlying tries have been upgraded to support sequence training with explicit counts. [223000320100] |This means the new LDC Google disk can be used to train a token language model. [223000320110] |Character language models already supported this feature. [223000320120] |The underlying dynamic language model interface lm.LanguageModel.Dynamic was also changed to support sequence-level training with counts. [223000320130] |New Class spell.TfIdfDistance: This matches strings based on token frequency and inverse document frequency (TF/IDF). [223000320140] |Any tokenizer may be used, including character n-grams. [223000320150] |New Class spell.JaroWinklerDistance: This matches strings based on the Jaro-Winkler distance used to deduplicate name data at the U.S. Census Bureau. [223000330010] |What is NLP-based Search? [223000330020] |Matthew Hurst (Latest Post) and Fernando Pereira (Latest Post) have been having an interesting discussion of the hype surrounding Powerset. [223000330030] |The reason I call it hype is because of the New York Times‘s breathless coverage (not worth reading) and the lack of details in Powerset’s own description: [223000330040] |Powerset is a Silicon Valley company building a transformative consumer search engine based on natural language processing. [223000330050] |Our unique innovations in search are rooted in breakthrough technologies that take advantage of the structure and nuances of natural language. [223000330060] |Using these advanced techniques, Powerset is building a large-scale search engine that breaks the confines of keyword search. [223000330070] |By making search more natural and intuitive, Powerset is fundamentally changing how we search the web, and delivering higher quality results. [223000330080] |Powerset’s search engine is currently under development. [223000330090] |Please check back in the near future for more information about our technology and for signing up for our alpha. [223000330100] |I call what’s going on “hype” because I can’t tell what’s being proposed.The last thing I remember being hyped to this extent without being described was the Segway scooter. [223000330110] |So my question is: What does this “transformative consumer search engine” look like from a user’s perspective? [223000330120] |I’ll leave you with a pointer to my two comments on Matthew Hurst’s latest blog entry, NLP and Search: Free Your Mind. [223000330130] |Matthew was suggesting that we can leverage redundancy of information and present answers, not documents (or passages). [223000330140] |I’ll repeat my entries here: [223000330150] |I couldn’t agree more about doing confidence-based collective data extraction. [223000330160] |Those who’ve participated in the DARPA bakeoffs over time have been doing just what Matthew suggests. [223000330170] |That’s why we built LingPipe to handle confidence-based extraction for tagging, chunking and classification. [223000330180] |But we prefer to use an expectation-based form of inference rather than a winner-take all, which only makes sense for bakeoff entries. [223000330190] |Most social networking and clustering software works just fine with expectations. [223000330200] |In real life, at least for intelligence analysts and biologists, recall is the game. [223000330210] |The reason is the long tail. [223000330220] |Most relations are not mentioned dozens of times in various forms. [223000330230] |In other words, you may have to deal with the equivalent of “youtube data mining”, because language is vague. [223000330240] |But it’s not just recall of relations. [223000330250] |Intelligence analysts need to link back to sources to generate their reports. [223000330260] |Biologists won’t trust current relation extraction programs to guide their research because even reasoning cumulatively, they’re not precise enough. [223000330270] |Too much depends on context outside of the phrase, sentence or even paragraph/document in which a fact is stated. [223000330280] |Could anyone help someone like me with limited imagination? [223000330290] |I’m trying to envision what the next generation of NLP enabled search is going to look like from a user’s perspective. [223000330300] |Let’s say my wife and I are trying to settle a bet that arose over dinner, such as whether the lead singer of Jethro Tull was Scottish or English. [223000330310] |I actually went to Google and used their newish pattern based search: [223000330320] |Google QUERY: ian anderson was born in * [223000330330] |Here’s the first few “answers”: [223000330340] |1. Paiseley, Scotland 2. [223000330350] |1947 3. [223000330360] |Scotland in 1947 4. [223000330370] |Fife in 1947 5. [223000330380] |Philadelphia 6. [223000330390] |Croydon before England won the world cup 7. [223000330400] |Williston, ND 8. [223000330410] |Nottingham on June 29, 1948 9. [223000330420] |1981 10. digg [223000330430] |Now which of those is the “right” answer? [223000330440] |Well, first it depends on which Ian Anderson we’re talking about. [223000330450] |There are 10 with their own Wikipedia pages. [223000330460] |The voting thing doesn’t help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. [223000330470] |I’m confused about answer 1, because it’s “Paiseley”, which is spelled wrong. [223000330480] |Wikipedia claims the answer is Dunfermline. [223000330490] |Clearly the source is important in providing answers. [223000340010] |Efficiency through Ignorance [223000340020] |We’ve been working on systems that achieve very high recall (e.g. 99.95% at 10% precision for MEDLINE gene mentions using the untuned models from our named entity tutorial). [223000340030] |In order to achieve this level of recall, pruning thresholds must be set fairly high, which entails a difficult balancing act between efficiency and accuracy. [223000340040] |While diving into the literature to see how other systems achieved their efficiency, I began to notice how many systems made strong “closed-world” assumption in their pruning strategies. [223000340050] |For instance, Thorsten Brants’s widely used TnT Tagger is fairly typical in only tagging “known” words with tags that were found in the training data (see the TnT paper, formula 5 and section 2.3). [223000340060] |This is a mode in Adwait Ratnaparkhi’s tagger (see Tag Dictionary on p. 136 in the paper). [223000340070] |Collins’s parser uses a similar heuristic, in that it only considers lexical categories that were assigned by an underlying tagger, in this case, Ratnaparkhi’s (see section 3.1 and the description in table 1 of the paper) . [223000340080] |Similar strategies are often made (erroneously) of word senses, citing Gale, Church, and Yarowsky’s (in)famous One sense per discourse paper. [223000340090] |In all of these cases, efficiency is derived through a kind of enforced ignorance. [223000340100] |To some extent, these decisions might also help first-best accuracy, by forcing the lexical statistics to dominate when they might otherwise be overwhelmed by contextual evidence. [223000340110] |All of this reminded by a talk by William Woods from KR ’94. [223000340120] |I never saw the talk, but found the abstract so compelling that I’ve never forgotten it: [223000340130] |Beyond Ignorance-Based Systems [223000340140] |W. A. Woods —Sun Microsystems Laboratories, Inc., USA [223000340150] |The field of artificial intelligence has a long tradition of exploiting the potential of limited domains. [223000340160] |While this is beneficial as a way to get started and has utility for applications of limited scope, these approaches will not scale to systems with more open-ended domains of knowledge. [223000340170] |Many “knowledge-based” systems actually derive their success as much from ignorance as from the knowledge that they contain. [223000340180] |That is, they succeed because they don’t know any better. [223000340190] |Too great a reliance on a closed-world assumption and default reasoning in a limited domain can result in a system that is fundamentally limited and cannot be extended beyond its initial domain. [223000340200] |If the field of knowledge-based systems is to move beyond this stage, we need to develop knowledge representation and reasoning technology that is more robust in the face of domain extensions. [223000340210] |Nonmonotonic reasoning becomes a liability if the fundamental abilities of a system can be destroyed by the addition of knowledge from a new domain. [223000340220] |This talk will discuss some of the challenges that we must meet to develop systems that can handle diverse ranges of knowledge. [223000340230] |The problem is that there’s a very long tail in natural language data. [223000340240] |Heuristic pruning of this kind leaves a definite blind spot in systems. [223000340250] |The real problem is that even without these a priori pruning heuristics (e.g. one sense per discourse, or only tag known words with known tags), training corpus statistics may often lead systems into roughly similar decisions purely statistically. [223000350010] |Ignorance is Bliss: Tenfold Speedup through Pruning [223000350020] |HMM Decoding [223000350030] |I added two beams to the first-best HMM Viterbi decoder. [223000350040] |Both beams are synchronous, operating on a single slice of the lattice corresponding to an input token. [223000350050] |One beam prunes on emission probabilities, removing tag hypotheses for which p(token|tag) is much lower than p(token|tag’) for the best tag’. [223000350060] |The other beam prunes on the Viterbi score going forward, removing any tag whose likelihood estimate is much lower than the best tag’s. Normal Viterbi decoding has complexity O(k2 n) where n is input length and k is the number of tags. [223000350070] |The square comes from considering all pairs of tags for the previous token and current token. [223000350080] |Our pruning attacks both sizes, reducing the k2 factor to c1 c2 where c1 is the number of forward hypotheses for the last tag surviving pruning and c2 is the number of tags surviving the emission pruning. [223000350090] |In practice, for our Brown corpus-based part-of-speech tagger, there are 90 tags, and therefore 90*90=8100 tag pairs. [223000350100] |Pruning reduces this number roughly ten-fold on average without loss of first-best accuracy on held-out data. [223000350110] |The beam-configurable version should be out soon in LingPipe 2.4.1. [223000350120] |Spelling [223000350130] |We’ve been working on developing our second large-scale spelling corrector, this time trained with 600MB of movie and TV-related data versus our last round training with 50GB of legal case law data (all English with some light Spanish in titles in the TV case). [223000350140] |In both cases, the out-of-the-box speed wasn’t good enough —around 10 queries/second. [223000350150] |All of the time is spent exploring insertions and substitutions. [223000350160] |But these are constrained to continue within a known token in the token-sensitive case. [223000350170] |This means we never correct to a token we didn’t see in training data, which is generally good for accuracy on languages with light morphology like English. [223000350180] |But it also means we have a huge lever to pull for increasing speed. [223000350190] |As the number of terms is reduced, speed goes way up. [223000350200] |The second parameter to tune is the do-not-edit tokens. [223000350210] |We generally don’t edit user tokens of 1 or 2 characters, nor do we edit longer tokens with high corpus counts. [223000350220] |Combining the do-not-edit list with a pruned known-token list, we increased speed ten-fold. [223000350230] |Unfortunately, to hit the uncached rate of 100 queries/second cost us about 5% in accuracy overall, mostly in false negatives (failure to suggest corrections for errors). [223000350240] |You may not realize this from the demo on trivial amounts of data, but our spelling corrector is surprisingly accurate when you train it on a gigabyte of domain specific data. [223000350250] |We’re seeing correction rates of 60-70% (percentage of user errors for which our first-best suggestion is correct) with false positive rates around 1%. [223000350260] |Of course, if 10% of the queries contain errors, the final breakdown is: 89% user correct, no system suggestion; 7% user error, system correct suggestion, 3% user error, system wrong or no suggestion, 1% user correct, system spurious suggestion. [223000350270] |Many of the errors are fairly unrecoverable by local edits. [223000350280] |Keep in mind that this is a devilishly difficult domain for which to annotate truth, because user intentions are implicit. [223000350290] |For instance, is “toomestone” the name of a band on YouTube (152 hits on Google) or a misspelling of “tombstone” (6.5 million hits)? [223000350300] |Google suggests “tombstone” and that’s a true-positive in the second case, but a false-positive in the first case. [223000350310] |Of course, if a user with the same IP address types “tombstone”, you can at least guess it was misspelling case. [223000350320] |Google will suggest “tomb stone” might mean “tombstone”, but clearly “tomb” and “stone” could make up a query that’s not related to tombstones (for instance, for a survey of tomb construction). [223000350330] |At best, we can hope the statistics play out to let us guess right most of the time. [223000350340] |After all, if the band Toomestone had more fans and more people searching for it, there’d be more hits for it online. [223000350350] |This is the whole point of building the source side (modeling what to expect) of the noisy channel model. [223000350360] |This time around we also built some spelling-specific weighted edit distances. [223000350370] |We found that users simply don’t know when something is one word or two words, especially for names of products (e.g. they’d type “ling pipe” for “lingpipe” or “postagger” for “pos tagger”), so space insertion/deletion needed to be relatively cheap. [223000350380] |Second, they made all the usually noted errors of reduced vowel confusion (“memoribilia” vs. “memorabilia”). [223000350390] |Unfortunately, our distance model is single character and context-free, so we can’t easily model the confusion around when consonants are duplicated (“parallel” vs. “parralell” vs. “parallell”). [223000350400] |Of course, you’re going to add your own cache, aren’t you? [223000350410] |In both cases, query logs show a huge number of repetitions (cache hits of over 2/3 in the TV-related queries after 20K queries). [223000350420] |That’s one of the advantages Google has —virtually infinite cache space for things they can caste as indexing/search/lookup problems. [223000350430] |Caching and Pruning [223000350440] |Combined, these two general techniques have taken our very slow HMM and spelling correctors up to usable speeds. [223000360010] |A PDF’n Mess [223000360020] |Don’t believe the marketing. [223000360030] |I’m working on a little project to mine research articles in “portable document format,” known on the street as PDF. [223000360040] |Maybe they know something about the word “portable” that I don’t. [223000360050] |I didn’t quite realize just how much the “document” in “PDF” was a visual markup. [223000360060] |I’ve been using PDFBox, a Java API for extracting content from PDFs. [223000360070] |It makes a noble effort and is recommend by projects such as Lucene and Nutch. [223000360080] |Anyway, here’s the problem. [223000360090] |Let’s take a random PDF-formatted paper from last year. [223000360100] |Now open the doc in Adobe Acrobat, the reference browser, and try to find the phrase “fulfills” (Hint: it follows the phrase “which fully”). [223000360110] |No luck? [223000360120] |Not surprising. [223000360130] |The “fi” in “fulfills” is what’s known as a ligature. [223000360140] |Those who are very picky about type know that just jamming an “f” and “i” character together is suboptimal because of the drop on the “f” interfering with the dot on the “i” if they are kerned properly. [223000360150] |So in hot-metal type days, and through this day on well-designed typefaces for the computer, any time an “f” is followed by an “i”, the pair are replaced with a special “fi” ligature “character”. [223000360160] |There are even unicode code points for ligatures to further confuse the issue. [223000360170] |At least for Unicode, we have a clear standard and IBM’s liberally licensed International Components for Unicode, which will normalize those ligatures right back to two character sequences. [223000360180] |Don’t even get me started with anything outside of ASCII; simple Latin1 vowels with umlauts come out as two character sequences in acrobat and as “currency1″ in PDFBox. [223000360190] |Japanese is lost in deep character conversion lookup exceptions, etc. etc. [223000360200] |And yes I’ve already pleaded for help on PDFBox’s forums, but I can hardly blame the poor guy running this thing as a free open source project if Adobe can’t get it right in their latest Acrobat. [223000360210] |If anyone’s got a better idea for converting PDFs to usable text, drop me (carp) a line (at alias-i.com). [223000370010] |U.S. Govt Research Spending: AAAS FY 2008 Report [223000370020] |As some of you may have read on our About Alias-i page, up until two years ago, when we started picking up serious commercial work, the majority of our revenue was generated through government research programs. [223000370030] |Our latest grant was in the area of text data mining for the National Library of Medicine (NLM), one of the institutes making up the National Institutes of Health (NIH). [223000370040] |This is the best thing I’ve found for an overview of where research and development money is being spent by the United States government: [223000370050] |Kei Koizumi. [223000370060] |February 2007. [223000370070] |R&D in the [U.S.] FY 2008 Budget. [223000370080] |American Association for the Advancement of Science. [223000370090] |It’s especially nice to see the upward trend of NIH funding over the past 10 years. [223000370100] |We’re waiting to hear back on a Phase II SBIR proposal; it was our Phase I SBIR that launched our work in biomedical text data mining. [223000370110] |We were originally funded by the Defense Advanced Research Projects Agency (DARPA), the folks who funded the folks who invented the internet. [223000370120] |It’s really a shame that DARPA has been cutting back on academic funding (here’s another article), though I can’t say I miss phrases like “tooth to tail ratio”. [223000370130] |Even the funding that exists for academics in speech and language processing, such as in the Global Autonomous Language Exploitation (GALE) program, are moving even further away from “research” and toward “development.” [223000370140] |I found it very surprising that the social science research budget (not counting psychology) was 5% of NIH’s budget. [223000370150] |That’s a whole lot of social science. [223000380010] |To Stem or Not to Stem? [223000380020] |I’ve been working for a customer on unsupervised and semi-supervised approaches to stemming for information retrieval. [223000380030] |I’d never thought too hard about stemming per se, preferring to tackle the underlying problem in applications through character n-gram indexing (we solve the same problems facing tokenized classifiers using character language models). [223000380040] |I have thought about computational morpho-phonology back in my logic programming/constraint grammar phase. [223000380050] |I found two very interesting evaluation papers in the same issue of JASIS, the first of which is: [223000380060] |Paice, C. D. 1996. [223000380070] |Method for evaluation of stemming algorithms based on error counting. [223000380080] |Journa l of the American Society for Information Science 47(8):632–649. [223000380090] |Wiley Pay Link (I couldn’t find a free copy online) [223000380100] |The Paice paper introduces a stemming-as-clustering evaluation, treating the clustering evaluation as one of equivalence recall and precision (just like LingPipe’s clustering precision/recall evaluation). [223000380110] |Paice assumes, quite rightly for stemming, if I may say so, that the issue in question is whether two words word1 and word2 have the same stem, not the actual identity of the stem. [223000380120] |This makes sense because reverse-indexed IR system don’t care about the form of a word, just its identity. [223000380130] |Paice then draws a distinction between “light” and “heavy” stemming, but I found this unmotivated as a two-way classification. [223000380140] |Extending Paice’s example, here are the words I found in the English Gigaword corpus with etymological root author: [223000380150] |antiauthoritarian, antiauthoritarianism, antiauthority, author, authoratative, authoratatively, authordom, authored, authoress, authoresses, authorhood, authorial, authoring, authorisation, authorised, authorises, authoritarian, authoritarianism, authoritarians, authoritative, authoritatively, authoritativeness, authorities, authoritory, authority, authorization, authorizations, authorize, authorized, authorizer, authorizers, authorizes, authorizing, authorless, authorly, authors, authorship, authorships, coauthor, coauthored, coauthoring, coauthors, cyberauthor, deauthorized, multiauthored, nonauthor, nonauthoritarian, nonauthorized, preauthorization, preauthorizations, preauthorized, quasiauthoritarian, reauthorization, reauthorizations, reauthorize, reauthorized, reauthorizes, reauthorizing, semiauthoritarian, semiauthorized, superauthoritarian, unauthorised, unauthoritative, unauthorized [223000380160] |As you can see, it’s rather difficult to draw a line here. [223000380170] |The shared root of author and authorize is lost to most native speakers, despite having a regular suffix -ize. [223000380180] |In contrast, the relation between notary and notarize is regular, but the relation between note and notary feels more opaque. [223000380190] |And we’re not even considering word sense ambiguity here, a closely related problem for search. [223000380200] |These uncertainties in derivational stem equivalence is why many people want to restrict stemming to inflectional morphology. [223000380210] |Inflections are forms of words, like the verb forms and noun number (e.g. -s makes a noun plural), whereas derivational morphology often converts one syntactic type to another (e.g. -ly converts and adjective to an adverb). [223000380220] |The problem with restricting to inflectional morphology is that it’s not the right cut at the problem. [223000380230] |Sometimes derivational stemming helps and sometimes inflectional stemming hurts. [223000380240] |This brings us to the second interesting paper: [223000380250] |Hull, David A. 1996. [223000380260] |Stemming algorithms: a case study for detailed evaluation. [223000380270] |Journal of the American Society for Information Science. [223000380280] |47(1): 70–84. [223000380290] |[CiteSeer link to paper] [223000380300] |Hull goes through a couple hundred TREC queries with a fine-toothed comb, finding that it’s almost impossible to declare a single winner among stemmers. [223000380310] |He compared the simple Porter and Lovins stemmers with more linguistically sophisticated algorithms developed at Xerox (now part of Inxight‘s offerings), and less sophisticated algorithms like reducing every word to it’s first 5 characters or just removing final s. [223000380320] |There was simply not a clear winner. [223000380330] |The beauty of the paper is in the careful case studies in which he contrasted stemmers. [223000380340] |For example, query 21 contained the word superconductivity, but matches documents containing only the form superconductors , clearly requiring derivational morphology for high recall . [223000380350] |A similar problem occurs for surrogate mother and surrogate motherhood in query 70. [223000380360] |Another case was genetic engineering vs. genetically engineered. [223000380370] |Another example where stemming helped was failure versus fail in the context of bank failures. [223000380380] |An example where stemming hurt was in client server matching serve and client, leading to numerous false positive matches; this is really a frequency argument, as servers in the computer sense do serve data. [223000380390] |Another example was reducing fishing to fish, causing recall problems even though it’s a simple inflectional change. [223000380400] |Other problems arose in converting privatization to private, which hurt precision. [223000380410] |What’s a poor computational linguist to do? [223000380420] |One thing I’d recommend is to index both the word and its stem. [223000380430] |If you’re using a TF/IDF-based document ranker, such as the one underlying Apache Lucene, you can index both the raw word and any number of stemmed forms in the same position and hope IDF sorts them out in the long run. [223000380440] |That seems to work for Chinese, where folks tend to index both words and unigrams and/or bigrams because n-grams provide high recall and words provide added precision. [223000380450] |If anyone knows of other evaluation’s like Hull’s that involve n-gram based stemmers, I’d love to hear about them. [223000390010] |How Home Dentistry Kits and LingPipe are Similar [223000390020] |LingPipe is a tough sell to most commercial customers without professional services. [223000390030] |Occasionally I will do a deal where all I do is cash a check but almost all of our customers want lots of help. [223000390040] |Suggest that they build it themselves and they look at me like I suggested a home root canal. [223000390050] |Why? [223000390060] |Take one of our simplest capabilities, language model classification. [223000390070] |There is a simple, runs out of the box tutorial that takes a developer through line by line what needs to be done to do some classification. [223000390080] |It is really simple. [223000390090] |Yet I cannot get certain customers working with it. [223000390100] |The sticking point, I believe, is that unfamiliarity plus the slightly loose nature of machine learning techniques is too great a jump conceptually. [223000390110] |The DynamicLMClassifier needs the labels of the categories (easy), boolean choice of whether to use a bounded or sequence based language model (starting to feel a little woozy) and a character n-gram size (whoa, a ‘whackety n-germ thingy’). [223000390120] |The tutorial suggests that 6 is a good initial n-gram value but they are lost at this point I think. [223000390130] |It gets worse because I suggest in the tutorial that they try different n-gram sizes to see what produces the best score. [223000390140] |The scoring is nicely provided as part of the tutorial as well. [223000390150] |This only gets worse as we dig deeper into the LingPipe API. [223000390160] |Tuning these systems requires a particular mindset that is not a part of a core computer science curriculum. [223000390170] |It doesn’t require great intelligence, but experience is a must. [223000390180] |Until we find a way to sort this out we will continue to see such systems out of general production. [223000390190] |My mantra is “make computational linguistics as easy to use as a database.” [223000390200] |We have a ways to go before we move away from the black art status of our field. [223000390210] |breck [223000400010] |LingPipe 2.4.1 Released [223000400020] |LingPipe 2.4.1 is out. [223000400030] |It’s backward compatible, patches a few bugs, and adds a few new features: linear and logistic regression, HMM decoder pruning, and spelling evaluation. [223000400040] |It also updates some jars and patches all the bugs we’ve found (or you’ve found). [223000400050] |More information and a download link are available from the LingPipe home page. [223000410010] |Spring Cleaning: Generics for LingPipe 3.0 [223000410020] |We’ve decided to refactor LingPipe for generics for the 3.0 release. [223000410030] |After three solid days of work (and untold nights and mornings reading about generics), I’ve finished the refactor of com.aliasi.util; the Iterators class and SmallSet were rather challenging for a rookie like me. [223000410040] |My major complaint about Java when I first saw it was the lack of parametric typing. [223000410050] |Many, if not most, of my programming errors are from assigning something to the wrong argument. [223000410060] |I’m hopeless in Perl, for instance. [223000410070] |As always, be careful what you wish for. [223000410080] |Little prepared me for the complexity of Java 1.5 generics. [223000410090] |My hat’s off to the architects and developers of this amazing retro-fit. [223000410100] |The addition of generics to a non-generic language is a heroic feat of backward compatibility engineering. [223000410110] |What I should’ve realized before I started, but only dawned on me about 10 classes in is that just adding types doesn’t break backward compatibility anywhere. [223000410120] |That’s why it’s possible to run 1.4-style programs in a 1.5 or 1.6 compiler or JVM. [223000410130] |Very very cool. [223000410140] |Luckily, using generic classes is a whole lot easier than writing them. [223000410150] |Even writing them is easy if you’re only worried about type safety and saving some work, and not on making something general and extensible. [223000410160] |My main gripe has to do with arrays. [223000410170] |As in pre-generics Java, they’re the poor cousins of types. [223000410180] |There’s no way to create arrays of generic types without having a handle on the java.lang.Class object for the type being created. [223000410190] |For instance, there are two toArray() methods in java.util.Collection. [223000410200] |Suppose we have a Collection. [223000410210] |The first array creation method has the signature Object[] toArray(), not E[] toArray() as one might expect, because it’s simply not possible to define the latter. [223000410220] |The second array creation method has the signature T[] toArray(T[] a), which is problematic in two ways. [223000410230] |First, you need to pass in an array to fill. [223000410240] |Second, you can still generate type errors if you do something like this: [223000410250] |If you omit the second line (in bold), the program compiles and runs with no complaint. [223000410260] |If you add the second line, the program compiles, but throws a runtime exception during the underlying call to System.arrayCopy(). [223000410270] |In LingPipe, I often used arrays exactly because they were type safe compared to returning collections. [223000410280] |They also tend to be smaller than collections in memory, as there’s no overhead for wrappers, buckets, tree balancing counts, etc. [223000410290] |Now I’ve painted myself into a corner with generics. [223000410300] |I’ve had to replace some return types that would’ve been E[] for a type variable E, with List, because it’s the only way I could ensure type safety without requiring a class object (or array) be passed in. Luckily, none of that stuff’s in deep loops. [223000410310] |That’s all arrays of primitive types, which nobody ever expected to play nicely with generics or anything else. [223000410320] |As always, my advice to someone wanting to learn Java types is to study the java.util collections framework. [223000410330] |That, and the Kernighan and Ritchie of Java: [223000410340] |Ken Arnold, James Gosling and David Holmes. 2006. [223000410350] |The Java Programming Language, Fourth Edition. [223000410360] |Addison-Wesley. [223000410370] |I had to read Chapter 11 (Generic Types) many many many many times. [223000410380] |Check out Section 16.10.1 (Genericity and Dynamic Arrays) for a deep explanation of what’s going on in the above example. [223000410390] |Here are some links to complement the book: [223000410400] |
  • Gilad Bracha’s Generics Tutorial. [223000410410] |The Sun-recommended tutorial.
  • [223000410420] |
  • Sun’s Java Tutorial:Generics Trail
  • [223000410430] |
  • Angelika Langer’s Generics FAQ
  • [223000410440] |
  • Ken Arnold (author of above book) wrote a polemic blog entry Generics Considered Harmful in which he ‘fesses up that "writing generified classes is rocket science" *
  • [223000410450] |*Note that Arnold actually got his types wrong and his example should be T[] toArray(Class); makes me think he didn’t read his own book! [223000410460] |Update: April 11, 2007 [223000410470] |It occurred to me yesterday that java.util.ArrayList must allocate generic arrays, because it implements List based on arrays. [223000410480] |Checking out Java’s source shows you how to “cheat” the type system: [223000410490] |It’s easy to forget that types are simply erased in the implementation. [223000410500] |This kind of operation is safe if it’s under your control, as it is in java.util.ArrayList and many of the LingPipe internals. [223000410510] |It’ll still cause the compiler to gripe. [223000410520] |I think I can live with that. [223000410530] |Update April 26, 2007: I’m done. [223000410540] |We’re about to release 3.0. [223000410550] |But there was one final bug that Breck caught compiling on 1.6. [223000410560] |Turns out that it’s a known issue with Java 1.6 throwing an assertion error from the compiler. [223000410570] |It’s already in Sun’s DB of bugs, as bug 6500207. [223000410580] |It arises with conjunctions of types in generic restriction. [223000410590] |The bug report has a simple example; it arose for us in a class that wasn’t going out with 3.0 anyway, an EM/soft clusterer with Dirichlet priors, so hopefully it’ll be fixed soon. [223000420010] |LingPipe 3.0 Released [223000420020] |We’re happy to announce the release of LingPipe 3.0. [223000420030] |As usual, you may view it or download it from the LingPipe Home Page. [223000420040] |

    Major Release

    [223000420050] |The latest release of LingPipe is LingPipe 3.0.0. [223000420060] |This major release replaces LingPipe 2.4.1. [223000420070] |Although 3.0 maintains all of the functionality of 2.4, it is not 100% backward compatible. [223000420080] |All demos and tutorials have been updated to the new API. [223000420090] |

    Generics

    [223000420100] |The major change that will be evident is that the API has been modified to use generic types. [223000420110] |This has allowed us to remove many redundant classes and generalize other common behavior. [223000420120] |Many of the classes now implement java.lang.Iterable, allowing them to be used in the Java 1.5 foreach construct. [223000420130] |Most of the type-specific utility methods have been replaced with generic alternatives. [223000420140] |Keep in mind that like in Java 1.5 itself, you don’t need to use the generics. [223000420150] |You may continue to write non-generic code against our API in the same way as for the Java 1.5 collections framework. [223000420160] |

    Clustering

    [223000420170] |The com.aliasi.cluster package was completely rewritten from the ground up. [223000420180] |Now, rather than clustering the rows of a labeled matrix, a clusterer clusters a set of objects under a distance measure. [223000420190] |The dendrogram classes for hierarchical clustering results have been cleaned of their indexing behavior, which was only necessary for the previous implementations. [223000420200] |For the new API, there’s a completely new clustering tutorial, which among other things, uses linguistic examples such as clustering documents by topic or entity mentioned. [223000420210] |We’ve included Bagga and Baldwin’s John Smith data (197 New York Times Articles annotated for which of 35 different John Smiths is mentioned; it’s available as the tarball demos/data/johnSmith.tar.gz. [223000420220] |

    LingPipe in Eclipse

    [223000420230] |We added a tutorial on Lingpipe in Eclipse, which explains how to get started building LingPipe in the Eclipse Integrated Development Environment (IDE). [223000420240] |

    Distance and Proximity

    [223000420250] |Two generic classes were added to the utility package, Distance, and Proximity. [223000420260] |These are not only used in clustering, but also in the distance functions in com.aliasi.spell package. [223000420270] |

    Matrices and Vectors

    [223000420280] |The com.aliasi.matrix package was simplified to remove the complexities of labeling. [223000420290] |In the future, we plan to build this package out with sparse and memory-mapped matrices. [223000420300] |

    Iterators

    [223000420310] |Iterators that were formerly in util.Arrays, namely ArrayIterator and ArraySliceIterator may now be found in the unified Iterators utility class. [223000420320] |A new Iterators.Empty class was added in order to support genericity; it replaces the overloaded constant. [223000420330] |Finally, util.SequenceIterator was made rolled into util.Iterators along with the others, util.Iterators.Sequence. [223000420340] |

    MEDLINE Parsing Standardized

    [223000420350] |The medline.MedlineParser class was modified to implement corpus.Parser. [223000420360] |At the same time, the class medline.MedlineHandler was modified to implement corpus.Handler. [223000420370] |The unusued corpus.MedlineCitationHandler interface was removed. [223000420380] |

    ObjectToCounter Simplified

    [223000420390] |The util.ObjectToCounter interface was removed; we only ever used the util.ObjectToCounterMap implementation, a generic version of which remains. [223000420400] |

    Unused Classes Removed

    [223000420410] |In the code review for generics, we found unused classes in the com.aliasi.coref package, Entity and EntityFactory. [223000420420] |The class util.SmallArray was removed. [223000420430] |The interface util.StringDistance was removed; it is replaced with the generic util.Distance interface. [223000420440] |Finally, the util.Visitor interface was removed; the corpus.Handler interface is doing its job. [223000430010] |TF/IDF, Scaling and Symmetry [223000430020] |Search engines typically use TF/IDF scaling only on the query, because it’s too hard to compute dynamically on the documents, leading to an asymmetry in scoring. [223000430030] |I’m building some discriminitave classifiers for LingPipe. [223000430040] |I just finished with voted kernel perceptrons, which are way cool, and though lighter than SVMs, still awfully heavy. [223000430050] |A simpler discriminative classifier is based on inverse-document frequency weighting, as is commonly used in search engines. [223000430060] |In this situation, the weight applied to a search term is inversely related to the number of documents in which it appears. [223000430070] |For instance, a term appearing in docFrequency documents, where there is a total number of numDocs documents, is typically weighted by a damped inverse ratio like: [223000430080] |For base 2 logs, and numDocs=1024, we have [223000430090] |Basically, this approach upweights terms that don’t show up in many documents. [223000430100] |If we are using TF/IDF ranking for classification, IDF weighting provides a parametric form of discriminitive feature weighting. [223000430110] |Term frequencies are typically dampened as well, often with a square root penalty: [223000430120] |This has the effect of making score rise linearly with the number of overlapping counts, rather than quadratically, as it’d do in the un-dampened raw frequency counts. [223000430130] |Scoring is typically done by cosine. [223000430140] |That is, a length-normalized dot product of query and document vectors. [223000430150] |What I didn’t notice until I dug into the internals of the Apache Lucene search engine and really paid attention to the scoring section in Witten et al.’s Managing Gigabytes is that IDF normalization is only applied to the query vector. [223000430160] |That is: [223000430170] |The reason for this is that if doc frequencies and the number of docs continually changes, so would the length of a document vector if it were normalized by IDF. [223000430180] |In this case, every document in the result set would have to be renormalized for every query, requiring multiplications and additions for each term in the document. [223000430190] |Even if we could afford the arithmetic overhead, we find that reverse indexes don’t even support looking up all of the terms in a document easily, because they’re maps from terms to documents, not the other way around. [223000430200] |Instead, in dynamic search engines, IDF is not applied to doc vectors, allowing their lengths to be computed as soon as they are added to the index. [223000430210] |The upshot of only IDF-weighting queries is a surprising asymmetry. [223000430220] |If we index a document by its text and then create a query out of the same text, the score will be less than 1 (recall that if all entries are positive counts, cosines range between 0 and 1 inclusive). [223000430230] |To score maximally against a document, what is required is a query with weights that will invert the IDF weighting applied to queries so that the query vector actually matches the document vector. [223000430240] |That is, the best scoring vector against a document is the vector created from the document term frequencies with inverse IDF weighting. [223000430250] |For the TF/IDF classifier I’m building into LingPipe, the IDFs are computed for document vectors at compile time, thus ensuring that the only query that will score 1.0 against a document is the query constructed out of the document itself. [223000430260] |Of course if you’re a major search engine, and thus matching by boolean match and sorting by some social network metric of influence with some fiddling, none of this really matters. [223000430270] |I actually miss Excite, which did TF/IDF search. [223000430280] |Perhaps that’s not surprising given that Doug Cutting, the founder and original developer of Lucene, was their lead search engine developer. [223000440010] |LingPipe 3.1.0 Released [223000440020] |LingPipe 3.1.0 has some important bug fixes: small set generics, dictionary chunker case sensitivity, and dictionary training of chunkers. [223000440030] |There are new classifiers (Perceptron, K-nearest-neighbor and TF/IDF) and a new clusterer (K-means) based on a new feature extraction interface and vector distances. [223000440040] |There are also some new tutorial sections: chunking text into noun phrases and verb phrases, and dictionary-based entity extraction. [223000440050] |Read more about and download the latest version from the LingPipe Home Page. [223000450010] |Is 90% Entity Detection Good Enough? [223000450020] |Ryan McDonald made the following comment on the NLPers blog: Tracking the State of the Art: [223000450030] |Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. [223000450040] |Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications. [223000450050] |Could someone point me to 90% accuracy gene taggers? [223000450060] |Ideally, boxed ones that could be used for real-world applications. [223000450070] |This is better than the best result from the last Biocreative on a single category, GENE, (not that anyone would know, because they haven’t made results public), and that evaluation used soft boundary decisions by allowing multiple “correct” boundaries. [223000450080] |Results for the GENIA corpus, with the kind of structured categories Ryan’s talking about have been much lower. [223000450090] |Most of the 90% accuracy of high-accuracy taggers is derived from tagging the same thing right again and again (e.g. “the” and “be” for English POS taggers, “George Bush” and “IBM” for English entity taggers trained on WSJ). [223000450100] |Entity mentions have a power law (Zipfian) distribution, so most of them aren’t mentioned very frequently. [223000450110] |Thus 90% per-token accuracy translates into terrible per-type accuracy. [223000450120] |Especially when one moves away temporally and topically from the training data (e.g. from well-edited, conventiaonlly structured Wall St. Journal news articles to movie or music reviews). [223000450130] |The situation gets worse if we’re looking for simple relations through collocations. [223000450140] |Errors in sentences tend to be correlated, not independent, as many folks have noticed over the years. [223000450150] |Thus we’re likely to get higher than .9 * .9 recall of common binary relations, but much lower than .9 * .9 recall of rarer binary relations. [223000450160] |For applications, we simply don’t need the correlation between “George Bush” and “White House”, or between “p53″ and “apoptosis”; everyone knows those relations. [223000450170] |Anyway, this problem is what’s led us to focus on high-recall approaches. [223000450180] |Check out our named entity recognition tutorial for more information and even examples that run out of the box. [223000450190] |Ryan McDonald was at the talk at Columbia Uni I gave on entity extraction with high recall, and his point was that often the errors are one of boundaries or of the type assignment, and that it’s really no big deal for an app if the phrase is off by a token or two on the boundaries. [223000450200] |I suppose that depends on the application. [223000450210] |Boundary words are often discriminative in gene names or person names, and we’ve had a world of trouble historically in coreference by clustering when the names get off by one. [223000470010] |Release: LingPipe GUI Named Entity and Chunk Annotator [223000470020] |We’re soft-releasing our new GUI tool for annotating corpora with named entities or other chunks. [223000470030] |It’s set up a little differently than other such tools with which you might be familiar, such as Callisto or WordFreak. [223000470040] |It’s basically token-oriented with combo-box controls, making it look a lot like the CoNLL named entity data in the GUI view. [223000470050] |The goal was to make it easy to drive from the keyboard rather than working as a text editor with highlight and menu select. [223000470060] |It does tag-a-little, learn-a-little style semi-automatic annotation (a la MITRE’s seminal Alembic Workbench) and tracks progress through a corpus annotation project. [223000470070] |It assumes inputs are in a specified XML format. [223000470080] |For now, the annotator’s in the sandbox, under project name citationEntities. [223000470090] |The name arose because the project started as something to annotate a bunch of bibliographic data. [223000470100] |Even though it’s in the sandbox, I’ve used it to annotate a ton of data already and I’ve ironed out all the bugs I’ve found. [223000470110] |You can find our general instructions for checking projects out of the sandbox at: [223000470120] |
  • LingPipe Sandbox
  • [223000470130] |You can also use the following command to download the entire project into the current working directory: [223000470140] |Just make sure to put it all on one line; it’s on two here because of the blog formatting. [223000470150] |The place to start is the top-level read-me file: [223000470160] |Let us know if you have any comments at lingpipe@alias-i.com [223000480010] |Read/Write Locks for Multi-Threaded Training and Execution [223000480020] |I finally built an application that exercised the write half of the read/write lock pattern used for most of LingPipe’s classifiers and chunkers (see the recent blog entry describing our chunk annotation GUI). [223000480030] |Perhaps not coincidentally, we’re getting customer requests and mailing-list requests for the same functionality: the ability to have multiple users (perhaps over the web), add new training data dynamically, while having the server stay up to date. [223000480040] |It’s an issue for everything from significant phrase extraction to chunking. [223000480050] |This post will explain how LingPipe’s models need to be synchronized so that they are thread safe under concurrent training and decoding (where decoding may be significant phrase extraction, language model estimation, chunking, classification or tagging). [223000480060] |Throughout, I’ve implemented our models to be thread-safe for multiple readers. [223000480070] |That’s not an issue, and if that’s all you want to do, have at it. [223000480080] |We have demos that run on servlets this way. [223000480090] |A model is a member object of a servlet and multiple threads access it in the default configuration. [223000480100] |The problem comes in for writes, which means training in this case, including resetting training and decoding parameters. [223000480110] |When a thread is writing to a model, no other thread may read or write from the model. [223000480120] |Luckily, this is a very common pattern of synchronization, known as the read-write lock pattern. [223000480130] |In addition to generics, the 1.5 JDK finally brought Doug Lea‘s wonderful util.concurrent library into the language as java.util.concurrent. [223000480140] |If you don’t know anything about threads, but like detailed coding books, I can recommend Brian Goetz’s master class textbook, Java Concurrency in Practice, to which Doug Lea, Josh Bloch and other Java API design and textbook gurus contributed. [223000480150] |In code, read-write locks are almost trivial. [223000480160] |Here’s what you do. [223000480170] |Let’s say you have a dynamic classifier: [223000480180] |We’ll also need a read-write lock: [223000480190] |The parameter true makes the annotator fair in that it schedules threads for the lock in the order of requests. [223000480200] |This is not necessary for LingPipe, but may be desirable for applications. [223000480210] |Then, if we have a piece of training data on a thread, we simply grab the write lock: [223000480220] |Then do the training inside the scope of the lock, making sure to release the lock in a finally block in case the handler throws an exception (this may be unchecked, such as an underlying out-of-memory exception, or may be a checked exception, such as an I/O exception in some cases). [223000480230] |As with training, decoding requires acquiring a lock, doing some work, and then freeing the lock in a finally block: [223000480240] |Note that only the action requiring synchronization is locked. [223000480250] |Whatever action the thread needs to take with the resulting chunking will happen after the try/finally block. [223000480260] |That’s it. [223000480270] |And it works for any of the LingPipe classes that allow training and execution to be interleaved. [223000480280] |Just remember that anything that modifies the object itself is a write operation. [223000480290] |One final note. [223000480300] |LingPipe’s cache class, com.aliasi.util.FastCache, which may be used, for example, with HMMs, is already thread safe. [223000480310] |Thus even though it does writes, it doesn’t need the class’s write lock. [223000480320] |It may be used in models without requring any modification to the above read/write synchronziation pattern. [223000480330] |For instance, we’ve used it in servlets that do spelling correction to handle caches in that multi-threaded environment. [223000490010] |Genes are Generic, People are Specific [223000490020] |Although many reasons are cited for genes and product names being more difficult to find than person names in text, the main linguistic difference is explained by the syntactic and semantic distinction between proper names and common nouns. [223000490030] |Named entity recognizers for people look for names, like “Pervez Musharraf”. [223000490040] |Syntactically, person named entity mentions take the form of proper names. [223000490050] |Semantically, these names refer to a single person. [223000490060] |By convention in English, they’re capitalized. [223000490070] |Note that names may be ambiguous in that the name "John Smith" may be used to refer to many different people. [223000490080] |But each use as a named entity will refer to a single individual. [223000490090] |Of course, a name may be mentioned as a string without referring to an individual; I can say "There are many John Smiths." without referring to one. [223000490100] |This is an example of the use-mention distinction. [223000490110] |Named entity recognizers for genes and proteins look for names of genes or proteins, like P53 or McrBC. [223000490120] |Syntactically, gene mention names take the form of proper names. [223000490130] |Semantically, they refer to a kind, and are thus more like common nouns. [223000490140] |There are many instances of P53 and the use of "P53" doesn’t typically refer to a single sample, much less a single molecule. [223000490150] |Another distinction is that “p53″ can be used as a mass term, like “water”, or a countable term like “glass(es)” (compare “much water” vs. “many glasses”, or “water is everywhere” with “there are glasses everywhere”). [223000490160] |In addition to not being conventionally capitalized in English, a fundamental problem with common nouns is that they can be subclassed or superclassed. [223000490170] |For instance, we see "P53-wt" and "P53-m" for a wild type (natural) or mutated form of P53. [223000490180] |P53 is also found in everything from mice to zebrafish, so the name may be qualified further for species. [223000490190] |Similarly, there is more than one known human mutation of p53, so the particular mutant strain may be identified (often by a catalog number). [223000490200] |The genericity in a gene or protein name is different than the ambiguity in a proper name like "Clinton", which may refer to Bill, Hilary, the fort in New York’s Central Park, or any number of other people, towns, monuments or manufacturers. [223000490210] |It’s not like these are all types of Clintons the way a cricket is a type of insect. [223000490220] |Linguistically, you can’t use the word "Clinton" to refer to all of these things the way you can use P53 to refer to all of its variants. [223000490230] |Because gene “names” act like nouns, it makes it particularly difficult to draw the line between plain old common descriptive nouns and more proper-name-like gene names. [223000490240] |For instance, consider the common noun “tumor supressor gene”. [223000490250] |This is a noun like “starting first-baseman”, not a name such as “Jason Giambi”. [223000490260] |Functional descriptors may also be added to gene names, as in “p53 tumor supressor protein”. [223000490270] |Semantically, the “tumor supressor protein” is a non-restrictive modifier, with the whole meaning is “protein that supresses tumors”. [223000490280] |All by itself, the name “p53″ picks out the type of protein. [223000490290] |Syntactically, “protein” is the head (root noun) of the phrase (that is, what’s being referred to is a protein, not a tumor). [223000490300] |With proper names, English syntax requires apositives, such as “Albany, capital of New York State”, or modifiers as in “New York State capital Albany”; with people these may be honorifics, as in the use of “Senator” in “Senator Hilary Clinton” or apositives, as in “Hilary Clinton, Senator from New York.” [223000490310] |Most of the other problems with gene-name finding are shared with person-name finding. [223000490320] |For instance, metonymy is a huge issue in both domains. [223000490330] |The phrase “New York” may refer to the city, the government, or any of the many sports franchises calling the city home. [223000490340] |So you find less ambiguous forms, like “New York Yankees”, used in the same way you find “p53 gene”. [223000490350] |The main difference is that the former is still proper name, whereas the latter is still a common noun syntactically. [223000490360] |As another example, string variation is highly prevalent in company names (e.g. “IBM”, “I.B.M.”, “International Business Machines, Inc”, “Big Blue”, etc.) which also have multiple subsidiaries (e.g. “IBM Europe”). [223000490370] |Names transliterated from foreign languags are particularly problematic (e.g. “Al Jazirah”, “Al Jazeera”, “El Jazira”; ..). [223000490380] |Nicknames are also popular for celebrities (e.g. “Sean Combs”, “P. Diddy”, “Puff Daddy”; or “Reggie Jackson”, “Mr. October”). [223000490390] |Product names, by the way, are just like gene names. [223000490400] |A term such as “Canon EOS” refers to a whole range of different cameras such as “Canon EOS Rebel Xti”. [223000490410] |There are multiple instances of these cameras in the world, just like P53. [223000490420] |And that’s why sites like ShopWiki Search “canon eos” and Froogle Search=”canon eos” have a hard time sorting them out. [223000490430] |In contrast, there’s only one New York City, one New York Yankees, and one Senator Hilary Rodham Clinton. [223000500010] |LingPipe 3.1.2 Released [223000500020] |LingPipe 3.1.2 is a minor revision of 3.1.1. [223000500030] |Here’s the change list: [223000500040] |Medline Citation Parser [223000500050] |Properly handle reprinted-in elements. [223000500060] |TF/IDF Classifier Unknown Features [223000500070] |Added code to ignore unknown features at classification time. [223000500080] |Cached Mention Identity Conditions [223000500090] |The identity conditions have been converted back to those for an Object and the unit tests changed. [223000500100] |This was to make the code compatible with our forthcoming cross-document coreference code. [223000500110] |In particular, this will not change the behavior of within-document coreference resolution. [223000500120] |Dynamic Classifier Handler Interfaces [223000500130] |Modified classifer evaluators and the dynamic LM classifiers to implement the classification handler interface. [223000500140] |LM Handler Interfaces [223000500150] |The trainable LM classes now implement the text handler interface. [223000500160] |Static Converter in Chunk Handler Adapter [223000500170] |Added a static method to convert chunkings to BIO-taggings. [223000500180] |Spelling Sizing and Tuning [223000500190] |There’s an added section on sizing and tuning the spelling corector. [223000510010] |Batch vs. Online Learning : Handler, Parser, and Corpus APIs [223000510020] |Online learning assumes that training cases are presented one at a time in a stream and the model may be output at any point. [223000510030] |Batch learning assumes the training cases are all available at once. [223000510040] |Note that this distinction is independent of the generative/discriminitive distinction. [223000510050] |Simple frequency-based discriminitive models could easily be online whereas latent mixture generative models may require batch processing for estimation. [223000510060] |Most of LingPipe’s online trainable models are also online executable (e.g. language models, LM-based classifiers and taggers, etc.). [223000510070] |This means you can interleave training and executing without compiling in between. [223000510080] |Some models may require compilation after being given online training (e.g. the spelling trainer), but that’s only a matter of efficient execution —the decoder is written against the compiled LM implementation, not the generic LM interface. [223000510090] |For online learning, we use the Handler marker interface. [223000510100] |Sub-interfaces define actual handlers, such as the ChunkHandler, which defines a single method handle(Chunking). [223000510110] |The idea is that training data is incrementally provided through the handle() methods of the various handlers. [223000510120] |Although the handle methods of handlers may be called directly, they are typically called by parsers. [223000510130] |The Parser (where H extends Handler) interface is generic, requiring the type of its handler to be specified. [223000510140] |The setHandler(H) and getHandler() methods manage the handlers. [223000510150] |A wide range of parse() methods accept arguments for the input data to be parsed, which may be presented as a file, input source, character array slice or character sequence. [223000510160] |This setup is a classic instance of the visitor pattern. [223000510170] |The individual models with handler methods don’t need to know anything about looping over data. [223000510180] |Without the pattern baggage, this is simply what’s known as a callback. [223000510190] |In Java, the executable code is encapsulated as an object, which usually implements some interface for abstraction from implementation. [223000510200] |In C, you just have a function pointer used as a callback, and there’s no type checking. [223000510210] |The setup should be familiar from XML parsing using SAX. [223000510220] |A SAX parser must implement the XMLReader interface, whereas a SAX (content) handler implements the ContentHandler interface. [223000510230] |This is sometimes called an event-based model, with the callbacks from the reader to the handler being called "events". [223000510240] |The nice thing about LingPipe’s implementation is that many of the gory details are implemented in two abstract classes, InputSourceParser and StringParser. [223000510250] |The input source parser implements the string parsing methods in terms of input source parsing, and the string parser does the reverse. [223000510260] |These are instances of the adapter pattern, which is used elsewhere in LingPipe. [223000510270] |There is also an adapter for the case where the input’s XML, namely the XMLParser subclass of InputSourceParser. [223000510280] |By allowing plug-and-play with parsers, we do not require our data to be in any standard on-disk representation. [223000510290] |We need merely write parsers for representations as they arise. [223000510300] |We prefer not to munge underlying data on disk if we can help it, as it almost invariably becomes a version management nightmare. [223000510310] |So everything’s simple and standard for online parsers. [223000510320] |But how do we handle batch processing? [223000510330] |If you’re used to thinking of files, the idea’s that you’d just run your loops multiple times over the files. [223000510340] |But our parser/handler setup isn’t quite so simple, as there’s no guarantee the data’s ever in files in a simple way. [223000510350] |Instead, we introduce a higher-order abstraction, the Corpus interface, where H extends Handler. [223000510360] |This interface specifies a method VisitCorpus(H), where the argument is a handler. [223000510370] |The idea is that the handler is passed the entire corpus, one argument at a time. [223000510380] |There’s also a convenience abstract implementation for on-disk corpora, DiskCorpus, which is constructed with a parser and disk directory. [223000510390] |Our batch learning methods don’t define handlers, they require corpora to be given to them wholesale. [223000510400] |For instance, the PerceptronClassifier takes a corpus argument in its constructor. [223000510410] |It is then able to call the corpus’s visitCorpus() method as many times as is necessary to achieve convergence or max out the number of allowed visits. [223000510420] |This looks more complicated than it is in practice. [223000510430] |If you’re not convinced, you’ll find further grist for your mill by reading Peter Norvig’s take on "patterns" (pitting Lisp against Java). [223000530010] |But does it do Greek? Unicode, Java Chars and the BMP [223000530020] |Yes, Extended Greek will work just fine in LingPipe. [223000530030] |LingPipe processes at the char level, not at the byte level. [223000530040] |Extended Greek involves characters in the 0x1F00 to 0x1FFF range. [223000530050] |That’s in the basic multilingual plane (BMP) of unicode. [223000530060] |The Ancient Greek numbers at 10140-1018F could be problematic if you need them, depending on what you need to do. [223000530070] |Even these should work OK in most of LingPipe. [223000530080] |The significance of being in the BMP is that the UTF-16 coding and hence Java’s char encoding is transparent. [223000530090] |Specifically, for a code point in the BMP, the UTF-16 is just the two-byte sequence making up the unsigned integer representation of the code point. [223000530100] |For any such code points, Java’s char representation is just the unsigned integer corresponding to the code point. [223000530110] |The significance of all this for LingPipe is that we treat char values in java.lang.String and java.lang.CharSequence as characters. [223000530120] |Thus if we have a 5-gram language model, we treat that as 5 chars. [223000530130] |The problem for code points beyond the BMP is that they take more than one char in Java to represent. [223000530140] |For these code points, Java’s String.length() is not the length of the the string in terms of number of code points, but in terms of number of chars. [223000530150] |Thus it actually overstates the length. [223000530160] |Non-BMP code points can be problematic depending on the tokenizers used, which we define at the char level, not at the code point level. [223000530170] |So you could write a nefarious tokenizer that split surrogate pairs of chars representing non-BMP code points into different tokens. [223000530180] |Our character n-gram tokenizers are nefarious in this way, but even they won’t necessarily be problematic in their intended applications (like classification). [223000530190] |As long as you’re not nefarious, and don’t mind a little inaccuracy in things like per-char cross-entropy rates due to the length, everything should actually work OK. [223000530200] |If you still tokenize on whitespace, you should be OK with non-BMP code points in our classifiers, POS taggers and chunkers. [223000530210] |Places where they won’t make sense is in things like edit distance, where our API is defined in terms of chars, not code points. [223000530220] |And string comparison like TF/IDF over character n-grams will be off in terms of using chars, not code points, so the dimensionality winds up being a little higher. [223000530230] |For applications like classification, this shouldn’t matter. [223000530240] |Of course, it’d be possible to write a proper n-gram code point (vs. char) tokenizer by taking the surrogate pairs into account. [223000530250] |Here’s a little test program that doesn’t involve LingPipe so you (and I) can see what’s up with the encodings: [223000540010] |Why Didn’t You Win the Netflix Prize? [223000540020] |Netflix awarded its first progress prize in the Netflix Prize competition. [223000540030] |The winner was a group from AT&T Research consisting of Robert Bell, Yehuda Koren and Chris Volinsky. [223000540040] |According to Netflix’s press release on the award, the AT&T team spent 2000 hours working on this project. [223000540050] |That’s a full-time, don’t go to any meetings, don’t take any vacation, person-year. [223000540060] |And it pretty much sums up why we weren’t even in the running. [223000540070] |

    How’d AT&T Do It?

    [223000540080] |Here’s all the juicy details, in AT&T’s Disclosure Tech Report. [223000540090] |(Props to AT&T for letting them release this stuff.) [223000540100] |I’m not surprised by how they did it. [223000540110] |I doubt anyone who’s studied machine learning in the past ten years will be surprised. [223000540120] |The goal was to reduce prediction variance So they built an ensemble (aka committee). [223000540130] |Of course they built an ensemble. [223000540140] |It’s the most widely practiced method for reducing the variance of a predictor. [223000540150] |AT&T used a really big ensemble. [223000540160] |One with 107 reasonably tuned learners, drawn from a half dozen classes or so. [223000540170] |They implemented a rich K nearest neighbors approach, Toronto’s Boltzman machine approach, singular value decompositions (standard L2 and Lasso-style L1 norms with regularization), some specialized factor models borrowed from Arek Paterek (another Netflix prize participant), and several flavors of regression model. [223000540180] |

    How Good is a 10% Reduction in Error?

    [223000540190] |The US$1M prize is for a 10% reduction in Netflix’s current system’s error. [223000540200] |The current Netflix system has an (average) error of around 0.951, and the US$1M prize would go to a system with error of around 0.856. [223000540210] |That’s on a 1-5 scale. [223000540220] |That’s an improvement of only 1/10th of a star. [223000540230] |The subsequent error, +/- 0.86 stars, is relatively large. [223000540240] |Because there are only 4 stars of range (5-1=4), the progress prize is for a system with 21.4% estimation error overall (.85/4), down from the 23.8% or so of Netflix’s (stripped down) Cinematch algorithm. [223000540250] |That’s pretty much the difference between 3 stars (liked it) and 2 stars (didn’t like it). [223000540260] |Put another way, if Netflix used a 0-100 scale, these algorithms would be more than 20 points off in their predictions on average. [223000540270] |(It’s actually a little more subtle than that, because they use root mean square error, not mean absolute error, which reduces the effect of small errors (less than 1 star) but increases the effect of larger errors, but the point still applies.) [223000540280] |The naive system that predicts the average rating for a user and average rating for the movie has something like a 1.1 error. [223000540290] |It’s very difficult to squeak out performance on such noisy data. [223000540300] |It may simply not be possible to win the grand prize due to inherent variability of people’s ratings. [223000540310] |

    Will Netflix Use It?

    [223000540320] |The AT&T approach is unlikely to be practical for Netflix. [223000540330] |Their approach involved combining 100s of predictors, each of which is very slow to either run or train, and most of which can only be run as-is on static data collections. [223000540340] |

    Why’d AT&T (or Anyone) Participate?

    [223000540350] |In the machine learning community, it’s common to participate in bakeoffs, even if there are no prizes at all. [223000540360] |For instance, the 2007 KDD Cup used the Netflix data to run a bakeoff to predict which movies a user would rent and how many times a given title would be rented. [223000540370] |These are perhaps even more pressing problems for Netflix. [223000540380] |The research community’s happy to play along because we’re starved for real data on which to test our models. [223000540390] |The researchers even published their results. [223000540400] |The AT&T system was only possible, as they themselves acknowledge, because several other teams published their results, which were then incorporated as predictors into AT&T’s system. [223000540410] |

    Would LingPipe Help?

    [223000540420] |We’re about to roll out the next release of LingPipe, which contains an implementation of one of the factor models: (ridge) regularized singular value decomposition. [223000540430] |This model alone gets you to 0.91 error on Netflix, but like the others, is both static and expensive to compute (though runtime predictions are fast). [223000540440] |I finally sorted out the regularized least squares stochastic gradient descent algorithm by following an implementation published through the Netflix prize forums (full citations forthcoming in the Javadoc and tutorial) and a stack of books with chapters on regression (e.g. Hastie et al., Bishop, Gelman et al., etc.) [223000540450] |You’ll also see that AT&T used some text features in comparing films in their KNN classifiers. [223000540460] |There’s a wide range of other features (like actors, directors, rating, length, country of origin, etc.) that could be extracted, aligned with the film, and used. [223000540470] |I also implemented an EM-estimated variant of probabilistic latent semantic analysis (pLSA), which only achieved a 0.95 error. [223000540480] |That may go into LingPipe at some time, as we’re working on general factor models (such as latent Dirichelt and EM clustering) for text and document clustering. [223000540490] |I think it’d work better with Gibbs sampling; the EM really gets stuck. [223000550010] |LingPipe 3.2.0 Released [223000550020] |The latest release of LingPipe is LingPipe 3.2.0. [223000550030] |This release replaces LingPipe 3.1.2, with which it is backward compatible with one exception (see next section). [223000550040] |

    Backward Incompatibility

    [223000550050] |The p-value methods in stats.BinomialDistribution and stats.MultinomialDistribution have been removed. [223000550060] |The javadoc for the classes includes the code for the removed implementations, which were based on the Jakarta Commons Math library. [223000550070] |

    Zero External Dependencies

    [223000550080] |The reason we removed p-values is that was the last remaining piece of functionality in LingPipe that required external dependencies. [223000550090] |As of this release, there is no longer a dependency on the Jakarta Commons Math library or the Apache Xerces XML libraries. [223000550100] |The latter functionality has been folded into Java itself. [223000550110] |

    New Features

    [223000550120] |Singular Value Decomposition [223000550130] |We’ve included an implementation of singular value decomposition. [223000550140] |It uses stochastic gradient descent, so is scalable to large matrices and allows partial matrix input (matrices with some unknown values). [223000550150] |The implementation is encapsulated in a single class, matrix.SvdMatrix, the javadoc of which explains the mathematics of SVD and our implementation. [223000550160] |SGML Normalizer [223000550170] |We were tired of writing one-off SGML entity normalizers, so we imported a class representing the entire standard, util.Sgml. [223000550180] |Line Tokenizer Factory [223000550190] |To support our work on document parsing (text extraction, bibliography extraction, e-mail signature extraction, etc.), we’ve included a new tokenizer, tokenizer.LineTokenizerFactory, that produces tokens based on lines of text. [223000550200] |It is used in the sandbox project citationEntities [223000550210] |Soundex Tokenizer Filter [223000550220] |We’ve provided an implementation of Soundex as a tokenizer filter in tokenizer.SoundexFilterTokenizer. [223000550230] |Soundex reduces strings of words to a simplified, (English) pronunciation-based representation. [223000550240] |We were mainly motivated by exploring features in our feature-based classifiers. [223000550250] |The javadoc contains a complete description of the Soundex algorithm. [223000550260] |Distances and Proximities [223000550270] |We made sure that if one of the interfaces util.Distance or util.Proximity was implemented by an object, so was the other. [223000550280] |This allows them all to be plugged into distance-based clusterers and classifiers (e.g. k nearest neighbors). [223000550290] |

    Tutorials

    [223000550300] |Singular Value Decomposition [223000550310] |The SVD Tutorial that walks through the applications to latent semantic indexing and basic n-gram smoothing. [223000550320] |It also covers all the tuning parameters (learning rate and annealing, initial values, early stopping, regularization, etc.) [223000550330] |Word Sense Disambiguation Tutorial [223000550340] |The Word Sense Tutorial provides details on creating a complete Senseval word sense disambiguation system using contextual classification. [223000550350] |Word sense disambiguation is the problem of determining which dictionary entry (for a given dictionary) corresponds to a given token of a word in a text. [223000560010] |Alias-i is Hiring [223000560020] |Alias-i, the makers of the LingPipe suite of NLP tools, is hiring one or two developers to work on: [223000560030] |
  • Natural language processing API development
  • [223000560040] |
  • Natural language processing application development
  • [223000560050] |
  • Enterprise architecture design and integration
  • [223000560060] |
  • User interface design and coding
  • [223000560070] |We need one full time equivalent position devoted to our Phase II Small Business Innovation Research (SBIR) grant from the U.S. National Institutes of Health (NIH). [223000560080] |Our goal is a topic- and entity-faceted search engine making heavy use of query refinement and topic discovery, and suited to queries derived from high throughput lab experiments. [223000560090] |We are collaborating with researchers at the Harvard University Medical School, who have real biology problems to solve. [223000560100] |We need part of a full time position devoted to improving our “source released” commercial product, LingPipe, a suite of Java NLP tools. [223000560110] |This will involve expanding current packages as well as developing new functionality. [223000560120] |A strong background in machine learning and statistical methods is a big plus. [223000560130] |This project overlaps with the NIH project. [223000560140] |We need at least half of a full time equivalent position devoted to enterprise application development and integration. [223000560150] |Our customers range from from small startups to the Fortune 500. [223000560160] |In 2007, we have delivered product licenses and consulting for structured database deduplication, part-of-speech tagging, named entity extraction and database coreference, classification, and spell checking. [223000560170] |We’re located in a sunny loft office in Brooklyn, New York, just across the East River from downtown Manhattan, and convenient to public transportion. [223000560180] |We are not considering full-time telecommuters. [223000560190] |If you’re interested, please send us an electronic copy of your resume, a cover note briefly explaining how your skills and interests match our requirements, and a description of your visa status for working in the United States. [223000560200] |No headhunters, please. [223000560210] |Contact: Breck Baldwin [223000560220] |E-mail: jobs@alias-i.com [223000560230] |Home: http://www.alias-i.com/ [223000560240] |LingPipe: http://www.alias-i.com/lingpipe [223000560250] |About: http://www.alias-i.com/lingpipe/web/about.html [223000570010] |(Junior) Varsity Interview: String Reverse [223000570020] |We’re hiring, so interviews have been on my mind. [223000570030] |I’m cross-posting what was mostly a response to reading Stephans Blog [sic], which was itself a riff on Joel on Software’s Guerilla Guide to Interviewing. [223000570040] |Both ask us to look at the string reverse problem. [223000570050] |Apparently, if you are interviewing at a tech company, at some point along the way you’re going to be asked to reverse a string. [223000570060] |Or a sequence of tokens. [223000570070] |Really. [223000570080] |No, really, I’m not kidding. [223000570090] |All of the answers I’ve seen are pure junior varsity when it comes to string processing. [223000570100] |Before reversing a “string”, we need to get clear on what we want. [223000570110] |At one level, a Java String object is just an array of chars. [223000570120] |At another level, it represents a sequence of unicode code points. [223000570130] |The tension arises from the fact that String(char[]) lets you construct a string with a sequence of chars that don’t correspond to valid unicode code points. [223000570140] |This tension was ratcheted up in the 1.5 JDK when they moved to Unicode 4.0. [223000570150] |In the 1.5 JDK, Sun changed the behavior of the StringBuffer.reverse() method in a way that was not backward compatible with 1.4. [223000570160] |That is, there are StringBuffer instances for which reverse() in 1.4 returns a different value than in 1.5. [223000570170] |The 1.5 version of reverse() is sensitive to surrogate pairs (unicode code points requiring more than 16 bits, and hence more than one char, in UTF-16). [223000570180] |In 1.5, both java.lang.StringBuilder and java.lang.StringBuffer use the implementation of reverse() from their common superclass, java.lang.AbstractStringBuilder. [223000570190] |Here’s the first paragraph of doc for java.lang.StringBuffer.reverse() from JDK 1.5: [223000570200] |“Causes this character sequence to be replaced by the reverse of the sequence. [223000570210] |If there are any surrogate pairs included in the sequence, these are treated as single characters for the reverse operation. [223000570220] |Thus, the order of the high-low surrogates is never reversed.” [223000570230] |And here’s the first paragraph of doc from java.lang.StringBuffer.reverse() from JDK 1.4: [223000570240] |“The character sequence contained in this string buffer is replaced by the reverse of the sequence.” [223000570250] |Following Stephan’s suggestion to use the built-in has either a good or bad side effect. [223000570260] |Moving from 1.4 to 1.5 either breaks backward compatibility for the string as char sequence representation, or appropriately handles unicode 5.0 in the string as sequence of code points representation. [223000570270] |Extra credit 1: Recursion won’t work because it’ll blow out the stack if we’re using Sun’s JDKs, which (at least so far) don’t perform tail recursion optimization (a kind of last call optimization). [223000570280] |Extra credit 2: The exception thrown when trying to reverse a null string should be a null pointer exception. [223000570290] |That’s how Sun codes the JDK itself (see, e.g., the java.lang.String.String(String) constructor). [223000570300] |It’s a runtime exception because it’s a coding error to send reverse(String) a null string (assuming you want behavior like your call to StringBuffer.reverse()). [223000570310] |It should be a null pointer exception, as that’ll lead you right to the problem while debugging. [223000570320] |Do I get a callback for a second interview?