[224002270010] |ICML 2010 Retrospective [224002270020] |Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. [224002270030] |I'm very glad I spent some time afterwards out of Haifa, though. [224002270040] |Overall, I saw a lot of really good stuff. [224002270050] |The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). [224002270060] |Here are some things that stood out: [224002270070] |Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). [224002270080] |This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. [224002270090] |Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. [224002270100] |But then you also have a bunch of parallel data. [224002270110] |What can you do? [224002270120] |You can turn the parallel data into a classification problem: are these two sentences translations of each other. [224002270130] |You can pair random sentences to get negative examples. [224002270140] |A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! [224002270150] |(Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) [224002270160] |Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? [224002270170] |Ming-Wei reassured him that it wasn't. [224002270180] |In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. [224002270190] |I suspect this might be the case. [224002270200] |A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). [224002270210] |The idea here is that without using citation structure, you can model influence in large document collections. [224002270220] |The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. [224002270230] |The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). [224002270240] |It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. [224002270250] |That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. [224002270260] |(The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.) [224002270270] |Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). [224002270280] |This received one of the best paper awards. [224002270290] |While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). [224002270300] |That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. [224002270310] |You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! [224002270320] |In the case that the data was actually generated by an HMM, you get good performance and good guarantees. [224002270330] |Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. [224002270340] |That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. [224002270350] |At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. [224002270360] |(Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.) [224002270370] |Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). [224002270380] |This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. [224002270390] |(A restaurant is just the set of counts for a given context.) [224002270400] |You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). [224002270410] |You can do this intelligently (via a heuristic) or just stupidly: pick them at random. [224002270420] |Turns out it doesn't matter. [224002270430] |The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. [224002270440] |The chance that something that occurs a lot keeps getting picked to be thrown out is low. [224002270450] |There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being! [224002270460] |Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). [224002270470] |Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). [224002270480] |The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. [224002270490] |You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. [224002270500] |Moreover, you can "warm start" your inference for the next round. [224002270510] |It's a very nice idea. [224002270520] |I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. [224002270530] |However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious. [224002270540] |Deep learning via Hessian-free optimization (J. Martens). [224002270550] |Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. [224002270560] |The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). [224002270570] |But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. [224002270580] |For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. [224002270590] |The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). [224002270600] |But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima! [224002270610] |A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). [224002270620] |Yes, that's right: a vision paper. [224002270630] |Why should you read this paper? [224002270640] |Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). [224002270650] |These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. [224002270660] |(Sound familiar yet?) [224002270670] |The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. [224002270680] |Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. [224002270690] |And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped). [224002270700] |Some other papers that I liked, but don't want to write too much about: [224002270710] |
  • Learning Programs: A Hierarchical Bayesian Approach (P. Liang, M. Jordan, D. Klein). [224002270720] |Structured models over programs are very hard; this paper gives one approach to modeling them.
  • [224002270730] |
  • Budgeted Nonparametric Learning from Data Streams (R. Gomes, A. Krause). [224002270740] |Shows that a clustering problem and a Gaussian process problem are submodular, goes from there.
  • [224002270750] |
  • Internal Rewards Mitigate Agent Boundedness (J. Sorg, S. Singh, R. Lewis). [224002270760] |Exactly what the title says.
  • [224002270770] |
  • The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data (J. Vogt, S. Prabhakaran, T. Fuchs, V. Roth). [224002270780] |Been wanting to do something like this for a while, but they did it better than I would have!
  • [224002270790] |
  • Sparse Gaussian Process Regression via L_1 Penalization (F. Yan, Y. Qi). [224002270800] |Very interesting way to get sparsity in a GP basically by changing your approximating distribution.
  • [224002270810] |Some papers that other people said they liked were: [224002270820] |
  • Multi-Class Pegasos on a Budget (Z. Wang, K. Crammer, S. Vucetic)
  • [224002270830] |
  • Risk minimization, probability elicitation, and cost-sensitive SVMs (H. Masnadi-Shirazi, N. Vasconcelos)
  • [224002270840] |
  • Asymptotic Analysis of Generative Semi-Supervised Learning (J. Dillon, K. Balasubramanian, G. Lebanon)
  • [224002270850] |Hope to see you at ACL! [224002280010] |ACL 2010 Retrospective [224002280020] |ACL 2010 finished up in Sweden a week ago or so. [224002280030] |Overall, I enjoyed my time there (the local organization was great, though I think we got hit with unexpected heat, so those of us who didn't feel like booking a room at the Best Western -- hah! why would I have done that?! -- had no A/C and my room was about 28-30 every night). [224002280040] |But you don't come here to hear about sweltering nights, you come to hear about papers. [224002280050] |My list is actually pretty short this time. [224002280060] |I'm not quite sure why that happened. [224002280070] |Perhaps NAACL sucked up a lot of the really good stuff, or I went to the wrong sessions, or something. [224002280080] |(Though my experience was echoed by a number of people (n=5) I spoke to after the conference.) Anyway, here are the things I found interesting. [224002280090] |
  • Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, by Matthew Gerber and Joyce Chai (this was the Best Long Paper award recipient). [224002280100] |This was by far my favorite paper of the conference. [224002280110] |For all you students out there (mine included!), pay attention to this one. [224002280120] |It was great because they looked at a fairly novel problem, in a fairly novel way, put clear effort into doing something (they annotated a bunch of data by hand), developed features that were significantly more interesting than the usual off-the-shelf set, and got impressive results on what is clearly a very hard problem. [224002280130] |Congratulations to Matthew and Joyce -- this was a great paper, and the award is highly deserved. [224002280140] |
  • Challenge Paper: The Human Language Project: Building a Universal Corpus of the World’s Languages, by Steven Abney and Steven Bird. [224002280150] |Basically this would be awesome if they can pull it off -- a giant structured database with stuff from tons of languages. [224002280160] |Even just having tokenization in tons of languages would be useful for me.
  • [224002280170] |
  • Extracting Social Networks from Literary Fiction, by David Elson, Nicholas Dames and Kathleen McKeown. [224002280180] |(This was the IBM best student paper.) [224002280190] |Basically they construct networks of characters from British fiction and try to analyze some literary theories in terms of those networks, and find that there might be holes in the existing theories. [224002280200] |My biggest question, as someone who's not a literary theorist, is why did those theories exist in the first place? [224002280210] |The analysis was over 80 or so books, surely literary theorists have read and pondered all of them.
  • [224002280220] |
  • Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish, by Reyyan Yeniterzi and Kemal Oflazer. [224002280230] |You probably know that I think translating morphology and translating out of English are both interesting topics, so it's perhaps no big surprise that I liked this paper. [224002280240] |The other thing I liked about this paper is that they presented things that worked, as well as things that might well have worked but didn't.
  • [224002280250] |
  • Learning Common Grammar from Multilingual Corpus, by Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawad. [224002280260] |I wouldn't go so far as to say that I thought this was a great paper, but I would say there is the beginning of something interesting here. [224002280270] |They basically learn a coupled PCFG in Jenny Finkel hierarchical-Bayes style, over multiple languages. [224002280280] |The obvious weakness is that languages don't all have the same structure. [224002280290] |If only there were an area of linguistics that studies how they differ.... [224002280300] |(Along similar lines, see Phylogenetic Grammar Induction, by Taylor Berg-Kirkpatrick and Dan Klein, which has a similar approach/goal.)
  • [224002280310] |
  • Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation, by Michael Bloodgood and Chris Callison-Burch. [224002280320] |The "trend" referenced in the title is that active learning always asymptotes depressingly early. [224002280330] |They have turkers translate bits of sentences in context (i.e., in a whole sentence, translate the highlighted phrase) and get a large bang-for-the-buck. [224002280340] |Right now they're looking primarily at out-of-vocabulary stuff, but there's a lot more to do here.
  • [224002280350] |A few papers that I didn't see, but other people told me good things about: [224002280360] |
  • “Was It Good? [224002280370] |It Was Provocative.” [224002280380] |Learning the Meaning of Scalar Adjectives, by Marie-Catherine de Marneffe, Christopher D. Manning and Christopher Pott.
  • [224002280390] |
  • Unsupervised Ontology Induction from Text, by Hoifung Poon and Pedro Domingos.
  • [224002280400] |
  • Improving the Use of Pseudo-Words for Evaluating Selectional Preferences, by Nathanael Chambers and Daniel Jurafsky.
  • [224002280410] |
  • Learning to Follow Navigational Directions, by Adam Vogel and Daniel Jurafsky.
  • [224002280420] |
  • Compositional Matrix-Space Models of Language, by Sebastian Rudolph and Eugenie Giesbrecht. [224002280430] |(This was described to me as "thought provoking" though not necessarily more.)
  • [224002280440] |
  • Top-Down K-Best A* Parsing, by Adam Pauls, Dan Klein and Chris Quirk.
  • [224002280450] |At any rate, I guess that's a reasonably long list. [224002280460] |There were definitely good things, but with a fairly heavy tail. [224002280470] |If you have anything you'd like to add, feel free to comment. [224002280480] |(As an experiment, I've turned comment moderation on as a way to try to stop the spam... [224002280490] |I'm not sure I'll do it indefinitely; I hadn't turned it on before because I always thought/hoped that Google would just start doing spam detection and/or putting hard captcha's up or something to try to stop spam, but sadly they don't seem interested.) [224002290010] |Why Discourse Structure? [224002290020] |I come from a strong lineage of discourse folks. [224002290030] |Writing a parser for Rhetorical Structure Theory was one of the first class projects I had when I was a grad student. [224002290040] |Recently, with the release of the Penn Discourse Treebank, there has been a bit of a flurry of interest in this problem (I had some snarky comments right after ACL about this). [224002290050] |I've also talked about why this is a hard problem, but never really about why it is an interesting problem. [224002290060] |My thinking about discourse has changed a lot over the years. [224002290070] |My current thinking about it is in an "interpretation as abduction" sense. [224002290080] |(And I sincerely hope all readers know what that means... if not, go back and read some classic papers by Jerry Hobbs.) [224002290090] |This is a view I've been rearing for a while, but I finally started putting it into words (probably mostly Jerry's words) in a conversation at ACL with Hoifung Poon and Joseph Turian (I think it was Joseph... my memory fades quickly these days :P). [224002290100] |This view is that discourse is that thing that gives you an interpretation above and beyond whatever interpretations you get from a sentence. [224002290110] |Here's a slightly refined version of the example I came up with on the fly at ACL: [224002290120] |
  • I only like traveling to Europe. [224002290130] |So I submitted a paper to ACL.
  • [224002290140] |
  • I only like traveling to Europe. [224002290150] |Nevertheless, I submitted a paper to ACL.
  • [224002290160] |What does the hearer (H) infer from these sentences. [224002290170] |Well, if we look at the sentences on their own, then H infers something like Hal-likes-travel-to-Europe-and-only-Europe, and H infers something like Hal-submitted-a-paper-to-ACL. [224002290180] |But when you throw discourse in, you can derive two additional bits of information. [224002290190] |In example (1), you can infer ACL-is-in-Europe-this-year and in (2) you can infer the negation of that. [224002290200] |Pretty amazing stuff, huh? [224002290210] |Replacing a "so" with a "nevertheless" completely changes this interpretation. [224002290220] |What does this have to do with interpretation as abduction? [224002290230] |Well, we're going to assume that this discourse is coherent. [224002290240] |Given that assumption, we have to ask ourselves: in (1), what do we have to assume about the world to make this discourse coherent? [224002290250] |The answer is that you have to assume that ACL is in Europe. [224002290260] |And similarly for (2). [224002290270] |Of course, there are other things you could assume that would make this discourse coherent. [224002290280] |For (1), you could assume that I have a rich benefactor who likes ACL submissions and will send me to Europe every time I submit something to ACL. [224002290290] |For (2), you could assume that I didn't want my paper to get in, but I wanted a submission to get reviews, and so I submitted a crappy paper. [224002290300] |Or something. [224002290310] |But these fail the Occam's Razor test. [224002290320] |Or, perhaps they are a priori simply less likely (i.e., you have to assume more to get the same result). [224002290330] |Interestingly, I can change the interpretation of (2), for instance, by adding a third sentence to the discourse: "I figured that it would be easy to make my way to Europe after going to Israel." [224002290340] |Here, we would abduce that ACL is in Israel, and that I'm willing to travel to Israel on my way to Europe. [224002290350] |For you GOFAI folks, this would be something like non-monotonic reasoning. [224002290360] |Whenever I talk about discourse to people who don't know much about it, I always get this nagging sense of "yes, but why do I care that you can recognize that sentence 4 is background to sentence 3, unless I want to do summarization?" [224002290370] |I hope that this view provides some alternative answer to that question. [224002290380] |Namely, that there's some information you can get from sentences, but there is additional information in how those sentences are glued together. [224002290390] |Of course, one of the big problems we have is that we have no idea how to represent sentence-level interpretations, or at least some ideas but no way to get there in the general case. [224002290400] |In the sentence-level case, we've seen some progress recently in terms of representing semantics in a sort of substitutability manner (ala paraphrasing), which is nice because the representation is still text. [224002290410] |One could ask if something similar might be possible at a discourse level. [224002290420] |Obviously you could paraphrase discourse connectives, but that's missing the point. [224002290430] |What else could you do? [224002300010] |Multi-task learning: should our hypothesis classes be the same? [224002300020] |It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. [224002300030] |In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. [224002300040] |In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used. [224002300050] |I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. [224002300060] |One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. [224002300070] |They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. [224002300080] |What they see is that different features are useful for spotting entity spans, and for labeling entity types. [224002300090] |That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly). [224002300100] |Now, to some degree this is not surprising. [224002300110] |If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. [224002300120] |On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space? [224002300130] |This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do? [224002300140] |One way is to punt: you can just concatenate your feature vectors and cross your fingers. [224002300150] |Or, more nuanced, you can have some set of shared features and some set of features unique to each task. [224002300160] |This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data. [224002300170] |An alternative approach is to let the two classifiers "talk" via unlabeled data. [224002300180] |Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree." [224002300190] |A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. [224002300200] |That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. [224002300210] |Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. [224002300220] |This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning. [224002300230] |I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption. [224002310010] |Readers kill blogs? [224002310020] |I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so. [224002310030] |
  • On the one hand, yesterday, Aleks stated that the main reason he blogs is to see comments. [224002310040] |(Similarly, Lance also thinks comments are a very important part of having an "open" blog.)
  • [224002310050] |
  • On the other hand, people are more an more moving to systems like Google Reader, as re-blogged by Fernando, also yesterday.
  • [224002310060] |I actually complete agree with both points. [224002310070] |The problem is that I worry that they are actually fairly opposed. [224002310080] |I comment much less on other people's blogs now that I use reader, because the 10 second overhead of clicking on the blog, being redirected, entering a comment, blah blah blah, is just too high. [224002310090] |Plus, I worry that no one (except the blog author) will see my comment, since most readers don't (by default) show comments in with posts. [224002310100] |Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier. [224002310110] |That is, unless they want to lose out to tweets! [224002310120] |Until then, I'd like to encourage people to continue commenting here. [224002320010] |Finite State NLP with Unlabeled Data on Both Sides [224002320020] |(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?) [224002320030] |[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.] [224002320040] |The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. [224002320050] |Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :). [224002320060] |The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. [224002320070] |For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like: [224002320080] |English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM [224002320090] |You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [224002320100] |[Plus, of course, if you're missing any of these, EM comes to your rescue!] [224002320110] |Let's take a simpler example, though the point I want to make applies to long chains, too. [224002320120] |Suppose I want to just do translation from French to English. [224002320130] |I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). [224002320140] |For the E2F transducer, I'll need parallel English/French text, of course. [224002320150] |The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences. [224002320160] |But, presumably, I also have lots of monolingual French text. [224002320170] |Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help? [224002320180] |Well, it probably won't help with my English language model, but it should be able to help with my transducer. [224002320190] |Why? [224002320200] |Because my transducer is supposed to give me p(f|e). [224002320210] |If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening. [224002320220] |More generally, I feel like the following two operations should probably give roughly the same probabilities: [224002320230] |
  • Drawing an English sentence from the language model p(e).
  • [224002320240] |
  • Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.
  • [224002320250] |If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above. [224002320260] |If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize [ sum_e p(e) log sum_f q(f|e) ]. [224002320270] |Here, the sum over f is a finite sum over GigaFrench, and the sum over e is an infinite sum over positive probability English sentences given my the English LM p(e). [224002320280] |One could then apply something like posterior regularization (Kuzman Ganchev, Graça and Taskar) to do the learning. [224002320290] |There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models. [224002320300] |Anyway, it seems like an interesting idea. [224002320310] |I'm definitely not aware if anyone has tried it. [224002330010] |Calibrating Reviews and Ratings [224002330020] |NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. [224002330030] |Except for journals, of course. [224002330040] |If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :). [224002330050] |One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. [224002330060] |Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. [224002330070] |Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. [224002330080] |(For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. [224002330090] |I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. [224002330100] |Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) [224002330110] |As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration. [224002330120] |There's also the issue of areas. [224002330130] |Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). [224002330140] |For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? [224002330150] |How do you calibrate across areas, other than by some form of affirmative action for less-represented areas? [224002330160] |A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. [224002330170] |The example he gives is very nice. [224002330180] |If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. [224002330190] |On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation. [224002330200] |So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. [224002330210] |And we expect this to work?! [224002330220] |Probably the main issue here is calibrating for expectations. [224002330230] |As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. [224002330240] |If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. [224002330250] |If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been). [224002330260] |One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). [224002330270] |You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. [224002330280] |For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is. [224002330290] |(Anticipating comments, I don't think this is just an "aspect" issue. [224002330300] |I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.) [224002330310] |I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. [224002330320] |(For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) [224002330330] |There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." [224002330340] |Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here. [224002340010] |Online Learning Algorithms that Work Harder [224002340020] |It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply. [224002340030] |Perceptron is probably the simplest, and involves just a few adds. [224002340040] |Winnow takes a few multiplies. [224002340050] |MIRA takes a bit more, but still nothing hugely complicated. [224002340060] |Same with stochastic gradient descent algorithms for, eg., hinge loss. [224002340070] |I think this maybe used to make sense. [224002340080] |I'm not sure that it makes sense any more. [224002340090] |In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data. [224002340100] |There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting). [224002340110] |Why do I feel this way? [224002340120] |Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning. [224002340130] |This is what John Langford observed in his Vowpal Wabbit implementation. [224002340140] |He has to do multiple passes. [224002340150] |He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes. [224002340160] |In this case, basically what John is seeing is that I/O is too slow. [224002340170] |Or, phrased differently, learning is too fast :). [224002340180] |I never thought I'd say that, but I think it's true. [224002340190] |Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use. [224002340200] |But I think the problem is actually quite a bit more severe. [224002340210] |You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized. [224002340220] |In particular, someone has already gone through the effort of computing all your features for you. [224002340230] |Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update. [224002340240] |So in the real world, you're probably going to be computing your features and learning on the fly. [224002340250] |(Or at least that's what I always do.) [224002340260] |In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features. [224002340270] |One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes. [224002340280] |In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P). [224002340290] |I've never tried this myself because I am lazy. [224002340300] |Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially). [224002340310] |A similar problem crops up in structured problems. [224002340320] |Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients. [224002340330] |(This is similar to the problems you run into when trying to parallelize the structured perceptron.) [224002340340] |Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me! [224002360010] |AIStats 2011 Call for Papers [224002360020] |The full call, and some changes to the reviewing process. [224002360030] |The submission deadline is Nov 1, and the conference is April 11-13, in Fort Lauderdale, Florida. [224002360040] |Promises to be warm :). [224002360050] |The changes the the reviewing process are interesting. [224002360060] |Basically the main change is that the author response is replaced by a journal-esque "revise and resubmit." [224002360070] |That is, you get 2 reviews, edit your paper, submit a new version, and get a 3rd review. [224002360080] |The hope is that this will reduce author frustration from the low bandwidth of author response. [224002360090] |Like with a journal, you'll also submit a "diff" saying what you've changed. [224002360100] |I can see this going really well: the third reviewer will presumably see a (much) better than the first two. [224002360110] |The disadvantage, which irked me at ICML last year, is that it often seemed like the third reviewer made the deciding call, and I would want to make sure that the first two reviewers also get updated. [224002360120] |I can also see it going poorly: authors invest even more time in "responding" and no one listens. [224002360130] |That will be increased frustration :). [224002360140] |The other change is that there'll be more awards. [224002360150] |I'm very much in favor of this, and I spend two years on the NAACL exec trying to get NAACL to do the same thing, but always got voted down :). [224002360160] |Oh well. [224002360170] |The reason I think it's a good idea is two-fold. [224002360180] |First, I think we're bad at selecting single best papers: a committee decision can often lead to selecting least offensive papers rather than ones that really push the boundary. [224002360190] |I also think there are lots of ways for papers to be great: they can introduce new awesome algorithms, have new theory, have a great application, introduce a cool new problem, utilize a new linguistic insight, etc., etc., etc... [224002360200] |Second, best papers are most useful at promotion time (hiring, and tenure), where you're being compared with people from other fields. [224002360210] |Why should our field put our people at a disadvantage by not awarding great work that they can list of their CVs? [224002360220] |Anyway, it'll be an interesting experiment, and I encourage folks to submit! [224002370010] |Very sad news.... [224002370020] |I heard earlier this morning that Fred Jelinek passed away last night. [224002370030] |Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes. [224002370040] |Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up." [224002370050] |And in Fred's 2009 ACL Lifetime Achievement Award speech, he basically said that such a thing never happened. [224002370060] |I doubt that will have any effect on how much the story is told. [224002370070] |Fred has had a remarkable influence on the field. [224002370080] |So much so that I won't attempt to list anything here: you can find all about him all of the internet. [224002370090] |Let me just say that the first time I met him, I was intimidated. [224002370100] |Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech. [224002370110] |Here's roughly how a segment of our conversation went: [224002370120] |Hal: What new projects are going on these days? [224002370130] |Fred: (Excitedly.) [224002370140] |We have a really exciting new speech recognition problem. [224002370150] |We're trying to map speech signals directly to fluent text. [224002370160] |Hal: (Really confused.) [224002370170] |Isn't that the speech recognition problem? [224002370180] |Fred: (Playing the "teacher role" now.) [224002370190] |Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts [Ed note: like "I went... [224002370200] |I went to the um store"]. [224002370210] |Hal: So now you want to produce the actual fluent sentence, not the one that was spoken? [224002370220] |Fred: Right. [224002370230] |Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them! [224002370240] |We then talked for a while about how they were doing this, and other fun topics. [224002370250] |A few weeks later, I got a voicemail on my home message machine from Fred. [224002370260] |That was probably one of the coolest things that have ever happened to me in life. [224002370270] |I actually saved it (but subsequently lost it, which saddens me greatly). [224002370280] |The content is irrelevant: the point is that Fred -- Fred! -- called me -- me! -- at home! [224002370290] |Amazing. [224002370300] |I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like. [224002370310] |Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come. [224002380010] |ACL / ICML Symposium? [224002380020] |ACL 2011 ends on June 24, in Portland (that's a Friday). [224002380030] |ICML 2011 begins on June 28, near Seattle (the following Tuesday). [224002380040] |This is pretty much as close to a co-location as we're probably going to get in a long time. [224002380050] |A few folks have been discussing the possibility of having a joint NLP/ML symposium in between. [224002380060] |The current thought is to have it on June 27 at the ICML venue (for various logistical reasons). [224002380070] |There are buses and trains that run between the two cities, and we might even be able to charter some buses. [224002380080] |One worry is that it might only attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium. [224002380090] |As a NLPer/MLer, I believe in data. [224002380100] |So please provide data by filling out the form below and, if you wish, adding comments. [224002380110] |If you woudn't attend any, you don't need to fill out the poll :). [224002380120] |The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible." [224002390010] |My Giant Reviewing Error [224002390020] |I try to be a good reviewer, but like everything, reviewing is a learning process. [224002390030] |About five years ago, I was reviewing a journal paper and made an error. [224002390040] |I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter. [224002390050] |I was reviewing a paper, which I thought was overall pretty strong. [224002390060] |I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review. [224002390070] |Not a connection that made the current paper irrelevant, but something the authors should probably talk about. [224002390080] |In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer. [224002390090] |I spend the next five hours looking for the reference and couldn't find it myself. [224002390100] |It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation. [224002390110] |But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely. [224002390120] |Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper! [224002390130] |I remember that happening to me! [224002390140] |But, sadly, this is not true. [224002390150] |I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do. [224002390160] |In fact, I don't think a single reviewing cycle has passed where I don't get a review like this. [224002390170] |The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up. [224002390180] |And the claims are usually wrong. [224002390190] |They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are. [224002390200] |Here are some of the most ridiculous cases I've seen. [224002390210] |I mention these just to show how often this problem occurs. [224002390220] |These are all on papers of mine. [224002390230] |
  • One reviewer wrote "This idea is so obvious this must have been done before." [224002390240] |This is probably the most humorous example I've seen, but the reviewer was clearly serious. [224002390250] |And no, this was not in a review for one of the the "frustratingly easy" papers.
  • [224002390260] |
  • In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times. [224002390270] |Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing. [224002390280] |(Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.) [224002390290] |It turned out to strongly seem that this was basically their way of saying "you are not one of us."
  • [224002390300] |
  • In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations. [224002390310] |The paper was rejected, we went and implemented Y, and found that it worked worse on task A. [224002390320] |We later found one paper saying that Y works better than X on task B, for B fairly different from A.
  • [224002390330] |
  • In another paper, we were told that what we were doing had been done before and in this case a citation was provided. [224002390340] |The citation was to one of our own papers, and it was quite different by any reasonable metric. [224002390350] |At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
  • [224002390360] |
  • We were told that we missed an enormous amount of related work that could be found by a simple web search. [224002390370] |I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that. [224002390380] |But here, no keywords were provided. [224002390390] |It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.
  • [224002390400] |There are other examples I could cite from my own experience, but I think you get the idea. [224002390410] |I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem. [224002390420] |It's really just an issue of laziness. [224002390430] |If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong. [224002390440] |But I was lazy. [224002390450] |Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case). [224002390460] |I've really really tried ever since my experience above to not ever do this again. [224002390470] |And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think. [224002390480] |I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims. [224002400010] |Comparing Bounds [224002400020] |This is something that's bothered me for quite a while, and I don't know of a good answer. [224002400030] |I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?" [224002400040] |The second is the slightly less obvious "but your empirical losses may be totally different" issue. [224002400050] |It's actually the second one that I want to talk about, but I have much less of a good internal feel about it. [224002400060] |Let's say that I'm considering two learning approaches. [224002400070] |Say it's SVMs versus logistic regression. [224002400080] |Both regularized. [224002400090] |Or something. [224002400100] |Doesn't really matter. [224002400110] |At the end of the day, I'll have a bound that looks roughly like: [224002400120] |expected test error <= empirical training error + f( complexity / N) [224002400130] |Here, f is often "sqrt", but could really be any function. [224002400140] |And N is the number of data points. [224002400150] |Between two algorithms, both "f" and "complexity" can vary. [224002400160] |For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)). [224002400170] |Or one might have a square root. [224002400180] |Who knows. [224002400190] |Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds. [224002400200] |At the end of the day, we of course want to say "my algorithm is better than your algorithm." [224002400210] |(What else is there in life?) [224002400220] |The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)." [224002400230] |Here's where two issues crop up. [224002400240] |The first is that our bound is just an upper bound. [224002400250] |For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100." [224002400260] |Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5. [224002400270] |In this way, the bounds can be misleading. [224002400280] |My general approach with this issue is to squint, as I do for experimental results. [224002400290] |I don't actually care about constant factors: I just care about things like "what does the dependence on D look like." [224002400300] |Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me. [224002400310] |Beyond that I don't really care. [224002400320] |I especially don't care if the proof techniques are quite similar. [224002400330] |For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers. [224002400340] |They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis. [224002400350] |(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...) [224002400360] |The other issue I don't have as good a feeling for. [224002400370] |The issue is that we're entirely ignoring the "empirical training error" question. [224002400380] |In fact, this is often measured differently between different algorithms! [224002400390] |For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..." [224002400400] |Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss. [224002400410] |Now I really don't know what to do. [224002400420] |We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data. [224002400430] |The problem is that the source training error might be totally incomparable to the (source + target) training error. [224002400440] |But the issue is for sure more general. [224002400450] |For instance, what if your training error is measured in squared error? [224002400460] |Now this can be huge when hinge loss is still rather small. [224002400470] |In fact, your squared error could be quadratically large in your hinge loss. [224002400480] |Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not. [224002400490] |(Neither does log loss.) [224002400500] |This worries me greatly, much more than the issue of comparing upper bounds. [224002400510] |Does this bother everyone, or is it just me? [224002400520] |Is there a good way to think about this that gets your out of this conundrum? [224002410010] |Managing group papers [224002410020] |Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (>=3, typically) that are getting prepared. [224002410030] |I would say on average 1 doesn't make it, but that's par for the course. [224002410040] |For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired. [224002410050] |Anywhere from not understanding the problem/motivation to typos or errors in equations. [224002410060] |My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on. [224002410070] |In fact, it was so successful that two students suggested to me that I require more-or-less complete drafts of papers at least one week in advance so that this can take place. [224002410080] |How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs. [224002410090] |It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off. [224002410100] |Maybe I'll try it though. [224002410110] |The bigger question is how to manage such a thing. [224002410120] |I was thinking of installing some conference management software locally (eg., HotCRP, which I really like) and giving students "reviewer" access. [224002410130] |Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them. [224002410140] |(Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!) [224002410150] |The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested. [224002410160] |Or to allow discussion. [224002410170] |Or something like that. [224002410180] |I'm curious if anyone has ever tried anything like this and whether it was successful or not. [224002410190] |It seems like if you can get a culture of this established, it could actually be quite useful. [224002420010] |Crowdsourcing workshop (/tutorial) decisions [224002420020] |Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers. [224002420030] |People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule. [224002420040] |I think there are various issues with this model, but don't want to talk about it. [224002420050] |What I do want to talk about is applying the same ideas to workshop acceptance decisions. [224002420060] |This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials. [224002420070] |(I think what I have to say applies equally to tutorials as to workshops.) [224002420080] |I feel like a workshop (or tutorial) is successful if it is well attended. [224002420090] |This applies both from a monetary perspective, as well as a scientific perspective. [224002420100] |(Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes. [224002420110] |That is to say, size is not all that matters. [224002420120] |But it is a big part of what matters.) [224002420130] |We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process more difficult). [224002420140] |The idea would be that I could create a poll, like the following: [224002420150] |
  • Are you going to ACL? [224002420160] |Yes, maybe, no
  • [224002420170] |
  • Are you going to EMNLP? [224002420180] |Yes, maybe, no
  • [224002420190] |
  • If workshop A were offered at a conference you were going to, would you go to workshop A?
  • [224002420200] |
  • If workshop B...
  • [224002420210] |
  • And so on
  • [224002420220] |This gives you two forms of information. [224002420230] |First it can help estimate expected attendance (though we ask proposers to estimate that, too, and I think they do a reasonable job if you skew their estimates down by about 10%). [224002420240] |But more importantly, it gives correlations between workshops. [224002420250] |This lets you be sure that you're not scheduling things on top of each other that people might want to go to. [224002420260] |Some of these are obvious (for instance, if we got 10 MT workshop proposals... which didn't actually happen but is moderately conceivable :P), but some are not. [224002420270] |For instance, maybe people who care about annotation also care about ML, but maybe not? [224002420280] |I actually have no idea. [224002420290] |Of course we're not going to do this this year. [224002420300] |It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so. [224002420310] |And of course I don't think this should exclusively be a popularity contest. [224002420320] |But I do beleive that popularity should be a factor. [224002420330] |And it should probably be a reasonably big factor. [224002420340] |Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions. [224002420350] |Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me. [224002420360] |I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse). [224002420370] |But I do think it would be interesting to try to see what sort of response you get. [224002420380] |Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees. [224002420390] |But this isn't particularly difficult. [224002420400] |Is there anything I'm not thinking of that would make this obviously not work? [224002420410] |I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public? [224002420420] |I find that a bit hard to swallow. [224002420430] |Perhaps there's a small embarassment factor if you're public and then don't get accepted. [224002420440] |But I wouldn't advocate making the voting results public -- they would be private to the organizers / workshop chairs. [224002420450] |I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting. [224002420460] |This is probably true. [224002420470] |But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile. [224002420480] |I doubt I'm any better at guessing this than any other random NLP person... [224002420490] |So what am I forgetting? [224002430010] |NIPS 2010 Retrospective [224002430020] |Happy New Year and I know I've been silent but I've been busy. [224002430030] |But no teaching this semester (YAY!) so maybe you'll see more posts. [224002430040] |At any rate, I'm really late to the table, but here are my comments about this past year's NIPS. [224002430050] |Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe. [224002430060] |Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference. [224002430070] |But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized. [224002430080] |(I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...). [224002430090] |The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada: [224002430100] |A: Wow, that map is really taking a long time to load. [224002430110] |B: I know. [224002430120] |It's probably some socialized Canadian WiFi service. [224002430130] |C: No, it's probably just slow because every third bit has to be a Canadian bit? [224002430140] |D: No no, it's because every bit has to be sent in both English and French! [224002430150] |Okay it's not that funny, but it was funny at the time. [224002430160] |(And really "B" is as much a joke about the US as it was about Canada :P.) [224002430170] |But I'm sure you are here to hear about papers, not stupid Canada jokes. [224002430180] |So here's my take. [224002430190] |The tutorial on optimization by Stephen Wright was awesome. [224002430200] |I hope this shows up on videolectures soon. [224002430210] |(Update: it has!) [224002430220] |I will make it required reading / watching for students. [224002430230] |There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG! [224002430240] |Really?!?! [224002430250] |There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work. [224002430260] |Lots of awesome stuff. [224002430270] |My favorite part of NIPS. [224002430280] |Some papers I saw that I really liked: [224002430290] |A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting. [224002430300] |The crux is a clever generalization of the "weak learning" notion from binary. [224002430310] |The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50). [224002430320] |Generalize this and it works. [224002430330] |Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this. [224002430340] |This was one of those talks where I understood the first half and then got lost. [224002430350] |But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms. [224002430360] |Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all! [224002430370] |It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme. [224002430380] |A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification. [224002430390] |Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning. [224002430400] |Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms? [224002430410] |Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster). [224002430420] |The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta. [224002430430] |Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation. [224002430440] |I think this is amazing. [224002430450] |Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all. [224002430460] |DP-style construction of infinite automata. [224002430470] |Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph. [224002430480] |It should be regularized in a sensible way. [224002430490] |Very LASSO-esque model, as you might expect given the author list :). [224002430500] |Other papers I saw that I liked but not enough to write mini summaries of: [224002430510] |Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy) Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan) Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland) Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum) Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin) Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos) Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan) Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher) Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh) [224002430520] |Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers. [224002430530] |The first is a list from Avishek Saha: [224002430540] |A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire) [224002430550] |Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth) [224002430560] |Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire) [224002430570] |Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum) [224002430580] |Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri) [224002430590] |Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh) [224002430600] |The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P: [224002430610] |Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc). [224002430620] |Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning. [224002430630] |Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees. [224002430640] |Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this. [224002430650] |Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x). [224002430660] |Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms. [224002430670] |Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know". [224002430680] |Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings. [224002430690] |A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm. [224002430700] |Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces. [224002430710] |Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices. [224002430720] |Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc). [224002430730] |Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance". [224002430740] |A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept. [224002430750] |MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence. [224002430760] |Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :) [224002430770] |Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing. [224002430780] |Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting. [224002430790] |Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first. [224002430800] |More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :) [224002430810] |Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search. [224002430820] |Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem. [224002430830] |A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda. [224002430840] |Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data. [224002430850] |Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary [224002450010] |What are your plans between ACL and ICML? [224002450020] |I'll tell you what they should be: attending the Symposium on Machine Learning in Speech and Language Processing, jointly sponsored by IMLS, ICML and ISCA, that I'm co-organizing with Dan Roth, Geoff Zweig and Joseph Keshet (the exact date is June 27, in the same venue as ICML in Bellevue, Washington). [224002450030] |So far we've got a great list of invited speakers from all of these areas, including Mark Steedman, Stan Chen, Yoshua Bengio, Lawrence Saul, Sanjoy Dasgupta and more. [224002450040] |(See the web page for more details.) [224002450050] |We'll also be organizing some sort of day trips (together with the local organizers of ICML) for people who want to join! [224002450060] |You should also consider submitting papers (deadline is April 15). [224002450070] |I know I said a month ago that I would blog more. [224002450080] |I guess that turned out to be a lie. [224002450090] |The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently. [224002450100] |I decided to use my time-off-teaching doing something far more time consuming than teaching. [224002450110] |This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work. [224002460010] |Grad school survey, revisited [224002460020] |You may recall a while ago I ran a survey on where people applied to grad school. [224002460030] |Obviously I've been sitting on these results for a while now, but I figured since it's that time of year when people are choosing grad schools, that I would say how things turned out. [224002460040] |Here's a summary of things that people thought were most important (deciding factor), and moderately important (contributing factor, in parens): [224002460050] |
  • Academic Program
  • [224002460060] |
  • Specialty degree programs in my research area, 48%
  • [224002460070] |
  • (Availability of interesting courses, 16%)
  • [224002460080] |
  • (Time to completion, 4%)
  • [224002460090] |
  • Application Process
  • [224002460100] |
  • Nothing
  • [224002460110] |
  • Faculty Member(s)
  • [224002460120] |
  • Read research papers by faculty member, 44%
  • [224002460130] |
  • Geographic Area
  • [224002460140] |
  • (Outside interests/personal preference, 15%)
  • [224002460150] |
  • Recommendations from People [224002460160] |
  • Professors in technical area, 45%
  • [224002460170] |
  • (Teachers/academic advisors, 32%)
  • [224002460180] |
  • (Technical colleagues, 20%)
  • [224002460190] |
  • Reputation
  • [224002460200] |
  • ... of research group, 61%
  • [224002460210] |
  • ... of department/college, 50%
  • [224002460220] |
  • (Ranking of university, 35%)
  • [224002460230] |
  • (Reputation of university, 34%)
  • [224002460240] |
  • Research Group
  • [224002460250] |
  • Research group works on interesting problems, 55%
  • [224002460260] |
  • Many faculty in a specialty area (eg., ML), 44%
  • [224002460270] |
  • (Many faculty/students in general area (eg., AI), 33%)
  • [224002460280] |
  • (Research group publishes a lot, 26%)
  • [224002460290] |
  • Web Presence
  • [224002460300] |
  • (Learned about group via web search, 37%)
  • [224002460310] |
  • (Learned about dept/univ via web search, 24%)
  • [224002460320] |
  • General
  • [224002460330] |
  • Funding availability, 49%
  • [224002460340] |
  • (High likelihood of being accepted, 12%)
  • [224002460350] |
  • (Size of dept/university, 5%)
  • [224002460360] |Overall these seem pretty reasonable. [224002460370] |And of course they all point to the fact that everyone should come to Maryland :P. Except for the fact that we don't have specialty degree programs, but that's the one thing on the list that I actually think is a bit silly: it might make sense for MS, but I don't really think it should be an important consideration for Ph.D.s. [224002460380] |You can get the full results if you want to read them and the comments: they're pretty interesting, IMO. [224002480010] |Postdoc Position at CLIP (@UMD) [224002480020] |Okay, now is why I take serious unfair advantage of having this blog. [224002480030] |We have a postdoc opening. [224002480040] |See the official ad below for details: [224002480050] |A postdoc position is available in the Computational Linguistics and Information Processing (CLIP) Laboratory in the Institute for Advanced Computer Studies at University of Maryland. [224002480060] |We are seeking a talented researcher in natural language processing, with strong interests in the processing of scientific literature. [224002480070] |A successful candidate should have a strong NLP background with a track record of top-tier research publications. [224002480080] |A Ph.D. in computer science and strong organizational and coordination skills are a must. [224002480090] |In addition to pursuing original research in scientific literature processing, the ideal candidate will coordinate the efforts of the other members of that project. [224002480100] |While not necessary, experience in one or more of the following areas is highly advantageous: summarization, NLP or data mining for scientific literature, machine learning, and the use of linguistic knowledge in computational systems. [224002480110] |Additionally, experience with large-data NLP and system building will be considered favorably. [224002480120] |The successful candidate will work closely with current CLIP faculty, especially Bonnie Dorr, Hal Daume III and Ken Fleischmann, while interacting with a large team involving NLP researchers across several other prominent institutions. [224002480130] |The duration of the position is one year, starting Summer or Fall 2011, and is potentially extendible. [224002480140] |CLIP is a a dynamic interdisciplinary computational linguistics program with faculty from across the university, and major research efforts in machine translation, information retrieval, semantic analysis, generation, and development of large-scale statistical language processing tools. [224002480150] |Please send a CV and names and contact information of 3 referees, preferably by e-mail, to: [224002480160] |Jessica Touchard jessica AT cs DOT umd DOT edu Department of Computer Science A.V. Williams Building, Room 1103 University of Maryland College Park, MD 20742 [224002480170] |Specific questions about the position may be addressed to Hal Daume III at hal AT umiacs DOT umd DOT edu. [224002490010] |Seeding, transduction, out-of-sample error and the Microsoft approach... [224002490020] |My past master's student Adam Teichert (now at JHU) did some work on inducing part of speech taggers using typological information. [224002490030] |We wanted to compare the usefulness of using small amounts of linguistic information with small amounts of lexical information in the form of seeds. [224002490040] |(Other papers give seeds different names, like initial dictionaries or prototypes or whatever... it's all the same basic idea.) [224002490050] |The basic result was that if you don't use seeds, then typological information can help a lot. [224002490060] |If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial. [224002490070] |This was a bit frustrating, and led us to think more about the problem. [224002490080] |The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag. [224002490090] |An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type. [224002490100] |They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words). [224002490110] |In terms of end performance, there are two reasons seeds can help. [224002490120] |The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese). [224002490130] |The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right. [224002490140] |In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails). [224002490150] |This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi &Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes." [224002490160] |Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds). [224002490170] |This is very similar to the idea of transductive learning in machine learning land. [224002490180] |Transduction is an alternative to semi-supervised learning. [224002490190] |The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled. [224002490200] |Your goal is to simply label the unlabeled data. [224002490210] |You need not "induce" the labeling function (though many approach do, in passing). [224002490220] |The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part. [224002490230] |The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous. [224002490240] |Yet this is what we seem to like to do in NLP land. [224002490250] |This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time. [224002490260] |This is your out-of-sample error, and is obviously much harder than your standard generalization error. [224002490270] |(The famous no-free-lunch theorems are from an out-of-sample analysis.) [224002490280] |The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it! [224002490290] |If you were getting it right already, this just hurts you. [224002490300] |(Perhaps you should be allowed to see x and say "no I don't want to see y"?) [224002490310] |I think the key question is: what are we trying to do. [224002490320] |If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense. [224002490330] |But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim. [224002490340] |In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting. [224002490350] |(Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.) [224002490360] |This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....). [224002490370] |We were working on web search stuff. [224002490380] |His comment was that he really didn't care about doing well on the 1000 most frequent queries. [224002490390] |Microsoft could always hire a couple annotators to manually do a good job on these queries. [224002490400] |And in fact, this is what is often done. [224002490410] |What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all. [224002490420] |This is precisely the same situation here. [224002490430] |I can easily get 1000 seeds for a new language. [224002490440] |Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things? [225000010010] |Day One [225000010020] |I created this blog with the intention of providing a wider audience for my list of companies who hire computational linguists (see link to the right). [225000010030] |If it morphs into something else, so be it. [225000020010] |OMG! [225000020020] |Hey look! [225000020030] |I blogged again...this could become a habit. [225000040010] |Language Death vs. Language Murder [225000040020] |Today, The Huffington post linked to an article about language death titled "Researchers Say Many Languages Are Dying" and I feel compelled to give my two cents. [225000040030] |As a caveat, I should say that I do not have special training in anthropological linguistics or socio-linguistics, beyond what everyone who does a PhD at a functionalism-biased linguistics department is required to undergo. [225000040040] |I will spend the next few days looking into this topic, as it causes passions to flare. [225000040050] |I will start with a "gut reaction" post, with the hopes of adding more substance in the coming days. [225000040060] |My two cents = I don't think there is anything inherently "wrong" with the death of a language, just like I don't think there is anything inherently wrong with the death of a certain species or a certain person. [225000040070] |I believe it's true that most species of living things that have ever existed are currently extinct. [225000040080] |This is probably also true of languages. [225000040090] |Extinction is natural. [225000040100] |The HuffPo article quoted professor K. David Harrison, an assistant professor of linguistics at Swarthmore College, as saying this: "When we lose a language, we lose centuries of human thinking about time, seasons, sea creatures, reindeer, edible flowers, mathematics, landscapes, myths, music, the unknown and the everyday." [225000040110] |My gut reaction is that this is an overly bold claim and ought to be scaled back. [225000040120] |I think calling this a "loss" is probably the wrong way to analyze the change that occurs with language death. [225000040130] |But even if it were true, such loss is inevitable, and not necessarily bad. [225000040140] |Think of it this way, when a person dies, we "lose" the lifetime of experience and knowledge that she held. [225000040150] |This is sad, surely, but also natural and we accept it. [225000040160] |It seems to me that feeling sad or angry over language death conflates the death with murder. [225000040170] |It's language murder that ought to be the stopped. [225000040180] |Language murder is probably the result of specific policy decisions that governments make regarding education, published materials, and public discourse. [225000040190] |Language death is natural. [225000040200] |Language murder is intentional and rational. [225000040210] |More later. [225000040220] |The HuffPo article is here: [225000050010] |The Heavy Weights Weigh In .. or do they? [225000050020] |Well, Eric Bakovic over at Language Log brings up language death today, and seems to at least implicitly support Harrison's view that language death is somehow inherently bad, a position I rejected in my earlier post and a position that The Language Guy also challenges (see my previous post for link). [225000050030] |However, the Bakovic post is remarkably devoid of any explicit claims about language death; rather it simply links to a variety of resources. [225000050040] |Since Language Log is ostensibly the world's most respectable linguistics blog, boasting such regular contributors as Zwicky, Liberman, Partee, and Nunberg (all far superior linguists to me), its postings on language death (and all linguistic phenomena for that matter) are likely to be taken as conventional wisdom within the field. [225000050050] |But here's the thing, linguistics has a bad history with conventional wisdom. [225000050060] |My chosen field has a 40 year history of failed theories. [225000050070] |And I suspect the very emotionally charged issue of language death is another example of bad conventional wisdom within the linguistics community. [225000050080] |It's not clear to me at this stage what positions the other Language Log contributors take on language death, so I will take some time this weekend to review their archives and see if they have previous posts discussing it. [225000050090] |For now, I repeat my earlier assertion that there is nothing inherently wrong with language death, and I promise to follow-up with more substance this weekend after some thoughtful review of the literature. [225000070010] |Language Death and Tough Questions [225000070020] |As promised, I've been following up on the contentious issue of language death, and I'm beginning to formulate a research direction. [225000070030] |I'm noticing a clear bias among those who champion the fight against language death: they all assume its bad. [225000070040] |Over the last few days I've developed a few foundational questions that I feel are being overlooked. [225000070050] |Question #2 is one of these paradigm shifters, so watch out! [225000070060] |My current research questions regarding language death: [225000070070] |
  • Is language death a separate phenomenon from language change?
  • [225000070080] |
  • Is language death good? (or, less caustically: are there any favorable outcomes of language death?)
  • [225000070090] |
  • How do current rates of language death compare with historical rates?
  • [225000070100] |
  • What is the role of linguists wrt language death? [225000070110] |Here’s an attempt at first principles regarding language death [225000070120] |
  • language change is natural
  • [225000070130] |
  • language change is a basic part of how language works
  • [225000070140] |
  • language change is good
  • [225000070150] |
  • language death is natural
  • [225000070160] |
  • ???
  • [225000070170] |I resist taking this further …for now. [225000070180] |But I suspect that there is an analogous argument to be made for language death (or, perhaps more likely, that language death is not a separate phenomenon from language change, and analyzing it as separate clouds the important issues that linguists need to study). [225000070190] |In the last two days, I’ve had a brief opportunity to read up on language death, and it appears that David Crystal is one of the world’s leading figures championing the fight against language death. [225000070200] |I’ve just read a sample of Crystal’s book Language Death (Cambridge University Press: 2000). [225000070210] |There is a PDF of the first 23 pages of the first chapter What Is Language Death freely available via Cambridge Press online. [225000070220] |The second chapter appears to begin just a few pages later, so it’s not clear why the PDF was cut short (or, if the chapter really ends as abruptly as the PDF), but it’s not a significant deletion, and the point is quite clear. [225000070230] |Language death is rampant, regardless of how or what you count. [225000070240] |My general impression of the Crystal chapter: The mission of the first chapter, as its title implies, is to establish the facts of language death, and it does an admirable job with this task. [225000070250] |Unfortunately, this is beside the point for me. [225000070260] |I have no reason to debate the FACT of language death; rather I want to debate the alleged PROBLEM of language death. [225000070270] |I consider the fact of language death and the problem of language death to be two different things. [225000070280] |I look forward to reading Crystal’s second chapter Why Should We Care? and the third chapter Why Do languages Die? [225000070290] |These should address my central questions more directly. [225000070300] |Here are what I consider to be the highlights of Crystal’s chapter 1: [225000070310] |
  • Language death is like person death because languages need people to exist
  • [225000070320] |
  • Language death = no one speaks it anymore
  • [225000070330] |
  • Language needs 2 speakers to be “alive”
  • [225000070340] |
  • Speakers are “archives” of language
  • [225000070350] |
  • A dead language with no record = never existed
  • [225000070360] |
  • Ethnologue lists about 6,300 living languages
  • [225000070370] |
  • Difficult estimating rate of language loss
  • [225000070380] |
  • Almost half of Ethnologue languages don’t even have surveys (let alone descriptions)
  • [225000070390] |
  • Difficulties in establishing relationship between dialects
  • [225000070400] |
  • Crystal accepts mutual intelligibility criteria as definition of language (Quechua = 12 diff languages)
  • [225000070410] |
  • Crystal accepts 5k-7k as range of # of languages
  • [225000070420] |
  • Footnote 19 = maybe 31,000-600,000 languages ever existed; 140,000 reasonable “middle road” estimate
  • [225000070430] |
  • A language must have fluent living speakers to be “alive”
  • [225000070440] |
  • How many speakers to be viable -- Unclear
  • [225000070450] |
  • 10,000 –20k speakers suggests viability in the short term
  • [225000070460] |
  • 96% of world population speaks just 4% of the existing languages
  • [225000070470] |
  • 500 languages have less than 100 speakers
  • [225000070480] |
  • 1500 less than 1000
  • [225000070490] |
  • 3,340 less than 10,000
  • [225000070500] |
  • Therefore, about 4k languages are in danger of death
  • [225000070510] |
  • Difficult to estimate current rate of death (me: surely it must be even MORE difficult to estimate historical rates)
  • [225000070520] |
  • Canadian survey = appears to be a downward trend in aboriginal languages spoken at home
  • [225000070530] |
  • Teen years seems to be when people begin to dis-favor their home language
  • [225000070540] |
  • Experts agree –majority of world languages are in danger in next 100 years
  • [225000070550] |
  • How to determine which languages are “more” endangered than others
  • [225000070560] |Okay, that's where I am now. [225000070570] |I hope to review Crystal's chapters 2 &3 and respond this weekend. [225000080010] |Poser Responds! [225000080020] |At 3:47pm yesterday (Sept 21), Bill Poser over at Language Log posted this interesting claim: "The rate of language loss has accelerated as communication and travel have become more rapid and efficient, but the phenomenon is far from new." [225000080030] |14 minutes earlier, at 3:33pm (same day), The Lousy Linguist (uh, me) posted this question: "How do current rates of language death compare with historical rates?" [225000080040] |It's doubtful that Poser was directly answering me (but The Lousy Linguist can dream...), but it does seem to directly answer the question. [225000080050] |Unfortunately, no supporting evidence for the claim is offered, and there's the rub. [225000080060] |As Crystal is quick to point out, rates of contemporary language death are very difficult to determine (in fact, he refers to the attempts as "well-informed guesswork", p15 of the PDF). [225000080070] |And as I was even quicker to point out "surely it must be even MORE difficult to estimate historical rates". [225000080080] |In the one chapter of Crystal's book that I have so far read, he opts for the position that 50% of the world's languages will be "lost" in the next 100 years. [225000080090] |I have no reason not to accept this as fair. [225000080100] |But I don't know how this compares with the past (it seems intuitive that this is far faster than historical rates, but honestly, I have only vague intuition to go on here, and no one else seems to have anything better). [225000080110] |And, of course (insert broken record here) we have yet to tackle the truly important question of what linguistic effect this loss has. [225000090010] |cko's challenge [225000090020] |In her comment on my first post below, cko challenged me (rightly so) to “think about the language used in this discussion of language extinction. [225000090030] |Where does it come from? [225000090040] |What do these analogies contribute to this discussion and what do they potentially hide?” [225000090050] |And it was the analysis of language death as cultural “loss” that got me thinking about this issue. [225000090060] |But the real danger is missing any potential VALUE that language death may provide language evolution. [225000090070] |This is a side of the issue wholly ignored, as far as I can tell. [225000090080] |And it’s precisely because of those framing metaphors like “death” and “loss” that linguists have overlooked the study of favorable outcomes. [225000090090] |We have an academic duty to view the whole picture of language death, even if that picture contradicts our political perspective. [225000090100] |There is an obvious correlate to contemporary debates over global warming. [225000090110] |Again, cko comment is instructive: “much of the language loss that is currently occurring is due to non-natural forces”. [225000090120] |This may or may not be true. [225000090130] |That’s the point. [225000090140] |We just don’t know how the current rate of language loss compares to historical rates. [225000090150] |We don’t know because 1) estimating current rates is difficult and 2) estimating historical rates is nearly impossible. [225000090160] |Yet conventional wisdom holds that contemporary language death rates MUST be unnaturally driven. [225000090170] |We evil humans are KILLING languages!!!! [225000090180] |O my god! [225000100010] |Faster, Pussycat! Kill! Kill! [225000100020] |To return for a moment to Poser's comment on Language Log that increasingly efficient communication and travel are the cause of accelerated language death: this may be true, and surely this is the hand of humans wielding the knife of change, but it's also the cause of far greater language contact, which, ya know, has it's own benefits. [225000100030] |This analysis of causation still gives us no reason to believe that language death is bad. [225000100040] |So, we are faced with the conflation of three phenomena: language change, death, and murder. [225000100050] |It is language murder (the rational choices by governments and institutions to affect policy changes that cause the decline of a language or languages ... that's my definition for now) which should be challenged and fought against, not the fact of language death per se. [225000100060] |Let's not transfer our anger over language murder to language death, which may turn out to serve some positive ends, if only we would study the effects with dispassionate hearts. [225000100070] |Mad props to Russ. [225000110010] |First Answers [225000110020] |Well, I posed these questions below, so it's only fair I pose some possible answers too. [225000110030] |These are my first impressions: [225000110040] |
  • Is language death a separate phenomenon from language change?
  • [225000110050] |In terms of linguistic effect, I suspect not [225000110060] |
  • Are there any favorable outcomes of language death?
  • [225000110070] |I suspect, yes [225000110080] |
  • How do current rates of language death compare with historical rates?
  • [225000110090] |Nearly impossible to tell [225000110100] |
  • What is the role of linguists wrt language death? [225000110110] |One might ask: what is the role of mechanics wrt global warming? [225000110120] |UPDATE: post with original questions here. [225000110130] |Additional response to David Crystal here. [225000120010] |Let’s be clear [225000120020] |
  • I am NOT a monolingualist. [225000120030] |I neither advocate nor support efforts towards ‘one language’, English only, or related endeavors. [225000120040] |
  • I believe language diversity and multilingualism are good.
  • [225000120050] |
  • I believe in describing as many languages as possible and I support descriptive field linguists.
  • [225000130010] |The Nile [225000130020] |Uh, I realize now that my previous post kinda maybe made it sound like I am a global warming denier. [225000130030] |I am not. [225000130040] |I believe that humans are contributing to the unnaturally accelerated rate of climate change. [225000130050] |We should stop doing that. [225000130060] |However, I am not convinced that humans are contributing to an unnaturally accelerated rate of language death. [225000130070] |Nor am I convinced that there are no favorable outcomes to language death. [225000140010] |Poor Juxtaposition [225000140020] |The large Northeast university at which I am a "long term" graduate student, recently posted this title on its "Weekly Student Affairs Survey" [225000140030] |"1 in 4, what do you know about sexual assault/rape? (tell us now and win!)" [225000140040] |It immediately struck me as poor juxtaposition to place the rather tacky "tell us now and win!" right after a question about such a serious topic. [225000150010] |Why? [225000150020] |Why am I challenging the conventional wisdom on language death (i.e., that it’s bad)? [225000150030] |Because, I believe linguists need to analyze the relationship between language death and natural language evolution dispassionately. [225000150040] |The cause of my posts on this topic was the recognition that conventional wisdom within linguistics seemed to hold that contemporary language death is somehow threatening the structure of language and culture, and I’m not convinced that’s true. [225000160010] |I Got Yer Deictic Center Right HERE [225000160020] |Hmmm, just wondering if there is anything intellectually interesting about the blogosphere's use of hyper-linked "here" (just like I did here) [225000170010] |Lie Berries [225000170020] |Well, it goes without saying that Crystal's book Language Death was not "On Shelf" as my research library's database claimed it was. [225000170030] |Ihave requested a trace, but I hold little hope for its timely discovery. [225000170040] |Google Book Search has the first few pages of each chapter available for viewing, so I've been able to sample chapters 3 and 4, Why Should We Care and Why Do Languages Die. [225000170050] |I've also found a couple book reviews that have summarized the chapters a bit, so I'm forming an impression of the contents, though I caution that an "impression" is all I will have to go on until I can get the book. [225000180010] |Unsafe In Any Post [225000180020] |Ahhhh shucks, Arnold Zwicky over at Language Log references little ol' me here in his round-up of snowclone dead ends, adding credibility to the veracity of my blog title. [225000180030] |It would be quite lovely indeed if Dr. Zwicky would also comment on my recent meditations here, here and here on the possibility that language death may well have favorable outcomes for language evolution (I'm not above fishing for recognition). [225000190010] |When "here" is "there" [225000190020] |As a follow-up to my previous post here, it seems to be rather interesting that the blogosphere's use of hyper-linked "here" is closer to the natural language use of "there", as a pointer to a distant referent. [225000190030] |The referent is NOT in fact "here", but somewhere else. [225000190040] |It is true that one must go through the link (which is, in essence, 'closer') to get to the distant referent, but the referent of "here" is not here, it's there. [225000190050] |In terms of usage, it's closer to Monty Hall's classic use of 'here' when he stood next to door number 3 and said "your new car might be through here!" (nothing good EVER was behind door number 3!). [225000190060] |I'm going to name this Let's-Make-A-Deal Deixis ... or Monty's Deixis ... or Door-Number Three Deixis ... [225000190070] |Shoot! [225000190080] |I may need to start an internet poll! [225000200010] |Freako-linguistics [225000200020] |Yesterday, Dubner over at Freakonomics posted about names and naming. [225000200030] |His basic point is here in this excerpt: [225000200040] |It has always struck me that a lot of the things we do and use and see every day have names that aren’t very accurate or appropriate or idiomatic….I don’t mean to say that most invented common nouns are bad. [225000200050] |This is a common claim made by non-linguists, and all of us who have taught intro to linguistics courses have heard it ad nauseum. [225000200060] |But when I read Dubner’s version of this old grumble, I was struck by two things. [225000200070] |I understand complaining that a noun’s name is not accurate or not appropriate, but just what the hell does he mean not idiomatic? [225000200080] |Second, Dubner seems to think there is such a thing as a noun name that is NOT invented. [225000200090] |Pray tell, Mr. Dubner, can you list some examples? [225000200100] |Okay, so it seems I get to add a third task for me to take up on this blog: a Lin 101 review of the arbitrariness of language. [225000200110] |Jeeeez! [225000200120] |I might have to brush up on Saussure. [225000200130] |Ugh! [225000210010] |Daume on POS tagging [225000210020] |Hal Daume over at his natural language processing blog makes a damned interesting claim (and his commenters basically agree): [225000210030] |Proposition: mark-up is always a bad idea. [225000210040] |That is: we should never be marking up data in ways that it's not "naturally" marked up. [225000210050] |For instance, part-of-speech tagged data does not exist naturally. [225000210060] |Parallel French-English data does. [225000210070] |The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing. [225000210080] |His point seems to be that humans naturally translate texts, so that’s worth “computationalizing” (great word, BTW), but humans do not naturally POS tag, so why bother? [225000210090] |Okay, but is this false? [225000210100] |Do humans naturally POS tag when processing language? [225000210110] |I think it’s fair to say that humans naturally categorize natural language input, and some of this categorization could be likened to POS tagging. [225000210120] |I’m going to need to brush up on my rusty psycholinguistics and make a more substantive post on this later. [225000220010] |A Hypothesis [225000220020] |I used the phrase 'an hypothesis" below, following what I thought was the prescriptivism I was surely taught in 6th grade that a noun starting with the letter "h" should be preceded by the "an" form of the indefinite article. [225000220030] |But it didn't seem right, so I Googled the two versions. [225000220040] |The results are thus: Results 1 - 10 of about 756,000 for "an hypothesis". [225000220050] |Results 1 - 10 of about 2,070,000 for "a hypothesis". [225000220060] |And so wins "a hypothesis" ... for now. [225000240010] |Promises Promises [225000240020] |Okay, that now makes 2 blog promises I now must keep: 1) Propose an hypothesis about how language death might have favorable outcomes for language evolution and 2) review any psycholinguistic evidence regarding POS tagging. [225000260010] |You say tomato... [225000260020] |Andrew Sullivan proudly asserts his refusal here to use the term “Myanmar” to refer to the Southeast Asian country found at the coordinates 22 00 N, 98 00 E (thank you CIA World Factbook). [225000260030] |Wikipedia explains the history of the two names here: [225000260040] |The colloquial name Bama is supposed to have originated from the name Myanma by shortening of the first syllable (loss of nasal "an", reduced to non-nasal "a", and loss of "y" glide), and then by transformation of "m" into "b". [225000260050] |This sound change from "m" to "b" is frequent in colloquial Burmese, and occurs in many other words. [225000260060] |Although Bama may be a later transformation of the name Myanma, both names have been in use alongside each other for centuries. [225000260070] |I respect Sullivan’s point that he wants to resist totalitarian p.r.; however, if Wikipedia is correct and “both names have been in use alongside each other for centuries” then this seems like a trivial way to do it. [225000270010] |The Innateness Hypothesis [225000270020] |Juan Uriagereka is a very good linguist, no doubt. [225000270030] |He writes here that [225000270040] |"Language is an innate faculty, rather than a learned behavior...Language may indeed be unique to humans, but the processes that underlie it are not." [225000270050] |Yes, most linguist agree that some sort of cognitive endowment is unique to humans which helps us learn and use language, but the nature of that endowment is far from well understood. [225000270060] |Uriagereka has a particularly apropos background for the topic of language evolution, I respect that. [225000270070] |Nonetheless, I suspect he is a tad biased towards the Chomskyan view. [225000270080] |Therefore, I will try to list the arguments AGAINST the innateness hypothesis for y'all this weekend (I guess this makes #4 on my list of blog promises). [225000280010] |The Perfect Storm [225000280020] |Eric Bakovic over at language log has posted again on endangered language, and yet again has given no indication of his own opinion of the issues; I think this is indicative of the entrenched assumption within the linguistic community that language death is bad, so there is no need to explicitly discuss that part of the issue. [225000280030] |As y’all know, I have challenged this position here, here, here, and here. [225000280040] |Bakovic’s comments section however does include a juicy argument by the center of the storm himself, K. David Harrison. [225000280050] |He claims languages [225000280060] |have unique structures [225000280070] |contain useful (to human survival) knowledge [225000280080] |are being abandoned by speakers in favor of global languages [225000280090] |I have posted a response on the language log comments here. [225000280100] |I will try to post more this weekend.