[224001350010] |Understanding Model Predictions [224001350020] |One significant (IMO) issue that we face when attempting to do some sort of feature engineering is trying to understand not only what sorts of errors our model is making, but why. [224001350030] |This is an area in which pattern-based methods seem to have a leg up on classification-based methods. [224001350040] |If an error occurs, we know exactly what pattern fired to yield that error, and we can look back at where the pattern came from and (perhaps) see what went wrong. [224001350050] |I've been trying to think for a while what the best way to do this in a classification style system is, especially when the output is structured. [224001350060] |I think I have a way, but in retrospect it seems so obvious that I feel someone must have tried it before. [224001350070] |Note that unlike some past blog posts, I haven't actually tried doing this yet, but I think maybe the idea is promising. [224001350080] |Suppose we learn a system to solve some task, and we have some held out dev data on which we want to do, say, feature engineering. [224001350090] |What we'd like to do is (a) find common varieties of errors and (b) figure out why these errors are common. [224001350100] |I'm going to assume that (a) is solved... for instance in tagging problems you could look at confusion matrices (though note that in something like MT or summarization, where the structure of the output space changes based on past decisions, this may be nontrivial). [224001350110] |Let's say we're doing POS tagging and we've observed some error class that we'd like to understand better, so we can add new features that will fix it. [224001350120] |One way of thinking about this is that our classifier predicted X in some context where it should have produced Y. The context, of course, is described by a feature set. [224001350130] |So what we want to do, essentially, is look back over the training data for similar contexts, but where the correct answer was X (what our current model predicted). [224001350140] |In the case of linear classifiers, it seems something as simple as cosine difference over feature vectors may be a sufficiently good metric to use. [224001360010] |NetFlix "solved" (the small version) [224001360020] |See the post on Hunch. [224001360030] |Maybe that last 1.5% might benefit not from fancy ML, but from fancy (or even stupid) NLP. [224001360040] |"But what data do we have to run the NLP on," my friends may ask. [224001360050] |How about stuff like this, or if you're adventurous like this? [224001360060] |(Had I enough time, I might give it a whirl, but alas...) [224001360070] |p.s., If any of the above links encourage copyright violations, then I'm not actually advocating their use. [224001370010] |Translation out of English [224001370020] |If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). [224001370030] |The parenthetical options are usually just due to the availability of corpora for those languages. [224001370040] |The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. [224001370050] |There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition! [224001370060] |Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. [224001370070] |Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. [224001370080] |Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task. [224001370090] |I think the state of affairs in translation out of English is much worse off. [224001370100] |In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. [224001370110] |(Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.) [224001370120] |Why do I think this is an interesting problem? [224001370130] |For one, I think it challenges a handful of preconceived notions about translation. [224001370140] |For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. [224001370150] |There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. [224001370160] |From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." [224001370170] |I don't want to belittle that work---it was fantastic. [224001370180] |But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic. [224001370190] |Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. [224001370200] |That is, the goal of MT is just to map from one language to another. [224001370210] |It is still up to the human reading the (translated) document to do interpretation. [224001370220] |One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. [224001370230] |This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. [224001370240] |It just so happens that a lot of this information is often marked in morphology, which languages like English lack. [224001370250] |That said, there is at least one good reason why one might not work on translation out of English. [224001370260] |For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). [224001370270] |The language other than English that I speak best is Japanese. [224001370280] |But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. [224001370290] |Moreover, Japanese is not what I would call a morphologically rich language. [224001370300] |I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese. [224001370310] |If I spoke Arabic, I think English to Arabic translation would be quite interesting. [224001370320] |Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). [224001370330] |As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. [224001370340] |Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. [224001370350] |Such things might be important in generating, eg., gender and number morphemes. [224001370360] |But alas, my Arabic is not quite up to par. [224001370370] |(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. [224001370380] |Moreover, from the perspective of "information poor", it's pretty close to the top!) [224001380010] |NIPS 2007 Pre-prints Up (and WhatToSee-d) [224001380020] |NIPS 2007 preprints are online, with a big warning that they're not final versions. [224001380030] |Be that as it may, I've indexed them on WhatToSee for your perusal. [224001380040] |(More info on WhatToSee) [224001390010] |Domain adaptation vs. transfer learning [224001390020] |The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). [224001390030] |Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. [224001390040] |Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. [224001390050] |In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. [224001390060] |The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. [224001390070] |(Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.) [224001390080] |As a brief aside, we actually need to be a bit more specific about the domain adaptation case. [224001390090] |In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). [224001390100] |In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. [224001390110] |(Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) [224001390120] |It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. [224001390130] |A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. [224001390140] |This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. [224001390150] |(Though, see (*) below for a bit more discussion on this assumption.) [224001390160] |Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. [224001390170] |In DA, the two "sides" are the source domain and the target domain. [224001390180] |In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). [224001390190] |In all cases, we want something that does well on the target. [224001390200] |Let's enumerate the four possibilities: [224001390210] |
  • Source labeled, target labeled (S+T+)
  • [224001390220] |
  • Source labeled, target only unlabeled (S+T-)
  • [224001390230] |
  • Source only unlabeled, target labeled (S-T+)
  • [224001390240] |
  • Source only unlabeled, target only unlabeled (S-T-)
  • [224001390250] |We can immediately throw away S-T- because this is basically an unsupervised learning problem. [224001390260] |The typical assumption in TL is S+T+. [224001390270] |That is, we have labeled data for both tasks. [224001390280] |(Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.) [224001390290] |In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on). [224001390300] |Now, I think it's fair to say that any of the T- settings are impossible for TL. [224001390310] |Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. [224001390320] |(I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.) [224001390330] |This begs the question: in TL and DA, does the S-T+ setting make any sense? [224001390340] |For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. [224001390350] |Usually we want to do DA so that we don't have to label (much) target data. [224001390360] |However, one could make a semi-supervised sort of argument here. [224001390370] |Maybe it's just hard to come by target data, labeled or otherwise. [224001390380] |In this case, we'd like to use a bunch of unlabeled source data to help out. [224001390390] |(Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) [224001390400] |From a more theoretical perspective, I don't really see anything wrong with it. [224001390410] |In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here. [224001390420] |For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). [224001390430] |Why do I think this makes sense? [224001390440] |Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. [224001390450] |If this is true, and we have labeled target data, I see no reason why we would need labeled source data. [224001390460] |That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. [224001390470] |Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same. [224001390480] |What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. [224001390490] |I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. [224001390500] |There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. [224001390510] |Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. [224001390520] |Maybe some assumption on the bound of the KL divergence or some such thing. [224001390530] |How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. [224001390540] |This is actually quite surprising, given that similar techniques have been used for both. [224001390550] |(*) I think one exception to this assumption occurs in microarray analysis in computational biology. [224001390560] |One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. [224001390570] |What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. [224001390580] |A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. [224001390590] |One can think of not doing the normalization step and instead thinking of this as a DA problem. [224001390600] |However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. [224001390610] |(You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.) [224001400010] |NIPS retrospective [224001400020] |I got back from NIPS on Sunday; sadly I got sick on the last day. [224001400030] |Despite this, it was by far my most productive conference ever. [224001400040] |Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. [224001400050] |There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. [224001400060] |The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. [224001400070] |I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. [224001400080] |I'll conclude with a bit of my sense of what was in and out this year. [224001400090] |
  • Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. [224001400100] |There are two new things here. [224001400110] |First, they actually model context (yay!) which we know is important in language. [224001400120] |Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. [224001400130] |This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). [224001400140] |They actually get good results (which is non-trivial for topic models in this context).
  • [224001400150] |
  • Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. [224001400160] |If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. [224001400170] |I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
  • [224001400180] |
  • John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. [224001400190] |If you care at all about domain adaptation, you should read this paper.
  • [224001400200] |
  • Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. [224001400210] |They show that things can go very wrong in widely different ways. [224001400220] |This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
  • [224001400230] |
  • Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. [224001400240] |(For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
  • [224001400250] |
  • Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). [224001400260] |This is the case for, eg., the RBF kernel. [224001400270] |This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. [224001400280] |Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. [224001400290] |The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
  • [224001400300] |
  • David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. [224001400310] |What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. [224001400320] |This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. [224001400330] |Curing HIV might be a bit more of an interesting application, though :). [224001400340] |Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. [224001400350] |In the linguistic sense, this is effectively separating historical from areal divergences. [224001400360] |I'll have to read up more on what they do, precisely.
  • [224001400370] |There were several themes that stuck out at NIPS this year, though they aren't represented in the above list. [224001400380] |Probably the biggest one is recommender systems. [224001400390] |Just search the titles for "matrix" and you'll come up with a handful of such systems. [224001400400] |I have to believe that this is spurred, at least in part, by the NetFlix challenge. [224001400410] |Another theme was deep belief networks, which even had a rogue workshop dedicated to them. [224001400420] |I have a feeling we'll be seeing more and more of these things. [224001400430] |A final theme was randomized algorithms, particularly as applied to large scale learning. [224001400440] |I think this is a great direction, especially at NIPS, where, historically, things have been less focused on the large scale. [224001400450] |My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. [224001400460] |That is, these papers all proposed a new method for solving task X and then proceeded to solve it. [224001400470] |The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. [224001400480] |I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. [224001400490] |I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist. [224001410010] |Particle filtering versus beam search [224001410020] |I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. [224001410030] |Much of what is below is a result of this conversation and he deserves much credit. [224001410040] |In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. [224001410050] |Just to set the stage, let's talk about generic beam search: [224001410060] |The key variant in beam search algorithms is the score function that's used. [224001410070] |The most naive is just path cost---how much did we have to pay to get to the current hypothesis? [224001410080] |The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete. [224001410090] |There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. [224001410100] |We'll come back to these shortly. [224001410110] |Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like: [224001410120] |To map beam search to particle filtering, use negative log probabilities for scores. [224001410130] |(This turns max search into min search.) [224001410140] |In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step. [224001410150] |There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do. [224001410160] |In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). [224001410170] |The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. [224001410180] |However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. [224001410190] |I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS. [224001410200] |What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." [224001410210] |There's a straightforward way to compute this (see the wikipedia page linked above). [224001410220] |If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. [224001410230] |My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land. [224001410240] |It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. [224001410250] |But I tend to doubt it. [224001410260] |One other thing that I've never seen done in PF is to maintain multiple beams. [224001410270] |Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. [224001410280] |One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. [224001410290] |We use resampling to try to fix this, but another way would be to use multiple beams. [224001410300] |This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered. [224001410310] |So what's the point? [224001410320] |I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. [224001410330] |For instance, the following seem promising: [224001410340] |
  • Use resampling techniques in beam search
  • [224001410350] |
  • Use multiple beams in particle filtering
  • [224001410360] |
  • Use A*-style heuristics in the proposal distribution for particle filtering
  • [224001410370] |An alternative suggestion would be for NLPers to pay more attention to particle filtering. [224001410380] |It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use). [224001420010] |Those Darn Biologists... [224001420020] |I've recently been talking to a handful of biologists. [224001420030] |The lab/bench kind, not the computational kind. [224001420040] |As an experiment in species observation, they intrigue me greatly. [224001420050] |I'm going to stereotype, and I'm sure it doesn't hold universally, but I think it's pretty true and several (non-biologist) friends have confirmed this (one of whom is actually married to a biologist). [224001420060] |The reason they intrigue me is because they are wholly uninterested in techniques that allow you to do X better, once they already have a solution for X. [224001420070] |This is a bit of an exaggeration, but not a huge one (I feel). [224001420080] |It is perhaps best served by example. [224001420090] |Take dimensionality reduction. [224001420100] |This is a very very well studied problem in the machine learning community. [224001420110] |There are a bajillion ways to do it. [224001420120] |What do biologists use? [224001420130] |PCA. [224001420140] |Still. [224001420150] |And it's not that people have tried other things and they've failed. [224001420160] |Other things have worked. [224001420170] |Better. [224001420180] |But no one cares. [224001420190] |Everyone still uses PCA. [224001420200] |Part of this is probably psychological. [224001420210] |After using PCA for a decade, people become comfortable with it and perhaps able to anticipate a bit of what it will do. [224001420220] |Changing techniques would erase this. [224001420230] |Part of this is probably cultural. [224001420240] |If you're trying to publish a bio paper (i.e., a paper in a bio journal) and you're doing dimensionality reduction and you aren't using PCA, you're probably going to really have to convince the reviewers that there's a good reason not to use PCA and that PCA just plain wouldn't work. [224001420250] |Now, that's not to say that biologists are luddites. [224001420260] |They seem perfectly happy to accept new solutions to new problems. [224001420270] |For instance, still in the dimensionality reduction setting, non-negative matrix factorization techniques are starting to show their heads in microarray settings. [224001420280] |But it's not because they're better at solving the same old problem. [224001420290] |What was observed was that if you have not just one, but a handful of microarray data sets, taken under slightly different experimental conditions, then the variance captured by PCA is the variance across experiments, not the variance that you actually care about. [224001420300] |It has been (empirically) observed that non-negative matrix factorization techniques don't suffer from this problem. [224001420310] |Note that it's only fairly recently that we've had such a surplus of microarray data that this has even become an issue. [224001420320] |So here we have an example of a new problem needing a new solution. [224001420330] |Let's contrast this with the *ACL modus operandi. [224001420340] |There were 132 papers in ACL 2007. [224001420350] |How many of them do you think introduced a new problem? [224001420360] |Now, I know I'm being a bit unfair. [224001420370] |I'm considering dimensionality reduction across different platforms (or experimental conditions) as a different problem from dimensionality reduction on a single platform. [224001420380] |You could argue that these really are the same problem. [224001420390] |I just kind of disagree. [224001420400] |Just as biologists tend to be wary of new techniques, so we tend to be wary of new problems. [224001420410] |I think anyone who has worked on a new problem and tried to get it published in a *ACL has probably gone through an experience that left a few scars. [224001420420] |I know I have. [224001420430] |It's just plain hard to introduce a new problem. [224001420440] |The issue is that if you work on an age-old problem, then all you have to do is introduce a new technique and then show that it works better than some old technique. [224001420450] |And convince the reviewers that the technique is interesting (though this can even be done away with if the "works better" part is replaced with "works a lot better"). [224001420460] |On the other hand, if you work on a new problem, you have to do a lot. [224001420470] |(1) You have to introduce the new problem and convince us that it is interesting. [224001420480] |(2) You have to introduce a new evaluation metric and convince us that it is reasonable. [224001420490] |(3) You have to introduce a new technique and show that it's better than whatever existed before. [224001420500] |The problem is that (1) is highly subjective, so a disinterested reviewer can easily kill such a paper with a "this problem isn't interesting" remark. [224001420510] |(2) is almost impossible to do right, especially the first time around. [224001420520] |Again, this is something that's easy to pick on as a reviewer and instantly kill a paper. [224001420530] |One might argue that (3) is either irrelevant (being a new problem, there isn't anything that existed before) or no harder than in the "age old problem" case. [224001420540] |I've actually found this not to be true. [224001420550] |If you work on an age old problem, everyone knows what the right baseline is. [224001420560] |If you work on a new problem, not only do you have to come up with your own technique, but you also have to come up with reasonable baselines. [224001420570] |I've gotten (on multiple occasions) reviews for "new problems" papers along the lines of "what about a baseline that does XXX." [224001420580] |The irony is that while XXX is often creative and very interesting, it's not really a baseline and would actually probably warrant it's own paper! [224001420590] |That said, I didn't write this post to complain about reviewing, but rather to point out a marked difference between how our community works and how another community (that is also successful) works. [224001420600] |Personally, I don't think either is completely healthy. [224001420610] |I feel like there is merit in figuring out how to do something better. [224001420620] |But I also feel that there is merit in figuring out what new problems are and how to solve them. [224001420630] |The key problem with the latter is that the common reviewer complaints I cited above (problem not interesting or metric not useful) often are real issues with "new problem" papers. [224001420640] |And I'm not claiming that my cases are exceptions. [224001420650] |And once you accept a "new problem" paper that is on the problem that really isn't interesting, you've opened Pandora's box and you can expect to see a handful of copy cat papers next year ( could, but won't, give a handful of examples of this happening in the past 5 years). [224001420660] |And it's really hard to reject the second paper on a topic as being uninteresting, given that the precedent has been set. [224001430010] |Syntax-based and syntax-informed translation [224001430020] |Happy new year, all... [224001430030] |Apologies for being remiss about posting recently. [224001430040] |(Can I say "apologies about my remission"?) [224001430050] |This post is a bit more of a review of what's out there, rather than trying out new ideas. [224001430060] |Citations are not meant to get anywhere close to full coverage. [224001430070] |There are a handful of ways of viewing the syntactic MT problem, essentially delineated by which side(s) of the translation does the tree appear on, and whether the tree is induced entirely from data or whether it's induced from a treebank. [224001430080] |There's additionally the constituent- versus dependency-tree issue. [224001430090] |
  • String-to-tree (constituent). [224001430100] |Here, we have a string on the input and a tree on the output. [224001430110] |In Chinese-to-English translation, this would mean that at translation time, we're essentially trying to parse Chinese into an English tree by means of local reordering. [224001430120] |This was essentially what Yamada and Knight were doing back in 2001 and 2002. [224001430130] |In order to train such a model, we need (a) parallel data and (b) a parser on the target language. [224001430140] |Translation is typically done by some variant of CKY.
  • [224001430150] |
  • Tree-to-string (constituent). [224001430160] |Here, we map an input tree to an output string. [224001430170] |In C2E, this means that at translation time, we first parse the Chinese, then reorder and flatten it out into an English string. [224001430180] |This is essentially what the JHU workshop on adding syntactic features to phrase-based translation was trying to do (though in somewhat of a weak way), and also what Liang Huang has been doing lately. [224001430190] |In order to train, we need (a) parallel data and (b) a parser on the source side. [224001430200] |Translation can be done any way you want, though once parsing is done, it's usually a lot easier than running CKY.
  • [224001430210] |
  • Tree-to-tree (constituent). [224001430220] |Here, we map an input tree to an output tree. [224001430230] |In C2E, this means that at translation time, we first parse the Chinese, then reorder the tree and translate subtrees/leaves into English. [224001430240] |Koehn and Collins have worked on this problem. [224001430250] |In order to train, we need (a) parallel data, (b) a parser on the source side and (c) a parser on the target side. [224001430260] |Translation is similar to Tree-to-string.
  • [224001430270] |
  • String-to-tree-to-string (constituent). [224001430280] |I'm not sure if there's agreed-upon terminology for this task, but the idea here is to translate without having a parser on either side. [224001430290] |At translation time, we essentially parse and translate simultaneously (rather like string-to-tree). [224001430300] |But at training time, we don't have access to source trees: we have to induce them from just (a) parallel data. [224001430310] |This is typified by Wu's inverse transduction grammars and more recently by David Chiang's Hiero system (plus others).
  • [224001430320] |
  • Any of the above but on dependency trees. [224001430330] |To my knowledge, the only groups that are seriously working on this are Microsoft (see, eg., Chris Quirk) and our friends at Charles University.
  • [224001430340] |Below is my attempt to categorize the pros and cons of these different approaches along a wide variety of metrics. [224001430350] |The metrics are: resource usage (how much labeled data do we need), training ease (computationally), testing ease (computationally), ngram LM (is it easy to include a target ngram language model), syntactic LM (is it easy to include a target syntactic language model), minimal assumptions (does the model make strong assumptions about, for instance, the degree of matchy-ness between source and target?), extensibility (is it easy to add additional features to the input side, such as NE info without breaking things). [224001430360] |I didn't include dependencies on this list because it essentially depends on what the underlying model is, and then remains the same as in the constituent case. [224001430370] |The only caveat here is that syntactic language modeling with dependency trees is not as well proven as with constituency trees. [224001430380] |As far as syntactic language modeling goes, pretty much this is easy if you are producing trees on the target; otherwise, not. [224001430390] |Tree-to-tree gets the unique distinction of being bad with respect to assumptions, essentially because you have to do tree-matching which is hard to do well. [224001430400] |You could argue that certain forms of s2t2s should be here too (Wellington, Turian and Melamed had a paper on the assumptions that ITG makes at ACL 2006). [224001430410] |In terms of resource usage, T2T is obviously the worst, S2T2S is obviously the best. [224001430420] |Whether S2T or T2s is better depends on the language pair. [224001430430] |There's of course the issue of which work best. [224001430440] |For this, I don't really know (I'm sure people involved in MTeval and GALE can chime in here, though of course these competitions are always "unfair" in the sense that they also measure engineering ability). [224001430450] |At this point, I'd like to relate this back to the recent discussion of Translation out of English. [224001430460] |One thing to notice is that when translating into English, S2T has the advantage that all we need is an English parser (there are a few of those lying around if you search hard enough) whereas T2S requires a foreign parser. [224001430470] |However, when translating from English, T2S starts looking a lot more attractive because we have good English parsers. [224001430480] |Actually, we have a lot of good annotations for English (NE coref, for instance; discourse parsing to some degree; etc.). [224001430490] |From this perspective, the relative ease of extensibility of T2S models begins to look pretty attractive. [224001430500] |That is to say, if I were actually to work myself on translation out of English, I would probably seriously consider T2S models. [224001430510] |Of course, if you're translating from X to Y, where neither X nor Y is English, then you're kind of hosed and might prefer S2T2S. [224001440010] |We're hiring machine learning/robotics... [224001440020] |Just a quick note because this is a bit off-topic for this blog, but I wanted to let you all know that this year, we're trying to hire a machine learning/robotics person. [224001440030] |If you do machine learning that in any way is robotics related, I'd encourage you to apply. [224001440040] |(I suppose same holds the other way around, but I probably wouldn't be quite as excited.) [224001440050] |The "deadline for full consideration" is listed as Jan 15, which has passed, but if you apply by the end of the month, things would be fine. [224001440060] |In the interest of full disclosure, we're also hiring in graphics, systems and sensor networks. [224001450010] |An NLPer in the Algorithmists Court [224001450020] |I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. [224001450030] |Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. [224001450040] |My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. [224001450050] |This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. [224001450060] |I'll try to get around at some point to post about things I learned in the seminar. [224001450070] |Some are more ML-related, some more NLP. [224001450080] |Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. [224001450090] |(Sadly, I had to miss the last day because I teach early Tuesday mornings.) [224001450100] |Below are a few of the papers I saw that I liked enough to put them on my "to read" list. [224001450110] |I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. [224001450120] |If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. [224001450130] |I also warn you guys that most of these things really have nothing to do with NLP. [224001450140] |The first invited talk was by Bonnie Berger, talking about computational biology. [224001450150] |Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. [224001450160] |There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). [224001450170] |The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). [224001450180] |The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. [224001450190] |I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future. [224001450200] |The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). [224001450210] |At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). [224001450220] |The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. [224001450230] |I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. [224001450240] |Not enough for me to learn functional combinatorics, but still interesting. [224001450250] |One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. [224001450260] |Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. [224001450330] |The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. [224001450340] |(For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) [224001450350] |The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. [224001450360] |If these are very similar, the two items in the pair are likely to be independent. [224001450370] |They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. [224001450380] |For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). [224001450390] |They are able to get pretty good results. [224001450400] |They have similar results for the other metrics. [224001450410] |Again, I can already think of a handful of possible applications of this stuff. [224001450420] |It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. [224001450430] |I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. [224001450440] |This would have immediate application to Bayes net structure learning, among other things. [224001450450] |Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. [224001450460] |This paper is a bit harder for me to get across (much less understand!). [224001450470] |At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). [224001450480] |Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). [224001450490] |The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. [224001450500] |There are also interesting connections to compressed sensing, which are mentioned in the paper. [224001450510] |There were other papers that I liked, but that I'm not going to try to summarize. [224001450520] |Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). [224001450530] |Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). [224001450540] |Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending. [224001450550] |Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. [224001450560] |I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. [224001450570] |But I suppose that's what this 1300 page proceedings they dumped on me is for. [224001450580] |In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. [224001450590] |In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. [224001450600] |This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. [224001450610] |(The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.) [224001450620] |(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.) [224001470010] |Kernels, distances and strings [224001470020] |Kernels and distances are closely related. [224001470030] |Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). [224001470040] |This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z). [224001470050] |On the other hand, we also know how to turn distance metrics into kernels. [224001470060] |If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X. [224001470070] |Now, there are three questions that arise. [224001470080] |The first is: in the distance->kernel setting, what does the RKHS "look like." [224001470090] |I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. [224001470100] |In practice, this "some point" is going to be one of the training points. [224001470110] |The second question follows the fact that we can iterate this construction. [224001470120] |One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. [224001470130] |Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. [224001470140] |I don't have a good intuitive sense for the answer to any of these. [224001470150] |Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on. [224001470160] |The final question is a bit "off topic" from the above two, but still relevant. [224001470170] |There's been a lot of work in kernels for discrete data, like strings. [224001470180] |The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. [224001470190] |It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. [224001470200] |However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. [224001470210] |Yet I'm not aware of this being used anywhere. [224001470220] |Is there any reason why? [224001470230] |The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about. [224001480010] |ACL 2008 Workshops and Tutorials [224001480020] |The list of workshops and tutorials for ACL is now up. [224001480030] |There are actually two separate workshops on MT! [224001480040] |I'm particularly excited to see the that BioNLP is continuing. [224001490010] |ACL papers up [224001490020] |As Chris Brew pointed out, ACL accepts have been posted. [224001490030] |In keeping with tradition, here are some of the top key words (stemmed, with stop words, including things like "model" or "based" removed): [224001490040] |
  • translat (16)
  • [224001490050] |
  • learn (14)
  • [224001490060] |
  • word (12)
  • [224001490070] |
  • gener (12)
  • [224001490080] |
  • evalu (10)
  • [224001490090] |
  • unsupervis (9)
  • [224001490100] |
  • pars (9)
  • [224001490110] |
  • machin (9)
  • [224001490120] |
  • phrase (8)
  • [224001490130] |
  • languag (8)
  • [224001490140] |
  • grammar (8)
  • [224001490150] |
  • depend (8)
  • [224001490160] |
  • automat (8)
  • [224001490170] |
  • segment (7)
  • [224001490180] |
  • web (6)
  • [224001490190] |
  • text (6)
  • [224001490200] |
  • supervis (6)
  • [224001490210] |
  • semant (6)
  • [224001490220] |
  • question (6)
  • [224001490230] |
  • featur (6)
  • [224001490240] |I'm happy to say that a cursory glance of the list includes at least a handful of topics that I don't consider the normal ACL fodder. [224001490250] |I'm also happy to say that, as chair for the summarization track, we had a number of excellent summarization papers this year. [224001500010] |What to do with a million summaries? [224001500020] |Let's pretend. [224001500030] |Let's pretend that someone gave you one million document/summary pairs. [224001500040] |If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document. [224001500050] |For those of us who work on summarization, this seems like it would be a pretty cool gift. [224001500060] |Most of the corpora we're used to using have, at best, a few hundred such pairs. [224001500070] |Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them). [224001500080] |So here's the question I pose: what would you do with a million summaries? [224001500090] |I actually have a hard time answering this question. [224001500100] |Sure, we have whatever our favorite sentence extraction method is. [224001500110] |If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. [224001500120] |This would probably be optimized against some automatic measure like one of the many flavors of Rouge. [224001500130] |Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. [224001500140] |Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). [224001500150] |(Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.) [224001500160] |But now we have a million doc/sum pairs. [224001500170] |Fitting a dozen parameters on a million examples seems a bit wasteful. [224001500180] |It also seems a bit wasteful to continue to focus on sentence extraction in this case. [224001500190] |Why? [224001500200] |Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). [224001500210] |Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights). [224001500220] |So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. [224001500230] |It turns out that somebody has already tried this, with a reasonable amount of success. [224001500240] |In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. [224001500250] |The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. [224001500260] |I might also throw in some word clustering just for fun. [224001500270] |So let's say that the alignments worked. [224001500280] |Now what? [224001500290] |We could decode, of course, by intersecting the learned alignment model with a language model. [224001500300] |I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments. [224001500310] |So now we've got a million document/summary pairs that have been aligned. [224001500320] |What now? [224001500330] |You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. [224001500340] |(Why? [224001500350] |Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. [224001500360] |What this means is that they do a lot of word copying, at least up to the stem. [224001500370] |If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. [224001500380] |What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.) [224001500390] |You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. [224001500400] |But I also don't think we have a really good understanding of this. [224001500410] |In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. [224001500420] |But I suppose I would probably start with something like that. [224001500430] |But that's what I would do. [224001500440] |What would you do? [224001500450] |(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." [224001500460] |I have about 120,000. [224001500470] |But there are some complications with this data. [224001500480] |Maybe I'll post about it in the near future.) [224001520010] |N-best lists and duplication [224001520020] |One thing that continually irks me about (some varieties of) n-best lists is duplicates. [224001520030] |Duplicates typically arise due to an "arg max" approximation in a structured search process. [224001520040] |In other words, we have some model that decomposes into . [224001520050] |Here, "in" is the input, "out" is the desired output and "hid" is some hidden/latent variable. [224001520060] |For example, in phrase-based MT, "in" is French, "out" in English and "hid" might be the phrasal segmentation. [224001520070] |The problem that arises is that we typically do not compute by summing over hidden states; instead, we try to find the "out"/"hid" pair that maximizes the joint probability . [224001520080] |Now, the first problem with this is that we're not actually after the out/hid pair that maximizes the joint; we're after the out that maximize the marginal. [224001520090] |But this is largely a separate issue. [224001520100] |The issue that I care about is that when we look at an N-best list, we get the top N out/hid pairs, rather than the top N outputs. [224001520110] |This leads to "duplication" in the N-best list, where many of the top N are actually the same output, but with different latent variables. [224001520120] |One might ask: is this a big problem? [224001520130] |I suspect so. [224001520140] |Below is a figure that shows how many unique outputs do we get in a top 5000 list from phrase-based translation (French to English). [224001520150] |There are 2000 sentences being translated (along the x-axis) and each produces some number of unique outputs (along the y-axis). [224001520160] |The sentences are sorted by the number of uniques they produce. [224001520170] |So the ones on the left produce almost all 5000 unique outputs, but the ones on the right produce only one or two unique outputs. [224001520180] |(Both axes are log-scaled.) [224001520190] |What we observe is that the median number of unique outputs in the top 5000 list is only 56! [224001520200] |The mean is only 178. [224001520210] |Over 10% of then have ten or fewer uniques. [224001520220] |Only six have over 2500 uniques. [224001520230] |This does not bode well for any system that intends to pipeline the N-best list into some other module (whether it is a straightforward reranker or something more complex; for instance, like a summarization system or search system). [224001520240] |One may ask: does the number of unique outputs correlate with the input sentence length. [224001520250] |The answer is: sort of. [224001520260] |The following figure plots input sentence length along the x-axis and number of unique translations along the y-axis. [224001520270] |We see that there is some weak correlation. [224001520280] |Pretty much all of the examples that lead to large numbers of uniques are short sentences (length 10 or less). [224001520290] |"Average" sentence of length 30-50 seem to produce somewhere about 250 uniques, and long sentences (>100) tend to produce only a handful of outputs. [224001520300] |Since average sentences are, in many ways, the most interesting, this is a bit frustrating. [224001520310] |I don't particularly care about getting lots of translations for 10-word-long sentences because they're kind of boring anyway. [224001520320] |So what can be done about this? [224001520330] |Here are a few things that come to mind, though I'm not sure any is a complete solution: [224001520340] |
  • Generate more than 5000 total and hope to get more than 50 unique. [224001520350] |Sure, this works (at the expense of computation and memory), but a 1% return is really quite bad. [224001520360] |If we could even get, say, a 10% or 20% return I would be much much happier.
  • [224001520370] |
  • Instead of using n-best lists, generate samples from the posterior (here, I assume that marginalization is still too much to ask for). [224001520380] |Probably you'd want to take the 1-best as well, since there's no guarantee that the MAP would show up in a finite sample. [224001520390] |I'm also not sure we know how to do this (efficiently) for arbitrary models.
  • [224001520400] |
  • Try to optimize the n-best list for diversity, instead of simply optimality. [224001520410] |You can write this down, but it seems hard to implement and may be computationally less pleasant than just generating 50,000 total.
  • [224001520420] |
  • Improve hypothesis recombination inside the decoder. [224001520430] |Here, my worry would be that it wouldn't be until the final step when we could combine everything, in which case this boils down to the "generate 50,000" approach.
  • [224001520440] |I'm sure there are other ways, but these are what come to mind. [224001520450] |How do other people deal with this? [224001520460] |What size n-best lists are popular? [224001530010] |Should NAACL and ICML co-locate? [224001530020] |It's been a while since we've had a survey, so here's a new one. [224001530030] |The question is: should NAACL and ICML attempt to co-locate in the near future (probably no sooner than 2010, since planning for 2009 is already fairly well established). [224001530040] |The reason the question is NAACL-specific and not, eg., ACL or EMNLP, is because I don't have any control over what ACL or EMNLP does. [224001530050] |(For that matter, I don't have any control over what ICML does either.) [224001550010] |ICWSM Report [224001550020] |(Guest Post by Kevin Duh -- Thanks, Kevin!!!) [224001550030] |I recently attended ICWSM (International Conference on Weblogs and Social Media), which consisted of an interesting mix of researchers from NLP, Data Mining, Pyschology, Sociology, and Information Sciences. [224001550040] |Social media (which defined generally can include blogs, newsgroups, and online communities like facebook, flikr, youtube, del.icio.us) now accounts for the majority of content produced and consumed on the Web. [224001550050] |As the area grows in importance, people are getting really interested in finding ways to better understand the phenomenon and to better build applications on top of it. [224001550060] |This conference, the second in the series, has nearly 200 participants this year. [224001550070] |I think this is a rewarding area for NLPers and MLers to test their wits on: there are many interesting applications and open problems. [224001550080] |In the following, I'll pick out some papers, just to give a flavor of the range of work in this area. [224001550090] |For a full list of papers, see the conference program. [224001550100] |Most papers are available online (do a search); some are linked from the conference blog. [224001550110] |Interesting new applications: [224001550120] |1) International sentiment analysis for News and Blogs -- M. Bautin, L. Vijayarenu, S. Skiena (StonyBrook) Suppose you want to monitor the sentiment of particular named entities (e.g. Bush, Putin) on news and blogs across different countries for comparison. [224001550130] |This may be useful for, e.g., political scientists analyzing global reactions to the same event. [224001550140] |There are two approaches: One is to apply sentiment analyzers trained in different languages; Another is to apply machine translation on foreign text, then apply an English sentiment analyzer. [224001550150] |Their approach is the latter (using off-the-shelf MT engine). [224001550160] |Their system generates very-fun-to-watch "heat maps" of named entities that are popular/unpopular across the globe. [224001550170] |I think this paper opens up a host of interesting questions for NLPers: Is sentiment polarity something that can be translated across languages? [224001550180] |How would one modify an MT system for this particular task? [224001550190] |Is it more effective to apply MT, or to build multilingual sentiment analyzers? [224001550200] |2) Recovering Implicit Thread Structure in Newsgroup Style Conversations, by Y-C. [224001550210] |Wang, M. Joshi, C. Rose, W. Cohen (CMU) Internet newsgroups can quite messy in terms of conversation structure. [224001550220] |One long thread can actually represent different conversations among multiple parties. [224001550230] |This work aims to use natural language cues to tease apart the conversations of a newsgroup thread. [224001550240] |Their output is a conversation graph that shows the series of post-replies in a more coherent manner. [224001550250] |3) BLEWS: Using blogs to provide context for news articles -- M. Gamon, S. Basu, D. Belenko, D. Fisher, M. Hurst, C. Konig (Microsoft) Every news article has its bias (e.g. liberal vs. conservative). [224001550260] |A reader who wishes to be well-educated on an issue should ideally peruse articles on all sides of the spectrum. [224001550270] |This paper presents a system that aids the reader in quickly undertanding the political leaning (and emotional charge) of an article. [224001550280] |It does so by basically looking at how many conservative vs. liberal blogs link to a news article. [224001550290] |I think this paper is a good example of how one can creatively combine a few existing technologies (NLP, visualization, link analysis) to produce an application that has a lot of value-added. [224001550300] |Methods and algorithms adapted for social media data: [224001550310] |4) Document representation and query expansion models for blog recommendation -- J. Arguello, J. Elsas, J. Callan, J. Carbonel (CMU) This is an information retrieval paper, where the goal is to retrieve blogs relevant to an user query. [224001550320] |This is arguably a harder problem than traditional webpage retrieval, since blogs are composed of many posts, and they can be on slightly different topics. [224001550330] |The paper adopts a language modeling approach and asks the question: should we model blogs at the blog-level, or at the post-level? [224001550340] |They also explored what kind of query expansion would work for blog retrieval. [224001550350] |This paper is a nice example of how one can apply traditional methods to a new problem, and then discover a whole range of interesting and new research problems due to domain differences. [224001550360] |Understanding and analyzing social communities: [224001550370] |5) Wikipedian Self-governance in action: Motivating the policy-lens -- I. Beschastnikh, T. Kriplean, D. McDonald (UW) [Best paper award] Wikipedia is an example of self-governance, where participant editors discuss/argue about what should and can be edited. [224001550380] |Over the years, a number of community-generated policies and guidelines have formed. [224001550390] |These include policies such as "all sources need to be verified" and "no original research should be included in Wikipedia". [224001550400] |Policies are themselves subject to modification, and they are often used as justification by different editors under different perspectives. [224001550410] |How are these policies used in practice? [224001550420] |Are they being used by knowledgeable Wikipedian "lawyers" or adminstrators at the expense of commonday editors? [224001550430] |This paper analyzes the Talk pages of Wikipedia to see how policies are used and draws some very interesting observations about the evolution of Wikipedia. [224001550440] |6) Understanding the efficiency of social tagging systems using information theory -- E. Chi, T. Mytkowicz (PARC) Social communities such as del.icio.us allows users to tag webpages with arbitrary terms; how efficient is this evolving vocabulary of tags for categorizing the webpage of interest? [224001550450] |Is there a way to measure whether a social community is "doing well"? [224001550460] |This paper looks at this problem with the tools of information theory. [224001550470] |For example, they compute the conditional entropy of documents given tags H(doc|tag) over time and observe that the efficiency is actually decreasing as popular tags are becoming overused. [224001550480] |Overall, I see three general directions of research for an NLPer in this field: The first approach focuses on building novel web applications that require NLP as a sub-component for the value-added. [224001550490] |NLPers in industry or large research groups are well-suited to build these applications; this is where start-ups may spring up. [224001550500] |The second approach is more technical: it focuses on how to adapt existing NLP techniques to new data such as blogs and social media. [224001550510] |This is a great area for individual researchers and grad student projects, since the task is challenging but clearly-defined: beat the baseline (old NLP technique) by introducing novel modifications, new features and models. [224001550520] |Success in this space may be picked up by the groups that build the large applications.The third avenue of research, which is less examined (as far as I know), is to apply NLP to help analyze social phenomenon. [224001550530] |The Web provides an incredible record of human artifacts. [224001550540] |If we can study all that is said and written on the web, we can really understand a lot about social systems and human behavior. [224001550550] |I don't know when NLP technology will be ready, but I think it would be really cool to use NLP to study language for language's sake, and more importantly, to study language in its social context--perhaps we could call that "Social Computational Linguistics". [224001550560] |I imagine this area of research will require collaboration with the social scientists; it is not yet clear what NLP technology is needed in this space, but papers (5) and (6) above may be a good place to start. [224001560010] |A Bit of Levity [224001560020] |Some out there might find this amusing (if you know me, you'll know why). [224001560030] |Semester's over; back to regular blogging soon! [224001580010] |Adaptation versus adaptability [224001580020] |Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. [224001580030] |Produce a classifier (say) that has low expected loss on new data drawn from the target domain. [224001580040] |(For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.) [224001580050] |Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). [224001580060] |It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist). [224001580070] |One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. [224001580080] |In particular, I can imagine many ways to tweak it: [224001580090] |
  • Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
  • [224001580100] |
  • Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
  • [224001580110] |
  • Continuing (2), perhaps it might not even come from one of the domains we've seen before!
  • [224001580120] |
  • Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." [224001580130] |Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
  • [224001580140] |
  • Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.
  • [224001580150] |I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). [224001580160] |All of these points beg the question that I often try to ask myself when defining problems: who is the client. [224001580170] |Here's one possible client: Google (or Yahoo or MSN or whatever). [224001580180] |Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. [224001580190] |In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." [224001580200] |It seems like for this client, we must be at (2) or (3) and not (1). [224001580210] |There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). [224001580220] |The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." [224001580230] |(This is a bit more like transduction that induction.) [224001580240] |However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. [224001580250] |However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before! [224001580260] |(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. [224001580270] |For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. [224001580280] |However, there are markedly different types of documents in there. [224001580290] |Some are "news." [224001580300] |Some are lists of stock prices and their recent changes. [224001580310] |One thing I tried in the Frustratingly Easy paper but that didn't work is the following. [224001580320] |Take the WSJ section of the Treebank and run document clustering on it (using bag of words). [224001580330] |Treat each of these clusters as a separate domain and then do adaptation. [224001580340] |This introduces the "when I get a test example I have to classify it into a domain" issue. [224001580350] |At the end of the day, I couldn't get this to work. [224001580360] |(Perhaps BOW is the wrong clustering representation? [224001580370] |Perhaps a flat clustering is a bad idea?) [224001580380] |Which brings us to the final question (5): what really is a domain? [224001580390] |One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. [224001580400] |Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. [224001580410] |However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." [224001580420] |Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. [224001580430] |One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently). [224001590010] |Continuing bad ideas [224001590020] |I occasionally--more than I would like--run in to the following problem. [224001590030] |I am working on some task for which there is prior work (shocking!). [224001590040] |I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. [224001590050] |Typically this involves comparing fairly directly against said prior work. [224001590060] |Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. [224001590070] |Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. [224001590080] |Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work. [224001590090] |The problem is: what if they did it wrong (in my opinion). [224001590100] |There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. [224001590110] |But just for concreteness, let's take the following case relating to the final example in the above paragraph. [224001590120] |Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. [224001590130] |This is (fairly clearly) totally unfair. [224001590140] |However, it leaves me with the following options: [224001590150] |
  • I can repeat their bad idea and test on every tenth sentence.
  • [224001590160] |
  • I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
  • [224001590170] |
  • I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
  • [224001590180] |
  • I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
  • [224001590190] |
  • I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.
  • [224001590200] |I think that if all else is equal, (5) is the best option. [224001590210] |I think it's scientifically sound, truthful, and gives results that are actually comparable. [224001590220] |Unfortunately, it's also the most work because it involves reimplementing a complete other system which in many cases may be verging on prohibitive. [224001590230] |(I suppose another option would be to contact the authors and ask them to rerun in the correct way -- I imagine some people might respond positively to this, though probably not all.) [224001590240] |(4) is tempting, because it allows me to "break the bad habit" but also compare directly. [224001590250] |The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? [224001590260] |If they're great (for me), does that actually tell me anything? [224001590270] |And if they're terrible (for me), does that tell me anything? [224001590280] |It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry. [224001590290] |(3) just seems intellectually dishonest. [224001590300] |(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. [224001590310] |After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. [224001590320] |If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. [224001590330] |(Of course, measuring "markedly different" may be difficult for some tasks.) [224001590340] |(1) also seems intellectually dishonest. [224001590350] |At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). [224001590360] |Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. [224001590370] |One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. [224001590380] |Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem. [224001590390] |I'm curious how other people feel about this. [224001590400] |I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead. [224001600010] |Measures of correlation [224001600020] |As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. [224001600030] |The most standard example is measuring collocations. [224001600040] |Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." [224001600050] |There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. [224001600060] |If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated! [224001600070] |The most straightforward measure, in my mind, is mutual information. [224001600080] |If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). [224001600090] |In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." [224001600100] |Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. [224001600110] |I.e., do they seem roughly statistically independent. [224001600120] |The mutual information statistic gives, for every X/Y pair (every pair of words) a score. [224001600130] |We can then sort all word pairs by this score and find the strongest collocations. [224001600140] |One issue with MI seems to be that it doesn't really care if pairs are common or not. [224001600150] |Very infrequent (typically noisy) pairs seem to pop to the front. [224001600160] |One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. [224001600170] |This is quite similar to the RlogF measure that's locally quite popular. [224001600180] |The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. [224001600190] |I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. [224001600200] |Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. [224001600210] |You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio. [224001600220] |Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). [224001600230] |It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this. [224001600240] |In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. [224001600250] |They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. [224001600260] |The intersection just contains t-score, so maybe that's the way to go. [224001600270] |Of course, these things are hard to validate because so much depends on what you're trying to do. [224001610010] |Evaluating topic models [224001610020] |I think it's fair to say that I am a fan of the Bayesian lifestyle. [224001610030] |I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense. [224001610040] |It's probably also fair to say that I'm a fan of NLP. [224001610050] |So... let's think about it. [224001610060] |A fan of Bayes, a fan of NLP. [224001610070] |He must do topic modeling (ie LDA-style models), right? [224001610080] |Well, no. [224001610090] |Not really. [224001610100] |Admittedly I have done something that looks like topic modeling (for query-focused summarization), but never really topic modeling for topic modeling's sake. [224001610110] |The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. [224001610120] |(I got away with this in the summarization example because I could evaluate the summaries directly.) [224001610130] |The purpose of this post is to discuss how one can try to evaluate topic models. [224001610140] |At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). [224001610150] |For a very simple example, let's take LDA. [224001610160] |Here, we have . [224001610170] |In the particular case of LDA, the first two "p"s are Dirichlet, and the last two "p"s are Multinomial, where z is an indicator selecting a mixture. [224001610180] |For simplicity, though, let's just collapse all the hyperparameters into a single variable a and the true parameters into a single parameter z and just treat this as and let's assume p and q are not conjugate (if they were, life would be too easy). [224001610190] |Now, we want to "evaluate" these models. [224001610200] |That is, we want to check to see if they really are good models of documents. [224001610210] |(Or more precisely, are the better models of documents than whatever we are comparing against... perhaps I've proposed a new version of p and q that I claim is more "life-like.") [224001610220] |Well, the natural thing to do would be to take some held-out data and evaluate according to the model. [224001610230] |Whichever model assigns higher probability to the heldout data is probably better. [224001610240] |At this point, we need to take a moment to talk about inference. [224001610250] |There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. [224001610260] |Each gives you something totally different. [224001610270] |In the Monte Carlo camp, we'll typically get a set of R-many (possibly weighted) samples from the joint distribution p(a,z,w). [224001610280] |We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. [224001610290] |In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of a given the training data: i.e., a value of a that maximizes p(a|w). [224001610300] |(This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out a as well.) [224001610310] |Now, back to evaluation. [224001610320] |The issue that comes up is that in order to evaluate -- that is, in order to compute , we have to do more inference. [224001610330] |In particular, we have to marginalize over the zs for the heldout data. [224001610340] |In the MC camp, this would mean taking our samples to describe a posterior distribution on a given w (marginalizing out z) and then using this posterior to evaluate the heldout likelihood. [224001610350] |This would involve another run of a sampler to marginalize out the zs for the new data. [224001610360] |In the deterministic camp, we may have an ML-II point estimate of the hyperparameters a, but we still need to marginalize out z, which usually means basically running inference again (eg., running EM on the test data). [224001610370] |All of this is quite unfortunate. [224001610380] |In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. [224001610390] |Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. [224001610400] |In the MC camp, life is pretty bad. [224001610410] |We can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. [224001610420] |At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!). [224001610430] |So, what recourse do we have? [224001610440] |One reasonable standard thing to do is to "hold out" data in a different way. [224001610450] |For instance, instead of holding out 10% of my documents, I'll hold out 10% of my words in each document. [224001610460] |The advantage here is that since the parameters z are typically document-specific, I will obtain them for every document in the process of normal inference. [224001610470] |This means that (at least part of) the integration in computing p(w|a) disappears and is usually tractable. [224001610480] |The problem with this approach is that in many cases, it's not really in line with what we want to evaluate. [224001610490] |Typically we want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. [224001610500] |(There are other issues, too, though these are less irksome to me in a topic-modeling setting.) [224001610510] |Another standard thing to do is to throw the latent variables into some sort of classification problem. [224001610520] |That is, take (eg) the 20newsgroups data set, training and test combined. [224001610530] |Run your topic model and get document-level parameters. [224001610540] |Use these as parameters to, say, logistic regression and see how well you do. [224001610550] |This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. [224001610560] |The problem is that this estimate is cloaked behind classification. [224001610570] |Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture. [224001610580] |The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their Pachinko Allocation paper. [224001610590] |(Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the empirical likelihood estimate of Diggle and Gratton. [224001610600] |The key idea in empirical likelihood is to replace our statistically simple but computationally complex model p with a statistically complex but computationally simple model q. [224001610610] |We then evaluate likelihood according to q instead of p. Here's how I think of it. [224001610620] |Let's say that whatever inference I did on training data allowed me to obtain a method for sampling from the distribution p(w|a). [224001610630] |In most cases, we'll have this. [224001610640] |If we have an ML-II estimate of a, we just follow the topic models' generative story; if we have samples over a, we just use those samples. [224001610650] |Easy enough. [224001610660] |So, what we're going to do is generate a ton of faux documents from the posterior. [224001610670] |On each of these documents, we estimate some simpler model. [224001610680] |For instance, we might simply estimate a Multinomial (bag of words) on each faux document. [224001610690] |We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. [224001610700] |The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is really easy. [224001610710] |We just take our test documents, compute their average likelihood under each of these faux multinomials, and voila -- we're done! [224001610720] |This method is, of course, not without it's own issues. [224001610730] |For one, a multinomial might not be a good model to use. [224001610740] |For instance, if my topic model says anything about word order, then I might want to estimate simple n-gram language models instead. [224001610750] |The estimates might also have high variance -- how many faux documents do I need? [224001610760] |Some sort of kernel smoothing can help here, but then that introduces additional bias. [224001610770] |I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see. [224001610780] |But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of HBC. [224001620010] |Old school conference blogging [224001620020] |These days, the "conference blog" has come to be a fairly accepted part of an academic blogger's repertoire. [224001620030] |I've actually received an email on the third day of a conference that people knew I was attending asking "why haven't you blogged yet!" [224001620040] |Colleagues who blog have shared similar stories. [224001620050] |How did the world manage without us?! [224001620060] |I occasionally like to read through old papers, like from the late 70s, early 80s, mostly in stats, but occasionally in NLP or ML. [224001620070] |Mostly it's fun; often I learn things. [224001620080] |The old NLP paper typically amuse me more than anything else, but it's a bit of a relief to see the set of things that were considered interesting or important 20-30 years ago. [224001620090] |I was browsing through the ACL 1980 proceedings (I won't share the answer to "how old were you then?") on the anthology and came across a paper titled "Parsing." [224001620100] |I thought that was quite impressive that someone would be audacious enough to title their paper "Parsing." [224001620110] |I mean, can you imagine going to a talk at ACL 2008 and seeing the title "Machine Translation?" [224001620120] |It's absurd. [224001620130] |So I read the paper. [224001620140] |Well, it turns out it's not really a research paper. [224001620150] |It's the 1980 ACL's parsing area chair's impression of all the parsing papers that year, and how they related to the previous year. [224001620160] |"Fantastic!" [224001620170] |I thought. [224001620180] |These are like pre-web-era conference blogs. [224001620190] |But it's not just some random guy blogging about his favorite ACL papers across a variety of areas, but rather the one person who probably knew all the parsing papers in ACL that year better than anyone. [224001620200] |Having area chaired myself, I know what all the summarization submissions were about, and I know quite well what all the accepted submissions were about. [224001620210] |(Or, at least, I did right after we issues accept/reject notifications.) [224001620220] |If I had sat down right then and written a page summary, relating the the past year, it probably would have taken about two hours. [224001620230] |(An average blog post takes 30 mins, but I figure I would want to be more diligent, and also be sure to look up what happened the year before.) [224001620240] |I would actually love to see something like that reinstated. [224001620250] |I think it captures an important area that's missing from you standard conference blogs -- namely, that you get coverage. [224001620260] |As I age, I attend fewer and fewer talks, so my conference blog posts are necessarily really spotty. [224001620270] |But if a chair from every area were to write a one page summary, I think that would be great. [224001620280] |And I really don't think it's that much added effort -- I easily spent 10-20 times that going over reviews, looking for inconsistencies, reading papers, etc. [224001620290] |However, while I think it's useful, I also don't think that it's really something that needs to be in our proceedings. [224001620300] |So I'm going to try to start a trend. [224001620310] |Every time I am area chair for a conference, I will post to the blog a summary of the papers in my area. [224001620320] |If you ever are an area chair, and would like to participate, please contact me. [224001620330] |Perhaps I'll be the only one who ever does this, but I hope not. [224001620340] |I'll start with ACL 2008 summarization as my next post. [224001630010] |ACL 2008 What-To-See'd [224001630020] |Sorry for the delay, but I had to wait until I got here and actually got the proceedings CD, since the proceedings don't yet seem to be online. [224001630030] |Find some good talks to see! [224001630040] |Yes, I know that I promised another post in the interim, but this doesn't really count in my mind. [224001640010] |CL is open access [224001640020] |Just officially announced. [224001640030] |Minor details: as of Mar 2009 (first issue next year), there will be no print version (electronic only) and will be open access. [224001640040] |Also switched to an online electronic management system. [224001640050] |Obviously I think this is fantastic. [224001640060] |Many many thanks go to Robert Dale and the other ACL and CL board members for making this happen. [224001650010] |ACS: ACL 2008 Summarization Track [224001650020] |This is the first in what I hope will be a long line of "Area Chair Summaries" from NLP-related conferences. [224001650030] |If you would like to contribute one, please contact me! [224001650040] |For ACL this year, I was lucky to be the Area Chair for the summarization track. [224001650050] |I know I've said it before, but we really got a ton of great papers this year. [224001650060] |In the end, seven were accepted for presentation (note there are also some summarization-related papers that officially fell under "Generation" for which I was not the area chair). [224001650070] |I would like to say that there was some sort of unifying theme this year, but there's not one that I can actually come up with. [224001650080] |The papers were: [224001650090] |P08-1035 [bib]: Tadashi NomotoA Generic Sentence Trimmer with CRFs [224001650100] |P08-1036 [bib]: Ivan Titov; Ryan McDonaldA Joint Model of Text and Aspect Ratings for Sentiment Summarization [224001650110] |P08-1092 [bib]: Fadi Biadsy; Julia Hirschberg; Elena FilatovaAn Unsupervised Approach to Biography Production Using Wikipedia [224001650120] |P08-1093 [bib]: Qiaozhu Mei; ChengXiang ZhaiGenerating Impact-Based Summaries for Scientific Literature [224001650130] |P08-1094 [bib]: Ani Nenkova; Annie LouisCan You Summarize This? [224001650140] |Identifying Correlates of Input Difficulty for Multi-Document Summarization [224001650150] |P08-1054 [bib]: Gerald Penn; Xiaodan ZhuA Critical Reassessment of Evaluation Baselines for Speech Summarization [224001650160] |P08-1041 [bib]: Giuseppe Carenini; Raymond T. Ng; Xiaodong ZhouSummarizing Emails with Conversational Cohesion and Subjectivity [224001650170] |Most of these you can guess the contents more-or-less by their titles, but here's a quick run-down. [224001650180] |Nomoto's is probably the hardest to guess. [224001650190] |I dare-say it actually sounds a bit boring from just the title; it leads me to think it's yet another sentence compression method. [224001650200] |But if you start reading the paper, you find out all about compression under dependency structures, summarization of Japanese text, and an fairly thorough evaluation. [224001650210] |The Titov and McDonald paper attempts to model associations between fine-grained user reviews of restaurants (eg: how do you rate the food versus the ambiance?) and the actual text in the reviews. [224001650220] |This enables them to produce summaries that are specific to particular aspects of the review. [224001650230] |Biadsi, Hircshberg and Filatova present a model for producing biographies, by trying to identify biography-like sentences from Wikipedia (a source that's gaining more an more attention these days). [224001650240] |One aspect that I found most interesting here was that they attempt to do full-on reference resolution and referring expression generation. [224001650250] |This has always been something I've been too scared to touch, but they actually present some results that show that it's worthwhile! [224001650260] |Mei and Zhai talk about a sentence-retrieval method for summarizing scientific documents, which they gathered from DBLP. [224001650270] |They take advantage of citation sentences (called "citances" by Marti Hearst and also explored deeply by Simone Teufel) and make a "citation" language model. [224001650280] |This language model is interpolated with the standard document language model to perform extraction. [224001650290] |This allows them to extract sentences that readers care about, not what the author thinks you should care about. [224001650300] |The key limitation, of course, is that it only works once the paper has been cited for a while. [224001650310] |(It would be nice to see how many citations you need before it's worth doing this.) [224001650320] |The paper by Nenkova and Louis describes a model for predicting if a batch of documents is going to be difficult (for a machine) to summarize. [224001650330] |This is akin to the notion of query-difficulty that you see in IR. [224001650340] |The results are about what you'd expect, but it's nice to see them played out. [224001650350] |In particular, you see that more cohesive sets of documents are easier to summarize. [224001650360] |Read the paper for more. [224001650370] |Penn and Zhu look at how well Rouge works when trying to summarize speech. [224001650380] |They look at both Switchboard data (telephone conversations) and lectures. [224001650390] |They have many interesting findings that cast doubt not only on the role that Rouge plays in speech summarization, but also on what sort of baselines are reasonable for speech, and what role meta-data plays in speech summarization. [224001650400] |If you care at all about the intersection of speech and summarization, this truly is a must-read. [224001650410] |Last but not least, Carenini, Ng and Zhou discuss the task of summarizing emails (following up on previously work by the same authors on roughly the same task). [224001650420] |Since the most relevant past work appeared in WWW07, I'll actually comment on that too (maybe unknown to many readers). [224001650430] |There is quite a bit of work here, that primarily aims at finding useful features for summarizing chains of emails. [224001650440] |They look at things like cue words, semantic similarity, lexical similarity, "PageRank" measures and a handful of others. [224001650450] |This is all done in a graph-based framework that is pieced together based on processing the email chain (from what I can tell, this is highly non-trivial). [224001650460] |After writing this, I think that maybe the connecting thread is that there's a lot of work that aims at doing summarization for things other than straight news stories. [224001650470] |I think this is a fantastic step for the community. [224001650480] |Keep up the good work, guys! [224001670010] |ICML 2008 papers up [224001670020] |See here. [224001670030] |Has also been WhatToSee-d. [224001670040] |Here are the top words (stems) from ICML this year: [224001670050] |
  • learn (53) -- no kidding!
  • [224001670060] |
  • model (20)
  • [224001670070] |
  • kernel (17) -- can't seem to shake these guys
  • [224001670080] |
  • estim (11)
  • [224001670090] |
  • reinforc (10)
  • [224001670100] |
  • linear (10) -- as in both "linear time" and "linear model"
  • [224001670110] |
  • classif (10) -- no kidding
  • [224001670120] |
  • process (9) -- yay nonparametric Bayes!
  • [224001670130] |
  • analysi (9)
  • [224001670140] |
  • supervis (8)
  • [224001670150] |
  • structur (8) -- our paper is one of these
  • [224001670160] |
  • rank (8) -- always popular in the web days
  • [224001670170] |
  • effici (8) -- this is great!
  • [224001680010] |ICML, UAI and COLT 2008 all WhatToSee'd [224001680020] |See here [224001690010] |ICML Business Meeting [224001690020] |Here are some points, coming to you roughly live. [224001690030] |
  • ICML 2009 will be in Montreal, June 15-19. [224001690040] |Colocated with COLT (June 20-23) and Symposium on RL (June 19-21).
  • [224001690050] |
  • ICML 2010 will be in Haifa, Isreal (led by IBM folks). [224001690060] |No official dates, yet.
  • [224001690070] |
  • The ICML board will be holding elections for members soon: probably around 10 new. [224001690080] |Requests for nominations will be sent out in the next month or two from IMLS.
  • [224001690090] |
  • Sounds like tracks will be implemented in SPC vs. review -- just like *ACL. [224001690100] |To me, clearly a good idea. [224001690110] |Bidding on 600 papers sucks.
  • [224001690120] |
  • There's discussion about having a combination of a "AAAI-style nectar track" and an "applications track." [224001690130] |The idea would be that if you've published a paper at an applications conference (eg, ACL, CVPR, ISMB, etc.), you could rewrite >50% of it, sell it to an ICML audience, and resubmit. [224001690140] |The goal would be to get some applications blood into ICML, and not simultaneously prevent people from submitting their apps papers to the apps conferences.
  • [224001700010] |ICML/UAI/COLT 2008 Retrospective [224001700020] |I know it's a bit delayed, but better late than never, eh? [224001700030] |I have to say: I thought ICML/UAI/COLT this year was fantastic. [224001700040] |The organizers all did a fantastic job. [224001700050] |I can't possibly list everything I liked, but here are some highlights: [224001700060] |The tutorials were incredible. [224001700070] |I'm reaching that point where I often skip out on tutorials because there's nothing new enough. [224001700080] |This time, there was an embarrassment of riches: I couldn't go to everything I wanted to. [224001700090] |In the end, I went to the Smola+Gretton+Fukumizu tutorial on distributional embeddings and the Krause+Guestrin tutorial on submodularity. [224001700100] |Both were fantastic. [224001700110] |My only complaint is that while all the plenary talks at ICML/UAI/COLT were video taped, the tutorials were not! [224001700120] |Anyway, I will blog separately about both of these tutorials in the near future; also check out their web pages: the slides are probably reasonably self-explanatory. [224001700130] |The first day began with an invited talk by John Winn, who does (at least used to do) probabilistic modeling for computer vision. [224001700140] |I'm not sure if this was the intended take-away message, but the thing I liked best about the talk is that he listed about a dozen things that you would have to model, were you to model "vision." [224001700150] |He then talked about problems and techniques in terms of which of these you model, which of them you fix, and which of them you treat as "noise." [224001700160] |I think this is a great way to think about modeling any kind of complex real world phenomenon (hint hint!). [224001700170] |My favorite morning talk was Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. [224001700180] |This is a really clever use of slice sampling, which everyone should know about! [224001700190] |The idea is to resample entire sequences of hidden states in one go. [224001700200] |The trouble with this in an infinite model is that I don't know how to do forward-backward over an infinite state space. [224001700210] |The trick is to slice the probability mass on the lattice and only sample over those state/observation pairs that stick out above the slice. [224001700220] |This is effectively a fully Bayesian way to do beam-like methods, and is certainly no limited to infinite HMMs. [224001700230] |That afternoon, there was the joint award session, with a "ten year" award (great idea!) going to Blum and Mitchell's original co-training paper from COLT 1998. [224001700240] |This is clearly a very influential paper, and one I will blog about separately. [224001700250] |I particularly liked two of the best paper award recipients: [224001700260] |
  • Percy Liang and Michael Jordan for An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. [224001700270] |The idea here is to look at what various estimators are like in the limit of infinite data. [224001700280] |The conclusion (as I read it) is that if your model is correct (i.e., the true distribution is realized by a certain set of parameters in your model family), then you're asymptotically better using generative models. [224001700290] |If the model is even a tiny tiny bit incorrect, then you're better using discriminative. [224001700300] |The key weakness of this analysis seems to be that, since we're looking only at asymptotics, there's no notion of a "sortof-wrong" model -- the wrongness gets blown up in the limit.
  • [224001700310] |
  • Shai Shalev-Shwartz and Nati Srebo for SVM Optimization: Inverse Dependence on Training Set Size. [224001700320] |The idea here is that having lots of data should be a good thing, not something that makes us say "ugh, now my SVM will take forever to run." [224001700330] |The key to getting an inverse dependence on the size of the training set is to sort of change the problem. [224001700340] |The idea is that if you have 1,000 training points, you may get 88% accuracy and it may take an hour. [224001700350] |If you have 1,000,000 training points, we're not going to ask that we find something with 90% accuracy, but rather still aim for 88% accuracy but try to get there faster. [224001700360] |The analysis is based on the idea that if we have lots more data, we can be a bit looser with finding the exact best model parameters, which is what usually takes so long. [224001700370] |We can afford to have not exactly the best parameters, since we're still going for 90% accuracy, not 88%.
  • [224001700380] |Although I didn't see the talk, I enjoyed the poster on Training Structural SVMs when Exact Inference is Intractable by Thomas Finley and Thorsten Joachims. [224001700390] |They investigate the difference between approximate inference methods that are overgenerating versus undergenerating. [224001700400] |The former is exemplified by a conversion from an IP to an LP. [224001700410] |The latter by, say, search. [224001700420] |I don't think there are huge cut and dry situations where one is better than the other, but I think this is a really good step to understanding these models better. [224001700430] |Tuesday morning, there was a great talk on Efficient Projections onto the L1-Ball for Learning in High Dimensions by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra. [224001700440] |Here, there's trying to extend the Pegasos algorithm to do L1 regularized SVMs. [224001700450] |The difficult step is, after taking a sub-gradient step, projecting yourself back on to the L1 ball. [224001700460] |They have a very clever algorithm for doing this that makes use of red-black trees to maintain a notion of when a dimension was last updated (necessary in order to get complexity that depends on the sparseness of feature vectors). [224001700470] |The only worry I have about this approach is that I'm very happy dealing with arrays for everything, but once someone tells me that my algorithm needs to use a tree structure, I start worrying. [224001700480] |Not because of big-O issues, but just because trees mean pointers, pointers mean pointer chasing, and pointer chasing (especially out of cache) is a nightmare for actual efficiency. [224001700490] |All the papers in the Tuesday 10:40am online learning section I found interesting. [224001700500] |Confidence-weighted linear classification (Mark Dredze, Koby Crammer and Fernando Pereira) presents a technique for keeping track of how confident you are about your weights for various features (if you squint it looks very PAC-Bayes) and they get very good online performance with only one or two passes over the data. [224001700510] |Francesco Orabona, JosephKeshet and Barbara Caputo had a paper on memory-bounded online perceptrons called the Projectron. [224001700520] |The idea is simple and effective; however, the big take-away I had from this paper has to do with the baseline they compare against. [224001700530] |If you only allow your perceptron to maintain, say, 100 support vectors, just use the first 100 points that the perceptron updates on. [224001700540] |This does pretty much as well as most of the past work in this problem! [224001700550] |How did this not get noticed before? [224001700560] |Finally, Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari talked about bandit problems; this is interesting work, but still seems far from practical. [224001700570] |The idea of naively picking an unknown label with uniform probability over a gigantic set of labels is just not going to work in the real world. [224001700580] |But I think it's a really good start. [224001700590] |The entire NLP section on the last afternoon was good. [224001700600] |David Chen and Ray Mooney talked about how to generate sportscasts from RoboCup trials; Ronan Collobert and Jason Weston talked about using neural nets to solve every NLP problem in the world (more on this in another post); Jason Wolfe, Aria Haghighi and Dan Klein talked about how to parallelize the M step in addition to the E step; and then Percy Liang, me, and Dan Klein talked about whether its reasonable or not to get rid of features in structured prediction (more on this in another post). [224001700610] |The next day was workshop day. [224001700620] |The two workshops I went to were the Prior Knowledge in Text workshop (that I co-organized with Marc Dymetman, Guillame Bouchard and Yee Whye Teh), and the Nonparametric Bayes workshop (where I talked a tiny bit about HBC). [224001700630] |I'll relate the news of our own workshop at a later date; you can see some about the NPBayes workshop here. [224001700640] |Then came UAI and COLT. [224001700650] |I must admit by this time I was a bit conferenced out, so I missed a bunch of sessions. [224001700660] |However, there were still some papers that stood out for me. [224001700670] |In UAI, David Mimno and Andrew McCallum presented Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, essentially a conditional variant of LDA wherein the distribution over hyperparameters is given by a generalized linear model and can use arbitrary features. [224001700680] |One audience question that came up (David, if you're reading, maybe you can answer this!) had to do with putting the GLM on the "alpha" hyperparameter. [224001700690] |The fact that there was such a big improvement over baseline suggests that LDA-like models are highly sensitive to the setting of their hyperparameters: this is a bit surprising. [224001700700] |I (like the questioner) was a bit surprised that tweaking a hyperparameter could have such a big influence! [224001700710] |Nevertheless, very cool. [224001700720] |The best paper went to David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola and Yair Weiss for Tightening LP Relaxations for MAP using Message Passing. [224001700730] |The idea is to take your marginal polytope initially defined by simple single-node marginal constraints and iteratively add pairwise, 3-wise, etc., constraints. [224001700740] |There are some implementation details that allow you to do warm restarts. [224001700750] |They managed to scale really amazingly well. [224001700760] |This furthers my confidence that LPs are a really great way to do MAP inference in graphical models. [224001700770] |Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar talked about Multi-View Learning over Structured and Non-Identical Outputs. [224001700780] |I see this as a bit of an extension to the "alignment by agreement" and the "agreement-based learning" work, but now the difference is that the output spaces don't have to "match up." [224001700790] |Marina Meila and Le Bao had a poster on Estimation and clustering with infinite rankings, where they give a nonparametric model over rankings, focusing on properties of the distribution. [224001700800] |Kurt Miller, Tom Griffiths and Mike Jordan had an extension to the Indian Buffet Process that intentionally removes exchangeability in order to model cases where we know the data is not exchangeable. [224001700810] |Chong Wang, Dave Blei and David Heckerman had a nice result on how to do topic modeling over continuous time (modeled as Brownian motion). [224001700820] |The cool thing is that by going continuous, they're able to get a more efficient algorithm that for previous work that functioned in discrete time steps. [224001700830] |At COLT, there were also a good number of good papers. [224001700840] |Shai Ben-David, Tyler Lu and David Pal ask: Does Unlabeled Data Provably Help? [224001700850] |Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning. [224001700860] |I'll ruin the suspense: No, it does not. [224001700870] |(At least not without strong assumptions about the label distribution.) [224001700880] |Nina Balcan, Avrim Blum and Nati Sreboshow that learning with similarity functions is even better than we thought. [224001700890] |One paper at COLT that I especially liked because I wasn't aware of much of the previous work they cited was by Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou and Jufu Feng On the Margin Explanation of Boosting Algorithms. [224001700900] |The history here is roughly like this. [224001700910] |Boosting came along and worked pretty well. [224001700920] |People started trying to explain it from the perspective of margin-based analysis. [224001700930] |Then, Breiman came along and described an algorithm arc-gv that provably generates a better margin than AdaBoost (and does so in practice as well), yet works worse! [224001700940] |This paper attempts to re-analyze boosting from the perspective of an "Equilibrium Margin," which provides sharper bounds and results that actually agree with what is observed empirically.