[224001000010] |Very Sad News [224001000020] |Karen Sparck-Jones passed away this morning. [224001000030] |You can find more information about her here and in many other places on the web. [224001000040] |I, personally, remember Karen as endlessly enthusiastic and always willing to push me to think about problems in new (and interesting) lights. [224001000050] |She's been immensely important in both NLP and IR, and her death is an enormous loss to the field; I have greatly valued her contributions in DUC discussions, among other things. [224001000060] |Many will miss her. [224001000070] |I'll leave this discussion with a quote from an email I have from her from only a few weeks ago, since I think it's something we should all think about, whatever our context: [224001000080] |Document collections is a much misumnderstood topic -- you have to think what its for and eg where (in retrieval) you are going to get real queries from. [224001000090] |Just assembling some bunch of stuff and saying hey giys, what would you like to do with this is useless. [224001010010] |Discourse is darn hard [224001010020] |Discourse is an area I've been peripherally interested in for quite a while; probably unduly influence by my advisor's (ex-?)interest in the problem. [224001010030] |The general sense I get is that discourse relations are relatively easy to identify when cue-phrases exist. [224001010040] |This is essentially the approach taken by the Penn Discourse Treebank and Daniel's original parser. [224001010050] |Unfortunately, cue phrases exist only for a small subset of sentences, so it's tempting to look at lexical techniques, since they work so well in other problems. [224001010060] |The intuition is that if one sentence says "love" and the next says "hate," then maybe they are in some sort of contrastive relation. [224001010070] |Unfortunately, the world is rarely this nice. [224001010080] |The amount of inference necessary to reliably identify seemingly simple, coarse-grained relationships, seems astonishing. [224001010090] |Here are some examples of contrast relations: [224001010100] |
  • The interest-only securities will be sold separately by BT Securities. [224001010110] |The principal-only securities will be repackaged by BT Securities into a Freddie Mac Remic, Series 103, that will have six classes.
  • [224001010120] |
  • James Earl Jones is cast to play the Rev. Mr. Johns. [224001010130] |The network deals a lot with unknowns, including Scott Wentworth, who portrayed Mr. Anderson, and Bill Alton as Father Jenco.
  • [224001010140] |
  • At the same time, second-tier firms will continue to lose ground. [224001010150] |In personal computers, Apple, Compaq and IBM are expected to tighten their hold on their business.
  • [224001010160] |In (1), we need to know that interest-only and principal-only securities are essentially themselves contrastive. [224001010170] |We then need to combine this information with the coreference of BT Securities in both cases (should be easy). [224001010180] |This seems like maybe we could possibly get it with enough data. [224001010190] |For (2), we essentially need to know that James Earl Jones is famous, and that "deals with unknowns" is the same as "casts." [224001010200] |For (3), we need to know that Apple, Compaq and IBM are not second-tier firms. [224001010210] |These latter two seem quite difficult. [224001010220] |Next, let's consider background. [224001010230] |Here are some short examples of background relations: [224001010240] |
  • The March 24 oil spill soiled hundreds of miles of shoreline along Alaska's southern coast and wreaked havoc with wildlife and the fishing industry. [224001010250] |Exxon Corp. is resigning from the National Wildlife Federation 's corporate advisory panel, saying the conservation group has been unfairly critical of the Exxon Valdez oil spill along the Alaskan coast.
  • [224001010260] |
  • Insurers typically retain a small percentage of the risks they underwrite and pass on the rest of the losses. [224001010270] |After Hugo hit, many insurers exhausted their reinsurance coverage and had to tap reinsurers to replace that coverage in case there were any other major disasters before the end of the year.
  • [224001010280] |The first example here might be identifiable by noting that the tense in the two sentences is different: one is present, one is past. [224001010290] |Beyond this, however, we'd essentially have to just recognize that the oil spill referred to in both sentences is, in fact, the same spill. [224001010300] |We'd have to infer this from the fact that otherwise the first sentence is irrelevant. [224001010310] |In (2), we'd have to recognize that the "passed on" losses go to reinsurers (I didn't even know such things existed). [224001010320] |As a final example, let's look at some evidence relations: [224001010330] |
  • A month ago, the firm started serving dinner at about 7:30 each night; about 50 to 60 of the 350 people in the investment banking operation have consistently been around that late. [224001010340] |Still, without many actual deals to show off, Kidder is left to stress that it finally has "a team" in place, and that everyone works harder.
  • [224001010350] |
  • Yesterday, Mobil said domestic exploration and production operations had a $16 million loss in the third quarter, while comparable foreign operations earned $234 million. [224001010360] |Individuals familiar with Mobil's strategy say that Mobil is reducing its U.S. work force because of declining U.S. output.
  • [224001010370] |These examples also seem quite difficult. [224001010380] |In (1), the fact that the "team" is supposed to refer to the "50 to 60 ... people" and that having dinner, within the context of a company, is a "team-like" activity. [224001010390] |(2) is somewhat easier -- knowing that a "loss" is related to a "decline" helps a lot, though it is entirely possible that the type of relation would differ. [224001010400] |You also pretty much have to know that "domestic" means "U.S." [224001010410] |I feel like, in many ways, this discourse relation problem subsumes the now-popular "textual entailment" problem. [224001010420] |In fact, some of the RST relations look a lot like textual entailment: elaboration-general-specific, eleboration-set-member, Contrast (as a negative entailment), consequence, etc. [224001010430] |I don't have a strong conclusion here other than: this is a hard problem. [224001010440] |There've been some attempts at learning lexical things for this task, but I feel like it needs more than that. [224001020010] |AI Stats post mortem [224001020020] |Got back from AIStats a week or so ago; overall, I enjoyed it. [224001020030] |Though by the end, I was totally sick of Puerto Rican food. [224001020040] |(I'm a huge fan of Cuban food and expected stronger similarities. [224001020050] |Oh well.) [224001020060] |Here are some things I learned at AIStats: [224001020070] |
  • Multiple instance learning. [224001020080] |There was a paper on this topic. [224001020090] |I can't comment on how good the paper was (in comparison to other MIL papers), but I didn't know about the MIL task before. [224001020100] |The problem is as follows. [224001020110] |Instead of getting a set of labeled training points, you get a set of sets of points. [224001020120] |Each set of points comes with a +1/-1 label. +1 means that at least one example in the set is positive; -1 means none are. [224001020130] |Vision people like this because they have an image that contains a person but they don't know where the person is, so the scan a box around the image and generate lots of examples. [224001020140] |We know at least one will contain a person. [224001020150] |I realized after seeing the poster that I've actually worked on this problem! [224001020160] |The BayeSum paper is actually a MIL algorithm, but I didn't know the right terminology. [224001020170] |I wonder how it compares in vision problems.
  • [224001020180] |
  • Deep belief networks. [224001020190] |These seem to be growing in popularity -- both Hinton's group and LeCun's group have their own version of DBNs. [224001020200] |You can basically think of these as neural nets with tons of layers of hidden units. [224001020210] |They can presumably learn really complicated functions of inputs. [224001020220] |It's also possible to train them unsupervisedly (is that a word?) and then just tune the unsupervised model to some classification task. [224001020230] |I don't really understand how they work, but they seem to be doing something interesting.
  • [224001020240] |
  • It is possible to learn (probably approximate) underestimates for the heuristic in A* search. [224001020250] |The problem is that you have a given, fixed path cost and you want to learn an admissible heuristic. [224001020260] |Seems quite similar to Berkeley paper at NAACL this year, though different in important ways. [224001020270] |Seems like a reasonable thing to try if you're stuck with a fixed path cost. [224001020280] |I'm curious how these techniques compare to keeping a heuristic fixed and learning a path cost, ala Drew -- seems that on the surface they'd be similar, but learning the path cost seems a bit easier to me.
  • [224001020290] |There were, of course, other interesting papers. [224001020300] |Yee Whye has listed a few. [224001030010] |Multinomial on a graph [224001030020] |I'm going to try something here to see if it works; my guess is not, but maybe it will at least spark some discussion. [224001030030] |(I'm also testing LaTeX in Blogger.) [224001030040] |In "text modeling" applications, we're usually dealing with lots of multinomials, especially multinomials over our vocabulary (i.e., unigram language models). [224001030050] |These are parameterized by a vector of length equal to the size of the vocabulary. should be the probability of seeing word . [224001030060] |The probability of observing an entire document of length , containing copies of word 1, copies of word 2, and so on, is: [224001030070] |What I'm interested in is extensions to this where we don't assume independence. [224001030080] |In particular, suppose that we have an ontology. [224001030090] |Let's say that our ontology is just some arbitrary graph. [224001030100] |What I want is a distribution that essentially prefers to keep drawing values "locally" within the graph. [224001030110] |That is, if I've already drawn "computer" then I'm not so suprised to also see "program." [224001030120] |One way of thinking about this is by having a sort of random walk model. [224001030130] |Think about placing the individual s on the graph and then allowing them to disperse. [224001030140] |So that maybe there is some for which (and the rest are zero). [224001030150] |We certainly want our distribution to prefer draws for word , but draws of word where is "near" should also be acceptable. [224001030160] |My first cut attempt at this is as follows. [224001030170] |Let be a graph over our vocabulary; let be a neighborhood of (maybe, all nodes within 10 steps); let be the inverse neighborhood (the set of all that have in ). [224001030180] |Let be the distance between two nodes and let be a dampening parameter (how fast probability diffuses on the graph). [224001030190] |Define our probabilistic model as: [224001030200] |Where we let . [224001030210] |The idea behind this model is as follows. [224001030220] |When we observe word i, we have to assign a probability to it. [224001030230] |This probability is given by any node j that contains i in its neighborhood. [224001030240] |Each such node has some mass which it can distribute. [224001030250] |It distributes this mass proportional to to all nodes , where serves to normalize this diffusion. [224001030260] |I'm pretty sure that the normalizing constant for this distribution is the same as the multinomial. [224001030270] |After all, there is a one-to-one mapping between the multinomial parameters and the "graph multinomial" parameters, so we should still get to normalize. [224001030280] |Now, because I feel an incredible need to be Bayesian, I need a conjugate prior for this "graph multinomial" distribution. [224001030290] |Since the graph multinomial is essentially just a reparameterization of the standard multinomial, it seems that a vanilla Dirichlet would be an appropriate prior. [224001030300] |However, I cannot prove that it is conjugate -- in particular, I can't seem to come up with appropriate posterior updates. [224001030310] |So this is where y'all come in to save the day! [224001030320] |I can't imagine that no one has come up with such a "graph multinomial" distribution -- I just don't know what keywords to search for. [224001030330] |If someone has seen something at all like this, I'd love to know. [224001030340] |Alternatively, if someone can tell me what the appropriate conjugate prior is for this distribution, plus the posterior update rules, that would also be fantastic. [224001040010] |ACL papers up [224001040020] |List is here. [224001040030] |Top stemmed, non-stop words from titles are: [224001040040] |
  • Model (22)
  • [224001040050] |
  • Tranlat (16)
  • [224001040060] |
  • Languag (16)
  • [224001040070] |
  • Machin (15)
  • [224001040080] |
  • Learn (15)
  • [224001040090] |
  • Base (14) -- as in "syntax-based"
  • [224001040100] |
  • Word (13)
  • [224001040110] |
  • Statist (13)
  • [224001040120] |
  • Pars (11) -- as in parsing
  • [224001040130] |
  • Semant (10) -- wow!
  • [224001040140] |
  • Approach (10)
  • [224001050010] |Searn versus HMMs [224001050020] |Searn is my baby and I'd like to say it can solve (or at least be applied to) every problem we care about. [224001050030] |This is of course not true. [224001050040] |But I'd like to understand the boundary between reasonable and unreasonable. [224001050050] |One big apparently weakness of Searn as it stands is that it appears applicable only to fully supervised problems. [224001050060] |That is, we can't do hidden variables, we can't do unsupervised learning. [224001050070] |I think this is a wrong belief. [224001050080] |It's something I've been thinking about for a long time and I think I finally understand what's going on. [224001050090] |This post is about showing that you can actually recover forward-backward training of HMMs as an instance of Searn with a particular choice of base classifier, optimal policy, loss function and approximation method. [224001050100] |I'll not prove it (I haven't even done this myself), but I think that even at a hand-waving level, it's sufficiently cool to warrant a post. [224001050110] |I'm going to have to assume you know how Searn works in order to proceed. [224001050120] |The important aspect is essentially that we train on the basis of an optimal policy (which may be stochastic) and some loss function. [224001050130] |Typically I've made the "optimal policy" assumption, which means that when computing the loss for a certain prediction along the way, we approximate the true expected loss with the loss given by the optimal policy. [224001050140] |This makes things efficient, but we can't do it in HMMs. [224001050150] |So here's the problem set up. [224001050160] |We have a sequence of words, each of which will get a label (for simplicity, say the labels are binary). [224001050170] |I'm going to treat the prediction task as predicting both the labels and the words. [224001050180] |(This looks a lot like estimating a joint probability, which is what HMMs do.) [224001050190] |The search strategy will be to first predict the first label, then predict the first word, then predict the second label and so on. [224001050200] |The loss corresponding to an entire prediction (of both labels and words) is just going to be the Hamming loss over the words, ignoring the labels. [224001050210] |Since the loss doesn't depend on the labels (which makes sense because they are latent so we don't know them anyway), the optimal policy has to be agnostic about their prediction. [224001050220] |Thus, we set up the optimal policy as follows. [224001050230] |For predictions of words, the optimal policy always predicts the correct word. [224001050240] |For predictions of labels, the optimal policy is stochastic. [224001050250] |If there are K labels, it predicts each with probability 1/K. Other optimal policies are possible and I'll discuss that later. [224001050260] |Now, we have to use a full-blown version of Searn that actually computes expected losses as true expectations, rather than with an optimal policy assumption. [224001050270] |Moreover, instead of sampling a single path from the current policy to get to a given state, we sample all paths from the current policy. [224001050280] |In other words, we marginalize over them. [224001050290] |This is essentially akin to not making the "single sample" assumption on the "left" of the current prediction. [224001050300] |So what happens in the first iteration? [224001050310] |Well, when we're predicting the Nth word, we construct features over the current label (our previous prediction) and predict. [224001050320] |Let's use a naive Bayes base classifier. [224001050330] |But we're computing expectations to the left and right, so we'll essentially have "half" an example for predicting the Nth word from state 0 and half an example for predicting it from state 1. [224001050340] |For predicting the Nth label, we compute features over the previous label only and again use a naive Bayes classifier. [224001050350] |The examples thus generated will look exactly like a training set for the first maximization in EM (with all the expectations equal to 1/2). [224001050360] |We then learn a new base classifier and repeat. [224001050370] |In the second iteration, the same thing happens, except now when we predict a label, there can be an associated loss due to messing up future word predictions. [224001050380] |In the end, if you work through it, the weight associated with each example is given by an expectation over previous decisions and an expectation over future decisions, just like in forward-backward training. [224001050390] |You just have to make sure that you treat your learned policy as stochastic as well. [224001050400] |So with this particular choice of optimal policy, loss function, search strategy and base learner, we recover something that looks essentially like forward-backward training. [224001050410] |It's not identical because in true F-B, we do full maximization each time, while in Searn we instead take baby steps. [224001050420] |There are two interesting things here. [224001050430] |First, this means that in this particular case, where we compute true expectations, somehow the baby steps aren't necessary in Searn. [224001050440] |This points to a potential area to improve the algorithm. [224001050450] |Second, and perhaps more interesting, it means that we don't actually have to do full F-B. [224001050460] |The Searn theorem holds even if you're not computing true expectations (you'll just wind up with higher variance in your predictions). [224001050470] |So if you want to do, eg., Viterbi F-B but are worried about convergence, this shows that you just have to use step sizes. [224001050480] |(I'm sure someone in the EM community must have shown this before, but I haven't seen it.) [224001050490] |Anyway, I'm about 90% sure that the above actually works out if you set about to prove it. [224001050500] |Assuming its validity, I'm about 90% sure it holds for EM-like structured prediction problems in general. [224001050510] |If so, this would be very cool. [224001050520] |Or, at least I would think it's very cool :). [224001060010] |EMNLP papers, tales from the trenches [224001060020] |First, EMNLP/CoNLL papers have been posted. [224001060030] |I must congratulate the chairs for publishing this so fast -- for too many conferences we have to wait indefinitely for the accepted papers to get online. [224001060040] |Playing the now-standard game of looking at top terms, we see: [224001060050] |
  • model (24)
  • [224001060060] |
  • translat (17)
  • [224001060070] |
  • base (16)
  • [224001060080] |
  • learn (13) -- woohoo!
  • [224001060090] |
  • machin (12) -- mostly as in "machin translat"
  • [224001060100] |
  • word (11) -- mostly from WSD
  • [224001060110] |
  • structur (10) -- yay, as in "structured"
  • [224001060120] |
  • disambigu (10) -- take a wild guess
  • [224001060130] |
  • improv (9) -- boring
  • [224001060140] |
  • statist, semant, pars, languag, depend, approach (all 7)
  • [224001060150] |EMNLP/CoNLL this year was the first time I served as area chair for a conference. [224001060160] |Overall it was a very interesting and enlightening experience. [224001060170] |It has drastically changed my view of the conference process. [224001060180] |I must say that overall it was a very good experience: Jason Eisner did a fantastic job at keeping things on track. [224001060190] |I want to mention a few things that I noticed about the process: [224001060200] |
  • The fact that reviewers only had a few weeks, rather than a few months to review didn't seem to matter. [224001060210] |Very few people I asked to review declined. [224001060220] |(Thanks to all of you who accepted!) [224001060230] |It seems that, in general, people are friendly and happy to help out with the process. [224001060240] |My sense is that 90% of the reviews get done in the last few days anyway, so having 3 weeks or 10 weeks is irrelevant.
  • [224001060250] |
  • The assignment process of papers to reviewers is hard, especially when I don't personally know many of the reviewers (this was necessary because my area was broader than me). [224001060260] |Bidding helps a lot here, but the process is not perfect. [224001060270] |(People bid differently: some "want" to review only 3 papers, so "want" half...) [224001060280] |If done poorly, this can lead to uninformative reviews.
  • [224001060290] |
  • Papers were ranked 1-5, with half-points allowed if necessary. [224001060300] |None of my papers got a 5. [224001060310] |One got a 4.5 from one reviewer. [224001060320] |Most got between 2.5 and 3.5, which is highly uninformative. [224001060330] |I reviewed for AAAI this year, where you had to give 1,2,3 or 4, which forces you to not abstain. [224001060340] |This essentially shifts responsibility from the area chair to the reviewer. [224001060350] |I'm not sure which approach is better.
  • [224001060360] |
  • EMNLP asked for many criteria to be evaluated by reviewers; more than in conferences past. [224001060370] |I thought this was very useful to help me make my decisions. [224001060380] |(Essentially, in addition to high overall recommendation, I looked for high scores on "depth" and "impact.") [224001060390] |So if you think (like I used to) that these other scores are ignored: be assured, they are not (unless other area chairs behave differently).
  • [224001060400] |
  • Blindness seems like a good thing. [224001060410] |I've been very back and forth on this until now. [224001060420] |This was the first time I got to see author names with papers and I have to say that it is really hard to not be subconsciously biased by this information. [224001060430] |This is a debate that will never end, but for the time being, I'm happy with blind papers.
  • [224001060440] |
  • 20-25% acceptance rate is more than reasonable (for my area -- I don't know about others). [224001060450] |I got 33 papers, of which three basically stood out as "must accepts." [224001060460] |There were then a handful over the bar, some of which got in, some of which didn't. [224001060470] |There is certainly some degree of randomness here (I believe largely due to the assignment of papers to reviewers), and if that randomness hurt your paper, I truly apologize. [224001060480] |Not to belittle the vast majority of papers in my area, but I honestly don't think that the world would be a significantly worse place is only those top three papers had gotten in. [224001060490] |This would make for a harsh 9% acceptance rate, but I don't have a problem with this. [224001060500] |I know that this comment will probably not make me many friends, but probably about half of the papers in my area were clear rejects. [224001060510] |It seems like some sort of cascaded approach to reviewing might be worth consideration. [224001060520] |The goal wouldn't be to reduce the workload for reviewers, but to have them concentrate their time on papers that stand a chance. [224001060530] |(Admittedly, some reviewers do this anyway internally, but it would probably be good to make it official.)
  • [224001060540] |
  • Discussion among the reviewers was very useful. [224001060550] |I want to thank those reviewers who added extra comments at the end to help me make the overall decisions (which then got passed up to the "higher powers" to be interpolated with other area chair's decisions). [224001060560] |There were a few cases where I had to recruit extra reviewers, either to get an additional opinion or because one of my original reviewers went AWOL (thanks, all!). [224001060570] |I'm happy to say that overall, almost all reviews were in on time, and without significant harassment.
  • [224001060580] |So that was my experience. [224001060590] |Am I glad I did it? [224001060600] |Yes. [224001060610] |Would I do it again? [224001060620] |Absolutely. [224001060630] |Thanks again to all my reviewers, all the authors, and especially Jason for doing a great job organizing. [224001070010] |Problems with Sometimes-hidden Variables [224001070020] |Following in the footsteps of Andrew, I may start posting NLP-related questions I receive by email to the blog (if the answer is sufficiently interesting). [224001070030] |At least its a consistent source of fodder. [224001070040] |Here's one from today: [224001070050] |I am looking for a task, preferably in nlp and info retriv, in which we have labeled data where some are completely observed (x,h) and some are partially observed x. (h is a latent variable and is observed for some data points and not observed for others) Briefly, the data should be like {(x_i,h_i,y_i),(x_j,y_j)} where y is the class label. [224001070060] |I think there are a lot of such examples. [224001070070] |The first one to spring to mind is alignments for MT, but please don't use the Aachen alignment data; see Alex's ACL06 paper for one way to do semi-supervised learning when the majority of the work is coming from the unsupervised part, not from the supervised part. [224001070080] |You could try to do something like the so-called end-to-end system, but with semi-supervised training on the alignments. [224001070090] |Of course, building an MT system is a can of worms we might not all want to open. [224001070100] |Depending on exactly what you're looking for, there may be (many) other options. [224001070110] |One might be interested in the case where the ys are named entity tags or entities with coref annotated and the hs are parse trees (ala the BBN corpus). [224001070120] |By this reasoning, pretty much any pipelined system can be considered of this sort, provided that there is data annotated for both tasks simultaneously (actually, I think a very interesting research question is what to do when this is not the case). [224001070130] |Unfortunately, typically if the amount of annotated data differs between the two tasks, it is almost certainly going to be the "easier" task for which more data is available, and typically you would want this easier task to be the hidden task. [224001070140] |Other than that, I can't think of any obvious examples. [224001070150] |You could of course fake it (do parsing, with POS tags as the hidden, and pretend that you don't have fully annotated POS data, but I might reject such a paper on the basis that this is a totally unrealistic setting -- then again, I might not if the results were sufficiently convincing). [224001070160] |You might also look at multitask learning literature; I know that in such settings, it is often useful to treat the "hidden" variable as just another output variable that's sometimes present and sometimes not. [224001070170] |This is easiest to do in the case of neural networks, but can probably be generalized. [224001080010] |Future DUC Plans [224001080020] |(Guest post by Lucy Vanderwende -- Thanks, Lucy!) [224001080030] |There was a general sigh of relief upon hearing the proposal for the next DUC to occur in November 2008, giving us all a nice long lead time for system development. [224001080040] |The change comes about by co-locating DUC with TREC, which predictably occurs once a year, in November, rather than the erratic schedule that DUC has followed in the past few years. [224001080050] |As a result, system submissions will no longer be due at the same time as the deadline for submission to a major NLP conference: another deep sigh of relief. [224001080060] |Perhaps we will see more DUC workshop papers extended and published as conference publications, now that there is more time to do system analysis after the DUC results have been made available. [224001080070] |Did you know that we should send NIST a reference to every paper published that uses the DUC data? [224001080080] |Co-locating with TREC has another change in store for DUC (though not yet set in stone). [224001080090] |DUC will, likely, have two tracks: summarization and question-answering (the QA track will move out of TREC and into DUC). [224001080100] |At the DUC2007 workshop, Hoa Trang Dang gave us an overview of the types of questions and question templates that are currently the focus of TREC QA. [224001080110] |Workshop participants were very interested, and the opportunities for synergy between the two communities seem plentiful. [224001080120] |For summarization, DUC will continue to have a main task and a pilot task. [224001080130] |NIST's proposal is for the 2008 main task to be Update Summarization (producing short, ~100 words, multi-document summaries in the context of a set of earlier articles); everyone at the workshop was excited by this proposal, and several groups have already worked on update summarization since it was the pilot for 2007. [224001080140] |NIST's proposal for the 2008 pilot task is summarization of opinions from blogs; this is less certain and there are some remaining questions, for example, whether it would be "single blog" or "multiple blog" summarization, summarization of initial blog post or blog post + comments, and what the summary size ought to be. [224001080150] |There was a lot of enthusiasm for this topic, especially because there was a TREC track on blogs in which for the last two years they have tagged blogs as positive or negative wrt opinions, i.e., there's at least some training data. [224001080160] |For more information, check back with the main website for DUC. [224001090010] |Non-linear models in NLP [224001090020] |If you talk to (many) "real" machine learning people, they express profound disbelief that almost everything we do in NLP is based on linear classifiers (maxent/lr, svms, perceptron, etc.). [224001090030] |We only rarely use kernels and, while decision tress used to be popular, they seem to have fallen out of favor. [224001090040] |Very few people use them for large-scale apps. [224001090050] |(Our NYU friends are an exception.) [224001090060] |There are two possible explanations for this. [224001090070] |(1) we really only need linear models; (2) we're too lazy to use anything other than linear models (or, alternative, non-linear models don't scale). [224001090080] |My experience tells me that for most of our sequence-y problems (parsing, tagging, etc.), there's very little to be gained by moving to, eg., quadratic SVMs. [224001090090] |I even tried doing NE tagging with boosted decision trees under Searn, because I really wanted it to work nicely, but it failed. [224001090100] |I've also pondered the idea of making small decision trees with perceptrons on the leaves, so as to account for small amounts of non-linearity. [224001090110] |Using default DT construction technology (eg., information gain), this doesn't seem to help either. [224001090120] |(Ryan McDonald has told me that other people have tried something similar and it hasn't worked for them either.) [224001090130] |Perhaps this is because IG is the wrong metric to use (there exist DTs for regression with linear models on the leaves and they are typically learned so as to maximize the "linearness" of the underlying data, but this is computationally too expensive). [224001090140] |One counter-example is the gains that people have gotten by using latent variable models (eg., Koo and Collins), which are essentially non-linearified linear models. [224001090150] |In somewhat a similar vein, one could consider "edge" features in CRFs (or any structured prediction technique) to be non-linear features, but this is perhaps stretching it. [224001090160] |Part of this may be because over time we've adapted to using features that don't need to be non-linearified. [224001090170] |If we went back and treated each character in a word as a single feature and then required the learning algorithm to recover what important features (like, "word is 'Bush'") then clearly non-linearity would be required. [224001090180] |This is essentially what vision people do, and exactly the cases where things like deep belief networks really shine. [224001090190] |But so long as we're subjecting our learning algorithms to featuritis (John Langford's term), perhaps there's really not much to gain. [224001100010] |WhatToSee [224001100020] |I've been using a small number of Perl scripts for helping me decide what papers to read and I just put them up on the web for people to play with. [224001100030] |See http://hal3.name/WhatToSee. [224001100040] |It's currently queued with a few recent years of ACL, NAACL, ICML and NIPS. [224001100050] |Yes, it would be more useful if conferences were to publish the proceedings before the conference, but don't complain to me about that. [224001100060] |Feel free to submit additional indices if you want (an index has to be an online webpage that contains links that point to online .PDF files -- if no such page exists, you can always create one on your own webpage and point to that). [224001100070] |There are probably ways to break it, but hopefully that won't happen frequently. [224001110010] |Whence JCLR? [224001110020] |Journal publication is not too popular for NLPers -- we tend to be a conference driven bunch. [224001110030] |While I could care less about some arguments for journals (eg., the folks on tenure committees like them), I do feel that they serve a purpose beyond simply acting as an archive (which things like arxiv.org, the ACL anthology, citeseer, rexa, etc. do anyway). [224001110040] |In particular, a journal paper is often a place where you get to really set the stage for your problem, describe your algorithms so that they're actually reimplementable, and go in to serious error analysis. [224001110050] |Certainly not every paper that appears in a *ACL should continue on to the journal path, but many times a handful of papers could be merged. [224001110060] |One significant problem is that we're currently really limited in our choice of publication venues. [224001110070] |Computational Linguistics (MIT Press) is definitely the place to publish a journal paper if you can. [224001110080] |Unfortunately, CL only puts out four issues per year, each with about 4-5 papers. [224001110090] |Sure there aren't hundreds of good papers per year, but I have to believe there are more than 16-20. [224001110100] |Moreover, I don't feel that CL actually mirrors the *ACL proceedings -- there are many papers published in CL that I don't think match with the general sensitivities of *ACL. [224001110110] |In addition to the small number of CL papers, the turn around time is quite slow. [224001110120] |I was personally very impressed with my turnaround time two years ago (date of submission -> date of publication was about a year) and I know that Robert Dale (who's editing now) has done a lot to try to improve this. [224001110130] |But still, a year is a long time. [224001110140] |And I've heard of papers that take several years to get through. [224001110150] |Finally, it's not open. [224001110160] |I hate pay-for-access journals almost as much as I had pay-for-access conference proceedings. [224001110170] |Sure, if you attend an *ACL you get it for "free" and most universities have agreements, but this is more a principle thing than a practical thing. [224001110180] |Things were similar in machine learning land about six years ago (though in fact I think they were worse). [224001110190] |The big journal there was Machine Learning (published by Springer). [224001110200] |They had roughly the same problems, to the extent that a large fraction of the editorial board resigned to found the Journal of Machine Learning Research (JMLR). [224001110210] |JMLR has since become very successful, publishes dozens of papers per year, and has incredibly quick turnaround (I have seen a journal version of a NIPS paper appear in JMLR before NIPS even happens). [224001110220] |The creation of JMLR was greatly assisted by the SPARC group, which helps fledgling journal get off the ground. [224001110230] |I would love to see a similar thing happen in the NLP community. [224001110240] |I, personally, cannot make this happen (I don't have enough weight to throw around), but in talking to colleagues (the majority of whom also don't have enough weight) this seems to be something that many people would be in favor of. [224001110250] |I don't think it has to be a "JCLR is better than CL" sort of thing; I think it's very possible for both to co-exist, essentially serving slightly different purposes for slightly different communities. [224001110260] |In particular, aside from fast turnaround and online pubs, some things that I would love to see happen with such a journal are: Strongly encouraged sharing of code/data (if one could build in some sort of copyright protection for private data, this would be even better since it would let more people share); and a built-in board for paper discussion (probably with membership); ability for authors to easily submit addenda. [224001110270] |A while back I went through the SPARC suggestions of how to begin such a thing and it's very non-trivial. [224001110280] |But it's doable. [224001110290] |And I'd be willing to help. [224001110300] |The biggest thing that would be required would be a bunch of people with white hair who are willing to commit body-and-soul to such a move. [224001120010] |ICML 2007 papers up [224001120020] |See here; there seem to be a lot of good-looking papers (I just printed about 20). [224001120030] |It's already indexed on WhatToSee so go ahead and try that out. [224001120040] |Here are top words from titles: [224001120050] |
  • learn (51) -- shocking!
  • [224001120060] |
  • model (16)
  • [224001120070] |
  • kernel (16) -- an I thought this was just a fad
  • [224001120080] |
  • cluster (13)
  • [224001120090] |
  • algorithm (13)
  • [224001120100] |
  • structur (10) -- this is both for "structured prediction" as well as "structure learning" as well as other things
  • [224001120110] |
  • machin (10)
  • [224001120120] |
  • function (10) -- as in "learning X functions" -- X is similarity or distance or basis or...
  • [224001120130] |
  • data (10)
  • [224001120140] |
  • supervis (8)
  • [224001120150] |
  • process (8) -- both Gaussian and Dirichlet
  • [224001120160] |
  • linear (8)
  • [224001120170] |
  • featur (8)
  • [224001120180] |
  • discrimin (8)
  • [224001120190] |
  • classif (8)
  • [224001120200] |
  • analysi (8)
  • [224001150010] |Open Access CL Proposal [224001150020] |Following up on the Whence JCLR discussion, Stuart Shieber, Fernando Pereira, Ryan McDonald, Kevin Duh and I have just submitted a proposal for an open access version of CL to the ACL exec committee, hopefully to be discussed in Prague. [224001150030] |In the spirit of open access, you can read the official proposal as well as see discussion that led up to it on our wiki. [224001150040] |Feel free to email me with comments/suggestions, post them here, or bring them with you to Prague! [224001160010] |Tracking the State of the Art [224001160020] |I just received the following email from Yoav Goldberg: [224001160030] |I believe a resource well needed in the ACL community is a "state-of-the-art-repository", that is a public location in which one can find information about the current state-of-the-art results, papers and software for various NLP tasks (e.g. NER, Parsing, WSD, PP-Attachment, Chunking, Dependency Parsing, Summarization, QA, ...). [224001160040] |This will help newcomers to the field to get the feel of "what's available" in terms of both tasks and available tools, and will allow active researchers to keep current on fields other than their own. [224001160050] |For example, I am currently quite up to date with what's going on with parsing, PoS tagging and chunking (and of course the CoNLL shared tasks are great when available, yet in many cases not updated enough), but I recently needed to do some Anaphora Resolution,and was quite lost as for where to start looking... [224001160060] |I think the ACL Wiki is an ideal platform for this, and if enough people will show some interest, I will create a "StateOfTheArt" page and start populating it. [224001160070] |But, before I do that, I would like to (a) know if there really is an interest in something like this and (b) hear any comments you might have about it (how you think it should be organized, what should be the scope, how it can be advertised other than in this blog, etc). [224001160080] |I find this especially amusing because this is something that I'd been planning to blog about for a few weeks and just haven't found the time! [224001160090] |I think that this is a great idea. [224001160100] |If we could start a community effect where everytime you publish a paper with new results on a common task, you also publish those results on the wiki, it would make life a lot easier for everyone. [224001160110] |I would suggest that the pages essentially consist of a table with the following columns: paper reference (and link), scores in whatever the approate metric(s) are, brief description of extra resources used. [224001160120] |If people feel compelled, they would also be encouraged to write a paragraph summary under the table with a bit more detail. [224001160130] |I would certainly agree to use this and to support the effort, I would be happy to go back through all my old papers and post their results on this page. [224001160140] |It would be nice if someone (Yoav perhaps???) could initialize pages for the main tasks, so that that burden is lifted. [224001160150] |I'm sure other suggestions would be taken to heart, so comment away! [224001170010] |First-best, Balanced F and All That [224001170020] |Our M.O. in NLP land is to evaluate our systems in a first-best setting, typically against a balanced F measure (balanced F means that precision and recall are weighed equally). [224001170030] |Occasionally we see precision/recall curves, but this is typically in straightforward classification tasks, not in more complex applications. [224001170040] |Why is this (potentially) bad? [224001170050] |Well, it's typically because our evaluation criteria is uncalibrated against human use studies. [224001170060] |In other words, picking on balanced F for a second, it may turn out that for some applications it's better to have higher precisions, while for others its better to have higher recall. [224001170070] |Reporting a balanced F removes our ability to judge this. [224001170080] |Sure, one can report precision, recall and F (and people often do this), but this doesn't give us a good sense of the trade-off. [224001170090] |For instance, if I report P=70, R=50, F=58, can I conclude that I could just as easily get P=50, R=70, F=58 or P=R=F=58 using the same system but tweaked differently? [224001170100] |Likely not. [224001170110] |But this seems to be the sort of conclusion we like to draw, especially when we compare across systems by using balanced F as a summary. [224001170120] |The issue is essentially that it's essentially impossible for any single metric to capture everything we need to know about the performance of a system. [224001170130] |This even holds up the line in applications like MT. [224001170140] |The sort of translations that are required to do cross-lingual IR, for instance, are of a different nature than those that are required to put a translation in front of a human. [224001170150] |(I'm told that for cross lingual IR, it's hard to beat just doing "query expansion" using model 1 translation tables.) [224001170160] |I don't think the solution is to proliferate error metrics, as has been seemingly popular recently. [224001170170] |The problem is that once you start to apply 10 different metrics to a single problem (something I'm guilty of myself), you actually cease to be able to understand the results. [224001170180] |It's reasonable for someone to develop a sufficiently deep intuition about a single metric, or two metrics, or maybe even three metrics, to be able to look at numbers and have an idea what they mean. [224001170190] |I feel that this is pretty impossible with ten very diverse metrics. [224001170200] |(And even if possible, it may just be a waste of time.) [224001170210] |One solution is to evaluate a different "cutoffs" ala precision/recall curves, or ROC curves. [224001170220] |The problem is that while this is easy for thresholded binary classifiers (just change the threshold), it is less clear for other classifiers, much less complex applications. [224001170230] |For instance, in my named entity tagger, I can trade-off precision/recall by postprocessing the weights and increasing the "bias" toward the "out of entity" tag. [224001170240] |While this is an easy hack to accomplish, there's nothing to guarantee that this is actually doing the right thing. [224001170250] |In other words, I might be able to do much better were I to directly optimize some sort of unbalanced F. [224001170260] |For a brain teaser, how might one do this in Pharaoh? [224001170270] |(Solutions welcome in comments!) [224001170280] |Another option is to force systems to produce more than a first-best output. [224001170290] |In the limit, if you can get every possible output together with a probability, you can compute something like expected loss. [224001170300] |This is good, but limits you to probabilistic classifiers, which makes like really hard in structure land where things quickly become #P-hard or worse to normalize. [224001170310] |Alternatively, one could produce ranked lists (up to, say, 100 best) and then look at something like precision a 5, 10, 20, 40, etc. as they do in IR. [224001170320] |But this presupposes that your algorithm can produce k-best lists. [224001170330] |Moreover, it doesn't answer the question of how to optimize for producing k-best lists. [224001170340] |I don't think there's a one-size fits all answer. [224001170350] |Depending on your application and your system, some of the above options may work. [224001170360] |Some may not. [224001170370] |I think the important thing to keep in mind is that it's entirely possible (and likely) that different approaches will be better at different points of trade-off. [224001180010] |3 Small Newses [224001180020] |(Yeah, I know, "news" isn't a count noun.) [224001180030] |
  • WhatToSee has been updated with ACL and EMNLP 2007, so figure out what talks you want to go to!
  • [224001180040] |
  • Yoav has set up a State Of The Art wiki page (see previous blog post on this topic)... please contribute!
  • [224001180050] |
  • The proposal for making CL an open-access journal has been accepted, so we get our 5 minutes of fame -- come by to support (or not). [224001180060] |The business meeting is scheduled for 1:30pm on 26 June.
  • [224001190010] |ACL Business Meeting Results [224001190020] |This afternoon here in Prague was the ACL business meeting. [224001190030] |A few interesting points were brought up. [224001190040] |As well all know, ACL will be in Columbus, OH next year. [224001190050] |It will actually be joint with HLT, which means that (as I previously expected), there won't be a separate HLT next year. [224001190060] |Combining with the fact that when ACL is in north america, there is no NAACL, it looks like there will only be one north american conference next year (unless EMNLP--which is now officially a conference--chooses not to co-locate with ACL/HLT). [224001190070] |The paper submission deadline looks to be around 11 Jan -- calls will be out in September. [224001190080] |EACL 2008 will be in Greece. [224001190090] |The new information: ACL 2009 will be in Singapore, which was one of my two guesses (the other being Beijing). [224001190100] |This should be a really nice location, though I'm saddened since I've already been there. [224001190110] |A few changes have been proposed for ACL 2008 in terms of reviewing. [224001190120] |None will necessarily happen, but for what it's worth I've added my opinion here. [224001190130] |If you have strong feelings, you should contact the board, or perhaps Kathy McKoewn, who is the conference chair. [224001190140] |
  • Conditional accepts and/or author feedback. [224001190150] |I'd be in favor of doing one of these, but not both (redundancy). [224001190160] |I'd prefer author feedback.
  • [224001190170] |
  • Increased poster presence with equal footing in the proceedings, ala NIPS. [224001190180] |I would also be in favor of this because already we are at four tracks and too much is going on. [224001190190] |Alternatively, we could reduce the number of accepted papers, which I actually don't think would be terrible, but going to posters seems like a safer solution. [224001190200] |The strongest argument against this is a personality one: ACLers tend to ignore poster sessions. [224001190210] |Something would have to be doing about this. [224001190220] |Spotlights may help.
  • [224001190230] |
  • Wildcards from senior members. [224001190240] |The idea would be that "senior" (however defined?) members would be able to play a single wildcard to accept an otherwise controversial paper. [224001190250] |I'm probably not in favor of this, partially because it seems to introduce annoying political issues "What? [224001190260] |I'm not senior enough for you?" [224001190270] |(I wouldn't say that, since I'm not, but...); partially because it seems that this is essentially already the job of area chairs. [224001190280] |There may be a problem here, but it seems that there are better, more direct solutions.
  • [224001190290] |
  • Something having to do with extra reviewing of borderline papers. [224001190300] |I didn't quite get what was meant here; it didn't seem that the proposal was to have fewer than 3 reviews, but to ask for more in case of confusion. [224001190310] |I would actually argue for being more extreme: have a single (maybe 2) reviewer to an initial round of rejects and then get three reviews only for those papers that have any chance at all of being accepted. [224001190320] |I doubt this idea will fly, though, but it would be interesting to check in previous papers how many got in that had one reviewer give a really bad score.... how many got in that two reviewers gave a really bad score. [224001190330] |If these numbers are really really low, then it should be safe. [224001190340] |Anyone have access to this data???
  • [224001190350] |Finally, we talked about the "grassroots" efforts. [224001190360] |The proposals were: archive videos, augment the anthology to include link structure, augmenting the anthology with tech reports and journals (given permission from publishers), and ours to make CL open access. [224001190370] |Speaking with respect to ours, the only big complains were with respect to typesetting information, but several people did voice support, both in the meeting and in person. [224001190380] |I remain hopeful! [224001190390] |I'll post more about technical content after the conference. [224001200010] |ACL and EMNLP 2007 Report [224001200020] |ACL/EMNLP just concluded. [224001200030] |Overall, I thought both conferences were a success, though by now I am quite ready to return home. [224001200040] |Prague was very nice. [224001200050] |I especially enjoyed Tom Mitchell's invited talk on linking fMRI experiments to language. [224001200060] |They actually use lexical semantic information to be able to identify what words people are thinking about when they scan their brains. [224001200070] |Scary mind-reading stuff going on here. [224001200080] |I think this is a very interesting avenue of research---probably not one I'll follow myself, but one that I'm really happy someone is persuing. [224001200090] |This is a really long post, but I hope people (both who attended and who didn't) find it useful. [224001200100] |Here are some highlights, more or less by theme (as usual, there are lots of papers I'm not mentioning, the majority of which because I didn't see them): [224001200110] |Machine Translation: [224001200120] |The overall theme at the Stat-MT workshop was that it's hard to translate out of domain. [224001200130] |I didn't see any conclusive evidence that we've figure out how to do this well. [224001200140] |Since domain adaptation is a topic I'm interested in, I'd like to have seen something. [224001200150] |Probably the most interesting paper I saw here was about using dependency order templates to improve translation, but I'm actually not convinced that this is actually helping much with the domain issues: the plots (eg., Fig 7.1) seem to indicate that the improvement is independent of domain when compared to the treelet system. [224001200160] |They do do better than phrases though. [224001200170] |There was a cute talk about trying character-based models for MT. [224001200180] |Doesn't seem like this does much of interest, and it only really works for related languages, but it was nice to see someone try. [224001200190] |This idea was echoed later in EMNLP but none of the talks there really stood out for me. [224001200200] |The only other theme that I picked up at Stat-MT (I didn't stay all day) was that a lot of people are doing some form of syntactic MT now. [224001200210] |Phrase-based seems to be on its way out (modulo the next paragraph). [224001200220] |There were also a lot of talks using Philipp Koehn's new Moses translation system, both at Stat-MT as well as at ACL and EMNLP. [224001200230] |I won't link you to all of them because they all tried very similar things, but Philipp's own paper is probably a good reference. [224001200240] |The idea is to do factored translation (ala factored language modeling) by splitting both the input words and the output words into factors (eg., lemma+morphology) and translating each independently. [224001200250] |The plus is that most of your algorithms for phrase-based translation remain the same, and you can still use max-Bleu traning. [224001200260] |These are also the cons. [224001200270] |It seems to me (being more on the MT side) that what we need to do is rid ourselves of max-Bleu, and then just switch to a purely discriminative approach with tons of features, rather than a linear combination of simple generative models. [224001200280] |There were also a lot of word-alignment talks. [224001200290] |The most conclusive, in my mind, was Alex Fraser's (though I should be upfront about bias: he was my officemate for 5 years). [224001200300] |He actually introduced a new generative alignment model (i.e., one that does have "IBM" in the name) that accounts for phrases directly in the model (no more symmetrization, either). [224001200310] |And it helps quite a bit. [224001200320] |There was also a paper on alignments tuned for syntax by John DeNero and Dan Klein that I liked (I tried something similar previously in summarization, but I think their model makes more sense). [224001200330] |(A second bias: I have known John since I was 4 years old.) [224001200340] |The google folks had a paper on training language models on tera-word corpora. [224001200350] |The clever trick here is that if your goal is a LM in a linear model (see below), it doesn't matter if its normalized or not. [224001200360] |This makes the estimation much easier. [224001200370] |They also (not too surprisingly) find that when you have lots of words, backoff doesn't matter as much. [224001200380] |Now, if only the google ngram corpus included low counts. [224001200390] |(This paper, as well as many other google papers both within and without MT, makes a big deal of how to do all this computation in a map-reduce framework. [224001200400] |Maybe it's just me, but I'd really appreciate not reading this anymore. [224001200410] |As a functional programming languages guy, map-reduce is just map and fold. [224001200420] |When I've written my code as a map-fold operation, I don't put it in my papers... should I?) [224001200430] |The talk that probably got the most attention at EMNLP was on WSD improving machine translation by Marine Carpuat and Dekai Wu. [224001200440] |I think Marine and Dekai must have predicted a bit of difficulty from the audience, because the talk was put together a bit tongue in cheek, but overall came across very well. [224001200450] |The key to getting WSD to help is: (a) integrate in the decoder, (b) do WSD on phrases not just words, and (c) redefine the WSD task :). Okay, (c) is not quite fair. [224001200460] |What they do is essentially train a classifier to do phrase prediction, rather than just using a t-table or a phrase-table. [224001200470] |(Actually, Berger and the Della Pietras did this back in the 90s, but for word translation.) [224001200480] |Daniel Marcu nicely complimented that he would go back to LA and tell the group that he saw a very nice talk about training a classifier to predict phrases based on more global information, but that he may not mention that it was called WSD. [224001200490] |They actually had a backup slide prepared for this exact question. [224001200500] |Personally, I wouldn't have called it WSD if I had written the paper. [224001200510] |But I don't think it's necessarily wrong to. [224001200520] |I liken it to David Chiang's Hiero system: is it syntactic? [224001200530] |If you say yes, I think you have to admit that Marine and Dekai's system uses WSD. [224001200540] |Regardless of what you call it, I think this paper may have quite a bit of impact. [224001200550] |(p.s., MT-people, listen up. [224001200560] |When you have a model of the form "choose translation by an argmax over a sum over features of a weight times a feature value", please stop refering to it as a log-linear model. [224001200570] |It's just a linear model.) [224001200580] |Machine Learning: [224001200590] |Daisuke Okanohara and Jun'ichi Tsujii presented a paper on learning discriminative language models with pseudo-negative samples. [224001200600] |Where do they get their negative samples? [224001200610] |They're just "sentences" produced by a trigram language model! [224001200620] |I find it hard to believe no one has done this before because it's so obvious in retrospect, but I liked it. [224001200630] |However, I think they underplay the similarity to the whole-sentence maxent language models from 2001. [224001200640] |Essentially, when one trains a WSMELM, one has to do sampling to compute the partition function. [224001200650] |The samples are actually generated by a base language model, typically a trigram. [224001200660] |If you're willing to interpret the partition funciton as a collection of negative examples, you end up with something quite similar. [224001200670] |There were two papers on an application of the matrix-tree theorem to dependency parsing, one by the MIT crowd, the other by the Smiths. [224001200680] |A clever application (by both sets of authors) of the m-t theorem essentially allows you to efficiently (cubic time) compute marginals over dependency links in non-projective trees. [224001200690] |I think both papers are good and if this is related to your area, it's worth reading both. [224001200700] |My only nit pick is the too-general title of the MIT paper :). [224001200710] |John Blitzer, Mark Drezde and Fernando Pereira had a very nice paper on an application of SCL (a domain adaptation technique) to sentiment classification. [224001200720] |Sentiment is definitely a hot topic (fad topic?) right now, but it's cool to see some fancy learning stuff going on there. [224001200730] |If you know SCL, you know roughly what they did, but the paper is a good read. [224001200740] |One of my favorite papers at ACL was on dirichlet process models for coreference resolution by Aria Haghighi and Dan Klein. [224001200750] |I'd say you should probably read this paper, even if it's not your area. [224001200760] |One of my favorite papers at EMNLP was on bootstrapping for dependency parsing by David Smith and Jason Eisner. [224001200770] |They use a clever application of Renyi entropy to obtain a reasonable bootstrapping algorithm. [224001200780] |I was not aware of this, but during the question period, it was raised that apparently these entropy-based measures can sometimes do funky things (make you too confident in wrong predictions). [224001200790] |But I think this is at least somewhat true for pretty much all semi-supervised or bootstrapping models. [224001200800] |Random other stuff: [224001200810] |I learned about system combination from the BBN talk. [224001200820] |The idea here is to get lots of outputs from lots of models and try to combine the outputs in a meaningful way. [224001200830] |The high-level approach for translation is to align all the outputs using some alignment technique. [224001200840] |Now, choose one as a pivot. [224001200850] |For each aligned phrase in the pivot, try replacing it with the corresponding phrase from one of the other outputs. [224001200860] |It's kinda crazy that this works, but it helps at least a few bleu points (which I'm told is a lot). [224001200870] |On principle I don't like the idea. [224001200880] |It seems like just a whole lot of engineering. [224001200890] |But if you're in the "get good results" game, it seems like just as good a strategy as anything else. [224001200900] |(I'm also curious: although currently quite ad-hoc, this seems a lot like doing an error-correcting output code. [224001200910] |Does anyone know if it has been formalized as such? [224001200920] |Do you get any gains out of this?) [224001200930] |My final blurb is to plug a paper by MSR on single-document summarization. [224001200940] |Yes, that's right, single-document. [224001200950] |And they beat the baseline. [224001200960] |The cool thing about this paper is that they use the "highlights" put up on many CNN news articles as training. [224001200970] |Not only are these not extracts, but they're also "out of order." [224001200980] |My sense from talking to the authors is that most of the time a single highlight corresponds to one sentence, but is simplified. [224001200990] |I actually downloaded a bunch of this data a month ago or so (it's annoying -- CNN says that you can only download 50 per day and you have to do it "manually" -- it's unclear that this is actually enforceable or if it would fall under fair use, but I can understand from Microsoft's perspective it's better safe than sorry). [224001201000] |I was waiting to collect a few more months of this data and then release it for people to use, so check back later. [224001201010] |(I couldn't quite tell if MSR was going to release their version or not... if so, we should probably talk and make sure that we don't waste effort.) [224001201020] |Wrapping Up: [224001201030] |Since we're ending on a summarization note, here's a challenge: create a document summarization system that will generate the above post from the data in the anthology. [224001201040] |(Okay, to be fair, you can exclude the information that's obtained from conversations and questions. [224001201050] |But if we start videotaping ACL, then that should be allowable too.) [224001201060] |I put pictures from Prague up on my web site; feel free to email me if you want a high-res version of any of them. [224001201070] |Also, if I was talking to you at the conference about sequential Monte Carlo, email me -- I have new info for you, but I can't remember who you are :). [224001210010] |Collapsed Gibbs [224001210020] |(The contents of this post are largely due to a conversation with Percy Liang at ACL.) [224001210030] |I'm a big fan of Gibbs sampling for Bayesian problems, just because it's so darn easy. [224001210040] |The standard setup for Gibbs sampling over a space of variables a,b,c (I'll assume there are no exploitable independences) is: [224001210050] |
  • Draw a conditioned on b,c
  • [224001210060] |
  • Draw b conditioned on a,c
  • [224001210070] |
  • Draw c conditioned on a,b
  • [224001210080] |This is quite a simple story that, in some cases, be "improved." [224001210090] |For instance, it is often possible to jointly draw a and b, yielding: [224001210100] |
  • Draw a,b conditioned on c
  • [224001210110] |
  • Draw c conditioned on a,b
  • [224001210120] |This is the "blocked Gibbs sampler." [224001210130] |Another variant, that is commonly used in our community, is when one of the variables (say, b) can be analytically integrated out, yielding: [224001210140] |
  • Draw a conditioned on c
  • [224001210150] |
  • Draw b conditioned on a,c
  • [224001210160] |
  • Draw c conditioned on a,b
  • [224001210170] |This is the "collapsed Gibbs sampler." [224001210180] |In fact, we can often collapse b out entirely and, in cases where we don't actually care about it's value, we get: [224001210190] |
  • Draw a conditioned on c
  • [224001210200] |
  • Draw c conditioned on a
  • [224001210210] |To make this concrete, consider Mark Johnson's EMNLP paper on unsupervised part of speech tagging. [224001210220] |Here, there are essentially two types of variables. [224001210230] |One set are the tags assigned to each word. [224001210240] |The second set are the parameters (eg., probability of word given tag). [224001210250] |The standard Gibbs sampler would perform the following actions: [224001210260] |
  • For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and the "probability of word given tag" parameters.
  • [224001210270] |
  • For each tag type (not token), draw a multinomial parameter vector for "probability of word given tag" conditioned on the current assignment of tags to words.
  • [224001210280] |It turns out that if our prior on "p(word|tag)" is Dirichlet, we can collapse out the second step by integrating over these parameters. [224001210290] |This yields a one-step collapsed Gibbs sampler: [224001210300] |
  • For each word token, draw a tag for that word conditioned on the word itself, the tag to the left, and all other current assignments of tags to words.
  • [224001210310] |My general M.O. has been: if you can collapse out a variable, you should. [224001210320] |This seems intuitively reasonable because you're now sampling over a smaller space and so it should be easier. [224001210330] |The point of this post is that acknowledge that this may not always be the case. [224001210340] |In fact, it's sort of obvious in retrospect. [224001210350] |There are many models for which auxiliary variables are added just to make the sampling easier. [224001210360] |This is, in effect, un-collapsing the sampler. [224001210370] |If "always collapse" is a good rule to follow, then people would never add auxiliary variables. [224001210380] |While this is a convincing argument (for me, at least), it's not particularly intuitive. [224001210390] |I think that the intuition comes from considering the mixing rate of the Markov chain specified by the standard Gibbs sampler and the collapsed Gibbs sampler. [224001210400] |It seems that essentially what's happening by using a collapsed sampler is that the variance of the Markov chain is decreasing. [224001210410] |In the tagging example, consider a frequent word. [224001210420] |In the collapsed setting, the chance that the tag for a single token of this word will change in a Gibbs step is roughly inversely proportional to its term frequency. [224001210430] |This means that the collapsed sampler is going to have a tendency to get stuck (and this is exactly what Mark's results seem to suggest). [224001210440] |On the other hand, in the uncollapsed case, it is reasonably plausible that a large number of tags could change for a single word type "simultaneously" due to a slightly different draw of the "p(word|tag)" parameter vector. [224001210450] |(Interestingly, in the case of LDA, the collapsed sampler is the standard approach and my sense is that it is actually somehow not causing serious problems here. [224001210460] |But I actually haven't seen experiments that bear on this.) [224001230010] |What's the Use of a Crummy Translation? [224001230020] |I'm currently visiting Microsoft Research Asia (in Beijing) for two weeks (thanks for having me, guys!). [224001230030] |I speak basically no Chinese. [224001230040] |I took one half of a semester about 6 years ago. [224001230050] |I know much more Japanese; enough so that I can read signs that indicate direction, dates and times, but that's about it... the remainder is too divergent for me to make out at all (perhaps a native Japanese speaker would feel differently, but certainly not a gaijin like me). [224001230060] |My experience here has reminded me of a paper that Ken Church and Ed Hovy wrote almost 15 years ago now, Good Applications for Crummy Machine Translation. [224001230070] |I'm not sure how many people have read it recently, but it essentially makes the following point: MT should enter the users world in small steps, only insofar as it is actually going to work. [224001230080] |To say that MT quality has improved significantly in 15 years is probably already an understatement, but it is still certainly far from something that can even compare to translation quality of a human, even in the original training domain. [224001230090] |That said, I think that maybe we are a bit too modest as a community. [224001230100] |MT output is actually relatively readable these days, especially for relatively short input sentences. [224001230110] |The fact that "real world companies" such as Google and LanguageWeaver seem to anticipate making a profit off of MT shows that at least a few crazies out there believe that it is likely to work well enough to be useful. [224001230120] |At this point, rather than gleefully shouting the glories of MT, I would like to point out the difference between the title of this post and the title of the Church/Hovy paper. [224001230130] |I want to know what to do with a crummy translation. [224001230140] |They want to know what to do with crummy machine translation. [224001230150] |This brings me back to the beginning of this post: my brief experience in Beijing. [224001230160] |(Discourse parsers: I challenge you to get that dependency link!) [224001230170] |
  • This voucher can not be encashed and can be used for one sitting only.
  • [224001230180] |
  • The management reserves the right of explanation.
  • [224001230190] |
  • Office snack is forbidden to take away.
  • [224001230200] |
  • Fizzwater bottles please recycle.
  • [224001230210] |The first two are at my hotel, which is quite upscale; the second two are here on the fridge at Microsoft. [224001230220] |There are so many more examples, in subway stations, on the bus, on tourism brochures, in trains, at the airport, I could go on collecting these forever. [224001230230] |The interesting this is that although two of these use words that aren't even in my vocabulary (encashed and fizzwater), one is grammatical but semantically nonsensical (what are they explaining?) and one is missing an indirect object (but if it had one, it would be semantically meaningless), I still know what they all mean. [224001230240] |Yes, they're sometimes amusing and worth a short chuckle, but overall the important points are gotten across: no cash value; you can be kicked out; don't steal snacks; recycle bottles. [224001230250] |The question I have to ask myself is: are these human translations really better than something a machine could produce? [224001230260] |My guess is that machine translation outputs would be less entertaining, but I have a hard time imagine that they would be less comprehensible. [224001230270] |I guess I want to know: if we're holding ourselves to the standard of a level of human translation, what level is this? [224001230280] |Clearly it's not the average translation level that large tourism companies in China hold themselves to. [224001230290] |Can we already beat these translations? [224001230300] |If so, why don't we relish in this fact? [224001240010] |Explanatory Models [224001240020] |I am frequently asked the question: why does your application for solving XXX make such and such an error? [224001240030] |(You can easily replace "your" with any possessive noun and the statement remains valid.) [224001240040] |My standard answer is to shrug and say "who knows." [224001240050] |This is quite different from, for instance, work in pattern matching for information extraction (many other citations are possible). [224001240060] |In this setting, when the system makes an error, one can ask the system "what pattern caused this error." [224001240070] |You can then trace the pattern back to the source documents from which it came and obtain some understanding for what is going on. [224001240080] |This is frequently not the case for your generic sequence labeling algorithm. [224001240090] |If, say, a CRF misses a person name, what can you do about it? [224001240100] |Can you understand why it made the error. [224001240110] |More generally, if a model of any variety errs, can it say anything about why this error came to be. [224001240120] |One way to approach this problem would be to try to inspect the weights of the learned algorithm, but there are so many trade-offs going on internally that I've never been able to do this successfully (by hand, at least --- perhaps a clever tool could help, but I'm not sure). [224001240130] |An alternative that I've been thinking about recently (but probably won't work on because I haven't enough time right now) is instead to pose the question as: what is the minimal change to the input required so that I would have made the decision correctly. [224001240140] |I guess one way to think about this is consider the case that a POS system misses tagging "Fred" in the sentence "Fred is not happy" as an NNP and instead calls it a "VBD." [224001240150] |Presumably we have a bunch of window features about Fred that give us its identify, prefixes and suffixes, etc. [224001240160] |Perhaps if "Fred" had been "Harry" this wouldn't have happened because "Fred" has the spelling feature "-ed." (Ok, clearly this is a very crafted example, but you get the idea.) [224001240170] |The question is: how do you define minimal change in features. [224001240180] |If we're in an HMM (where we're willing to assume feature independence), then I don't think this is a big problem. [224001240190] |But in the case where "Fred" ends in "-ed", it also ends in "-d" and both of these make it look more like a VBD. [224001240200] |Such an explanatory system would ideally like to know that if "-d" weren't there, then neither would be "-d" and use this for computing the minimal change. [224001240210] |It would also have to know that certain features are easier to change than others. [224001240220] |For instance, if it has only ever seen the word "Xavier" in the training data as an NNP, then it could also suggest that if the word were "Xavier" instead of "Fred" than this error would not have happened. [224001240230] |But this is sort of silly, because it gives us no information. [224001240240] |(I'm working under the assumption that we want to do this so that we can add/remove features to our model to help correct for errors [on development data :P].) [224001240250] |It seems like neither of these problems is insurmountable. [224001240260] |Indeed, just looking at something like feature frequency across the entire training set would give you some sense of which features are easy to change, as well as which ones are highly correlated. [224001240270] |(I guess you could even do this on unlabeled data.) [224001240280] |I feel like it's possible to create a methodology for doing this for a specific problem (eg., NE identification), but I'd really like to see some work on a more generic framework that can be applied to a whole host of problems (why did my MT system make that translation?). [224001240290] |Perhaps something already exists and I just haven't seen it. [224001250010] |Conferences: Costs and Benefits [224001250020] |I typically attend two or three conferences per year; usually NIPS (which has been in Vancouver since I started attending), and an ACL-related one; the third is typically a second ACL-related conference or ICML, depending on the year. [224001250030] |Typically two of these are domestic, one is international. [224001250040] |Domestic conferences cost me about $1500 and international ones vary, but Prague weighed in at around $4000. [224001250050] |This means that my travel costs (just for myself!) are about $5500-$7000 per year. [224001250060] |Moreover, this takes 2-3 weeks of my year (more than 5% of my non-vacation time). [224001250070] |When I was a student, this question never entered my mind (I seemed to have a nearly endless supply of money); now, I find myself wondering: are conferences worth the time and money investment? [224001250080] |I'll focus on international conferences because these are the biggest sink in terms of both money and time. [224001250090] |In particular, I'll consider Prague, which hosted both ACL and EMNLP. [224001250100] |Here's what I feel like I gained from this trip: [224001250110] |
  • I saw some interesting papers presented.
  • [224001250120] |
  • I saw some interesting invited talks (Tom Mitchell's stands out for me).
  • [224001250130] |
  • I had semi-deep hallway conversations with 3 or 4 people.
  • [224001250140] |
  • I had non-deep hallway conversations with probably ~20 people.
  • [224001250150] |
  • I gave two presentations. [224001250160] |(The implication is that this may make me "more famous" and that this is a good thing for some reason :P.)
  • [224001250170] |
  • I saw an area of the world that I hadn't yet been to.
  • [224001250180] |
  • I spent a not insignificant amount of time socializing with ~20 friends who I pretty much only see at conferences.
  • [224001250190] |So the question is, was this worth just over $4 grand and 10 days of my life that could have been spent doing research (or taking a vacation)? [224001250200] |I have mixed feelings. [224001250210] |
  • does not seem compelling -- for conferences close to me that I do not attend, I still read proceedings. [224001250220] |Sure, sometimes presentations are helpful and there's a bit of a serendipity aspect, but overall, I'd say this is something I could do in a day in the park with a copy of the proceedings.
  • [224001250230] |
  • is important. [224001250240] |Especially when the invited talks are good and aren't just a long version of some paper presentation---i.e., when you can get a good sense of the overall research direction and the important long term results---I feel like these things are worth something.
  • [224001250250] |
  • is important. [224001250260] |Some people say that hallway conversations are the most important; maybe it's just me, but it's pretty rare for me to have hallway conversations that are sufficiently deep to be really meaningful in the long run, but I'd say around 3 per conferences is something that you can hope for. [224001250270] |At least with these, they seem to have either led to collaboration or at least new ideas to try out in my own work.
  • [224001250280] |
  • provides good social networking... [224001250290] |I don't feel like these really change how I think about problems (and I think the same is true for the people I had such conversations with). [224001250300] |The only important thing here is if you find out about what new problems other people are working on, you can learn about new areas that may interest you.
  • [224001250310] |
  • is nebulous to me; I feel like the key purpose in conference talks is advertisement. [224001250320] |It's a bit unclear what I'm advertising for---citations, perhaps?---but hopefully something I've done will save someone else some time, or will give them ideas of something to try or something along these lines. [224001250330] |But this is highly correlated with (1), which suggests that it's actually not particularly useful.
  • [224001250340] |
  • shouldn't be underestimated, but if I compare taking a vacation with going to a conference, they're very different. [224001250350] |In particular, even at a conference where I a priori intend to spend a bunch of time touristing, I never seem able to accomplish this as much as I would like. [224001250360] |Of course, $4k out of grant money versus $4k out of personal money is very different.
  • [224001250370] |
  • also shouldn't be underestimated, but like (6) is maybe best accomplished in other ways.
  • [224001250380] |Based on this, I feel like overall the main benefits to going to a conference are: seeing invited talks, having deep hallway conversations, and a minor bit of socializing and serendipity. [224001250390] |The thing that occurred to me recently is that it's actually possible to achieve these things without going to conferences. [224001250400] |In particular, consider the following model. [224001250410] |I invite one or two "famous types" to my university to give invited talks. [224001250420] |Each of these would cost maybe $2000, but a lot (if not all) of this would be subsidized by the department. [224001250430] |So I get invited talks for (nearly) free; for safety, even say it costs me $1k. [224001250440] |I now have $3k left. [224001250450] |With this $3k I tour around the country and spend a few days at different labs/universities and meet with people. [224001250460] |If I know someone well enough at a lab, I can probably stay with them, which means my only real cost is airfare (assuming their university doesn't want to invite me and pay for it) and incidentals. [224001250470] |For domestic flights, it's hard to imagine that I wouldn't be able to pull off each of this for around $750. [224001250480] |And that eats up the $4k. [224001250490] |What do I get out of this model? [224001250500] |Well, I'd give talks at four universities. [224001250510] |This is not quite as broad an audience as at a conference, but they're more focused and my talk can be longer. [224001250520] |Instead of having semi-deep hallway conversations, at the very least I get 4 very deep office conversations, which is potentially much more useful. [224001250530] |I get one or two invited talks per year, by people that I choose (modulo availability). [224001250540] |What do I lose? [224001250550] |I lose seeing other papers presented, which I don't think is too serious. [224001250560] |I lose socializing and touristing (in foreign countries). [224001250570] |This is too bad, but is perhaps better served by a legitimate vacation. [224001250580] |The only other big thing I lose is conversations with multiple people simultaneously (eg., in Prague, one of my "good" conversations was with Ryan McDonald and Joakim Nivre... this would not be possible under my proposed model). [224001250590] |I also lose seeing questions and answers asked at talks, which are occasionally quite interesting, but also something that I'm willing to live with out. [224001250600] |Overall, I think the biggest thing I would lose is a sense of community, which I think is hard to quantify and yet still important. [224001250610] |Though, I'm also not proposing that I would never go to a conference, but that maybe 2-3 per year is overkill for the benefits obtained (especially for expensive destinations). [224001250620] |If I went to one domestic (I count Canada as domestic) conference per year and visited 2-3 other sites, I'm not sure that I'd be any worse off. [224001250630] |(Of course, the fact that I'm in the States helps here... you probably couldn't get away with this model outside of US/Canada.) [224001260010] |Topic modeling: syntactic versus semantic [224001260020] |Topic modeling has turned into a bit of a cottage industry in the NLP/machine learning world. [224001260030] |Most seems to stem from latent Dirichlet allocation, though this of course built on previous techniques; the most well-known of which is latent semantic analysis. [224001260040] |At the end of the day, such "topic models" really look more like dimensionality reduction techniques (eg., the similarity to multinomial PCA); however, in practice, they're often used as (perhaps soft) clustering methods. [224001260050] |Words are mapped to topics; topics are used as features; this is fed into some learning algorithm. [224001260060] |One thing that's interested me for a while is that when viewed as clustering algorithms, how these topic models compare with more standard word clustering algorithms from the NLP community. [224001260070] |For instance, the Brown clustering technique (built into SRILM) that clusters words based on context. [224001260080] |(Lots of other word clustering techniques exist, but they pretty much all cluster based on local context; where local is either positionally local or local in a syntactic tree.) [224001260090] |I think the general high level story is that "topic models" go for semantics while "clustering models" go for syntax. [224001260100] |That is, clustering models will tend to cluster words together that appear in similar local context, while topic models will cluster words together that appear in a similar global context. [224001260110] |I've even heard stories that when given a choice of using POS tags as features in a model versus Brown clusters, it really don't make a difference. [224001260120] |I think this sentiment is a bit unfair to clustering models. [224001260130] |Saying that context-based clustering models only find syntactically similar words is just not true. [224001260140] |Consider the example clusters from the original LDA paper (the top portion of Figure 8). If we look up "film" ("new" seems odd) in CBC, we get: movie, film, comedy, drama, musical, thriller, documentary, flick, etc. [224001260150] |(I left out multiword entries). [224001260160] |The LDA list contains: new, film, show, music, movie, play, musical, best, actor, etc. [224001260170] |We never get things like "actor" or "york" (presumably this is why "new" appeared), "love" or "theater", but it's unclear if this is good or not. [224001260180] |Perhaps with more topics, these things would have gone into separate topics. [224001260190] |If we look up "school", we get: hospital, school, clinic, center, laboratory, lab, library, institute, university, etc. [224001260200] |Again, this is a different sort of list than the LDA list, which contains: school, students, schools, education, teachers, high, public, teacher, bennett, manigat, state, president, etc. [224001260210] |It seems like the syntactic/semantic distinction is not quite right. [224001260220] |In some sense, with the first list, LDA is being more liberal in what it considers film-like, with CBC being more conservative. [224001260230] |OTOH, with the "school" list, CBC seems to be more liberal. [224001260240] |I realize, of course, that this is comparing apples and oranges... the data sets are different, the models are different, the preprocessing is different, etc. [224001260250] |But it's still pretty clear that both sort of models are getting at the same basic information. [224001260260] |It would be cool to see some work that tried to get leverage from both local context and global context, but perhaps this wouldn't be especially beneficial since these approaches---at least looking at these two lists---don't seem to produce results that are strongly complementary. [224001260270] |I've also seen questions abound regarding getting topics out of topic models that are "disjoint" in some sense...this is something CBC does automatically. [224001260280] |Perhaps a disjoint-LDA could leverage these ideas. [224001270010] |The Hierarchical Bayes Compiler [224001270020] |I've been working for a while on a compiler for statistical models. [224001270030] |(Those of you who knew me back in the day will know that I used to be a programming languages/compiler nut.) [224001270040] |The basic idea is to have a language in which you can naturally express hierarchical Bayesian models and either simulate them directly or compile them into your language of choice (in this case, right now, the only language you can choose in C :P). [224001270050] |The key differentiation between HBC (the Hierarchical Bayes Compiler) and tools like WinBugs is that you're actually supposed to be able to use HBC to do large-scale simulations. [224001270060] |Also, although right now it only supports Gibbs sampling, message passing and stochastic EM variants are in the works. [224001270070] |It also almost knows how to automatically collapse out variables (i.e., automatic Rao-Blackwellization), but this is still a bit buggy. [224001270080] |Anyway, you can see more information on the official website. [224001270090] |I'd really appreciate people who are interested in sending me bug reports if you encounter any problems. [224001270100] |It's a pretty complex bit of machinery and some of it is still hacked together rather than done properly (especially type checking), so I expect to find some bugs there. [224001270110] |So just to whet your appetite if you haven't yet clicked on the above link, here's a complete implementation of LDA including hyperparameter estimation in HBC: [224001280010] |Word order and what-not [224001280020] |I've recently been looking at some linguistic issues related to word order phenomena. [224001280030] |This is somewhat inline with the implicational universals stuff that I am working on with Lyle Campbell, but also somewhat tangential. [224001280040] |Here's a little background. [224001280050] |Most linguists tend to believe in things like nouns, verbs, adjectives, subjects and objects, genitives, adpositions (prepositions and postpositions), etc. [224001280060] |Indeed, these are some of the basic building blocks of typological studies of word order. [224001280070] |A common example is that languages that are OV (i.e., the object precedes the verb) are also postpositional (think Hindi or Japanese). [224001280080] |On the other hand, VO languages are also prepositional (think English). [224001280090] |The general set of orders that are considered important are: V/O, Adj/N, Gen/N, PrepP/PostP and a few others. [224001280100] |However, one of these is underspecified. [224001280110] |In particular, PrepP/PostP tells us nothing about the placement of the embedded clause with respect to its head. [224001280120] |For instance, in English, which is PrepP, the head precedes the PP ("The man *with* the axe", axe comes after man). [224001280130] |Or, in Japanese, which is PostP, the head comes after the PP (glossed: "the axe *with* the man" -- meaning that the man has the axe). [224001280140] |However, it is unclear if other orders are possible. [224001280150] |For instance, are there PrepP languages for which "with the axe" ("with" has to come before "the axe" in a PrepP language) precedes "the man", something like "with the axe the man". [224001280160] |Or, are there PostP languages for which "the axe-with" comes after the head ("the man the axe-with"). [224001280170] |I certainly don't know any, but I know enough about maybe 4 or 5 languages out of around 7000 to tell. [224001280180] |It seems like interpretation in such a language would be difficult, but of course that doesn't stop German from separating verbs and auxiliaries and indeed Germans don't seem to have a hard time understanding each other. [224001280190] |A second thing that is left unanswered is how these relate to each other. [224001280200] |Consider Gen/N and Adj/N. [224001280210] |If you are a GenN + AdjN language, which comes first? [224001280220] |In English, the Gen has to ("The man's happy brother" not "The happy man's brother" -- the latter means that it's the man, not the brother, who is happy). [224001280230] |Is this reversible for a given language, or are the any languages that allow both? [224001280240] |Again, it seems like it would make interpretation difficult. [224001280250] |I've asked the few typologists that I know these two questions and they actually haven't known. [224001280260] |I'm hoping that a query to the blogosphere (a word I hate) and a query to linguist-list will turn up something. [224001280270] |I'll post anything I hear here. [224001290010] |Journals are on the Mind [224001290020] |Anyone who has ever read this blog before knows that I'm a huge supporter of moving our Computational Linguistics journal to open access. [224001290030] |Those who talk to me, know I'm in favor of more drastic measures, but would be content with just this one small change. [224001290040] |I'm not here to beat a horse (living or dead), but to say that this appears to be something on many people's minds. [224001290050] |I just got an email for voting for new members of the ACL board. [224001290060] |I think only members get to vote, but I don't see anything that says that everyone can't view the statements. [224001290070] |The list of candidates with their statements is here. [224001290080] |What I find promising is how often open access CL is mentioned! [224001290090] |Indeed, both candidates for VP-elect say something. [224001290100] |Ido writes: [224001290110] |Among other possibilities, the proposal to make CL open access, with a shorter reviewing cycle, seems worth pursuing. [224001290120] |Additionally, we can increase the annual capacity of journal papers and include shorter ones. [224001290130] |And Jan writes: [224001290140] |Third, ACL's role is to help the widest possible dissemination of the field's results, including novel ways such as electronic and open-access publications, training workshops and summer schools (to attract excellent "new blood"). [224001290150] |The issue comes up again for one of the Exec members. [224001290160] |Hwee Tou says "I believe some issues of current concern to the ACL membership include the role of open access journals..." [224001290170] |Obviously I'm not here to tell anyone who to vote for. [224001290180] |Indeed, since both candidates for VP mention the issue, it's not really a dividing point! [224001290190] |But I'm very very very pleased to see that this has become something of an important front. [224001300010] |Mark-up Always the Wrong Tree? [224001300020] |Almost a year ago I responded to a very interesting article in CL. [224001300030] |The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. [224001300040] |In this post I'm going to take a more extreme position. [224001300050] |It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration. [224001300060] |Proposition: mark-up is always a bad idea. [224001300070] |That is: we should never be marking up data in ways that it's not "naturally" marked up. [224001300080] |For instance, part-of-speech tagged data does not exist naturally. [224001300090] |Parallel French-English data does. [224001300100] |The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing. [224001300110] |Here's why I think this is a reasonable position to take. [224001300120] |In some sense, we're striving for machines that can do things that humans do. [224001300130] |We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. [224001300140] |Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. [224001300150] |We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). [224001300160] |But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables. [224001300170] |Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument. [224001300180] |I think the first hole is just a "we're not there yet" issue. [224001300190] |That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this. [224001300200] |The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. [224001300210] |For instance, many summarization people believe that coreference issues are a big problem. [224001300220] |Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. [224001300230] |On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. [224001300240] |The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems. [224001300250] |The hole with the second hole :) is the following. [224001300260] |If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. [224001300270] |Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. [224001300280] |So we should stop doing it. [224001300290] |(Similar arguments can be made for other tasks; eg., NE tagging in English.) [224001300300] |The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. [224001300310] |And these are tasks that computers can do. [224001300320] |If we can force some humans to do these tasks, maybe it would be worthwhile. [224001300330] |But, to be honest, I can't think of any such thing off the top of my head. [224001300340] |Maybe I'm just too closed-minded. [224001310010] |Bootstrapping [224001310020] |There are many bootstrapping algorithms, but they all have (roughly) the same general form. [224001310030] |Suppose we want to solve a binary classification problem. [224001310040] |We do the following: [224001310050] |
  • Build (by hand) a classifier that predicts positive with high precision (and low recall)
  • [224001310060] |
  • Build (by hand) a classifier that predicts negative with high precision (and low recall)
  • [224001310070] |
  • Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)
  • [224001310080] |
  • Train a new classifier on the output of (3)
  • [224001310090] |
  • Apply the new classifier to the original data set and go to (4)
  • [224001310100] |Note that the hand-built classifier can alternatively just be a hand-labeling of a small number of "representative" examples. [224001310110] |I've never actually used such an algorithm before, but I hear that they work pretty well. [224001310120] |There's always been one thing that's surprised me, though. [224001310130] |And that's the fact that they work pretty well. [224001310140] |There are two reasons why the fact that these algorithms work surprises me. [224001310150] |First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. [224001310160] |Second, there is a domain adaptation problem. [224001310170] |To be more explicit, let me give an example. [224001310180] |Suppose I want to classify sentences as subjective or objective. [224001310190] |This is a well studied problem to which bootstrapping algorithms have been applied. [224001310200] |The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. [224001310210] |One way to do this is to create word lists. [224001310220] |Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). [224001310230] |Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt. [224001310240] |Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. [224001310250] |In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier. [224001310260] |This is where things go haywire for my understanding. [224001310270] |First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. [224001310280] |In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. [224001310290] |Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead. [224001310300] |One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. [224001310310] |But people don't seem to do this. [224001310320] |In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." [224001310330] |But now we have a single feature that will allow us to replicate the rule-based classifier. [224001310340] |This scares me greatly. [224001310350] |Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." [224001310360] |But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." [224001310370] |The fact that this will generalize is totally not obvious. [224001310380] |And yet these things seem to work pretty well. [224001310390] |I just don't quite understand why. [224001320010] |F-measure versus Accuracy [224001320020] |I had a bit of a revelation a few years ago. [224001320030] |In retrospect, it's obvious. [224001320040] |And I'm hoping someone else out there hasn't realized this because otherwise I'll feel like an idiot. [224001320050] |The realization was that F-measure (for a binary classification problem) is not invariant under label switching. [224001320060] |That is, if you just change which class it is that you call "positive" and which it is that you call "negative", then your overall F-measure will change. [224001320070] |What this means is that you have to be careful, when using F-measure, about how you choose which class is the "positive" class. [224001320080] |On the other hand, the simple "accuracy" metric is (of course) invariant under label switching. [224001320090] |So when using accuracy, you needn't worry about which class you consider "positive." [224001320100] |In the olden days, when people pretty much just used F-measure to analyze things like retrieval quality, this wasn't a problem. [224001320110] |It was "clear" which class was positive (good documents) and which was negative. (Or maybe it wasn't...) But now, when people use F-measure to compare results on a large variety of tasks, it makes sense to ask: when is accuracy the appropriate measure and when is F the appropriate measure? [224001320120] |I think that, if you were to press anyone on an immediate answer to this question, they would say that they favor F when one of the classes is rare. [224001320130] |That is, if one class occurs only in 1% of the instances, then a classifier that always reports "the other class" will get 99% accuracy, but terrible F. [224001320140] |I'm going to try to convince you that while rarity is a reasonable heuristic, there seems to be something deeper going on. [224001320150] |Suppose I had a bunch of images of people drinking soda (from a can) and your job was to classify if they were drinking Coke or Pepsi. [224001320160] |I think it would be hard to argue that F is a better measure here than accuracy: how would I choose which one is "positive." [224001320170] |Now, suppose the task were to distinguish between Coke and Diet Dr. Pepper. [224001320180] |Coke is clearly going to be the majority class here (by a long shot), but I still feel that F is just the wrong measure. [224001320190] |Accuracy still seems to make more sense. [224001320200] |Now, suppose the task were to distinguish between Coke and "anything else." [224001320210] |All of a sudden, F is much more appealing, even though Coke probably isn't much of a minority (maybe 30%). [224001320220] |What seems to be important here is the notion of X versus not-X, rather than X versus Y. In other words, the question seems to be: does the "not-X" space make sense? [224001320230] |Let's consider named entity recognition (NER). [224001320240] |Despite the general suggestion that F is a bad metric for NER, I would argue that it makes more sense than accuracy. [224001320250] |Why? [224001320260] |Because it just doesn't make sense to try to specify what "not a name" is. [224001320270] |For instance, consider the string "Bill Clinton used to be the president; now it's Bush." [224001320280] |Clearly "Bill Clinton" is a person. [224001320290] |But is "Clinton used"? [224001320300] |Not really. [224001320310] |What about "Bill the"? [224001320320] |Or "Bill Bush"? [224001320330] |I think a substantial part of the problem here is not that names are rare, but that it's just not reasonable to develop an algorithm that finds all not-names. [224001320340] |They're just not well defined. [224001320350] |This suggests that F is a more appropriate measure for NER than, at least, accuracy. [224001320360] |One might argue--and I did initially myself--that this is an artifact of the fact that names are often made up of multiple words, and so there's a segmentation issue. [224001320370] |(The same goes for the computer vision problem of trying to draw bounding boxes around humans in images, for which, again, F seems to make more sense.) [224001320380] |But I think I've been convinced that this isn't actually the key issue. [224001320390] |It seems, again, that what it boils down to is that it makes sense to ask one to find Entities, but it doesn't make sense to ask one to find non-Entities, in the same way it it doesn't make sense to ask one to find non-Cokes. [224001320400] |(Of course, in my heart of hearts I believe that you need to use a real--i.e., type 4--evaluation metric, but if you're stuck without this, then perhaps this yields a reasonable heuristic.) [224001330010] |Gender and text, gender and speech [224001330020] |For some crazy reason I decided a while ago that I wanted to learn Japanese. [224001330030] |Essentially, I wanted to learn a language as unlike English as I could find. [224001330040] |So I did some summer intensive thing before college (that amounted to a year of class) and then continued taking class for all three years of undergrad. [224001330050] |At the end, I could get by passably for most conversation topics (business, politics, current events, etc.) other than research stuff (at some point I learned how to say NLP, but I don't remember anymore...I wonder if en-eru-pi would be understood...). [224001330060] |During the whole time we were required to meet weekly with conversation partners so as to practice our speaking skills. [224001330070] |For the first "semester" during the summer, I had a male professor. [224001330080] |For all remaining seven semesters, my profs were female. [224001330090] |With the exception of one conversation partner (who was from Hokkaido and spoke quicky with a strong accent and who was quickly replaced by someone who I could understand a bit more), all of my conversation partners were female. [224001330100] |At the end of my four years, I was speaking to a frien (who was neither a conversation partner nor a prof) in Japanese and after about three turns of conversation, he says to me (roughly): "you talk like a girl." [224001330110] |Based on the set up of this post, you may have seen that coming. [224001330120] |But the thing that is most interesting is that Japanese is not one of those languages where the speaker's gender is encoded in (eg.) verb morphology. [224001330130] |In fact, as best I could tell at that point, the only thing that I did that was effeminate was to use too many sentence ending particles that were more commonly used by women (-ka-na, I think, was one, but it's been too long now to really remember). [224001330140] |The guy who said this to me was a close enough friend that I tried to figure out what it was about my speech that made him assess that I talk like a girl. [224001330150] |The sentence particle thing was part of it, but he said that there was also something else that he couldn't really figure out; he was hypothesizing it was something to do with emphasis patterns. [224001330160] |It's not at all surprising that given that the majority of native speakers that I talked to were female, that if there were some underlying bias that was sufficiently subtle that the profs weren't able to intentially avoid it, that I would have picked it up. [224001330170] |Now, getting back to en-eru-pi. [224001330180] |There's been a reasonable amount of work in the past few years on identifying the gender of the authors of texts. [224001330190] |I know both Moshe Koppel and Shlomo Argamon, to name two, have worked on this problem. [224001330200] |I also remember seeing a web site a year or so ago where you could enter a few sentences that you wrote and it would guess your gender. [224001330210] |I don't remember what it cued off of---i think distribution of types of verbs and adjectives, mostly, but I do remember that given a short paragraph, it's shockingly accurate. [224001330220] |What I don't know is (a) if anyone has done this for something other than English and (b) if someone has done it for speech. [224001330230] |Of course, if you have speech, you have extra information (eg., pitch) which might be useful. [224001330240] |But given my Japanese friend's reaction to my speech pattern (my voice is rather low), there has to be more going on. [224001330250] |And I'm not convinced that what is going on will be the same between written text and (say) transcribed speech. [224001330260] |If someone wanted to try such an experiment for non-English text, you could probably just mine non-English from some social networking site (like myspace or facebook), where people tend to list their genders. [224001330270] |I'm not sure how to do it for speech. [224001330280] |Maybe there's some speech transcription corpus out there that's annotated with gender, but I don't know what it is. [224001330290] |Although I don't see a huge financial marked out there for an answer, I'm personally curious what it is about my English writing patterns that made the web site I refered to earlier strongly convinced that I'm male, and what it is about my Japanese speech patterns that make it clear that I'm not. [224001340010] |Non-parametric versus model selection/averaging [224001340020] |Non-parametric approaches (probably the most familiar of which is the Dirichlet process, but there are a whole host of them) are nice because they don't require us to pre-specify a bunch of things that in standard parametric inference would essentially be a model selection issue. [224001340030] |For instance, in the DP, we needn't specify how many "clusters" gave rise to our data (in the context of a mixture model). [224001340040] |This brings up the immediate question, though: instead of doing inference in a non-parametric model, why don't you just do model selection (eg by comparing marginals) or model averaging. [224001340050] |You can just vary whatever it is that is the "non-parametric" part of the model. [224001340060] |For instance, in a DP, you run a bunch of inferences with different numbers of clusters and either choose the best (model selection) or average with respect to the marginals (model averaging). [224001340070] |In something like an IBP, you can run with a different number of latent factors and select or average. [224001340080] |I've been asked this general question a few times by non-ML people and I rarely feel like I can give a compelling answer. [224001340090] |In particular, I'm not aware of any non-toy experimental comparisons between doing model selection/averaging in any of these models. [224001340100] |And even toy ones are hard to come by. [224001340110] |But even beyond empirical evidence, I often have a hard time even formulating a coherent qualitative argument. [224001340120] |Here are some points I've come up with, but maybe commentors can either debunk them or add... [224001340130] |
  • In some cases, there are lots of parts of the model for which we don't know the structure, so to do model selection/averaging would require trying a ridiculously large number of models. [224001340140] |For instance, I might have two components in my model that are DP-ish, so now I have to try quadratically many models.
  • [224001340150] |
  • I may not know a good upper/lower bound on the number of components (eg., in a DP). [224001340160] |So I'm going to have to try a really large range. [224001340170] |In fact, although it's well known that the expected number of clusters in a DP grows as o(log N), where N is the number of data points, it is actually unbounded (and there's a conjecture that it's w(log log N), which isn't terribly slow).
  • [224001340180] |
  • Comparing marginal likelihoods across models with a different number of parameters is just plain hard. [224001340190] |In fact, for most cases, I don't know how to do it, especially if you want to live in MCMC world. [224001340200] |(In variational world you could compare the lower bound on the marginals, but it's always a bit nerve wracking to compare two lower bounds -- you'd rather compare lowers and uppers.) [224001340210] |I'm aware of things like reversible jump MCMC and so on, but in most cases these aren't actually applicable to the models you want. [224001340220] |Alternatively, if all you want to do is select (not average), you could always do something with held-out data.
  • [224001340230] |The problem is that I can think of counter-arguments to most of these points. [224001340240] |In the case of (1), you could argue that if the space is too big, then your sampler isn't going to hit everywhere anyway. [224001340250] |In the case of (2), my guess is that for most of these models the marginal will be semi-convex, so you can just start small and keep increasing until things seem to get worse. [224001340260] |For (3), this seems to be an argument for developing better MCMC techniques for comparing marginals, not necessarily an argument in favor of non-parametric methods. [224001340270] |But I can go back yet again. [224001340280] |To counter the counter to (1), you can argue that the sampler is at least guaranteed after a long time to hit what you care about, whereas if you construct some arbitrary search policy, you may not be. [224001340290] |For (2), well...I don't know...I'm pretty convinced by the counter-argument to (2) :P... For (3), you could just disagree and say: why should we develop better MCMC techniques for comparing marginals when we can get away from this whole business by doing non-parametric inference. [224001340300] |Overall, I think non-parametric inference is interesting, useful and fun. [224001340310] |But I'd like better arguments against the nay-sayers (who, in my experience, are actually typically non-ML people). [224001340320] |(Note that I'm ignoring the case where the non-parametric model is actually known--or roughly known--to be the right model for your problem. [224001340330] |Of course if it's the right model, then you should use it. [224001340340] |I'm more referring to the question of using non-parametric methods to get around model selection issues.)