The text files in this directory constitute a sentence-segmented version of the WSJ raw text as released in LDC95T07. Some normalisation was performed, to remove line breaks before segmenting, but maintain paragraph breaks as blank lines in the text file. These files were prepared by Rebecca Dridan (at the University of Oslo), in consultation with Stephan Oepen, working from several LDC releases of the original WSJ text (and, of course, the 1999 release of the PTB). The initial .START line of each document was ignored. In aligning sentences with the annotations in the PTB, certain mappings were used: " was allowed to match to `` or '' ` was allowed to match to ' ' was allowed to match to ` * was allowed to match to \\* / was allowed to match to \\/ PTB bracket substitutions were allowed for, e.g. -LCB- and friends. In cooking the ‘raw’ WSJ text to approximate what one would expect to be the input to a full analysis pipeline aiming to arrive at the PTB annotations, we had to balance two guiding principles: (a) what was (probably) the original form in the printed newspaper and (b) respecting the decisions made in annotation. Not surprisingly, these principles can be in conflict. For example, ‘'Tis’ in the ‘raw’ appears in the PTB as ‘'T-’ ‘is’; in this case, we rank principle (a) over principle (b) (21583012). Likewise, from ‘raw’ ‘The hotel and gaming company’ there is ‘Gaming’ in the PTB; again, we keep to the ‘raw’ in this case (21503003). Somewhat ambivalently, we applied principle (b) in changing ‘raw' ‘16/ 64-inch’ (which an editor with an engineering background would likely typeset as ‘16/64-inch’) to ‘16 64-inch’ (matching the PTB); for this one, maybe one should check the original WSJ print version. A comparatively frequent application of principle (a) relates to single quotes. It appears that whatever format and encoding conversion was applied to the original WSJ files, double quotes were not disambiguated (between opening and closing), while single quotes use LaTeX-style conventions to distinguish left and right quotes (i.e. a backquote for opening quotes, straight single quote as closing quotes and apostrophes); we find a few cases like ‘rock `n' roll’ in the ‘raw', which the PTB has edited to ‘'n'’ (two apostrophes), suggesting there was a lossy process of single quote disambiguation. One class of segmentation errors in the PTB relates to sentence-initial or -final quote marks, as for example in [20044058] "Being a teacher just became my life," says the 37-year-old Mrs. Yeargin, a teacher for 12 years before her dismissal. " [20044059] I loved the school, its history. [20526014] He quotes one student saying, "You're just the kind of Jewboy we Southerners can't stand. [20526015] " Mr. Sohmer confesses that it was partly in response to such attitudes that he is now "a dweller on one of the two islands off the coast of America." In these cases, it seems clear from context and the use of whitespace around the quote marks that PTB has mis-segmented. For compatbility with gold-standard sentence boundaries from the PTB, we (again ambivalently) decided to not correct the quote marks in these examples. Searching for /^" / and / "$/, we estimate that there are about two dozen instances in each class. Excess full stops in the tree (e.g. those after ‘U.S.’) were not added to raw text. With the exception of the two examples given above, it should be possible to go from these raw text files to the tree yields of the PTB using only the mappings listed, sentence segmentation and tokenisation, modulo excess full stops. Below, we list the changes that were necessary to make to the raw text to produce these files. The vast majority of changes involved deleting punctuation from the end of a sentence, usually a double quote, or removing mid-sentence paragraph breaks. We also list the places where text was omitted mid file. 20004002 delete /Donoghue after 20118121 delete .START 20162031 delete random symbols < > before 20166006 deleted excess .START 20203018 deleted middle section of the raw sentence before 20285001 delete sentence 20408009 change `n' to 'n' before 20901001 delete e/ 20994048 change `n' to 'n' 20994051 change `90s to '90s 20998039 Co,. to Co., before 21139023 delete > 21161034 delete random < symbol 21331036 change `S to 'S before 21625071 delete random f 21625071 staf to staff 22170053 changed But's that to But that's 22312002 change 16/ 64-inch to 16 64-inch delete wide chars in raw: 20142010 21069009 21870061 22055020 22377048 swap ." to ". : 20749019 20749025 20749029 20749033 20749037 20749043 20749045 20749048 20749050 normalize m-dash (‘---’) to n-dash (‘--’): 20350026 20675027 delete final '": 21125024 21996073 22367021 delete final ): 20452023 20455031 22227036 delete final ": 20003028 20034039 20044135 20051034 20060020 20072019 20089065 20090052 20098019 20100042 20102047 20109045 20114039 20121050 20128044 20160005 20181009 20205002 20214080 20222003 20242029 20259018 20261016 20267100 20282047 20286100 20287006 20304026 20305054 20320004 20326029 20367052 20408013 20430024 20441016 20445117 20465095 20473029 20495005 20518038 20525042 20530048 20559037 20560029 20563017 20564058 20578061 20584048 20592044 20597007 20601029 20604058 20617061 20633088 20635006 20654009 20666038 20673007 20723022 20725127 20741037 20742068 20743009 20748025 20758077 20765109 20768039 20776051 20799051 20800058 20808013 20903002 20909101 20922034 20931038 20935011 20943010 20949009 20956057 20962019 20971021 20975050 20976038 20983048 20984072 20994099 20998045 21000033 21010039 21025004 21057147 21094055 21110024 21123022 21136022 21150017 21159013 21178016 21179007 21234006 21246028 21248027 21261018 21264034 21265004 21273049 21274045 21275015 21306019 21316020 21366057 21367085 21368041 21374018 21375052 21376068 21379017 21396017 21432048 21455068 21457070 21471008 21474047 21475004 21515049 21549053 21556042 21560024 21563019 21582025 21586061 21587013 21594029 21603091 21611016 21615083 21617040 21629036 21634129 21647049 21650023 21654038 21675008 21677056 21682041 21686014 21691051 21693023 21727008 21791016 21802034 21822037 21826023 21844074 21852041 21860036 21870115 21875135 21900007 21901008 21903049 21918007 21922025 21928043 22010014 22012016 22013167 22044036 22048042 22055044 22063028 22100057 22110028 22125030 22156025 22161070 22165033 22169033 22202046 22213010 22221007 22223042 22224010 22235006 22250064 22269006 22276097 22303029 22306062 22314070 22321034 22347031 22354017 22357010 22359013 22370008 22381044 22384050 22386049 22393011 22397063 22406050 22417095 22446035 The following ‘sentences’ cover text spans that included paragraph breaks in the raw text. These paragraph breaks have been converted to single spaces in the `cooked’ data: 20091008 20095009 20102001 20102015 20102027 20102038 20105014 20105018 20105024 20105028 20105035 20139002 20179040 20179042 20179043 20179044 20280001 20280020 20280032 20312002 20312003 20331042 20331049 20360010 20360028 20360038 20403002 20403005 20403008 20411005 20416001 20416016 20416028 20434010 20444013 20508005 20508010 20508014 20576008 20576017 20576024 20594002 20594003 20594006 20595001 20595013 20595024 20710003 20728015 20730001 20730023 20730034 20732012 20732022 20747010 20747015 20749017 20756018 20757001 20757002 20757005 20761023 20772012 20783012 20793017 20998001 20998014 20998025 20998037 21052051 21052052 21054002 21092011 21092018 21092029 21095011 21116008 21131017 21151020 21212002 21212005 21218001 21218018 21218031 21240004 21252004 21252007 21253028 21256009 21258015 21260008 21262019 21284026 21328045 21328047 21328048 21328049 21373001 21373015 21373026 21398001 21398004 21412018 21416021 21435002 21470001 21470016 21470028 21470042 21482001 21490045 21490051 21529018 21571018 21578043 21583001 21583015 21583027 21583036 21588005 21605006 21694028 21735051 21735058 21755019 21758002 21758003 21758006 21790001 21790015 21790026 21790040 21798002 21798003 21809058 22011011 22021023 22021039 22029013 22036023 22038002 22038005 22086041 22086048 22111001 22111015 22111033 22171001 22204006 22215002 22215005 22225001 22225018 22225031 22239067 22250002 22265011 22308034 22330026 22338013 22366001 22366017 22366031 22373001 22401002 22412035 Text from the raw files have been omitted after the text in the following sentences: 20094014 20104002 20133003 20148027 20190005 20200002 20211002 20248007 20266006 20268005 20269006 20331049 20364005 20410002 20434010 20449056 20455038 20511005 20581006 20586046 20603002 20605002 20608002 20609048 20611002 20614002 20694014 20696005 20728020 20735010 20747020 20751004 20818053 20911017 20957007 20974002 20980010 20992020 21056005 21070002 21095011 21107017 21228005 21253028 21259009 21260008 21377071 2138200521401002 21402003 21417003 21430002 21436050 21450061 21490052 21497005 21529018 21557005 21558005 21564049 21585011 21588009 21602014 21605006 21616037 21619036 21623088 21625072 21632008 21658015 21735058 21745002 21747002 21751002 21814012 21856081 21862008 21871002 21941005 21946044 21961002 21964002 21965002 21970038 22011011 22017031 22021039 22108010 22114005 22139002 22153064 22206004 22230071 22300086 22301004 22311002 22351056 22352008 22368007 22374002 22374002 22376093 22407049 22412083 22413042 22415065