[221000010010] |Email test [221000010020] |This is a test to see if email addresses are showing up properly in posts. bro...@u.washington.edu [221000020010] |About Cyberling [221000020020] |Advances in computing technology from the past few decades, including general communications technology like the world wide web as well as specific advances in computational linguistics, have opened the possibility of a cyberinfrastructure for linguistics that will advance the field by allowing linguists to analyze and test hypotheses against much larger data sets, collaborate with more people across greater distances, and as a result ask questions not previously answerable. [221000020030] |We envision a research climate in which data including audio and video recordings, transcriptions, interlinear glossed text, dictionaries, acceptability judgments, typological classifications, psycholinguistic results, language acquisition data others are available for virtually all the worlds languages through web-based portals. [221000020040] |These benefits are made possible by data that is encoded in standardized formats, and annotated with standardized metadata so that they can be discoverable, searchable, and aggregatable to maximize the ability of researchers to find both particular data sets and examples of interest as well as test hypotheses against large quantities of data. [221000020050] |However, in order to realize the promise of a cyberinfrastructure, the field needs to solve three problems: (1) the culture change problem, (2) the design problem, and (3) the funding problem. Regarding (1), we need to establish a culture in the field of publishing and sharing data and annotations, and of expecting hypotheses to be tested against available data sets. [221000020060] |Regarding (2), we need to identify existing standards and software that can contribute to a general cyberinfrastructure and plan how to build from them. [221000020070] |Finally, regarding (3), we need to develop a funding model which will sustain not only research contributions by linguists and computational linguists, but also software development (including user interface work) by software engineers. [221000020080] |These problems can not be solved by isolated research projects, but rather require wide-spread communication, participation and buy-in from the field. [221000020090] |In July 2009, the Cyberling 2009 Workshop (held in conjunction with the LSA Linguistic Institute at UC Berkeley) brought together researchers from diverse subfields of linguistics (as well as some non-linguists) interested in issues of Cyberinfrastructure. [221000020100] |The results of those conversations are documented in the workshop wiki. [221000020110] |The workshop was a wonderful opportunity to discuss issues pertaining to cyberinfrastructure for linguistics across many different perspectives, and we would like to continue that conversation and collaboration online, without waiting for the next opportunity to meet face-to-face. [221000020120] |The goal of this blog is to provide a site for that on-line collaboration. [221000020130] |We hope the "breaking news" aspect of a blog will bring people back to the site regularly to see updates and participate in the discussion, while tags on the posts will support organization of the information so that it also becomes a useful repository. [221000020140] |The Cyberling 2009 Workshop and the initial development of this blog were funded by the National Science Foundation under grant number BCS-0936577. [221000020150] |Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. [221000030010] |Introducing the Cyberling blog [221000030020] |The goal of this blog is to provide a point of virtual collaboration regarding the creation, promotion and maintenance of cyberinfrastructure for the field of linguistics (and language sciences more broadly). [221000030030] |At the Cyberling 2009 workshop in Berkeley, CA (July 2009), "communication" was identified as a key issue in the development of cyberinfrastructure. [221000030040] |In particular, we need to communicate about standards (availability and development), tool and resource availability, needs assessment, and principles &practices. [221000030050] |People working on tools and standards across linguistics and the language sciences more broadly need to be aware of each other and each other's efforts and need to be able to communicate with potential users for needs assessment. [221000030060] |(Mark Liberman commented that every successful piece of software starts with someone scratching an itch: they have a problem, build a solution, and share that solution. [221000030070] |But not everyone who has a problem that can be solved with software has the means/skills to build that software themselves.) [221000030080] |People potentially using tools and standards need to be able to find them. [221000030090] |People who should be using tools and standards but don't yet know about them, need to be reached. [221000030100] |It is our hope that this blog will become a useful vehicle for communication across all these dimensions, as well as a useful repository for information about cyberinfrastructure for linguistics. [221000030110] |We envision contributions to be one of these six basic types: [221000030120] |
  • conference/workshop announcements (and reports)
  • [221000030130] |
  • project announcements
  • [221000030140] |
  • funding opportunities
  • [221000030150] |
  • issues/op-eds
  • [221000030160] |
  • tutorials
  • [221000030170] |
  • software/hardware/book/paper reviews.
  • [221000030180] |Rather than attempting to organize the information in an encyclopedic fashion, we are experimenting with a set of tags which can be assigned to posts to build an index to the posts for later reference. [221000030190] |We hope that the comments on the blog posts will be both collegial and lively, and also that visitors will use the associated forum for more general discussion. [221000030200] |Welcome! [221000040010] |Endangered Languages Information and Infrastructure Project (ELIIP) [221000040020] |Last week the Center for American Indian Languages at the University of Utah hosted a workshop going under the acronym ELIIP (for Endangered Languages Information and Infrastructure Project) as the first step towards a larger project "intended to produce an authoritative catalogue, database, and updatable website of information on endangered languages and enrich the infrastructure of the discipline by integrating accurate EL information into a network of digital information and research facilities". [221000040030] |The workshop participants ranged from regional language experts to data infrastructure specialists to representatives of non-profit foundations who were all assembled to give advice (largely via working groups) to the project organizers, Lyle Campbell of the University of Utah and Helen and Anthony Aristar-Dry of Eastern Michigan University/LINGUIST List. [221000040040] |While technical issues were far from absent from the discussion, quite a bit of the attention was focused, not surprisingly, on the social problems relating to data input and review. [221000040050] |Ideally, the "best" experts (or more likely expert) would review the data for each language in the database and local experts (possibly members of native speaker communities) would also be able to offer comments on the data for review by those with editorial control. [221000040060] |When one considers that the number of "endangered" languages will probably be in the thousands on almost any counting scheme, this is no small editorial task, and purely technical solutions will only get us so far. [221000050010] |New NSF-OCI Software Development for Cyberinfrastructure (SDCI) solicitation [221000050020] |On Thursday, November 19, the NSF Office of Cyberinfrastructure (OCI) announced a new Software Development for Cyberinfrastructure (SDCI) solicitation, with a full proposal deadline of February 28, 2010. [221000050030] |It expects to make 25 to 30 awards totaling $15,000,000 over three years. [221000050040] |The program synopsis reads as follows. [221000050050] |"The purpose of the Software Development for Cyberinfrastructure (SDCI) program is to develop, deploy, and sustain a set of reusable and expandable software components and systems that benefit a broad set of science and engineering applications. [221000050060] |SDCI is a continuation of the NSF Middleware Initiative (NMI) in an expanded context appropriate to the current expanded vision of cyberinfrastructure. [221000050070] |"This program supports software development across five major software areas: system software and tools for High Performance Computing (HPC) environments; software promoting NSF's strategic vision for digital data; network software to support distributed software, software in the form of middleware capabilities and services, and cybersecurity. [221000050080] |SDCI funds software activities for enhancing scientific productivity and for facilitating research and education collaborations through sharing of data, instruments, and computing and storage resources. [221000050090] |The program requires open source software development." [221000050100] |Of the five software areas to be supported by this program, Software for Digital Data is perhaps the one in which computational linguists may be able to make a contribution. [221000050110] |The solicitation lists four specific focus areas of interest: [221000050120] |
  • Documentation/Metadata,
  • [221000050130] |
  • Security/Protection,
  • [221000050140] |
  • Data transport/management, and
  • [221000050150] |
  • Data analytics and visualization.
  • [221000050160] |For the Documentation/Metadata focus area, the solicitation lists the following development areas of interest: [221000050170] |"Tools for automated/facilitated metadata creation/acquisition, including linking data and metadata to assist in curation efforts; tools to enable the creation and application of ontologies, semantic discovery, assessment, comparison, and integration of new composite ontologies." [221000050180] |Such tools, however, should be designed "to support multiple application domains and large-scale end use communities." [221000050190] |In addition, proposals will be evaluated for their ability to deal with one or more of following cross-cutting software issues: [221000050200] |
  • sustainability,
  • [221000050210] |
  • self-manageability, and
  • [221000050220] |
  • power/energy efficiency
  • [221000050230] |as described in the solicitation at the beginning of Section II. [221000050240] |Proposals must also be categorized as either New Development or Improvement and Support. [221000050250] |Budgets for New Development must not exceed $500,000 per year, and those for Improvement and Support must not exceed $1,000,000 per year. [221000050260] |All SDCI proposals should satisfy eleven "common requirements", some of which have been described above; these are listed at the end of Section II. [221000060010] |Language Description Heritage (LDH) Digital Library [221000060020] |At the Max Planck Society in Germany we currently are building up the Language Description Heritage (LDH), a digital library to share extant linguistic description and analysis. [221000060030] |We plan to officially announce this initiative around February 2010. [221000060040] |Currently, we are busy finishing the practical workflow and the communication with the authors who want to submit their work to this digital library. [221000060050] |At this stage, I would like to have some feedback about the explanation of the goals and procedures on the project website (specifically, the pages "About", "Objective" and "For Authors"). [221000060060] |The technical details, like the sidebar and the entries themselves (as shown in "Archive"), are not yet finished, so please don't comment on that aspect. [221000060070] |http://ldh.blogs.mpdl.mpg.de/ [221000060080] |Please send any comments, questions, or suggestions for improvement to Michael Cysouw at cysouw [at] eva.mpg.de. [221000060090] |Thanks, and merry christmas! [221000060100] |Michael Cysouw [221000070010] |LSA Data Sharing Resolution [221000070020] |At the recently concluded Annual Meeting of the Linguistic Society of America (LSA) in Baltimore, the following resolution on Data Sharing was passed by those at the Business Meeting. [221000070030] |It will soon be sent along to the whole membership of the Society for their vote. [221000070040] |The resolution was put forth by the LSA's Technology Advisory Committee. [221000070050] |-------------------------------------------- Whereas modern computing technology has the potential of advancing linguistic science by enabling linguists to work with datasets at a scale previously unimaginable; and [221000070060] |Whereas this will only be possible if such data are made available and standards ensuring interoperability are followed; and [221000070070] |Whereas data collected, curated, and annotated by linguists forms the empirical base of our field; and [221000070080] |Whereas working with linguistic data requires computational tools supporting analysis and collaboration in the field, including standards, analysis tools, and portals that bring together linguistic data and tools to analyze them, [221000070090] |Therefore, be it resolved at the annual business meeting on 8 January 2010 that the Linguistic Society of America encourages members and other working linguists to: [221000070100] |
  • make the full data sets behind publications available, subject to all relevant ethical and legal concerns;
  • [221000070110] |
  • annotate data and provide metadata according to current standards and best practices;
  • [221000070120] |
  • seek wherever possible institutional review board human subjects approval that allows full recordings and transcripts to be made available for other research;
  • [221000070130] |
  • contribute to the development of computational tools which support the analysis of linguistic data;
  • [221000070140] |
  • work towards assigning academic credit for the creation and maintenance of linguistic databases and computational tools; and
  • [221000070150] |
  • when serving as reviewers, expect full data sets to be published (again subject to legal and ethical considerations) and expect claims to be tested against relevant publicly available datasets.
  • [221000070160] |-------------------------------------------- [221000070170] |The resolution passed in the Business Meeting by a comfortable enough margin that no vote count was required. [221000070180] |Some members of the Society expressed reservations about the resolution including: (i) the logically separate points it brings together, (ii) the overall framing of the resolution towards users of data rather than producers of data, and (iii) the relatively limited mention of "ethical" issues. [221000070190] |My own sense is that some of these points could be addressed in revisions to the resolution that would probably be acceptable to both its original authors and those with objections. [221000070200] |However, the LSA's somewhat antiquated resolution process does not make it easy to make such revisions on anything less than a full-year cycle. [221000070210] |So, for now the above resolution is the one that will move forward to the membership. [221000070220] |After the resolution was presented at the Business Meeting, the LSA Ethics Committee decided it would discuss the resolution on its Ethics Discussion Blog in the near future, specifically to address what ethical issues it raises. [221000080010] |Presentation on the ISO 639 family of language codes [221000080020] |At the 2010 Annual Meeting of the Linguistic Society of America (LSA) in Baltimore there was a presentation by Rebecca Guenther of the Library of Congress (who is also Rotating chair of the ISO 639 Joint Advisory Committee) about ISO language code standards. [221000080030] |Her slides are now posted on the page of the LSA's Technology Advisory Committee and may be of interest to those wishing to know more about the ISO 639 family of standards, the most prominent of which for linguistics is ISO 639-3, which attempts to be comprehensive for the world's languages. [221000090010] |A "data problem" [221000090020] |On Jan 8, Fritz Newmeyer gave a very interesting talk at the University of Washington about the lack of evidence for a particular parameter from Principles and Parameters theory. [221000090030] |As I understood it, the main points of his talk were first that when parameters from P&P theory are tested against a wide variety of languages, the correlations they are meant to capture tend not to hold up, but also that it can be very difficult to say this for sure, because of course multiple parameters can interact, obscuring the functioning of the parameter of interest from the purview relatively superficial surveys. [221000090040] |In this context, Newmeyer mentioned what he characterized as a "data problem": Every descriptive linguist and every typologist is working from their own interpretation of such fundamental concepts as "adjective" or "subject" or "case". [221000090050] |This problem struck me as just the kind of problem that a full-fledged cyberinfrastructure for our field could (and eventually should) address. [221000090060] |Furthermore, there are at least two ways in which cyberinfrastructure can help here. [221000090070] |First is through standardization. [221000090080] |To the extent that resources like the GOLD ontology catch on, linguists can at least "opt in" to linking their terminology to the ontology, and this should improve comparability across studies. [221000090090] |The second is through publication and aggregation of data: If the linguists that Newmeyer refers to are empirical linguists, then their definitions of these concepts ought to be grounded in linguistic facts (primarily facts about distribution of formatives or meanings of utterances). [221000090100] |If the data behind analyses were published along with the analyses (in accessible, standards-compliant ways, with the relevant annotations included), then it ought to be possible to algorithmically check the compatibility of different uses of the same term, or at least for the interested linguist to "drill down" to get more information about the use of the terms in that particular work. [221000100010] |Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21) [221000100020] |As described in a recently posted Dear Colleague letter, NSF has set up six task forces including NSF Program Officers and "distinguished members from the external science and engineering community" to "develop a long term vision" for a Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21). [221000100030] |The areas are: [221000100040] |
  • Campus Bridging
  • [221000100050] |
  • Grand Challenges
  • [221000100060] |
  • Software and Tools
  • [221000100070] |
  • Data
  • [221000100080] |
  • High Performance Computing
  • [221000100090] |
  • Work Force Development
  • [221000100100] |Draft versions of the documents produced by these task forces will be posted on an external wiki for public access and comment. [221000100110] |If you wish to track or contribute to CF21, send email to acci-task-forces AT nsf DOT gov. [221000110010] |Strategic Technologies for Cyberinfrastructure (STCI) solicitation from NSF Office of Cyberinfrastructure [221000110020] |On January 26, the NSF Office of Cyberinfrastructure (OCI) announced its latest Strategic Technologies for Cyberinfrastructure (STCI) program solicitation, with target dates for full proposal submission of April 21 and August 5, 2010. [221000110030] |The goal of the STCI program "is to support activities that lead to innovative cyberinfrastructure but are not currently funded by other programs or solicitations". [221000110040] |The announcement lists six specific review criteria in addition to NSF's intellectual merit and broader impacts criteria, and states: "Proposals that do not address these points will not be competitive in this program." [221000110050] |More information about STCI, including funding levels, is contained in two sets of slides (HERE and HERE) that were prepared by Jennifer Schopf, one of STCI's Program Directors. [221000110060] |Note that STCI and SDCI are separate and distinct OCI programs! [221000120010] |The World Loanword Database goes online: interview with Robert Forkel [221000120020] |The World Loanword Database (WOLD, http://wold.livingsources.org/), edited by Martin Haspelmath and Uri Tadmor and published by the Max Planck Digital Library (http://www.mpdl.mpg.de/) is a new digital resource for linguists that allows tracing the origin of loan words. [221000120030] |We had the oportunity to interview WOLD web developer Robert Forkel and ask him about the design philosophy and technology behind the platform. [221000120040] |Soon (in about 1-2 weeks) we will also post an interview with Martin Haspelmath on the potential of WOLD for data-driven linguistic research. [221000120050] |Cornelius Puschmann: Robert, WOLD is a rich, open-access resource for studying a range of different questions in linguistics. [221000120060] |Could you tell us a bit more about the history of WOLD itself, how it came into being? [221000120070] |Robert Forkel: Martin can tell you everything about the concept and history of WOLD, so I'll focus on the development process. [221000120080] |Successful collaboration with the Max Planck Institute for Evolutionary Anthropology (EVA, http://www.eva.mpg.de/english/index.htm) on the World Atlas of Language Structures Online (WALS, http://wals.info/) led to the Cross-Linguistic Database Platform project (http://www.mpdl.mpg.de/projects/intern/cldp_de.htm). [221000120090] |The idea behind the platform is the post-hoc integration of distributed resources via linked data (http://linkeddata.org/). [221000120100] |WOLD is the second linked data resource for linguistics we have developed, so now the work on integration of the two can begin. [221000120110] |Cornelius Puschmann: Where does the data for WOLD come from and who contributed to it, apart from the editors and yourself? [221000120120] |Robert Forkel: I'll also refer you to Martin for a detailed answer to that question. [221000120130] |The short version is that the data was contributed by a large group of researchers over several years in the Loanword Typology Project and then adapted for Web publication. [221000120140] |Cornelius Puschmann: What kind of technology is WOLD based on and how can researchers interact with the data? [221000120150] |Robert Forkel: WOLD is implemented using a Python web application framework (currently Turbogears, but we'll move to Pylons soon), serving data stored in a relational database (PostgreSQL). [221000120160] |Good question regarding how researchers can interact with the data -- we'd like to find out more about that once more people use WOLD. [221000120170] |As stated above, we want to establish linked data and RDF as as data access and exchange protocols. [221000120180] |This will be beneficial to our own integration plans, but ideally it would also replace CSV/Excel/etc as exchange formats. [221000120190] |Our own plan in terms of data integration involves harvesting dispersed data and putting it in a central repository where it could be queried using SPARQL (http://www.w3.org/TR/rdf-sparql-query/). [221000120200] |Pretty much like OLAC (http://www.language-archives.org/), just for data. [221000120210] |Cornelius Puschmann: How long did it take to develop WOLD and what resources, in terms or specialists and work hours, are needed to put a project on this scale together? [221000120220] |Robert Forkel: There is no simple answer to this, since different steps were involved, with the development of the WOLD web platform just being the last one. [221000120230] |The data for WOLD was collected in a project running over several years. [221000120240] |During this project, the data was stored in a Filemaker database (http://www.filemaker.com/) which made for easy data input, but also required an extra data migration step for the online publication. [221000120250] |Having gathered experience with this kind of toolset and the workflow of the linguists in the WALS Online project helped a lot. [221000120260] |The work on the online publication of the data was also an ongoing process over the course of more than a year. [221000120270] |There are always delays in a project with many contributers and parties involved, where careful coordination between scholars and developers is pivotal. [221000120280] |I think to put together a project of this scale requires an organization which can dedicate small amounts of resources over a longer period of time. [221000120290] |The finished web application right now could probably be rewritten within a week or two -- which I'm actually doing for the switch to a new software framework. [221000120300] |But as with WALS, an iterative process was essential. [221000120310] |There is simply no way of imagining (let alone specifying) such an application without looking at it and discussing it with practitioners. [221000120320] |Cornelius Puschmann: How does WOLD tie in with other MPDL/MPG-EVA projects and who do you see as target audiences for the different resources you provide? [221000120330] |Robert Forkel: In various ways. [221000120340] |For resources like the intercontinental dictionary series (http://lingweb.eva.mpg.de/ids/), and word lists in general, the ties are very strong, i.e. [221000120350] |I think it should be possible to mix and match data from these resources without much programming. [221000120360] |In fact, we think about reusing the web application serving WOLD to serve IDs as well, thereby publishing the ID data as linked data as well. [221000120370] |With resources like WALS, integration will probably be on a more superficial level à la "and what does WALS say about language X?" [221000120380] |Finding out what it may mean to query WALS and WOLD and ID data at once is ultimately the goal of the "cross-linguistic database platform" project, so stay tuned. [221000120390] |Regarding the target audience: the first week after its publication WOLD showed that, just as with WALS, the user community is not restricted to linguistic specialists, but quite diverse. [221000120400] |Cornelius Puschmann: How do legal and licensing issues come into play when developing such resources? [221000120410] |What role does Open Access play? [221000120420] |Robert Forkel: Legal and data licensing issues should come into play at a very early stage of your project. [221000120430] |There is significant demand for qualified real legal advice, since all of this is unchartered terrain. [221000120440] |With WOLD we were in the fortunate situation that the data had not been published before and the editors agreed to publishing it under a Creative Commons Attribution (CC-BY) license, which I'm told qualifies as "real" open access. [221000120450] |Still, licensing and conveying license information is still a largely unsolved problem for research data, if not in principle, then practically in each concrete dataset I've encountered so far. [221000120460] |A lot of insecurity in this area stems from a lack of precedent and explicit licensing terms. [221000120470] |Being able to publish WOLD and WALS open access is certainly essential for getting an entity like the MPDL involved, since we are committed to open access (http://oa.mpg.de/openaccess-berlin/berlindeclaration.html). [221000120480] |Publishing restricted data would be hard to justify in our context. [221000120490] |Cornelius Puschmann: Where do you see the field moving in terms of digital resources and cyberinfrastructure in the future? [221000120500] |Robert Forkel: Well, fortunately for researchers, I don't see the field moving forward so quickly that one risks falling behind. [221000120510] |My personal opinion is that if maybe in three years a WOLD vocabulary can be imported in Excel or Google Spreadsheets by simply giving the vocabulary URL -- and be meaningfully merged with a word list from IDs -- I'd consider this a bright future. [221000120520] |Cornelius Puschmann: What are your recommendations for developers and researchers who want to build such resources or contribute to existing ones? [221000120530] |Robert Forkel: Get in touch! [221000120540] |Actually the "contribution" question is still a big one for us. [221000120550] |WALS has been a tremendous success in sliciting feedback. [221000120560] |I'd like to thank Robert for taking the time to chat with me. [221000130010] |ISO 639-3 changes, four years on [221000130020] |About a year ago, in working on this paper, I attempted to do rough count of the affiliations of people submitting code change requests for the ISO 639-3 language codes. [221000130030] |The three-letter ISO 639-3 language codes are one of the more successful pieces of linguistic cyberinfrastructure, and, given their history as being largely derived from the old Ethnologue codes, it has been interesting to look at the extent to which proposed revisions to the codeset were coming from SIL and its various affiliates (including other missionary organizations) as opposed to other groups (e.g., linguists associated with purely academic institutions or members of the general public). [221000130040] |According to my quick and dirty semi-automated check of the requests from 2009 that have been processed, it looks like 2009 continues a trend seen in 2008 where about half of the requests come from "SIL" and the other half from non-SIL sources, including academic linguists and even some conlangers), which is an increase from the first two years of code change requests that were more heavily skewed towards SIL. [221000130050] |This trend of greater non-SIL participation would seem to be a good thing insofar as it means more eyes are on the standard. [221000130060] |That being said, the participation still doesn't seem to be where it should be since not all that many linguists submitted change requests (maybe seven or so) (though some of the ones that did submitted a lot). [221000130070] |There are probably various reasons for this, but I suspect a big one is simply that it takes time to work on a change request, and it's not (yet!) a very valued endeavor. [221000130080] |I wonder if a "stick" approach might work well here: For example, if one gets a documentation grant, perhaps submitting all appropriate code change requests should be considered a required outcome. [221000140010] |Focus on applications [221000140020] |A lot of digital ink has been spilled in recent years laying out standards and best practices for language documentation and archiving, and rightly so. [221000140030] |Coherent standards greatly improve the usefulness and longevity of archived data, and getting standards right is a difficult process. [221000140040] |And, measures like the recent LSA resolution and the requirements of funding agencies are an important step towards getting researchers to use these standards. [221000140050] |But, even more important (I believe) is the development of tools which let researchers take advantage of these emerging standards in the earliest stages of their research. [221000140060] |As an example, let me describe a language documentation project I've been tangentially involved with (that shall remain nameless). [221000140070] |It began as a graduate field methods class, but the instructors and students quickly realized that they had found an excellent language informant who spoke an extremely interesting language, and over the last five years it's developed bit by bit into something a lot more. [221000140080] |Currently, the workflow looks a little like this: the researchers meet with the speakers in a variety of settings and circumstances, both here in San Diego and back in the old country, during intensive field session and in brief meetings fit in around everyone's work schedules. [221000140090] |The linguists for the most part take handwritten notes, which they later type up as Microsoft Word documents. [221000140100] |For better or worse (mostly worse), this collection of Word documents constitutes the main database for the project. [221000140110] |One research assistant (a linguistics PhD student who's reasonably computer savvy but no specialist) is tasked with working through all the Word documents, cleaning things up, regularizing the various orthographies, trying to build an index listing the locations of relevant examples, and constructing a lexical database based on examples culled from the notes. [221000140120] |This database (constructed using Filemaker Pro, for no particular reason) will be used to make a web accessible dictionary and, when the project is complete, will be archived in the appropriate places using the appropriate standards. [221000140130] |But, what about all the other data? [221000140140] |Most likely, if it gets archived at all it will be as a heap of unprocessed documents, confusing to project participants and more or less useless to outsiders. [221000140150] |How could this project be helped? [221000140160] |The PIs are all for data sharing and archiving. [221000140170] |They just don't have the expertise to do it, and resources are limited. [221000140180] |Given the choice between hiring an XML expert or doing another three months of data collection, which would you pick? [221000140190] |But, what if they had access to suite of tools that fit into their workflow and made their linguistic lives easier, and which also as a side benefit made it easier for them to publish the results of the field research in a standards-complaint format? [221000140200] |If it didn't require them to significantly change the way they do their work and could also reduce the amount of database futzing their grad students have to do, I'm sure they would use it. [221000140210] |Do such tools exist? [221000140220] |If so, what can we do to publicize their existence? [221000140230] |If not, what needs to be done to create them. [221000140240] |There are certainly plenty of tools out there intended for field linguists... who uses them, who doesn't, and why? [221000140250] |And do they all support current standards and best practices? [221000140260] |I'm not sure to what extent these questions are being addressed, but they don't seem to be getting as much attention as other infrastructure issues. [221000140270] |Don't forget the applications! [221000140280] |To paraphrase Niklaus Wirth, Standards + Tools = Archives. [221000150010] |Etnolinguistica.Org: a report from South America [221000150020] |For the past few years, I've been part of a team involved in building an information hub on indigenous South American languages, a place to create and gather online resources for both academic researchers and the general public. [221000150030] |The project, Etnolinguistica.Org, started in 2002 as a mailing list. [221000150040] |The list quickly evolved into a major forum for the discussion of research topics on South American languages, the promotion of events and online resources—in sum, a meeting point for all those interested in South American linguistics and related areas. [221000150050] |A result of the list's popularity, the website currently comprises more than 700 pages, including conference abstracts, articles, and a comprehensive, up-to-date library of links to open-access periodicals, news articles, and other online resources. [221000150060] |The project is community-driven, as the list's users (ranging from experienced scholars to undergraduate students) remain by far our most important sources. [221000150070] |Since 2009, the website also publishes Cadernos de Etnolingüística (ISSN 1946-7095), a peer-reviewed, open-access online journal on South American languages. [221000150080] |Our most popular features, in terms of both hits and community participation, are our dissertation repository, which currently lists 165 freely-available theses and dissertations (many of which are author-submitted), and the Curt Nimuendaju Digital Library, which offers hard-to-find, out-of-print books and articles. [221000150090] |Named after a pioneer of Brazilian ethnography and linguistics, the library includes, in addition to items digitized by its own volunteer staff or by similar projects, a number of items donated by interested readers (including authors or their heirs). [221000150100] |The direct participation of linguists actively involved in the documentation of South American languages is the main characteristic of Etnolinguistica.Org, contributing to keep our information relevant and accurate. [221000150110] |To further contextualize the information we provide, we've recently started a directory of linguists working on South American indigenous languages. [221000150120] |Each entry is an individual page containing basic information on the researcher: name, institutional affiliation, means of contact (email addresses are duly protected via ReCaptcha), interest areas, and languages of interest. [221000150130] |The directory is cross-referenced with our ever-growing list of online resources, in such a way that, by clicking on a given language tag, one finds not only a list of online materials, but ways of getting directly in touch with linguists working on that language as well. [221000150140] |As a further step towards that goal, Etnolinguistica.Org will launch a catalogue of South American languages later this year (for examples, take a look here and here). [221000150150] |That integration between authors and resources will hopefully ensure a certain measure of control, by the scientific community, over the quality of the information being provided. [221000160010] |NSF Software Infrastructure for Sustained Innovation (SI**2) [221000160020] |On March 16, the National Science Foundation announced the Software Infrastructure for Sustained Innovation (SI**2) Program Solicitation 10-551 at http://www.nsf.gov/pubs/2010/nsf10551/nsf10551.pdf. [221000160030] |This is an NSF-wide solicitation, led by the Office of Cyberinfrastructure (OCI). [221000160040] |It is the first tangible result of the March 3 Dear Colleague letter at http://www.nsf.gov/pubs/2010/nsf10029/nsf10029.jsp?org=OCI announcing the Cyberinfrastructure Framework for 21st Century Science and Engineering. [221000160050] |This posting provides only the basic information about the solicitation. [221000160060] |I think it is a very promising one for supporting certain kinds of research and development for linguistic cyberinfrastructure, and I hope our community is able to take advantage of it. [221000160070] |There are two due dates (deadlines): Letters of intent are due by 5pm proposer's local time on May 10, 2010. [221000160080] |These are required; you cannot submit a full proposal without having submitted an LOI. [221000160090] |Full proposals are due on June 14, 2010. [221000160100] |There are two categories of submissions for this solicitation; a third will be available starting in 2011. [221000160110] |1. Scientific Software Elements (SSE) 2. Scientific Software Integration (SSI) [221000160120] |SSE awards are for small groups that will create and deploy robust software elements for which there is a demonstrated need ... [read the solicitation, p. 5, for the details]. [221000160130] |These are expected to total $300K - $500K for 3 years. [221000160140] |About 18 SSE awards will be made in FY 2010, subject to availability of funds. [221000160150] |SSI awards are for larger multidisciplinary groups organized around a common research problem and common software infrastructure, and will result in sustainable community software... [more details in solicitation, p. 5]. [221000160160] |These are expected to total approximately $1M per year for 3-5 years. [221000160170] |About 4 SSI awards will be made in FY 2010, subject to availability of funds. [221000170010] |Language Description Heritage (LDH) Digital Library [221000170020] |Dear colleagues, [221000170030] |it is my pleasure to announce the Language Description Heritage (LDH) open access digital library, available online at [221000170040] |http://ldh.livingsources.org [221000170050] |The LDH is being compiled at the Max Planck Society in Germany, specifically at the MPI for Evolutionary Anthropology in Leipzig in cooperation with the Max Planck Digital Library in Munich. [221000170060] |The goal of the LDH is to make available existing descriptive and analytic work about the world’s languages. [221000170070] |The main focus is to provide easy access to traditionally difficult to obtain scientific contributions. [221000170080] |Specifically, there are many unpublished theses and manuscripts with valuable data on individual languages that are often unknown and unavailable to the wider linguistic community. [221000170090] |Also many out-of-print publications with a limited availability in research libraries deserve a much wider audience and recognition. [221000170100] |To enhance to flow of scientific discussion, we offer this platform to make electronic version of said contributions freely available. [221000170110] |The Language Description Heritage Digital Library minimally provides photographic scans, downloadable in PDF format (more is planned for the future). [221000170120] |Most importantly, all content in this digital library is available under a permissive Creative Commons (CC-by) license, so everything can be freely used for all scientific purposes. [221000170130] |When you are the author and/or rights-holder of a suitable publication, please and consider making your works available under a CC-license. [221000170140] |This is a very simply process. [221000170150] |Basically, you sign a permission form (http://ldh.livingsources.org/files/2009/08/formular13081.pdf) and send this to us. [221000170160] |Detailed instructions can be found at http://ldh.livingsources.org/for-authors/ [221000170170] |We recommend you assign a bare CC-by (“Attribution”) license to your work, though you might also opt for an even freer CC-zero (“No Rights Reserved”, equivalent to “Public Domain”). [221000170180] |Clear an open licensing enhances the exchange of scientific ideas. [221000170190] |In choosing a license, please be aware that there is a difference between scientific recognition and commercial recognition of your work. [221000170200] |Whatever license you choose for your work, this does not regulate scientific recognition! [221000170210] |To obtain more scientific recognition it is best to make your work as broadly and easily available as possible, so others can find and acknowledge your work without restriction. [221000170220] |To enhance the exchange of scientific results, we recommend you to choose a highly permissive license. [221000170230] |best Michael Cysouw [221000170240] |--------------- [221000170250] |Max Planck Institut for evolutionary Anthropology – Library Language Description Heritage (LDH) project Deutscher Platz 6 04103 Leipzig, Germany [221000170260] |email: l...@eva.mpg.de [221000170270] |Scientific Mentoring: Prof. Dr. Bernard Comrie, Dr. Michael Cysouw Library Assistance: Gisela Lausberg, Kirstin Baumgarten [221000170280] |‹ Language Description Heritage (LDH) open access digital library [221000180010] |Language Description Heritage (LDH) open access digital library [221000180020] |Dear colleagues, [221000180030] |it is my pleasure to announce the Language Description Heritage (LDH) open access digital library, available online at [221000180040] |http://ldh.livingsources.org [221000180050] |The LDH is being compiled at the Max Planck Society in Germany, specifically at the MPI for Evolutionary Anthropology in Leipzig in cooperation with the Max Planck Digital Library in Munich. [221000180060] |The goal of the LDH is to make available existing descriptive and analytic work about the world’s languages. [221000180070] |The main focus is to provide easy access to traditionally difficult to obtain scientific contributions. [221000180080] |Specifically, there are many unpublished theses and manuscripts with valuable data on individual languages that are often unknown and unavailable to the wider linguistic community. [221000180090] |Also many out-of-print publications with a limited availability in research libraries deserve a much wider audience and recognition. [221000180100] |To enhance to flow of scientific discussion, we offer this platform to make electronic version of said contributions freely available. [221000180110] |The Language Description Heritage Digital Library minimally provides photographic scans, downloadable in PDF format (more is planned for the future). [221000180120] |Most importantly, all content in this digital library is available under a permissive Creative Commons (CC-by) license, so everything can be freely used for all scientific purposes. [221000180130] |When you are the author and/or rights-holder of a suitable publication, please and consider making your works available under a CC-license. [221000180140] |This is a very simply process. [221000180150] |Basically, you sign a permission form (http://ldh.livingsources.org/files/2009/08/formular13081.pdf) and send this to us. [221000180160] |Detailed instructions can be found at http://ldh.livingsources.org/for-authors/ [221000180170] |We recommend you assign a bare CC-by (“Attribution”) license to your work, though you might also opt for an even freer CC-zero (“No Rights Reserved”, equivalent to “Public Domain”). [221000180180] |Clear an open licensing enhances the exchange of scientific ideas. [221000180190] |In choosing a license, please be aware that there is a difference between scientific recognition and commercial recognition of your work. [221000180200] |Whatever license you choose for your work, this does not regulate scientific recognition! [221000180210] |To obtain more scientific recognition it is best to make your work as broadly and easily available as possible, so others can find and acknowledge your work without restriction. [221000180220] |To enhance the exchange of scientific results, we recommend you to choose a highly permissive license. [221000180230] |best Michael Cysouw [221000180240] |--------------- [221000180250] |Max Planck Institut for evolutionary Anthropology – Library Language Description Heritage (LDH) project Deutscher Platz 6 04103 Leipzig, Germany [221000180260] |email: l...@eva.mpg.de [221000180270] |Scientific Mentoring: Prof. Dr. Bernard Comrie, Dr. Michael Cysouw Library Assistance: Gisela Lausberg, Kirstin Baumgarten [221000180280] |‹ Dictionaries and Endangered Languages Language Description Heritage (LDH) Digital Library › [221000190010] |Language Description Heritage (LDH) open access digital library [221000190020] |it is my pleasure to announce the Language Description Heritage (LDH) open access digital library, available online at [221000190030] |http://ldh.livingsources.org [221000190040] |The LDH is being compiled at the Max Planck Society in Germany, specifically at the MPI for Evolutionary Anthropology in Leipzig in cooperation with the Max Planck Digital Library in Munich. [221000190050] |The goal of the LDH is to make available existing descriptive and analytic work about the world’s languages. [221000190060] |The main focus is to provide easy access to traditionally difficult to obtain scientific contributions. [221000190070] |Specifically, there are many unpublished theses and manuscripts with valuable data on individual languages that are often unknown and unavailable to the wider linguistic community. [221000190080] |Also many out-of-print publications with a limited availability in research libraries deserve a much wider audience and recognition. [221000190090] |To enhance to flow of scientific discussion, we offer this platform to make electronic version of said contributions freely available. [221000190100] |The Language Description Heritage Digital Library minimally provides photographic scans, downloadable in PDF format (more is planned for the future). [221000190110] |Most importantly, all content in this digital library is available under a permissive Creative Commons (CC-by) license, so everything can be freely used for all scientific purposes. [221000190120] |When you are the author and/or rights-holder of a suitable publication, please and consider making your works available under a CC-license. [221000190130] |This is a very simply process. [221000190140] |Basically, you sign a permission form (http://ldh.livingsources.org/files/2009/08/formular13081.pdf) and send this to us. [221000190150] |Detailed instructions can be found at http://ldh.livingsources.org/for-authors/ [221000190160] |We recommend you assign a bare CC-by (“Attribution”) license to your work, though you might also opt for an even freer CC-zero (“No Rights Reserved”, equivalent to “Public Domain”). [221000190170] |Clear an open licensing enhances the exchange of scientific ideas. [221000190180] |In choosing a license, please be aware that there is a difference between scientific recognition and commercial recognition of your work. [221000190190] |Whatever license you choose for your work, this does not regulate scientific recognition! [221000190200] |To obtain more scientific recognition it is best to make your work as broadly and easily available as possible, so others can find and acknowledge your work without restriction. [221000190210] |To enhance the exchange of scientific results, we recommend you to choose a highly permissive license. [221000190220] |best Michael Cysouw [221000190230] |--------------- [221000190240] |Max Planck Institut for evolutionary Anthropology – Library Language Description Heritage (LDH) project Deutscher Platz 6 04103 Leipzig, Germany [221000190250] |email: l...@eva.mpg.de [221000190260] |Scientific Mentoring: Prof. Dr. Bernard Comrie, Dr. Michael Cysouw Library Assistance: Gisela Lausberg, Kirstin Baumgarten [221000200010] |NSF Fellowships for Transformative Computational Science using CyberInfrastructure (CI TraCS) [221000200020] |The NSF Office of Cyberinfrastructure announced a new solicitation for post-doctoral fellowships for transformative computational science using cyberinfrastructure (CI TraCS) at http://www.nsf.gov/pubs/2010/nsf10553/nsf10553.pdf. [221000200030] |Applicants must be US citizens, nationals, or legally admitted permanent resident aliens of the US, have received a doctoral degree by the start date of the award, but no more than two years before the beginning of the year in which the award is made (e.g. a recipient of an award that starts on 1 Sept 2010 must have received the doctoral degree between January 2008 and August 2010 inclusive), and have selected a host institution and sponsoring scientist(s) different from those associated with their doctoral degree (but see pp. 4-5 of the solicitation). [221000200040] |Postdoctoral research activities under CI TraCS must be computational in nature and CI-based. and applicants are expected to include a plan for education and mentoring activities in their proposal; as a guideline, proposers should plan on their educational activities to take up between 10% and 25% of their time. [221000200050] |Fellowships are awarded to the applicant, but they must identify a host research organization, such as a college, university, privately-sponsored nonprofit institutes, government agencies and laboratories, and for-profit organizations. [221000200060] |(The last mentioned is prefaced with "under special conditions" in the solicitation.) [221000200070] |Fellows will be expected to participate in an annual Fellows' workshop. [221000200080] |The award is for up to 3 years; the total fellowship amount is $240K over 3 years: a stipend of $60K in Year 1, $65K in year 2 and $70K in year 3; an institutional allowance of $5K per year for each year; and a research allowance supplement of $10K per year. [221000200090] |Fellows moving on to a tenure-track faculty position following their fellowship may apply for a $50K start-up supplement. [221000200100] |CI TraCS proposals are to be submitted by the individual, not by the host institution. [221000200110] |Instructions are available in on the NSF FastLane homepage by clicking on the Postdoctoral Fellowships link. [221000200120] |Before submitting the proposal, you must first register as an individual researcher before the applicant or his or her references can access the application procedures. [221000200130] |A complete submission consists of: [221000200140] |
  • a one page Project Summary, including separate paragraphs describing the proposal's intellectual merit and broader impacts
  • [221000200150] |
  • a 10 page maximum Project Description, with the following information: [221000200160] |
  • a plan for research and education activities, highlighting the key CI-related omponents and how CI will be used to advance the discipline
  • [221000200170] |
  • a justification for the choice of host institution, sponsoring scientist(s), and description of available mentoring, facilities and resources
  • [221000200180] |
  • a description of the applicant's long-term career goals and the role of the fellowshipo in achieving them
  • [221000200190] |
  • a list of references cited in the Project Summary and Description
  • [221000200200] |
  • a 2-page CV of the applicant
  • [221000200210] |
  • a letter of commitment from the host institution and sponsoring scientist, including a mentoring plan and 2-page CV for each sponsoring scientist.
  • [221000200220] |
  • two reference letters, including one from the applicant's doctoral dissertation advisor; the other should not be submitted by a sponsoring scientist
  • [221000200230] |
  • a 1-page abstract of the applicant's dissertation research
  • [221000200240] |Read the solicitation carefully to make sure you understand all the details. [221000200250] |Questions should be sent to the cognizant program officers Manish Parashar and Mimi McClure at citracs AT nsf DOT gov. Upcoming submission deadlines are 5pm proposer's local time on: [221000200260] |
  • June 21, 2010
  • [221000200270] |
  • January 13, 2011
  • [221000200280] |
  • January 13, 2012
  • [221000200290] |It is estimated that 6 to 8 new awards will be made each year, depending on the quality of the proposals and the availability of funds. [221000200300] |The anticipated annual program budget is $2M annually. [221000210010] |Dictionaries and Endangered Languages [221000210020] |The Endangered Languages and Dictionaries Project at the University of Cambridge investigates ways of writing dictionaries that better facilitate the maintenance and revitalization of endangered languages. [221000210030] |It explores the relationship between documenting a language and sustaining it, and entails collaboration with linguists, dictionary-makers and educators, as well as members of endangered-language communities themselves, in order to determine what lexicographic methodologies work particularly well pedagogically for language maintenance and revitalization. [221000210040] |In addition to developing a methodology for writing dictionaries that are more community-focussed and collaborative in their making, content, and format, the Project is creating an online catalogue of dictionary projects around the world. [221000210050] |If you would like your dictionary to be included in the catalogue, please fill out the Dictionary Survey via http://www.lucy-cav.cam.ac.uk/pages/the-college/people/sarah-ogilvie/ela... or contact Sarah Ogilvie at svo...@cam.ac.uk. [221000210060] |We really hope you will want to participate, in order to make the catalogue as comprehensive as possible. [221000210070] |-- Dr Sarah Ogilvie Alice Tong Sze Research Fellow Lucy Cavendish College Lady Margaret Road University of Cambridge Cambridge CB3 0BU. [221000210080] |Tel. [221000210090] |Office (+44) 01223 764018 Tel. Mobile (+44) 07540 133790 [221000210100] |Language Description Heritage (LDH) open access digital library › [221000220010] |Dictionaries and Endangered Languages [221000220020] |The Endangered Languages and Dictionaries Project at the University of Cambridge investigates ways of writing dictionaries that better facilitate the maintenance and revitalization of endangered languages. [221000220030] |It explores the relationship between documenting a language and sustaining it, and entails collaboration with linguists, dictionary-makers and educators, as well as members of endangered-language communities themselves, in order to determine what lexicographic methodologies work particularly well pedagogically for language maintenance and revitalization. [221000220040] |In addition to developing a methodology for writing dictionaries that are more community-focussed and collaborative in their making, content, and format, the Project is creating an online catalogue of dictionary projects around the world. [221000220050] |If you would like your dictionary to be included in the catalogue, please fill out the Dictionary Survey via http://www.lucy-cav.cam.ac.uk/pages/the-college/people/sarah-ogilvie/ela... or contact Sarah Ogilvie at svo...@cam.ac.uk. [221000220060] |We really hope you will want to participate, in order to make the catalogue as comprehensive as possible. [221000220070] |-- Dr Sarah Ogilvie Alice Tong Sze Research Fellow Lucy Cavendish College Lady Margaret Road University of Cambridge Cambridge CB3 0BU. [221000220080] |Tel. [221000220090] |Office (+44) 01223 764018 Tel. Mobile (+44) 07540 133790 [221000230010] |LiLT Special Volume: Implementation of Linguistic Analyses against Data [221000230020] |We are pleased to announce that Linguistic Issues in Language Technology Volume 3, Implementation of Linguistic Analyses against Data has appeared. [221000230030] |This volume, edited by Terry Langendoen and Emily Bender, contains papers by presenters at the LSA 2009 invited symposium "Computational Linguistics in Support of Linguistic Analysis". [221000230040] |Table of contents: [221000230050] |Special volume Introduction, D. Terence Terence Langendoen, Emily M. Bender Computational Linguistics in Support of Linguistic Theory, Emily M. Bender, D. Terrence Langendoen Reweaving a Grammar for Wambaya, Emily M. Bender Computational strategies for reducing annotation effort in language documentation, Alexis Palmer, Taesun Moon, Jason Baldridge, Katrin Erk, Eric Campbell, Telma Can Affective 'this', Christopher Potts, Florian Schwartz [221000230060] |We would like to thank Chris Kennedy and Larry Horn, who as co-chairs of the Program Committee for that meeting invited us to organize the symposium, David Lightfoot for encouraging us to move forward quickly to disseminate this work, and Annie Zaenen for agreeing to consider the papers for publication in LiLT and getting them reviewed in a very timely fashion. [221000240010] |Workshop on Advanced Corpus Solutions, PACLIC 24 [221000240020] |Call for papers: http://www.hf.uio.no/tekstlab/paclic/index.html [221000240030] |Submission deadline: June 14, 2010 Workshop date: November 4, 2010 [221000240040] |This workshop invites papers on advances in corpus types and corpus tools in support of linguistic research. [221000250010] |Resolution on Cyberinfrastructure for Linguistics on LSA ballot [221000250020] |The LSA resolution on cyberinfrastructure for linguistics is now up for a vote of the membership. [221000250030] |LSA members can vote here. [221000250040] |(Information on joining LSA.) [221000250050] |Here is the LSA's summary of the resolution: [221000250060] |"Resolution #2, a 'Resolution on Cyberinfrastructure,' expresses the LSA’s support for exploiting the power of modern information technology to the fullest extent in linguistic research, by making available in digital form full data sets behind publications, working towards standards ensuring interoperability, creating new analysis tools, and other relevant measures." [221000250070] |We posted the full text of the resolution earlier this year. [221000250080] |The other resolution on the ballot is also highly relevant to cyberling: [221000250090] |"Resolution #1, a 'Resolution Recognizing the Scholarly Merit of Language Documentation,' puts the LSA on record as supporting the recognition of work in language documentation as a scholarly contribution to be given weight in the awarding of advanced degrees and in decisions on hiring, tenure, and promotion of faculty." [221000260010] |Conference on Electronic Grammaticography [221000260020] |Dates: 11-Feb-2011 - 12-Feb-2011 Location: Leipzig, Germany Contact Person: Sebastian Nordhoff Meeting Email: sebastian_nordhoff at eva.mpg.de General Web Site: http://www.eva.mpg.de/lingua/conference/11-grammaticography2011 Call for Papers: http://www.eva.mpg.de/lingua/conference/11-grammaticography2011/files/ca... [221000260030] |Abstract deadline: 1-Oct-2010 [221000260040] |This meeting will bring together field linguists, computer scientists, and publishers with the aim of exploring production and dissemination of grammatical descriptions in electronic/hypertextual format. [221000270010] |A linguist’s perspective on Creative Commons’ data sharing whitepaper [221000270020] |Edit: this post on (legal aspects of) data sharing by Creative Commons' Kaitlin Thaney is also highly recommended. [221000270030] |If you're involved in academic publishing -- whether as a researcher, librarian or publisher -- data sharing and data publishing are probably hot issues to you. [221000270040] |Beyond its versatility as a platform for the dissemination of articles and ebooks, the Internet is increasingly also a place where research data lives. [221000270050] |Scholars are no longer restricted to referring to data in their publications or including charts and graphs alongside the text, but can link directly to data published and stored elsewhere, or even embed data into their papers, a process facilitated by standards such as the Resource Description Framework (RDF). [221000270060] |Journals such as Earth System Science Data and the International Journal of Robotics Research give us a glimpse at how this approach might evolve in the future -- from journals to data journals, publications which are concerned with presenting valuable data for reuse and pave the way for a research process that is increasingly collaborative. [221000270070] |Technology is gradually catching up with the need for genuinely digital publications, a need fueled by the advantages of able to combine text, images, links, videos and a wide variety of datasets to produce a next-generation multi-modal scholarly article. [221000270080] |Systems such as Fedora and PubMan are meant to facilitate digital publishing and assure best-practice data provenance and storage. [221000270090] |They are able to handle different types of data and associate any number of individual files with a "data paper" that documents them. [221000270100] |However, technology is the much smaller issue when weighing the advantages of data publishing with its challenges -- of which there are many, both to practitioners and to those supporting them. [221000270110] |Best practices on the individual level are cultural norms that need to be established over time. [221000270120] |Scientists still don't have sufficient incentives to openly share their data, as tenure processes are tied to publishing results based on data, but not on sharing data directly. [221000270130] |And finally, technology is prone to failure when there are no agreed-upon standards guiding its use and such standards need to be gradually (meaning painfully slowly, compared with technology's breakneck pace) established accepted by scholars, not decreed by committee. [221000270140] |In March, Jonathan Rees of NeuroCommons (a project within Creative Commons/Science Commons) published a working paper that outlines such standards for reusable scholarly data. [221000270150] |One thing I really appreciate about Rees' approach is that it is remarkably discipline-independent and not limited to the sciences (vs. social science and the humanities). [221000270160] |Rees outlines how data papers differ from traditional papers: [221000270170] |A data paper is a publication whose primary purpose is to expose and describe data, as opposed to analyze and draw conclusions from it. [221000270180] |The data paper enables a division of labor in which those possessing the resources and skills can perform the experiments and observations needed to collect potentially interesting data sets, so that many parties, each with a unique background and ability to analyze the data, may make use of it as they see fit. [221000270190] |The key phrase here (which is why I couldn't resist boldfacing it) is division of labor. [221000270200] |Right now, to use an auto manufacturing analogy, a scholar does not just design a beautiful car (an analysis in the form of a research paper that culminates in observations or theoretical insights), he also has to build an engine (the data that his observations are based on). [221000270210] |It doesn't matter if she is a much better engineer than designer, the car will only run (she'll only get tenure) if both the engine and the car meet the same requirements. [221000270220] |The car analogy isn't terribly fitting, but it serves to make the point that our current system lacks a division of labor, making it pretty inefficient. [221000270230] |It's based more on the idea of producing smart people than on the idea of getting smart people to produce reusable research. [221000270240] |Rees notes that data publishing is a complicated process and lists a set of rules for successful sharing of scientific data. [221000270250] |From the paper: [221000270260] |
  • The author must be professionally motivated to publish the data
  • [221000270270] |
  • The effort and economic burden of publication must be acceptable
  • [221000270280] |
  • The data must become accessible to potential users
  • [221000270290] |
  • The data must remain accessible over time
  • [221000270300] |
  • The data must be discoverable by potential users
  • [221000270310] |
  • The user’s use of the data must be permitted
  • [221000270320] |
  • The user must be able to understand what was measured and how (materials and methods)
  • [221000270330] |
  • The user must be able to understand all computations that were applied and their inputs
  • [221000270340] |
  • The user must be able to apply standard tools to all file formats
  • [221000270350] |At a glance, these rules signify very different things. #1 and #2 are preconditions, rather than prescriptions while #3 - #6 are concerned with what the author needs to do in order to make the data available. [221000270360] |Finally, rules #7 - #10 are corned with making the data as useful to others as possible. [221000270370] |Rules #7 -#10 are dependent on who "the user" is and qualify as "do-this-as-best-as-you-can"-style suggestions, rather than strict requirements, not because they aren't important, but because it's impossible for the author to guarantee their successful implementation. [221000270380] |By contrast, #3 -#6 are concerned with providing and preserving access and are requirements -- I can't guarantee that you'll understand (or agree with) my electronic dictionary on Halh Mongolian, but I can make sure it's stored in an institutional or disciplinary repository that is indexed in search engines, mirrored to assure the data can't be lost and licensed in a legally unambiguous way, rather that upload it to my personal website and hope for the best when it comes to long-term availability, ease of discovery and legal re-use. [221000270390] |Finally, Rees gives some good advice beyond tech issues to publishers who want to implement data publishing: [221000270400] |Set a standard. [221000270410] |There won't be investment in data set reusability unless granting agencies and tenure review boards see it as a legitimate activity. [221000270420] |A journal that shows itself credible in the role of enabling reuse will be rewarded with submissions and citations, and will in turn reward authors by helping them obtain recognition for their service to the research community. [221000270430] |This is critical. [221000270440] |Don't wait for universities, grant agencies or even scholars to agree on standards entirely on their own -- they can't and won't if they don't know how digital publishing works (legal aspects included). [221000270450] |Start an innovative journal and set a standard yourself by being successful. [221000270460] |Encourage use of standard file formats, schemas, and ontologies. [221000270470] |It is impossible to know what file formats will be around in ten years, much less a hundred, and this problem worries digital archivists. [221000270480] |Open standards such as XML, RDF/XML, and PNG should be encouraged. [221000270490] |Plain text is generally transparent but risky due to character encoding ambiguity. [221000270500] |File formats that are obviously new or exotic, that lack readily available documentation, or that do not have non-proprietary parsers should not be accepted. [221000270510] |Ontologies and schemas should enjoy community acceptance. [221000270520] |An important suggestion that is entirely compatible with linguistic data (dictionaries, word lists, corpora, transcripts, etc) and simplified by the fact that we have comparably small datasets. [221000270530] |Even a megaword corpus is small compared to climate data or gene banks. [221000270540] |Aggressively implement a clean separation of concerns. [221000270550] |To encourage submissions and reduce the burden on authors and publishers, avoid the imposition of criteria not related to data reuse. [221000270560] |These include importance (this will not be known until after others work with the data) and statistical strength (new methods and/or meta-analysis may provide it). [221000270570] |The primary peer review criterion should be adequacy of experimental and computational methods description in the service of reuse. [221000270580] |This will be a tough nut to crack, because it sheds tradition to a degree. [221000270590] |Relevance was always high on the list of requirements while publications were scarce -- paper costs money, therefor what was published had to important to as many people as possible. [221000270600] |With data publishing this is no longer true -- whether something is important or statistically strong (applying this to linguistics one might say representative, well-documented, etc) is impossible to know from the onset. [221000270610] |It's much more sensible to get it out there and deal with the analysis later, rather than creating an artificial scarcity of data. [221000270620] |But it will take time and cultural change to get researchers (and funding both funding agencies and hiring committees) to adapt to this approach. [221000270630] |In the meantime, while we're still publishing traditional (non-data) papers, we can at least work on making them more accessible. [221000270640] |Something like arXiv for linguistics wouldn't hurt. [221000280010] |Resolutions pass! [221000280020] |The LSA announced today that both resolutions on the May 31 ballot (the Resolution on Cyberinfrastructure and the Resolution Recognizing the Scholarly Merit of Language Documentation) have passed! [221000290010] |Inspiring article about data sharing [221000290020] |NYTimes: Sharing of Data Leads to Progress on Alzheimer’s [221000290030] |http://www.nytimes.com/2010/08/13/health/research/13alzheimer.html?_r=3&pagewanted=1&hp [221000300010] |RELISH Meeting in Nijmegen [221000300020] |On 4–5 August, the RELISH project held a workshop on lexicon tools and lexical standards. [221000300030] |Slides from many of the presentations are posted on the workshop site. [221000300040] |An important goal of the workshop was to work towards harmonization of standards for lexical data, and it was noteworthy for including computational linguists, field linguists, and software engineers among its participants. [221000300050] |In addition to discussion of existing standards like LIFT and LMF, there also appeared to be an emerging agreement among participants that ISOcat will soon be in a position to provide a useful backbone for interoperability via its data category registry. [221000310010] |Conference on Electronic Grammaticography—Location Change [221000310020] |The location for the Conference on Electronic Grammaticography, previously announced on this blog, has been moved to the University of Hawaii so that it can be held under the umbrella of the 2nd International Conference on Language Documentation and Conservation. [221000310030] |Abstracts are due on 31 August 2010. [221000320010] |Invitation from NSF/SBE for white papers describing grand challenges [221000320020] |The NSF Directorate for the Social, Behavioral, and Economic Sciences (SBE) released last week a Dear Colleague Letter inviting members of the research community (individuals and groups) to submit by September 30th, 2,000-word-maximum white papers outlining what they think are "grand challenge" questions in the fields supported by SBE "that are both foundational and transformative". [221000320030] |These contributions will be used to help the Directorate make plans to support research over the coming decade and beyond. [221000320040] |The white-paper submission form provides guidance for contributors and self-contained instructions for submission. [221000320050] |I strongly encourage all the Cyberling participants to seriously consider submitting white papers, if only to insure that the ideas we have about advancing human language science and technology are on the table when SBE makes its plans for supporting the next decade of research across all the social, behavioral and economic sciences. [221000330010] |Abney &Bird s Grand Challenge: The Human Language Project [221000330020] |Steven Abney and Steven Bird published a provocative paper (.pdf) at ACL 2010 calling on the computational linguistics community to work to create a "Universal Corpus", an undertaking that they compare in both scale and potential impact to the Human Genome Project. [221000330030] |Here is the abstract: [221000330040] | We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. [221000330050] |The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. [221000330060] |We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. [221000330070] |We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world’s linguistic heritage before more languages fall silent. [221000330080] |Will the community take up this challenge? [221000330090] |Will the linguistics and computational linguistics communities succeed in working together on it? [221000330100] |It seems to me that neither community could do it alone, but it will take better communication between the two fields than we have at present to achieve. [221000340010] |Launch of L&C Field Manuals and Stimulus Materials [221000340020] |It is our pleasure to announce the launch of the L&C Field Manuals and Stimulus Materials, a web resource providing access to many of the field manuals produced by the Language and Cognition group at the Max Planck Institute for Psycholinguistics. [221000340030] |The site contains a bonanza of material for the field elicitation of semantics and the field collection of verbal behaviour. [221000340040] |These are unique resources that have been compiled over nearly twenty years of investigation of under-studied languages and that have played a major role in pioneering the field of semantic typology. [221000340050] |In many cases, the design of the tasks has been refined over recurrent field seasons, yielding well-adapted sensitive instruments for investigating e.g. semantic distinctions in a language without a writing system or a culture with only minimal schooling. [221000340060] |In this way, the tasks are the joint product of many scholars working in over 50 languages and cultures. [221000340070] |For years these field manuals have been available on demand, but they have now been put online for the first time, and this site will serve as the online repository for both older manuals as well as new ones currently under development. [221000340080] |Free registration is required to access the materials. [221000340090] |That is so we can keep track of new users and potential new data — we plan an archiving system that will allow users to contribute to the joint enterprise. [221000340100] |—Stephen C. Levinson, Asifa Majid, and Mark Dingemanse Language &Cognition group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands [221000350010] |The Endangered Languages Archive (ELAR) at SOAS [221000350020] |The Endangered Languages Archive (ELAR) at SOAS preserves and disseminates digital documentation of endangered languages around the world, especially (but not limited to) the outcomes of ELDP-funded projects. [221000350030] |ELAR's recently re-launched website is designed specifically to suit the needs of endangered languages archiving, using "Web 2.0" methods to implement a nuanced access control system and make the site user-friendly for a range of audiences. [221000350040] |We see the site not only as a source of valuable data but also as a forum for negotiation and exchange between depositors and users. [221000350050] |There are currently 19 language documentations available, from a variety of regions including Alaska, Australia, India, Mexico, Siberia, Solomon Islands, and Tanzania. [221000350060] |We are adding additional documentations to the site at the rate of approximately one each week. [221000360010] |A Grand Challenge for Linguistics: Scaling Up and Integrating Models [221000360020] |In response to NSF's call for White Papers in the SBE 2020 Initiative, Jeff Good and I have submitted a paper outlining our take on Cyberinfrastructure for Linguistics, why its necessary, and how it can come about. [221000360030] |The abstract: [221000360040] | The preeminent grand challenge facing the field of linguistics is the integration of theories and analyses from different levels of linguistic structure and aspects of language use to develop comprehensive models of language. [221000360050] |Addressing this challenge will require massive scaling-up in the size of data sets used to develop and test hypotheses in our field as well as new computational methods, i.e., the deployment of cyberinfrastructure on a grand scale, including new standards, tools and computational models, as well as requisite culture change. [221000360060] |Dealing with this challenge will allow us to break the barrier of only looking at pieces of languages to actually being able to build comprehensive models of all languages. [221000360070] |This will enable us to answer questions that current paradigms cannot adequately address, not only transforming Linguistics but also impacting all fields that have a stake in linguistic analysis. [221000370010] |NSF emphasizes data sharing policy [221000370020] |Good news from NSF: In the most recent update to the Grant Proposal Guide, they have strengthened and emphasized the requirements for data sharing. [221000370030] |Here is the summary from the GPG Summary of Significant Changes page: [221000370040] |Chapter II.C.2.j, Special Information and Supplementary Documentation, contains a clarification of NSF’s long standing data policy. [221000370050] |All proposals must describe plans for data management and sharing of the products of research, or assert the absence of the need for such plans. [221000370060] |Fastlane will not permit submission of a proposal that is missing a Data Management Plan. [221000370070] |Cross-references are included in the Project Description section (II.C.2.d), the Results from Prior NSF Support (II.C.2.d(iii)), Proposals for Conferences, Symposia and Workshops (II.D.8), and the Proposal Preparation Checklist (Exhibit II-1). [221000370080] |The Data Management Plan will be reviewed as part of the intellectual merit or broader impacts of the proposal or both. [221000370090] |The actual Grant Proposal Guide (section II.C.2.j) contains this text: [221000370100] | Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. [221000370110] |This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include: [221000370120] |
  • the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;
  • [221000370130] |
  • the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
  • [221000370140] |
  • policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
  • [221000370150] |
  • policies and provisions for re-use, re-distribution, and the production of derivatives; and
  • [221000370160] |
  • plans for archiving data, samples, and other research products, and for preservation of access to them.
  • [221000370170] |The text above refers to Chapter VI.D.4 of the Award and Administration Guide, which includes the following: [221000370180] | b. Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. [221000370190] |Grantees are expected to encourage and facilitate such sharing. [221000370200] |Privileged or confidential information should be released only in a form that protects the privacy of individuals and subjects involved. [221000370210] |General adjustments and, where essential, exceptions to this sharing expectation may be specified by the funding NSF Program or Division/Office for a particular field or discipline to safeguard the rights of individuals and subjects, the validity of results, or the integrity of collections or to accommodate the legitimate interest of investigators. [221000370220] |A grantee or investigator also may request a particular adjustment or exception from the cognizant NSF Program Officer. [221000380010] |Subscribe to eLanguage journals [221000380020] |This isn't well-advertised, and it even takes a while to find it on the page, but it's possible to subscribe (for free) to the eLanguage online journals. [221000380030] |The value of a subscription is that you'll get email alerts when a new issue is published---allowing you to look it over and see what's new, and allowing the authors who publish in these fora to have greater impact. [221000380040] |Once you have one account, it's fairly easy to add subscriptions for the other eLanguage journals. [221000380050] |Here's the link to subscribe to Linguistic Issues in Language Technology. [221000390010] |Copyright free language descriptions [221000390020] |Dear colleagues, [221000390030] |the Language Description Heritage project is slowly picking up steam and getting more and more content to be publicly available under a permissive license. [221000390040] |Please check the announcement blog to see our current list of available works: [221000390050] |http://ldh.livingsources.org/archive/ [221000390060] |We are currently going through out-of-copyright works from before 1935. [221000390070] |If you happen to have any digital version of any such work lying around (and it is not yet available in the above-mentioned archive), then we would be happy to check the copyright, and make it available in the project. [221000390080] |Note that we are currently not scanning any works ourselves! [221000390090] |Just checking the copyright for those works that are already available in any scanned version is keeping us busy already. [221000390100] |Please send any suggestions of titles (with links to the digital version) to: ldh (at) eva (dot) mpg (dot) de. [221000390110] |best Michael Cysouw [221000400010] |New annotation tool: DiscoverText [221000400020] |Stuart Shulman recently gave a workshop at UW on a new annotation tool he is developing: DiscoverText. [221000400030] |It has some limitations that make it unsuitable for some linguistic purposes, but is powerful in other ways and might be a good choice for certain types of tasks. [221000400040] |Highlights: [221000400050] |
  • Robust support for collaborative annotation, including non-destructive adjudication, tag merging, etc.
  • [221000400060] |
  • Annotation can be crowdsourced, and the results filtered based on various annotator credentials.
  • [221000400070] |
  • Built-in function to scrape publicly available data from Facebook, Twitter, and RSS.
  • [221000400080] |
  • A nice subset-via-search function called "bucketing" that will extract tokens and their surrounding context (of customizable size) and collect them into a subset of your corpus.
  • [221000400090] |
  • Alleged to be unicode compliant, though this wasn't explicitly demonstrated in the workshop.
  • [221000400100] |Lowlights: [221000400110] |
  • The codable unit size is fixed (determined by delimiters when the data is imported), so as far as I can tell, you can't annotate at the segment level, word level, and phrase level all within the same dataset.
  • [221000400120] |
  • DiscoverText is hosted software that appears to have a "freemium" usage model (some functionality free, some requires paid account). [221000400130] |It is not yet clear where the line will be drawn once the product goes out of beta.
  • [221000400140] |Stuart emphasized that the platform is still in development and he is eager for suggestions and feature requests, so if this tool looks valuable for your research I would encourage you to contact him. [221000410010] |New journal: Open Research Computation [221000410020] |A new journal Open Research Computation has been launched and, while it isn't geared towards linguistics specifically, it looks very interesting from a general cyberinfrastructure perspective. [221000410030] |Here's it's Aims and Scope: [221000410040] |Open Research Computation publishes peer reviewed articles that describe the development, capacities, and uses of software designed for use by researchers in any field. [221000410050] |Submissions relating to software for use in any area of research are welcome as are articles dealing with algorithms, useful code snippets, as well as large applications or web services, and libraries. [221000410060] |Open Research Computation differs from other journals with a software focus in its requirement for the software source code to be made available under an Open Source Initiative compliant license, and in its assessment of the quality of documentation and testing of the software. [221000410070] |In addition to articles describing software Open Research Computation also welcomes submissions that review or describe developments relating to software based tools for research. [221000410080] |These include, but are not limited to, reviews or proposals for standards, discussion of best practice in research software development, educational and support resources and tools for researchers that develop or use software based tools. [221000410090] |Further discussion can be found here: http://cameronneylon.net/blog/open-research-computation-an-ordinary-jour.... [221000420010] |CL Review of Interest [221000420020] |The current issue of Computational Linguistics includes a review (by Eric J. M. Smith) of Vladimir Pericliev's book Machine-Aided Linguistic Discovery: An Introduction and Some Examples. [221000420030] |The review gives a quick overview of the problems that Pericliev approaches and the techniques he applies. [221000430010] |Open Data and corpora for (computational) linguistic research [221000430020] |I recommend this guest post by Nancy Ide over on the Open Knowledge Foundation Blog. [221000430030] |Ide gives a brief history of the ANC, and describes issues pertaining to creative commons licensing and copyright that arise when textual data are repurposed for linguistic and computational linguistic research. [221000440010] |Beyond the PDF? [221000440020] |While looking for something on this blog http://cameronneylon.net/category/blog/ (which I recommend in general), I stumbled on the fact that an interesting workshop recently took place entitled Beyond the PDF. [221000440030] |The workshop goal is described as follows: [221000440040] |The goal of the workshop was not to produce a white paper! [221000440050] |Rather it was to identify a set of requirements, and a group of willing participants to develop a mandate, open source code and a set of deliverables to be used by scholars to accelerate data and knowledge sharing and discovery . [221000440060] |Our starting point, and the only prerequisite to participating, was the belief that we need to move Beyond the PDF (meant to capture a common philosophy, not necessarily to be taken literally). [221000440070] |In a heady moment we might also describe our efforts as the desire to contribute to the development of a free and open digital printing press for the 21st century. [221000440080] |A platform, when utilized, moves us beyond a static and disparate data and knowledge representation to a rich integrated content which grows and changes the more we learn. [221000440090] |A system (content plus platform) from which a scholar can interact and once evaluated shows improved understanding and interest. [221000440100] |The only name I saw among the participants who I recognized from the linguistics world was Eduard Hovy. [221000440110] |(I only looked at the list quickly. [221000440120] |Sorry if I missed anyone.) [221000440130] |In addition to the workshop's general goal of helping build cyberinfrastructure, which is of clear relevance to the cyberlinguists out there, it reminded me of a problem that I've been long aware of, but don't have a good solution for: It's clear that lots of other people out there have a lot of our needs for cyberinfrastructure, but we're not very good at connecting with, say, the biologist who may encounter data management issues with a similar structure to those of the descriptive linguist, often because of the stark differences in the content of the data. [221000440140] |This is a hard problem to solve since it can involve "lateral" connections across fields to connect people who would otherwise never know about each other, rather than the more usual "big idea" sort of interdisciplinary dialogue where famous and (and often) divisive scholars face off against each other. [221000440150] |My impression is that people who like to work directly with the data generally are not interested in the limelight. [221000440160] |This is probably good for their productivity, but it makes it hard for them to find each other. [221000450010] |2011 LSA Orthography Symposium [221000450020] |The 85th Annual Meeting of the LSA (Pittsburgh, 2011) included a symposium on creating orthographies for unwritten languages. [221000450030] |Creating an orthography has important implications for both speaker community access and long-term preservation and access to linguistic data. [221000450040] |The organizers of the symposium have made the materials (abstracts, handouts, and slides) available here: http://www.sil.org/linguistics/2011LSASymposium/ [221000460010] |NSF SBE 2020 White Papers (Updated) [221000460020] |The NSF has now made the SBE 2020 (Future Research in Social, Behavioral &Economic Sciences) white papers available. [221000460030] |There are at least three seven eleven (in addition to the one that Jeff Good and I submitted, noted below) that are relevant to cyberling: [221000460040] |
  • Anthony Aristar Endangered Languages and Linguistic Infrastructure
  • [221000460050] |
  • John T Hale Linguistic Theory as an Integral Part of SBE's Vision for the Language Sciences
  • [221000460060] |
  • Matthew Wagers Widening the Net: Challenges for Gathering Linguistic Data in the Digital Age
  • [221000460070] |
  • update Gregory R. Crane Analyzing human systems across time, space, language, and culture
  • [221000460080] |
  • update2 Natasha Warner et al SBE Grand Challenge: Understanding the complexity and variability of spoken and signed languages
  • [221000460090] |
  • update2 Lyle Campbell Documentation and Analysis of Endangered Languages, Cultures, and Knowledge Systems
  • [221000460100] |
  • update2 Clifton Pye A Distributed Architecture for the Documentation of Language and Culture
  • [221000460110] |
  • update3 John A. Goldsmith et al. Defining and Redefining NSF Funding for Linguistics
  • [221000460120] |
  • update3 Rakesh Bhatt et al. Migration, Multilingualism, and Minorities: New Challenges for the Linguistic Sciences
  • [221000460130] |
  • update3 Chilin Shih et al. Speech Variation, Graded Competency, and Human Communication
  • [221000460140] |
  • update3 Jürgen Bohnemeyer Semantic typology as an approach to mapping the nature-nurture divide in cognition
  • [221000460150] |I may well (still) have missed some --- if I have please leave a comment and I will update this story. [221000470010] |Reproducibility in computational science [221000470020] |The AAAS meeting going on right now includes a symposium on The Digitization of Science: Reproducibility and Interdisciplinary Knowledge Transfer. [221000470030] |Mark Liberman is speaking at it, and has posted the symposium abstract as well as some remarks at Language Log. [221000470040] |It is interesting to me to think that digitization has the potential to lead to less reproducibility, but once the point is raised, it's easy to see how that could come about. [221000470050] |Clearly as we talk about making data available along with research reports (papers, etc), we should include with the data the scripts, software, and "recipes" (how to use the software) to reproduce the results. [221000470060] |In the comments on today's Language Log post, Mark also provides a link to a previous LL post about a 2008 event in Berlin on a similar topic: Open Data and Reproducible Research: Blurring the Boundaries between Research and Publication That symposium page includes links to presentations. [221000480010] |OKCon 2011 [221000480020] |From the call for participation: [221000480030] | The 6th Annual Open Knowledge Conference (OKCon) will take place on 30th June – 1st July 2011 in Berlin. [221000480040] |OKCon is a wide-ranging conference that brings together individuals and organizations from across the open knowledge spectrum for two days of presentations, workshops and exchange of ideas. [221000480050] | Open knowledge promises significant social and economic benefits in a wide range of areas from governance to science, culture to technology. [221000480060] |Opening up access to content and data can radically increase access and reuse, bridge gaps, improve transparency and thus foster innovation and increase societal welfare. [221000490010] |Open Linguistics [221000490020] |The Open Knowledge Foundation (cf. the next post on OKCon) has a working group on Linguistic data, known as Open Linguistics. [221000490030] |That website has links to various resources, including linguistics-related posts on the Open Knowledge Foundation blog. [221000500010] |LRTS Sharing Workshop at IJCNLP 2011 [221000500020] |FLaReNet, Language Grid and META-SHARE are co-hosting the Workshop on Language Resources, Technology and Services in the Sharing Paradigm at IJCNLP 2011. [221000500030] |From the call for papers: [221000500040] |The Workshop aims at addressing (some of the) technological, market and policy challenges posed by the “sharing and openness paradigm”, the major role that language resources can play and the consequences of this paradigm on language resources themselves. [221000510010] |Language Documentation Meets Corpus Linguistics [221000510020] |I just saw an announcement for this conference: [221000510030] |Language Documentation Meets Corpus Linguistics: How to Exploit DOBES Corpora for Descriptive Linguistics and Language Typology? [221000510040] |From the workshop description: [221000510050] |The major goal of this workshop is to bring together documentary linguists and corpus linguists in order to explore and discover ways how DOBES corpora—and there are more than 50 digital corpora in the archive of the Max Planck Institute for Psycholinguistics in Nijmegen by now—can be automatically or semi-automatically exploited for descriptive linguistics and language typology. [221000510060] |This looks like a good example of how the development of a better cyberinfrastructure for language documentation in the last decade or so is now allowing us to actually conduct new kinds of research.