## Imagining the Future of Dictionaries:<br/> Tracing the Early History of eLexicography *Encyclopedistics 2020<br/> 15 October 2020* Toma Tasovac <img src="https://i.imgur.com/yusgTpG.png" width="40%" style="border:none; background:none;"> Note: --- ### DH and eLexicography: common roots <!-- .slide: data-background-image="https://i.imgur.com/ZrEbyWs.png" data-background-size="90%" --> --- <!-- .slide: data-background-image="https://i.imgur.com/uuNztb8.png" data-background-size="cover" --> --- <!-- .slide: data-background-image="https://i.imgur.com/j0i133z.png" data-background-size="cover" --> --- ### The sixties: a brave new world <!-- .element: style="color:black" --> <!-- .slide: data-background-image="https://i.imgur.com/VJo8NhW.jpg" data-background-size="contain data-background-repeat="no-repeat" --> --- #### Random House Dictionary of the English Language - 260,000 entires <!-- .element: class="fragment" --> - using computers for "sorting, codifying, rearranging, and checking the data at hand and the text to be written" (Urdang 1966, 31) <!-- .element: class="fragment" --> --- #### Dictionary Data - illustrations - main entry words (including pronunciations, inflected forms) - definitions - variations - etymologies - run-on entries - additional information --- #### Converting paper dictionaries - Olney et al. (1968) - Webster’s Seventh New Collegiate Dictionary (W7) - The New Merriam-Webster Pocket Dictionary (MPD) - areas of use: taxonomy extraction, text analysis, speech processing, syntactic and semantic parsing, detection of circular definitions --- #### Lexicographic data format - WEBMARC, based on MARC (Machine Readable Cataolg) used for bibliographic data - "One of the advantages of computer files over printed texts is their capacity to absorb additional information as a result of being compared and merged with other data files" (Sherman 1974: 25) --- #### Dictionaries belong together - digitizing existing dictionaries - “great linguistic value” - “historical data can best be analyzed and compared in computer files” (Sherman 1974: 25) Note: Sherman clearly understood the potential of digitizing other dictionaries: he explicitly mentions Daniel Jones’ English Pronouncing Dictionary, which, if converted to WEBMARC, would provide a basis for a systematic lexical comparison of British and American pronunciations. But he also stresses the need for digitizing older dictionaries of the English language because of their “great linguistic value” and because “historical data can best be analyzed and compared in computer files” (25). --- #### The sixties were very busy - Trésor de la Langue Française - Lexical Archive of the Italian Language - Dictionary of the Older Scottish Tongue - Dictionary of American Regional English - Hebrew Historical Dictionary - early Russian work on “automatization” and “statistics” in lexicology and lexicography (see Фрумкина 1966; Москович 1966) - a plan on the “mechanical” processing of an etymological dictionary of Hungarian (see Папп 1968) --- #### The seventies: imagining the future <!-- .slide: data-background-image="https://i.imgur.com/DTrXmQ2.jpg" data-background-size="50%" data-background-repeat="repeat" --> --- ##### Technology as a double-edged sword --- #### "How to Make a N.U.D.E" - N.U.D.E = New Utopian Dictionary of the English Language (Revard 1973) - utopia: unlimited resources - computer-accessible - semantic relations between senses - citation slips + previous dictionaries (mutually aligned) - lexical network - there are no copyright problems in Utopia --- #### Beyond cost-saving - speed - structure - “the outcome of our research will be conditioned by the devices we use in its execution” (Bailey 1973: 293; echoing Richards 1955) - machine-readable lexicographic data “will be increasingly formalized” in the future (Lehmann 1973: 312) --- #### Why formalisms matter > the effort to provide completely regular and reasonably formal definitions in which usage at any point is consistent with usage at any other, is bound to bring to light anomalies, irregularities, inherent in the language itself, and these are likely to be precisely the sort useful to linguists who want to find the holes as well as the nodes of the lexical network. (Revard 1973: 91-2) <!-- .element style="font-size: 80%" --> --- #### Facts vs. life? > Linguistic facts, in our historical dictionaries, are left somehow apart from the life of the community; we have yet to capture in our representations of meaning the continuum from precision to imprecision inherent in our citations and certainly pervasive in the collective whole. (Bailey 1973: 294) <!-- .element style="font-size: 80%" --> --- #### Role of human intervention? > [We cannot] reasonably expect the computer to execute the direction that Murray issued to the volunteer readers for his dictionary: ‘Make a quotation for every word that strikes you [Oh machina] as rare, obsolete, old-fashioned, new, peculiar, or used in a peculiar way.’” (Bailey 1973: 294). <!-- .element style="font-size: 80%" --> --- #### Oh, that feeling... > I’m sure that these machines, no matter what champion speed-readers they may be, will not ever be able to supplant the human reader with his sensitivity and Sprachgefühl. (Chapman 1973: 309) <!-- .element style="font-size: 80%" --> --- #### Apogee of development > the evolution of computers into the mundane tasks of lexicography is at or near the saturation point, that is, we’ve gone about as far as we can go in developing new applications under the current rules of the game, and any further qualitative advantages to be offered by automation will require a revolution in both semantic analysis techniques and in the interaction between lexicographer and computers. (Venezky 1973: 287) <!-- .element style="font-size: 80%" --> --- ### The missing infrastructure - proposal for a "Central Archive for Lexicography in English" (Barnhart 1973) <!-- .element: class="fragment" --> - 25-30 million quotations for 500,000 lexical items <!-- .element: class="fragment" --> - not a corpus <!-- .element: class="fragment" --> --- > The archive will have to assemble phonological, paradigmatic, syntactic and semantic information for lexical items, and list the syntactic and semantic properties of constituents required in the syntactic environments of these items. It will have to establish the implication relations between such properties. It will also have to establish the logical relations between the meanings of linguistic items. (Lehmann 1973: 317) <!-- .element style="font-size: 80%" --> --- ### Brotherhood and unity? - “Socialized lexicography, in short, is now upon us” (Bailey 1973) - “sustained and continuous growth through addition (or deletion) of information” (Bailey 1973: 296) - scholars from different institutions working together --- ### Raw data matters & should be shared > For a relatively low cost, the citation file itself can be published on micro-fiche, making the evidence - if not the interpretation of the evidence - available to scholars before a generation or two has passed (Bailey 1969: 171-2) --- #### The eighties: toward standardization of text representation <!-- .slide: data-background-image="https://i.imgur.com/UBBo2L3.jpg" data-background-size="60%" data-background-repeat="repeat" --> --- ### NLP lexicons - Longman Dictionary of Contemporary English (LDOCE) - publishers agreed to share their computer files with the researchers - dictionary of English as a foreign language was seen as particularly useful --- ### Why LDOCE? - grammatical codes incl. valency - semantic codes - definitional vocabulary restricted to 2000 items --- ### MRDs for Humans? - "on-line" environments provide qualitatively different modes of using dictionaries - maximizing the use of information --- ### Queries for everybody? > “the user wishes to see all entries for three-syllable nouns which describe movable solid objects, whose second syllable has a schwa as peak, and whose third syllable has a coda that is a voiced stop” (Boguraev et al.: 67) --- ### Representing data structures > if a standard data structure were to be used as a common defining medium for different record formats and data bases, then this common structure could serve as a unifying basis for exchanging and integrating related materials. A common data structure would at least standardize our vocabulary for describing and documenting data base format and content. (Sherman 1974: 21) <!-- .element style="font-size:80%"--> --- ### Getting there - Generalized Markup Language (GML) (Goldfarb et al. 1970) - Rigorous descriptive markup (see Goldfarb 1981: 69) - GML became a basis for SGML (ISO 8879, 1986) - which eventually led to XML, a W3C standard markup language --- ### So, what does this all mean? - a history of electronic lexicography is yet to be written - we are still in the early years of eLexicography - socialized lexicography - copyright - lexicographic complexity and ambiguity --- Thank you for bearing with me!
