Toward TEI Lex-0 Publisher: A workshop

# Toward TEI Lex-0 Publisher: A workshop - **When:** December 16-17th, 2019 - **Where:** DARIAH Coordination Office, Germaine-Tillion-Saal (7th Floor), Friedrichstr. 191, Berlin - **Instructors:** Magdalena Turska and Wolfgang Meier, eXist Solutions - **Sponsor:** Belgrade Center for Digital Humanities - **Local Organizer:** DARIAH WG "Lexical Resources" - **GitHub repo:** https://github.com/BCDH/tei-lex-0-publisher-workshop ## Goal The goal of the two-day workshop/hackathon is to: - introduce members of the DARIAH WG "Lexical Resources" and other interested parties to **TEI Publisher** (https://teipublisher.com/index.html), a highly customizable, open-source publication toolbox based on the TEI Processing Model; - kickstart the development of the **TEI Lex-0 Publisher**, а generic publication framework for dictionaries and other lexical data; and - build a pool of knowledge as a starting point for creating good documentation and training materials on TEI Lex-0 Publisher for **DARIAH-Campus** ## Participants - Alix Chagué - Axel Herold - Mohamed Khemakhem - Maxim Kupreyev - Boris Lehečka - Simona Olivieri - Laurent Romary - Clarissa Stincone - Toma Tasovac ## Schedule ****Monday, December 16th (10:00 - 18:00)**** *Lunch will be served at 13:00.* - Introduction to TEI Publishing Model and the architecture of TEI Publisher - The specificity of lexical data: formalizing and prioritizing our user needs and translating them into feature requests - Hands-on work **Tuesday, Decembeer 17th (09:00 - 16:00)** *Lunch will be served at 12:30.* - Hands-on work ## Background The DARIAH WG “Lexical Resources” is the *spiritus movens* behind the community-based initiative to develop TEI Lex-0, a stricter subset of TEI, to be used specifically for encoding dictionaries, pooling lexical data together and performing lexical research across national and linguistic boundaries. During our Lexical Data Masterclasses, we established that our community very much needs a generic TEI Lex-0 publication framework: 1. **in the educational context**, when we teach the principles of TEI Lex-0 and best practices in encoding lexical data, we need an easy-to-use publication platform to show immediately the affordances of well-structured lexical data; and 2. **in the context of scholarly editing projects**, we need, in the long term, a solution which will make it significantly easier for individual scholars and/or smaller, underresourced institutions to publish high-quality editions of historical dictionaries and other types of lexical data, with functionalities which will include, among others, basic and advanced search, facetted browsing and geobrowsing. *The more preparatory work on formalizing and prioritizing feature requests we do before the workshop, the better.* See below. ## Features wishlist This is a place for brainstorming. We can't achieve everything in two days, but it's important that we dream big so that we can get as much input from the instructors on how to achieve what we want. We will have to prioritize for the workshop. - multiple TEI Lex-0 dictionaries per installation - grouping dictionaries by categories (if taxonomy is used), by headwords language (and something else?) - searching through all, or selected dictionaries - **paginated browsing/reading interface** per individual dictionary - (we can require that each letter is wrapped in a separate div) - ideally both as a list (```//entry/form[@type="lemma"][1]```) and as full entries, similar to this: ![](https://i.imgur.com/i9YYcQi.png) ![](https://i.imgur.com/th3S8p9.png) - configurable number of entries per page of the browsing interface - number of lemmas and number of dictionary entries should be configurable separately - all lemmas from the entry can/should appear in the list too (```//entry/form[@type="lemma"]``` or ```//entry/form[@type="lemma" and not(@extent)]```) - not only entries (within *body*) but also passages from *front*, *back* and *teiHeader* (license, publisher) shoud be available - list of cited sources (with bibliographic data) - link between abbreviation of the source used within an entry and full bibliographic record (show more data about the source in the tooltip or somewhere else on the page or expand an abbreviation) - **general entries** - group entries (from different dictionaries) with the same lemma into group represented by this lemma, something à la ![](https://vokabular.ujc.cas.cz/obrazky/general-entries.png) - **search** - searching with wildcards and/or regular expressions (or it applies only to full-text search?) - lemma-search with autocomplete dropdown, ideally filterable by one or more dictionaries a la ![](https://i.imgur.com/BHUqZ2c.png) ![](https://i.imgur.com/z8xiq3P.png) - full-text search - (what is ment by full-text? searching across all words/tokens in the entry, or by using lemmatization; e.g. *use* finds word forms like *use*, *used*, *using*; in the second case, lemmatization should be aware of the language in the entry and in the query) - across full entries - across definitions - across examples - across translation equivalents (for bilingual dictionaries) - advanced search - combination of multiple search conditions (chaining two or more conditions with negation, AND and OR statements) - facetting - based on ```<usg>``` types and values - based on etym languages - based on ```pos```, ```num```..., and ```gram``` types and values - based on cited sources - based on bibl references - based on other types of XPath constructs - geomap - for entries with geolocated entries (must provide a TEI Lex-0 recipe for this), something along the lines of ![Скриншот 2019-11-29 09.47.42](https://i.imgur.com/u4mOuYy.jpg) ![Скриншот 2017-11-14 09.06.35](https://i.imgur.com/pPQZPEN.png) - timeline - for entries with dated examples/quotations - **how to cite** simple way to show to the users how to cite - individual entry - whole dictionary - whole site (with multiple dictionaries) - **(definitely not for this workshop, but worth discussing**): creation of reversed bilingual dictionaries - if you start with a say English-German dictionary, it would be good if we could create a reversed index of German words - although this could be very tricky since TEI Publisher doesn't do any XSLT... hm... something to think about - **entry parts visibility** - user can select which parts of the entry should be visible or hidden (for example, definitions only without examples) # Notes from the workshop - high variance in encoded material - TEI only covers text encoding, but does not address the broader context - single processing across diverse outputs - TEI Simple - the editor should have a leading role in the development of the application; relief for the editor - projects are unique; technical requirements repat (get slide) - components behind a web page (get slide) - sustainability - data sources are standardized adn reusable - but the software used for their publicaiton is not - TEI Porcessing Model - express the rules of the intended processing in ODD, the language of TEI itself - simple, standardized syntax - following the principle of literate programming: complete documentation in TEI ## What is TEI Processing model? - media-agnostic description of output transformations - maps TEI elements to abstract behaviors - expresses the intended processing for a document within the TEI vocabulary itself - just 3 new elements adn 24 predefined behaviors - futreproofing our data and our presentation? - TEI Source + ODD: enough to reconstruct an edition - chaining ODDs - we'll be working on crating a default TEI Lex-0 ODD, which can be used as a basis for creating local customized ODDs for actual projects - page templates - web components to build a web page - web components are to a large extent natively supported by web browers - a web component encapsulates all the styling, look and feal; no style polution ## Exist - accepts and XML of any size or complexity - directly locate any node in a huge collection - avoid loading docuemnts into memory - evaluate XPath/XQuery via indexes, not tree traversals ## Create new ODD - ![](https://i.imgur.com/0moMelw.png) - ![](https://i.imgur.com/7Nik3cI.png) - ![](https://i.imgur.com/dOcJuC0.png) - associate uploaded dictionary with the new odd - edit odd - create element "entry" ![](https://i.imgur.com/7waAjbw.png) - define behavior as block ![](https://i.imgur.com/IS8hvB2.png) - make lemma bold![](https://i.imgur.com/naY6Mcf.png) - sense @n - content is the default parameter, this is where processing continues - we can do it all as one sequence ![](https://i.imgur.com/ZIUSXd9.png) - or we can split it into two different models - models are processed in the order in which they are encountered in the ODD - model sequence - first model deals with @n - second model is simply inline -- it processes the rest of the content ![](https://i.imgur.com/hSLZIYc.png) ![](https://i.imgur.com/NTEiKm5.png) - you can hard-code a processing instruction to use an odd ![1](https://i.imgur.com/X3P1QPR.jpg) - create model for persName - two paramenters - deafult: . - alternate @ref - external parameter bit: http://0.0.0.0:8081/exist/apps/tei-publisher/doc/documentation.xml?odd=docbook.odd&root=2.9.8.8 - ![](https://i.imgur.com/JTGpLxB.png) - we actually had to get rid of parameters?root and do root(.) ## App Generator - Admin > generate app - creating an account for the app (lex/lex) - once generated, you can download the generated app (Admin > Download xar) which you can then install in any eXist-db - exported app .xar can be unzipped (unzip xxx.xar) and then pushed to git ## HTML templates - default HTML template is lex/templates/pages/view.html (in eXide) - the center of the page is inside ```<main></main>``` - inside main we have ```<pb-navigation direction="backwards"></pb-navigation>``` and ```<pb-navigation direction="forwards"></pb-navigation>``` and ```<pb-view></pb-view>``` - Wolfgang saved view.html as lex.html so that we can work on it - ```lex/modlues/config.xqm``` is the main configuration file: for instance the default-language, default-view - change default-template to ```lex.html``` - Wolfgang removed table of contents from lex.html and breadcrumbs - removed toc-togle from toolbar.html - all components send messages via channels - ```<pb-view></pb-view>``` loads something from the server and displays it but it's bound by page-by-page view; this we need to change this by writing some xquery and then rewire our pb-view to use our xquery - create a new xquery - ```collection("/db/apps/lex")//tei:entry``` would get all entries from all dictionaries - ```doc("/db/apps/lex/data/kluge-lutz-1898.tei.xml")//tei:entry``` would get all entries from a specific dictionary - we need to parametrize this so that we can select which particular dictionary we want to display ``` <div> { let $doc := request:get-parameter("doc", ()) return doc("/db/apps/lex/data" || $doc)//tei:entry } </div> ``` - save the xquery file in ```lex/modules/lex.xql``` - ```/db/apps/lex/modules/coјmponents.xql``` - stuf in ```/modules/lib``` comes from TEI Publisher and will be overwritten by a new version; so we don't want to change anything inside ```lib``` - then we're copying a bunch of stuff from components.xql to lex.xql such as imports (TODO what exactly: import config etc. ) - then, we have to wire lex.xql into tei-publisher - we don't want to use ```<pb-view></pb-view>``` anymore but something simpler - ```<pb-load src="document1" url="modules/lex.xql" auto="auto"></pb-load>``` - eXide works directly on the db, but we need to sync back to the file system so that we can push it to git - in eXide Application > Synchronise , then give it a path to where we have things on the filesystem, click on synchronise - added some stuff to .gitignore - pushed it - in our workshop repo there is now tei-lex - it has a build.xml - we need to run ant - it will generate the .xar file - for production use, it's recommended to have separate .xars for data and for the app - curent setap ![1](https://i.imgur.com/WNABpPJ.jpg) - there is effort underway to do more refactoring ![2](https://i.imgur.com/wkJF24T.jpg) - remember we can work in atom or oxygen on the local file system and then sync up to the db ## Pagination - limit to 10 per page - we need a start offset, a range; and we need to limit the list, i.e return a subsequence - we then want to add ```<pb-paginate></pb-paginate>``` to lex.html - in lex.xql we need to set http headers... so that the lex.html can get that info - we added an id parameter and the logic for it, so that we can also display only one individual entry by xml:id - there is a bug with passing lemma id, Wolfgang will fix it - now adding a param query so that we can search ## Index - ```collection.xconf ``` we added an indexing rule on ```tei:orth```, stripped down some stuff we don't need it - we can search for acc* - to wire the search ```<pb-search></pb-search>``` - we'll need to fix the search autocomplete to be based only on the lemma; and we want to have a separate full-text search - then we changed to index on entry, and crate a field name lemma... this is more precise ## Autocomplete - built-in, by default does full-text, but we want to change that - ```tei-query.xql``` has ```function teis:autocomplete```, we add a case "lemma" and then create a hidden input - we need to cache the query so that the pagination will work without having to repeat the query - we also changed added ancesstor-or-self stuff to make sure that nested entries always get shown in the context of the ancestor entry ## Short and long view ## Facetting - defined in collection.xconf, config.xqm - we defined facet etymology - added highlighting ## Multidoc search - adding $scope instead of $doc - ## cross-references - in odd, we had to add ref... - don't work at the moment - need to be inlne - need to emit something... pb-load does not react... - web components - material design - tachyons