TEI encoding and annotation

Documents published in the EHRI digital editions are encoded in the Text Encoding Initiative (TEI) P5 standard. While TEI is multi-layered and can be very complex, it is widely adopted and considered a standard format for digital editions of texts of all kinds.

The particular TEI customisation can differ depending on the characteristics and needs of a particular edition. While allowing for flexibility, the EHRI editions, however, rely on the use of references to names, dates, places and people (TEI module namesdates) as described in TEI documentation.

TEI source information

In EHRI editions, the bibliographic information about the source is encoded in two forms in the TEI Header (/TEI/teiHeader/fileDesc/sourceDesc/):

  • A display ready source information in the <bibl> element which can also include additional information about original language etc. This element can capture citations of non-archival sources, such as newspaper articles, and should always be included.

  • In addition to <bibl> for archival sources in a structured form in element <msDesc> which can contain <country>, <repository> and <collection>, among other structured information, for example:

<msDesc>
    <msIdentifier>
        <country ref="https://portal.ehri- project.eu/countries/cz">
            Tschechische Republik</country>
        <repository ref="https://portal.ehri- project.eu/institutions/cz-002286">
            Nationalarchiv Prag</repository>
        <collection ref="https://portal.ehri-project.eu/units/cz- 002286-1075">
            Innenministerium (225)</collection>
        <idno>1936-1940, Sign. X/R/3/2, K. 1186-16, Nr. 11651</idno>
    </msIdentifier>
    <physDesc/>
</msDesc>

Referencing vocabularies, main elements

  1. We use the references to keywords, places, organisations and people.

  • <term>: for keywords, with the attribute type “subject”. Use links to EHRI Terms:

<term type="subject"
      ref="https://portal.ehri-project.eu/keywords/ehri_terms-1141">passport</term>
  • <placeName> element for places. For camps and ghettos, the use of links to EHRI camps or EHRI ghettos is preferred, with the attribute type “camp”/ “ghetto”:

He was deported to <placeName ref="https://portal.ehri-project.eu/keywords/ehri_camps-2" type="camp">Birkenau</placeName>

Linking to Geonames records is recommended with places that aren’t included in the EHRI portal:

<placeName ref="http://www.geonames.org/2804979/zeilsheim.html">Zeisheim u Frankfurtu</placeName>

<orgName ref="https://portal.ehri-project.eu/authorities/ehri_cb-347">JOINT</orgName>

<persName ref="https://portal.ehri-project.eu/authorities/ehri_pers-000272">Mengele</persName>

<persName ref="http://yvng.yadvashem.org/nameDetails.html?itemId=4763965">Felixem Stiastny-m</persName>

Mark also people (or other types of entities) that have no corresponding record in the EHRI portal or in other usual repositories, for example:

SS officer <persName>Nowak</persName>

Attribute @ref: use URL of the linked record as a unique identifier. If you copy URLs from the EHRI Portal, please don’t copy the language parameter at the end of the URL (for instance “#desc-eng”).

  1. Text formatting

  • <hi rend="bold">Bold text</hi>

  • <hi rend="italic">Italic text</hi>

  • <hi rend="underlined">Underlined text</hi>

  1. Dates

  • <date when="1940-02-11">11th February 1940</date>

  • <date when="1940-02">early February 1940</date>

  1. Quotations

  • We use element <q>to mark quotations (replace quotation marks with element tags), for example:

[...] came to pick him up with the words <q>Another one's croaked.</q>

  1. Notes, remarks

  • We use element <note> with the attribute type “translation”/”gloss” for remarks:

<note type="translation">special treatment</note>

<note type="gloss">The real date of the event must have been May 1942.</note>

  1. Page breaks

  • Element <pb> is used, with the attribute type “facs” when we want to relate to images of individual pages outside of the document, for example:

<pb n="1" facs="EHRI-ET-YV3549264_01.jpg"/>

  1. Other languages, camp language/slang

  • Use element <foreign> with the “xml:lang” attribute type to mark the words or phrases in other languages, for example:

<foreign xml:lang="de">Sonderbehandlung</foreign>

  • Analogically, use element <distinct> for camp language or slang:

I went to the <distinct type=”camp_language”>Schleusse</distinct>.

  1. Typos

  • For historical editions, methodologies often recommend correcting mistakes such as typos which have no bearing on the understanding of the context or meaning of the document (such an approach can be explained in the introduction of a particular edition). If we decide to record the individual mistake or in cases where the correction carries meaning, we can use the <sic> element, or - with correction - in this form:

<choice>
    <sic>deprtation</sic>
    <corr>deportation</corr>
</choice>

TEI enhancement utility

A command-line utility written in PHP (for the purpose of the possible integration into Omeka) was developed to support enrichment of the linked controlled vocabularies. It traverses across the entities linked in the body of TEI files and performs rule-based enrichment of the TEI headers by fetching metadata using the EHRI and Geonames resources.

The utility adds normalised records in the TEI header, in conformance with the Dublin Core - TEI mapping listed above. Currently, it uses the EHRI API to process the following EHRI vocabularies: places, camps, ghettos and terms. Based on the Geonames RDF service, it creates place records containing geographic coordinates and links to further resources (such as Wikipedia articles). An argument can be specified to prefer data in a specific language (if available). The utility can be extended to include other services with machine readable information.

Documentation of command line options

TODO