Identifiers
===========
There are two kinds of identifier concepts in the EHRI DB: local, and
global. Unfortunately we haven't been too thorough about disambiguating
the two things, so it's quite confusing in places. If we use the word
'identifier' it typically means the local identifier, whereas the word
ID means the global one. Here's what they mean in practice:
Local identifiers exist within a namespace defined by their parent
scope. Item types at the top level of a hierarchy - for example,
countries - have an identifier that consists of their ISO 3166
two-letter country code. Since they have no higher scope, their local
identifier is the same as their global one: for example, the Netherlands
has local identifier ``nl`` and also global identifier ``nl``.
Repositories, however, belong to a scope (the country in which they
reside) so *their* global identifier consists of the local identifier of
their scope *added to* their own local identifier. For example, the
local identifier of USHMM is ``005578`` and that of its country (the
U.S.) ``us``, therefore USHMM's global identifier is ``us-005578``.
Why we do this
--------------
Identifiers are only useful when they uniquely identify something.
However, identity within hierarchies is contextual. For example, within
an archival collection ``c1`` there can only be a single archival unit
with the identifier ``1``. Deriving the *global identifier* of an item
from its own local identifier plus that of its parent items therefore
provides a means to ensure uniqueness within a given hierarchical scope
- if the resulting global ID is already taken, the local identifier is
not unique within its scope.
For example, if we import an EAD file from repository ``001500`` in
country ``us`` with the following structure:
.. code:: xml
100
1
... the resultant global ID of the first unitid would be
``us-001500-100`` and that of its child item ``us-001500-100-1``.
Transliteration
~~~~~~~~~~~~~~~
Prior to creating the hierarchical ID the local identifier is also
transformed by removing all punctuation and certain other URI reserved
characters and replacing them with at most one underscore per sequence.
Leading and trailing underscores are then removed. Finally, the result
is lower cased.
The final hierarchical ID is then formed by joining each transliterated
local identifier with a **hyphen** character.
Relative identifiers are therefore preferred in EHRI since they provide
the neatest global identifiers. However in many cases EAD files are
structured with absolute identifiers, e.g:
.. code:: xml
100
100 1
In this case, where the child identifier is *prefixed* by it's parent
identifier the prefix is removed prior to transliteration, so if, for
example, there was a hierarchy like so:
- ``us``
- ``005578``
- ``DOC-1``
- ``DOC-1 / 1``
- ``DOC-1 / 1 / 2``
- ``DOC-1 / 1 / 2 / 3``
... the process of generation the final hierarchical ID would be as
follows:
- ``us`` => ``005578`` => ``DOC-1`` => ``DOC-1 / 1`` =>
``DOC-1 / 1 / 2`` => ``"DOC-1 / 1 / 2 / 3"`` (raw strings)
- ``us`` => ``005578`` => ``DOC-1`` => ``/ 1`` => ``/ 2`` => ``/ 3``
(parent prefixes removed, note leading space-slash-space)
- ``us`` => ``005578`` => ``DOC_1`` => ``___1`` => ``___2`` => ``___3``
(replace non-characters with underscores)
- ``us`` => ``005578`` => ``doc_1`` => ``1`` => ``2`` => ``3`` (remove
leading/training replacements)
- ``us-005578-doc_1-1-2-3`` (joining sections with hyphens)
Identifiers for descriptions
----------------------------
Descriptions for items (e.g. documentary units, repositories etc) also
have identifiers that are unique within their scope (the item.) A
description's local identifier is typically its
`ISO-639-2 `__ 3-letter
language code, with an optional additional disambiguator if there is
more than one description in the same language. The global identifier of
a description is its item's global identifier joined via a period (".").
For example:
- ``us-005578-doc_1-1-2-3.eng`` (description local ID appended with a
period.)
Restrictions
------------
This scheme places some restrictions on what can be used as an
identifier in an EHRI item:
- it must contain some non-punctuation characters
- the sequence of characters with punctuation removed *must be unique
within the parent scope*
Trade-offs
----------
The main trade-off in this scheme is normalisation vs. ease of
determining uniqueness. It is quite difficult (and quite costly) to
determine if a given identifier is unique within the scope of its parent
item. (In the worst case it involves iterating through every single node
in the graph, which makes importing items exceedingly slow.) Creating
graph IDs from a concatenation of local identifiers with the parent
scopes allows uniqueness checks via a single index lookup, which is very
cheap. The downside is that an item's graph ID is de-normalised with the
hierarchical structure to which it belongs. If it is moved to another
parent scope, its graph ID will no longer be valid. For this reason we
recommend that moving an item within a hierarchy be though of as a copy
followed by a delete.
Validation
----------
Maintaining hierarchical structures is difficult is any database system:
whilst integrity guarantees might best be maintained using a traditional
self-referential foreign-key structure in a relational database (which
can better handle integrity issues using compound keys), any system that
aspires to good performance will run into problems when moving trees
within trees (especially when optimisations like the adjacency list or
nested set model are employed.) Graph databases make the *relationship*
side of things much easier where hierarchies are concerned, but since
EHRI is an integration project we also have to worry about the
**identity** of things at various levels so that we can point back to
whatever it was we were integrating. There are therefore numerous
complexities involved that make sanity checking hierarchical structures
pretty important; especially - as with graph DBs - when there's a
separate *indexing* stage involved. Confusing matter is the fact that
there are two types of hierarchy:
- permission scope
- parental
Most items, for instance, archival units, can only have one parent.
However, some, such as concepts (in SKOS vocabularies) can belong to
multiple different trees and therefore have several different parents.
**All** items, however, can only have a single permission scope. For
archival units this will be the parent item or the repository. For
repositories it will be the country they are in. For concepts it will be
the vocabulary they belong (rather than the higher level broader
concepts which they may have as immediate parents.)
The ``IdRegenerator`` class is a 'check' tool that ensures IDs match
permission scopes when a node is moved between permission scopes. It can
be called via HTTP on the host that runs the graph server via the
``tools`` endpoint.