Example Ingest

The current ingest procedure is somewhat long-winded and technical. This is an example given a single EAD XML file containing a large number (48,000) of individual documentary unit items in a single fonds. The repository is the Internation Tracing Service (ITS), which has EHRI repository ID de-002409.

This ingest covers importing the EAD file into the staging server, at which time it should be ready for verification and if necessary, changes, before the production ingest.

Before you start

First, log into the EHRI staging server via SSH and open a bunch of shells. In one of them, tail the following file, which will give us some information about what goes wrong, when something inevitably goes wrong the first few times we try:

tail -f /opt/webapps/neo4j-version/logs/log/neo4j.log

Back up the database

The Neo4j DB lives in /opt/webapps/data/neo4j/databases/graph.db. You can back it up without shutting down the server by running:

/opt/webapps/neo4j-backup.sh graph.db.BAK

To restore the DB the procedure is: - shut down Neo4j - replace /opt/webapps/data/neo4j/databases/graph.db with backup directory you specified previously - ensure all files in the graph.db directory are owned and writable by the webadm group: - chgrp -R webadm graph.db - chmod -R g+rw graph.db - restart Neo4j

Procedure

Onwards with the ingest...

Next, in another shell, copy the file(s) to be ingested to the server and place them in /opt/webapps/data/import-data/de/de-002409, the working directory for ITS data.

Import properties handle certain mappings between tags (with particular attributes) and EHRI fields. The ITS data has a particular mapping indicating that when the <unitid> has a type="refcode" that is the main doc unit identifier, and that the rest are the alternates. This file is, in this case:

/opt/webapps/data/import-data/de/de-002409/its-pertinence.properties

The actual import is done via the /ehri/import/ead endpoint on the Neo4j extension. It is documented here: http://ehri.github.io/docs/api/ehri-rest/ehri-extension/wsdocs/resource_ImportResource.html

The basic procedure is:

obtain an appropriate import properties file (which we've done in this case)
write an appropriate log file, describing what we're doing
stick the EAD XML on the server
run a curl command, POSTing the XML data to the ingest endpoint, with the appropriate parameters
re-index the data held by our repository (ITS, de-002409) to make it searchable in the portal UI

To make the curl command less cumbersome, lets export the path to the properties file as an environment variable:

export PROPERTIES=/opt/webapps/data/import-data/de/de-002409/its-pertinence.properties

Also, lets write a log file and export it's path as an environment variable:

echo "Importing ITS data with properties: $PROPERTIES" > LOG.txt
export LOG=`pwd`/LOG.txt

Now we can POST the data to the ingest endpoint:

curl -XPOST \
    -H "X-User:mike" \
    -H "Content-type: text/xml" \
    --data-binary @KHSK_GER.xml \
    "http://localhost:7474/ehri/import/ead?scope=de-002409&log=$LOG&properties=$PROPERTIES&commit=true"

These parameters are:

the X-User header tells the web service which user is responsible for the ingest.
the Content-type header tells it to expect XML data.
the scope=de-002409 query parameter tells it we're importing this EAD into the ITS repository.
the log=$LOG parameter tells it to find the log text in a local file.
the properties=$PROPERTIES parameter tells it to file the import properties in a local file.
the commit=true|false parameter tells the web service to actually commit the transaction. By default it will not, which provides a way of doing "dry run" ingests.

Note: when importing a single EAD containing ~50,000 items in a single transaction the staging server might run out of memory. If it does the only option is to increase the Neo4j heap size by uncommenting and setting the dbms.memory.heap.max_size=MORE_MB (say, 3500) in $NEO4J_HOME/conf/neo4j-wrapper.conf and restarting Neo4j by running:

sudo service neo4j-service restart

Additional note: Certain date patterns are fuzzy parsed by the importer and invalid dates such as 31st April will currently throw a runtime exception resulting in a BadRequest from the web service. So fix all these first ;)

If all goes well you should get something like this:

{"created":48430,"unchanged":0,"message":"Import ITS 0.4 data using its-pertinence.properties.\n","updated":0,"errors":{}}

In theory, that ingest should be idemotent, so you can run the same command again and not change anything. Instead you'd get a reply like:

{"created":0,"unchanged":48430,"message":"Import ITS 0.4 data using its-pertinence.properties.\n","updated":0,"errors":{}}

Indexing

The final step is the re-index the ITS repository, making the items searchable. This can be done from the Portal Admin UI, or via the following command:

java -jar /opt/webapps/docview/bin/indexer.jar \
     --clear-key-value holderId=de-002409 \
     --index -H "X-User=admin" \
     --stats \
     --solr http://localhost:8080/ehri/portal \
     --rest http://localhost:7474/ehri \
     "Repository|de-002409"

(This tool is a library/CLI utility the is used by the portal UI and available on the server: see the https://github.com/EHRI/ehri-search-tools project for more details.)

Updating existing collections

To update existing collections, when, for example, adding descriptions in another language, the procedure is exactly the same with one exception: the import Curl command needs an additional parameter:

&allow-update=true

Without this parameter the importer will throw a mode violation error when it ends up updating an existing collection.

Overwriting existing descriptions

If you want to overwrite existing item description with data from a new EAD the EAD must have the same sourceFileId value as exists on the current description. The sourceFileId is a property computed from two aspects of the EAD file: the eadheader/eadid value and the eadheader/profiledesc/langusage/language/@langcode value combined thus: [EADID]#[UPPER-CASE-LANGCODE].

For example, if the eadid is 100 and the language code is eng, the sourceFileId will be 100#ENG.

Only documentary unit descriptions created via the EAD ingest process will have a sourceFileId; those created using the portal interface will not. For descriptions that have the property it is visible (but not editable) on the portal admin pages.

Note: the consequence of the above is that the eadid value should not contain the language code, since this is redundant and will result in a sourceFileId like eng#ENG.

Ingesting multiple files in an archive

It is possible to ingest multiple EAD files in a single transaction by providing the importer with an archive file (containing multiple XML files) instead of a single XML file. Currently the following formats are supported:

zip (although some extensions are problematic)
tar
tar.gz

The importer will assume the data it is given is an archive if the content type of the request is given as application/octet-stream (aka, miscellaneous binary) instead of either text/xml (XML) or text/plain (local file paths.)

Note: if several EAD files provide different translations of the same items it is necessary to enable update ingests via &allow-updates=true.