Harvesting Tools

For the data integration we first need to get the data on the server. We use two kind of harvesters: an open source harvester that follows the OAI-PMH protocol, shell-oaiharvester. Within the project some institution did not provide a PMH endpoint. We developed a harvester for the ResourceSync Framework, which can be found at the EHRI github resydes repository.

OAI-PMH

Download
https://github.com/wimmuskee/shell-oaiharvester
Installation:
/opt/shell-oaiharvester/
Config file:
/opt/shell-oaiharvester/config.xml
Records:
/var/opt/oai-pmh-harvester/

in the config.xml the harvester needs to be configured, for every endpoint at least the following properties can be set:

id
the identifier for this repository, to be used to run the harvester
baseurl
the OAI-PMH endpoint of the CHI
metadataprefix
the PMH metadataprefix, usually something like oai_ead
recordpath
the location to store the records
set (optional)
the PMH set to be harvested

Run Command

/opt/shell-oaiharvester/oaiharvester -c config.xml -r <repository-id>

this will run the harvester and retrieve all new and updated records from <repository-id> and store them at the recordspath. There they will be picked up by the ingest-process.

ResourceSync

Download:
https://github.com/EHRI/resydes
Installation:
/opt/oai-resourcesync/
Config files:
/opt/oai-resourcesync/cfg/
Records:
/var/opt/oai-rs-harvester/

The ResourceSync Framework describes a protocol to a destination (EHRI) in sync with a source (the CHI). It uses sitemaps to do so. In the config file uri-list.txt every sitemap is listed that needs to be synced. The syncapp-context.xml configures the harvester, for instance the baseDirectory to store the retrieved files can be set here:

Run Command

/opt/oai-resourcesync/run.sh

this will run the harvester and retrieve all new and updated files from the CHI and store them at the baseDirectory. It will also delete files that are no longer exposed at the CHI.

Selective Harvesting

See selective harvesting.