For the data integration we first need to get the data on the server. We use two kind of harvesters: an open source harvester that follows the OAI-PMH protocol, shell-oaiharvester. Within the project some institution did not provide a PMH endpoint. We developed a harvester for the ResourceSync Framework, which can be found at the EHRI github resydes repository.
- Config file:
config.xml the harvester needs to be configured, for every endpoint at least the following properties can be set:
- the identifier for this repository, to be used to run the harvester
- the OAI-PMH endpoint of the CHI
- the PMH metadataprefix, usually something like
- the location to store the records
- the PMH set to be harvested
/opt/shell-oaiharvester/oaiharvester -c config.xml -r <repository-id>
this will run the harvester and retrieve all new and updated records from
<repository-id> and store them at the
recordspath. There they will be picked up by the ingest-process.
- Config files:
The ResourceSync Framework describes a protocol to a destination (EHRI) in sync with a source (the CHI). It uses sitemaps to do so. In the config file
uri-list.txt every sitemap is listed that needs to be synced. The
syncapp-context.xml configures the harvester, for instance the
baseDirectory to store the retrieved files can be set here:
this will run the harvester and retrieve all new and updated files from the CHI and store them at the
baseDirectory. It will also delete files that are no longer exposed at the CHI.
See selective harvesting.