Compare commits

...

28 Commits
v0.2 ... main

Author SHA1 Message Date
Felix Lohmeier a23a93e5cd
Verweis auf neue Repos 2022-04-08 11:40:24 +02:00
Felix Lohmeier a3cd4c1849 Schreibweise vl:doctype geändert 2022-03-15 18:33:40 +01:00
Felix Lohmeier 2e698c3fe3 fix dc:contributor java.lang issue 2022-02-22 10:38:55 +01:00
Felix Lohmeier 659ad70ec6 fix #37 abweichender MIME-type bei Datensätzen mit Dateiendung .pdf in URL 2021-07-06 13:39:07 +02:00
Felix Lohmeier 170bb53b57 Ergänzung pub.uni-bielefeld.de in Doku 2021-06-14 15:35:54 +02:00
Felix Lohmeier 1b5f3000bc fix #36 Bielefeld: Identifier für Link in Verbundkatalog fehlen 2021-05-28 13:47:14 +02:00
Felix Lohmeier 5c727fdbcd Neue Datenquelle: PUB UB Bielefeld #18 2021-05-11 22:20:40 +02:00
Felix Lohmeier dd614a6e2d Vorläufig ohne Zeitschriftenhefte #31 und nur mit einem Direktlink #25 2021-03-19 16:12:58 +01:00
Felix Lohmeier 8cd0b69f70 Nachtrag zu #28 Münster: Datensätze mit 'restriction on access' ausfiltern 2021-03-08 17:37:53 +01:00
Felix Lohmeier 4571ebd6fc fix #24 Münster: GND-Nummern auswerten 2021-03-08 17:37:06 +01:00
Felix Lohmeier 6734927ecd fix #30 Münster: mods:note type teilweise filtern 2021-03-08 17:19:28 +01:00
Felix Lohmeier cf247a86c1 fix #28 Münster: Datensätze mit 'restriction on access' ausfiltern 2021-03-08 17:13:23 +01:00
Felix Lohmeier cf3c006d78 Münster: Spezifikation weiterer vl:doctype 2021-03-02 17:18:52 +01:00
Felix Lohmeier 3b154c21cb fix #27 Münster: Linkcheck soll weitere Informationen ausgeben 2021-03-02 14:59:24 +01:00
Felix Lohmeier 1f1298c6f0 versehentlich hochgeladene temporäre Dateien löschen 2021-03-02 13:45:18 +01:00
Felix Lohmeier 3711d241f2 resolve #26 Refactoring nach Vorlage openrefine-task-runner 2021-03-02 13:32:12 +01:00
Felix Lohmeier 192bbef02d ersetze .reverse()[0] durch [-1] 2021-02-10 11:23:31 +01:00
Felix Lohmeier 1c77a9ab50 individuelle Portnummer für Münster 2021-02-10 11:14:27 +01:00
Felix Lohmeier 7554346261 resolve #19 Neue Datenquelle: miami ULB Münster 2021-02-06 02:51:16 +01:00
Felix Lohmeier 11fd9aa54a Filter nach PDF sicherheitshalber auch case-insensitive (.PDF) 2021-02-05 16:44:45 +01:00
Felix Lohmeier 6fe88c393e fix #22 Split ohne extra Zeichenkette 2021-02-03 15:11:17 +01:00
Felix Lohmeier 159ccc1a17 Harvesting und Import ULB Münster miami #19 2021-01-25 18:08:44 +01:00
Felix Lohmeier cb989c0410 fix #21 Temporäre Dateien zu Beginn löschen 2021-01-25 17:48:47 +01:00
Felix Lohmeier 65edbbf873 workaround für too many open files deutlicher machen 2021-01-20 15:48:54 +01:00
Felix Lohmeier acd10b3ebb Weitere Variablen nutzen, damit diff zwischen zwei Datenquellen möglichst aussagekräftig wird #20 2021-01-20 15:35:02 +01:00
Felix Lohmeier 4d259e30fe task diff: status: durch if in cmd ersetzen #9 2021-01-20 13:15:09 +01:00
Felix Lohmeier 8d78f56cbf label: für übergreifende Tasks hinzufügen #9 2021-01-20 12:30:21 +01:00
Felix Lohmeier 3760451b36 Revert "Statusprüfungen in Taskfiles der Datenquelle #9"
This reverts commit 1286c8177b.
2021-01-20 12:16:34 +01:00
70 changed files with 2843 additions and 495 deletions

9
.gitignore vendored
View File

@ -1,3 +1,8 @@
data
openrefine
*/harvest/*
*/refine/*
*/split/*
*/validate/*
*/zip/*
*/*.log
.openrefine
.task

145
README.md
View File

@ -1,141 +1,14 @@
# Datenintegration für noah.nrw
Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das Portal [noah.nrw](https://noah.nrw/)
**:warning: Dies ist ein Prototyp für die Beta-Version des Portals.**
> :warning: **Achtung:** Dieses Repo ist nicht mehr aktuell. Die Workflows sind nun wie folgt aufgeteilt
## Datenfluss
| Workflow | GitHub Repository|
|:------------------|-----------------------------------------------------------------------------------------|
| bielefeld | [noah-bielefeld-pub](https://github.com/opencultureconsulting/noah-bielefeld-pub) |
| muenster | [noah-muenster-miami](https://github.com/opencultureconsulting/noah-muenster-miami) |
| siegen | [noah-siegen-opus](https://github.com/opencultureconsulting/noah-siegen-opus) |
| wuppertal | [noah-wuppertal-elpub](https://github.com/opencultureconsulting/noah-wuppertal-elpub) |
![Datenflussdiagramm](flowchart.svg)
## Verwendete Tools
* Harvesting (mit Cache): [metha](https://github.com/miku/metha/)
* Transformation: [OpenRefine](https://github.com/OpenRefine/OpenRefine) und [openrefine-client](https://github.com/opencultureconsulting/openrefine-client)
* :warning: Für den Produktivbetrieb ist der Einsatz von [metafacture](https://github.com/metafacture) geplant.
* Task Runner: [Task](https://github.com/go-task/task)
## Systemvoraussetzungen
* GNU/Linux (getestet mit Fedora 32)
* JAVA 8+
## Installation
1. Git Repository klonen
```sh
git clone https://github.com/opencultureconsulting/noah.git
cd noah
```
2. [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) (benötigt JAVA 8+)
```sh
# in Unterverzeichnis openrefine installieren
wget -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
mkdir -p openrefine
tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
# automatisches Starten des Browsers abschalten
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini"
# Zeitraum für automatisches Speichern von 5 Minuten auf 25 Stunden erhöhen
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini"
```
3. [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
```sh
# in Unterverzeichnis openrefine installieren
mkdir -p openrefine
wget -O openrefine/openrefine-client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
chmod +x openrefine/openrefine-client
```
4. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
```sh
wget https://github.com/miku/metha/releases/download/v0.2.20/metha-0.2.20-0.x86_64.rpm
sudo dnf install ./metha-0.2.20-0.x86_64.rpm && rm metha-0.2.20-0.x86_64.rpm
```
b) DEB-basiert (Debian, Ubuntu etc.)
```sh
wget https://github.com/miku/metha/releases/download/v0.2.20/metha_0.2.20_amd64.deb
sudo apt install ./metha_0.2.20_amd64.deb && rm metha_0.2.20_amd64.deb
```
5. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
```
b) DEB-basiert (Debian, Ubuntu etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
```
## Nutzung
* Alle Datenquellen harvesten, transformieren und validieren (parallelisiert)
```
task
```
* Eine Datenquelle harvesten, transformieren und validieren
```
task siegen:default
```
* Zwei Datenquellen harvesten, transformieren und validieren (parallelisiert)
```
task --parallel siegen:default wuppertal:default
```
* Links einer Datenquelle überprüfen
```
task siegen:linkcheck
```
* Cache einer Datenquelle löschen
```
task siegen:delete
```
* Verfügbare Tasks auflisten
```
task --list
```
## Konfiguration
* Workflow für die jeweilige Datenquelle in [tasks](tasks)
* Beispiel: [tasks/siegen.yml](tasks/siegen.yml)
* OpenRefine-Transformationsregeln in [rules](rules)
* Beispiel: [rules/siegen/hbz.json](rules/siegen/hbz.json)
* Allgemeine Tasks (z.B. Validierung) in [Taskfile.yml](Taskfile.yml)
## Known Issues
> too many open files
```
ulimit -n 10000
```
## OAI-PMH Data Provider
Für die Bereitstellung der transformierten Daten wird der dateibasierte OAI-PMH Data Provider [oai_pmh](https://github.com/opencultureconsulting/oai_pmh) genutzt. Installations- und Nutzungshinweise sind dort zu finden.
Der alte technische Ansatz ist in https://github.com/opencultureconsulting/noah/tree/v0.3 nachzulesen.

View File

@ -1,110 +1,196 @@
# https://taskfile.dev
# https://github.com/opencultureconsulting/openrefine-task-runner
version: '3'
output: prefixed
includes:
siegen: ./tasks/siegen.yml
wuppertal: ./tasks/wuppertal.yml
bielefeld: bielefeld
muenster: muenster
siegen: siegen
wuppertal: wuppertal
silent: true
output: prefixed
vars:
DATE: '{{ now | date "2006-01-02"}}'
env:
OPENREFINE:
sh: readlink -e openrefine/refine
OPENREFINE_CLIENT:
sh: readlink -e openrefine/openrefine-client
sh: readlink -m .openrefine/refine
CLIENT:
sh: readlink -m .openrefine/client
tasks:
default:
desc: alle Datenquellen (parallel)
preconditions:
- sh: test -n "$(command -v metha-sync)"
msg: "requirement metha missing"
- sh: test -n "$(command -v java)"
msg: "requirement JAVA runtime environment (jre) missing"
- sh: test -x "$OPENREFINE"
msg: "requirement OpenRefine missing"
- sh: test -x "$OPENREFINE_CLIENT"
msg: "requirement openrefine-client missing"
- sh: test -n "$(command -v curl)"
msg: "requirement curl missing"
- sh: test -n "$(command -v xmllint)"
msg: "requirement xmllint missing"
desc: execute all projects in parallel
deps:
- task: wuppertal:default
- task: siegen:default
- task: bielefeld:main
- task: muenster:main
- task: siegen:main
- task: wuppertal:main
install:
desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
cmds:
- | # delete existing install and recreate folder
rm -rf .openrefine
mkdir -p .openrefine
- > # download OpenRefine archive
wget --no-verbose -O openrefine.tar.gz
https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
- | # install OpenRefine into subdirectory .openrefine
tar -xzf openrefine.tar.gz -C .openrefine --strip 1
rm openrefine.tar.gz
- | # optimize OpenRefine for batch processing
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
- > # download openrefine-client into subdirectory .openrefine
wget --no-verbose -O .openrefine/client
https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
- chmod +x .openrefine/client # make client executable
start:
dir: ./{{.PROJECT}}/refine
cmds:
- | # verify that OpenRefine is installed
if [ ! -f "$OPENREFINE" ]; then
echo 1>&2 "OpenRefine missing; try task install"; exit 1
fi
- | # delete temporary files and log file of previous run
rm -rf ./*.project* workspace.json
rm -rf "{{.PROJECT}}.log"
- > # launch OpenRefine with specific data directory and redirect its output to a log file
"$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
-d ../{{.PROJECT}}/refine
>> "{{.PROJECT}}.log" 2>&1 &
- | # wait until OpenRefine API is available
timeout 30s bash -c "until
wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
do sleep 1
done"
stop:
dir: ./{{.PROJECT}}/refine
cmds:
- | # shut down OpenRefine gracefully
PID=$(lsof -t -i:{{.PORT}})
kill $PID
while ps -p $PID > /dev/null; do sleep 1; done
- > # archive the OpenRefine project
tar cfz
"{{.PROJECT}}.openrefine.tar.gz"
-C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
.
- rm -rf ./*.project* workspace.json # delete temporary files
kill:
dir: ./{{.PROJECT}}/refine
cmds:
- | # shut down OpenRefine immediately to save time and disk space
PID=$(lsof -t -i:{{.PORT}})
kill -9 $PID
while ps -p $PID > /dev/null; do sleep 1; done
- rm -rf ./*.project* workspace.json # delete temporary files
check:
dir: data/{{.PROJECT}}/refine
dir: ./{{.PROJECT}}/refine
cmds:
- test -n "{{.PROJECT}}"; test -n "{{.MINIMUM}}"
# Logdatei von OpenRefine auf Warnungen und Fehlermeldungen prüfen
- if grep -i 'exception\|error' openrefine.log; then echo 1>&2 "Logdatei $PWD/openrefine.log enthält Warnungen!" && exit 1; fi
# Prüfen, ob Mindestanzahl von 1250 Datensätzen generiert wurde
- if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!" && exit 1; fi
- | # find log file(s) and check for "exception" or "error"
if grep -i 'exception\|error' $(find . -name '*.log'); then
echo 1>&2 "log contains warnings!"; exit 1
fi
- | # Prüfen, ob Mindestanzahl von Datensätzen generiert wurde
if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then
echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!"; exit 1
fi
split:
dir: data/{{.PROJECT}}/split
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}/split
cmds:
- test -n "{{.PROJECT}}"
# in Einzeldateien aufteilen
- csplit -q ../refine/{{.PROJECT}}.txt --suppress-matched '/<!-- SPLIT -->/' "{*}"
- csplit -s -z ../refine/{{.PROJECT}}.txt '/<mets:mets /' "{*}"
# ggf. vorhandene XML-Dateien löschen
- rm -f *.xml
# Identifier als Dateinamen
- for f in xx*; do mv "$f" "$(xmllint --xpath "//*[local-name(.) = 'recordIdentifier']/text()" "$f").xml"; done
sources:
- ../refine/{{.PROJECT}}.txt
generates:
- ./*.xml
validate:
dir: data/{{.PROJECT}}
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}/validate
cmds:
- test -n "{{.PROJECT}}"
# Validierung gegen METS Schema
- wget -q -nc https://www.loc.gov/standards/mets/mets.xsd
- xmllint --schema mets.xsd --noout split/*.xml > validate.log 2>&1
- xmllint --schema mets.xsd --noout ../split/*.xml > validate.log 2>&1
sources:
- ../split/*.xml
generates:
- validate.log
zip:
dir: data/{{.PROJECT}}
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}/zip
cmds:
- test -n "{{.PROJECT}}"
# ZIP-Archiv mit Zeitstempel erstellen
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip split/*.xml
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip ../split/*.xml
sources:
- ../split/*.xml
generates:
- '{{.PROJECT}}_{{.DATE}}.zip'
diff:
dir: data/{{.PROJECT}}
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}
cmds:
- test -n "{{.PROJECT}}"
# Inhalt der beiden letzten ZIP-Archive vergleichen
- unzip -q -d old $(ls -t *.zip | sed -n 2p)
- unzip -q -d new $(ls -t *.zip | sed -n 1p)
- if test -n "$(ls -t zip/*.zip | sed -n 2p)"; then unzip -q -d old $(ls -t zip/*.zip | sed -n 2p); unzip -q -d new $(ls -t zip/*.zip | sed -n 1p); fi
- diff -d old new > diff.log || exit 0
- rm -rf old new
# Diff prüfen, ob es weniger als 500 Zeilen enthält
- if (( 500 < $(wc -l <diff.log) )); then echo 1>&2 "Unerwartet große Änderungen in $PWD/diff.log!" && exit 1; fi
# Diff archivieren
- cp diff.log {{.PROJECT}}_{{.DATE}}.diff
status:
# Task nicht ausführen, wenn weniger als zwei ZIP-Archive vorhanden
- test -z $(ls -t *.zip | sed -n 2p)
- cp diff.log zip/{{.PROJECT}}_{{.DATE}}.diff
sources:
- split/*.xml
generates:
- diff.log
linkcheck:
dir: data/{{.PROJECT}}
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}
cmds:
- test -n "{{.PROJECT}}"
# Links extrahieren
- xmllint --xpath '//@*[local-name(.) = "href"]' split/*.xml | cut -d '"' -f2 > links.txt
# http status code aller Links ermitteln
- curl --silent --head --write-out "%{http_code} %{url_effective}\n" $(while read line; do echo "-o /dev/null $line"; done < links.txt) > linkcheck.log
- rm -rf links.txt
- grep -o 'href="[^"]*"' split/*.xml | sed 's/:href=/\t/' | tr -d '"' | sort -k 2 --unique > links.txt
# http status code ermitteln
- awk '{ print "url = " $2 "\noutput = /dev/null"; }' links.txt > curl.cfg
- curl --silent --head --location --write-out "%{http_code}\t%{url_effective}\n" --config curl.cfg > curl.log
# Tabelle mit status code, effektiver URL, Dateiname und start URL erstellen
- paste curl.log links.txt > linkcheck.log
- rm -rf curl.cfg curl.log links.txt
# Logdatei auf status code != 2XX prüfen
- if grep '^[^2]' linkcheck.log; then echo 1>&2 "Logdatei $PWD/linkcheck.log enthält problematische status codes!" && exit 1; fi
sources:
- split/*.xml
generates:
- linkcheck.log
delete:
dir: data/{{.PROJECT}}
label: '{{.TASK}}-{{.PROJECT}}'
dir: ./{{.PROJECT}}
cmds:
- test -n "{{.PROJECT}}"
- rm -rf harvest
- rm -rf refine
- rm -rf split
- rm -rf validate
- rm -f diff.log

143
bielefeld/Taskfile.yml Normal file
View File

@ -0,0 +1,143 @@
version: '3'
tasks:
main:
desc: pub UB Bielefeld
vars:
MINIMUM: 12000 # Mindestanzahl der zu erwartenden Datensätze
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: harvest
- task: refine
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
- task: :split
vars: {PROJECT: '{{.PROJECT}}'}
- task: :validate
vars: {PROJECT: '{{.PROJECT}}'}
- task: :zip
vars: {PROJECT: '{{.PROJECT}}'}
- task: :diff
vars: {PROJECT: '{{.PROJECT}}'}
harvest:
dir: ./{{.PROJECT}}/harvest
desc: pub UB Bielefeld harvesten
vars:
URL: https://pub.uni-bielefeld.de/oai
FORMAT: mods
SET: open_access
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} --set {{.SET}} --no-intervals {{.URL}} # Selective Harvesting mit metha schlägt bei diesem Endpoint fehl, daher mit Option --no-intervals
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} --set {{.SET}} {{.URL}} > {{.PROJECT}}.xml
status:
- test -f ./{{.PROJECT}}.xml # Da Selective Harvesting nicht funktioniert, hier Statuscheck ob Datei existent, um nicht jedesmal einen Gesamtdatenabzug zu laden. Aktualisierungen müssen bis auf Weiteres manuell erfolgen mit task bielefeld:harvest --force
refine:
dir: ./{{.PROJECT}}
vars:
PORT: 3337 # assign a different port for each project
RAM: 4G # maximum RAM for OpenRefine java heap space
PROJECT: '{{splitList ":" .TASK | first}}'
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
cmds:
- mkdir -p refine
- task: :start # launch OpenRefine
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # Import (erfordert absoluten Pfad zur XML-Datei)
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
--recordPath Records --recordPath Record
--storeEmptyStrings false --trimStrings true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # Vorverarbeitung: Identifier in erste Spalte id; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/vorverarbeitung.json
> {{.LOG}}
- > # Datensätze ohne PDF löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nur-mit-pdf.json
> {{.LOG}}
- > # Index: Spalte index mit row.record.index generieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/index.json
> {{.LOG}}
- > # Sortierung nonSort für das erste Element in title
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nonsort.json
> {{.LOG}}
- > # ORCID-iDs aus name - description extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/orcid.json
> {{.LOG}}
- > # Rollenangaben in name - role - roleTerm in MARC relators konvertieren (nur für Personen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/roleterm.json
> {{.LOG}}
- > # doctype für mods:genre aus setSpec in oai header extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doctype.json
> {{.LOG}}
- > # Visual Library doctype aus doctype ableiten
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/vldoctype.json
> {{.LOG}}
- > # ddc für mods:classification aus setSpec in oai header extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/ddc.json
> {{.LOG}}
- > # Sonderzeichen in relatedItem - location - url encoden
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/urlencode.json
> {{.LOG}}
- > # internetMediaType bei Dateiendung .pdf in URL einheitlich auf application/pdf setzen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/mime.json
> {{.LOG}}
- > # Rechteangaben aus dc:rights in Format OAI_DC ergänzen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/rights.json
> {{.LOG}}
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/hbz.json
> {{.LOG}}
- | # Export in METS:MODS mit Templating
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- harvest/{{.PROJECT}}.xml
- config/**
generates:
- refine/{{.PROJECT}}.openrefine.tar.gz
- refine/{{.PROJECT}}.txt
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
linkcheck:
desc: pub UB Bielefeld links überprüfen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :linkcheck
vars: {PROJECT: '{{.PROJECT}}'}
delete:
desc: pub UB Bielefeld cache löschen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :delete
vars: {PROJECT: '{{.PROJECT}}'}
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

35
bielefeld/config/ddc.json Normal file
View File

@ -0,0 +1,35 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "setSpec",
"expression": "grel:filter(row.record.cells[columnName].value,v,v.contains('ddc'))[0].replace('ddc:','')",
"onError": "set-to-blank",
"newColumnName": "ddc",
"columnInsertIndex": 39,
"description": "Create column ddc at index 39 based on column setSpec using expression grel:filter(row.record.cells[columnName].value,v,v.contains('ddc'))[0].replace('ddc:','')"
}
]

View File

@ -0,0 +1,55 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "setSpec",
"expression": "grel:filter(row.record.cells[columnName].value,v,v.contains('doc-type'))[0].replace('doc-type:','')",
"onError": "set-to-blank",
"newColumnName": "doctype",
"columnInsertIndex": 39,
"description": "Create column doctype at index 39 based on column setSpec using expression grel:filter(row.record.cells[columnName].value,v,v.contains('doc-type'))[0].replace('doc-type:','')"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "doctype",
"expression": "value",
"edits": [
{
"from": [
"other"
],
"fromBlank": false,
"fromError": false,
"to": "Other"
}
],
"description": "Mass edit cells in column doctype"
}
]

84
bielefeld/config/hbz.json Normal file
View File

@ -0,0 +1,84 @@
[
{
"op": "core/column-addition-by-fetching-urls",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "relatedItem - identifier - type",
"expression": "value",
"columnName": "relatedItem - identifier - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "urn",
"l": "urn"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "relatedItem - identifier",
"urlExpression": "grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'",
"onError": "set-to-blank",
"newColumnName": "hbz",
"columnInsertIndex": 13,
"delay": 0,
"cacheResponses": true,
"httpHeadersJson": [
{
"name": "authorization",
"value": ""
},
{
"name": "user-agent",
"value": "OpenRefine 3.4.1 [437dc4d]"
},
{
"name": "accept",
"value": "*/*"
}
],
"description": "Create column hbz at index 13 by fetching URLs based on column relatedItem - identifier using expression grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "relatedItem - identifier - type",
"expression": "value",
"columnName": "relatedItem - identifier - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "urn",
"l": "urn"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "hbz",
"expression": "grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column hbz using expression grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)"
}
]

View File

@ -0,0 +1,15 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"baseColumnName": "id",
"expression": "grel:row.record.index",
"onError": "set-to-blank",
"newColumnName": "index",
"columnInsertIndex": 1,
"description": "Create column index at index 1 based on column id using expression grel:row.record.index"
}
]

View File

@ -0,0 +1,25 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "text",
"name": "relatedItem - location - url - displayLabel",
"columnName": "relatedItem - location - url - displayLabel",
"query": "\\.pdf$",
"mode": "regex",
"caseSensitive": false,
"invert": false
}
],
"mode": "row-based"
},
"columnName": "relatedItem - physicalDescription - internetMediaType",
"expression": "grel:'application/pdf'",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column relatedItem - physicalDescription - internetMediaType using expression grel:'application/pdf'"
}
]

View File

@ -0,0 +1,85 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "titleInfo - title",
"expression": "grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))",
"onError": "set-to-blank",
"newColumnName": "nonsort",
"columnInsertIndex": 27
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
},
{
"type": "list",
"name": "nonsort",
"expression": "isBlank(value)",
"columnName": "nonsort",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "titleInfo - title",
"expression": "grel:value.split(' ').slice(1).join(' ')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "relatedItem - location - url - displayLabel",
"expression": "grel:isNonBlank(filter(row.record.cells[columnName].value,v,v.toLowercase().contains('.pdf')).join(''))",
"columnName": "relatedItem - location - url - displayLabel",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,35 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "name - description - type",
"expression": "value",
"columnName": "name - description - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "orcid",
"l": "orcid"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "name - description",
"expression": "grel:value",
"onError": "set-to-blank",
"newColumnName": "orcid",
"columnInsertIndex": 9,
"description": "Create column orcid at index 9 based on column name - description using expression grel:value"
}
]

View File

@ -0,0 +1,274 @@
[
{
"op": "core/column-addition-by-fetching-urls",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "id",
"urlExpression": "grel:'https://pub.uni-bielefeld.de/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=' + value",
"onError": "set-to-blank",
"newColumnName": "rights",
"columnInsertIndex": 1,
"delay": 0,
"cacheResponses": true,
"httpHeadersJson": [
{
"name": "authorization",
"value": ""
},
{
"name": "user-agent",
"value": "OpenRefine 3.4.1 [437dc4d]"
},
{
"name": "accept",
"value": "*/*"
}
],
"description": "Create column rights at index 1 by fetching URLs based on column id using expression grel:'https://pub.uni-bielefeld.de/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=' + value"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "rights",
"expression": "grel:forEach(value.parseXml().select('dc|rights'),v,v.xmlText()).join(',')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column rights using expression grel:forEach(value.parseXml().select('dc|rights'),v,v.xmlText()).join(',')"
},
{
"op": "core/multivalued-cell-split",
"columnName": "rights",
"keyColumnName": "id",
"mode": "separator",
"separator": ",",
"regex": false,
"description": "Split multi-valued cells in column rights"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "rights",
"expression": "value",
"columnName": "rights",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "dppl_3_0",
"l": "dppl_3_0"
}
},
{
"v": {
"v": "info:eu-repo/semantics/openAccess",
"l": "info:eu-repo/semantics/openAccess"
}
},
{
"v": {
"v": "cc_0_3_0",
"l": "cc_0_3_0"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "rights",
"expression": "grel:null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column rights using expression grel:null"
},
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "rights",
"expression": "grel:value",
"onError": "set-to-blank",
"newColumnName": "rights_url",
"columnInsertIndex": 2,
"description": "Create column rights_url at index 2 based on column rights using expression grel:value"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "text",
"name": "rights",
"columnName": "rights",
"query": "creativecommons",
"mode": "text",
"caseSensitive": false,
"invert": false
}
],
"mode": "row-based"
},
"columnName": "rights",
"expression": "grel:value.replace('https://','').replace('http://','').replace('creativecommons.org/licenses/','CC ').replace('/',' ').trim().toUppercase()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column rights using expression grel:value.replace('https://','').replace('http://','').replace('creativecommons.org/licenses/','CC ').replace('/',' ').trim().toUppercase()"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "rights",
"expression": "value",
"edits": [
{
"from": [
"CREATIVECOMMONS.ORG PUBLICDOMAIN ZERO 1.0"
],
"fromBlank": false,
"fromError": false,
"to": "CC0 1.0"
}
],
"description": "Mass edit cells in column rights"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "rights",
"expression": "value",
"edits": [
{
"from": [
"https://opendatacommons.org/licenses/by/summary/index.html"
],
"fromBlank": false,
"fromError": false,
"to": "ODC-By"
}
],
"description": "Mass edit cells in column rights"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "rights",
"expression": "value",
"edits": [
{
"from": [
"https://opendatacommons.org/licenses/odbl/summary/index.html"
],
"fromBlank": false,
"fromError": false,
"to": "ODbL"
}
],
"description": "Mass edit cells in column rights"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "rights",
"expression": "value",
"edits": [
{
"from": [
"https://opendatacommons.org/licenses/pddl/summary/index.html"
],
"fromBlank": false,
"fromError": false,
"to": "PDDL"
}
],
"description": "Mass edit cells in column rights"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "rights",
"expression": "value",
"edits": [
{
"from": [
"https://rightsstatements.org/vocab/InC/1.0/"
],
"fromBlank": false,
"fromError": false,
"to": "Urheberrechtsschutz"
}
],
"description": "Mass edit cells in column rights"
}
]

View File

@ -0,0 +1,62 @@
[
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "name - role - roleTerm",
"expression": "value",
"edits": [
{
"from": [
"author"
],
"fromBlank": false,
"fromError": false,
"to": "aut"
}
],
"description": "Mass edit cells in column name - role - roleTerm"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "name - role - roleTerm",
"expression": "value",
"edits": [
{
"from": [
"editor"
],
"fromBlank": false,
"fromError": false,
"to": "edt"
}
],
"description": "Mass edit cells in column name - role - roleTerm"
},
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "name - role - roleTerm",
"expression": "value",
"edits": [
{
"from": [
"supervisor"
],
"fromBlank": false,
"fromError": false,
"to": "dgs"
}
],
"description": "Mass edit cells in column name - role - roleTerm"
}
]

View File

@ -0,0 +1,130 @@
{{
if(row.index - row.record.fromRowIndex == 0,
with(cross(cells['index'].value, 'bielefeld' , 'index'), rows,
'<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">' + '\n' +
' <mets:dmdSec ID="' + 'DMD' + cells['id'].value.escape('xml') + '">' + '\n' +
' <mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">' + '\n' +
' <mets:xmlData>' + '\n' +
' <mods xmlns="http://www.loc.gov/mods/v3" version="3.7" xmlns:vl="http://visuallibrary.net/vl">' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['titleInfo - title'].value)), r,
' <titleInfo' + forNonBlank(r.cells['titleInfo - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + '\n' +
forNonBlank(r.cells['nonsort'].value, v,
' <nonSort>' + v.escape('xml') + '</nonSort>' + '\n'
, '') +
forNonBlank(r.cells['titleInfo - title'].value, v,
' <title>' + v.escape('xml') + '</title>' + '\n'
, '') +
' </titleInfo>' + '\n'
).join('') +
forEachIndex(rows, i, r, if(r.cells['name - type'].value == 'personal',
' <name type="personal"' + '>' + '\n' +
' <namePart type="' + r.cells['name - namePart - type'].value.escape('xml') + '">' + r.cells['name - namePart'].value.escape('xml') + '</namePart>' + '\n' +
if(and(isBlank(rows[i+1].cells['name - type'].value), isNonBlank(rows[i+1].cells['name - namePart - type'].value)),
' <namePart type="' + rows[i+1].cells['name - namePart - type'].value.escape('xml') + '">' + rows[i+1].cells['name - namePart'].value.escape('xml') + '</namePart>' + '\n'
, '') +
forNonBlank(r.cells['orcid'].value, v,
' <nameIdentifier type="orcid" typeURI="http://orcid.org">' + v.escape('xml') + '</nameIdentifier>' + '\n'
, '') +
forNonBlank(r.cells['name - role - roleTerm'].value, v,
' <role>' + '\n' +
' <roleTerm type="code" authority="marcrelator">' + v.escape('xml') + '</roleTerm>' + '\n' +
' </role>' + '\n'
, '') +
' </name>' + '\n'
, '')).join('') +
' <typeOfResource>text</typeOfResource>' + '\n' +
' <genre authority="dini">' + cells['doctype'].value.escape('xml') + '</genre>' + '\n' +
' <originInfo>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['originInfo - dateIssued'].value)), r,
' <dateIssued encoding="w3cdtf">' + r.cells['originInfo - dateIssued'].value.escape('xml') + '</dateIssued>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['dateOther'].value)), r,
' <dateOther encoding="w3cdtf"' + forNonBlank(r.cells['dateOther - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['dateOther'].value.escape('xml') + '</dateOther>' + '\n'
).join('') +
' </originInfo>' + '\n' +
' <language>' + '\n' +
' <languageTerm type="code" authority="iso639-2b">' + cells['language - languageTerm'].value.escape('xml') + '</languageTerm>' + '\n' +
' </language>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['abstract'].value)), r,
' <abstract' + forNonBlank(r.cells['abstract - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + '>' + r.cells['abstract'].value.escape('xml') + '</abstract>' + '\n'
).join('') +
if(isNonBlank(row.record.cells['subject - topic'].value),
' <subject>' + '\n'
, '') +
forEach(filter(rows, r, isNonBlank(r.cells['subject - topic'].value)), r,
' <topic>' + r.cells['subject - topic'].value.escape('xml') + '</topic>' + '\n'
).join('') +
if(isNonBlank(row.record.cells['subject - topic'].value),
' </subject>' + '\n'
, '') +
forEach(filter(rows, r, isNonBlank(r.cells['ddc'].value)), r,
' <classification authority="ddc">' + r.cells['ddc'].value.escape('xml') + '</classification>' + '\n'
).join('') +
forEachIndex(rows, i, r, if(and(r.cells['relatedItem - type'].value == 'host', r.cells['relatedItem - part - detail - type'].value == 'volume'),
' <relatedItem type="host">' + '\n' +
' <titleInfo>' + '\n' +
' <title>' + r.cells['relatedItem - titleInfo - title'].value.escape('xml') + '</title>' + '\n' +
' </titleInfo>' + '\n' +
' <part>' + '\n' +
' <detail type="volume">' + '\n' +
' <number>' + r.cells['relatedItem - part - detail - number'].value.escape('xml') + '</number>' + '\n' +
' </detail>' + '\n' +
forNonBlank(rows[i+1].cells['relatedItem - part - detail - number'].value, v,
' <detail type="issue">' + '\n' +
' <number>' + v.escape('xml') + '</number>' + '\n' +
' </detail>' + '\n'
, '') +
forNonBlank(r.cells['relatedItem - part - extent'].value.split('-')[0], v,
' <extent unit="page">' + '\n' +
' <start>' + v.escape('xml') + '</start>' + '\n' +
forNonBlank(r.cells['relatedItem - part - extent'].value.split('-')[1], x,
' <end>' + x.escape('xml') + '</end>' + '\n'
, '') +
' </extent>' + '\n'
, '') +
' </part>' + '\n' +
' </relatedItem>' + '\n'
, '')).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['relatedItem - identifier'].value)), r,
' <identifier' + forNonBlank(r.cells['relatedItem - identifier - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['relatedItem - identifier'].value.escape('xml') + '</identifier>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['hbz'].value)), r,
' <identifier type="sys">' + r.cells['hbz'].value.escape('xml') + '</identifier>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['rights_url'].value)), r,
' <accessCondition type="use and reproduction" xlink:href="' + r.cells['rights_url'].value.escape('xml') + '">' + r.cells['rights'].value.escape('xml') + '</accessCondition>' + '\n'
).join('') +
' <recordInfo>' + '\n' +
' <recordIdentifier>' + 'bielefeld_pub_' + cells['id'].value.escape('xml') + '</recordIdentifier>' + '\n' +
' </recordInfo>' + '\n' +
forNonBlank(cells['vldoctype'].value, v,
' <extension>' + '\n' +
' <vl:doctype>' + v.escape('xml') + '</vl:doctype>' + '\n' +
' </extension>' + '\n'
, '') +
' </mods>' + '\n' +
' </mets:xmlData>' + '\n' +
' </mets:mdWrap>' + '\n' +
' </mets:dmdSec>' + '\n' +
' <mets:fileSec>' + '\n' +
forEachIndex(filter(rows, r, and(isNonBlank(r.cells['relatedItem - location - url'].value), r.cells['relatedItem - type'].value == 'constituent')), i, r,
' <mets:fileGrp USE="' + if(r.cells['relatedItem - location - url'].value == filter(row.record.cells['relatedItem - location - url'].value, v, v.toLowercase().contains('.pdf'))[0], 'pdf upload', 'generic file') + '">' + '\n' +
' <mets:file MIMETYPE="' + r.cells['relatedItem - physicalDescription - internetMediaType'].value.escape('xml') + '" ID="FILE' + i + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '">' + '\n' +
' <mets:FLocat xlink:href="' + r.cells['relatedItem - location - url'].value.escape('xml') + '" LOCTYPE="URL"/>' + '\n' +
' </mets:file>' + '\n' +
' </mets:fileGrp>' + '\n'
).join('') +
' </mets:fileSec>' + '\n' +
' <mets:structMap TYPE="LOGICAL">' + '\n' +
' <mets:div TYPE="document" ID="' + 'bielefeld_pub_' + cells['id'].value.escape('xml') + '" DMDID="' + 'DMD' + cells['id'].value.escape('xml') + '">' + '\n' +
' <mets:fptr FILEID="' + 'FILE0' + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '"/>' + '\n' +
forEachIndex(filter(rows, r, and(isNonBlank(r.cells['relatedItem - location - url'].value), r.cells['relatedItem - type'].value == 'constituent')).slice(1), i, r,
' <mets:div TYPE="part" ID="' + 'PART' + (i+1) + '_' + cells['id'].value.escape('xml') + '" LABEL="' + r.cells['relatedItem - location - url - displayLabel'].value.escape('xml') + '">' + '\n' +
' <mets:fptr FILEID="' + 'FILE' + (i+1) + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '"/>' + '\n' +
' </mets:div>' + '\n'
).join('') +
' </mets:div>' + '\n' +
' </mets:structMap>' + '\n' +
'</mets:mets>' + '\n'
), '')
}}

View File

@ -0,0 +1,35 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "relatedItem - type",
"expression": "value",
"columnName": "relatedItem - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "constituent",
"l": "constituent"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "relatedItem - location - url",
"expression": "grel:'https://' + forEach(value.replace('https://','').split('/'),v,v.escape('url')).join('/')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column relatedItem - location - url using expression grel:'https://' + forEach(value.replace('https://','').split('/'),v,v.escape('url')).join('/')"
}
]

View File

@ -0,0 +1,15 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"baseColumnName": "doctype",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "vldoctype",
"columnInsertIndex": 3,
"description": "Create column vldoctype"
}
]

View File

@ -0,0 +1,395 @@
[
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - recordInfo - recordIdentifier",
"newColumnName": "id",
"description": "Rename column Record - metadata - mods - recordInfo - recordIdentifier to id"
},
{
"op": "core/column-move",
"columnName": "id",
"index": 0,
"description": "Move column id to position 0"
},
{
"op": "core/column-removal",
"columnName": "Record - header - identifier",
"description": "Remove column Record - header - identifier"
},
{
"op": "core/column-removal",
"columnName": "Record - header - datestamp",
"description": "Remove column Record - header - datestamp"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - version",
"description": "Remove column Record - metadata - mods - version"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - xsi:schemaLocation",
"description": "Remove column Record - metadata - mods - xsi:schemaLocation"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - name - role - roleTerm - type",
"description": "Remove column Record - metadata - mods - name - role - roleTerm - type"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - name - description - xsi:type",
"description": "Remove column Record - metadata - mods - name - description - xsi:type"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - relatedItem - accessCondition",
"description": "Remove column Record - metadata - mods - relatedItem - accessCondition"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - relatedItem - accessCondition - type",
"description": "Remove column Record - metadata - mods - relatedItem - accessCondition - type"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - apa",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - apa"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - ama",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - ama"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - mla",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - mla"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - ieee",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - ieee"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - dgps",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - dgps"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - bio1",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - bio1"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - wels",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - wels"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - lncs",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - lncs"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - chicago",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - chicago"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - default",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - default"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - harvard1",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - harvard1"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - frontiers",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - frontiers"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - apa_indent",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - apa_indent"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - angewandte-chemie",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - angewandte-chemie"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - extension - bibliographicCitation - aps",
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - aps"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - originInfo - dateIssued - encoding",
"description": "Remove column Record - metadata - mods - originInfo - dateIssued - encoding"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - originInfo - place - placeTerm - type",
"description": "Remove column Record - metadata - mods - originInfo - place - placeTerm - type"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - recordInfo - recordChangeDate",
"description": "Remove column Record - metadata - mods - recordInfo - recordChangeDate"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - recordInfo - recordChangeDate - encoding",
"description": "Remove column Record - metadata - mods - recordInfo - recordChangeDate - encoding"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - recordInfo - recordCreationDate",
"description": "Remove column Record - metadata - mods - recordInfo - recordCreationDate"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - recordInfo - recordCreationDate - encoding",
"description": "Remove column Record - metadata - mods - recordInfo - recordCreationDate - encoding"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - language - languageTerm - type",
"description": "Remove column Record - metadata - mods - language - languageTerm - type"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - language - languageTerm - authority",
"description": "Remove column Record - metadata - mods - language - languageTerm - authority"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - dateOther - encoding",
"description": "Remove column Record - metadata - mods - dateOther - encoding"
},
{
"op": "core/column-removal",
"columnName": "Record - metadata - mods - targetAudience",
"description": "Remove column Record - metadata - mods - targetAudience"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - type",
"newColumnName": "name - type",
"description": "Rename column Record - metadata - mods - name - type to name - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - namePart",
"newColumnName": "name - namePart",
"description": "Rename column Record - metadata - mods - name - namePart to name - namePart"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - namePart - type",
"newColumnName": "name - namePart - type",
"description": "Rename column Record - metadata - mods - name - namePart - type to name - namePart - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - role - roleTerm",
"newColumnName": "name - role - roleTerm",
"description": "Rename column Record - metadata - mods - name - role - roleTerm to name - role - roleTerm"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - identifier",
"newColumnName": "name - identifier",
"description": "Rename column Record - metadata - mods - name - identifier to name - identifier"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - identifier - type",
"newColumnName": "name - identifier - type",
"description": "Rename column Record - metadata - mods - name - identifier - type to name - identifier - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - description",
"newColumnName": "name - description",
"description": "Rename column Record - metadata - mods - name - description to name - description"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - name - description - type",
"newColumnName": "name - description - type",
"description": "Rename column Record - metadata - mods - name - description - type to name - description - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - type",
"newColumnName": "relatedItem - type",
"description": "Rename column Record - metadata - mods - relatedItem - type to relatedItem - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - identifier",
"newColumnName": "relatedItem - identifier",
"description": "Rename column Record - metadata - mods - relatedItem - identifier to relatedItem - identifier"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - identifier - type",
"newColumnName": "relatedItem - identifier - type",
"description": "Rename column Record - metadata - mods - relatedItem - identifier - type to relatedItem - identifier - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - location - url",
"newColumnName": "relatedItem - location - url",
"description": "Rename column Record - metadata - mods - relatedItem - location - url to relatedItem - location - url"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - location - url - displayLabel",
"newColumnName": "relatedItem - location - url - displayLabel",
"description": "Rename column Record - metadata - mods - relatedItem - location - url - displayLabel to relatedItem - location - url - displayLabel"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - physicalDescription - internetMediaType",
"newColumnName": "relatedItem - physicalDescription - internetMediaType",
"description": "Rename column Record - metadata - mods - relatedItem - physicalDescription - internetMediaType to relatedItem - physicalDescription - internetMediaType"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - part - detail - type",
"newColumnName": "relatedItem - part - detail - type",
"description": "Rename column Record - metadata - mods - relatedItem - part - detail - type to relatedItem - part - detail - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - part - detail - number",
"newColumnName": "relatedItem - part - detail - number",
"description": "Rename column Record - metadata - mods - relatedItem - part - detail - number to relatedItem - part - detail - number"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - part - extent",
"newColumnName": "relatedItem - part - extent",
"description": "Rename column Record - metadata - mods - relatedItem - part - extent to relatedItem - part - extent"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - part - extent - unit",
"newColumnName": "relatedItem - part - extent - unit",
"description": "Rename column Record - metadata - mods - relatedItem - part - extent - unit to relatedItem - part - extent - unit"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - relatedItem - titleInfo - title",
"newColumnName": "relatedItem - titleInfo - title",
"description": "Rename column Record - metadata - mods - relatedItem - titleInfo - title to relatedItem - titleInfo - title"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - subject - topic",
"newColumnName": "subject - topic",
"description": "Rename column Record - metadata - mods - subject - topic to subject - topic"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - note",
"newColumnName": "note",
"description": "Rename column Record - metadata - mods - note to note"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - note - type",
"newColumnName": "note - type",
"description": "Rename column Record - metadata - mods - note - type to note - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - titleInfo - type",
"newColumnName": "titleInfo - type",
"description": "Rename column Record - metadata - mods - titleInfo - type to titleInfo - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - titleInfo - title",
"newColumnName": "titleInfo - title",
"description": "Rename column Record - metadata - mods - titleInfo - title to titleInfo - title"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - genre",
"newColumnName": "genre",
"description": "Rename column Record - metadata - mods - genre to genre"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - originInfo - dateIssued",
"newColumnName": "originInfo - dateIssued",
"description": "Rename column Record - metadata - mods - originInfo - dateIssued to originInfo - dateIssued"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - originInfo - publisher",
"newColumnName": "originInfo - publisher",
"description": "Rename column Record - metadata - mods - originInfo - publisher to originInfo - publisher"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - originInfo - place - placeTerm",
"newColumnName": "originInfo - place - placeTerm",
"description": "Rename column Record - metadata - mods - originInfo - place - placeTerm to originInfo - place - placeTerm"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - language - languageTerm",
"newColumnName": "language - languageTerm",
"description": "Rename column Record - metadata - mods - language - languageTerm to language - languageTerm"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - abstract",
"newColumnName": "abstract",
"description": "Rename column Record - metadata - mods - abstract to abstract"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - abstract - lang",
"newColumnName": "abstract - lang",
"description": "Rename column Record - metadata - mods - abstract - lang to abstract - lang"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - dateOther",
"newColumnName": "dateOther",
"description": "Rename column Record - metadata - mods - dateOther to dateOther"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - dateOther - type",
"newColumnName": "dateOther - type",
"description": "Rename column Record - metadata - mods - dateOther - type to dateOther - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - accessCondition",
"newColumnName": "accessCondition",
"description": "Rename column Record - metadata - mods - accessCondition to accessCondition"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - metadata - mods - accessCondition - type",
"newColumnName": "accessCondition - type",
"description": "Rename column Record - metadata - mods - accessCondition - type to accessCondition - type"
},
{
"op": "core/column-rename",
"oldColumnName": "Record - header - setSpec",
"newColumnName": "setSpec",
"description": "Rename column Record - header - setSpec to setSpec"
}
]

View File

@ -3,17 +3,29 @@ wuppertal[elpub.bib.uni-wuppertal.de] --- metha_wuppertal
click wuppertal "http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider?verb=ListRecords&metadataPrefix=oai_dc" _blank
siegen[dspace.ub.uni-siegen.de] --- metha_siegen
click siegen "https://dspace.ub.uni-siegen.de/oai/request?verb=ListRecords&metadataPrefix=xMetaDissPlus" _blank
muenster[miami.uni-muenster.de] --- metha_muenster
click muenster "https://repositorium.uni-muenster.de/oai/miami?verb=ListRecords&metadataPrefix=mets" _blank
bielefeld[pub.uni-bielefeld.de] --- metha_bielefeld
click bielefeld "https://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=mods&set=open_access" _blank
subgraph Harvesting
metha_wuppertal["fa:fa-cogs metha"]
metha_siegen["fa:fa-cogs metha"]
metha_muenster["fa:fa-cogs metha"]
metha_bielefeld["fa:fa-cogs metha"]
end
subgraph Transformation
metha_wuppertal -->|Dublin Core| refine_wuppertal[fa:fa-cogs OpenRefine]
metha_siegen -->|xMetaDissPlus| refine_siegen[fa:fa-cogs OpenRefine]
metha_muenster -->|METS/MODS| refine_muenster[fa:fa-cogs OpenRefine]
metha_bielefeld -->|MODS| refine_bielefeld[fa:fa-cogs OpenRefine]
end
subgraph OAI-PMH Data Provider
refine_wuppertal -->|METS/MODS| oai_wuppertal["noah.opencultureconsulting.com/ubw/"]
click oai_wuppertal "https://noah.opencultureconsulting.com/ubw/?verb=ListRecords&metadataPrefix=mets" _blank
refine_siegen -->|METS/MODS| oai_siegen["noah.opencultureconsulting.com/ubs/"]
click oai_siegen "https://noah.opencultureconsulting.com/ubs/?verb=ListRecords&metadataPrefix=mets" _blank
refine_muenster -->|METS/MODS| oai_muenster["noah.opencultureconsulting.com/ulbm/"]
click oai_muenster "https://noah.opencultureconsulting.com/ubm/?verb=ListRecords&metadataPrefix=mets" _blank
refine_bielefeld -->|METS/MODS| oai_bielefeld["noah.opencultureconsulting.com/ubb/"]
click oai_bielefeld "https://noah.opencultureconsulting.com/ubb/?verb=ListRecords&metadataPrefix=mets" _blank
end

File diff suppressed because one or more lines are too long

Before

Width:  |  Height:  |  Size: 15 KiB

After

Width:  |  Height:  |  Size: 28 KiB

147
muenster/Taskfile.yml Normal file
View File

@ -0,0 +1,147 @@
version: '3'
tasks:
main:
desc: miami ULB Münster
vars:
MINIMUM: 6600 # Mindestanzahl der zu erwartenden Datensätze
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: harvest
- task: refine
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
- task: :split
vars: {PROJECT: '{{.PROJECT}}'}
- task: :validate
vars: {PROJECT: '{{.PROJECT}}'}
- task: :zip
vars: {PROJECT: '{{.PROJECT}}'}
- task: :diff
vars: {PROJECT: '{{.PROJECT}}'}
harvest:
dir: ./{{.PROJECT}}/harvest
vars:
URL: http://repositorium.uni-muenster.de/oai/miami
FORMAT: mets
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
refine:
dir: ./{{.PROJECT}}
vars:
PORT: 3336 # assign a different port for each project
RAM: 4G # maximum RAM for OpenRefine java heap space
PROJECT: '{{splitList ":" .TASK | first}}'
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
cmds:
- mkdir -p refine
- task: :start # launch OpenRefine
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # Import (erfordert absoluten Pfad zur XML-Datei)
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
--recordPath Records --recordPath Record --recordPath metadata --recordPath mets:mets
--storeEmptyStrings false --trimStrings true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # Vorverarbeitung: Identifier in erste Spalte id
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/vorverarbeitung.json
> {{.LOG}}
- > # Ältere Einträge (nach mets:metsHdr - CREATEDATE) mit gleichem Identifier entfernen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/duplicates.json
> {{.LOG}}
- > # Aggregationen löschen (diese Datensätze werden von untergeordneten Werken über relatedItem referenziert)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/ohne-aggregationen.json
> {{.LOG}}
- > # Datensätze ohne Direktlink auf ein PDF löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nur-mit-pdf.json
> {{.LOG}}
- > # Separaten Download-Link entfernen, wenn nur eine Datei vorhanden ist
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/flocat.json
> {{.LOG}}
- > # Vorläufig Datensätze löschen, die mehr als einen Direktlink beinhalten https://github.com/opencultureconsulting/noah/issues/25
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nur-ein-direktlink.json
> {{.LOG}}
- > # Vorläufig Zeitschriftenhefte löschen https://github.com/opencultureconsulting/noah/issues/31
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/keine-zeitschriftenhefte.json
> {{.LOG}}
- > # Datensätze mit "restriction on access" löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/restriction.json
> {{.LOG}}
- > # Index: Spalte index mit row.record.index generieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/index.json
> {{.LOG}}
- > # Sortierung mods:nonSort für das erste Element in mods:title
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nonsort.json
> {{.LOG}}
- > # Visual Library doctype aus mods:genre
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doctype.json
> {{.LOG}}
- > # HTML-Codes in Abstracts entfernen und Abstracts ohne Sprachangabe löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/abstract.json
> {{.LOG}}
- > # mets:file - ID eindeutig machen, um Validierungsfehler zu vermeiden
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/file-id.json
> {{.LOG}}
- > # mods:note type teilweise filtern
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/note.json
> {{.LOG}}
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/hbz.json
> {{.LOG}}
- | # Export in METS:MODS mit Templating
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- harvest/{{.PROJECT}}.xml
- config/**
generates:
- refine/{{.PROJECT}}.openrefine.tar.gz
- refine/{{.PROJECT}}.txt
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
linkcheck:
desc: miami ULB Münster links überprüfen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :linkcheck
vars: {PROJECT: '{{.PROJECT}}'}
delete:
desc: miami ULB Münster cache löschen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :delete
vars: {PROJECT: '{{.PROJECT}}'}
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -0,0 +1,81 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "0",
"l": "0"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract",
"expression": "null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract using expression null"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "0",
"l": "0"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
"expression": "null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang using expression null"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract",
"expression": "grel:value.parseHtml().htmlText().trim()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract using expression grel:value.parseHtml().htmlText().trim()"
}
]

View File

@ -0,0 +1,34 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "doctype",
"columnInsertIndex": 20
}
]

View File

@ -0,0 +1,59 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"columnName": "mets:mets - mets:metsHdr - CREATEDATE",
"expression": "value.toDate()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:metsHdr - CREATEDATE using expression value.toDate()"
},
{
"op": "core/row-reorder",
"mode": "record-based",
"sorting": {
"criteria": [
{
"valueType": "date",
"column": "mets:mets - mets:metsHdr - CREATEDATE",
"blankPosition": 2,
"errorPosition": 1,
"reverse": false
}
]
},
"description": "Reorder rows"
},
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "grel:with(value.cross('muenster', columnName), rows, if(rows.length() > 1, if(rows.index.sort()[-1] > row.index, 'is duplicate of a higher row number', 'has duplicate(s) with lower row number'), 'unique'))",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "is duplicate of a higher row number",
"l": "is duplicate of a higher row number"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,35 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
"expression": "isBlank(value)",
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
"expression": "grel:'FILE_' + row.record.cells['id'].value[0].split(':')[-1] + '_' + (row.index - row.record.fromRowIndex + 1)",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID using expression grel:'FILE_' + row.record.cells['id'].value[0].split(':')[-1] + '_' + (row.index - row.record.fromRowIndex + 1)"
}
]

View File

@ -0,0 +1,54 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:structMap - mets:div - mets:div - ID",
"expression": "grel:row.record.cells[columnName].value.length()",
"columnName": "mets:mets - mets:structMap - mets:div - mets:div - ID",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": 2,
"l": "2"
}
}
],
"selectBlank": false,
"selectError": false
},
{
"type": "list",
"name": "mets:mets - mets:fileSec - mets:fileGrp - USE",
"expression": "value",
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - USE",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "DOWNLOAD",
"l": "DOWNLOAD"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
"expression": "grel:null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href using expression grel:null"
}
]

84
muenster/config/hbz.json Normal file
View File

@ -0,0 +1,84 @@
[
{
"op": "core/column-addition-by-fetching-urls",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "urn",
"l": "urn"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier",
"urlExpression": "grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'",
"onError": "set-to-blank",
"newColumnName": "hbz",
"columnInsertIndex": 37,
"delay": 0,
"cacheResponses": true,
"httpHeadersJson": [
{
"name": "authorization",
"value": ""
},
{
"name": "user-agent",
"value": "OpenRefine 3.4.1 [437dc4d]"
},
{
"name": "accept",
"value": "*/*"
}
],
"description": "Create column hbz at index 37 by fetching URLs based on column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier using expression grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "urn",
"l": "urn"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "hbz",
"expression": "grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column hbz using expression grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)"
}
]

View File

@ -0,0 +1,15 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"baseColumnName": "id",
"expression": "grel:row.record.index",
"onError": "set-to-blank",
"newColumnName": "index",
"columnInsertIndex": 1,
"description": "Create column index at index 1 based on column id using expression grel:row.record.index"
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "PeriodicalPart",
"l": "PeriodicalPart"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,87 @@
[
{
"op": "core/column-addition",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title",
"expression": "grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))",
"onError": "set-to-blank",
"newColumnName": "nonsort",
"columnInsertIndex": 43,
"description": "Create column nonsort at index 43 based on column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title using expression grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "id",
"expression": "isBlank(value)",
"columnName": "id",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
},
{
"type": "list",
"name": "nonsort",
"expression": "isBlank(value)",
"columnName": "nonsort",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title",
"expression": "grel:value.split(' ').slice(1).join(' ')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title using expression grel:value.split(' ').slice(1).join(' ')"
}
]

67
muenster/config/note.json Normal file
View File

@ -0,0 +1,67 @@
[
{
"op": "core/mass-edit",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
"expression": "value",
"edits": [
{
"from": [
"thesis"
],
"fromBlank": false,
"fromError": false,
"to": "thesis statement"
}
],
"description": "Mass edit cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
"invert": true,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "citation/reference",
"l": "citation/reference"
}
},
{
"v": {
"v": "ownership",
"l": "ownership"
}
},
{
"v": {
"v": "thesis statement",
"l": "thesis statement"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "row-based"
},
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
"expression": "grel:null",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type using expression grel:null"
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
"expression": "grel:with(row.record.cells[columnName].value, x, and(x.length() == 1, x[0].toLowercase().contains('.pdf')))",
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
"expression": "grel:row.record.cells[columnName].value.join('').toLowercase().contains('.pdf')",
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
"expression": "grel:isBlank(row.record.cells[columnName].value.join(''))",
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": true,
"l": "true"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,30 @@
[
{
"op": "core/row-removal",
"engineConfig": {
"facets": [
{
"type": "list",
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type",
"expression": "value",
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type",
"invert": false,
"omitBlank": false,
"omitError": false,
"selection": [
{
"v": {
"v": "restriction on access",
"l": "restriction on access"
}
}
],
"selectBlank": false,
"selectError": false
}
],
"mode": "record-based"
},
"description": "Remove rows"
}
]

View File

@ -0,0 +1,138 @@
{{
if(row.index - row.record.fromRowIndex == 0,
with(cross(cells['index'].value, 'muenster' , 'index'), rows,
'<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">' + '\n' +
' <mets:dmdSec ID="' + cells['mets:mets - mets:dmdSec - ID'].value.escape('xml') + '">' + '\n' +
' <mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">' + '\n' +
' <mets:xmlData>' + '\n' +
' <mods xmlns="http://www.loc.gov/mods/v3" version="3.7" xmlns:vl="http://visuallibrary.net/vl">' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title'].value)), r,
' <titleInfo' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - type'].value.replace('uniform', ''), v, ' type="' + v.escape('xml') + '"', '') + '>' + '\n' +
forNonBlank(r.cells['nonsort'].value, v,
' <nonSort>' + v.escape('xml') + '</nonSort>' + '\n'
, '') +
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title'].value, v,
' <title>' + v.escape('xml') + '</title>' + '\n'
, '') +
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:subTitle'].value, v,
' <subTitle>' + v.escape('xml') + '</subTitle>' + '\n'
, '') +
' </titleInfo>' + '\n'
).join('') +
forEachIndex(rows, i, r, if(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - type'].value == 'personal',
' <name type="personal"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - valueURI'].value, v, ' authority="gnd" authorityURI="http://d-nb.info/gnd/" valueURI="' + v.escape('xml') + '"', '') + '>' + '\n' +
' <displayForm>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:displayForm'].value.escape('xml') + '</displayForm>' + '\n' +
' <namePart type="' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value.escape('xml') + '">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart'].value.escape('xml') + '</namePart>' + '\n' +
if(and(isBlank(rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - type'].value), isNonBlank(rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value)),
' <namePart type="' + rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value.escape('xml') + '">' + rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart'].value.escape('xml') + '</namePart>' + '\n'
, '') +
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:role - mods:roleTerm'].value, v,
' <role>' + '\n' +
' <roleTerm type="code" authority="marcrelator">' + v.escape('xml') + '</roleTerm>' + '\n' +
' </role>' + '\n'
, '') +
' </name>' + '\n'
, '')).join('') +
' <typeOfResource>text</typeOfResource>' + '\n' +
' <genre authority="dini">' + cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre'].value.escape('xml') + '</genre>' + '\n' +
' <originInfo>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued'].value)), r,
' <dateIssued encoding="w3cdtf"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued - keyDate'].value, v, ' keyDate="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued'].value.escape('xml') + '</dateIssued>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther'].value)), r,
' <dateOther encoding="w3cdtf"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther - keyDate'].value, v, ' keyDate="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther'].value.escape('xml') + '</dateOther>' + '\n'
).join('') +
' </originInfo>' + '\n' +
' <language>' + '\n' +
' <languageTerm type="code" authority="iso639-2b">' + cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:language - mods:languageTerm'].value.escape('xml') + '</languageTerm>' + '\n' +
' </language>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract'].value)), r,
' <abstract' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract'].value.escape('xml') + '</abstract>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note'].value)), r,
' <note' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note'].value.escape('xml') + '</note>' + '\n'
).join('') +
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('ger'),
' <subject lang="ger">' + '\n'
, '') +
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value == 'ger'), r,
forEach(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic'].value.split(';'), v,
' <topic>' + v.trim().escape('xml') + '</topic>' + '\n'
).join('')
).join('') +
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('ger'),
' </subject>' + '\n'
, '') +
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('eng'),
' <subject lang="eng">' + '\n'
, '') +
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value == 'eng'), r,
forEach(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic'].value.split(';'), v,
' <topic>' + v.trim().escape('xml') + '</topic>' + '\n'
).join('')
).join('') +
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('eng'),
' </subject>' + '\n'
, '') +
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:classification - authority'].value == 'ddc'), r,
' <classification authority="ddc">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:classification'].value.escape('xml') + '</classification>' + '\n'
).join('') +
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - type'].value == 'host'), r,
' <relatedItem type="host">' + '\n' +
' <titleInfo>' + '\n' +
' <title>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:titleInfo - mods:title'].value.escape('xml') + '</title>' + '\n' +
' </titleInfo>' + '\n' +
' <part>' + '\n' +
' <detail type="issue">' + '\n' +
' <number>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:titleInfo - mods:title'].value.escape('xml') + '</number>' + '\n' +
' </detail>' + '\n' +
' <extent unit="page">' + '\n' +
' <start>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:part - mods:extent - mods:start'].value.escape('xml') + '</start>' + '\n' +
' <end>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:part - mods:extent - mods:end'].value.escape('xml') + '</end>' + '\n' +
' </extent>' + '\n' +
' </part>' + '\n' +
' </relatedItem>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type'].value)), r,
' <identifier' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier'].value.escape('xml') + '</identifier>' + '\n'
).join('') +
forNonBlank(cells['hbz'].value, v,
' <identifier type="sys">' + v.escape('xml') + '</identifier>' + '\n'
, '') +
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type'].value)), r,
' <accessCondition type="use and reproduction" xlink:href="' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - mods:extension - ma:maWrap - ma:licence - ma:targetUrl'].value.escape('xml') + '">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - mods:extension - ma:maWrap - ma:licence - ma:displayLabel'].value.replace('InC 1.0', 'Urheberrechtsschutz').escape('xml') + '</accessCondition>' + '\n'
).join('') +
' <recordInfo>' + '\n' +
' <recordIdentifier>' + 'muenster_miami_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '</recordIdentifier>' + '\n' +
' </recordInfo>' + '\n' +
forNonBlank(cells['doctype'].value, v,
' <extension>' + '\n' +
' <vl:doctype>' + v.escape('xml') + '</vl:doctype>' + '\n' +
' </extension>' + '\n'
, '') +
' </mods>' + '\n' +
' </mets:xmlData>' + '\n' +
' </mets:mdWrap>' + '\n' +
' </mets:dmdSec>' + '\n' +
' <mets:fileSec>' + '\n' +
forEachIndex(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value)), i, r,
' <mets:fileGrp USE="' + if(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value == filter(row.record.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value, v, v.toLowercase().contains('.pdf'))[0], 'pdf upload', 'generic file') + '">' + '\n' +
' <mets:file MIMETYPE="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - MIMETYPE'].value.escape('xml') + '" ID="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '">' + '\n' +
' <mets:FLocat xlink:href="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value.escape('xml') + '" LOCTYPE="URL"/>' + '\n' +
' </mets:file>' + '\n' +
' </mets:fileGrp>' + '\n'
).join('') +
' </mets:fileSec>' + '\n' +
' <mets:structMap TYPE="LOGICAL">' + '\n' +
' <mets:div TYPE="document" ID="' + 'muenster_miami_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '" DMDID="' + cells['mets:mets - mets:dmdSec - ID'].value.escape('xml') + '">' + '\n' +
' <mets:fptr FILEID="' + cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '"/>' + '\n' +
forEachIndex(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value)).slice(1), i, r,
' <mets:div TYPE="part" ID="' + 'PART' + (i+1) + '_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '" LABEL="' + if(r.cells['mets:mets - mets:fileSec - mets:fileGrp - USE'].value == 'DOWNLOAD', 'Download ZIP-Archiv (mit allen Dateien)' , r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value.split('/').reverse()[0].escape('xml')) + '">' + '\n' +
' <mets:fptr FILEID="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '"/>' + '\n' +
' </mets:div>' + '\n'
).join('') +
' </mets:div>' + '\n' +
' </mets:structMap>' + '\n' +
'</mets:mets>' + '\n'
), '')
}}

View File

@ -0,0 +1,14 @@
[
{
"op": "core/column-rename",
"oldColumnName": "mets:mets - OBJID",
"newColumnName": "id",
"description": "Rename column mets:mets - OBJID to id"
},
{
"op": "core/column-move",
"columnName": "id",
"index": 0,
"description": "Move column id to position 0"
}
]

141
siegen/Taskfile.yml Normal file
View File

@ -0,0 +1,141 @@
version: '3'
tasks:
main:
desc: OPUS Siegen
vars:
MINIMUM: 1250 # Mindestanzahl der zu erwartenden Datensätze
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: harvest
- task: refine
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
- task: :split
vars: {PROJECT: '{{.PROJECT}}'}
- task: :validate
vars: {PROJECT: '{{.PROJECT}}'}
- task: :zip
vars: {PROJECT: '{{.PROJECT}}'}
- task: :diff
vars: {PROJECT: '{{.PROJECT}}'}
harvest:
dir: ./{{.PROJECT}}/harvest
vars:
URL: https://dspace.ub.uni-siegen.de/oai/request
FORMAT: xMetaDissPlus
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
refine:
dir: ./{{.PROJECT}}
vars:
PORT: 3334 # assign a different port for each project
RAM: 4G # maximum RAM for OpenRefine java heap space
PROJECT: '{{splitList ":" .TASK | first}}'
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
cmds:
- mkdir -p refine
- task: :start # launch OpenRefine
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # Import (erfordert absoluten Pfad zur XML-Datei)
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
--recordPath Records --recordPath Record
--storeEmptyStrings false --trimStrings true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/vorverarbeitung.json
> {{.LOG}}
- > # URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/urn.json
> {{.LOG}}
- > # Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/direktlinks.json
> {{.LOG}}
- > # Datensätze ohne Direktlink auf ein PDF löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nur-mit-pdf.json
> {{.LOG}}
- > # Aufteilung dc:subject in ddc und topic
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/ddc-topic.json
> {{.LOG}}
- > # Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/cc.json
> {{.LOG}}
- > # Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/mime.json
> {{.LOG}}
- > # DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doi.json
> {{.LOG}}
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/hbz.json
> {{.LOG}}
- > # Sortierung mods:nonSort für das erste Element in dc:title
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nonsort.json
> {{.LOG}}
- > # DINI Publikationstypen aus dc:type extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/dini.json
> {{.LOG}}
- > # Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doctype.json
> {{.LOG}}
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/join.json
> {{.LOG}}
- | # Export in METS:MODS mit Templating
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
" --suffix "
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- harvest/{{.PROJECT}}.xml
- config/**
generates:
- refine/{{.PROJECT}}.openrefine.tar.gz
- refine/{{.PROJECT}}.txt
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
linkcheck:
desc: OPUS Siegen links überprüfen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :linkcheck
vars: {PROJECT: '{{.PROJECT}}'}
delete:
desc: OPUS Siegen cache löschen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :delete
vars: {PROJECT: '{{.PROJECT}}'}
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -26,11 +26,11 @@
"mode": "row-based"
},
"baseColumnName": "dc:type",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "doctype",
"columnInsertIndex": 7,
"description": "Create column doctype at index 7 based on column dc:type using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
"description": "Create column doctype"
},
{
"op": "core/text-transform",

View File

@ -6,7 +6,7 @@
{
"type": "list",
"name": "ddb:transfer",
"expression": "grel:row.record.cells['ddb:transfer'].value.join('').contains('.pdf')",
"expression": "grel:row.record.cells['ddb:transfer'].value.join('').toLowercase().contains('.pdf')",
"columnName": "ddb:transfer",
"invert": false,
"omitBlank": false,

View File

@ -1,147 +0,0 @@
# https://taskfile.dev
version: '3'
tasks:
default:
desc: OPUS Siegen
deps: [harvest]
cmds:
- task: refine
- task: check
- task: split
- task: validate
- task: zip
- task: diff
harvest:
dir: data/siegen/harvest
cmds:
- METHA_DIR=$PWD metha-sync --format xMetaDissPlus https://dspace.ub.uni-siegen.de/oai/request
- METHA_DIR=$PWD metha-cat --format xMetaDissPlus https://dspace.ub.uni-siegen.de/oai/request > siegen.xml
refine:
dir: data/siegen/refine
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
env:
PORT: 3334
RAM: 4G
PROJECT: siegen
cmds:
# OpenRefine starten
- $OPENREFINE -v warn -p $PORT -m $RAM -d $PWD > openrefine.log 2>&1 &
- timeout 30s bash -c "until curl -s http://localhost:$PORT | cat | grep -q -o OpenRefine ; do sleep 1; done"
# Import (erfordert absoluten Pfad zur XML-Datei)
- $OPENREFINE_CLIENT -P $PORT --create "$(readlink -e ../harvest/siegen.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName $PROJECT
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/vorverarbeitung.json $PROJECT
# URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/urn.json $PROJECT
# Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/direktlinks.json $PROJECT
# Datensätze ohne Direktlink auf ein PDF löschen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/nur-mit-pdf.json $PROJECT
# Aufteilung dc:subject in ddc und topic
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/ddc-topic.json $PROJECT
# Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/cc.json $PROJECT
# Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/mime.json $PROJECT
# DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/doi.json $PROJECT
# Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/hbz.json $PROJECT
# Sortierung mods:nonSort für das erste Element in dc:title
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/nonsort.json $PROJECT
# DINI Publikationstypen aus dc:type extrahieren
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/dini.json $PROJECT
# Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/doctype.json $PROJECT
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/join.json $PROJECT
# Export in METS:MODS mit Templating
- |
$OPENREFINE_CLIENT -P $PORT --export --template "$(< ../../../rules/siegen/template.txt)" --rowSeparator "
<!-- SPLIT -->
" --suffix "
" --output siegen.txt $PROJECT
# Statistik zu Laufzeit und Ressourcenverbrauch
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:$PORT)
# OpenRefine beenden
- PID=$(lsof -t -i:$PORT); kill $PID; while ps -p $PID > /dev/null; do sleep 1; done
# OpenRefine-Projekt für Debugging archivieren
- tar cfz siegen.openrefine.tar.gz -C $(grep -l siegen *.project/metadata.json | cut -d '/' -f 1) .
# Temporäre Dateien löschen
- rm -rf ./*.project* && rm -f workspace.json
sources:
# wenn "dir:" für task gesetzt ist, dann relative Links ausgehend von dir
- ../harvest/siegen.xml
- ../../../rules/siegen/*.json
- ../../../rules/siegen/template.txt
#TODO - ../../../rules/common/*.json
generates:
- openrefine.log
- siegen.txt
- siegen.openrefine.tar.gz
check:
cmds:
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: "siegen", MINIMUM: "1250"}
sources:
# wenn "dir:" für task nicht gesetzt ist, dann relative Links ausgehend von Taskfile.yml
- data/siegen/refine/openrefine.log
- data/siegen/refine/siegen.txt
split:
cmds:
- task: :split
vars: {PROJECT: "siegen"}
sources:
- data/siegen/refine/siegen.txt
generates:
- data/siegen/split/*.xml
validate:
cmds:
- task: :validate
vars: {PROJECT: "siegen"}
sources:
- data/siegen/split/*.xml
generates:
- data/siegen/validate.log
zip:
cmds:
- task: :zip
vars: {PROJECT: "siegen"}
sources:
- data/siegen/split/*.xml
generates:
- data/siegen/siegen_{{.DATE}}.zip
diff:
cmds:
- task: :diff
vars: {PROJECT: "siegen"}
sources:
- data/siegen/split/*.xml
generates:
- data/siegen/diff.log
linkcheck:
desc: OPUS Siegen links überprüfen
cmds:
- task: :linkcheck
vars: {PROJECT: "siegen"}
sources:
- data/siegen/split/*.xml
generates:
- data/siegen/linkcheck.log
delete:
desc: OPUS Siegen cache löschen
cmds:
- task: :delete
vars: {PROJECT: "siegen"}

View File

@ -1,150 +0,0 @@
# https://taskfile.dev
version: '3'
tasks:
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
default:
desc: Elpub Wuppertal
deps: [harvest]
cmds:
- task: refine
- task: check
- task: split
- task: validate
- task: zip
- task: diff
harvest:
dir: data/wuppertal/harvest
cmds:
- METHA_DIR=$PWD metha-sync --format oai_dc http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
- METHA_DIR=$PWD metha-cat --format oai_dc http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider > wuppertal.xml
refine:
dir: data/wuppertal/refine
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
env:
PORT: 3335
RAM: 4G
PROJECT: wuppertal
cmds:
# OpenRefine starten
- $OPENREFINE -v warn -p $PORT -m $RAM -d $PWD > openrefine.log 2>&1 &
- timeout 30s bash -c "until curl -s http://localhost:$PORT | cat | grep -q -o OpenRefine ; do sleep 1; done"
# Import (erfordert absoluten Pfad zur XML-Datei)
- $OPENREFINE_CLIENT -P $PORT --create "$(readlink -e ../harvest/wuppertal.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName $PROJECT
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/vorverarbeitung.json $PROJECT
# Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/html.json $PROJECT
# DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/ddc.json $PROJECT
# dc:publisher setzen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/publisher.json $PROJECT
# URNs, DOIs und PDF-Links aus dc:identifier extrahieren
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/identifier.json $PROJECT
# Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/nbn.json $PROJECT
# Aufteilung dc:subject in ioo und topic
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/subjects.json $PROJECT
# Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/rights.json $PROJECT
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/join.json $PROJECT
# Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/subtitle.json $PROJECT
# Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/language.json $PROJECT
# Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/rights-cc.json $PROJECT
# Anreicherung HT-Nummer via lobid-resources
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/hbz.json $PROJECT
# Sortierung mods:nonSort für das erste Element in dc:title
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/nonsort.json $PROJECT
# Export in METS:MODS mit Templating
- |
$OPENREFINE_CLIENT -P $PORT --export --template "$(< ../../../rules/wuppertal/template.txt)" --rowSeparator "
<!-- SPLIT -->
" --suffix "
" --output wuppertal.txt $PROJECT
# Statistik zu Laufzeit und Ressourcenverbrauch
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:$PORT)
# OpenRefine beenden
- PID=$(lsof -t -i:$PORT); kill $PID; while ps -p $PID > /dev/null; do sleep 1; done
# OpenRefine-Projekt für Debugging archivieren
- tar cfz wuppertal.openrefine.tar.gz -C $(grep -l wuppertal *.project/metadata.json | cut -d '/' -f 1) .
# Temporäre Dateien löschen
- rm -rf ./*.project* && rm -f workspace.json
sources:
# wenn "dir:" für task gesetzt ist, dann relative Links ausgehend von dir
- ../harvest/wuppertal.xml
- ../../../rules/wuppertal/*.json
- ../../../rules/wuppertal/template.txt
#TODO - ../../../rules/common/*.json
generates:
- openrefine.log
- wuppertal.txt
- wuppertal.openrefine.tar.gz
check:
cmds:
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: "wuppertal", MINIMUM: "1300"}
sources:
# wenn "dir:" für task nicht gesetzt ist, dann relative Links ausgehend von Taskfile.yml
- data/wuppertal/refine/openrefine.log
- data/wuppertal/refine/wuppertal.txt
split:
cmds:
- task: :split
vars: {PROJECT: "wuppertal"}
sources:
- data/wuppertal/refine/wuppertal.txt
generates:
- data/wuppertal/split/*.xml
validate:
cmds:
- task: :validate
vars: {PROJECT: "wuppertal"}
sources:
- data/wuppertal/split/*.xml
generates:
- data/wuppertal/validate.log
zip:
cmds:
- task: :zip
vars: {PROJECT: "wuppertal"}
sources:
- data/wuppertal/split/*.xml
generates:
- data/wuppertal/wuppertal_{{.DATE}}.zip
diff:
cmds:
- task: :diff
vars: {PROJECT: "wuppertal"}
sources:
- data/wuppertal/split/*.xml
generates:
- data/wuppertal/diff.log
linkcheck:
desc: Elpub Wuppertal links überprüfen
cmds:
- task: :linkcheck
vars: {PROJECT: "wuppertal"}
sources:
- data/wuppertal/split/*.xml
generates:
- data/wuppertal/linkcheck.log
delete:
desc: Elpub Wuppertal cache löschen
cmds:
- task: :delete
vars: {PROJECT: "wuppertal"}

145
wuppertal/Taskfile.yml Normal file
View File

@ -0,0 +1,145 @@
version: '3'
tasks:
main:
desc: Elpub Wuppertal
vars:
MINIMUM: 1300 # Mindestanzahl der zu erwartenden Datensätze
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: harvest
- task: refine
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
- task: :check
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
- task: :split
vars: {PROJECT: '{{.PROJECT}}'}
- task: :validate
vars: {PROJECT: '{{.PROJECT}}'}
- task: :zip
vars: {PROJECT: '{{.PROJECT}}'}
- task: :diff
vars: {PROJECT: '{{.PROJECT}}'}
harvest:
dir: ./{{.PROJECT}}/harvest
vars:
URL: http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
FORMAT: oai_dc
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
refine:
dir: ./{{.PROJECT}}
vars:
PORT: 3335 # assign a different port for each project
RAM: 4G # maximum RAM for OpenRefine java heap space
PROJECT: '{{splitList ":" .TASK | first}}'
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
cmds:
- mkdir -p refine
- task: :start # launch OpenRefine
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # Import (erfordert absoluten Pfad zur XML-Datei)
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
--recordPath Records --recordPath Record
--storeEmptyStrings false --trimStrings true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/vorverarbeitung.json
> {{.LOG}}
- > # Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/html.json
> {{.LOG}}
- > # DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/ddc.json
> {{.LOG}}
- > # dc:publisher setzen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/publisher.json
> {{.LOG}}
- > # URNs, DOIs und PDF-Links aus dc:identifier extrahieren
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/identifier.json
> {{.LOG}}
- > # Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nbn.json
> {{.LOG}}
- > # Aufteilung dc:subject in ioo und topic
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/subjects.json
> {{.LOG}}
- > # Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/rights.json
> {{.LOG}}
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/join.json
> {{.LOG}}
- > # Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/subtitle.json
> {{.LOG}}
- > # Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/language.json
> {{.LOG}}
- > # Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/rights-cc.json
> {{.LOG}}
- > # Anreicherung HT-Nummer via lobid-resources
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/hbz.json
> {{.LOG}}
- > # Sortierung mods:nonSort für das erste Element in dc:title
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/nonsort.json
> {{.LOG}}
- | # Export in METS:MODS mit Templating
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
" --suffix "
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- harvest/{{.PROJECT}}.xml
- config/**
generates:
- refine/{{.PROJECT}}.openrefine.tar.gz
- refine/{{.PROJECT}}.txt
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
linkcheck:
desc: Elpub Wuppertal links überprüfen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :linkcheck
vars: {PROJECT: '{{.PROJECT}}'}
delete:
desc: Elpub Wuppertal cache löschen
vars:
PROJECT: '{{splitList ":" .TASK | first}}'
cmds:
- task: :delete
vars: {PROJECT: '{{.PROJECT}}'}
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -57,11 +57,11 @@
"mode": "row-based"
},
"columnName": "setSpec",
"expression": "grel:value.split(':').reverse()[0]",
"expression": "grel:value.split(':')[-1]",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column setSpec using expression grel:value.split(':').reverse()[0]"
"description": "Text transform on cells in column setSpec using expression grel:value.split(':')[-1]"
},
{
"op": "core/text-transform",

View File

@ -157,7 +157,7 @@
{
"type": "list",
"name": "url",
"expression": "grel:row.record.cells['url'].value.join('').contains('.pdf')",
"expression": "grel:row.record.cells['url'].value.join('').toLowercase().contains('.pdf')",
"columnName": "url",
"invert": false,
"omitBlank": false,

View File

@ -17,11 +17,11 @@
<role>
<roleTerm type="code" authority="marcrelator">aut</roleTerm>
</role>
</name>{{forNonBlank(cells['dc:contributor'].value,x,forEach(x.split('␞'),v,'
</name>{{forNonBlank(cells['dc:contributor'].value, x, forEach(x.split('␞'), v, '
<name type="personal">
<displayForm>'+ v.escape('xml') +'</displayForm>
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>
<namePart type="given">' + v.split(',')[1].trim().escape('xml') + '</namePart>
<displayForm>'+ v.escape('xml') +'</displayForm>' + forNonBlank(v.split(',')[1], z, '
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>' + '
<namePart type="given">' + z.trim().escape('xml') + '</namePart>', '') + '
<role>
<roleTerm type="code" authority="marcrelator">ctb</roleTerm>
</role>