resolve #26 Refactoring nach Vorlage openrefine-task-runner
This commit is contained in:
parent
192bbef02d
commit
3711d241f2
|
@ -1,3 +1,8 @@
|
|||
data
|
||||
openrefine
|
||||
*/harvest/*
|
||||
*/refine/*
|
||||
*/split/*
|
||||
*/validate/*
|
||||
*/zip/*
|
||||
*/*.log
|
||||
.openrefine
|
||||
.task
|
||||
|
|
69
README.md
69
README.md
|
@ -18,6 +18,7 @@ Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das P
|
|||
|
||||
* GNU/Linux (getestet mit Fedora 32)
|
||||
* JAVA 8+
|
||||
* [cURL](https://curl.se), xmllint
|
||||
|
||||
## Installation
|
||||
|
||||
|
@ -28,29 +29,7 @@ Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das P
|
|||
cd noah
|
||||
```
|
||||
|
||||
2. [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) (benötigt JAVA 8+)
|
||||
|
||||
```sh
|
||||
# in Unterverzeichnis openrefine installieren
|
||||
wget -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
|
||||
mkdir -p openrefine
|
||||
tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
|
||||
# automatisches Starten des Browsers abschalten
|
||||
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini"
|
||||
# Zeitraum für automatisches Speichern von 5 Minuten auf 25 Stunden erhöhen
|
||||
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini"
|
||||
```
|
||||
|
||||
3. [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
|
||||
|
||||
```sh
|
||||
# in Unterverzeichnis openrefine installieren
|
||||
mkdir -p openrefine
|
||||
wget -O openrefine/openrefine-client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
|
||||
chmod +x openrefine/openrefine-client
|
||||
```
|
||||
|
||||
4. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
|
||||
2. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
|
@ -66,7 +45,7 @@ Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das P
|
|||
sudo apt install ./metha_0.2.20_amd64.deb && rm metha_0.2.20_amd64.deb
|
||||
```
|
||||
|
||||
5. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
|
||||
3. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
|
@ -82,32 +61,52 @@ Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das P
|
|||
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
|
||||
```
|
||||
|
||||
4. Install task ausführen, um [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) und [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10) herunterzuladen
|
||||
|
||||
```sh
|
||||
task install
|
||||
```
|
||||
|
||||
|
||||
## Nutzung
|
||||
|
||||
* Vorab ggf. ulimit erhöhen, um Abbruch durch "too many open files" zu vermeiden
|
||||
|
||||
```
|
||||
ulimit -n 10000
|
||||
ulimit -n 20000
|
||||
```
|
||||
|
||||
* Alle Datenquellen harvesten, transformieren und validieren (parallelisiert)
|
||||
* Alle Datenquellen (parallelisiert)
|
||||
|
||||
```
|
||||
task
|
||||
```
|
||||
|
||||
* Eine Datenquelle harvesten, transformieren und validieren
|
||||
* Eine Datenquelle
|
||||
|
||||
```
|
||||
task siegen:default
|
||||
task siegen:main
|
||||
```
|
||||
|
||||
* Zwei Datenquellen harvesten, transformieren und validieren (parallelisiert)
|
||||
* Zwei Datenquellen (parallelisiert)
|
||||
|
||||
```
|
||||
task --parallel siegen:default wuppertal:default
|
||||
task --parallel siegen:main wuppertal:main
|
||||
```
|
||||
|
||||
* Trotzdem Verarbeitung starten, auch wenn Checksummenprüfung ergibt, dass nichts zu tun wäre
|
||||
|
||||
```sh
|
||||
task siegen:main --force
|
||||
```
|
||||
|
||||
* Zur Fehlerbehebung: Befehle ausgeben, aber nicht ausführen
|
||||
|
||||
```sh
|
||||
task siegen:main --dry --verbose --force
|
||||
```
|
||||
|
||||
|
||||
* Links einer Datenquelle überprüfen
|
||||
|
||||
```
|
||||
|
@ -128,11 +127,11 @@ Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das P
|
|||
|
||||
## Konfiguration
|
||||
|
||||
* Workflow für die jeweilige Datenquelle in [tasks](tasks)
|
||||
* Beispiel: [tasks/siegen.yml](tasks/siegen.yml)
|
||||
* OpenRefine-Transformationsregeln in [rules](rules)
|
||||
* Beispiel: [rules/siegen/hbz.json](rules/siegen/hbz.json)
|
||||
* Allgemeine Tasks (z.B. Validierung) in [Taskfile.yml](Taskfile.yml)
|
||||
* Der Workflow einer Datenquelle wird im jeweiligen spezifischen `Taskfile.yml` definiert
|
||||
* Beispiel: [siegen/Taskfile.yml](siegen/Taskfile.yml)
|
||||
* Die im Workflow verwendeten OpenRefine-Transformationsregeln liegen im Unterordner `config` der jeweiligen Datenquelle
|
||||
* Beispiel: [siegen/config/hbz.json](siegen/config/hbz.json)
|
||||
* Allgemeine Tasks (z.B. Validierung) werden im [Taskfile.yml](Taskfile.yml) des Hauptordners definiert.
|
||||
|
||||
## OAI-PMH Data Provider
|
||||
|
||||
|
|
173
Taskfile.yml
173
Taskfile.yml
|
@ -1,84 +1,111 @@
|
|||
# https://taskfile.dev
|
||||
# https://github.com/opencultureconsulting/openrefine-task-runner
|
||||
|
||||
version: '3'
|
||||
|
||||
output: prefixed
|
||||
|
||||
includes:
|
||||
muenster: ./tasks/muenster.yml
|
||||
siegen: ./tasks/siegen.yml
|
||||
wuppertal: ./tasks/wuppertal.yml
|
||||
muenster: muenster
|
||||
siegen: siegen
|
||||
wuppertal: wuppertal
|
||||
|
||||
silent: true
|
||||
output: prefixed
|
||||
|
||||
vars:
|
||||
DATE: '{{ now | date "2006-01-02"}}'
|
||||
|
||||
env:
|
||||
OPENREFINE:
|
||||
sh: readlink -e openrefine/refine
|
||||
OPENREFINE_CLIENT:
|
||||
sh: readlink -e openrefine/openrefine-client
|
||||
sh: readlink -m .openrefine/refine
|
||||
CLIENT:
|
||||
sh: readlink -m .openrefine/client
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: alle Datenquellen (parallel)
|
||||
preconditions:
|
||||
- sh: test -n "$(command -v metha-sync)"
|
||||
msg: "requirement metha missing"
|
||||
- sh: test -n "$(command -v java)"
|
||||
msg: "requirement JAVA runtime environment (jre) missing"
|
||||
- sh: test -x "$OPENREFINE"
|
||||
msg: "requirement OpenRefine missing"
|
||||
- sh: test -x "$OPENREFINE_CLIENT"
|
||||
msg: "requirement openrefine-client missing"
|
||||
- sh: test -n "$(command -v curl)"
|
||||
msg: "requirement curl missing"
|
||||
- sh: test -n "$(command -v xmllint)"
|
||||
msg: "requirement xmllint missing"
|
||||
desc: execute all projects in parallel
|
||||
deps:
|
||||
- task: muenster:default
|
||||
- task: wuppertal:default
|
||||
- task: siegen:default
|
||||
- task: muenster:main
|
||||
- task: siegen:main
|
||||
- task: wuppertal:main
|
||||
|
||||
openrefine-start:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
install:
|
||||
desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"; test -n "{{.PORT}}"; test -n "{{.RAM}}"
|
||||
# Temporäre Dateien löschen
|
||||
- rm -rf ./*.project* && rm -f workspace.json
|
||||
# OpenRefine starten und Logdatei schreiben für spätere checks
|
||||
- $OPENREFINE -v warn -p {{.PORT}} -m {{.RAM}} -d $PWD > openrefine.log 2>&1 &
|
||||
# Warten bis OpenRefine erreichbar ist
|
||||
- timeout 30s bash -c "until curl -s http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine ; do sleep 1; done"
|
||||
- | # delete existing install and recreate folder
|
||||
rm -rf .openrefine
|
||||
mkdir -p .openrefine
|
||||
- > # download OpenRefine archive
|
||||
wget --no-verbose -O openrefine.tar.gz
|
||||
https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
|
||||
- | # install OpenRefine into subdirectory .openrefine
|
||||
tar -xzf openrefine.tar.gz -C .openrefine --strip 1
|
||||
rm openrefine.tar.gz
|
||||
- | # optimize OpenRefine for batch processing
|
||||
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
|
||||
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
|
||||
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
|
||||
- > # download openrefine-client into subdirectory .openrefine
|
||||
wget --no-verbose -O .openrefine/client
|
||||
https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
|
||||
- chmod +x .openrefine/client # make client executable
|
||||
|
||||
openrefine-stop:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
start:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"; test -n "{{.PORT}}"
|
||||
# Statistik zu Laufzeit und Ressourcenverbrauch
|
||||
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}})
|
||||
# OpenRefine herunterfahren
|
||||
- PID=$(lsof -t -i:{{.PORT}}); kill $PID; while ps -p $PID > /dev/null; do sleep 1; done
|
||||
# OpenRefine-Projekt für Debugging archivieren
|
||||
- tar cfz {{.PROJECT}}.openrefine.tar.gz -C $(grep -l {{.PROJECT}} *.project/metadata.json | cut -d '/' -f 1) .
|
||||
- | # verify that OpenRefine is installed
|
||||
if [ ! -f "$OPENREFINE" ]; then
|
||||
echo 1>&2 "OpenRefine missing; try task install"; exit 1
|
||||
fi
|
||||
- | # delete temporary files and log file of previous run
|
||||
rm -rf ./*.project* workspace.json
|
||||
rm -rf "{{.PROJECT}}.log"
|
||||
- > # launch OpenRefine with specific data directory and redirect its output to a log file
|
||||
"$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
|
||||
-d ../{{.PROJECT}}/refine
|
||||
>> "{{.PROJECT}}.log" 2>&1 &
|
||||
- | # wait until OpenRefine API is available
|
||||
timeout 30s bash -c "until
|
||||
wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
|
||||
do sleep 1
|
||||
done"
|
||||
|
||||
stop:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- | # shut down OpenRefine gracefully
|
||||
PID=$(lsof -t -i:{{.PORT}})
|
||||
kill $PID
|
||||
while ps -p $PID > /dev/null; do sleep 1; done
|
||||
- > # archive the OpenRefine project
|
||||
tar cfz
|
||||
"{{.PROJECT}}.openrefine.tar.gz"
|
||||
-C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
|
||||
.
|
||||
- rm -rf ./*.project* workspace.json # delete temporary files
|
||||
|
||||
kill:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- | # shut down OpenRefine immediately to save time and disk space
|
||||
PID=$(lsof -t -i:{{.PORT}})
|
||||
kill -9 $PID
|
||||
while ps -p $PID > /dev/null; do sleep 1; done
|
||||
- rm -rf ./*.project* workspace.json # delete temporary files
|
||||
|
||||
check:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"; test -n "{{.MINIMUM}}"
|
||||
# Logdatei von OpenRefine auf Warnungen und Fehlermeldungen prüfen
|
||||
- if grep -i 'exception\|error' openrefine.log; then echo 1>&2 "Logdatei $PWD/openrefine.log enthält Warnungen!" && exit 1; fi
|
||||
# Prüfen, ob Mindestanzahl von 1250 Datensätzen generiert wurde
|
||||
- if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!" && exit 1; fi
|
||||
sources:
|
||||
- openrefine.log
|
||||
- '{{.PROJECT}}.txt'
|
||||
- | # find log file(s) and check for "exception" or "error"
|
||||
if grep -i 'exception\|error' $(find . -name '*.log'); then
|
||||
echo 1>&2 "log contains warnings!"; exit 1
|
||||
fi
|
||||
- | # Prüfen, ob Mindestanzahl von Datensätzen generiert wurde
|
||||
if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then
|
||||
echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!"; exit 1
|
||||
fi
|
||||
|
||||
split:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}/split
|
||||
dir: ./{{.PROJECT}}/split
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# in Einzeldateien aufteilen
|
||||
|
@ -94,42 +121,42 @@ tasks:
|
|||
|
||||
validate:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}
|
||||
dir: ./{{.PROJECT}}/validate
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Validierung gegen METS Schema
|
||||
- wget -q -nc https://www.loc.gov/standards/mets/mets.xsd
|
||||
- xmllint --schema mets.xsd --noout split/*.xml > validate.log 2>&1
|
||||
- xmllint --schema mets.xsd --noout ../split/*.xml > validate.log 2>&1
|
||||
sources:
|
||||
- split/*.xml
|
||||
- ../split/*.xml
|
||||
generates:
|
||||
- validate.log
|
||||
|
||||
zip:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}
|
||||
dir: ./{{.PROJECT}}/zip
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# ZIP-Archiv mit Zeitstempel erstellen
|
||||
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip split/*.xml
|
||||
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip ../split/*.xml
|
||||
sources:
|
||||
- split/*.xml
|
||||
- ../split/*.xml
|
||||
generates:
|
||||
- '{{.PROJECT}}_{{.DATE}}.zip'
|
||||
|
||||
diff:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Inhalt der beiden letzten ZIP-Archive vergleichen
|
||||
- if test -n "$(ls -t *.zip | sed -n 2p)"; then unzip -q -d old $(ls -t *.zip | sed -n 2p); unzip -q -d new $(ls -t *.zip | sed -n 1p); fi
|
||||
- if test -n "$(ls -t zip/*.zip | sed -n 2p)"; then unzip -q -d old $(ls -t zip/*.zip | sed -n 2p); unzip -q -d new $(ls -t zip/*.zip | sed -n 1p); fi
|
||||
- diff -d old new > diff.log || exit 0
|
||||
- rm -rf old new
|
||||
# Diff prüfen, ob es weniger als 500 Zeilen enthält
|
||||
- if (( 500 < $(wc -l <diff.log) )); then echo 1>&2 "Unerwartet große Änderungen in $PWD/diff.log!" && exit 1; fi
|
||||
# Diff archivieren
|
||||
- cp diff.log {{.PROJECT}}_{{.DATE}}.diff
|
||||
- cp diff.log zip/{{.PROJECT}}_{{.DATE}}.diff
|
||||
sources:
|
||||
- split/*.xml
|
||||
generates:
|
||||
|
@ -137,16 +164,18 @@ tasks:
|
|||
|
||||
linkcheck:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Links extrahieren
|
||||
- xmllint --xpath '//@*[local-name(.) = "href"]' split/*.xml | cut -d '"' -f2 > links.txt
|
||||
- xmllint --xpath '//@*[local-name(.) = "href"]' split/*.xml | cut -d '"' -f2 | sort | uniq > links.txt
|
||||
# http status code aller Links ermitteln
|
||||
- curl --silent --head --write-out "%{http_code} %{url_effective}\n" $(while read line; do echo "-o /dev/null $line"; done < links.txt) > linkcheck.log
|
||||
- rm -rf links.txt
|
||||
- awk '{ print "url = " $0 "\noutput = /dev/null"; }' links.txt > curl.cfg
|
||||
- curl --silent --head --location --write-out "%{http_code} %{url_effective}\n" --config curl.cfg > linkcheck.log
|
||||
# Logdatei auf status code != 2XX prüfen
|
||||
- if grep '^[^2]' linkcheck.log; then echo 1>&2 "Logdatei $PWD/linkcheck.log enthält problematische status codes!" && exit 1; fi
|
||||
# Aufräumen bei Erfolg
|
||||
- rm -rf curl.cfg links.txt
|
||||
sources:
|
||||
- split/*.xml
|
||||
generates:
|
||||
|
@ -154,9 +183,11 @@ tasks:
|
|||
|
||||
delete:
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: data/{{.PROJECT}}
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
- rm -rf harvest
|
||||
- rm -rf refine
|
||||
- rm -rf split
|
||||
- rm -rf validate
|
||||
- rm -f diff.log
|
||||
|
|
|
@ -0,0 +1,132 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: miami ULB Münster
|
||||
vars:
|
||||
MINIMUM: 7695 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://repositorium.uni-muenster.de/oai/miami
|
||||
FORMAT: mets
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3336 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record --recordPath metadata --recordPath mets:mets
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte id
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # Ältere Einträge (nach mets:metsHdr - CREATEDATE) mit gleichem Identifier entfernen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/duplicates.json
|
||||
> {{.LOG}}
|
||||
- > # Aggregationen löschen (diese Datensätze werden von untergeordneten Werken über relatedItem referenziert)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ohne-aggregationen.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-mit-pdf.json
|
||||
> {{.LOG}}
|
||||
# Index: Spalte index mit row.record.index generieren
|
||||
- >
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/index.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in mods:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- > # Visual Library doctype aus mods:genre
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doctype.json
|
||||
> {{.LOG}}
|
||||
- > # HTML-Codes in Abstracts entfernen und Abstracts ohne Sprachangabe löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/abstract.json
|
||||
> {{.LOG}}
|
||||
- > # Separaten Download-Link entfernen, wenn nur eine Datei vorhanden ist
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/flocat.json
|
||||
> {{.LOG}}
|
||||
- > # mets:file - ID eindeutig machen, um Validierungsfehler zu vermeiden
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/file-id.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: miami ULB Münster links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: miami ULB Münster cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
@ -0,0 +1,141 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: OPUS Siegen
|
||||
vars:
|
||||
MINIMUM: 1250 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: https://dspace.ub.uni-siegen.de/oai/request
|
||||
FORMAT: xMetaDissPlus
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3334 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/urn.json
|
||||
> {{.LOG}}
|
||||
- > # Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/direktlinks.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-mit-pdf.json
|
||||
> {{.LOG}}
|
||||
- > # Aufteilung dc:subject in ddc und topic
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ddc-topic.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/cc.json
|
||||
> {{.LOG}}
|
||||
- > # Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/mime.json
|
||||
> {{.LOG}}
|
||||
- > # DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doi.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in dc:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- > # DINI Publikationstypen aus dc:type extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/dini.json
|
||||
> {{.LOG}}
|
||||
- > # Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doctype.json
|
||||
> {{.LOG}}
|
||||
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/join.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: OPUS Siegen links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: OPUS Siegen cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
|
@ -1,94 +0,0 @@
|
|||
# https://taskfile.dev
|
||||
|
||||
version: '3'
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: miami ULB Münster
|
||||
vars:
|
||||
PROJECT: muenster
|
||||
MINIMUM: 7695 # Mindestanzahl der zu erwartenden Datensätze
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: data/{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://repositorium.uni-muenster.de/oai/miami
|
||||
FORMAT: mets
|
||||
PROJECT: muenster
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
|
||||
vars:
|
||||
PORT: 3336
|
||||
RAM: 4G
|
||||
PROJECT: muenster
|
||||
cmds:
|
||||
- task: :openrefine-start
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
# Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --create "$(readlink -e ../harvest/{{.PROJECT}}.xml)" --recordPath Records --recordPath Record --recordPath metadata --recordPath mets:mets --storeEmptyStrings false --trimStrings true --projectName {{.PROJECT}}
|
||||
# Vorverarbeitung: Identifier in erste Spalte id
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/vorverarbeitung.json {{.PROJECT}}
|
||||
# Ältere Einträge (nach mets:metsHdr - CREATEDATE) mit gleichem Identifier entfernen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/duplicates.json {{.PROJECT}}
|
||||
# Aggregationen löschen (diese Datensätze werden von untergeordneten Werken über relatedItem referenziert)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/ohne-aggregationen.json {{.PROJECT}}
|
||||
# Datensätze ohne Direktlink auf ein PDF löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nur-mit-pdf.json {{.PROJECT}}
|
||||
# Index: Spalte index mit row.record.index generieren
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/index.json {{.PROJECT}}
|
||||
# Sortierung mods:nonSort für das erste Element in mods:title
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nonsort.json {{.PROJECT}}
|
||||
# Visual Library doctype aus mods:genre
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/doctype.json {{.PROJECT}}
|
||||
# HTML-Codes in Abstracts entfernen und Abstracts ohne Sprachangabe löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/abstract.json {{.PROJECT}}
|
||||
# Separaten Download-Link entfernen, wenn nur eine Datei vorhanden ist
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/flocat.json {{.PROJECT}}
|
||||
# mets:file - ID eindeutig machen, um Validierungsfehler zu vermeiden
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/file-id.json {{.PROJECT}}
|
||||
# Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/hbz.json {{.PROJECT}}
|
||||
# Export in METS:MODS mit Templating
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --export --template "$(< ../../../rules/{{.PROJECT}}/template.txt)" --rowSeparator "" --output {{.PROJECT}}.txt {{.PROJECT}}
|
||||
- task: :openrefine-stop
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- ../harvest/{{.PROJECT}}.xml
|
||||
- ../../../rules/{{.PROJECT}}/*.json
|
||||
- ../../../rules/{{.PROJECT}}/template.txt
|
||||
#TODO - ../../../rules/common/*.json
|
||||
generates:
|
||||
- openrefine.log
|
||||
- '{{.PROJECT}}.txt'
|
||||
- '{{.PROJECT}}.openrefine.tar.gz'
|
||||
|
||||
linkcheck:
|
||||
desc: miami ULB Münster links überprüfen
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: "muenster"}
|
||||
|
||||
delete:
|
||||
desc: miami ULB Münster cache löschen
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: "muenster"}
|
101
tasks/siegen.yml
101
tasks/siegen.yml
|
@ -1,101 +0,0 @@
|
|||
# https://taskfile.dev
|
||||
|
||||
version: '3'
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: OPUS Siegen
|
||||
vars:
|
||||
PROJECT: siegen
|
||||
MINIMUM: 1250 # Mindestanzahl der zu erwartenden Datensätze
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: data/{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: https://dspace.ub.uni-siegen.de/oai/request
|
||||
FORMAT: xMetaDissPlus
|
||||
PROJECT: siegen
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
|
||||
vars:
|
||||
PORT: 3334
|
||||
RAM: 4G
|
||||
PROJECT: siegen
|
||||
cmds:
|
||||
- task: :openrefine-start
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
# Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --create "$(readlink -e ../harvest/{{.PROJECT}}.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName {{.PROJECT}}
|
||||
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/vorverarbeitung.json {{.PROJECT}}
|
||||
# URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/urn.json {{.PROJECT}}
|
||||
# Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/direktlinks.json {{.PROJECT}}
|
||||
# Datensätze ohne Direktlink auf ein PDF löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nur-mit-pdf.json {{.PROJECT}}
|
||||
# Aufteilung dc:subject in ddc und topic
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/ddc-topic.json {{.PROJECT}}
|
||||
# Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/cc.json {{.PROJECT}}
|
||||
# Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/mime.json {{.PROJECT}}
|
||||
# DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/doi.json {{.PROJECT}}
|
||||
# Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/hbz.json {{.PROJECT}}
|
||||
# Sortierung mods:nonSort für das erste Element in dc:title
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nonsort.json {{.PROJECT}}
|
||||
# DINI Publikationstypen aus dc:type extrahieren
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/dini.json {{.PROJECT}}
|
||||
# Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/doctype.json {{.PROJECT}}
|
||||
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/join.json {{.PROJECT}}
|
||||
# Export in METS:MODS mit Templating
|
||||
- |
|
||||
$OPENREFINE_CLIENT -P {{.PORT}} --export --template "$(< ../../../rules/{{.PROJECT}}/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output {{.PROJECT}}.txt {{.PROJECT}}
|
||||
- task: :openrefine-stop
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- ../harvest/{{.PROJECT}}.xml
|
||||
- ../../../rules/{{.PROJECT}}/*.json
|
||||
- ../../../rules/{{.PROJECT}}/template.txt
|
||||
#TODO - ../../../rules/common/*.json
|
||||
generates:
|
||||
- openrefine.log
|
||||
- '{{.PROJECT}}.txt'
|
||||
- '{{.PROJECT}}.openrefine.tar.gz'
|
||||
|
||||
linkcheck:
|
||||
desc: OPUS Siegen links überprüfen
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: "siegen"}
|
||||
|
||||
delete:
|
||||
desc: OPUS Siegen cache löschen
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: "siegen"}
|
|
@ -1,103 +0,0 @@
|
|||
# https://taskfile.dev
|
||||
|
||||
version: '3'
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: Elpub Wuppertal
|
||||
vars:
|
||||
PROJECT: wuppertal
|
||||
MINIMUM: 1300 # Mindestanzahl der zu erwartenden Datensätze
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: data/{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
|
||||
FORMAT: oai_dc
|
||||
PROJECT: wuppertal
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
|
||||
vars:
|
||||
PORT: 3335
|
||||
RAM: 4G
|
||||
PROJECT: wuppertal
|
||||
cmds:
|
||||
- task: :openrefine-start
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
# Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --create "$(readlink -e ../harvest/{{.PROJECT}}.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName {{.PROJECT}}
|
||||
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/vorverarbeitung.json {{.PROJECT}}
|
||||
# Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/html.json {{.PROJECT}}
|
||||
# DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/ddc.json {{.PROJECT}}
|
||||
# dc:publisher setzen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/publisher.json {{.PROJECT}}
|
||||
# URNs, DOIs und PDF-Links aus dc:identifier extrahieren
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/identifier.json {{.PROJECT}}
|
||||
# Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nbn.json {{.PROJECT}}
|
||||
# Aufteilung dc:subject in ioo und topic
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/subjects.json {{.PROJECT}}
|
||||
# Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/rights.json {{.PROJECT}}
|
||||
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/join.json {{.PROJECT}}
|
||||
# Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/subtitle.json {{.PROJECT}}
|
||||
# Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/language.json {{.PROJECT}}
|
||||
# Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/rights-cc.json {{.PROJECT}}
|
||||
# Anreicherung HT-Nummer via lobid-resources
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/hbz.json {{.PROJECT}}
|
||||
# Sortierung mods:nonSort für das erste Element in dc:title
|
||||
- $OPENREFINE_CLIENT -P {{.PORT}} --apply ../../../rules/{{.PROJECT}}/nonsort.json {{.PROJECT}}
|
||||
# Export in METS:MODS mit Templating
|
||||
- |
|
||||
$OPENREFINE_CLIENT -P {{.PORT}} --export --template "$(< ../../../rules/{{.PROJECT}}/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output {{.PROJECT}}.txt {{.PROJECT}}
|
||||
- task: :openrefine-stop
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- ../harvest/{{.PROJECT}}.xml
|
||||
- ../../../rules/{{.PROJECT}}/*.json
|
||||
- ../../../rules/{{.PROJECT}}/template.txt
|
||||
#TODO - ../../../rules/common/*.json
|
||||
generates:
|
||||
- openrefine.log
|
||||
- '{{.PROJECT}}.txt'
|
||||
- '{{.PROJECT}}.openrefine.tar.gz'
|
||||
|
||||
linkcheck:
|
||||
desc: Elpub Wuppertal links überprüfen
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
|
||||
delete:
|
||||
desc: Elpub Wuppertal cache löschen
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: "wuppertal"}
|
|
@ -0,0 +1,145 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: Elpub Wuppertal
|
||||
vars:
|
||||
MINIMUM: 1300 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
|
||||
FORMAT: oai_dc
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3335 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/html.json
|
||||
> {{.LOG}}
|
||||
- > # DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ddc.json
|
||||
> {{.LOG}}
|
||||
- > # dc:publisher setzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/publisher.json
|
||||
> {{.LOG}}
|
||||
- > # URNs, DOIs und PDF-Links aus dc:identifier extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/identifier.json
|
||||
> {{.LOG}}
|
||||
- > # Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nbn.json
|
||||
> {{.LOG}}
|
||||
- > # Aufteilung dc:subject in ioo und topic
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/subjects.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights.json
|
||||
> {{.LOG}}
|
||||
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/join.json
|
||||
> {{.LOG}}
|
||||
- > # Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/subtitle.json
|
||||
> {{.LOG}}
|
||||
- > # Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/language.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights-cc.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in dc:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: Elpub Wuppertal links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: Elpub Wuppertal cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
Loading…
Reference in New Issue