Compare commits
28 Commits
Author | SHA1 | Date |
---|---|---|
Felix Lohmeier | a23a93e5cd | |
Felix Lohmeier | a3cd4c1849 | |
Felix Lohmeier | 2e698c3fe3 | |
Felix Lohmeier | 659ad70ec6 | |
Felix Lohmeier | 170bb53b57 | |
Felix Lohmeier | 1b5f3000bc | |
Felix Lohmeier | 5c727fdbcd | |
Felix Lohmeier | dd614a6e2d | |
Felix Lohmeier | 8cd0b69f70 | |
Felix Lohmeier | 4571ebd6fc | |
Felix Lohmeier | 6734927ecd | |
Felix Lohmeier | cf247a86c1 | |
Felix Lohmeier | cf3c006d78 | |
Felix Lohmeier | 3b154c21cb | |
Felix Lohmeier | 1f1298c6f0 | |
Felix Lohmeier | 3711d241f2 | |
Felix Lohmeier | 192bbef02d | |
Felix Lohmeier | 1c77a9ab50 | |
Felix Lohmeier | 7554346261 | |
Felix Lohmeier | 11fd9aa54a | |
Felix Lohmeier | 6fe88c393e | |
Felix Lohmeier | 159ccc1a17 | |
Felix Lohmeier | cb989c0410 | |
Felix Lohmeier | 65edbbf873 | |
Felix Lohmeier | acd10b3ebb | |
Felix Lohmeier | 4d259e30fe | |
Felix Lohmeier | 8d78f56cbf | |
Felix Lohmeier | 3760451b36 |
|
@ -1,3 +1,8 @@
|
|||
data
|
||||
openrefine
|
||||
*/harvest/*
|
||||
*/refine/*
|
||||
*/split/*
|
||||
*/validate/*
|
||||
*/zip/*
|
||||
*/*.log
|
||||
.openrefine
|
||||
.task
|
||||
|
|
145
README.md
145
README.md
|
@ -1,141 +1,14 @@
|
|||
# Datenintegration für noah.nrw
|
||||
|
||||
Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das Portal [noah.nrw](https://noah.nrw/)
|
||||
|
||||
**:warning: Dies ist ein Prototyp für die Beta-Version des Portals.**
|
||||
> :warning: **Achtung:** Dieses Repo ist nicht mehr aktuell. Die Workflows sind nun wie folgt aufgeteilt
|
||||
|
||||
## Datenfluss
|
||||
| Workflow | GitHub Repository|
|
||||
|:------------------|-----------------------------------------------------------------------------------------|
|
||||
| bielefeld | [noah-bielefeld-pub](https://github.com/opencultureconsulting/noah-bielefeld-pub) |
|
||||
| muenster | [noah-muenster-miami](https://github.com/opencultureconsulting/noah-muenster-miami) |
|
||||
| siegen | [noah-siegen-opus](https://github.com/opencultureconsulting/noah-siegen-opus) |
|
||||
| wuppertal | [noah-wuppertal-elpub](https://github.com/opencultureconsulting/noah-wuppertal-elpub) |
|
||||
|
||||
![Datenflussdiagramm](flowchart.svg)
|
||||
|
||||
## Verwendete Tools
|
||||
|
||||
* Harvesting (mit Cache): [metha](https://github.com/miku/metha/)
|
||||
* Transformation: [OpenRefine](https://github.com/OpenRefine/OpenRefine) und [openrefine-client](https://github.com/opencultureconsulting/openrefine-client)
|
||||
* :warning: Für den Produktivbetrieb ist der Einsatz von [metafacture](https://github.com/metafacture) geplant.
|
||||
* Task Runner: [Task](https://github.com/go-task/task)
|
||||
|
||||
## Systemvoraussetzungen
|
||||
|
||||
* GNU/Linux (getestet mit Fedora 32)
|
||||
* JAVA 8+
|
||||
|
||||
## Installation
|
||||
|
||||
1. Git Repository klonen
|
||||
|
||||
```sh
|
||||
git clone https://github.com/opencultureconsulting/noah.git
|
||||
cd noah
|
||||
```
|
||||
|
||||
2. [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) (benötigt JAVA 8+)
|
||||
|
||||
```sh
|
||||
# in Unterverzeichnis openrefine installieren
|
||||
wget -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
|
||||
mkdir -p openrefine
|
||||
tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
|
||||
# automatisches Starten des Browsers abschalten
|
||||
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini"
|
||||
# Zeitraum für automatisches Speichern von 5 Minuten auf 25 Stunden erhöhen
|
||||
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini"
|
||||
```
|
||||
|
||||
3. [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
|
||||
|
||||
```sh
|
||||
# in Unterverzeichnis openrefine installieren
|
||||
mkdir -p openrefine
|
||||
wget -O openrefine/openrefine-client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
|
||||
chmod +x openrefine/openrefine-client
|
||||
```
|
||||
|
||||
4. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/miku/metha/releases/download/v0.2.20/metha-0.2.20-0.x86_64.rpm
|
||||
sudo dnf install ./metha-0.2.20-0.x86_64.rpm && rm metha-0.2.20-0.x86_64.rpm
|
||||
```
|
||||
|
||||
b) DEB-basiert (Debian, Ubuntu etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/miku/metha/releases/download/v0.2.20/metha_0.2.20_amd64.deb
|
||||
sudo apt install ./metha_0.2.20_amd64.deb && rm metha_0.2.20_amd64.deb
|
||||
```
|
||||
|
||||
5. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
|
||||
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
|
||||
```
|
||||
|
||||
b) DEB-basiert (Debian, Ubuntu etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
|
||||
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
|
||||
```
|
||||
|
||||
## Nutzung
|
||||
|
||||
* Alle Datenquellen harvesten, transformieren und validieren (parallelisiert)
|
||||
|
||||
```
|
||||
task
|
||||
```
|
||||
|
||||
* Eine Datenquelle harvesten, transformieren und validieren
|
||||
|
||||
```
|
||||
task siegen:default
|
||||
```
|
||||
|
||||
* Zwei Datenquellen harvesten, transformieren und validieren (parallelisiert)
|
||||
|
||||
```
|
||||
task --parallel siegen:default wuppertal:default
|
||||
```
|
||||
|
||||
* Links einer Datenquelle überprüfen
|
||||
|
||||
```
|
||||
task siegen:linkcheck
|
||||
```
|
||||
|
||||
* Cache einer Datenquelle löschen
|
||||
|
||||
```
|
||||
task siegen:delete
|
||||
```
|
||||
|
||||
* Verfügbare Tasks auflisten
|
||||
|
||||
```
|
||||
task --list
|
||||
```
|
||||
|
||||
## Konfiguration
|
||||
|
||||
* Workflow für die jeweilige Datenquelle in [tasks](tasks)
|
||||
* Beispiel: [tasks/siegen.yml](tasks/siegen.yml)
|
||||
* OpenRefine-Transformationsregeln in [rules](rules)
|
||||
* Beispiel: [rules/siegen/hbz.json](rules/siegen/hbz.json)
|
||||
* Allgemeine Tasks (z.B. Validierung) in [Taskfile.yml](Taskfile.yml)
|
||||
|
||||
## Known Issues
|
||||
|
||||
> too many open files
|
||||
|
||||
```
|
||||
ulimit -n 10000
|
||||
```
|
||||
|
||||
## OAI-PMH Data Provider
|
||||
|
||||
Für die Bereitstellung der transformierten Daten wird der dateibasierte OAI-PMH Data Provider [oai_pmh](https://github.com/opencultureconsulting/oai_pmh) genutzt. Installations- und Nutzungshinweise sind dort zu finden.
|
||||
Der alte technische Ansatz ist in https://github.com/opencultureconsulting/noah/tree/v0.3 nachzulesen.
|
||||
|
|
184
Taskfile.yml
184
Taskfile.yml
|
@ -1,110 +1,196 @@
|
|||
# https://taskfile.dev
|
||||
# https://github.com/opencultureconsulting/openrefine-task-runner
|
||||
|
||||
version: '3'
|
||||
|
||||
output: prefixed
|
||||
|
||||
includes:
|
||||
siegen: ./tasks/siegen.yml
|
||||
wuppertal: ./tasks/wuppertal.yml
|
||||
bielefeld: bielefeld
|
||||
muenster: muenster
|
||||
siegen: siegen
|
||||
wuppertal: wuppertal
|
||||
|
||||
silent: true
|
||||
output: prefixed
|
||||
|
||||
vars:
|
||||
DATE: '{{ now | date "2006-01-02"}}'
|
||||
|
||||
env:
|
||||
OPENREFINE:
|
||||
sh: readlink -e openrefine/refine
|
||||
OPENREFINE_CLIENT:
|
||||
sh: readlink -e openrefine/openrefine-client
|
||||
sh: readlink -m .openrefine/refine
|
||||
CLIENT:
|
||||
sh: readlink -m .openrefine/client
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: alle Datenquellen (parallel)
|
||||
preconditions:
|
||||
- sh: test -n "$(command -v metha-sync)"
|
||||
msg: "requirement metha missing"
|
||||
- sh: test -n "$(command -v java)"
|
||||
msg: "requirement JAVA runtime environment (jre) missing"
|
||||
- sh: test -x "$OPENREFINE"
|
||||
msg: "requirement OpenRefine missing"
|
||||
- sh: test -x "$OPENREFINE_CLIENT"
|
||||
msg: "requirement openrefine-client missing"
|
||||
- sh: test -n "$(command -v curl)"
|
||||
msg: "requirement curl missing"
|
||||
- sh: test -n "$(command -v xmllint)"
|
||||
msg: "requirement xmllint missing"
|
||||
desc: execute all projects in parallel
|
||||
deps:
|
||||
- task: wuppertal:default
|
||||
- task: siegen:default
|
||||
- task: bielefeld:main
|
||||
- task: muenster:main
|
||||
- task: siegen:main
|
||||
- task: wuppertal:main
|
||||
|
||||
install:
|
||||
desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
|
||||
cmds:
|
||||
- | # delete existing install and recreate folder
|
||||
rm -rf .openrefine
|
||||
mkdir -p .openrefine
|
||||
- > # download OpenRefine archive
|
||||
wget --no-verbose -O openrefine.tar.gz
|
||||
https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
|
||||
- | # install OpenRefine into subdirectory .openrefine
|
||||
tar -xzf openrefine.tar.gz -C .openrefine --strip 1
|
||||
rm openrefine.tar.gz
|
||||
- | # optimize OpenRefine for batch processing
|
||||
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
|
||||
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
|
||||
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
|
||||
- > # download openrefine-client into subdirectory .openrefine
|
||||
wget --no-verbose -O .openrefine/client
|
||||
https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
|
||||
- chmod +x .openrefine/client # make client executable
|
||||
|
||||
start:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- | # verify that OpenRefine is installed
|
||||
if [ ! -f "$OPENREFINE" ]; then
|
||||
echo 1>&2 "OpenRefine missing; try task install"; exit 1
|
||||
fi
|
||||
- | # delete temporary files and log file of previous run
|
||||
rm -rf ./*.project* workspace.json
|
||||
rm -rf "{{.PROJECT}}.log"
|
||||
- > # launch OpenRefine with specific data directory and redirect its output to a log file
|
||||
"$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
|
||||
-d ../{{.PROJECT}}/refine
|
||||
>> "{{.PROJECT}}.log" 2>&1 &
|
||||
- | # wait until OpenRefine API is available
|
||||
timeout 30s bash -c "until
|
||||
wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
|
||||
do sleep 1
|
||||
done"
|
||||
|
||||
stop:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- | # shut down OpenRefine gracefully
|
||||
PID=$(lsof -t -i:{{.PORT}})
|
||||
kill $PID
|
||||
while ps -p $PID > /dev/null; do sleep 1; done
|
||||
- > # archive the OpenRefine project
|
||||
tar cfz
|
||||
"{{.PROJECT}}.openrefine.tar.gz"
|
||||
-C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
|
||||
.
|
||||
- rm -rf ./*.project* workspace.json # delete temporary files
|
||||
|
||||
kill:
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- | # shut down OpenRefine immediately to save time and disk space
|
||||
PID=$(lsof -t -i:{{.PORT}})
|
||||
kill -9 $PID
|
||||
while ps -p $PID > /dev/null; do sleep 1; done
|
||||
- rm -rf ./*.project* workspace.json # delete temporary files
|
||||
|
||||
check:
|
||||
dir: data/{{.PROJECT}}/refine
|
||||
dir: ./{{.PROJECT}}/refine
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"; test -n "{{.MINIMUM}}"
|
||||
# Logdatei von OpenRefine auf Warnungen und Fehlermeldungen prüfen
|
||||
- if grep -i 'exception\|error' openrefine.log; then echo 1>&2 "Logdatei $PWD/openrefine.log enthält Warnungen!" && exit 1; fi
|
||||
# Prüfen, ob Mindestanzahl von 1250 Datensätzen generiert wurde
|
||||
- if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!" && exit 1; fi
|
||||
- | # find log file(s) and check for "exception" or "error"
|
||||
if grep -i 'exception\|error' $(find . -name '*.log'); then
|
||||
echo 1>&2 "log contains warnings!"; exit 1
|
||||
fi
|
||||
- | # Prüfen, ob Mindestanzahl von Datensätzen generiert wurde
|
||||
if (( {{.MINIMUM}} > $(grep -c recordIdentifier {{.PROJECT}}.txt) )); then
|
||||
echo 1>&2 "Unerwartet geringe Anzahl an Datensätzen in $PWD/{{.PROJECT}}.txt!"; exit 1
|
||||
fi
|
||||
|
||||
split:
|
||||
dir: data/{{.PROJECT}}/split
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}/split
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# in Einzeldateien aufteilen
|
||||
- csplit -q ../refine/{{.PROJECT}}.txt --suppress-matched '/<!-- SPLIT -->/' "{*}"
|
||||
- csplit -s -z ../refine/{{.PROJECT}}.txt '/<mets:mets /' "{*}"
|
||||
# ggf. vorhandene XML-Dateien löschen
|
||||
- rm -f *.xml
|
||||
# Identifier als Dateinamen
|
||||
- for f in xx*; do mv "$f" "$(xmllint --xpath "//*[local-name(.) = 'recordIdentifier']/text()" "$f").xml"; done
|
||||
sources:
|
||||
- ../refine/{{.PROJECT}}.txt
|
||||
generates:
|
||||
- ./*.xml
|
||||
|
||||
validate:
|
||||
dir: data/{{.PROJECT}}
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}/validate
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Validierung gegen METS Schema
|
||||
- wget -q -nc https://www.loc.gov/standards/mets/mets.xsd
|
||||
- xmllint --schema mets.xsd --noout split/*.xml > validate.log 2>&1
|
||||
- xmllint --schema mets.xsd --noout ../split/*.xml > validate.log 2>&1
|
||||
sources:
|
||||
- ../split/*.xml
|
||||
generates:
|
||||
- validate.log
|
||||
|
||||
zip:
|
||||
dir: data/{{.PROJECT}}
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}/zip
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# ZIP-Archiv mit Zeitstempel erstellen
|
||||
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip split/*.xml
|
||||
- zip -q -FS -j {{.PROJECT}}_{{.DATE}}.zip ../split/*.xml
|
||||
sources:
|
||||
- ../split/*.xml
|
||||
generates:
|
||||
- '{{.PROJECT}}_{{.DATE}}.zip'
|
||||
|
||||
diff:
|
||||
dir: data/{{.PROJECT}}
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Inhalt der beiden letzten ZIP-Archive vergleichen
|
||||
- unzip -q -d old $(ls -t *.zip | sed -n 2p)
|
||||
- unzip -q -d new $(ls -t *.zip | sed -n 1p)
|
||||
- if test -n "$(ls -t zip/*.zip | sed -n 2p)"; then unzip -q -d old $(ls -t zip/*.zip | sed -n 2p); unzip -q -d new $(ls -t zip/*.zip | sed -n 1p); fi
|
||||
- diff -d old new > diff.log || exit 0
|
||||
- rm -rf old new
|
||||
# Diff prüfen, ob es weniger als 500 Zeilen enthält
|
||||
- if (( 500 < $(wc -l <diff.log) )); then echo 1>&2 "Unerwartet große Änderungen in $PWD/diff.log!" && exit 1; fi
|
||||
# Diff archivieren
|
||||
- cp diff.log {{.PROJECT}}_{{.DATE}}.diff
|
||||
status:
|
||||
# Task nicht ausführen, wenn weniger als zwei ZIP-Archive vorhanden
|
||||
- test -z $(ls -t *.zip | sed -n 2p)
|
||||
- cp diff.log zip/{{.PROJECT}}_{{.DATE}}.diff
|
||||
sources:
|
||||
- split/*.xml
|
||||
generates:
|
||||
- diff.log
|
||||
|
||||
linkcheck:
|
||||
dir: data/{{.PROJECT}}
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
# Links extrahieren
|
||||
- xmllint --xpath '//@*[local-name(.) = "href"]' split/*.xml | cut -d '"' -f2 > links.txt
|
||||
# http status code aller Links ermitteln
|
||||
- curl --silent --head --write-out "%{http_code} %{url_effective}\n" $(while read line; do echo "-o /dev/null $line"; done < links.txt) > linkcheck.log
|
||||
- rm -rf links.txt
|
||||
- grep -o 'href="[^"]*"' split/*.xml | sed 's/:href=/\t/' | tr -d '"' | sort -k 2 --unique > links.txt
|
||||
# http status code ermitteln
|
||||
- awk '{ print "url = " $2 "\noutput = /dev/null"; }' links.txt > curl.cfg
|
||||
- curl --silent --head --location --write-out "%{http_code}\t%{url_effective}\n" --config curl.cfg > curl.log
|
||||
# Tabelle mit status code, effektiver URL, Dateiname und start URL erstellen
|
||||
- paste curl.log links.txt > linkcheck.log
|
||||
- rm -rf curl.cfg curl.log links.txt
|
||||
# Logdatei auf status code != 2XX prüfen
|
||||
- if grep '^[^2]' linkcheck.log; then echo 1>&2 "Logdatei $PWD/linkcheck.log enthält problematische status codes!" && exit 1; fi
|
||||
sources:
|
||||
- split/*.xml
|
||||
generates:
|
||||
- linkcheck.log
|
||||
|
||||
delete:
|
||||
dir: data/{{.PROJECT}}
|
||||
label: '{{.TASK}}-{{.PROJECT}}'
|
||||
dir: ./{{.PROJECT}}
|
||||
cmds:
|
||||
- test -n "{{.PROJECT}}"
|
||||
- rm -rf harvest
|
||||
- rm -rf refine
|
||||
- rm -rf split
|
||||
- rm -rf validate
|
||||
- rm -f diff.log
|
||||
|
|
|
@ -0,0 +1,143 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: pub UB Bielefeld
|
||||
vars:
|
||||
MINIMUM: 12000 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
desc: pub UB Bielefeld harvesten
|
||||
vars:
|
||||
URL: https://pub.uni-bielefeld.de/oai
|
||||
FORMAT: mods
|
||||
SET: open_access
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} --set {{.SET}} --no-intervals {{.URL}} # Selective Harvesting mit metha schlägt bei diesem Endpoint fehl, daher mit Option --no-intervals
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} --set {{.SET}} {{.URL}} > {{.PROJECT}}.xml
|
||||
status:
|
||||
- test -f ./{{.PROJECT}}.xml # Da Selective Harvesting nicht funktioniert, hier Statuscheck ob Datei existent, um nicht jedesmal einen Gesamtdatenabzug zu laden. Aktualisierungen müssen bis auf Weiteres manuell erfolgen mit task bielefeld:harvest --force
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3337 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte id; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze ohne PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-mit-pdf.json
|
||||
> {{.LOG}}
|
||||
- > # Index: Spalte index mit row.record.index generieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/index.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung nonSort für das erste Element in title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- > # ORCID-iDs aus name - description extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/orcid.json
|
||||
> {{.LOG}}
|
||||
- > # Rollenangaben in name - role - roleTerm in MARC relators konvertieren (nur für Personen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/roleterm.json
|
||||
> {{.LOG}}
|
||||
- > # doctype für mods:genre aus setSpec in oai header extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doctype.json
|
||||
> {{.LOG}}
|
||||
- > # Visual Library doctype aus doctype ableiten
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vldoctype.json
|
||||
> {{.LOG}}
|
||||
- > # ddc für mods:classification aus setSpec in oai header extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ddc.json
|
||||
> {{.LOG}}
|
||||
- > # Sonderzeichen in relatedItem - location - url encoden
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/urlencode.json
|
||||
> {{.LOG}}
|
||||
- > # internetMediaType bei Dateiendung .pdf in URL einheitlich auf application/pdf setzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/mime.json
|
||||
> {{.LOG}}
|
||||
- > # Rechteangaben aus dc:rights in Format OAI_DC ergänzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: pub UB Bielefeld links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: pub UB Bielefeld cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
|
@ -0,0 +1,35 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "setSpec",
|
||||
"expression": "grel:filter(row.record.cells[columnName].value,v,v.contains('ddc'))[0].replace('ddc:','')",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "ddc",
|
||||
"columnInsertIndex": 39,
|
||||
"description": "Create column ddc at index 39 based on column setSpec using expression grel:filter(row.record.cells[columnName].value,v,v.contains('ddc'))[0].replace('ddc:','')"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,55 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "setSpec",
|
||||
"expression": "grel:filter(row.record.cells[columnName].value,v,v.contains('doc-type'))[0].replace('doc-type:','')",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "doctype",
|
||||
"columnInsertIndex": 39,
|
||||
"description": "Create column doctype at index 39 based on column setSpec using expression grel:filter(row.record.cells[columnName].value,v,v.contains('doc-type'))[0].replace('doc-type:','')"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "doctype",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"other"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "Other"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column doctype"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,84 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition-by-fetching-urls",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "relatedItem - identifier - type",
|
||||
"expression": "value",
|
||||
"columnName": "relatedItem - identifier - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "urn",
|
||||
"l": "urn"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "relatedItem - identifier",
|
||||
"urlExpression": "grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "hbz",
|
||||
"columnInsertIndex": 13,
|
||||
"delay": 0,
|
||||
"cacheResponses": true,
|
||||
"httpHeadersJson": [
|
||||
{
|
||||
"name": "authorization",
|
||||
"value": ""
|
||||
},
|
||||
{
|
||||
"name": "user-agent",
|
||||
"value": "OpenRefine 3.4.1 [437dc4d]"
|
||||
},
|
||||
{
|
||||
"name": "accept",
|
||||
"value": "*/*"
|
||||
}
|
||||
],
|
||||
"description": "Create column hbz at index 13 by fetching URLs based on column relatedItem - identifier using expression grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "relatedItem - identifier - type",
|
||||
"expression": "value",
|
||||
"columnName": "relatedItem - identifier - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "urn",
|
||||
"l": "urn"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "hbz",
|
||||
"expression": "grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column hbz using expression grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,15 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"baseColumnName": "id",
|
||||
"expression": "grel:row.record.index",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "index",
|
||||
"columnInsertIndex": 1,
|
||||
"description": "Create column index at index 1 based on column id using expression grel:row.record.index"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,25 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "text",
|
||||
"name": "relatedItem - location - url - displayLabel",
|
||||
"columnName": "relatedItem - location - url - displayLabel",
|
||||
"query": "\\.pdf$",
|
||||
"mode": "regex",
|
||||
"caseSensitive": false,
|
||||
"invert": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "relatedItem - physicalDescription - internetMediaType",
|
||||
"expression": "grel:'application/pdf'",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column relatedItem - physicalDescription - internetMediaType using expression grel:'application/pdf'"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,85 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "titleInfo - title",
|
||||
"expression": "grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "nonsort",
|
||||
"columnInsertIndex": 27
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
},
|
||||
{
|
||||
"type": "list",
|
||||
"name": "nonsort",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "nonsort",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "titleInfo - title",
|
||||
"expression": "grel:value.split(' ').slice(1).join(' ')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "relatedItem - location - url - displayLabel",
|
||||
"expression": "grel:isNonBlank(filter(row.record.cells[columnName].value,v,v.toLowercase().contains('.pdf')).join(''))",
|
||||
"columnName": "relatedItem - location - url - displayLabel",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,35 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "name - description - type",
|
||||
"expression": "value",
|
||||
"columnName": "name - description - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "orcid",
|
||||
"l": "orcid"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "name - description",
|
||||
"expression": "grel:value",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "orcid",
|
||||
"columnInsertIndex": 9,
|
||||
"description": "Create column orcid at index 9 based on column name - description using expression grel:value"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,274 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition-by-fetching-urls",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "id",
|
||||
"urlExpression": "grel:'https://pub.uni-bielefeld.de/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=' + value",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "rights",
|
||||
"columnInsertIndex": 1,
|
||||
"delay": 0,
|
||||
"cacheResponses": true,
|
||||
"httpHeadersJson": [
|
||||
{
|
||||
"name": "authorization",
|
||||
"value": ""
|
||||
},
|
||||
{
|
||||
"name": "user-agent",
|
||||
"value": "OpenRefine 3.4.1 [437dc4d]"
|
||||
},
|
||||
{
|
||||
"name": "accept",
|
||||
"value": "*/*"
|
||||
}
|
||||
],
|
||||
"description": "Create column rights at index 1 by fetching URLs based on column id using expression grel:'https://pub.uni-bielefeld.de/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=' + value"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "grel:forEach(value.parseXml().select('dc|rights'),v,v.xmlText()).join(',')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column rights using expression grel:forEach(value.parseXml().select('dc|rights'),v,v.xmlText()).join(',')"
|
||||
},
|
||||
{
|
||||
"op": "core/multivalued-cell-split",
|
||||
"columnName": "rights",
|
||||
"keyColumnName": "id",
|
||||
"mode": "separator",
|
||||
"separator": ",",
|
||||
"regex": false,
|
||||
"description": "Split multi-valued cells in column rights"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "rights",
|
||||
"expression": "value",
|
||||
"columnName": "rights",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "dppl_3_0",
|
||||
"l": "dppl_3_0"
|
||||
}
|
||||
},
|
||||
{
|
||||
"v": {
|
||||
"v": "info:eu-repo/semantics/openAccess",
|
||||
"l": "info:eu-repo/semantics/openAccess"
|
||||
}
|
||||
},
|
||||
{
|
||||
"v": {
|
||||
"v": "cc_0_3_0",
|
||||
"l": "cc_0_3_0"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "grel:null",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column rights using expression grel:null"
|
||||
},
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "rights",
|
||||
"expression": "grel:value",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "rights_url",
|
||||
"columnInsertIndex": 2,
|
||||
"description": "Create column rights_url at index 2 based on column rights using expression grel:value"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "text",
|
||||
"name": "rights",
|
||||
"columnName": "rights",
|
||||
"query": "creativecommons",
|
||||
"mode": "text",
|
||||
"caseSensitive": false,
|
||||
"invert": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "grel:value.replace('https://','').replace('http://','').replace('creativecommons.org/licenses/','CC ').replace('/',' ').trim().toUppercase()",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column rights using expression grel:value.replace('https://','').replace('http://','').replace('creativecommons.org/licenses/','CC ').replace('/',' ').trim().toUppercase()"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"CREATIVECOMMONS.ORG PUBLICDOMAIN ZERO 1.0"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "CC0 1.0"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column rights"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"https://opendatacommons.org/licenses/by/summary/index.html"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "ODC-By"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column rights"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"https://opendatacommons.org/licenses/odbl/summary/index.html"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "ODbL"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column rights"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"https://opendatacommons.org/licenses/pddl/summary/index.html"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "PDDL"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column rights"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "rights",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"https://rightsstatements.org/vocab/InC/1.0/"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "Urheberrechtsschutz"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column rights"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,62 @@
|
|||
[
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "name - role - roleTerm",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"author"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "aut"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column name - role - roleTerm"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "name - role - roleTerm",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"editor"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "edt"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column name - role - roleTerm"
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "name - role - roleTerm",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"supervisor"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "dgs"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column name - role - roleTerm"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,130 @@
|
|||
{{
|
||||
if(row.index - row.record.fromRowIndex == 0,
|
||||
with(cross(cells['index'].value, 'bielefeld' , 'index'), rows,
|
||||
'<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">' + '\n' +
|
||||
' <mets:dmdSec ID="' + 'DMD' + cells['id'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">' + '\n' +
|
||||
' <mets:xmlData>' + '\n' +
|
||||
' <mods xmlns="http://www.loc.gov/mods/v3" version="3.7" xmlns:vl="http://visuallibrary.net/vl">' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['titleInfo - title'].value)), r,
|
||||
' <titleInfo' + forNonBlank(r.cells['titleInfo - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + '\n' +
|
||||
forNonBlank(r.cells['nonsort'].value, v,
|
||||
' <nonSort>' + v.escape('xml') + '</nonSort>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['titleInfo - title'].value, v,
|
||||
' <title>' + v.escape('xml') + '</title>' + '\n'
|
||||
, '') +
|
||||
' </titleInfo>' + '\n'
|
||||
).join('') +
|
||||
forEachIndex(rows, i, r, if(r.cells['name - type'].value == 'personal',
|
||||
' <name type="personal"' + '>' + '\n' +
|
||||
' <namePart type="' + r.cells['name - namePart - type'].value.escape('xml') + '">' + r.cells['name - namePart'].value.escape('xml') + '</namePart>' + '\n' +
|
||||
if(and(isBlank(rows[i+1].cells['name - type'].value), isNonBlank(rows[i+1].cells['name - namePart - type'].value)),
|
||||
' <namePart type="' + rows[i+1].cells['name - namePart - type'].value.escape('xml') + '">' + rows[i+1].cells['name - namePart'].value.escape('xml') + '</namePart>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['orcid'].value, v,
|
||||
' <nameIdentifier type="orcid" typeURI="http://orcid.org">' + v.escape('xml') + '</nameIdentifier>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['name - role - roleTerm'].value, v,
|
||||
' <role>' + '\n' +
|
||||
' <roleTerm type="code" authority="marcrelator">' + v.escape('xml') + '</roleTerm>' + '\n' +
|
||||
' </role>' + '\n'
|
||||
, '') +
|
||||
' </name>' + '\n'
|
||||
, '')).join('') +
|
||||
' <typeOfResource>text</typeOfResource>' + '\n' +
|
||||
' <genre authority="dini">' + cells['doctype'].value.escape('xml') + '</genre>' + '\n' +
|
||||
' <originInfo>' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['originInfo - dateIssued'].value)), r,
|
||||
' <dateIssued encoding="w3cdtf">' + r.cells['originInfo - dateIssued'].value.escape('xml') + '</dateIssued>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['dateOther'].value)), r,
|
||||
' <dateOther encoding="w3cdtf"' + forNonBlank(r.cells['dateOther - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['dateOther'].value.escape('xml') + '</dateOther>' + '\n'
|
||||
).join('') +
|
||||
' </originInfo>' + '\n' +
|
||||
' <language>' + '\n' +
|
||||
' <languageTerm type="code" authority="iso639-2b">' + cells['language - languageTerm'].value.escape('xml') + '</languageTerm>' + '\n' +
|
||||
' </language>' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['abstract'].value)), r,
|
||||
' <abstract' + forNonBlank(r.cells['abstract - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + '>' + r.cells['abstract'].value.escape('xml') + '</abstract>' + '\n'
|
||||
).join('') +
|
||||
if(isNonBlank(row.record.cells['subject - topic'].value),
|
||||
' <subject>' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['subject - topic'].value)), r,
|
||||
' <topic>' + r.cells['subject - topic'].value.escape('xml') + '</topic>' + '\n'
|
||||
).join('') +
|
||||
if(isNonBlank(row.record.cells['subject - topic'].value),
|
||||
' </subject>' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['ddc'].value)), r,
|
||||
' <classification authority="ddc">' + r.cells['ddc'].value.escape('xml') + '</classification>' + '\n'
|
||||
).join('') +
|
||||
forEachIndex(rows, i, r, if(and(r.cells['relatedItem - type'].value == 'host', r.cells['relatedItem - part - detail - type'].value == 'volume'),
|
||||
' <relatedItem type="host">' + '\n' +
|
||||
' <titleInfo>' + '\n' +
|
||||
' <title>' + r.cells['relatedItem - titleInfo - title'].value.escape('xml') + '</title>' + '\n' +
|
||||
' </titleInfo>' + '\n' +
|
||||
' <part>' + '\n' +
|
||||
' <detail type="volume">' + '\n' +
|
||||
' <number>' + r.cells['relatedItem - part - detail - number'].value.escape('xml') + '</number>' + '\n' +
|
||||
' </detail>' + '\n' +
|
||||
forNonBlank(rows[i+1].cells['relatedItem - part - detail - number'].value, v,
|
||||
' <detail type="issue">' + '\n' +
|
||||
' <number>' + v.escape('xml') + '</number>' + '\n' +
|
||||
' </detail>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['relatedItem - part - extent'].value.split('-')[0], v,
|
||||
' <extent unit="page">' + '\n' +
|
||||
' <start>' + v.escape('xml') + '</start>' + '\n' +
|
||||
forNonBlank(r.cells['relatedItem - part - extent'].value.split('-')[1], x,
|
||||
' <end>' + x.escape('xml') + '</end>' + '\n'
|
||||
, '') +
|
||||
' </extent>' + '\n'
|
||||
, '') +
|
||||
' </part>' + '\n' +
|
||||
' </relatedItem>' + '\n'
|
||||
, '')).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['relatedItem - identifier'].value)), r,
|
||||
' <identifier' + forNonBlank(r.cells['relatedItem - identifier - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['relatedItem - identifier'].value.escape('xml') + '</identifier>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['hbz'].value)), r,
|
||||
' <identifier type="sys">' + r.cells['hbz'].value.escape('xml') + '</identifier>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['rights_url'].value)), r,
|
||||
' <accessCondition type="use and reproduction" xlink:href="' + r.cells['rights_url'].value.escape('xml') + '">' + r.cells['rights'].value.escape('xml') + '</accessCondition>' + '\n'
|
||||
).join('') +
|
||||
' <recordInfo>' + '\n' +
|
||||
' <recordIdentifier>' + 'bielefeld_pub_' + cells['id'].value.escape('xml') + '</recordIdentifier>' + '\n' +
|
||||
' </recordInfo>' + '\n' +
|
||||
forNonBlank(cells['vldoctype'].value, v,
|
||||
' <extension>' + '\n' +
|
||||
' <vl:doctype>' + v.escape('xml') + '</vl:doctype>' + '\n' +
|
||||
' </extension>' + '\n'
|
||||
, '') +
|
||||
' </mods>' + '\n' +
|
||||
' </mets:xmlData>' + '\n' +
|
||||
' </mets:mdWrap>' + '\n' +
|
||||
' </mets:dmdSec>' + '\n' +
|
||||
' <mets:fileSec>' + '\n' +
|
||||
forEachIndex(filter(rows, r, and(isNonBlank(r.cells['relatedItem - location - url'].value), r.cells['relatedItem - type'].value == 'constituent')), i, r,
|
||||
' <mets:fileGrp USE="' + if(r.cells['relatedItem - location - url'].value == filter(row.record.cells['relatedItem - location - url'].value, v, v.toLowercase().contains('.pdf'))[0], 'pdf upload', 'generic file') + '">' + '\n' +
|
||||
' <mets:file MIMETYPE="' + r.cells['relatedItem - physicalDescription - internetMediaType'].value.escape('xml') + '" ID="FILE' + i + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:FLocat xlink:href="' + r.cells['relatedItem - location - url'].value.escape('xml') + '" LOCTYPE="URL"/>' + '\n' +
|
||||
' </mets:file>' + '\n' +
|
||||
' </mets:fileGrp>' + '\n'
|
||||
).join('') +
|
||||
' </mets:fileSec>' + '\n' +
|
||||
' <mets:structMap TYPE="LOGICAL">' + '\n' +
|
||||
' <mets:div TYPE="document" ID="' + 'bielefeld_pub_' + cells['id'].value.escape('xml') + '" DMDID="' + 'DMD' + cells['id'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:fptr FILEID="' + 'FILE0' + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '"/>' + '\n' +
|
||||
forEachIndex(filter(rows, r, and(isNonBlank(r.cells['relatedItem - location - url'].value), r.cells['relatedItem - type'].value == 'constituent')).slice(1), i, r,
|
||||
' <mets:div TYPE="part" ID="' + 'PART' + (i+1) + '_' + cells['id'].value.escape('xml') + '" LABEL="' + r.cells['relatedItem - location - url - displayLabel'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:fptr FILEID="' + 'FILE' + (i+1) + '_bielefeld_pub_' + cells['id'].value.escape('xml') + '"/>' + '\n' +
|
||||
' </mets:div>' + '\n'
|
||||
).join('') +
|
||||
' </mets:div>' + '\n' +
|
||||
' </mets:structMap>' + '\n' +
|
||||
'</mets:mets>' + '\n'
|
||||
), '')
|
||||
}}
|
|
@ -0,0 +1,35 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "relatedItem - type",
|
||||
"expression": "value",
|
||||
"columnName": "relatedItem - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "constituent",
|
||||
"l": "constituent"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "relatedItem - location - url",
|
||||
"expression": "grel:'https://' + forEach(value.replace('https://','').split('/'),v,v.escape('url')).join('/')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column relatedItem - location - url using expression grel:'https://' + forEach(value.replace('https://','').split('/'),v,v.escape('url')).join('/')"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,15 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "doctype",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "vldoctype",
|
||||
"columnInsertIndex": 3,
|
||||
"description": "Create column vldoctype"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,395 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - recordInfo - recordIdentifier",
|
||||
"newColumnName": "id",
|
||||
"description": "Rename column Record - metadata - mods - recordInfo - recordIdentifier to id"
|
||||
},
|
||||
{
|
||||
"op": "core/column-move",
|
||||
"columnName": "id",
|
||||
"index": 0,
|
||||
"description": "Move column id to position 0"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - header - identifier",
|
||||
"description": "Remove column Record - header - identifier"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - header - datestamp",
|
||||
"description": "Remove column Record - header - datestamp"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - version",
|
||||
"description": "Remove column Record - metadata - mods - version"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - xsi:schemaLocation",
|
||||
"description": "Remove column Record - metadata - mods - xsi:schemaLocation"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - name - role - roleTerm - type",
|
||||
"description": "Remove column Record - metadata - mods - name - role - roleTerm - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - name - description - xsi:type",
|
||||
"description": "Remove column Record - metadata - mods - name - description - xsi:type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - relatedItem - accessCondition",
|
||||
"description": "Remove column Record - metadata - mods - relatedItem - accessCondition"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - relatedItem - accessCondition - type",
|
||||
"description": "Remove column Record - metadata - mods - relatedItem - accessCondition - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - apa",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - apa"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - ama",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - ama"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - mla",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - mla"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - ieee",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - ieee"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - dgps",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - dgps"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - bio1",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - bio1"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - wels",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - wels"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - lncs",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - lncs"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - chicago",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - chicago"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - default",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - default"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - harvard1",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - harvard1"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - frontiers",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - frontiers"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - apa_indent",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - apa_indent"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - angewandte-chemie",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - angewandte-chemie"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - extension - bibliographicCitation - aps",
|
||||
"description": "Remove column Record - metadata - mods - extension - bibliographicCitation - aps"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - originInfo - dateIssued - encoding",
|
||||
"description": "Remove column Record - metadata - mods - originInfo - dateIssued - encoding"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - originInfo - place - placeTerm - type",
|
||||
"description": "Remove column Record - metadata - mods - originInfo - place - placeTerm - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - recordInfo - recordChangeDate",
|
||||
"description": "Remove column Record - metadata - mods - recordInfo - recordChangeDate"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - recordInfo - recordChangeDate - encoding",
|
||||
"description": "Remove column Record - metadata - mods - recordInfo - recordChangeDate - encoding"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - recordInfo - recordCreationDate",
|
||||
"description": "Remove column Record - metadata - mods - recordInfo - recordCreationDate"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - recordInfo - recordCreationDate - encoding",
|
||||
"description": "Remove column Record - metadata - mods - recordInfo - recordCreationDate - encoding"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - language - languageTerm - type",
|
||||
"description": "Remove column Record - metadata - mods - language - languageTerm - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - language - languageTerm - authority",
|
||||
"description": "Remove column Record - metadata - mods - language - languageTerm - authority"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - dateOther - encoding",
|
||||
"description": "Remove column Record - metadata - mods - dateOther - encoding"
|
||||
},
|
||||
{
|
||||
"op": "core/column-removal",
|
||||
"columnName": "Record - metadata - mods - targetAudience",
|
||||
"description": "Remove column Record - metadata - mods - targetAudience"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - type",
|
||||
"newColumnName": "name - type",
|
||||
"description": "Rename column Record - metadata - mods - name - type to name - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - namePart",
|
||||
"newColumnName": "name - namePart",
|
||||
"description": "Rename column Record - metadata - mods - name - namePart to name - namePart"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - namePart - type",
|
||||
"newColumnName": "name - namePart - type",
|
||||
"description": "Rename column Record - metadata - mods - name - namePart - type to name - namePart - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - role - roleTerm",
|
||||
"newColumnName": "name - role - roleTerm",
|
||||
"description": "Rename column Record - metadata - mods - name - role - roleTerm to name - role - roleTerm"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - identifier",
|
||||
"newColumnName": "name - identifier",
|
||||
"description": "Rename column Record - metadata - mods - name - identifier to name - identifier"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - identifier - type",
|
||||
"newColumnName": "name - identifier - type",
|
||||
"description": "Rename column Record - metadata - mods - name - identifier - type to name - identifier - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - description",
|
||||
"newColumnName": "name - description",
|
||||
"description": "Rename column Record - metadata - mods - name - description to name - description"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - name - description - type",
|
||||
"newColumnName": "name - description - type",
|
||||
"description": "Rename column Record - metadata - mods - name - description - type to name - description - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - type",
|
||||
"newColumnName": "relatedItem - type",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - type to relatedItem - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - identifier",
|
||||
"newColumnName": "relatedItem - identifier",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - identifier to relatedItem - identifier"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - identifier - type",
|
||||
"newColumnName": "relatedItem - identifier - type",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - identifier - type to relatedItem - identifier - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - location - url",
|
||||
"newColumnName": "relatedItem - location - url",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - location - url to relatedItem - location - url"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - location - url - displayLabel",
|
||||
"newColumnName": "relatedItem - location - url - displayLabel",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - location - url - displayLabel to relatedItem - location - url - displayLabel"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - physicalDescription - internetMediaType",
|
||||
"newColumnName": "relatedItem - physicalDescription - internetMediaType",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - physicalDescription - internetMediaType to relatedItem - physicalDescription - internetMediaType"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - part - detail - type",
|
||||
"newColumnName": "relatedItem - part - detail - type",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - part - detail - type to relatedItem - part - detail - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - part - detail - number",
|
||||
"newColumnName": "relatedItem - part - detail - number",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - part - detail - number to relatedItem - part - detail - number"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - part - extent",
|
||||
"newColumnName": "relatedItem - part - extent",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - part - extent to relatedItem - part - extent"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - part - extent - unit",
|
||||
"newColumnName": "relatedItem - part - extent - unit",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - part - extent - unit to relatedItem - part - extent - unit"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - relatedItem - titleInfo - title",
|
||||
"newColumnName": "relatedItem - titleInfo - title",
|
||||
"description": "Rename column Record - metadata - mods - relatedItem - titleInfo - title to relatedItem - titleInfo - title"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - subject - topic",
|
||||
"newColumnName": "subject - topic",
|
||||
"description": "Rename column Record - metadata - mods - subject - topic to subject - topic"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - note",
|
||||
"newColumnName": "note",
|
||||
"description": "Rename column Record - metadata - mods - note to note"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - note - type",
|
||||
"newColumnName": "note - type",
|
||||
"description": "Rename column Record - metadata - mods - note - type to note - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - titleInfo - type",
|
||||
"newColumnName": "titleInfo - type",
|
||||
"description": "Rename column Record - metadata - mods - titleInfo - type to titleInfo - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - titleInfo - title",
|
||||
"newColumnName": "titleInfo - title",
|
||||
"description": "Rename column Record - metadata - mods - titleInfo - title to titleInfo - title"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - genre",
|
||||
"newColumnName": "genre",
|
||||
"description": "Rename column Record - metadata - mods - genre to genre"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - originInfo - dateIssued",
|
||||
"newColumnName": "originInfo - dateIssued",
|
||||
"description": "Rename column Record - metadata - mods - originInfo - dateIssued to originInfo - dateIssued"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - originInfo - publisher",
|
||||
"newColumnName": "originInfo - publisher",
|
||||
"description": "Rename column Record - metadata - mods - originInfo - publisher to originInfo - publisher"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - originInfo - place - placeTerm",
|
||||
"newColumnName": "originInfo - place - placeTerm",
|
||||
"description": "Rename column Record - metadata - mods - originInfo - place - placeTerm to originInfo - place - placeTerm"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - language - languageTerm",
|
||||
"newColumnName": "language - languageTerm",
|
||||
"description": "Rename column Record - metadata - mods - language - languageTerm to language - languageTerm"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - abstract",
|
||||
"newColumnName": "abstract",
|
||||
"description": "Rename column Record - metadata - mods - abstract to abstract"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - abstract - lang",
|
||||
"newColumnName": "abstract - lang",
|
||||
"description": "Rename column Record - metadata - mods - abstract - lang to abstract - lang"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - dateOther",
|
||||
"newColumnName": "dateOther",
|
||||
"description": "Rename column Record - metadata - mods - dateOther to dateOther"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - dateOther - type",
|
||||
"newColumnName": "dateOther - type",
|
||||
"description": "Rename column Record - metadata - mods - dateOther - type to dateOther - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - accessCondition",
|
||||
"newColumnName": "accessCondition",
|
||||
"description": "Rename column Record - metadata - mods - accessCondition to accessCondition"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - metadata - mods - accessCondition - type",
|
||||
"newColumnName": "accessCondition - type",
|
||||
"description": "Rename column Record - metadata - mods - accessCondition - type to accessCondition - type"
|
||||
},
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "Record - header - setSpec",
|
||||
"newColumnName": "setSpec",
|
||||
"description": "Rename column Record - header - setSpec to setSpec"
|
||||
}
|
||||
]
|
|
@ -3,17 +3,29 @@ wuppertal[elpub.bib.uni-wuppertal.de] --- metha_wuppertal
|
|||
click wuppertal "http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider?verb=ListRecords&metadataPrefix=oai_dc" _blank
|
||||
siegen[dspace.ub.uni-siegen.de] --- metha_siegen
|
||||
click siegen "https://dspace.ub.uni-siegen.de/oai/request?verb=ListRecords&metadataPrefix=xMetaDissPlus" _blank
|
||||
muenster[miami.uni-muenster.de] --- metha_muenster
|
||||
click muenster "https://repositorium.uni-muenster.de/oai/miami?verb=ListRecords&metadataPrefix=mets" _blank
|
||||
bielefeld[pub.uni-bielefeld.de] --- metha_bielefeld
|
||||
click bielefeld "https://pub.uni-bielefeld.de/oai?verb=ListRecords&metadataPrefix=mods&set=open_access" _blank
|
||||
subgraph Harvesting
|
||||
metha_wuppertal["fa:fa-cogs metha"]
|
||||
metha_siegen["fa:fa-cogs metha"]
|
||||
metha_muenster["fa:fa-cogs metha"]
|
||||
metha_bielefeld["fa:fa-cogs metha"]
|
||||
end
|
||||
subgraph Transformation
|
||||
metha_wuppertal -->|Dublin Core| refine_wuppertal[fa:fa-cogs OpenRefine]
|
||||
metha_siegen -->|xMetaDissPlus| refine_siegen[fa:fa-cogs OpenRefine]
|
||||
metha_muenster -->|METS/MODS| refine_muenster[fa:fa-cogs OpenRefine]
|
||||
metha_bielefeld -->|MODS| refine_bielefeld[fa:fa-cogs OpenRefine]
|
||||
end
|
||||
subgraph OAI-PMH Data Provider
|
||||
refine_wuppertal -->|METS/MODS| oai_wuppertal["noah.opencultureconsulting.com/ubw/"]
|
||||
click oai_wuppertal "https://noah.opencultureconsulting.com/ubw/?verb=ListRecords&metadataPrefix=mets" _blank
|
||||
refine_siegen -->|METS/MODS| oai_siegen["noah.opencultureconsulting.com/ubs/"]
|
||||
click oai_siegen "https://noah.opencultureconsulting.com/ubs/?verb=ListRecords&metadataPrefix=mets" _blank
|
||||
refine_muenster -->|METS/MODS| oai_muenster["noah.opencultureconsulting.com/ulbm/"]
|
||||
click oai_muenster "https://noah.opencultureconsulting.com/ubm/?verb=ListRecords&metadataPrefix=mets" _blank
|
||||
refine_bielefeld -->|METS/MODS| oai_bielefeld["noah.opencultureconsulting.com/ubb/"]
|
||||
click oai_bielefeld "https://noah.opencultureconsulting.com/ubb/?verb=ListRecords&metadataPrefix=mets" _blank
|
||||
end
|
||||
|
|
File diff suppressed because one or more lines are too long
Before Width: | Height: | Size: 15 KiB After Width: | Height: | Size: 28 KiB |
|
@ -0,0 +1,147 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: miami ULB Münster
|
||||
vars:
|
||||
MINIMUM: 6600 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://repositorium.uni-muenster.de/oai/miami
|
||||
FORMAT: mets
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3336 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record --recordPath metadata --recordPath mets:mets
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte id
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # Ältere Einträge (nach mets:metsHdr - CREATEDATE) mit gleichem Identifier entfernen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/duplicates.json
|
||||
> {{.LOG}}
|
||||
- > # Aggregationen löschen (diese Datensätze werden von untergeordneten Werken über relatedItem referenziert)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ohne-aggregationen.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-mit-pdf.json
|
||||
> {{.LOG}}
|
||||
- > # Separaten Download-Link entfernen, wenn nur eine Datei vorhanden ist
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/flocat.json
|
||||
> {{.LOG}}
|
||||
- > # Vorläufig Datensätze löschen, die mehr als einen Direktlink beinhalten https://github.com/opencultureconsulting/noah/issues/25
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-ein-direktlink.json
|
||||
> {{.LOG}}
|
||||
- > # Vorläufig Zeitschriftenhefte löschen https://github.com/opencultureconsulting/noah/issues/31
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/keine-zeitschriftenhefte.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze mit "restriction on access" löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/restriction.json
|
||||
> {{.LOG}}
|
||||
- > # Index: Spalte index mit row.record.index generieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/index.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in mods:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- > # Visual Library doctype aus mods:genre
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doctype.json
|
||||
> {{.LOG}}
|
||||
- > # HTML-Codes in Abstracts entfernen und Abstracts ohne Sprachangabe löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/abstract.json
|
||||
> {{.LOG}}
|
||||
- > # mets:file - ID eindeutig machen, um Validierungsfehler zu vermeiden
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/file-id.json
|
||||
> {{.LOG}}
|
||||
- > # mods:note type teilweise filtern
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/note.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: miami ULB Münster links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: miami ULB Münster cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
|
@ -0,0 +1,81 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "0",
|
||||
"l": "0"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract",
|
||||
"expression": "null",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract using expression null"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "0",
|
||||
"l": "0"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang",
|
||||
"expression": "null",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang using expression null"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract",
|
||||
"expression": "grel:value.parseHtml().htmlText().trim()",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract using expression grel:value.parseHtml().htmlText().trim()"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,34 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "doctype",
|
||||
"columnInsertIndex": 20
|
||||
}
|
||||
]
|
|
@ -0,0 +1,59 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:metsHdr - CREATEDATE",
|
||||
"expression": "value.toDate()",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:metsHdr - CREATEDATE using expression value.toDate()"
|
||||
},
|
||||
{
|
||||
"op": "core/row-reorder",
|
||||
"mode": "record-based",
|
||||
"sorting": {
|
||||
"criteria": [
|
||||
{
|
||||
"valueType": "date",
|
||||
"column": "mets:mets - mets:metsHdr - CREATEDATE",
|
||||
"blankPosition": 2,
|
||||
"errorPosition": 1,
|
||||
"reverse": false
|
||||
}
|
||||
]
|
||||
},
|
||||
"description": "Reorder rows"
|
||||
},
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "grel:with(value.cross('muenster', columnName), rows, if(rows.length() > 1, if(rows.index.sort()[-1] > row.index, 'is duplicate of a higher row number', 'has duplicate(s) with lower row number'), 'unique'))",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "is duplicate of a higher row number",
|
||||
"l": "is duplicate of a higher row number"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,35 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
|
||||
"expression": "grel:'FILE_' + row.record.cells['id'].value[0].split(':')[-1] + '_' + (row.index - row.record.fromRowIndex + 1)",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID using expression grel:'FILE_' + row.record.cells['id'].value[0].split(':')[-1] + '_' + (row.index - row.record.fromRowIndex + 1)"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,54 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:structMap - mets:div - mets:div - ID",
|
||||
"expression": "grel:row.record.cells[columnName].value.length()",
|
||||
"columnName": "mets:mets - mets:structMap - mets:div - mets:div - ID",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": 2,
|
||||
"l": "2"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
},
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:fileSec - mets:fileGrp - USE",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - USE",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "DOWNLOAD",
|
||||
"l": "DOWNLOAD"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
|
||||
"expression": "grel:null",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href using expression grel:null"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,84 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition-by-fetching-urls",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "urn",
|
||||
"l": "urn"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier",
|
||||
"urlExpression": "grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "hbz",
|
||||
"columnInsertIndex": 37,
|
||||
"delay": 0,
|
||||
"cacheResponses": true,
|
||||
"httpHeadersJson": [
|
||||
{
|
||||
"name": "authorization",
|
||||
"value": ""
|
||||
},
|
||||
{
|
||||
"name": "user-agent",
|
||||
"value": "OpenRefine 3.4.1 [437dc4d]"
|
||||
},
|
||||
{
|
||||
"name": "accept",
|
||||
"value": "*/*"
|
||||
}
|
||||
],
|
||||
"description": "Create column hbz at index 37 by fetching URLs based on column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier using expression grel:'https://lobid.org/resources/search?q=' + 'urn:\"' + value \n + '\"'"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "urn",
|
||||
"l": "urn"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "hbz",
|
||||
"expression": "grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column hbz using expression grel:forNonBlank(value.parseJson().member[0].hbzId,v,v,null)"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,15 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"baseColumnName": "id",
|
||||
"expression": "grel:row.record.index",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "index",
|
||||
"columnInsertIndex": 1,
|
||||
"description": "Create column index at index 1 based on column id using expression grel:row.record.index"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "PeriodicalPart",
|
||||
"l": "PeriodicalPart"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,87 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-addition",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title",
|
||||
"expression": "grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "nonsort",
|
||||
"columnInsertIndex": 43,
|
||||
"description": "Create column nonsort at index 43 based on column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title using expression grel:with(['a', 'das', 'dem', 'den', 'der', 'des', 'die', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'the'],x,if(inArray(x,value.split(' ')[0].toLowercase()),value.split(' ')[0] + ' ',''))"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "id",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "id",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
},
|
||||
{
|
||||
"type": "list",
|
||||
"name": "nonsort",
|
||||
"expression": "isBlank(value)",
|
||||
"columnName": "nonsort",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title",
|
||||
"expression": "grel:value.split(' ').slice(1).join(' ')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title using expression grel:value.split(' ').slice(1).join(' ')"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,67 @@
|
|||
[
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"engineConfig": {
|
||||
"facets": [],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
|
||||
"expression": "value",
|
||||
"edits": [
|
||||
{
|
||||
"from": [
|
||||
"thesis"
|
||||
],
|
||||
"fromBlank": false,
|
||||
"fromError": false,
|
||||
"to": "thesis statement"
|
||||
}
|
||||
],
|
||||
"description": "Mass edit cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
|
||||
"invert": true,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "citation/reference",
|
||||
"l": "citation/reference"
|
||||
}
|
||||
},
|
||||
{
|
||||
"v": {
|
||||
"v": "ownership",
|
||||
"l": "ownership"
|
||||
}
|
||||
},
|
||||
{
|
||||
"v": {
|
||||
"v": "thesis statement",
|
||||
"l": "thesis statement"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type",
|
||||
"expression": "grel:null",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type using expression grel:null"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
|
||||
"expression": "grel:with(row.record.cells[columnName].value, x, and(x.length() == 1, x[0].toLowercase().contains('.pdf')))",
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
|
||||
"expression": "grel:row.record.cells[columnName].value.join('').toLowercase().contains('.pdf')",
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": false,
|
||||
"l": "false"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
|
||||
"expression": "grel:isBlank(row.record.cells[columnName].value.join(''))",
|
||||
"columnName": "mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": true,
|
||||
"l": "true"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,30 @@
|
|||
[
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "list",
|
||||
"name": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type",
|
||||
"expression": "value",
|
||||
"columnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
||||
"omitError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": "restriction on access",
|
||||
"l": "restriction on access"
|
||||
}
|
||||
}
|
||||
],
|
||||
"selectBlank": false,
|
||||
"selectError": false
|
||||
}
|
||||
],
|
||||
"mode": "record-based"
|
||||
},
|
||||
"description": "Remove rows"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,138 @@
|
|||
{{
|
||||
if(row.index - row.record.fromRowIndex == 0,
|
||||
with(cross(cells['index'].value, 'muenster' , 'index'), rows,
|
||||
'<mets:mets xmlns:mets="http://www.loc.gov/METS/" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">' + '\n' +
|
||||
' <mets:dmdSec ID="' + cells['mets:mets - mets:dmdSec - ID'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:mdWrap MIMETYPE="text/xml" MDTYPE="MODS">' + '\n' +
|
||||
' <mets:xmlData>' + '\n' +
|
||||
' <mods xmlns="http://www.loc.gov/mods/v3" version="3.7" xmlns:vl="http://visuallibrary.net/vl">' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title'].value)), r,
|
||||
' <titleInfo' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - type'].value.replace('uniform', ''), v, ' type="' + v.escape('xml') + '"', '') + '>' + '\n' +
|
||||
forNonBlank(r.cells['nonsort'].value, v,
|
||||
' <nonSort>' + v.escape('xml') + '</nonSort>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:title'].value, v,
|
||||
' <title>' + v.escape('xml') + '</title>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:titleInfo - mods:subTitle'].value, v,
|
||||
' <subTitle>' + v.escape('xml') + '</subTitle>' + '\n'
|
||||
, '') +
|
||||
' </titleInfo>' + '\n'
|
||||
).join('') +
|
||||
forEachIndex(rows, i, r, if(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - type'].value == 'personal',
|
||||
' <name type="personal"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - valueURI'].value, v, ' authority="gnd" authorityURI="http://d-nb.info/gnd/" valueURI="' + v.escape('xml') + '"', '') + '>' + '\n' +
|
||||
' <displayForm>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:displayForm'].value.escape('xml') + '</displayForm>' + '\n' +
|
||||
' <namePart type="' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value.escape('xml') + '">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart'].value.escape('xml') + '</namePart>' + '\n' +
|
||||
if(and(isBlank(rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - type'].value), isNonBlank(rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value)),
|
||||
' <namePart type="' + rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart - type'].value.escape('xml') + '">' + rows[i+1].cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:namePart'].value.escape('xml') + '</namePart>' + '\n'
|
||||
, '') +
|
||||
forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:name - mods:role - mods:roleTerm'].value, v,
|
||||
' <role>' + '\n' +
|
||||
' <roleTerm type="code" authority="marcrelator">' + v.escape('xml') + '</roleTerm>' + '\n' +
|
||||
' </role>' + '\n'
|
||||
, '') +
|
||||
' </name>' + '\n'
|
||||
, '')).join('') +
|
||||
' <typeOfResource>text</typeOfResource>' + '\n' +
|
||||
' <genre authority="dini">' + cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre'].value.escape('xml') + '</genre>' + '\n' +
|
||||
' <originInfo>' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued'].value)), r,
|
||||
' <dateIssued encoding="w3cdtf"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued - keyDate'].value, v, ' keyDate="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateIssued'].value.escape('xml') + '</dateIssued>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther'].value)), r,
|
||||
' <dateOther encoding="w3cdtf"' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther - keyDate'].value, v, ' keyDate="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:originInfo - mods:dateOther'].value.escape('xml') + '</dateOther>' + '\n'
|
||||
).join('') +
|
||||
' </originInfo>' + '\n' +
|
||||
' <language>' + '\n' +
|
||||
' <languageTerm type="code" authority="iso639-2b">' + cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:language - mods:languageTerm'].value.escape('xml') + '</languageTerm>' + '\n' +
|
||||
' </language>' + '\n' +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract'].value)), r,
|
||||
' <abstract' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract - lang'].value, v, ' lang="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:abstract'].value.escape('xml') + '</abstract>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note'].value)), r,
|
||||
' <note' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:note'].value.escape('xml') + '</note>' + '\n'
|
||||
).join('') +
|
||||
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('ger'),
|
||||
' <subject lang="ger">' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value == 'ger'), r,
|
||||
forEach(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic'].value.split(';'), v,
|
||||
' <topic>' + v.trim().escape('xml') + '</topic>' + '\n'
|
||||
).join('')
|
||||
).join('') +
|
||||
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('ger'),
|
||||
' </subject>' + '\n'
|
||||
, '') +
|
||||
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('eng'),
|
||||
' <subject lang="eng">' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value == 'eng'), r,
|
||||
forEach(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic'].value.split(';'), v,
|
||||
' <topic>' + v.trim().escape('xml') + '</topic>' + '\n'
|
||||
).join('')
|
||||
).join('') +
|
||||
if(row.record.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:subject - mods:topic - lang'].value.inArray('eng'),
|
||||
' </subject>' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:classification - authority'].value == 'ddc'), r,
|
||||
' <classification authority="ddc">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:classification'].value.escape('xml') + '</classification>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - type'].value == 'host'), r,
|
||||
' <relatedItem type="host">' + '\n' +
|
||||
' <titleInfo>' + '\n' +
|
||||
' <title>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:titleInfo - mods:title'].value.escape('xml') + '</title>' + '\n' +
|
||||
' </titleInfo>' + '\n' +
|
||||
' <part>' + '\n' +
|
||||
' <detail type="issue">' + '\n' +
|
||||
' <number>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:titleInfo - mods:title'].value.escape('xml') + '</number>' + '\n' +
|
||||
' </detail>' + '\n' +
|
||||
' <extent unit="page">' + '\n' +
|
||||
' <start>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:part - mods:extent - mods:start'].value.escape('xml') + '</start>' + '\n' +
|
||||
' <end>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:relatedItem - mods:part - mods:extent - mods:end'].value.escape('xml') + '</end>' + '\n' +
|
||||
' </extent>' + '\n' +
|
||||
' </part>' + '\n' +
|
||||
' </relatedItem>' + '\n'
|
||||
).join('') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type'].value)), r,
|
||||
' <identifier' + forNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier - type'].value, v, ' type="' + v.escape('xml') + '"', '') + '>' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:identifier'].value.escape('xml') + '</identifier>' + '\n'
|
||||
).join('') +
|
||||
forNonBlank(cells['hbz'].value, v,
|
||||
' <identifier type="sys">' + v.escape('xml') + '</identifier>' + '\n'
|
||||
, '') +
|
||||
forEach(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - type'].value)), r,
|
||||
' <accessCondition type="use and reproduction" xlink:href="' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - mods:extension - ma:maWrap - ma:licence - ma:targetUrl'].value.escape('xml') + '">' + r.cells['mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:accessCondition - mods:extension - ma:maWrap - ma:licence - ma:displayLabel'].value.replace('InC 1.0', 'Urheberrechtsschutz').escape('xml') + '</accessCondition>' + '\n'
|
||||
).join('') +
|
||||
' <recordInfo>' + '\n' +
|
||||
' <recordIdentifier>' + 'muenster_miami_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '</recordIdentifier>' + '\n' +
|
||||
' </recordInfo>' + '\n' +
|
||||
forNonBlank(cells['doctype'].value, v,
|
||||
' <extension>' + '\n' +
|
||||
' <vl:doctype>' + v.escape('xml') + '</vl:doctype>' + '\n' +
|
||||
' </extension>' + '\n'
|
||||
, '') +
|
||||
' </mods>' + '\n' +
|
||||
' </mets:xmlData>' + '\n' +
|
||||
' </mets:mdWrap>' + '\n' +
|
||||
' </mets:dmdSec>' + '\n' +
|
||||
' <mets:fileSec>' + '\n' +
|
||||
forEachIndex(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value)), i, r,
|
||||
' <mets:fileGrp USE="' + if(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value == filter(row.record.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value, v, v.toLowercase().contains('.pdf'))[0], 'pdf upload', 'generic file') + '">' + '\n' +
|
||||
' <mets:file MIMETYPE="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - MIMETYPE'].value.escape('xml') + '" ID="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:FLocat xlink:href="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value.escape('xml') + '" LOCTYPE="URL"/>' + '\n' +
|
||||
' </mets:file>' + '\n' +
|
||||
' </mets:fileGrp>' + '\n'
|
||||
).join('') +
|
||||
' </mets:fileSec>' + '\n' +
|
||||
' <mets:structMap TYPE="LOGICAL">' + '\n' +
|
||||
' <mets:div TYPE="document" ID="' + 'muenster_miami_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '" DMDID="' + cells['mets:mets - mets:dmdSec - ID'].value.escape('xml') + '">' + '\n' +
|
||||
' <mets:fptr FILEID="' + cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '"/>' + '\n' +
|
||||
forEachIndex(filter(rows, r, isNonBlank(r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value)).slice(1), i, r,
|
||||
' <mets:div TYPE="part" ID="' + 'PART' + (i+1) + '_' + cells['id'].value.split(':').reverse()[0].escape('xml') + '" LABEL="' + if(r.cells['mets:mets - mets:fileSec - mets:fileGrp - USE'].value == 'DOWNLOAD', 'Download ZIP-Archiv (mit allen Dateien)' , r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - mets:FLocat - xlink:href'].value.split('/').reverse()[0].escape('xml')) + '">' + '\n' +
|
||||
' <mets:fptr FILEID="' + r.cells['mets:mets - mets:fileSec - mets:fileGrp - mets:file - ID'].value.escape('xml') + '"/>' + '\n' +
|
||||
' </mets:div>' + '\n'
|
||||
).join('') +
|
||||
' </mets:div>' + '\n' +
|
||||
' </mets:structMap>' + '\n' +
|
||||
'</mets:mets>' + '\n'
|
||||
), '')
|
||||
}}
|
|
@ -0,0 +1,14 @@
|
|||
[
|
||||
{
|
||||
"op": "core/column-rename",
|
||||
"oldColumnName": "mets:mets - OBJID",
|
||||
"newColumnName": "id",
|
||||
"description": "Rename column mets:mets - OBJID to id"
|
||||
},
|
||||
{
|
||||
"op": "core/column-move",
|
||||
"columnName": "id",
|
||||
"index": 0,
|
||||
"description": "Move column id to position 0"
|
||||
}
|
||||
]
|
|
@ -0,0 +1,141 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: OPUS Siegen
|
||||
vars:
|
||||
MINIMUM: 1250 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: https://dspace.ub.uni-siegen.de/oai/request
|
||||
FORMAT: xMetaDissPlus
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3334 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/urn.json
|
||||
> {{.LOG}}
|
||||
- > # Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/direktlinks.json
|
||||
> {{.LOG}}
|
||||
- > # Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nur-mit-pdf.json
|
||||
> {{.LOG}}
|
||||
- > # Aufteilung dc:subject in ddc und topic
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ddc-topic.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/cc.json
|
||||
> {{.LOG}}
|
||||
- > # Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/mime.json
|
||||
> {{.LOG}}
|
||||
- > # DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doi.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in dc:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- > # DINI Publikationstypen aus dc:type extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/dini.json
|
||||
> {{.LOG}}
|
||||
- > # Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/doctype.json
|
||||
> {{.LOG}}
|
||||
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/join.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: OPUS Siegen links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: OPUS Siegen cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
|
@ -26,11 +26,11 @@
|
|||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "dc:type",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "doctype",
|
||||
"columnInsertIndex": 7,
|
||||
"description": "Create column doctype at index 7 based on column dc:type using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
|
||||
"description": "Create column doctype"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
|
@ -6,7 +6,7 @@
|
|||
{
|
||||
"type": "list",
|
||||
"name": "ddb:transfer",
|
||||
"expression": "grel:row.record.cells['ddb:transfer'].value.join('').contains('.pdf')",
|
||||
"expression": "grel:row.record.cells['ddb:transfer'].value.join('').toLowercase().contains('.pdf')",
|
||||
"columnName": "ddb:transfer",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
147
tasks/siegen.yml
147
tasks/siegen.yml
|
@ -1,147 +0,0 @@
|
|||
# https://taskfile.dev
|
||||
|
||||
version: '3'
|
||||
|
||||
tasks:
|
||||
default:
|
||||
desc: OPUS Siegen
|
||||
deps: [harvest]
|
||||
cmds:
|
||||
- task: refine
|
||||
- task: check
|
||||
- task: split
|
||||
- task: validate
|
||||
- task: zip
|
||||
- task: diff
|
||||
|
||||
harvest:
|
||||
dir: data/siegen/harvest
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format xMetaDissPlus https://dspace.ub.uni-siegen.de/oai/request
|
||||
- METHA_DIR=$PWD metha-cat --format xMetaDissPlus https://dspace.ub.uni-siegen.de/oai/request > siegen.xml
|
||||
|
||||
refine:
|
||||
dir: data/siegen/refine
|
||||
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
|
||||
env:
|
||||
PORT: 3334
|
||||
RAM: 4G
|
||||
PROJECT: siegen
|
||||
cmds:
|
||||
# OpenRefine starten
|
||||
- $OPENREFINE -v warn -p $PORT -m $RAM -d $PWD > openrefine.log 2>&1 &
|
||||
- timeout 30s bash -c "until curl -s http://localhost:$PORT | cat | grep -q -o OpenRefine ; do sleep 1; done"
|
||||
# Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
- $OPENREFINE_CLIENT -P $PORT --create "$(readlink -e ../harvest/siegen.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName $PROJECT
|
||||
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/vorverarbeitung.json $PROJECT
|
||||
# URNs extrahieren: Dubletten entfernen und verschiedene URNs zusammenführen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/urn.json $PROJECT
|
||||
# Fehlende Direktlinks aus Format METS ergänzen: Wenn keine Angabe in ddb:transfer, dann zusätzlich METS Format abfragen und daraus METS Flocat extrahieren
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/direktlinks.json $PROJECT
|
||||
# Datensätze ohne Direktlink auf ein PDF löschen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/nur-mit-pdf.json $PROJECT
|
||||
# Aufteilung dc:subject in ddc und topic
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/ddc-topic.json $PROJECT
|
||||
# Standardisierte Rechteangaben (Canonical Name aus CC Links in dc:rights)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/cc.json $PROJECT
|
||||
# Internet Media Type aus ddb:transfer ableiten: Mapping manuell nach Apache http://svn.apache.org/viewvc/httpd/httpd/trunk/docs/conf/mime.types?view=markup
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/mime.json $PROJECT
|
||||
# DOIs aus Format OAI_DC ergänzen: Für alle Datensätze zusätzlich DC Format abfragen und daraus dc:identifier mit Typ doi extrahieren
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/doi.json $PROJECT
|
||||
# Anreicherung HT-Nummer via lobid-resources: Bei mehreren URNs ODER-Suche; bei mehreren Treffern wird nur der erste übernommen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/hbz.json $PROJECT
|
||||
# Sortierung mods:nonSort für das erste Element in dc:title
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/nonsort.json $PROJECT
|
||||
# DINI Publikationstypen aus dc:type extrahieren
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/dini.json $PROJECT
|
||||
# Visual Library doctype aus dc:type: Wenn thesis:level == thesis.habilitation dann doctype oaHabil
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/doctype.json $PROJECT
|
||||
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/siegen/join.json $PROJECT
|
||||
# Export in METS:MODS mit Templating
|
||||
- |
|
||||
$OPENREFINE_CLIENT -P $PORT --export --template "$(< ../../../rules/siegen/template.txt)" --rowSeparator "
|
||||
<!-- SPLIT -->
|
||||
" --suffix "
|
||||
" --output siegen.txt $PROJECT
|
||||
# Statistik zu Laufzeit und Ressourcenverbrauch
|
||||
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:$PORT)
|
||||
# OpenRefine beenden
|
||||
- PID=$(lsof -t -i:$PORT); kill $PID; while ps -p $PID > /dev/null; do sleep 1; done
|
||||
# OpenRefine-Projekt für Debugging archivieren
|
||||
- tar cfz siegen.openrefine.tar.gz -C $(grep -l siegen *.project/metadata.json | cut -d '/' -f 1) .
|
||||
# Temporäre Dateien löschen
|
||||
- rm -rf ./*.project* && rm -f workspace.json
|
||||
sources:
|
||||
# wenn "dir:" für task gesetzt ist, dann relative Links ausgehend von dir
|
||||
- ../harvest/siegen.xml
|
||||
- ../../../rules/siegen/*.json
|
||||
- ../../../rules/siegen/template.txt
|
||||
#TODO - ../../../rules/common/*.json
|
||||
generates:
|
||||
- openrefine.log
|
||||
- siegen.txt
|
||||
- siegen.openrefine.tar.gz
|
||||
|
||||
check:
|
||||
cmds:
|
||||
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: "siegen", MINIMUM: "1250"}
|
||||
sources:
|
||||
# wenn "dir:" für task nicht gesetzt ist, dann relative Links ausgehend von Taskfile.yml
|
||||
- data/siegen/refine/openrefine.log
|
||||
- data/siegen/refine/siegen.txt
|
||||
|
||||
split:
|
||||
cmds:
|
||||
- task: :split
|
||||
vars: {PROJECT: "siegen"}
|
||||
sources:
|
||||
- data/siegen/refine/siegen.txt
|
||||
generates:
|
||||
- data/siegen/split/*.xml
|
||||
|
||||
validate:
|
||||
cmds:
|
||||
- task: :validate
|
||||
vars: {PROJECT: "siegen"}
|
||||
sources:
|
||||
- data/siegen/split/*.xml
|
||||
generates:
|
||||
- data/siegen/validate.log
|
||||
|
||||
zip:
|
||||
cmds:
|
||||
- task: :zip
|
||||
vars: {PROJECT: "siegen"}
|
||||
sources:
|
||||
- data/siegen/split/*.xml
|
||||
generates:
|
||||
- data/siegen/siegen_{{.DATE}}.zip
|
||||
|
||||
diff:
|
||||
cmds:
|
||||
- task: :diff
|
||||
vars: {PROJECT: "siegen"}
|
||||
sources:
|
||||
- data/siegen/split/*.xml
|
||||
generates:
|
||||
- data/siegen/diff.log
|
||||
|
||||
linkcheck:
|
||||
desc: OPUS Siegen links überprüfen
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: "siegen"}
|
||||
sources:
|
||||
- data/siegen/split/*.xml
|
||||
generates:
|
||||
- data/siegen/linkcheck.log
|
||||
|
||||
delete:
|
||||
desc: OPUS Siegen cache löschen
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: "siegen"}
|
|
@ -1,150 +0,0 @@
|
|||
# https://taskfile.dev
|
||||
|
||||
version: '3'
|
||||
|
||||
tasks:
|
||||
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
default:
|
||||
desc: Elpub Wuppertal
|
||||
deps: [harvest]
|
||||
cmds:
|
||||
- task: refine
|
||||
- task: check
|
||||
- task: split
|
||||
- task: validate
|
||||
- task: zip
|
||||
- task: diff
|
||||
|
||||
harvest:
|
||||
dir: data/wuppertal/harvest
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format oai_dc http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
|
||||
- METHA_DIR=$PWD metha-cat --format oai_dc http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider > wuppertal.xml
|
||||
|
||||
refine:
|
||||
dir: data/wuppertal/refine
|
||||
ignore_error: true # provisorisch verwaisten Java-Prozess bei Exit vermeiden https://github.com/go-task/task/issues/141
|
||||
env:
|
||||
PORT: 3335
|
||||
RAM: 4G
|
||||
PROJECT: wuppertal
|
||||
cmds:
|
||||
# OpenRefine starten
|
||||
- $OPENREFINE -v warn -p $PORT -m $RAM -d $PWD > openrefine.log 2>&1 &
|
||||
- timeout 30s bash -c "until curl -s http://localhost:$PORT | cat | grep -q -o OpenRefine ; do sleep 1; done"
|
||||
# Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
- $OPENREFINE_CLIENT -P $PORT --create "$(readlink -e ../harvest/wuppertal.xml)" --recordPath Records --recordPath Record --storeEmptyStrings false --trimStrings true --projectName $PROJECT
|
||||
# Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/vorverarbeitung.json $PROJECT
|
||||
# Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/html.json $PROJECT
|
||||
# DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/ddc.json $PROJECT
|
||||
# dc:publisher setzen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/publisher.json $PROJECT
|
||||
# URNs, DOIs und PDF-Links aus dc:identifier extrahieren
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/identifier.json $PROJECT
|
||||
# Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/nbn.json $PROJECT
|
||||
# Aufteilung dc:subject in ioo und topic
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/subjects.json $PROJECT
|
||||
# Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/rights.json $PROJECT
|
||||
# Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/join.json $PROJECT
|
||||
# Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/subtitle.json $PROJECT
|
||||
# Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/language.json $PROJECT
|
||||
# Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/rights-cc.json $PROJECT
|
||||
# Anreicherung HT-Nummer via lobid-resources
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/hbz.json $PROJECT
|
||||
# Sortierung mods:nonSort für das erste Element in dc:title
|
||||
- $OPENREFINE_CLIENT -P $PORT --apply ../../../rules/wuppertal/nonsort.json $PROJECT
|
||||
# Export in METS:MODS mit Templating
|
||||
- |
|
||||
$OPENREFINE_CLIENT -P $PORT --export --template "$(< ../../../rules/wuppertal/template.txt)" --rowSeparator "
|
||||
<!-- SPLIT -->
|
||||
" --suffix "
|
||||
" --output wuppertal.txt $PROJECT
|
||||
# Statistik zu Laufzeit und Ressourcenverbrauch
|
||||
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:$PORT)
|
||||
# OpenRefine beenden
|
||||
- PID=$(lsof -t -i:$PORT); kill $PID; while ps -p $PID > /dev/null; do sleep 1; done
|
||||
# OpenRefine-Projekt für Debugging archivieren
|
||||
- tar cfz wuppertal.openrefine.tar.gz -C $(grep -l wuppertal *.project/metadata.json | cut -d '/' -f 1) .
|
||||
# Temporäre Dateien löschen
|
||||
- rm -rf ./*.project* && rm -f workspace.json
|
||||
sources:
|
||||
# wenn "dir:" für task gesetzt ist, dann relative Links ausgehend von dir
|
||||
- ../harvest/wuppertal.xml
|
||||
- ../../../rules/wuppertal/*.json
|
||||
- ../../../rules/wuppertal/template.txt
|
||||
#TODO - ../../../rules/common/*.json
|
||||
generates:
|
||||
- openrefine.log
|
||||
- wuppertal.txt
|
||||
- wuppertal.openrefine.tar.gz
|
||||
|
||||
check:
|
||||
cmds:
|
||||
# Tasks mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: "wuppertal", MINIMUM: "1300"}
|
||||
sources:
|
||||
# wenn "dir:" für task nicht gesetzt ist, dann relative Links ausgehend von Taskfile.yml
|
||||
- data/wuppertal/refine/openrefine.log
|
||||
- data/wuppertal/refine/wuppertal.txt
|
||||
|
||||
split:
|
||||
cmds:
|
||||
- task: :split
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
sources:
|
||||
- data/wuppertal/refine/wuppertal.txt
|
||||
generates:
|
||||
- data/wuppertal/split/*.xml
|
||||
|
||||
validate:
|
||||
cmds:
|
||||
- task: :validate
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
sources:
|
||||
- data/wuppertal/split/*.xml
|
||||
generates:
|
||||
- data/wuppertal/validate.log
|
||||
|
||||
zip:
|
||||
cmds:
|
||||
- task: :zip
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
sources:
|
||||
- data/wuppertal/split/*.xml
|
||||
generates:
|
||||
- data/wuppertal/wuppertal_{{.DATE}}.zip
|
||||
|
||||
diff:
|
||||
cmds:
|
||||
- task: :diff
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
sources:
|
||||
- data/wuppertal/split/*.xml
|
||||
generates:
|
||||
- data/wuppertal/diff.log
|
||||
|
||||
linkcheck:
|
||||
desc: Elpub Wuppertal links überprüfen
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: "wuppertal"}
|
||||
sources:
|
||||
- data/wuppertal/split/*.xml
|
||||
generates:
|
||||
- data/wuppertal/linkcheck.log
|
||||
|
||||
delete:
|
||||
desc: Elpub Wuppertal cache löschen
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: "wuppertal"}
|
|
@ -0,0 +1,145 @@
|
|||
version: '3'
|
||||
|
||||
tasks:
|
||||
main:
|
||||
desc: Elpub Wuppertal
|
||||
vars:
|
||||
MINIMUM: 1300 # Mindestanzahl der zu erwartenden Datensätze
|
||||
PROJECT: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
|
||||
cmds:
|
||||
- task: harvest
|
||||
- task: refine
|
||||
# Folgende Tasks beginnend mit ":" sind für alle Datenquellen gleich in Taskfile.yml definiert
|
||||
- task: :check
|
||||
vars: {PROJECT: '{{.PROJECT}}', MINIMUM: '{{.MINIMUM}}'}
|
||||
- task: :split
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :validate
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :zip
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
- task: :diff
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
harvest:
|
||||
dir: ./{{.PROJECT}}/harvest
|
||||
vars:
|
||||
URL: http://elpub.bib.uni-wuppertal.de/servlets/OAIDataProvider
|
||||
FORMAT: oai_dc
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- METHA_DIR=$PWD metha-sync --format {{.FORMAT}} {{.URL}}
|
||||
- METHA_DIR=$PWD metha-cat --format {{.FORMAT}} {{.URL}} > {{.PROJECT}}.xml
|
||||
|
||||
refine:
|
||||
dir: ./{{.PROJECT}}
|
||||
vars:
|
||||
PORT: 3335 # assign a different port for each project
|
||||
RAM: 4G # maximum RAM for OpenRefine java heap space
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
LOG: '>(tee -a "refine/{{.PROJECT}}.log") 2>&1'
|
||||
cmds:
|
||||
- mkdir -p refine
|
||||
- task: :start # launch OpenRefine
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||
- > # Import (erfordert absoluten Pfad zur XML-Datei)
|
||||
"$CLIENT" -P {{.PORT}}
|
||||
--create "$(readlink -m harvest/{{.PROJECT}}.xml)"
|
||||
--recordPath Records --recordPath Record
|
||||
--storeEmptyStrings false --trimStrings true
|
||||
--projectName "{{.PROJECT}}"
|
||||
> {{.LOG}}
|
||||
- > # Vorverarbeitung: Identifier in erste Spalte; nicht benötigte Spalten (ohne differenzierende Merkmale) löschen; verbleibende Spalten umbenennen (Pfad entfernen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/vorverarbeitung.json
|
||||
> {{.LOG}}
|
||||
- > # Entfernen von HTML-Tags und Transformation von subscript und superscript in Unicode (betrifft dc:description, dc:source und dc:title)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/html.json
|
||||
> {{.LOG}}
|
||||
- > # DDC einheitlich auf drei Ziffern vereinheitlichen (betrifft dc:subjects und oai:setSpec)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/ddc.json
|
||||
> {{.LOG}}
|
||||
- > # dc:publisher setzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/publisher.json
|
||||
> {{.LOG}}
|
||||
- > # URNs, DOIs und PDF-Links aus dc:identifier extrahieren
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/identifier.json
|
||||
> {{.LOG}}
|
||||
- > # Direktlinks generieren durch Abgleich der URNs mit nbn-resolving und Datensätze ohne Direktlink auf ein PDF löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nbn.json
|
||||
> {{.LOG}}
|
||||
- > # Aufteilung dc:subject in ioo und topic
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/subjects.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben Teil 1 (Links zu CC-Lizenzen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights.json
|
||||
> {{.LOG}}
|
||||
- > # Datenstruktur für Templating vorbereiten: Pro Zeile ein Datensatz und leere Zeilen löschen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/join.json
|
||||
> {{.LOG}}
|
||||
- > # Zusammenführung gleichsprachiger Titelangaben zu Title/Subtitle
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/subtitle.json
|
||||
> {{.LOG}}
|
||||
- > # Sprachangaben nach ISO-639-2b (betrifft dc:language sowie die xml:lang Attribute für dc:coverage, dc:description und dc:title)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/language.json
|
||||
> {{.LOG}}
|
||||
- > # Standardisierte Rechteangaben Teil 2 (Canonical Name für CC-Lizenzen)
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights-cc.json
|
||||
> {{.LOG}}
|
||||
- > # Anreicherung HT-Nummer via lobid-resources
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/hbz.json
|
||||
> {{.LOG}}
|
||||
- > # Sortierung mods:nonSort für das erste Element in dc:title
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/nonsort.json
|
||||
> {{.LOG}}
|
||||
- | # Export in METS:MODS mit Templating
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}" --export --template "$(< config/template.txt)" --rowSeparator "
|
||||
" --suffix "
|
||||
" --output "$(readlink -m refine/{{.PROJECT}}.txt)" > {{.LOG}}
|
||||
- | # print allocated system resources
|
||||
PID="$(lsof -t -i:{{.PORT}})"
|
||||
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
|
||||
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
|
||||
- task: :stop # shut down OpenRefine and archive the OpenRefine project
|
||||
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||
sources:
|
||||
- Taskfile.yml
|
||||
- harvest/{{.PROJECT}}.xml
|
||||
- config/**
|
||||
generates:
|
||||
- refine/{{.PROJECT}}.openrefine.tar.gz
|
||||
- refine/{{.PROJECT}}.txt
|
||||
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||
|
||||
linkcheck:
|
||||
desc: Elpub Wuppertal links überprüfen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :linkcheck
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
delete:
|
||||
desc: Elpub Wuppertal cache löschen
|
||||
vars:
|
||||
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||
cmds:
|
||||
- task: :delete
|
||||
vars: {PROJECT: '{{.PROJECT}}'}
|
||||
|
||||
default: # enable standalone execution (running `task` in project directory)
|
||||
cmds:
|
||||
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"
|
|
@ -57,11 +57,11 @@
|
|||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "setSpec",
|
||||
"expression": "grel:value.split(':').reverse()[0]",
|
||||
"expression": "grel:value.split(':')[-1]",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column setSpec using expression grel:value.split(':').reverse()[0]"
|
||||
"description": "Text transform on cells in column setSpec using expression grel:value.split(':')[-1]"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
|
@ -157,7 +157,7 @@
|
|||
{
|
||||
"type": "list",
|
||||
"name": "url",
|
||||
"expression": "grel:row.record.cells['url'].value.join('').contains('.pdf')",
|
||||
"expression": "grel:row.record.cells['url'].value.join('').toLowercase().contains('.pdf')",
|
||||
"columnName": "url",
|
||||
"invert": false,
|
||||
"omitBlank": false,
|
|
@ -17,11 +17,11 @@
|
|||
<role>
|
||||
<roleTerm type="code" authority="marcrelator">aut</roleTerm>
|
||||
</role>
|
||||
</name>{{forNonBlank(cells['dc:contributor'].value,x,forEach(x.split('␞'),v,'
|
||||
</name>{{forNonBlank(cells['dc:contributor'].value, x, forEach(x.split('␞'), v, '
|
||||
<name type="personal">
|
||||
<displayForm>'+ v.escape('xml') +'</displayForm>
|
||||
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>
|
||||
<namePart type="given">' + v.split(',')[1].trim().escape('xml') + '</namePart>
|
||||
<displayForm>'+ v.escape('xml') +'</displayForm>' + forNonBlank(v.split(',')[1], z, '
|
||||
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>' + '
|
||||
<namePart type="given">' + z.trim().escape('xml') + '</namePart>', '') + '
|
||||
<role>
|
||||
<roleTerm type="code" authority="marcrelator">ctb</roleTerm>
|
||||
</role>
|
Loading…
Reference in New Issue