Compare commits

...

4 Commits
v0.3 ... main

Author SHA1 Message Date
Felix Lohmeier a23a93e5cd
Verweis auf neue Repos 2022-04-08 11:40:24 +02:00
Felix Lohmeier a3cd4c1849 Schreibweise vl:doctype geändert 2022-03-15 18:33:40 +01:00
Felix Lohmeier 2e698c3fe3 fix dc:contributor java.lang issue 2022-02-22 10:38:55 +01:00
Felix Lohmeier 659ad70ec6 fix #37 abweichender MIME-type bei Datensätzen mit Dateiendung .pdf in URL 2021-07-06 13:39:07 +02:00
7 changed files with 47 additions and 143 deletions

141
README.md
View File

@ -2,138 +2,13 @@
Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das Portal [noah.nrw](https://noah.nrw/)
**:warning: Dies ist ein Prototyp für die Beta-Version des Portals.**
> :warning: **Achtung:** Dieses Repo ist nicht mehr aktuell. Die Workflows sind nun wie folgt aufgeteilt
## Datenfluss
| Workflow | GitHub Repository|
|:------------------|-----------------------------------------------------------------------------------------|
| bielefeld | [noah-bielefeld-pub](https://github.com/opencultureconsulting/noah-bielefeld-pub) |
| muenster | [noah-muenster-miami](https://github.com/opencultureconsulting/noah-muenster-miami) |
| siegen | [noah-siegen-opus](https://github.com/opencultureconsulting/noah-siegen-opus) |
| wuppertal | [noah-wuppertal-elpub](https://github.com/opencultureconsulting/noah-wuppertal-elpub) |
[![Datenflussdiagramm](flowchart.svg)](https://raw.githubusercontent.com/opencultureconsulting/noah/main/flowchart.svg)
## Verwendete Tools
* Harvesting (mit Cache): [metha](https://github.com/miku/metha/)
* Transformation: [OpenRefine](https://github.com/OpenRefine/OpenRefine) und [openrefine-client](https://github.com/opencultureconsulting/openrefine-client)
* :warning: Für den Produktivbetrieb ist der Einsatz von [metafacture](https://github.com/metafacture) geplant.
* Task Runner: [Task](https://github.com/go-task/task)
## Systemvoraussetzungen
* GNU/Linux (getestet mit Fedora 32)
* JAVA 8+
* [cURL](https://curl.se), xmllint
## Installation
1. Git Repository klonen
```sh
git clone https://github.com/opencultureconsulting/noah.git
cd noah
```
2. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
```sh
wget https://github.com/miku/metha/releases/download/v0.2.20/metha-0.2.20-0.x86_64.rpm
sudo dnf install ./metha-0.2.20-0.x86_64.rpm && rm metha-0.2.20-0.x86_64.rpm
```
b) DEB-basiert (Debian, Ubuntu etc.)
```sh
wget https://github.com/miku/metha/releases/download/v0.2.20/metha_0.2.20_amd64.deb
sudo apt install ./metha_0.2.20_amd64.deb && rm metha_0.2.20_amd64.deb
```
3. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
```
b) DEB-basiert (Debian, Ubuntu etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
```
4. Install task ausführen, um [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) und [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10) herunterzuladen
```sh
task install
```
## Nutzung
* Vorab ggf. ulimit erhöhen, um Abbruch durch "too many open files" zu vermeiden
```
ulimit -n 20000
```
* Alle Datenquellen (parallelisiert)
```
task
```
* Eine Datenquelle
```
task siegen:main
```
* Zwei Datenquellen (parallelisiert)
```
task --parallel siegen:main wuppertal:main
```
* Trotzdem Verarbeitung starten, auch wenn Checksummenprüfung ergibt, dass nichts zu tun wäre
```sh
task siegen:main --force
```
* Zur Fehlerbehebung: Befehle ausgeben, aber nicht ausführen
```sh
task siegen:main --dry --verbose --force
```
* Links einer Datenquelle überprüfen
```
task siegen:linkcheck
```
* Cache einer Datenquelle löschen
```
task siegen:delete
```
* Verfügbare Tasks auflisten
```
task --list
```
## Konfiguration
* Der Workflow einer Datenquelle wird im jeweiligen spezifischen `Taskfile.yml` definiert
* Beispiel: [siegen/Taskfile.yml](siegen/Taskfile.yml)
* Die im Workflow verwendeten OpenRefine-Transformationsregeln liegen im Unterordner `config` der jeweiligen Datenquelle
* Beispiel: [siegen/config/hbz.json](siegen/config/hbz.json)
* Allgemeine Tasks (z.B. Validierung) werden im [Taskfile.yml](Taskfile.yml) des Hauptordners definiert.
## OAI-PMH Data Provider
Für die Bereitstellung der transformierten Daten wird der dateibasierte OAI-PMH Data Provider [oai_pmh](https://github.com/opencultureconsulting/oai_pmh) genutzt. Installations- und Nutzungshinweise sind dort zu finden.
Der alte technische Ansatz ist in https://github.com/opencultureconsulting/noah/tree/v0.3 nachzulesen.

View File

@ -93,6 +93,10 @@ tasks:
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/urlencode.json
> {{.LOG}}
- > # internetMediaType bei Dateiendung .pdf in URL einheitlich auf application/pdf setzen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/mime.json
> {{.LOG}}
- > # Rechteangaben aus dc:rights in Format OAI_DC ergänzen
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/rights.json

View File

@ -0,0 +1,25 @@
[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [
{
"type": "text",
"name": "relatedItem - location - url - displayLabel",
"columnName": "relatedItem - location - url - displayLabel",
"query": "\\.pdf$",
"mode": "regex",
"caseSensitive": false,
"invert": false
}
],
"mode": "row-based"
},
"columnName": "relatedItem - physicalDescription - internetMediaType",
"expression": "grel:'application/pdf'",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column relatedItem - physicalDescription - internetMediaType using expression grel:'application/pdf'"
}
]

View File

@ -6,10 +6,10 @@
"mode": "row-based"
},
"baseColumnName": "doctype",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "vldoctype",
"columnInsertIndex": 40,
"description": "Create column vldoctype at index 40 based on column doctype using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
"columnInsertIndex": 3,
"description": "Create column vldoctype"
}
]

View File

@ -26,7 +26,7 @@
"mode": "row-based"
},
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "doctype",
"columnInsertIndex": 20

View File

@ -26,11 +26,11 @@
"mode": "row-based"
},
"baseColumnName": "dc:type",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
"onError": "set-to-blank",
"newColumnName": "doctype",
"columnInsertIndex": 7,
"description": "Create column doctype at index 7 based on column dc:type using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
"description": "Create column doctype"
},
{
"op": "core/text-transform",

View File

@ -17,11 +17,11 @@
<role>
<roleTerm type="code" authority="marcrelator">aut</roleTerm>
</role>
</name>{{forNonBlank(cells['dc:contributor'].value,x,forEach(x.split('␞'),v,'
</name>{{forNonBlank(cells['dc:contributor'].value, x, forEach(x.split('␞'), v, '
<name type="personal">
<displayForm>'+ v.escape('xml') +'</displayForm>
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>
<namePart type="given">' + v.split(',')[1].trim().escape('xml') + '</namePart>
<displayForm>'+ v.escape('xml') +'</displayForm>' + forNonBlank(v.split(',')[1], z, '
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>' + '
<namePart type="given">' + z.trim().escape('xml') + '</namePart>', '') + '
<role>
<roleTerm type="code" authority="marcrelator">ctb</roleTerm>
</role>