Compare commits
4 Commits
Author | SHA1 | Date |
---|---|---|
Felix Lohmeier | a23a93e5cd | |
Felix Lohmeier | a3cd4c1849 | |
Felix Lohmeier | 2e698c3fe3 | |
Felix Lohmeier | 659ad70ec6 |
141
README.md
141
README.md
|
@ -2,138 +2,13 @@
|
|||
|
||||
Harvesting von OAI-PMH-Schnittstellen und Transformation in METS/MODS für das Portal [noah.nrw](https://noah.nrw/)
|
||||
|
||||
**:warning: Dies ist ein Prototyp für die Beta-Version des Portals.**
|
||||
> :warning: **Achtung:** Dieses Repo ist nicht mehr aktuell. Die Workflows sind nun wie folgt aufgeteilt
|
||||
|
||||
## Datenfluss
|
||||
| Workflow | GitHub Repository|
|
||||
|:------------------|-----------------------------------------------------------------------------------------|
|
||||
| bielefeld | [noah-bielefeld-pub](https://github.com/opencultureconsulting/noah-bielefeld-pub) |
|
||||
| muenster | [noah-muenster-miami](https://github.com/opencultureconsulting/noah-muenster-miami) |
|
||||
| siegen | [noah-siegen-opus](https://github.com/opencultureconsulting/noah-siegen-opus) |
|
||||
| wuppertal | [noah-wuppertal-elpub](https://github.com/opencultureconsulting/noah-wuppertal-elpub) |
|
||||
|
||||
[![Datenflussdiagramm](flowchart.svg)](https://raw.githubusercontent.com/opencultureconsulting/noah/main/flowchart.svg)
|
||||
|
||||
## Verwendete Tools
|
||||
|
||||
* Harvesting (mit Cache): [metha](https://github.com/miku/metha/)
|
||||
* Transformation: [OpenRefine](https://github.com/OpenRefine/OpenRefine) und [openrefine-client](https://github.com/opencultureconsulting/openrefine-client)
|
||||
* :warning: Für den Produktivbetrieb ist der Einsatz von [metafacture](https://github.com/metafacture) geplant.
|
||||
* Task Runner: [Task](https://github.com/go-task/task)
|
||||
|
||||
## Systemvoraussetzungen
|
||||
|
||||
* GNU/Linux (getestet mit Fedora 32)
|
||||
* JAVA 8+
|
||||
* [cURL](https://curl.se), xmllint
|
||||
|
||||
## Installation
|
||||
|
||||
1. Git Repository klonen
|
||||
|
||||
```sh
|
||||
git clone https://github.com/opencultureconsulting/noah.git
|
||||
cd noah
|
||||
```
|
||||
|
||||
2. [metha 0.2.20](https://github.com/miku/metha/releases/tag/v0.2.20)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/miku/metha/releases/download/v0.2.20/metha-0.2.20-0.x86_64.rpm
|
||||
sudo dnf install ./metha-0.2.20-0.x86_64.rpm && rm metha-0.2.20-0.x86_64.rpm
|
||||
```
|
||||
|
||||
b) DEB-basiert (Debian, Ubuntu etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/miku/metha/releases/download/v0.2.20/metha_0.2.20_amd64.deb
|
||||
sudo apt install ./metha_0.2.20_amd64.deb && rm metha_0.2.20_amd64.deb
|
||||
```
|
||||
|
||||
3. [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
|
||||
|
||||
a) RPM-basiert (Fedora, CentOS, SLES, etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
|
||||
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
|
||||
```
|
||||
|
||||
b) DEB-basiert (Debian, Ubuntu etc.)
|
||||
|
||||
```sh
|
||||
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
|
||||
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
|
||||
```
|
||||
|
||||
4. Install task ausführen, um [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) und [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10) herunterzuladen
|
||||
|
||||
```sh
|
||||
task install
|
||||
```
|
||||
|
||||
|
||||
## Nutzung
|
||||
|
||||
* Vorab ggf. ulimit erhöhen, um Abbruch durch "too many open files" zu vermeiden
|
||||
|
||||
```
|
||||
ulimit -n 20000
|
||||
```
|
||||
|
||||
* Alle Datenquellen (parallelisiert)
|
||||
|
||||
```
|
||||
task
|
||||
```
|
||||
|
||||
* Eine Datenquelle
|
||||
|
||||
```
|
||||
task siegen:main
|
||||
```
|
||||
|
||||
* Zwei Datenquellen (parallelisiert)
|
||||
|
||||
```
|
||||
task --parallel siegen:main wuppertal:main
|
||||
```
|
||||
|
||||
* Trotzdem Verarbeitung starten, auch wenn Checksummenprüfung ergibt, dass nichts zu tun wäre
|
||||
|
||||
```sh
|
||||
task siegen:main --force
|
||||
```
|
||||
|
||||
* Zur Fehlerbehebung: Befehle ausgeben, aber nicht ausführen
|
||||
|
||||
```sh
|
||||
task siegen:main --dry --verbose --force
|
||||
```
|
||||
|
||||
|
||||
* Links einer Datenquelle überprüfen
|
||||
|
||||
```
|
||||
task siegen:linkcheck
|
||||
```
|
||||
|
||||
* Cache einer Datenquelle löschen
|
||||
|
||||
```
|
||||
task siegen:delete
|
||||
```
|
||||
|
||||
* Verfügbare Tasks auflisten
|
||||
|
||||
```
|
||||
task --list
|
||||
```
|
||||
|
||||
## Konfiguration
|
||||
|
||||
* Der Workflow einer Datenquelle wird im jeweiligen spezifischen `Taskfile.yml` definiert
|
||||
* Beispiel: [siegen/Taskfile.yml](siegen/Taskfile.yml)
|
||||
* Die im Workflow verwendeten OpenRefine-Transformationsregeln liegen im Unterordner `config` der jeweiligen Datenquelle
|
||||
* Beispiel: [siegen/config/hbz.json](siegen/config/hbz.json)
|
||||
* Allgemeine Tasks (z.B. Validierung) werden im [Taskfile.yml](Taskfile.yml) des Hauptordners definiert.
|
||||
|
||||
## OAI-PMH Data Provider
|
||||
|
||||
Für die Bereitstellung der transformierten Daten wird der dateibasierte OAI-PMH Data Provider [oai_pmh](https://github.com/opencultureconsulting/oai_pmh) genutzt. Installations- und Nutzungshinweise sind dort zu finden.
|
||||
Der alte technische Ansatz ist in https://github.com/opencultureconsulting/noah/tree/v0.3 nachzulesen.
|
||||
|
|
|
@ -93,6 +93,10 @@ tasks:
|
|||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/urlencode.json
|
||||
> {{.LOG}}
|
||||
- > # internetMediaType bei Dateiendung .pdf in URL einheitlich auf application/pdf setzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/mime.json
|
||||
> {{.LOG}}
|
||||
- > # Rechteangaben aus dc:rights in Format OAI_DC ergänzen
|
||||
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
|
||||
--apply config/rights.json
|
||||
|
|
|
@ -0,0 +1,25 @@
|
|||
[
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"engineConfig": {
|
||||
"facets": [
|
||||
{
|
||||
"type": "text",
|
||||
"name": "relatedItem - location - url - displayLabel",
|
||||
"columnName": "relatedItem - location - url - displayLabel",
|
||||
"query": "\\.pdf$",
|
||||
"mode": "regex",
|
||||
"caseSensitive": false,
|
||||
"invert": false
|
||||
}
|
||||
],
|
||||
"mode": "row-based"
|
||||
},
|
||||
"columnName": "relatedItem - physicalDescription - internetMediaType",
|
||||
"expression": "grel:'application/pdf'",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10,
|
||||
"description": "Text transform on cells in column relatedItem - physicalDescription - internetMediaType using expression grel:'application/pdf'"
|
||||
}
|
||||
]
|
|
@ -6,10 +6,10 @@
|
|||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "doctype",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "vldoctype",
|
||||
"columnInsertIndex": 40,
|
||||
"description": "Create column vldoctype at index 40 based on column doctype using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
|
||||
"columnInsertIndex": 3,
|
||||
"description": "Create column vldoctype"
|
||||
}
|
||||
]
|
||||
|
|
|
@ -26,7 +26,7 @@
|
|||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "mets:mets - mets:dmdSec - mets:mdWrap - mets:xmlData - mods:mods - mods:genre",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['CourseMaterial','courseMaterial'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','researchData'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','workingPaper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['CourseMaterial','course_material'], ['doctoralThesis','oaDoctoralThesis'], ['lecture','lecture'], ['Manuscript','handwritten'], ['masterThesis','oaMasterThesis'], ['MusicalNotation','notated music'], ['PeriodicalPart','journal issue'], ['preprint','oaPreprint'], ['report','oaBdArticle'], ['ResearchData','research_data'], ['review','review'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'],['workingPaper','working_paper'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "doctype",
|
||||
"columnInsertIndex": 20
|
||||
|
|
|
@ -26,11 +26,11 @@
|
|||
"mode": "row-based"
|
||||
},
|
||||
"baseColumnName": "dc:type",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"expression": "grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conference_object'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['PeriodicalPart','journal issue'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))",
|
||||
"onError": "set-to-blank",
|
||||
"newColumnName": "doctype",
|
||||
"columnInsertIndex": 7,
|
||||
"description": "Create column doctype at index 7 based on column dc:type using expression grel:with([ ['article','oaArticle'], ['bachelorThesis','oaBachelorThesis'], ['book','oaBook'], ['bookPart','oaBookPart'], ['conferenceObject','conferenceObject'], ['doctoralThesis','oaDoctoralThesis'], ['masterThesis','oaMasterThesis'], ['StudyThesis','oaStudyThesis'], ['Other','oaBdOther'] ], x, forEach(x, v, if(value == v[0], v[1], null)).join(''))"
|
||||
"description": "Create column doctype"
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
|
|
|
@ -17,11 +17,11 @@
|
|||
<role>
|
||||
<roleTerm type="code" authority="marcrelator">aut</roleTerm>
|
||||
</role>
|
||||
</name>{{forNonBlank(cells['dc:contributor'].value,x,forEach(x.split('␞'),v,'
|
||||
</name>{{forNonBlank(cells['dc:contributor'].value, x, forEach(x.split('␞'), v, '
|
||||
<name type="personal">
|
||||
<displayForm>'+ v.escape('xml') +'</displayForm>
|
||||
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>
|
||||
<namePart type="given">' + v.split(',')[1].trim().escape('xml') + '</namePart>
|
||||
<displayForm>'+ v.escape('xml') +'</displayForm>' + forNonBlank(v.split(',')[1], z, '
|
||||
<namePart type="family">' + v.split(',')[0].escape('xml') + '</namePart>' + '
|
||||
<namePart type="given">' + z.trim().escape('xml') + '</namePart>', '') + '
|
||||
<role>
|
||||
<roleTerm type="code" authority="marcrelator">ctb</roleTerm>
|
||||
</role>
|
||||
|
|
Loading…
Reference in New Issue