Optimierung marc2dc Transformationsregeln in Kap. 3.5

This commit is contained in:
Felix Lohmeier 2017-11-26 00:34:46 +01:00
parent 57d40c2c95
commit e5be768c9c
3 changed files with 365 additions and 52 deletions

View File

@ -115,7 +115,7 @@ Wenn Sie sich auf Basis der Empfehlung der LoC, der Statistik und Stichproben f
2. Ausgewählte Daten aus Spalte `content` mit der Funktion `add column based on column...` in eine neue "Dublin Core"-Spalte kopieren \(Name der Spalte ist das Dublin-Core-Feld\). 2. Ausgewählte Daten aus Spalte `content` mit der Funktion `add column based on column...` in eine neue "Dublin Core"-Spalte kopieren \(Name der Spalte ist das Dublin-Core-Feld\).
3. Bei Bedarf die Daten in der neuen Spalte mit Transformationen bearbeiten, um z.B. Trennzeichen einzufügen. 3. Bei Bedarf die Daten in der neuen Spalte mit Transformationen bearbeiten, um z.B. Trennzeichen einzufügen.
4. Zusammengehörige Werte \(z.B. Person und ihre Lebensdaten\) in der neuen Spalte mit der Funktion `join multi-valued cells` zusammenführen. Damit nicht zuviel \(z.B. mehrere Personen\) zusammengeführt werden, muss dabei die Spalte `index` vorne stehen. 4. Zusammengehörige Werte \(z.B. Person und ihre Lebensdaten\) in der neuen Spalte mit der Funktion `join multi-valued cells` zusammenführen. Damit nicht zuviel \(z.B. mehrere Personen\) zusammengeführt werden, muss dabei die Spalte `index` vorne stehen.
5. Abschließend dann noch einmal mit der Funktion `join multi-valued cells` und dem bekannten Trennzeichen `␟` die Daten in einer Zeile pro Datensatz zusammenführen. Hierzu muss dann die Spalte `id` vorne stehen. 5. Abschließend dann noch einmal mit der Funktion `join multi-valued cells` und dem bekannten Trennzeichen `␟` die Daten in einer Zeile pro Datensatz zusammenführen. Hierzu muss dann die Spalte `id` vorne stehen. Um die Performance zu verbessern, kann alternativ auch die Transformation `row.record.cells["Name der Spalte"].value.join("␟")` (zusammen mit einer Facette "by blank" mit Wert `false` auf die Spalte `id`) auf die neuen Spalten angewendet werden.
Beispiel für "Autor/in" \(MARC21 `100a,D,d,e` auf Dublin Core `dc:creator`\): Beispiel für "Autor/in" \(MARC21 `100a,D,d,e` auf Dublin Core `dc:creator`\):
@ -144,7 +144,8 @@ Beispiel für "Autor/in" \(MARC21 `100a,D,d,e` auf Dublin Core `dc:creator`\):
* Spalte `creator` / Edit cells / Join multi-valued cells... / Separator: ` ` \(Leerzeichen\) * Spalte `creator` / Edit cells / Join multi-valued cells... / Separator: ` ` \(Leerzeichen\)
5. Abschließend die Daten in einer Zeile pro Datensatz zusammenführen 5. Abschließend die Daten in einer Zeile pro Datensatz zusammenführen
* Spalte `id` / Edit column / Move column to beginning * Spalte `id` / Edit column / Move column to beginning
* Spalte `creator` / Edit cells / Join multi-valued cells... / Separator: `␟` \(Unit Separator\) * Spalte `id` / Facet / Customized facets / Facet by blank... / Wert `false` auswählen
* Spalte `creator` / Edit cells / Transform... / Expression: `row.record.cells["creator"].value.join("␟")`
6. Ergebnis prüfen und ggf. nachbessern 6. Ergebnis prüfen und ggf. nachbessern
* Spalte `creator` / Facet / Text facet * Spalte `creator` / Facet / Text facet
* Spalte `creator` / Edit cells / Cluster and edit... / Method: nearest neighbor * Spalte `creator` / Edit cells / Cluster and edit... / Method: nearest neighbor

View File

@ -24,7 +24,7 @@ curl "http://oai.swissbib.ch/oai/DB=2.1?verb=ListRecords&metadataPrefix=m21-xml%
JSON-Datei mit Transformationsregeln für ein Mapping von MARC21 auf Dublin Core: [openrefine-marc2dc.json](https://raw.githubusercontent.com/felixlohmeier/kurs-bibliotheks-und-archivinformatik/master/openrefine/openrefine-marc2dc.json) JSON-Datei mit Transformationsregeln für ein Mapping von MARC21 auf Dublin Core: [openrefine-marc2dc.json](https://raw.githubusercontent.com/felixlohmeier/kurs-bibliotheks-und-archivinformatik/master/openrefine/openrefine-marc2dc.json)
Ergebnis als TSV-Datei: [openrefine/einstein-nebis\_2017-11-02.tsv](https://github.com/felixlohmeier/kurs-bibliotheks-und-archivinformatik/raw/master/openrefine/einstein-nebis_2017-11-02.tsv) Ergebnis als TSV-Datei: [openrefine/einstein-nebis\_2017-11-02.tsv](https://github.com/felixlohmeier/kurs-bibliotheks-und-archivinformatik/raw/master/openrefine/einstein-nebis_2017-11-02.tsv) (speichern Sie Datei zur Verwendung in Kapitel 4 als `einstein.tsv` im Ordner `Downloads`)
Folgende Mappings wurden darin exemplarisch umgesetzt: Folgende Mappings wurden darin exemplarisch umgesetzt:

View File

@ -245,11 +245,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column creator", "description": "Text transform on cells in column creator using expression grel:row.record.cells[\"creator\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "creator", "columnName": "creator",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"creator\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -381,11 +407,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column title", "description": "Text transform on cells in column title using expression grel:row.record.cells[\"title\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "title", "columnName": "title",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"title\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -645,11 +697,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column contributor", "description": "Text transform on cells in column contributor using expression grel:row.record.cells[\"contributor\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "contributor", "columnName": "contributor",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"contributor\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -704,11 +782,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column language", "description": "Text transform on cells in column language using expression grel:row.record.cells[\"language\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "language", "columnName": "language",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"language\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -840,11 +944,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column publisher", "description": "Text transform on cells in column publisher using expression grel:row.record.cells[\"publisher\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "publisher", "columnName": "publisher",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"publisher\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -899,11 +1029,37 @@
"onError": "set-to-blank" "onError": "set-to-blank"
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column coverage", "description": "Text transform on cells in column coverage using expression grel:row.record.cells[\"coverage\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "coverage", "columnName": "coverage",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"coverage\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -958,11 +1114,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column date", "description": "Text transform on cells in column date using expression grel:row.record.cells[\"date\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "date", "columnName": "date",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"date\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -1221,11 +1403,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column identifier", "description": "Text transform on cells in column identifier using expression grel:row.record.cells[\"identifier\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "identifier", "columnName": "identifier",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"identifier\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -1280,11 +1488,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column rights", "description": "Text transform on cells in column rights using expression grel:row.record.cells[\"rights\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "rights", "columnName": "rights",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"rights\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -1358,11 +1592,37 @@
"index": 0 "index": 0
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column type", "description": "Text transform on cells in column type using expression grel:row.record.cells[\"type\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "type", "columnName": "type",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"type\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/text-transform", "op": "core/text-transform",
@ -1391,7 +1651,7 @@
} }
] ]
}, },
"columnName": "type", "columnName": "uniques",
"expression": "grel:value.split(\"␟\").uniques().join(\"␟\")", "expression": "grel:value.split(\"␟\").uniques().join(\"␟\")",
"onError": "keep-original", "onError": "keep-original",
"repeat": false, "repeat": false,
@ -1545,11 +1805,37 @@
"onError": "set-to-blank" "onError": "set-to-blank"
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column description", "description": "Text transform on cells in column description using expression grel:row.record.cells[\"description\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "description", "columnName": "description",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"description\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",
@ -1604,11 +1890,37 @@
"onError": "set-to-blank" "onError": "set-to-blank"
}, },
{ {
"op": "core/multivalued-cell-join", "op": "core/text-transform",
"description": "Join multi-valued cells in column extent", "description": "Text transform on cells in column extent using expression grel:row.record.cells[\"extent\"].value.join(\"␟\")",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"selection": [
{
"v": {
"v": false,
"l": "false"
}
}
],
"selectError": false,
"invert": false,
"name": "id",
"omitBlank": false,
"type": "list",
"columnName": "id"
}
]
},
"columnName": "extent", "columnName": "extent",
"keyColumnName": "id", "expression": "grel:row.record.cells[\"extent\"].value.join(\"␟\")",
"separator": "␟" "onError": "keep-original",
"repeat": false,
"repeatCount": 10
}, },
{ {
"op": "core/column-addition", "op": "core/column-addition",