Compare commits
23 Commits
Author | SHA1 | Date |
---|---|---|
Felix Lohmeier | 2cc2378085 | |
dependabot[bot] | 9e6e42261b | |
Felix Lohmeier | 4e32074d85 | |
Felix Lohmeier | a9c494856b | |
Felix Lohmeier | ca19d7ef16 | |
Felix Lohmeier | 93be203efe | |
Felix Lohmeier | 2894b0194f | |
Felix Lohmeier | 4199fadc04 | |
Felix Lohmeier | 80fb37cb65 | |
Felix Lohmeier | 68dbc04c01 | |
Felix Lohmeier | b259cf571c | |
Felix Lohmeier | f6c8ee9d98 | |
Felix Lohmeier | b8114260ec | |
Felix Lohmeier | d60f732244 | |
Felix Lohmeier | 52fff4281b | |
Felix Lohmeier | 978344a55c | |
Felix Lohmeier | 9471abc35f | |
Felix Lohmeier | 566cd372af | |
Felix Lohmeier | 0d1150c5fb | |
Felix Lohmeier | 38ac46b450 | |
Felix Lohmeier | 197285194a | |
Felix Lohmeier | c8e57230ca | |
Felix Lohmeier | 2af30448dc |
|
@ -1,6 +1,6 @@
|
|||
# downloaded program libraries
|
||||
# ignore downloaded program libraries
|
||||
openrefine
|
||||
openrefine-client
|
||||
|
||||
# examples output directories
|
||||
# ignore output directories of examples
|
||||
examples/powerhouse-museum/output
|
||||
|
|
241
README.md
241
README.md
|
@ -1,6 +1,6 @@
|
|||
## OpenRefine batch processing (openrefine-batch.sh)
|
||||
|
||||
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/66bf001c38194f5bb722f65f5e15f0ec)](https://www.codacy.com/app/mail_74/openrefine-batch?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-batch&utm_campaign=badger)
|
||||
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ad8a97e42e634bbe87203ea48efb436e)](https://www.codacy.com/gh/opencultureconsulting/openrefine-batch/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
|
||||
|
||||
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
|
||||
|
||||
|
@ -17,9 +17,21 @@ If you prefer a containerized approach, see a [variation of this script for Dock
|
|||
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects to files in the format specified (default: TSV - tab-separated values).
|
||||
|
||||
### Demo via binder
|
||||
|
||||
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
|
||||
|
||||
- free to use on-demand server with Jupyterlab and Bash Kernel
|
||||
- no registration needed, will start within a few minutes
|
||||
- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
|
||||
|
||||
### Install
|
||||
|
||||
Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
|
||||
Download the script and grant file permissions to execute:
|
||||
```
|
||||
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh
|
||||
chmod +x openrefine-batch.sh
|
||||
```
|
||||
|
||||
That's all. The script will automatically download copies of OpenRefine and the python client on first run and will tell you if something (python, java) is missing.
|
||||
|
||||
|
@ -39,7 +51,7 @@ cp CONFIGFILES config/
|
|||
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
|
||||
|
||||
**CONFIGFILES**
|
||||
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
||||
* JSON files with [OpenRefine transformation rules](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
||||
|
||||
**OUTPUT/**
|
||||
* path to directory where results and temporary data should be stored
|
||||
|
@ -50,6 +62,17 @@ cp CONFIGFILES config/
|
|||
|
||||
[Example Powerhouse Museum](examples/powerhouse-museum)
|
||||
|
||||
download example data
|
||||
|
||||
```
|
||||
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||
unzip master.zip openrefine-batch-master/examples/*
|
||||
mv openrefine-batch-master/examples .
|
||||
rm -f master.zip
|
||||
```
|
||||
|
||||
execute openrefine-batch.sh
|
||||
|
||||
```
|
||||
./openrefine-batch.sh \
|
||||
-a examples/powerhouse-museum/input/ \
|
||||
|
@ -61,12 +84,10 @@ cp CONFIGFILES config/
|
|||
-RX
|
||||
```
|
||||
|
||||
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
|
||||
|
||||
### Help Screen
|
||||
|
||||
```
|
||||
[14:45 felix ~/openrefine-batch]$ ./openrefine-batch.sh
|
||||
[felix@tux openrefine-batch]$ ./openrefine-batch.sh
|
||||
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||
|
||||
== basic arguments ==
|
||||
|
@ -81,38 +102,61 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
|||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
-p PORT PORT on which OpenRefine should listen (default: 3333)
|
||||
-t TEMPLATING several options for templating export, see below...
|
||||
-E do NOT export files
|
||||
-R do NOT restart OpenRefine after each transformation (e.g. config file)
|
||||
-X do NOT restart OpenRefine after each project (e.g. input file)
|
||||
-h displays this help screen
|
||||
|
||||
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
|
||||
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
|
||||
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
|
||||
== more inputoptions (optional, only together with inputformat) ==
|
||||
-i projectName=PROJECTNAME (all formats)
|
||||
-i projectName=PROJECTNAME (all formats), default: filename
|
||||
-i limit=LIMIT (all formats), default: -1
|
||||
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
|
||||
-i trimStrings=TRIMSTRINGS (xml, json), default: false
|
||||
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
|
||||
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i includeFileSources=true/false (all formats), default: false
|
||||
-i trimStrings=true/false (xml, json), default: false
|
||||
-i storeEmptyStrings=true/false (xml, json), default: true
|
||||
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
|
||||
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
|
||||
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
|
||||
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
|
||||
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i processQuotes=true/false (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i linesPerRow=LINESPERROW (line-based), default: 1
|
||||
|
||||
== example ==
|
||||
== templating options (alternative exportformat) ==
|
||||
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
|
||||
-t mode=row-based/record-based (engine mode, default: row-based)
|
||||
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
|
||||
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
|
||||
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
|
||||
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
|
||||
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
|
||||
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
|
||||
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
|
||||
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
|
||||
|
||||
== examples ==
|
||||
|
||||
download example data
|
||||
|
||||
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||
unzip master.zip openrefine-batch-master/examples/*
|
||||
mv openrefine-batch-master/examples .
|
||||
rm -f master.zip
|
||||
|
||||
example 1 (input, transform, export to tsv)
|
||||
|
||||
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
|
||||
clone or download GitHub repository to get example data:
|
||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
||||
example 2 (input, transform, templating export)
|
||||
|
||||
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
|
||||
```
|
||||
|
||||
### Logging
|
||||
|
@ -120,107 +164,127 @@ https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
|||
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
|
||||
|
||||
```
|
||||
[00:41 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
[felix@tux openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
Download OpenRefine...
|
||||
openrefine-linux-2017-10-26.tar.gz 100%[=====================================================================================================================>] 66,34M 5,62MB/s in 12s
|
||||
openrefine-linux-3.5.0.tar.gz 100%[=========================================================================================================================================>] 125,73M 9,50MB/s in 13s
|
||||
Install OpenRefine in subdirectory openrefine...
|
||||
Total bytes read: 79861760 (77MiB, 128MiB/s)
|
||||
Total bytes read: 154163200 (148MiB, 87MiB/s)
|
||||
|
||||
Download OpenRefine client...
|
||||
openrefine-client_0-3-1_linux-64bit 100%[=====================================================================================================================>] 5,39M 5,08MB/s in 1,1s
|
||||
openrefine-client_0-3-10_linux 100%[=========================================================================================================================================>] 4,25M 9,17MB/s in 0,5s
|
||||
|
||||
Input directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input directory: /home/felix/git/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input files: phm-collection.tsv
|
||||
Input format: --format=tsv
|
||||
Input options: --processQuotes=false --guessCellValueTypes=true
|
||||
Config directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/config
|
||||
Config directory: /home/felix/git/openrefine-batch/examples/powerhouse-museum/config
|
||||
Transformation rules: phm-transform.json
|
||||
Cross directory: /dev/null
|
||||
Cross projects:
|
||||
OpenRefine heap space: 2048M
|
||||
OpenRefine port: 3333
|
||||
OpenRefine workspace: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output
|
||||
OpenRefine workspace: /home/felix/git/openrefine-batch/examples/powerhouse-museum/output
|
||||
Export to workspace: true
|
||||
Export format: tsv
|
||||
Templating options:
|
||||
restart after file: false
|
||||
restart after transform: false
|
||||
|
||||
=== 1. Launch OpenRefine ===
|
||||
|
||||
starting time: Sa 28. Okt 00:42:33 CEST 2017
|
||||
starting time: Di 9. Nov 22:37:25 CET 2021
|
||||
|
||||
Using refine.ini for configuration
|
||||
You have 15913M of free memory.
|
||||
Your current configuration is set to use 2048M of memory.
|
||||
OpenRefine can run better when given more memory. Read our FAQ on how to allocate more memory here:
|
||||
https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Allocate-More-Memory
|
||||
/usr/bin/java -cp server/classes:server/target/lib/* -Drefine.headless=true -Xms2048M -Xmx2048M -Drefine.memory=2048M -Drefine.max_form_content_size=1048576 -Drefine.verbosity=info -Dpython.path=main/webapp/WEB-INF/lib/jython -Dpython.cachedir=/home/felix/.local/share/google/refine/cachedir -Drefine.data_dir=/home/felix/git/openrefine-batch/examples/powerhouse-museum/output -Drefine.webapp=main/webapp -Drefine.port=3333 -Drefine.interface=127.0.0.1 -Drefine.host=127.0.0.1 -Drefine.autosave=1440 com.google.refine.Refine
|
||||
Starting OpenRefine at 'http://127.0.0.1:3333/'
|
||||
|
||||
00:42:33.199 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
||||
00:42:33.200 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
|
||||
00:42:33.206 [ refine_server] Initializing context: '/' from '/home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/openrefine/webapp' (6ms)
|
||||
00:42:33.418 [ refine] Starting OpenRefine 2017-10-26 [TRUNK]... (212ms)
|
||||
00:42:33.424 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
|
||||
00:42:35.993 [ refine] Running in headless mode (2569ms)
|
||||
log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).
|
||||
log4j:WARN Please initialize the log4j system properly.
|
||||
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
|
||||
SLF4J: Class path contains multiple SLF4J bindings.
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/webapp/WEB-INF/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/server/target/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
|
||||
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
|
||||
22:37:28.211 [ refine] Starting OpenRefine 3.5.0 [d4209a2]... (0ms)
|
||||
22:37:28.213 [ refine] initializing FileProjectManager with dir (2ms)
|
||||
22:37:28.213 [ refine] /home/felix/git/openrefine-batch/examples/powerhouse-museum/output (0ms)
|
||||
22:37:28.223 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (10ms)
|
||||
|
||||
=== 2. Import all files ===
|
||||
|
||||
starting time: Sa 28. Okt 00:42:36 CEST 2017
|
||||
starting time: Di 9. Nov 22:37:33 CET 2021
|
||||
|
||||
import phm-collection.tsv...
|
||||
00:42:36.393 [ refine] POST /command/core/create-project-from-upload (400ms)
|
||||
New project: 1721413008439
|
||||
00:42:40.731 [ refine] GET /command/core/get-rows (4338ms)
|
||||
Number of rows: 75814
|
||||
22:37:33.804 [ refine] GET /command/core/get-csrf-token (5581ms)
|
||||
22:37:33.872 [ refine] POST /command/core/create-project-from-upload (68ms)
|
||||
22:37:44.653 [ refine] GET /command/core/get-models (10781ms)
|
||||
22:37:44.790 [ refine] POST /command/core/get-rows (137ms)
|
||||
id: 2252508879578
|
||||
rows: 75814
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
00:42:32 00:07 5.7 220 937692
|
||||
22:37:25 00:19 10.2 202 1670620
|
||||
|
||||
=== 3. Prepare transform & export ===
|
||||
|
||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
||||
starting time: Di 9. Nov 22:37:44 CET 2021
|
||||
|
||||
get project ids...
|
||||
00:42:40.866 [ refine] GET /command/core/get-all-project-metadata (135ms)
|
||||
1721413008439: phm-collection.tsv
|
||||
22:37:45.112 [ refine] GET /command/core/get-csrf-token (322ms)
|
||||
22:37:45.115 [ refine] GET /command/core/get-all-project-metadata (3ms)
|
||||
2252508879578: phm-collection
|
||||
|
||||
=== 4. Transform phm-collection.tsv ===
|
||||
=== 4. Transform phm-collection ===
|
||||
|
||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
||||
starting time: Di 9. Nov 22:37:45 CET 2021
|
||||
|
||||
transform phm-transform.json...
|
||||
00:42:40.963 [ refine] GET /command/core/get-models (97ms)
|
||||
00:42:40.967 [ refine] POST /command/core/apply-operations (4ms)
|
||||
22:37:45.303 [ refine] GET /command/core/get-csrf-token (188ms)
|
||||
22:37:45.308 [ refine] GET /command/core/get-models (5ms)
|
||||
22:37:45.324 [ refine] POST /command/core/apply-operations (16ms)
|
||||
File /home/felix/git/openrefine-batch/examples/powerhouse-museum/config/phm-transform.json has been successfully applied to project 2252508879578
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
00:42:32 00:29 7.1 142 1162720
|
||||
22:37:25 00:34 11.9 175 1940600
|
||||
|
||||
|
||||
=== 5. Export phm-collection.tsv ===
|
||||
=== 5. Export phm-collection ===
|
||||
|
||||
starting time: Sa 28. Okt 00:43:02 CEST 2017
|
||||
starting time: Di 9. Nov 22:37:59 CET 2021
|
||||
|
||||
export to file phm-collection.tsv...
|
||||
00:43:02.555 [ refine] GET /command/core/get-models (21588ms)
|
||||
00:43:02.557 [ refine] GET /command/core/get-all-project-metadata (2ms)
|
||||
00:43:02.561 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
|
||||
22:37:59.944 [ refine] GET /command/core/get-csrf-token (14620ms)
|
||||
22:37:59.947 [ refine] GET /command/core/get-models (3ms)
|
||||
22:37:59.951 [ refine] GET /command/core/get-all-project-metadata (4ms)
|
||||
22:37:59.954 [ refine] POST /command/core/export-rows/phm-collection.tsv (3ms)
|
||||
Export to file /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv complete
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
00:42:32 00:53 7.1 81.1 1164684
|
||||
22:37:25 00:38 12.4 181 2021388
|
||||
|
||||
|
||||
output (number of lines / size in bytes):
|
||||
167017 60619468 /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||
75728 59431272 /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||
|
||||
cleanup...
|
||||
00:43:26.161 [ ProjectManager] Saving all modified projects ... (23600ms)
|
||||
00:43:29.586 [ project_utilities] Saved project '1721413008439' (3425ms)
|
||||
22:38:06.850 [ ProjectManager] Saving all modified projects ... (6896ms)
|
||||
22:38:10.014 [ project_utilities] Saved project '2252508879578' (3164ms)
|
||||
|
||||
=== Statistics ===
|
||||
|
||||
starting time and run time of each step:
|
||||
Start process Sa 28. Okt 00:42:33 CEST 2017 (00:00:00)
|
||||
Launch OpenRefine Sa 28. Okt 00:42:33 CEST 2017 (00:00:03)
|
||||
Import all files Sa 28. Okt 00:42:36 CEST 2017 (00:00:04)
|
||||
Prepare transform & export Sa 28. Okt 00:42:40 CEST 2017 (00:00:00)
|
||||
Transform phm-collection.tsv Sa 28. Okt 00:42:40 CEST 2017 (00:00:22)
|
||||
Export phm-collection.tsv Sa 28. Okt 00:43:02 CEST 2017 (00:00:28)
|
||||
End process Sa 28. Okt 00:43:30 CEST 2017 (00:00:00)
|
||||
Start process Di 9. Nov 22:37:25 CET 2021 (00:00:00)
|
||||
Launch OpenRefine Di 9. Nov 22:37:25 CET 2021 (00:00:08)
|
||||
Import all files Di 9. Nov 22:37:33 CET 2021 (00:00:11)
|
||||
Prepare transform & export Di 9. Nov 22:37:44 CET 2021 (00:00:01)
|
||||
Transform phm-collection Di 9. Nov 22:37:45 CET 2021 (00:00:14)
|
||||
Export phm-collection Di 9. Nov 22:37:59 CET 2021 (00:00:11)
|
||||
End process Di 9. Nov 22:38:10 CET 2021 (00:00:00)
|
||||
|
||||
total run time: 00:00:57 (hh:mm:ss)
|
||||
highest memory load: 1137 MB
|
||||
total run time: 00:00:45 (hh:mm:ss)
|
||||
highest memory load: 1974 MB
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
@ -229,8 +293,14 @@ A variation of the shell script orchestrates a [docker container for OpenRefine]
|
|||
|
||||
**Install**
|
||||
|
||||
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
|
||||
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh && chmod +x openrefine-batch-docker.sh`
|
||||
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
|
||||
* **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot)
|
||||
* or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
|
||||
2. Download the script and grant file permissions to execute:
|
||||
```
|
||||
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh
|
||||
chmod +x openrefine-batch-docker.sh
|
||||
```
|
||||
|
||||
**Usage**
|
||||
|
||||
|
@ -239,15 +309,36 @@ mkdir input
|
|||
cp INPUTFILES input/
|
||||
mkdir config
|
||||
cp CONFIGFILES config/
|
||||
sudo ./openrefine-batch-docker.sh input/ config/ OUTPUT/
|
||||
./openrefine-batch-docker.sh -a input/ -b config/ -c OUTPUT/
|
||||
```
|
||||
|
||||
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
|
||||
The script may ask you for sudo privileges. Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
|
||||
|
||||
### Todo
|
||||
**Example**
|
||||
|
||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||
- [ ] provide more example data from other OpenRefine tutorials
|
||||
[Example Powerhouse Museum](examples/powerhouse-museum)
|
||||
|
||||
download example data
|
||||
|
||||
```
|
||||
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||
unzip master.zip openrefine-batch-master/examples/*
|
||||
mv openrefine-batch-master/examples .
|
||||
rm -f master.zip
|
||||
```
|
||||
|
||||
execute openrefine-batch-docker.sh
|
||||
|
||||
```
|
||||
./openrefine-batch-docker.sh \
|
||||
-a examples/powerhouse-museum/input/ \
|
||||
-b examples/powerhouse-museum/config/ \
|
||||
-c examples/powerhouse-museum/output/ \
|
||||
-f tsv \
|
||||
-i processQuotes=false \
|
||||
-i guessCellValueTypes=true \
|
||||
-RX
|
||||
```
|
||||
|
||||
### Licensing
|
||||
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
openjdk-8-jre
|
|
@ -0,0 +1,5 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# Install bash_kernel https://github.com/takluyver/bash_kernel
|
||||
python -m bash_kernel.install
|
|
@ -0,0 +1,2 @@
|
|||
jupyter-server-proxy==3.2.1
|
||||
bash_kernel==0.7.2
|
|
@ -0,0 +1 @@
|
|||
{"metadata":{"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"},"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"}},"nbformat_minor":5,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Example Powerhouse Museum\n\nOutput will be stored in examples/powerhouse-museum/output/phm-collection.tsv","metadata":{}},{"cell_type":"code","source":"./openrefine-batch.sh \\\n-a examples/powerhouse-museum/input/ \\\n-b examples/powerhouse-museum/config/ \\\n-c examples/powerhouse-museum/output/ \\\n-f tsv \\\n-i processQuotes=false \\\n-i guessCellValueTypes=true \\\n-RX","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}
|
|
@ -73,6 +73,41 @@
|
|||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"description": "Text transform on cells in column Categories using expression grel:value.replace('||', '|')",
|
||||
"engineConfig": {
|
||||
"mode": "record-based",
|
||||
"facets": [
|
||||
{
|
||||
"mode": "text",
|
||||
"caseSensitive": false,
|
||||
"query": "||",
|
||||
"name": "Categories",
|
||||
"type": "text",
|
||||
"columnName": "Categories"
|
||||
}
|
||||
]
|
||||
},
|
||||
"columnName": "Categories",
|
||||
"expression": "grel:value.replace('||', '|')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"description": "Text transform on cells in column Categories using expression grel:value.split('|').uniques().join('|')",
|
||||
"engineConfig": {
|
||||
"mode": "record-based",
|
||||
"facets": []
|
||||
},
|
||||
"columnName": "Categories",
|
||||
"expression": "grel:value.split('|').uniques().join('|')",
|
||||
"onError": "keep-original",
|
||||
"repeat": false,
|
||||
"repeatCount": 10
|
||||
},
|
||||
{
|
||||
"op": "core/multivalued-cell-split",
|
||||
"description": "Split multi-valued cells in column Categories",
|
||||
|
@ -82,34 +117,6 @@
|
|||
"separator": "|",
|
||||
"regex": false
|
||||
},
|
||||
{
|
||||
"op": "core/row-removal",
|
||||
"description": "Remove rows",
|
||||
"engineConfig": {
|
||||
"mode": "row-based",
|
||||
"facets": [
|
||||
{
|
||||
"omitError": false,
|
||||
"expression": "isBlank(value)",
|
||||
"selectBlank": false,
|
||||
"invert": false,
|
||||
"selectError": false,
|
||||
"selection": [
|
||||
{
|
||||
"v": {
|
||||
"v": true,
|
||||
"l": "true"
|
||||
}
|
||||
}
|
||||
],
|
||||
"name": "Categories",
|
||||
"omitBlank": false,
|
||||
"type": "list",
|
||||
"columnName": "Categories"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"op": "core/mass-edit",
|
||||
"description": "Mass edit cells in column Categories",
|
||||
|
@ -538,28 +545,6 @@
|
|||
"description": "Join multi-valued cells in column Categories",
|
||||
"columnName": "Categories",
|
||||
"keyColumnName": "Record ID",
|
||||
"separator": ", "
|
||||
},
|
||||
{
|
||||
"op": "core/text-transform",
|
||||
"description": "Text transform on cells in column Categories using expression grel:value.split(\", \").uniques().join(\", \")",
|
||||
"engineConfig": {
|
||||
"mode": "record-based",
|
||||
"facets": []
|
||||
},
|
||||
"columnName": "Categories",
|
||||
"expression": "grel:value.split(\", \").uniques().join(\", \")",
|
||||
"onError": "set-to-blank",
|
||||
"repeat": false,
|
||||
"repeatCount": 10
|
||||
},
|
||||
{
|
||||
"op": "core/multivalued-cell-split",
|
||||
"description": "Split multi-valued cells in column Categories",
|
||||
"columnName": "Categories",
|
||||
"keyColumnName": "Record ID",
|
||||
"mode": "separator",
|
||||
"separator": ",",
|
||||
"regex": false
|
||||
"separator": "|"
|
||||
}
|
||||
]
|
||||
|
|
|
@ -1,23 +1,32 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
||||
# openrefine-batch-docker.sh, Felix Lohmeier, v1.16, 2021-11-09
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# check system requirements
|
||||
DOCKER="$(which docker 2> /dev/null)"
|
||||
DOCKER="$(command -v docker 2> /dev/null)"
|
||||
if [ -z "$DOCKER" ] ; then
|
||||
echo 1>&2 "This action requires you to have 'docker' installed and present in your PATH. You can download it for free at http://www.docker.com/"
|
||||
exit 1
|
||||
fi
|
||||
DOCKERINFO="$(docker info 2>/dev/null | grep 'Server Version')"
|
||||
if [ -z "$DOCKERINFO" ] ; then
|
||||
echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands. Try to run the script with 'sudo ./openrefine-batch-docker.sh ...'"
|
||||
exit 1
|
||||
if [ -z "$DOCKERINFO" ]
|
||||
then
|
||||
echo "command 'docker info' failed, trying again with sudo..."
|
||||
DOCKERINFO="$(sudo docker info 2>/dev/null | grep 'Server Version')"
|
||||
echo "OK"
|
||||
docker=(sudo docker)
|
||||
if [ -z "$DOCKERINFO" ] ; then
|
||||
echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands.'"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
docker=(docker)
|
||||
fi
|
||||
|
||||
# help screen
|
||||
function usage () {
|
||||
cat <<EOF
|
||||
Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||
Usage: ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||
|
||||
== basic arguments ==
|
||||
-a INPUTDIR path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
|
||||
|
@ -30,36 +39,58 @@ Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUT
|
|||
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
-v VERSION OpenRefine version (2.7, 2.7rc2, 2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7)
|
||||
-t TEMPLATING several options for templating export, see below...
|
||||
-v VERSION OpenRefine version (3.5.0, 3.4.1, 3.4, 3.3, 3.2, 3.1, 3.0, 2.8, 2.7, ...; default: 3.5.0)
|
||||
-E do NOT export files
|
||||
-R do NOT restart OpenRefine after each transformation (e.g. config file)
|
||||
-X do NOT restart OpenRefine after each project (e.g. input file)
|
||||
-h displays this help screen
|
||||
|
||||
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
|
||||
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
|
||||
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
|
||||
== more inputoptions (optional, only together with inputformat) ==
|
||||
-i projectName=PROJECTNAME (all formats)
|
||||
-i projectName=PROJECTNAME (all formats), default: filename
|
||||
-i limit=LIMIT (all formats), default: -1
|
||||
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
|
||||
-i trimStrings=TRIMSTRINGS (xml, json), default: false
|
||||
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
|
||||
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i includeFileSources=true/false (all formats), default: false
|
||||
-i trimStrings=true/false (xml, json), default: false
|
||||
-i storeEmptyStrings=true/false (xml, json), default: true
|
||||
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
|
||||
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
|
||||
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
|
||||
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
|
||||
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i processQuotes=true/false (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i linesPerRow=LINESPERROW (line-based), default: 1
|
||||
|
||||
== example ==
|
||||
== templating options (alternative exportformat) ==
|
||||
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
|
||||
-t mode=row-based/record-based (engine mode, default: row-based)
|
||||
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
|
||||
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
|
||||
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
|
||||
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
|
||||
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
|
||||
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
|
||||
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
|
||||
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
|
||||
|
||||
sudo ./openrefine-batch-docker.sh \
|
||||
== examples ==
|
||||
|
||||
download example data
|
||||
|
||||
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||
unzip master.zip openrefine-batch-master/examples/*
|
||||
mv openrefine-batch-master/examples .
|
||||
rm -f master.zip
|
||||
|
||||
example 1 (input, transform, export to tsv)
|
||||
|
||||
./openrefine-batch-docker.sh \
|
||||
-a examples/powerhouse-museum/input/ \
|
||||
-b examples/powerhouse-museum/config/ \
|
||||
-c examples/powerhouse-museum/output/ \
|
||||
|
@ -68,16 +99,16 @@ sudo ./openrefine-batch-docker.sh \
|
|||
-i guessCellValueTypes=true \
|
||||
-RX
|
||||
|
||||
clone or download GitHub repository to get example data:
|
||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
||||
example 2 (input, transform, templating export)
|
||||
|
||||
./openrefine-batch-docker.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
|
||||
# defaults
|
||||
ram="2048M"
|
||||
version="2.7"
|
||||
version="3.5.0"
|
||||
restartfile="true"
|
||||
restarttransform="true"
|
||||
export="true"
|
||||
|
@ -93,7 +124,7 @@ if [ "$NUMARGS" -eq 0 ]; then
|
|||
fi
|
||||
|
||||
# get user input
|
||||
options="a:b:c:d:e:f:i:m:p:ERXh"
|
||||
options="a:b:c:d:e:f:i:m:t:v:ERXh"
|
||||
while getopts $options opt; do
|
||||
case $opt in
|
||||
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
|
@ -104,6 +135,7 @@ while getopts $options opt; do
|
|||
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
|
||||
i ) inputoptions+=("--${OPTARG}") ;;
|
||||
m ) ram=${OPTARG} ;;
|
||||
t ) templating+=("--${OPTARG}") ; exportformat="txt" ;;
|
||||
v ) version=${OPTARG} ;;
|
||||
E ) export="false" ;;
|
||||
R ) restarttransform="false" ;;
|
||||
|
@ -114,7 +146,7 @@ while getopts $options opt; do
|
|||
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
shift $(($OPTIND - 1))
|
||||
shift $((OPTIND - 1))
|
||||
|
||||
# check for mandatory options
|
||||
if [ -z "$outputdir" ]; then
|
||||
|
@ -122,6 +154,11 @@ if [ -z "$outputdir" ]; then
|
|||
echo 1>&2 "example: ./openrefine-batch-docker.sh -c output/"
|
||||
exit 1
|
||||
fi
|
||||
if [ "$(ls -A "$outputdir" 2>/dev/null)" ];then
|
||||
echo 1>&2 "path to directory for exported files (and OpenRefine workspace) is not empty"
|
||||
echo 1>&2 "$outputdir"
|
||||
exit 1
|
||||
fi
|
||||
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
|
||||
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
|
||||
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
|
||||
|
@ -156,6 +193,7 @@ echo "OpenRefine version: $version"
|
|||
echo "OpenRefine workspace: $outputdir"
|
||||
echo "Export to workspace: $export"
|
||||
echo "Export format: $exportformat"
|
||||
echo "Templating options: ${templating[*]}"
|
||||
echo "Docker container name: $uuid"
|
||||
echo "restart after file: $restartfile"
|
||||
echo "restart after transform: $restarttransform"
|
||||
|
@ -163,38 +201,52 @@ echo ""
|
|||
|
||||
# declare additional variables
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Start process"
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Start process"
|
||||
memoryload=()
|
||||
|
||||
# safe cleanup handler
|
||||
cleanup()
|
||||
{
|
||||
echo "cleanup..."
|
||||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
rm -r -f "${outputdir:?}"/workspace*.json
|
||||
# delete duplicates from copied projects
|
||||
if [ -n "$crossprojects" ]; then
|
||||
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
|
||||
fi
|
||||
}
|
||||
trap "cleanup;exit" SIGHUP SIGINT SIGQUIT SIGTERM
|
||||
|
||||
# launch server
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Launch OpenRefine"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
# wait until server is available
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
# show server logs
|
||||
docker attach ${uuid} &
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
|
||||
# import all files
|
||||
if [ -n "$inputfiles" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Import all files"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Import all files"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# run client with input command
|
||||
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -202,11 +254,11 @@ if [ -n "$inputfiles" ]; then
|
|||
# restart server to clear memory
|
||||
if [ "$restartfile" = "true" ]; then
|
||||
echo "save project and restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
done
|
||||
|
@ -215,18 +267,18 @@ fi
|
|||
# transform and export files
|
||||
if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Prepare transform & export"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
|
||||
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
|
||||
${docker[*]} run --rm --link ${uuid} felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
|
||||
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
|
||||
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
|
||||
echo ""
|
||||
|
||||
|
@ -237,11 +289,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
|
||||
# restart server to advertise copied projects
|
||||
echo "restart OpenRefine server to advertise copied projects..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
||||
|
@ -251,16 +303,16 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# apply transformation rules
|
||||
if [ -n "$jsonfiles" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Transform ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# run client with apply command
|
||||
sudo docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
# allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -268,11 +320,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# restart server to clear memory
|
||||
if [ "$restarttransform" = "true" ]; then
|
||||
echo "save project and restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
|
@ -281,17 +333,17 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# export project to workspace
|
||||
if [ "$export" = "true" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Export ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
# get filename without extension
|
||||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.${exportformat}" ${projectids[i]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -E --output="${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -301,11 +353,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# restart server to clear memory
|
||||
if [ "$restartfile" = "true" ]; then
|
||||
echo "restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
|
||||
|
@ -319,32 +371,25 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
fi
|
||||
fi
|
||||
|
||||
# cleanup
|
||||
echo "cleanup..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
rm -r -f "${outputdir:?}"/workspace*.json
|
||||
# delete duplicates from copied projects
|
||||
if [ -n "$crossprojects" ]; then
|
||||
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
|
||||
fi
|
||||
# run cleanup function
|
||||
cleanup
|
||||
echo ""
|
||||
|
||||
# calculate and print checkpoints
|
||||
echo "=== Statistics ==="
|
||||
echo ""
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="End process"
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="End process"
|
||||
echo "starting time and run time of each step:"
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
for i in $(seq 1 $checkpoints); do
|
||||
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
|
||||
diffsec="$((${checkpointdate[$((i + 1))]} - ${checkpointdate[$i]}))"
|
||||
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
|
||||
done
|
||||
echo ""
|
||||
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
|
||||
diffsec="$((checkpointdate[$checkpoints] - checkpointdate[1]))"
|
||||
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
|
||||
|
||||
# calculate and print memory load
|
||||
|
@ -352,4 +397,4 @@ max=${memoryload[0]}
|
|||
for n in "${memoryload[@]}" ; do
|
||||
((n > max)) && max=$n
|
||||
done
|
||||
echo "highest memory load: $(($max / 1024)) MB"
|
||||
echo "highest memory load: $((max / 1024)) MB"
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.16, 2021-11-09
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# declare download URLs for OpenRefine and OpenRefine client
|
||||
openrefine_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-linux-2017-10-26.tar.gz"
|
||||
client_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-client_0-3-1_linux-64bit"
|
||||
openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/3.5.0/openrefine-linux-3.5.0.tar.gz"
|
||||
client_URL="https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux"
|
||||
|
||||
# check system requirements
|
||||
JAVA="$(which java 2> /dev/null)"
|
||||
|
@ -34,7 +34,7 @@ if [ ! -d "openrefine-client" ]; then
|
|||
echo "Download OpenRefine client..."
|
||||
mkdir -p openrefine-client
|
||||
wget -q -P openrefine-client $wget_opt $client_URL
|
||||
chmod +x openrefine-client/openrefine-client_0-3-1_linux-64bit
|
||||
chmod +x openrefine-client/openrefine-client_0-3-10_linux
|
||||
echo ""
|
||||
fi
|
||||
|
||||
|
@ -55,33 +55,55 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
|||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
-p PORT PORT on which OpenRefine should listen (default: 3333)
|
||||
-t TEMPLATING several options for templating export, see below...
|
||||
-E do NOT export files
|
||||
-R do NOT restart OpenRefine after each transformation (e.g. config file)
|
||||
-X do NOT restart OpenRefine after each project (e.g. input file)
|
||||
-h displays this help screen
|
||||
|
||||
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
|
||||
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
|
||||
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
|
||||
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
|
||||
== more inputoptions (optional, only together with inputformat) ==
|
||||
-i projectName=PROJECTNAME (all formats)
|
||||
-i projectName=PROJECTNAME (all formats), default: filename
|
||||
-i limit=LIMIT (all formats), default: -1
|
||||
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
|
||||
-i trimStrings=TRIMSTRINGS (xml, json), default: false
|
||||
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
|
||||
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i includeFileSources=true/false (all formats), default: false
|
||||
-i trimStrings=true/false (xml, json), default: false
|
||||
-i storeEmptyStrings=true/false (xml, json), default: true
|
||||
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
|
||||
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
|
||||
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
|
||||
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
|
||||
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
|
||||
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
|
||||
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
|
||||
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i processQuotes=true/false (csv, tsv), default: true
|
||||
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
|
||||
-i linesPerRow=LINESPERROW (line-based), default: 1
|
||||
|
||||
== example ==
|
||||
== templating options (alternative exportformat) ==
|
||||
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
|
||||
-t mode=row-based/record-based (engine mode, default: row-based)
|
||||
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
|
||||
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
|
||||
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
|
||||
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
|
||||
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
|
||||
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
|
||||
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
|
||||
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
|
||||
|
||||
== examples ==
|
||||
|
||||
download example data
|
||||
|
||||
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||
unzip master.zip openrefine-batch-master/examples/*
|
||||
mv openrefine-batch-master/examples .
|
||||
rm -f master.zip
|
||||
|
||||
example 1 (input, transform, export to tsv)
|
||||
|
||||
./openrefine-batch.sh \
|
||||
-a examples/powerhouse-museum/input/ \
|
||||
|
@ -92,9 +114,9 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
|||
-i guessCellValueTypes=true \
|
||||
-RX
|
||||
|
||||
clone or download GitHub repository to get example data:
|
||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
||||
example 2 (input, transform, templating export)
|
||||
|
||||
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
|
||||
EOF
|
||||
exit 1
|
||||
}
|
||||
|
@ -118,7 +140,7 @@ if [ "$NUMARGS" -eq 0 ]; then
|
|||
fi
|
||||
|
||||
# get user input
|
||||
options="a:b:c:d:e:f:i:m:p:ERXh"
|
||||
options="a:b:c:d:e:f:i:m:p:t:ERXh"
|
||||
while getopts $options opt; do
|
||||
case $opt in
|
||||
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
|
@ -130,6 +152,7 @@ while getopts $options opt; do
|
|||
i ) inputoptions+=("--${OPTARG}") ;;
|
||||
m ) ram=${OPTARG} ;;
|
||||
p ) port=${OPTARG} ;;
|
||||
t ) templating+=("--${OPTARG}") ; exportformat="txt" ;;
|
||||
E ) export="false" ;;
|
||||
R ) restarttransform="false" ;;
|
||||
X ) restartfile="false" ;;
|
||||
|
@ -139,7 +162,7 @@ while getopts $options opt; do
|
|||
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
|
||||
esac
|
||||
done
|
||||
shift $(($OPTIND - 1))
|
||||
shift $((OPTIND - 1))
|
||||
|
||||
# check for mandatory options
|
||||
if [ -z "$outputdir" ]; then
|
||||
|
@ -147,6 +170,11 @@ if [ -z "$outputdir" ]; then
|
|||
echo 1>&2 "example: ./openrefine-batch.sh -c output/"
|
||||
exit 1
|
||||
fi
|
||||
if [ "$(ls -A "$outputdir" 2>/dev/null)" ];then
|
||||
echo 1>&2 "path to directory for exported files (and OpenRefine workspace) is not empty"
|
||||
echo 1>&2 "$outputdir"
|
||||
exit 1
|
||||
fi
|
||||
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
|
||||
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
|
||||
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
|
||||
|
@ -180,23 +208,38 @@ echo "OpenRefine port: $port"
|
|||
echo "OpenRefine workspace: $outputdir"
|
||||
echo "Export to workspace: $export"
|
||||
echo "Export format: $exportformat"
|
||||
echo "Templating options: ${templating[*]}"
|
||||
echo "restart after file: $restartfile"
|
||||
echo "restart after transform: $restarttransform"
|
||||
echo ""
|
||||
|
||||
# declare additional variables
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Start process"
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Start process"
|
||||
memoryload=()
|
||||
|
||||
# safe cleanup handler
|
||||
cleanup()
|
||||
{
|
||||
echo "cleanup..."
|
||||
kill ${pid}
|
||||
wait
|
||||
rm -r -f "${outputdir:?}"/workspace*.json
|
||||
# delete duplicates from copied projects
|
||||
if [ -n "$crossprojects" ]; then
|
||||
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
|
||||
fi
|
||||
}
|
||||
trap "cleanup;exit" SIGHUP SIGINT SIGQUIT SIGTERM
|
||||
|
||||
# launch server
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Launch OpenRefine"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
|
||||
pid=$!
|
||||
|
@ -207,16 +250,16 @@ echo ""
|
|||
# import all files
|
||||
if [ -n "$inputfiles" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Import all files"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Import all files"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# run client with input command
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -238,18 +281,18 @@ fi
|
|||
# transform and export files
|
||||
if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Prepare transform & export"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
|
||||
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
|
||||
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
|
||||
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
|
||||
echo ""
|
||||
|
||||
|
@ -275,16 +318,16 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# apply transformation rules
|
||||
if [ -n "$jsonfiles" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Transform ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# run client with apply command
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
|
||||
# allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -306,17 +349,17 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# export project to workspace
|
||||
if [ "$export" = "true" ]; then
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="Export ${projectnames[i]}"
|
||||
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
|
||||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
||||
echo ""
|
||||
# get filename without extension
|
||||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" ${projectids[i]}
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -345,32 +388,25 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
fi
|
||||
fi
|
||||
|
||||
# cleanup
|
||||
echo "cleanup..."
|
||||
kill ${pid}
|
||||
wait
|
||||
rm -r -f "${outputdir:?}"/workspace*.json
|
||||
# delete duplicates from copied projects
|
||||
if [ -n "$crossprojects" ]; then
|
||||
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
|
||||
fi
|
||||
# run cleanup function
|
||||
cleanup
|
||||
echo ""
|
||||
|
||||
# calculate and print checkpoints
|
||||
echo "=== Statistics ==="
|
||||
echo ""
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$(($checkpoints + 1))]="End process"
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
checkpointname[$((checkpoints + 1))]="End process"
|
||||
echo "starting time and run time of each step:"
|
||||
checkpoints=${#checkpointdate[@]}
|
||||
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
|
||||
checkpointdate[$((checkpoints + 1))]=$(date +%s)
|
||||
for i in $(seq 1 $checkpoints); do
|
||||
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
|
||||
diffsec="$((${checkpointdate[$((i + 1))]} - ${checkpointdate[$i]}))"
|
||||
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
|
||||
done
|
||||
echo ""
|
||||
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
|
||||
diffsec="$((checkpointdate[$checkpoints] - checkpointdate[1]))"
|
||||
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
|
||||
|
||||
# calculate and print memory load
|
||||
|
@ -378,4 +414,4 @@ max=${memoryload[0]}
|
|||
for n in "${memoryload[@]}" ; do
|
||||
((n > max)) && max=$n
|
||||
done
|
||||
echo "highest memory load: $(($max / 1024)) MB"
|
||||
echo "highest memory load: $((max / 1024)) MB"
|
||||
|
|
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue