Compare commits

...

23 Commits
v1.7 ... master

Author SHA1 Message Date
Felix Lohmeier 2cc2378085
Merge pull request #6 from opencultureconsulting/dependabot/pip/binder/jupyter-server-proxy-3.2.1
Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
2022-01-28 17:23:34 +01:00
dependabot[bot] 9e6e42261b
Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
Bumps [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) from 1.5.3 to 3.2.1.
- [Release notes](https://github.com/jupyterhub/jupyter-server-proxy/releases)
- [Changelog](https://github.com/jupyterhub/jupyter-server-proxy/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jupyterhub/jupyter-server-proxy/compare/v1.5.3...v3.2.1)

---
updated-dependencies:
- dependency-name: jupyter-server-proxy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-01-27 16:25:09 +00:00
Felix Lohmeier 4e32074d85 OpenRefine 3.5.0 2021-11-09 23:14:30 +01:00
Felix Lohmeier a9c494856b cleanup README 2021-06-17 13:00:33 +02:00
Felix Lohmeier ca19d7ef16 add jupyter notebook 2021-06-17 12:59:47 +02:00
Felix Lohmeier 93be203efe add binder config 2021-06-17 12:42:22 +02:00
Felix Lohmeier 2894b0194f fix codacy badge 2021-02-12 13:38:33 +01:00
Felix Lohmeier 4199fadc04 OpenRefine 3.4.1, openrefine-client 0.3.10 2021-01-04 17:37:49 +01:00
Felix Lohmeier 80fb37cb65 release v1.14 2020-08-08 13:45:28 +02:00
Felix Lohmeier 68dbc04c01 update openrefine-client to v0.3.9 2020-08-08 13:43:00 +02:00
Felix Lohmeier b259cf571c release v1.13: improved use of sudo in docker version, pinned version of openrefine-client, improved README 2019-08-06 21:21:59 +02:00
Felix Lohmeier f6c8ee9d98 update sample log in README.md 2019-07-29 23:37:39 +02:00
Felix Lohmeier b8114260ec Update to OpenRefine 3.2 2019-07-29 23:29:12 +02:00
Felix Lohmeier d60f732244 release v1.11, added support for templating export 2017-12-11 21:57:48 +01:00
Felix Lohmeier 52fff4281b release v1.10, safe cleanup handler (2) issue #4 2017-11-07 21:24:14 +01:00
Felix Lohmeier 978344a55c release v1.10, safe cleanup handler issue #4 2017-11-05 18:09:41 +01:00
Felix Lohmeier 9471abc35f release v1.9, fix codacy issues 2017-11-05 16:40:01 +01:00
Felix Lohmeier 566cd372af release v1.9, fix codacy issues 2017-11-05 16:25:42 +01:00
Felix Lohmeier 0d1150c5fb fix docker usage command 2017-11-05 16:24:06 +01:00
Felix Lohmeier 38ac46b450 updated example transform history 2017-11-02 23:20:09 +01:00
Felix Lohmeier 197285194a ignore downloaded libraries 2017-10-28 12:16:21 +02:00
Felix Lohmeier c8e57230ca release v1.8, updated OpenRefine version (dev snapshot 2017-10-28) 2017-10-28 12:09:25 +02:00
Felix Lohmeier 2af30448dc provide precompiled sources of OpenRefine dev version 2017-10-28 10:36:54 +02:00
11 changed files with 452 additions and 286 deletions

4
.gitignore vendored
View File

@ -1,6 +1,6 @@
# downloaded program libraries
# ignore downloaded program libraries
openrefine
openrefine-client
# examples output directories
# ignore output directories of examples
examples/powerhouse-museum/output

241
README.md
View File

@ -1,6 +1,6 @@
## OpenRefine batch processing (openrefine-batch.sh)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/66bf001c38194f5bb722f65f5e15f0ec)](https://www.codacy.com/app/mail_74/openrefine-batch?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-batch&utm_campaign=badger)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ad8a97e42e634bbe87203ea48efb436e)](https://www.codacy.com/gh/opencultureconsulting/openrefine-batch/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
@ -17,9 +17,21 @@ If you prefer a containerized approach, see a [variation of this script for Dock
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects to files in the format specified (default: TSV - tab-separated values).
### Demo via binder
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
- free to use on-demand server with Jupyterlab and Bash Kernel
- no registration needed, will start within a few minutes
- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
### Install
Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
Download the script and grant file permissions to execute:
```
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh
chmod +x openrefine-batch.sh
```
That's all. The script will automatically download copies of OpenRefine and the python client on first run and will tell you if something (python, java) is missing.
@ -39,7 +51,7 @@ cp CONFIGFILES config/
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
**CONFIGFILES**
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
* JSON files with [OpenRefine transformation rules](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
**OUTPUT/**
* path to directory where results and temporary data should be stored
@ -50,6 +62,17 @@ cp CONFIGFILES config/
[Example Powerhouse Museum](examples/powerhouse-museum)
download example data
```
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
unzip master.zip openrefine-batch-master/examples/*
mv openrefine-batch-master/examples .
rm -f master.zip
```
execute openrefine-batch.sh
```
./openrefine-batch.sh \
-a examples/powerhouse-museum/input/ \
@ -61,12 +84,10 @@ cp CONFIGFILES config/
-RX
```
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
### Help Screen
```
[14:45 felix ~/openrefine-batch]$ ./openrefine-batch.sh
[felix@tux openrefine-batch]$ ./openrefine-batch.sh
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== basic arguments ==
@ -81,38 +102,61 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-p PORT PORT on which OpenRefine should listen (default: 3333)
-t TEMPLATING several options for templating export, see below...
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i projectName=PROJECTNAME (all formats), default: filename
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i includeFileSources=true/false (all formats), default: false
-i trimStrings=true/false (xml, json), default: false
-i storeEmptyStrings=true/false (xml, json), default: true
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i processQuotes=true/false (csv, tsv), default: true
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
== templating options (alternative exportformat) ==
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
-t mode=row-based/record-based (engine mode, default: row-based)
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
== examples ==
download example data
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
unzip master.zip openrefine-batch-master/examples/*
mv openrefine-batch-master/examples .
rm -f master.zip
example 1 (input, transform, export to tsv)
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
example 2 (input, transform, templating export)
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
```
### Logging
@ -120,107 +164,127 @@ https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
```
[00:41 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
[felix@tux openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
Download OpenRefine...
openrefine-linux-2017-10-26.tar.gz 100%[=====================================================================================================================>] 66,34M 5,62MB/s in 12s
openrefine-linux-3.5.0.tar.gz 100%[=========================================================================================================================================>] 125,73M 9,50MB/s in 13s
Install OpenRefine in subdirectory openrefine...
Total bytes read: 79861760 (77MiB, 128MiB/s)
Total bytes read: 154163200 (148MiB, 87MiB/s)
Download OpenRefine client...
openrefine-client_0-3-1_linux-64bit 100%[=====================================================================================================================>] 5,39M 5,08MB/s in 1,1s
openrefine-client_0-3-10_linux 100%[=========================================================================================================================================>] 4,25M 9,17MB/s in 0,5s
Input directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/input
Input directory: /home/felix/git/openrefine-batch/examples/powerhouse-museum/input
Input files: phm-collection.tsv
Input format: --format=tsv
Input options: --processQuotes=false --guessCellValueTypes=true
Config directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/config
Config directory: /home/felix/git/openrefine-batch/examples/powerhouse-museum/config
Transformation rules: phm-transform.json
Cross directory: /dev/null
Cross projects:
OpenRefine heap space: 2048M
OpenRefine port: 3333
OpenRefine workspace: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output
OpenRefine workspace: /home/felix/git/openrefine-batch/examples/powerhouse-museum/output
Export to workspace: true
Export format: tsv
Templating options:
restart after file: false
restart after transform: false
=== 1. Launch OpenRefine ===
starting time: Sa 28. Okt 00:42:33 CEST 2017
starting time: Di 9. Nov 22:37:25 CET 2021
Using refine.ini for configuration
You have 15913M of free memory.
Your current configuration is set to use 2048M of memory.
OpenRefine can run better when given more memory. Read our FAQ on how to allocate more memory here:
https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Allocate-More-Memory
/usr/bin/java -cp server/classes:server/target/lib/* -Drefine.headless=true -Xms2048M -Xmx2048M -Drefine.memory=2048M -Drefine.max_form_content_size=1048576 -Drefine.verbosity=info -Dpython.path=main/webapp/WEB-INF/lib/jython -Dpython.cachedir=/home/felix/.local/share/google/refine/cachedir -Drefine.data_dir=/home/felix/git/openrefine-batch/examples/powerhouse-museum/output -Drefine.webapp=main/webapp -Drefine.port=3333 -Drefine.interface=127.0.0.1 -Drefine.host=127.0.0.1 -Drefine.autosave=1440 com.google.refine.Refine
Starting OpenRefine at 'http://127.0.0.1:3333/'
00:42:33.199 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
00:42:33.200 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
00:42:33.206 [ refine_server] Initializing context: '/' from '/home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/openrefine/webapp' (6ms)
00:42:33.418 [ refine] Starting OpenRefine 2017-10-26 [TRUNK]... (212ms)
00:42:33.424 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
00:42:35.993 [ refine] Running in headless mode (2569ms)
log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/webapp/WEB-INF/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/server/target/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22:37:28.211 [ refine] Starting OpenRefine 3.5.0 [d4209a2]... (0ms)
22:37:28.213 [ refine] initializing FileProjectManager with dir (2ms)
22:37:28.213 [ refine] /home/felix/git/openrefine-batch/examples/powerhouse-museum/output (0ms)
22:37:28.223 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (10ms)
=== 2. Import all files ===
starting time: Sa 28. Okt 00:42:36 CEST 2017
starting time: Di 9. Nov 22:37:33 CET 2021
import phm-collection.tsv...
00:42:36.393 [ refine] POST /command/core/create-project-from-upload (400ms)
New project: 1721413008439
00:42:40.731 [ refine] GET /command/core/get-rows (4338ms)
Number of rows: 75814
22:37:33.804 [ refine] GET /command/core/get-csrf-token (5581ms)
22:37:33.872 [ refine] POST /command/core/create-project-from-upload (68ms)
22:37:44.653 [ refine] GET /command/core/get-models (10781ms)
22:37:44.790 [ refine] POST /command/core/get-rows (137ms)
id: 2252508879578
rows: 75814
STARTED ELAPSED %MEM %CPU RSS
00:42:32 00:07 5.7 220 937692
22:37:25 00:19 10.2 202 1670620
=== 3. Prepare transform & export ===
starting time: Sa 28. Okt 00:42:40 CEST 2017
starting time: Di 9. Nov 22:37:44 CET 2021
get project ids...
00:42:40.866 [ refine] GET /command/core/get-all-project-metadata (135ms)
1721413008439: phm-collection.tsv
22:37:45.112 [ refine] GET /command/core/get-csrf-token (322ms)
22:37:45.115 [ refine] GET /command/core/get-all-project-metadata (3ms)
2252508879578: phm-collection
=== 4. Transform phm-collection.tsv ===
=== 4. Transform phm-collection ===
starting time: Sa 28. Okt 00:42:40 CEST 2017
starting time: Di 9. Nov 22:37:45 CET 2021
transform phm-transform.json...
00:42:40.963 [ refine] GET /command/core/get-models (97ms)
00:42:40.967 [ refine] POST /command/core/apply-operations (4ms)
22:37:45.303 [ refine] GET /command/core/get-csrf-token (188ms)
22:37:45.308 [ refine] GET /command/core/get-models (5ms)
22:37:45.324 [ refine] POST /command/core/apply-operations (16ms)
File /home/felix/git/openrefine-batch/examples/powerhouse-museum/config/phm-transform.json has been successfully applied to project 2252508879578
STARTED ELAPSED %MEM %CPU RSS
00:42:32 00:29 7.1 142 1162720
22:37:25 00:34 11.9 175 1940600
=== 5. Export phm-collection.tsv ===
=== 5. Export phm-collection ===
starting time: Sa 28. Okt 00:43:02 CEST 2017
starting time: Di 9. Nov 22:37:59 CET 2021
export to file phm-collection.tsv...
00:43:02.555 [ refine] GET /command/core/get-models (21588ms)
00:43:02.557 [ refine] GET /command/core/get-all-project-metadata (2ms)
00:43:02.561 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
22:37:59.944 [ refine] GET /command/core/get-csrf-token (14620ms)
22:37:59.947 [ refine] GET /command/core/get-models (3ms)
22:37:59.951 [ refine] GET /command/core/get-all-project-metadata (4ms)
22:37:59.954 [ refine] POST /command/core/export-rows/phm-collection.tsv (3ms)
Export to file /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv complete
STARTED ELAPSED %MEM %CPU RSS
00:42:32 00:53 7.1 81.1 1164684
22:37:25 00:38 12.4 181 2021388
output (number of lines / size in bytes):
167017 60619468 /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
75728 59431272 /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
cleanup...
00:43:26.161 [ ProjectManager] Saving all modified projects ... (23600ms)
00:43:29.586 [ project_utilities] Saved project '1721413008439' (3425ms)
22:38:06.850 [ ProjectManager] Saving all modified projects ... (6896ms)
22:38:10.014 [ project_utilities] Saved project '2252508879578' (3164ms)
=== Statistics ===
starting time and run time of each step:
Start process Sa 28. Okt 00:42:33 CEST 2017 (00:00:00)
Launch OpenRefine Sa 28. Okt 00:42:33 CEST 2017 (00:00:03)
Import all files Sa 28. Okt 00:42:36 CEST 2017 (00:00:04)
Prepare transform & export Sa 28. Okt 00:42:40 CEST 2017 (00:00:00)
Transform phm-collection.tsv Sa 28. Okt 00:42:40 CEST 2017 (00:00:22)
Export phm-collection.tsv Sa 28. Okt 00:43:02 CEST 2017 (00:00:28)
End process Sa 28. Okt 00:43:30 CEST 2017 (00:00:00)
Start process Di 9. Nov 22:37:25 CET 2021 (00:00:00)
Launch OpenRefine Di 9. Nov 22:37:25 CET 2021 (00:00:08)
Import all files Di 9. Nov 22:37:33 CET 2021 (00:00:11)
Prepare transform & export Di 9. Nov 22:37:44 CET 2021 (00:00:01)
Transform phm-collection Di 9. Nov 22:37:45 CET 2021 (00:00:14)
Export phm-collection Di 9. Nov 22:37:59 CET 2021 (00:00:11)
End process Di 9. Nov 22:38:10 CET 2021 (00:00:00)
total run time: 00:00:57 (hh:mm:ss)
highest memory load: 1137 MB
total run time: 00:00:45 (hh:mm:ss)
highest memory load: 1974 MB
```
### Docker
@ -229,8 +293,14 @@ A variation of the shell script orchestrates a [docker container for OpenRefine]
**Install**
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh && chmod +x openrefine-batch-docker.sh`
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
* **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot)
* or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
2. Download the script and grant file permissions to execute:
```
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh
chmod +x openrefine-batch-docker.sh
```
**Usage**
@ -239,15 +309,36 @@ mkdir input
cp INPUTFILES input/
mkdir config
cp CONFIGFILES config/
sudo ./openrefine-batch-docker.sh input/ config/ OUTPUT/
./openrefine-batch-docker.sh -a input/ -b config/ -c OUTPUT/
```
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
The script may ask you for sudo privileges. Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
### Todo
**Example**
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] provide more example data from other OpenRefine tutorials
[Example Powerhouse Museum](examples/powerhouse-museum)
download example data
```
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
unzip master.zip openrefine-batch-master/examples/*
mv openrefine-batch-master/examples .
rm -f master.zip
```
execute openrefine-batch-docker.sh
```
./openrefine-batch-docker.sh \
-a examples/powerhouse-museum/input/ \
-b examples/powerhouse-museum/config/ \
-c examples/powerhouse-museum/output/ \
-f tsv \
-i processQuotes=false \
-i guessCellValueTypes=true \
-RX
```
### Licensing

1
binder/apt.txt Normal file
View File

@ -0,0 +1 @@
openjdk-8-jre

5
binder/postBuild Executable file
View File

@ -0,0 +1,5 @@
#!/bin/bash
set -e
# Install bash_kernel https://github.com/takluyver/bash_kernel
python -m bash_kernel.install

2
binder/requirements.txt Normal file
View File

@ -0,0 +1,2 @@
jupyter-server-proxy==3.2.1
bash_kernel==0.7.2

1
demo.ipynb Normal file
View File

@ -0,0 +1 @@
{"metadata":{"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"},"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"}},"nbformat_minor":5,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Example Powerhouse Museum\n\nOutput will be stored in examples/powerhouse-museum/output/phm-collection.tsv","metadata":{}},{"cell_type":"code","source":"./openrefine-batch.sh \\\n-a examples/powerhouse-museum/input/ \\\n-b examples/powerhouse-museum/config/ \\\n-c examples/powerhouse-museum/output/ \\\n-f tsv \\\n-i processQuotes=false \\\n-i guessCellValueTypes=true \\\n-RX","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}

View File

@ -73,6 +73,41 @@
]
}
},
{
"op": "core/text-transform",
"description": "Text transform on cells in column Categories using expression grel:value.replace('||', '|')",
"engineConfig": {
"mode": "record-based",
"facets": [
{
"mode": "text",
"caseSensitive": false,
"query": "||",
"name": "Categories",
"type": "text",
"columnName": "Categories"
}
]
},
"columnName": "Categories",
"expression": "grel:value.replace('||', '|')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10
},
{
"op": "core/text-transform",
"description": "Text transform on cells in column Categories using expression grel:value.split('|').uniques().join('|')",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "grel:value.split('|').uniques().join('|')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10
},
{
"op": "core/multivalued-cell-split",
"description": "Split multi-valued cells in column Categories",
@ -82,34 +117,6 @@
"separator": "|",
"regex": false
},
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"invert": false,
"selectError": false,
"selection": [
{
"v": {
"v": true,
"l": "true"
}
}
],
"name": "Categories",
"omitBlank": false,
"type": "list",
"columnName": "Categories"
}
]
}
},
{
"op": "core/mass-edit",
"description": "Mass edit cells in column Categories",
@ -538,28 +545,6 @@
"description": "Join multi-valued cells in column Categories",
"columnName": "Categories",
"keyColumnName": "Record ID",
"separator": ", "
},
{
"op": "core/text-transform",
"description": "Text transform on cells in column Categories using expression grel:value.split(\", \").uniques().join(\", \")",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "grel:value.split(\", \").uniques().join(\", \")",
"onError": "set-to-blank",
"repeat": false,
"repeatCount": 10
},
{
"op": "core/multivalued-cell-split",
"description": "Split multi-valued cells in column Categories",
"columnName": "Categories",
"keyColumnName": "Record ID",
"mode": "separator",
"separator": ",",
"regex": false
"separator": "|"
}
]

View File

@ -1,23 +1,32 @@
#!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
# openrefine-batch-docker.sh, Felix Lohmeier, v1.16, 2021-11-09
# https://github.com/felixlohmeier/openrefine-batch
# check system requirements
DOCKER="$(which docker 2> /dev/null)"
DOCKER="$(command -v docker 2> /dev/null)"
if [ -z "$DOCKER" ] ; then
echo 1>&2 "This action requires you to have 'docker' installed and present in your PATH. You can download it for free at http://www.docker.com/"
exit 1
fi
DOCKERINFO="$(docker info 2>/dev/null | grep 'Server Version')"
if [ -z "$DOCKERINFO" ] ; then
echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands. Try to run the script with 'sudo ./openrefine-batch-docker.sh ...'"
exit 1
if [ -z "$DOCKERINFO" ]
then
echo "command 'docker info' failed, trying again with sudo..."
DOCKERINFO="$(sudo docker info 2>/dev/null | grep 'Server Version')"
echo "OK"
docker=(sudo docker)
if [ -z "$DOCKERINFO" ] ; then
echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands.'"
exit 1
fi
else
docker=(docker)
fi
# help screen
function usage () {
cat <<EOF
Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
Usage: ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== basic arguments ==
-a INPUTDIR path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
@ -30,36 +39,58 @@ Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUT
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-v VERSION OpenRefine version (2.7, 2.7rc2, 2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7)
-t TEMPLATING several options for templating export, see below...
-v VERSION OpenRefine version (3.5.0, 3.4.1, 3.4, 3.3, 3.2, 3.1, 3.0, 2.8, 2.7, ...; default: 3.5.0)
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i projectName=PROJECTNAME (all formats), default: filename
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i includeFileSources=true/false (all formats), default: false
-i trimStrings=true/false (xml, json), default: false
-i storeEmptyStrings=true/false (xml, json), default: true
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i processQuotes=true/false (csv, tsv), default: true
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
== templating options (alternative exportformat) ==
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
-t mode=row-based/record-based (engine mode, default: row-based)
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
sudo ./openrefine-batch-docker.sh \
== examples ==
download example data
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
unzip master.zip openrefine-batch-master/examples/*
mv openrefine-batch-master/examples .
rm -f master.zip
example 1 (input, transform, export to tsv)
./openrefine-batch-docker.sh \
-a examples/powerhouse-museum/input/ \
-b examples/powerhouse-museum/config/ \
-c examples/powerhouse-museum/output/ \
@ -68,16 +99,16 @@ sudo ./openrefine-batch-docker.sh \
-i guessCellValueTypes=true \
-RX
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
example 2 (input, transform, templating export)
./openrefine-batch-docker.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
EOF
exit 1
}
# defaults
ram="2048M"
version="2.7"
version="3.5.0"
restartfile="true"
restarttransform="true"
export="true"
@ -93,7 +124,7 @@ if [ "$NUMARGS" -eq 0 ]; then
fi
# get user input
options="a:b:c:d:e:f:i:m:p:ERXh"
options="a:b:c:d:e:f:i:m:t:v:ERXh"
while getopts $options opt; do
case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
@ -104,6 +135,7 @@ while getopts $options opt; do
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;;
t ) templating+=("--${OPTARG}") ; exportformat="txt" ;;
v ) version=${OPTARG} ;;
E ) export="false" ;;
R ) restarttransform="false" ;;
@ -114,7 +146,7 @@ while getopts $options opt; do
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
esac
done
shift $(($OPTIND - 1))
shift $((OPTIND - 1))
# check for mandatory options
if [ -z "$outputdir" ]; then
@ -122,6 +154,11 @@ if [ -z "$outputdir" ]; then
echo 1>&2 "example: ./openrefine-batch-docker.sh -c output/"
exit 1
fi
if [ "$(ls -A "$outputdir" 2>/dev/null)" ];then
echo 1>&2 "path to directory for exported files (and OpenRefine workspace) is not empty"
echo 1>&2 "$outputdir"
exit 1
fi
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
@ -156,6 +193,7 @@ echo "OpenRefine version: $version"
echo "OpenRefine workspace: $outputdir"
echo "Export to workspace: $export"
echo "Export format: $exportformat"
echo "Templating options: ${templating[*]}"
echo "Docker container name: $uuid"
echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform"
@ -163,38 +201,52 @@ echo ""
# declare additional variables
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Start process"
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Start process"
memoryload=()
# safe cleanup handler
cleanup()
{
echo "cleanup..."
${docker[*]} stop -t=5000 ${uuid}
${docker[*]} rm ${uuid}
rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects
if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi
}
trap "cleanup;exit" SIGHUP SIGINT SIGQUIT SIGTERM
# launch server
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
# wait until server is available
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# show server logs
docker attach ${uuid} &
${docker[*]} attach ${uuid} &
echo ""
# import all files
if [ -n "$inputfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..."
# run client with input command
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
${docker[*]} run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java))
@ -202,11 +254,11 @@ if [ -n "$inputfiles" ]; then
# restart server to clear memory
if [ "$restartfile" = "true" ]; then
echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
${docker[*]} stop -t=5000 ${uuid}
${docker[*]} rm ${uuid}
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
${docker[*]} attach ${uuid} &
echo ""
fi
done
@ -215,18 +267,18 @@ fi
# transform and export files
if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
# get project ids
echo "get project ids..."
sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
${docker[*]} run --rm --link ${uuid} felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -l > "${outputdir}/projects.tmp"
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
echo ""
@ -237,11 +289,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
# restart server to advertise copied projects
echo "restart OpenRefine server to advertise copied projects..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
${docker[*]} stop -t=5000 ${uuid}
${docker[*]} rm ${uuid}
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
${docker[*]} attach ${uuid} &
echo ""
fi
@ -251,16 +303,16 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# apply transformation rules
if [ -n "$jsonfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Transform ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..."
# run client with apply command
sudo docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
${docker[*]} run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -f ${jsonfile} ${projectids[i]}
# allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java))
@ -268,11 +320,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# restart server to clear memory
if [ "$restarttransform" = "true" ]; then
echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
${docker[*]} stop -t=5000 ${uuid}
${docker[*]} rm ${uuid}
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
${docker[*]} attach ${uuid} &
fi
echo ""
done
@ -281,17 +333,17 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# export project to workspace
if [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Export ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
# get filename without extension
filename=${projectnames[i]%.*}
echo "export to file ${filename}.${exportformat}..."
# run client with export command
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.${exportformat}" ${projectids[i]}
${docker[*]} run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -E --output="${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java))
@ -301,11 +353,11 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# restart server to clear memory
if [ "$restartfile" = "true" ]; then
echo "restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
${docker[*]} stop -t=5000 ${uuid}
${docker[*]} rm ${uuid}
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
${docker[*]} attach ${uuid} &
fi
echo ""
@ -319,32 +371,25 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
fi
fi
# cleanup
echo "cleanup..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects
if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi
# run cleanup function
cleanup
echo ""
# calculate and print checkpoints
echo "=== Statistics ==="
echo ""
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="End process"
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="End process"
echo "starting time and run time of each step:"
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointdate[$((checkpoints + 1))]=$(date +%s)
for i in $(seq 1 $checkpoints); do
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
diffsec="$((${checkpointdate[$((i + 1))]} - ${checkpointdate[$i]}))"
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
done
echo ""
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
diffsec="$((checkpointdate[$checkpoints] - checkpointdate[1]))"
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
# calculate and print memory load
@ -352,4 +397,4 @@ max=${memoryload[0]}
for n in "${memoryload[@]}" ; do
((n > max)) && max=$n
done
echo "highest memory load: $(($max / 1024)) MB"
echo "highest memory load: $((max / 1024)) MB"

View File

@ -1,10 +1,10 @@
#!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
# openrefine-batch.sh, Felix Lohmeier, v1.16, 2021-11-09
# https://github.com/felixlohmeier/openrefine-batch
# declare download URLs for OpenRefine and OpenRefine client
openrefine_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-linux-2017-10-26.tar.gz"
client_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-client_0-3-1_linux-64bit"
openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/3.5.0/openrefine-linux-3.5.0.tar.gz"
client_URL="https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux"
# check system requirements
JAVA="$(which java 2> /dev/null)"
@ -34,7 +34,7 @@ if [ ! -d "openrefine-client" ]; then
echo "Download OpenRefine client..."
mkdir -p openrefine-client
wget -q -P openrefine-client $wget_opt $client_URL
chmod +x openrefine-client/openrefine-client_0-3-1_linux-64bit
chmod +x openrefine-client/openrefine-client_0-3-10_linux
echo ""
fi
@ -55,33 +55,55 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-p PORT PORT on which OpenRefine should listen (default: 3333)
-t TEMPLATING several options for templating export, see below...
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: -i recordPath=collection -i recordPath=record, default xml: record, default json: _ _
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
-i sheets=SHEETS (xls, xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i projectName=PROJECTNAME (all formats), default: filename
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i includeFileSources=true/false (all formats), default: false
-i trimStrings=true/false (xml, json), default: false
-i storeEmptyStrings=true/false (xml, json), default: true
-i guessCellValueTypes=true/false (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xls, xlsx, ods), default: 1, default fixed-width: 0
-i skipDataLines=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: 0, default line-based: -1
-i storeBlankRows=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i processQuotes=true/false (csv, tsv), default: true
-i storeBlankCellsAsNulls=true/false (csv, tsv, line-based, fixed-width, xls, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
== templating options (alternative exportformat) ==
-t template=TEMPLATE (mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)
-t mode=row-based/record-based (engine mode, default: row-based)
-t prefix=PREFIX (text string that you enter in the *prefix* textfield in the browser app)
-t rowSeparator=ROWSEPARATOR (text string that you enter in the *row separator* textfield in the browser app)
-t suffix=SUFFIX (text string that you enter in the *suffix* textfield in the browser app)
-t filterQuery=REGEX (Simple RegEx text filter on filterColumn, e.g. ^12015$)
-t filterColumn=COLUMNNAME (column name for filterQuery, default: name of first column)
-t facets=FACETS (facets config in json format, may be extracted with browser dev tools in browser app)
-t splitToFiles=true/false (will split each row/record into a single file; it specifies a presumably unique character series for splitting; prefix and suffix will be applied to all files
-t suffixById=true/false (enhancement option for splitToFiles; will generate filename-suffix from values in key column)
== examples ==
download example data
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
unzip master.zip openrefine-batch-master/examples/*
mv openrefine-batch-master/examples .
rm -f master.zip
example 1 (input, transform, export to tsv)
./openrefine-batch.sh \
-a examples/powerhouse-museum/input/ \
@ -92,9 +114,9 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
-i guessCellValueTypes=true \
-RX
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
example 2 (input, transform, templating export)
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX -t template='{ "Record ID" : {{jsonize(cells["Record ID"].value)}}, "Object Title" : {{jsonize(cells["Object Title"].value)}}, "Registration Number" : {{jsonize(cells["Registration Number"].value)}}, "Description." : {{jsonize(cells["Description."].value)}}, "Marks" : {{jsonize(cells["Marks"].value)}}, "Production Date" : {{jsonize(cells["Production Date"].value)}}, "Provenance (Production)" : {{jsonize(cells["Provenance (Production)"].value)}}, "Provenance (History)" : {{jsonize(cells["Provenance (History)"].value)}}, "Categories" : {{jsonize(cells["Categories"].value)}}, "Persistent Link" : {{jsonize(cells["Persistent Link"].value)}}, "Height" : {{jsonize(cells["Height"].value)}}, "Width" : {{jsonize(cells["Width"].value)}}, "Depth" : {{jsonize(cells["Depth"].value)}}, "Diameter" : {{jsonize(cells["Diameter"].value)}}, "Weight" : {{jsonize(cells["Weight"].value)}}, "License info" : {{jsonize(cells["License info"].value)}} }' -t rowSeparator=',' -t prefix='{ "rows" : [ ' -t suffix='] }' -t splitToFiles=true
EOF
exit 1
}
@ -118,7 +140,7 @@ if [ "$NUMARGS" -eq 0 ]; then
fi
# get user input
options="a:b:c:d:e:f:i:m:p:ERXh"
options="a:b:c:d:e:f:i:m:p:t:ERXh"
while getopts $options opt; do
case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
@ -130,6 +152,7 @@ while getopts $options opt; do
i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;;
p ) port=${OPTARG} ;;
t ) templating+=("--${OPTARG}") ; exportformat="txt" ;;
E ) export="false" ;;
R ) restarttransform="false" ;;
X ) restartfile="false" ;;
@ -139,7 +162,7 @@ while getopts $options opt; do
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
esac
done
shift $(($OPTIND - 1))
shift $((OPTIND - 1))
# check for mandatory options
if [ -z "$outputdir" ]; then
@ -147,6 +170,11 @@ if [ -z "$outputdir" ]; then
echo 1>&2 "example: ./openrefine-batch.sh -c output/"
exit 1
fi
if [ "$(ls -A "$outputdir" 2>/dev/null)" ];then
echo 1>&2 "path to directory for exported files (and OpenRefine workspace) is not empty"
echo 1>&2 "$outputdir"
exit 1
fi
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
@ -180,23 +208,38 @@ echo "OpenRefine port: $port"
echo "OpenRefine workspace: $outputdir"
echo "Export to workspace: $export"
echo "Export format: $exportformat"
echo "Templating options: ${templating[*]}"
echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform"
echo ""
# declare additional variables
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Start process"
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Start process"
memoryload=()
# safe cleanup handler
cleanup()
{
echo "cleanup..."
kill ${pid}
wait
rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects
if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi
}
trap "cleanup;exit" SIGHUP SIGINT SIGQUIT SIGTERM
# launch server
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
pid=$!
@ -207,16 +250,16 @@ echo ""
# import all files
if [ -n "$inputfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..."
# run client with input command
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
memoryload+=($(ps --no-headers -o rss -p ${pid}))
@ -238,18 +281,18 @@ fi
# transform and export files
if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
# get project ids
echo "get project ids..."
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -l > "${outputdir}/projects.tmp"
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -l > "${outputdir}/projects.tmp"
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
echo ""
@ -275,16 +318,16 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# apply transformation rules
if [ -n "$jsonfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Transform ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..."
# run client with apply command
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
# allocated system resources
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
memoryload+=($(ps --no-headers -o rss -p ${pid}))
@ -306,17 +349,17 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# export project to workspace
if [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="Export ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$((checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
echo ""
# get filename without extension
filename=${projectnames[i]%.*}
echo "export to file ${filename}.${exportformat}..."
# run client with export command
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" ${projectids[i]}
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
memoryload+=($(ps --no-headers -o rss -p ${pid}))
@ -345,32 +388,25 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
fi
fi
# cleanup
echo "cleanup..."
kill ${pid}
wait
rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects
if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi
# run cleanup function
cleanup
echo ""
# calculate and print checkpoints
echo "=== Statistics ==="
echo ""
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="End process"
checkpointdate[$((checkpoints + 1))]=$(date +%s)
checkpointname[$((checkpoints + 1))]="End process"
echo "starting time and run time of each step:"
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointdate[$((checkpoints + 1))]=$(date +%s)
for i in $(seq 1 $checkpoints); do
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
diffsec="$((${checkpointdate[$((i + 1))]} - ${checkpointdate[$i]}))"
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
done
echo ""
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
diffsec="$((checkpointdate[$checkpoints] - checkpointdate[1]))"
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
# calculate and print memory load
@ -378,4 +414,4 @@ max=${memoryload[0]}
for n in "${memoryload[@]}" ; do
((n > max)) && max=$n
done
echo "highest memory load: $(($max / 1024)) MB"
echo "highest memory load: $((max / 1024)) MB"