release v1.0

2025-05-18 00:00:46 +02:00 · 2017-03-14 23:17:33 +01:00 · 2017-03-14 23:17:33 +01:00 · acf9f046b7
commit acf9f046b7
parent f466df1e46
4 changed files with 851 additions and 302 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,9 @@ Shell script to run OpenRefine in batch mode (import, transform, export). This b
 2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
 3. finally exports the data in TSV (tab-separated values) format.
-It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/felixlohmeier/openrefine/) (server) and a [docker container for a python client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
+It orchestrates [OpenRefine](https://github.com/OpenRefine/OpenRefine) (server) and a [python client](https://github.com/felixlohmeier/openrefine-client) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
 If you prefer a containerized approach, see a [variation of this script for Docker](#docker) below.
 ### Typical Workflow
@ -15,22 +17,23 @@ It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/fel
 ### Install
-1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
+Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
-2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
+
 That's all. The script will automatically download copies of OpenRefine and the python client on first run and will tell you if something (python, java) is missing.
 ### Usage
 ```
-mkdir -p input && cp INPUTFILES input/
+mkdir input
-mkdir -p config && cp CONFIGFILES config/
+cp INPUTFILES input/
-sudo ./openrefine-batch.sh input/ config/ OUTPUT/
+mkdir config
 cp CONFIGFILES config/
 ./openrefine-batch.sh input/ config/ OUTPUT/
 ```
 Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
 **INPUTFILES**
 * any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter [Options](https://github.com/felixlohmeier/openrefine-batch#options) below)
-* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive])
+* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers)
 * you may use hard symlinks instead of cp: `ln INPUTFILE input/`
 **CONFIGFILES**
@ -41,140 +44,204 @@ Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon
 * Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
 * OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
-#### Example
+### Example
 [Example Powerhouse Museum](examples/powerhouse-museum)
 ```
 ./openrefine-batch.sh \
 -a examples/powerhouse-museum/input/ \
 -b examples/powerhouse-museum/config/ \
 -c examples/powerhouse-museum/output/ \
 -f tsv \
 -i processQuotes=false \
 -i guessCellValueTypes=true \
 -RX
 ```
 clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
-```
+### Help Screen
 sudo ./openrefine-batch.sh \
 examples/powerhouse-museum/input/ \
 examples/powerhouse-museum/config/ \
 examples/powerhouse-museum/output/ \
 examples/powerhouse-museum/cross/ \
 2G 2.7rc1 restartfile-false restarttransform-false export-true \
 tsv --processQuotes=false --guessCellValueTypes=true
 ```
 #### Options
 ```
-sudo ./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restartfile $restarttransform $export $inputformat $inputoptions
+[18:20 felix ~/openrefine-batch]$ ./openrefine-batch.sh
 Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
 == basic arguments ==
    -a INPUTDIR      path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
    -b TRANSFORMDIR  path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
    -c OUTPUTDIR     path to directory for exported files (and OpenRefine workspace)
 == options ==
    -d CROSSDIR      path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
    -f INPUTFORMAT   (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
    -i INPUTOPTIONS  several options provided by openrefine-client, see below...
    -m RAM           maximum RAM for OpenRefine java heap space (default: 2048M)
    -p PORT          PORT on which OpenRefine should listen (default: 3333)
    -E               do NOT export files
    -R               do NOT restart OpenRefine after each transformation (e.g. config file)
    -X               do NOT restart OpenRefine after each project (e.g. input file)
    -h               displays this help screen
 == inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
    -i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
    -i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
    -i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
 == more inputoptions (optional, only together with inputformat) ==
    -i projectName=PROJECTNAME (all formats)
    -i limit=LIMIT (all formats), default: -1
    -i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
    -i trimStrings=TRIMSTRINGS (xml, json), default: false
    -i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
    -i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
    -i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
    -i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
    -i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
    -i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
    -i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i processQuotes=PROCESSQUOTES (csv, tsv), default: true
    -i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i linesPerRow=LINESPERROW (line-based), default: 1
 == example ==
 ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true
 clone or download GitHub repository to get example data:
 https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
 ```
 1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
 2. configdir: path to directory with [OpenRefine transformation rules (json files)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
 3. outputdir: path to directory for exported files (and OpenRefine workspace)
 4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
 5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
 6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc1)
 7. restartfile: restart docker after each project (e.g. input file) to clear memory (restartfile-true/restartfile-false; default: restartfile-true)
 8. restarttransform: restart docker container after each transformation (e.g. config file) to clear memory (restarttransform-true/restarttransform-false; default: restarttransform-false)
 9. export: toggle on/off (export-true/export-false; default: export-true)
 8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
 9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
 inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
 * `--recordPath=RECORDPATH` (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: `--recordPath=collection --recordPath=record`
 * `--columnWidths=COLUMNWIDTHS` (fixed-width): please provide widths separated by comma (e.g. 7,5)
 * `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
 more inputoptions (optional, only together with inputformat):
 * `--projectName=PROJECTNAME` (all formats)
 * `--limit=LIMIT` (all formats), default: -1
 * `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
 * `--trimStrings=TRIMSTRINGS` (xml, json), default: false
 * `--storeEmptyStrings=STOREEMPTYSTRINGS` (xml, json), default: true
 * `--guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json)`, default: false
 * `--encoding=ENCODING (csv, tsv, line-based, fixed-width)`, please provide short encoding name (e.g. UTF-8)
 * `--ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods)`, default: -1
 * `--headerLines=HEADERLINES` (csv, tsv, fixed-width, xlsx, ods), default: 1
 * `--skipDataLines=SKIPDATALINES` (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
 * `--storeBlankRows=STOREBLANKROWS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
 * `--processQuotes=PROCESSQUOTES` (csv, tsv), default: true
 * `--storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
 * `--linesPerRow=LINESPERROW` (line-based), default: 1
 ### Logging
-The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
+The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
 ```
-[17:54 felix ~/openrefine-batch]$ sudo ./openrefine-batch.sh \
+[17:55 felix ~/openrefine-batch]$ ./openrefine-batch.sh \
-> examples/powerhouse-museum/input/ \
+> -a examples/powerhouse-museum/input/ \
-> examples/powerhouse-museum/config/ \
+> -b examples/powerhouse-museum/config/ \
-> examples/powerhouse-museum/output/ \
+> -c examples/powerhouse-museum/output/ \
-> examples/powerhouse-museum/cross/ \
+> -f tsv \
-> 2G 2.7rc1 restartfile-false restarttransform-false export-true \
+> -i processQuotes=false \
-> tsv --processQuotes=false --guessCellValueTypes=true
+> -i guessCellValueTypes=true \
-Input directory:         /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/input
+> -RX
 Input directory:         /home/felix/openrefine-batch/examples/powerhouse-museum/input
 Input files:             phm-collection.tsv
 Input format:            --format=tsv
 Input options:           --processQuotes=false --guessCellValueTypes=true
-Config directory:        /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/config
+Config directory:        /home/felix/openrefine-batch/examples/powerhouse-museum/config
 Transformation rules:    phm-transform.json
-Cross directory:         /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/cross
+Cross directory:         /dev/null
 Cross projects:          
-OpenRefine heap space:   2G
+OpenRefine heap space:   2048M
-OpenRefine version:      2.7rc1
+OpenRefine port:         3333
-OpenRefine workspace:    /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output
+OpenRefine workspace:    /home/felix/openrefine-batch/examples/powerhouse-museum/output
-Export TSV to workspace: export-true
+Export TSV to workspace: true
-Docker container name:   6b622f38-bbdd-4a28-b590-0c7fdf9d577b
+restart after file:      false
-restart after file:      restartfile-false
+restart after transform: false
 restart after transform: restarttransform-false
-begin: Mi 1. Mär 17:54:45 CET 2017
+=== 1. Launch OpenRefine ===
-start OpenRefine server...
+starting time: Di 14. Mär 17:58:08 CET 2017
 2d836891cbc79f730f18262c9f98b6406b5323ca9fd84636afb194a664abf66e
-=== IMPORT ===
+Starting OpenRefine at 'http://127.0.0.1:3333/'
 17:58:08.758 [            refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
 17:58:08.760 [            refine_server] refine.memory size: 2048M JVM Max heap: 1908932608 (2ms)
 17:58:08.787 [            refine_server] Initializing context: '/' from '/home/felix/openrefine-batch/openrefine/webapp' (27ms)
 17:58:09.463 [                   refine] Starting OpenRefine 2.7-rc.1 [TRUNK]... (676ms)
 17:58:09.476 [       FileProjectManager] Failed to load workspace from any attempted alternatives. (13ms)
 17:58:12.003 [                   refine] Running in headless mode (2527ms)
 === 2. Import all files ===
 starting time: Di 14. Mär 17:58:12 CET 2017
 import phm-collection.tsv...
-16:54:59.290 [                   refine] POST /command/core/create-project-from-upload (4748ms)
+17:58:13.068 [                   refine] POST /command/core/create-project-from-upload (1065ms)
-New project: 1831307645035
+New project: 2073385535316
-16:55:15.514 [                   refine] GET /command/core/get-rows (16224ms)
+17:58:26.543 [                   refine] GET /command/core/get-rows (13475ms)
 Number of rows: 75814
 STARTED     ELAPSED %MEM %CPU   RSS
-17:54:46       00:31  9.7  109 788156
+17:58:07       00:18  9.8  168 795024
-=== TRANSFORM / EXPORT ===
+=== 3. Prepare transform & export ===
 starting time: Di 14. Mär 17:58:26 CET 2017
 get project ids...
-16:55:21.258 [                   refine] GET /command/core/get-all-project-metadata (5744ms)
+17:58:26.778 [                   refine] GET /command/core/get-all-project-metadata (235ms)
- 1831307645035: phm-collection.tsv
+ 2073385535316: phm-collection.tsv
--- begin project 1831307645035 @ Mi 1. Mär 17:55:22 CET 2017 ---
+=== 4. Transform phm-collection.tsv ===
 starting time: Di 14. Mär 17:58:26 CET 2017
 transform phm-transform.json...
-16:55:23.983 [                   refine] GET /command/core/get-models (2725ms)
+17:58:26.917 [                   refine] GET /command/core/get-models (139ms)
-16:55:24.002 [                   refine] POST /command/core/apply-operations (19ms)
+17:58:26.934 [                   refine] POST /command/core/apply-operations (17ms)
 STARTED     ELAPSED %MEM %CPU   RSS
-17:54:46       01:26 13.3  118 1076800
+17:58:07       01:02 13.5  134 1096916
-export to file 1831307645035.tsv...
+
-16:56:14.909 [                   refine] GET /command/core/get-models (50907ms)
+=== 5. Export phm-collection.tsv ===
-16:56:14.933 [                   refine] GET /command/core/get-all-project-metadata (24ms)
+
-16:56:14.949 [                   refine] POST /command/core/export-rows/phm-collection.tsv.tsv (16ms)
+starting time: Di 14. Mär 17:59:09 CET 2017
 export to file phm-collection.tsv...
 17:59:09.944 [                   refine] GET /command/core/get-models (43010ms)
 17:59:09.956 [                   refine] GET /command/core/get-all-project-metadata (12ms)
 17:59:09.967 [                   refine] POST /command/core/export-rows/phm-collection.tsv.tsv (11ms)
 STARTED     ELAPSED %MEM %CPU   RSS
-17:54:46       03:10 13.9 59.2 1130304
+17:58:07       02:24 13.5 60.5 1098056
 --- finished project 1831307645035 @ Mi 1. Mär 17:57:57 CET 2017 ---
 output (number of lines / size in bytes):
-  167017 60527726 /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output/1831307645035.tsv
+  167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
 cleanup...
-16:58:00.158 [           ProjectManager] Saving all modified projects ... (105209ms)
+18:00:35.425 [           ProjectManager] Saving all modified projects ... (85458ms)
-16:58:07.242 [        project_utilities] Saved project '1831307645035' (7084ms)
+18:00:42.357 [        project_utilities] Saved project '2073385535316' (6932ms)
 6b622f38-bbdd-4a28-b590-0c7fdf9d577b
 6b622f38-bbdd-4a28-b590-0c7fdf9d577b
-finish: Mi 1. Mär 17:58:09 CET 2017
+=== Statistics ===
 starting time and run time of each step:
                      Start process Di 14. Mär 17:58:08 CET 2017 (00:00:00)
                  Launch OpenRefine Di 14. Mär 17:58:08 CET 2017 (00:00:04)
                   Import all files Di 14. Mär 17:58:12 CET 2017 (00:00:14)
         Prepare transform & export Di 14. Mär 17:58:26 CET 2017 (00:00:00)
       Transform phm-collection.tsv Di 14. Mär 17:58:26 CET 2017 (00:00:43)
          Export phm-collection.tsv Di 14. Mär 17:59:09 CET 2017 (00:01:34)
                        End process Di 14. Mär 18:00:43 CET 2017 (00:00:00)
 total run time: 00:02:35 (hh:mm:ss)
 highest memory load: 1072 MB
 ```
 ### Docker
 A variation of the shell script orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/felixlohmeier/openrefine/) (server) and a [docker container for the python client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) instead of native applications.
 **Install**
 1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
 2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh && chmod +x openrefine-batch-docker.sh`
 **Usage**
 ```
 mkdir input
 cp INPUTFILES input/
 mkdir config
 cp CONFIGFILES config/
 sudo ./openrefine-batch-docker.sh input/ config/ OUTPUT/
 ```
 Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
 ### Todo
 - [ ] use getopts for parsing of arguments
 - [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
 - [ ] add option to delete openrefine projects in output directory
 - [ ] provide more example data from other OpenRefine tutorials
--- a/examples/powerhouse-museum/README.md
+++ b/examples/powerhouse-museum/README.md
@ -7,13 +7,14 @@ Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Dat
 ## Usage
 ```
-sudo ./openrefine-batch.sh \
+./openrefine-batch.sh \
-examples/powerhouse-museum/input/ \
+-a examples/powerhouse-museum/input/ \
-examples/powerhouse-museum/config/ \
+-b examples/powerhouse-museum/config/ \
-examples/powerhouse-museum/output/ \
+-c examples/powerhouse-museum/output/ \
-examples/powerhouse-museum/cross/ \
+-f tsv \
-2G 2.7rc1 restartfile-false restarttransform-false export-true \
+-i processQuotes=false \
-tsv --processQuotes=false --guessCellValueTypes=true
+-i guessCellValueTypes=true \
 -RX
 ```
 ## input/phm-collection.tsv
--- a/openrefine-batch-docker.sh
+++ b/openrefine-batch-docker.sh
@ -0,0 +1,339 @@
 #!/bin/bash
 # openrefine-batch.sh, Felix Lohmeier, v1.0.1, 14.03.2017
 # https://github.com/felixlohmeier/openrefine-batch
 # check system requirements
 DOCKER="$(which docker 2> /dev/null)"
 if [ -z "$DOCKER" ] ; then
    echo 1>&2 "This action requires you to have 'docker' installed and present in your PATH. You can download it for free at http://www.docker.com/"
    exit 1
 fi
 DOCKERINFO="$(docker info 2>/dev/null | grep 'Server Version')"
 if [ -z "$DOCKERINFO" ] ; then
    echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands. Try to run the script with 'sudo ./openrefine-batch-docker.sh ...'"
    exit 1
 fi
 # help screen
 function usage () {
    cat <<EOF
 Usage: ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
 == basic arguments ==
    -a INPUTDIR      path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
    -b TRANSFORMDIR  path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
    -c OUTPUTDIR     path to directory for exported files (and OpenRefine workspace)
 == options ==
    -d CROSSDIR      path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
    -f INPUTFORMAT   (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
    -i INPUTOPTIONS  several options provided by openrefine-client, see below...
    -m RAM           maximum RAM for OpenRefine java heap space (default: 2048M)
    -v VERSION       OpenRefine version (2.7rc2, 2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc2)
    -E               do NOT export files
    -R               do NOT restart OpenRefine after each transformation (e.g. config file)
    -X               do NOT restart OpenRefine after each project (e.g. input file)
    -h               displays this help screen
 == inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
    -i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
    -i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
    -i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
 == more inputoptions (optional, only together with inputformat) ==
    -i projectName=PROJECTNAME (all formats)
    -i limit=LIMIT (all formats), default: -1
    -i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
    -i trimStrings=TRIMSTRINGS (xml, json), default: false
    -i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
    -i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
    -i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
    -i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
    -i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
    -i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
    -i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i processQuotes=PROCESSQUOTES (csv, tsv), default: true
    -i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i linesPerRow=LINESPERROW (line-based), default: 1
 == example ==
 ./openrefine-batch-docker.sh \
 -a examples/powerhouse-museum/input/ \
 -b examples/powerhouse-museum/config/ \
 -c examples/powerhouse-museum/output/ \
 -f tsv \
 -i processQuotes=false \
 -i guessCellValueTypes=true
 clone or download GitHub repository to get example data:
 https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
 EOF
   exit 1
 }
 # defaults
 ram="2048M"
 version="2.7rc2"
 restartfile="true"
 restarttransform="true"
 export="true"
 inputdir=/dev/null
 configdir=/dev/null
 crossdir=/dev/null
 # check input
 NUMARGS=$#
 if [ "$NUMARGS" -eq 0 ]; then
  usage
 fi
 # get user input
 options="a:b:c:d:f:i:m:p:ERXh"
 while getopts $options opt; do
   case $opt in
   a )  inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
   b )  configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
   c )  outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
   d )  crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
   f )  format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
   i )  inputoptions+=("--${OPTARG}") ;;
   m )  ram=${OPTARG} ;;
   v )  version=${OPTARG} ;;
   E )  export="false" ;;
   R )  restarttransform="false" ;;
   X )  restartfile="false" ;;
   h )  usage ;;
   \? ) echo 1>&2 "Unknown option: -$OPTARG"; usage; exit 1;;
   :  ) echo 1>&2 "Missing option argument for -$OPTARG"; usage; exit 1;;
   *  ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
   esac
 done
 shift $(($OPTIND - 1))
 # check for mandatory options
 if [ -z "$outputdir" ]; then
    echo 1>&2 "please provide path to directory for exported files (and OpenRefine workspace)"
    echo 1>&2 "example: ./openrefine-batch-docker.sh -c output/"
    exit 1
 fi
 if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide recordpath in multiple arguments without slashes"
    echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i recordPath=collection -i recordPath=record"
    exit 1
 fi
 if [ "$format" = "fixed-width" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide column widths separated by comma (e.g. 7,5)"
    echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i columnWidths=7,5"
    exit 1
 fi
 if [ "$format" = "xlsx" ] || [ "$format" = "ods" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)"
    echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i sheets=0"
    exit 1
 fi
 # print variables
 uuid=$(cat /proc/sys/kernel/random/uuid)
 echo "Input directory:         $inputdir"
 echo "Input files:             ${inputfiles[*]}"
 echo "Input format:            $inputformat"
 echo "Input options:           ${inputoptions[*]}"
 echo "Config directory:        $configdir"
 echo "Transformation rules:    ${jsonfiles[*]}"
 echo "Cross directory:         $crossdir"
 echo "Cross projects:          ${crossprojects[*]}"
 echo "OpenRefine heap space:   $ram"
 echo "OpenRefine version:      $version"
 echo "OpenRefine workspace:    $outputdir"
 echo "Export TSV to workspace: $export"
 echo "Docker container name:   $uuid"
 echo "restart after file:      $restartfile"
 echo "restart after transform: $restarttransform"
 echo ""
 # declare additional variables
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="Start process"
 # launch server
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
 echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
 echo ""
 echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
 echo ""
 docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
 # wait until server is available
 until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
 # show server logs
 docker attach ${uuid} &
 echo ""
 # import all files
 if [ -n "$inputfiles" ]; then
    checkpoints=${#checkpointdate[@]}
    checkpointdate[$(($checkpoints + 1))]=$(date +%s)
    checkpointname[$(($checkpoints + 1))]="Import all files"
    echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
    echo ""
    echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
    echo ""
    for inputfile in "${inputfiles[@]}" ; do
        echo "import ${inputfile}..."
        # run client with input command
        docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
        # show allocated system resources
        ps -o start,etime,%mem,%cpu,rss -C java --sort=start
        echo ""
        # restart server to clear memory
        if [ "$restartfile" = "true" ]; then
            echo "save project and restart OpenRefine server..." 
            docker stop -t=5000 ${uuid}
            docker rm ${uuid}
            docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
            until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
            docker attach ${uuid} &
            echo ""
        fi
    done
 fi
 # transform and export files
 if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
    checkpoints=${#checkpointdate[@]}
    checkpointdate[$(($checkpoints + 1))]=$(date +%s)
    checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
    echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
    echo ""
    echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
    echo ""
    # get project ids
    echo "get project ids..."
    docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
    projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
    projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
    cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
    echo ""
    # provide additional OpenRefine projects for cross function
    if [ -n "$crossprojects" ]; then
        echo "provide additional projects for cross function..."
        # copy given projects to workspace
        rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
        # restart server to advertise copied projects
        echo "restart OpenRefine server to advertise copied projects..." 
        docker stop -t=5000 ${uuid}
        docker rm ${uuid}
        docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
        until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
        docker attach ${uuid} &
        echo ""
    fi
    # loop for all projects
    for ((i=0;i<${#projectids[@]};++i)); do
        # apply transformation rules
        if [ -n "$jsonfiles" ]; then
            checkpoints=${#checkpointdate[@]}
            checkpointdate[$(($checkpoints + 1))]=$(date +%s)
            checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
            echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
            echo ""
            echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
            echo ""
            for jsonfile in "${jsonfiles[@]}" ; do
                echo "transform ${jsonfile}..."
                # run client with apply command
                docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
                # allocated system resources
                ps -o start,etime,%mem,%cpu,rss -C java --sort=start
                echo ""
                # restart server to clear memory
                if [ "$restarttransform" = "true" ]; then
                  echo "save project and restart OpenRefine server..." 
                  docker stop -t=5000 ${uuid}
                  docker rm ${uuid}
                  docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
                  until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
                  docker attach ${uuid} &
                fi
                echo ""
            done
        fi
        # export project to workspace
        if [ "$export" = "true" ]; then
            checkpoints=${#checkpointdate[@]}
            checkpointdate[$(($checkpoints + 1))]=$(date +%s)
            checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
            echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
            echo ""
            echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
            echo ""
            # get filename without extension
            filename=${projectnames[i]%.*}
            echo "export to file ${filename}.tsv..."
            # run client with export command
            docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.tsv" ${projectids[i]}
            # show allocated system resources
            ps -o start,etime,%mem,%cpu,rss -C java --sort=start
            echo ""
        fi
        # restart server to clear memory
        if [ "$restartfile" = "true" ]; then    
              echo "restart OpenRefine server..." 
              docker stop -t=5000 ${uuid}
              docker rm ${uuid}
              docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
              until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
              docker attach ${uuid} &
        fi
        echo ""        
    done
    # list output files
    if [ "$export" = "true" ]; then
        echo "output (number of lines / size in bytes):"
        wc -c -l "${outputdir}"/*.tsv
        echo ""
    fi
 fi
 # cleanup
 echo "cleanup..."
 docker stop -t=5000 ${uuid}
 docker rm ${uuid}
 rm -r -f "${outputdir:?}"/workspace*.json
 # delete duplicates from copied projects
 if [ -n "$crossprojects" ]; then
    for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
 fi
 echo ""
 # calculate and print checkpoints
 echo "=== Statistics ==="
 echo ""
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="End process"
 echo "starting time and run time of each step:"
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 for i in $(seq 1 $checkpoints); do
    diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
    printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
 done
 echo ""
 diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
 echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
--- a/openrefine-batch.sh
+++ b/openrefine-batch.sh
@ -1,240 +1,382 @@
 #!/bin/bash
-# openrefine-batch.sh, Felix Lohmeier, v0.6.4, 01.03.2017
+# openrefine-batch.sh, Felix Lohmeier, v1.0.1, 14.03.2017
 # https://github.com/felixlohmeier/openrefine-batch
-# user input
+# declare download URLs for OpenRefine and OpenRefine client
-if [ -z "$1" ]
+openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/2.7-rc.2/openrefine-linux-2.7-rc.2.tar.gz"
-  then
+client_URL="https://github.com/felixlohmeier/openrefine-client/archive/v0.3.1.tar.gz"
-    echo 1>&2 "please provide path to directory with source files (leave empty to transform only)"
+
-    exit 2
+# check system requirements
-  else
+PYTHON="$(which python 2> /dev/null)"
-    inputdir=$(readlink -f $1)
+if [ -z "$PYTHON" ] ; then
-    if [ -n "${inputdir// }" ] ; then
+    echo 1>&2 "This action requires you to have 'python' installed and present in your PATH. You can download it for free at http://www.python.org/"
-      inputfiles=($(find -L ${inputdir}/* -type f -printf "%f\n" 2>/dev/null))
+    exit 1
    fi
 fi
-if [ -z "$2" ]
+PYTHON_VERSION="$($PYTHON --version 2>&1 | cut -f 2 -d ' ' | cut -f 1,2 -d .)"
-  then
+if [ "$PYTHON_VERSION" != "2.6" ] && [ "$PYTHON_VERSION" != "2.7" ]; then
-    echo 1>&2 "please provide path to directory with config files (leave empty to import only)"
+    echo 1>&2 "This action requires Python version 2.6.x. or 2.7.x. You can download it for free at http://www.python.org/"
-    exit 2
+    exit 1
  else
    configdir=$(readlink -f $2)
    if [ -n "${configdir// }" ] ; then
      jsonfiles=($(find -L ${configdir}/* -type f -printf "%f\n" 2>/dev/null))
    fi
 fi
-if [ -z "$3" ]
+JAVA="$(which java 2> /dev/null)"
-  then
+if [ -z "$JAVA" ] ; then
-    echo 1>&2 "please provide path to output directory"
+    echo 1>&2 "This action requires you to have 'Java JRE' installed. You can download it for free at https://java.com"
-    exit 2
+    exit 1
  else
    outputdir=$(readlink -m $3)
    mkdir -p ${outputdir}
 fi
 if [ -z "$4" ]
  then
    echo 1>&2 "please provide path to directory with additional OpenRefine projects for use with cross function (may be empty)"
    exit 2
  else
    crossdir=$(readlink -f $4)
    if [ -n "${crossdir// }" ] ; then
      crossprojects=($(find -L ${crossdir}/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null))
    fi
 fi
 if [ -z "$5" ]
  then
    ram="4G"
  else
    ram="$5"
 fi
 if [ -z "$6" ]
  then
    version="2.7rc1"
  else
    version="$6"
 fi
 if [ -z "$7" ]
  then
    restartfile="restartfile-true"
  else
    restartfile="$7"
 fi
 if [ -z "$8" ]
  then
    restarttransform="restarttransform-false"
  else
    restarttransform="$8"
 fi
 if [ -z "$9" ]
  then
    export="export-true"
  else
    export="$9"
 fi
 if [ -z "${10}" ]
  then
    inputformat=""
  else
    inputformat="--format=${10}"
 fi
 if [ -z "${11}" ]
  then
    inputoptions=""
  else
    inputoptions=( "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" "${21}" "${22}" "${23}" "${24}" "${25}" )
 fi
-# variables
+# autoinstall OpenRefine
-uuid=$(cat /proc/sys/kernel/random/uuid)
+if [ ! -d "openrefine" ]; then
    echo "Download OpenRefine..."
    mkdir -p openrefine
    wget -q --show-progress $openrefine_URL
    echo "Install OpenRefine in subdirectory openrefine..."
    tar -xzf "$(basename $openrefine_URL)" -C openrefine --strip 1 --totals
    rm -f "$(basename $openrefine_URL)"
    sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' openrefine/refine.ini
    echo ""
 fi
 # autoinstall OpenRefine client
 if [ ! -d "openrefine-client" ]; then
    echo "Download OpenRefine client..."
    mkdir -p openrefine-client
    wget -q --show-progress $client_URL
    echo "Install OpenRefine client in subdirectory openrefine-client..."
    tar -xzf "$(basename $client_URL)" -C openrefine-client --strip 1 --totals
    rm -f "$(basename $client_URL)"
    echo ""
 fi
 # help screen
 function usage () {
    cat <<EOF
 Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
 == basic arguments ==
    -a INPUTDIR      path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
    -b TRANSFORMDIR  path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
    -c OUTPUTDIR     path to directory for exported files (and OpenRefine workspace)
 == options ==
    -d CROSSDIR      path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
    -f INPUTFORMAT   (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
    -i INPUTOPTIONS  several options provided by openrefine-client, see below...
    -m RAM           maximum RAM for OpenRefine java heap space (default: 2048M)
    -p PORT          PORT on which OpenRefine should listen (default: 3333)
    -E               do NOT export files
    -R               do NOT restart OpenRefine after each transformation (e.g. config file)
    -X               do NOT restart OpenRefine after each project (e.g. input file)
    -h               displays this help screen
 == inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
    -i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
    -i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
    -i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
 == more inputoptions (optional, only together with inputformat) ==
    -i projectName=PROJECTNAME (all formats)
    -i limit=LIMIT (all formats), default: -1
    -i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
    -i trimStrings=TRIMSTRINGS (xml, json), default: false
    -i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
    -i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
    -i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
    -i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
    -i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
    -i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
    -i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i processQuotes=PROCESSQUOTES (csv, tsv), default: true
    -i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
    -i linesPerRow=LINESPERROW (line-based), default: 1
 == example ==
 ./openrefine-batch.sh \
 -a examples/powerhouse-museum/input/ \
 -b examples/powerhouse-museum/config/ \
 -c examples/powerhouse-museum/output/ \
 -f tsv \
 -i processQuotes=false \
 -i guessCellValueTypes=true
 clone or download GitHub repository to get example data:
 https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
 EOF
   exit 1
 }
 # defaults
 ram="2048M"
 port="3333"
 restartfile="true"
 restarttransform="true"
 export="true"
 inputdir=/dev/null
 configdir=/dev/null
 crossdir=/dev/null
 # check input
 NUMARGS=$#
 if [ "$NUMARGS" -eq 0 ]; then
  usage
 fi
 # get user input
 options="a:b:c:d:f:i:m:p:ERXh"
 while getopts $options opt; do
   case $opt in
   a )  inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
   b )  configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
   c )  outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
   d )  crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
   f )  format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
   i )  inputoptions+=("--${OPTARG}") ;;
   m )  ram=${OPTARG} ;;
   p )  port=${OPTARG} ;;
   E )  export="false" ;;
   R )  restarttransform="false" ;;
   X )  restartfile="false" ;;
   h )  usage ;;
   \? ) echo 1>&2 "Unknown option: -$OPTARG"; usage; exit 1;;
   :  ) echo 1>&2 "Missing option argument for -$OPTARG"; usage; exit 1;;
   *  ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
   esac
 done
 shift $(($OPTIND - 1))
 # check for mandatory options
 if [ -z "$outputdir" ]; then
    echo 1>&2 "please provide path to directory for exported files (and OpenRefine workspace)"
    echo 1>&2 "example: ./openrefine-batch.sh -c output/"
    exit 1
 fi
 if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide recordpath in multiple arguments without slashes"
    echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i recordPath=collection -i recordPath=record"
    exit 1
 fi
 if [ "$format" = "fixed-width" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide column widths separated by comma (e.g. 7,5)"
    echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i columnWidths=7,5"
    exit 1
 fi
 if [ "$format" = "xlsx" ] || [ "$format" = "ods" ] && [ -z "$inputoptions" ]; then
    echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
    echo 1>&2 "please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)"
    echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i sheets=0"
    exit 1
 fi
 # print variables
 echo "Input directory:         $inputdir"
-echo "Input files:             ${inputfiles[@]}"
+echo "Input files:             ${inputfiles[*]}"
 echo "Input format:            $inputformat"
-echo "Input options:           ${inputoptions[@]}"
+echo "Input options:           ${inputoptions[*]}"
 echo "Config directory:        $configdir"
-echo "Transformation rules:    ${jsonfiles[@]}"
+echo "Transformation rules:    ${jsonfiles[*]}"
 echo "Cross directory:         $crossdir"
-echo "Cross projects:          ${crossprojects[@]}"
+echo "Cross projects:          ${crossprojects[*]}"
 echo "OpenRefine heap space:   $ram"
-echo "OpenRefine version:      $version"
+echo "OpenRefine port:         $port"
 echo "OpenRefine workspace:    $outputdir"
 echo "Export TSV to workspace: $export"
 echo "Docker container name:   $uuid"
 echo "restart after file:      $restartfile"
 echo "restart after transform: $restarttransform"
 echo ""
-# time
+# declare additional variables
-echo "begin: $(date)"
+checkpoints=${#checkpointdate[@]}
-echo ""
+checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="Start process"
 memoryload=()
 # launch server
-echo "start OpenRefine server..."
+checkpoints=${#checkpointdate[@]}
-docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
+checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
 echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
 echo ""
 echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
 echo ""
 openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
 pid=$!
 # wait until server is available
-until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
+until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
 # show server logs
 docker attach ${uuid} &
 echo ""
 # import all files
 if [ -n "$inputfiles" ]; then
-  echo "=== IMPORT ==="
+    checkpoints=${#checkpointdate[@]}
-  echo ""
+    checkpointdate[$(($checkpoints + 1))]=$(date +%s)
    checkpointname[$(($checkpoints + 1))]="Import all files"
    echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
    echo ""
    echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
    echo ""
    for inputfile in "${inputfiles[@]}" ; do
        echo "import ${inputfile}..."
        # run client with input command
-        docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
+        openrefine-client/refine.py -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
-        # show statistics
+        # show allocated system resources
-        ps -o start,etime,%mem,%cpu,rss -C java --sort=start
+        ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
        memoryload+=($(ps --no-headers -o rss -p ${pid}))
        echo ""
        # restart server to clear memory
-        if [ "$restartfile" = "restartfile-true" ]; then
+        if [ "$restartfile" = "true" ]; then
            echo "save project and restart OpenRefine server..." 
-            docker stop -t=5000 ${uuid}
+            kill ${pid}
-            docker rm ${uuid}
+            wait
-            docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
+            echo ""
-            until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
+            openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
-            docker attach ${uuid} &
+            pid=$!
            until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
            echo ""
        fi
    done
 fi
 # transform and export files
-if [ -n "$jsonfiles" ] || [ "$export" = "export-true" ]; then
+if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
-  echo "=== TRANSFORM / EXPORT ==="
+    checkpoints=${#checkpointdate[@]}
-  echo ""
+    checkpointdate[$(($checkpoints + 1))]=$(date +%s)
-  
+    checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
-  # get project ids
+    echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
-  echo "get project ids..."
+    echo ""
-  projects=($(docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | tee ${outputdir}/projects.tmp | cut -c 2-14))
+    echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
-  cat ${outputdir}/projects.tmp && rm ${outputdir}/projects.tmp
+    echo ""
-  echo ""
+    
-  
+    # get project ids
-  # provide additional OpenRefine projects for cross function
+    echo "get project ids..."
-  if [ -n "$crossprojects" ]; then
+    openrefine-client/refine.py -P ${port} -l > "${outputdir}/projects.tmp"
-      echo "provide additional projects for cross function..."
+    projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
-      # copy given projects to workspace
+    projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
-      rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir
+    cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
-      # restart server to advertise copied projects
+    echo ""
-      echo "restart OpenRefine server to advertise copied projects..." 
+    
-      docker stop -t=5000 ${uuid}
+    # provide additional OpenRefine projects for cross function
-      docker rm ${uuid}
+    if [ -n "$crossprojects" ]; then
-      docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
+        echo "provide additional projects for cross function..."
-      until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
+        # copy given projects to workspace
-      docker attach ${uuid} &
+        rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
-      echo ""
+        # restart server to advertise copied projects
-  fi
+        echo "restart OpenRefine server to advertise copied projects..." 
-  
+        kill ${pid}
-  # loop for all projects
+        wait
-  for projectid in "${projects[@]}" ; do
+        echo ""
-      # time
+        openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
-      echo "--- begin project $projectid @ $(date) ---"
+        pid=$!
-      echo ""
+        until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
-  
+        echo ""
-      # apply transformation rules
+    fi
-      if [ -n "$jsonfiles" ]; then
+    
-          for jsonfile in "${jsonfiles[@]}" ; do
+    # loop for all projects
-              echo "transform ${jsonfile}..."
+    for ((i=0;i<${#projectids[@]};++i)); do
-              # run client with apply command
+        
-              docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
+        # apply transformation rules
-              # show statistics
+        if [ -n "$jsonfiles" ]; then
-              ps -o start,etime,%mem,%cpu,rss -C java --sort=start
+            checkpoints=${#checkpointdate[@]}
-              # restart server to clear memory
+            checkpointdate[$(($checkpoints + 1))]=$(date +%s)
-              if [ "$restarttransform" = "restarttransform-true" ]; then
+            checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
-                  echo "save project and restart OpenRefine server..." 
+            echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
-                  docker stop -t=5000 ${uuid}
+            echo ""
-                  docker rm ${uuid}
+            echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
-                  docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
+            echo ""
-                  until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
+            for jsonfile in "${jsonfiles[@]}" ; do
-                  docker attach ${uuid} &
+                echo "transform ${jsonfile}..."
-              fi
+                # run client with apply command
-              echo ""
+                openrefine-client/refine.py -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
-          done
+                # allocated system resources
-      fi
+                ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
-  
+                memoryload+=($(ps --no-headers -o rss -p ${pid}))
-      # export project to workspace
+                echo ""
-      if [ "$export" = "export-true" ]; then
+                # restart server to clear memory
-          echo "export to file ${projectid}.tsv..."
+                if [ "$restarttransform" = "true" ]; then
-          # run client with export command
+                    echo "save project and restart OpenRefine server..." 
-          docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
+                    kill ${pid}
-          # show statistics
+                    wait
-          ps -o start,etime,%mem,%cpu,rss -C java --sort=start
+                    echo ""
-          # restart server to clear memory
+                    openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
-          if [ "$restartfile" = "restartfile-true" ]; then    
+                    pid=$!
-              echo "restart OpenRefine server..." 
+                    until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
-              docker stop -t=5000 ${uuid}
+                fi
-              docker rm ${uuid}
+                echo ""
-              docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
+            done
-              until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
+        fi
-              docker attach ${uuid} &
+        
-          fi
+        # export project to workspace
-          echo""
+        if [ "$export" = "true" ]; then
-      fi
+            checkpoints=${#checkpointdate[@]}
-  
+            checkpointdate[$(($checkpoints + 1))]=$(date +%s)
-      # time
+            checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
-      echo "--- finished project $projectid @ $(date) ---"
+            echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
-      echo ""
+            echo ""
-  done
+            echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
-  
+            echo ""
-  # list output files
+            # get filename without extension
-  if [ "$export" = "export-true" ]; then
+            filename=${projectnames[i]%.*}
-      echo "output (number of lines / size in bytes):"
+            echo "export to file ${filename}.tsv..."
-      wc -c -l ${outputdir}/*.tsv
+            # run client with export command
-      echo ""
+            openrefine-client/refine.py -P ${port} -E --output="${outputdir}/${filename}.tsv" ${projectids[i]}
-  fi
+            # show allocated system resources
            ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
            memoryload+=($(ps --no-headers -o rss -p ${pid}))
            echo ""
        fi
        # restart server to clear memory
        if [ "$restartfile" = "true" ]; then    
            echo "restart OpenRefine server..." 
            kill ${pid}
            wait
            echo ""
            openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
            pid=$!
            until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
        fi
        echo ""        
    done
    # list output files
    if [ "$export" = "true" ]; then
        echo "output (number of lines / size in bytes):"
        wc -c -l "${outputdir}"/*.tsv
        echo ""
    fi
 fi
 # cleanup
 echo "cleanup..."
-docker stop -t=5000 ${uuid}
+kill ${pid}
-docker rm ${uuid}
+wait
-rm -r -f ${outputdir}/workspace*.json
+rm -r -f "${outputdir:?}"/workspace*.json
 # delete duplicates from copied projects
 if [ -n "$crossprojects" ]; then
-  for i in "${crossprojects[@]}" ; do rm -r -f ${outputdir}/${i} ; done
+    for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
 fi
 echo ""
-# time
+# calculate and print checkpoints
-echo "finish: $(date)"
+echo "=== Statistics ==="
 echo ""
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 checkpointname[$(($checkpoints + 1))]="End process"
 echo "starting time and run time of each step:"
 checkpoints=${#checkpointdate[@]}
 checkpointdate[$(($checkpoints + 1))]=$(date +%s)
 for i in $(seq 1 $checkpoints); do
    diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
    printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
 done
 echo ""
 diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
 echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
 # calculate and print memory load
 max=${memoryload[0]}
 for n in "${memoryload[@]}" ; do
    ((n > max)) && max=$n
 done
 echo "highest memory load: $(($max / 1024)) MB"