release v1.0

This commit is contained in:
Felix Lohmeier 2017-03-14 23:17:33 +01:00
parent f466df1e46
commit acf9f046b7
4 changed files with 851 additions and 302 deletions

269
README.md
View File

@ -6,7 +6,9 @@ Shell script to run OpenRefine in batch mode (import, transform, export). This b
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and 2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
3. finally exports the data in TSV (tab-separated values) format. 3. finally exports the data in TSV (tab-separated values) format.
It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/felixlohmeier/openrefine/) (server) and a [docker container for a python client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum. It orchestrates [OpenRefine](https://github.com/OpenRefine/OpenRefine) (server) and a [python client](https://github.com/felixlohmeier/openrefine-client) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
If you prefer a containerized approach, see a [variation of this script for Docker](#docker) below.
### Typical Workflow ### Typical Workflow
@ -15,22 +17,23 @@ It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/fel
### Install ### Install
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker` Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
That's all. The script will automatically download copies of OpenRefine and the python client on first run and will tell you if something (python, java) is missing.
### Usage ### Usage
``` ```
mkdir -p input && cp INPUTFILES input/ mkdir input
mkdir -p config && cp CONFIGFILES config/ cp INPUTFILES input/
sudo ./openrefine-batch.sh input/ config/ OUTPUT/ mkdir config
cp CONFIGFILES config/
./openrefine-batch.sh input/ config/ OUTPUT/
``` ```
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
**INPUTFILES** **INPUTFILES**
* any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter [Options](https://github.com/felixlohmeier/openrefine-batch#options) below) * any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter [Options](https://github.com/felixlohmeier/openrefine-batch#options) below)
* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive]) * multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers)
* you may use hard symlinks instead of cp: `ln INPUTFILE input/` * you may use hard symlinks instead of cp: `ln INPUTFILE input/`
**CONFIGFILES** **CONFIGFILES**
@ -41,140 +44,204 @@ Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv` * Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project` * OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
#### Example ### Example
[Example Powerhouse Museum](examples/powerhouse-museum)
```
./openrefine-batch.sh \
-a examples/powerhouse-museum/input/ \
-b examples/powerhouse-museum/config/ \
-c examples/powerhouse-museum/output/ \
-f tsv \
-i processQuotes=false \
-i guessCellValueTypes=true \
-RX
```
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
``` ### Help Screen
sudo ./openrefine-batch.sh \
examples/powerhouse-museum/input/ \
examples/powerhouse-museum/config/ \
examples/powerhouse-museum/output/ \
examples/powerhouse-museum/cross/ \
2G 2.7rc1 restartfile-false restarttransform-false export-true \
tsv --processQuotes=false --guessCellValueTypes=true
```
#### Options
``` ```
sudo ./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restartfile $restarttransform $export $inputformat $inputoptions [18:20 felix ~/openrefine-batch]$ ./openrefine-batch.sh
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== basic arguments ==
-a INPUTDIR path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
-b TRANSFORMDIR path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
-c OUTPUTDIR path to directory for exported files (and OpenRefine workspace)
== options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-p PORT PORT on which OpenRefine should listen (default: 3333)
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
``` ```
1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
2. configdir: path to directory with [OpenRefine transformation rules (json files)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
3. outputdir: path to directory for exported files (and OpenRefine workspace)
4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc1)
7. restartfile: restart docker after each project (e.g. input file) to clear memory (restartfile-true/restartfile-false; default: restartfile-true)
8. restarttransform: restart docker container after each transformation (e.g. config file) to clear memory (restarttransform-true/restarttransform-false; default: restarttransform-false)
9. export: toggle on/off (export-true/export-false; default: export-true)
8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
* `--recordPath=RECORDPATH` (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: `--recordPath=collection --recordPath=record`
* `--columnWidths=COLUMNWIDTHS` (fixed-width): please provide widths separated by comma (e.g. 7,5)
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
more inputoptions (optional, only together with inputformat):
* `--projectName=PROJECTNAME` (all formats)
* `--limit=LIMIT` (all formats), default: -1
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
* `--storeEmptyStrings=STOREEMPTYSTRINGS` (xml, json), default: true
* `--guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json)`, default: false
* `--encoding=ENCODING (csv, tsv, line-based, fixed-width)`, please provide short encoding name (e.g. UTF-8)
* `--ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods)`, default: -1
* `--headerLines=HEADERLINES` (csv, tsv, fixed-width, xlsx, ods), default: 1
* `--skipDataLines=SKIPDATALINES` (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
* `--storeBlankRows=STOREBLANKROWS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
* `--processQuotes=PROCESSQUOTES` (csv, tsv), default: true
* `--storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
* `--linesPerRow=LINESPERROW` (line-based), default: 1
### Logging ### Logging
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log: The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
``` ```
[17:54 felix ~/openrefine-batch]$ sudo ./openrefine-batch.sh \ [17:55 felix ~/openrefine-batch]$ ./openrefine-batch.sh \
> examples/powerhouse-museum/input/ \ > -a examples/powerhouse-museum/input/ \
> examples/powerhouse-museum/config/ \ > -b examples/powerhouse-museum/config/ \
> examples/powerhouse-museum/output/ \ > -c examples/powerhouse-museum/output/ \
> examples/powerhouse-museum/cross/ \ > -f tsv \
> 2G 2.7rc1 restartfile-false restarttransform-false export-true \ > -i processQuotes=false \
> tsv --processQuotes=false --guessCellValueTypes=true > -i guessCellValueTypes=true \
Input directory: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/input > -RX
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
Input files: phm-collection.tsv Input files: phm-collection.tsv
Input format: --format=tsv Input format: --format=tsv
Input options: --processQuotes=false --guessCellValueTypes=true Input options: --processQuotes=false --guessCellValueTypes=true
Config directory: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/config Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
Transformation rules: phm-transform.json Transformation rules: phm-transform.json
Cross directory: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/cross Cross directory: /dev/null
Cross projects: Cross projects:
OpenRefine heap space: 2G OpenRefine heap space: 2048M
OpenRefine version: 2.7rc1 OpenRefine port: 3333
OpenRefine workspace: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output OpenRefine workspace: /home/felix/openrefine-batch/examples/powerhouse-museum/output
Export TSV to workspace: export-true Export TSV to workspace: true
Docker container name: 6b622f38-bbdd-4a28-b590-0c7fdf9d577b restart after file: false
restart after file: restartfile-false restart after transform: false
restart after transform: restarttransform-false
begin: Mi 1. Mär 17:54:45 CET 2017 === 1. Launch OpenRefine ===
start OpenRefine server... starting time: Di 14. Mär 17:58:08 CET 2017
2d836891cbc79f730f18262c9f98b6406b5323ca9fd84636afb194a664abf66e
=== IMPORT === Starting OpenRefine at 'http://127.0.0.1:3333/'
17:58:08.758 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
17:58:08.760 [ refine_server] refine.memory size: 2048M JVM Max heap: 1908932608 (2ms)
17:58:08.787 [ refine_server] Initializing context: '/' from '/home/felix/openrefine-batch/openrefine/webapp' (27ms)
17:58:09.463 [ refine] Starting OpenRefine 2.7-rc.1 [TRUNK]... (676ms)
17:58:09.476 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (13ms)
17:58:12.003 [ refine] Running in headless mode (2527ms)
=== 2. Import all files ===
starting time: Di 14. Mär 17:58:12 CET 2017
import phm-collection.tsv... import phm-collection.tsv...
16:54:59.290 [ refine] POST /command/core/create-project-from-upload (4748ms) 17:58:13.068 [ refine] POST /command/core/create-project-from-upload (1065ms)
New project: 1831307645035 New project: 2073385535316
16:55:15.514 [ refine] GET /command/core/get-rows (16224ms) 17:58:26.543 [ refine] GET /command/core/get-rows (13475ms)
Number of rows: 75814 Number of rows: 75814
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
17:54:46 00:31 9.7 109 788156 17:58:07 00:18 9.8 168 795024
=== TRANSFORM / EXPORT === === 3. Prepare transform & export ===
starting time: Di 14. Mär 17:58:26 CET 2017
get project ids... get project ids...
16:55:21.258 [ refine] GET /command/core/get-all-project-metadata (5744ms) 17:58:26.778 [ refine] GET /command/core/get-all-project-metadata (235ms)
1831307645035: phm-collection.tsv 2073385535316: phm-collection.tsv
--- begin project 1831307645035 @ Mi 1. Mär 17:55:22 CET 2017 --- === 4. Transform phm-collection.tsv ===
starting time: Di 14. Mär 17:58:26 CET 2017
transform phm-transform.json... transform phm-transform.json...
16:55:23.983 [ refine] GET /command/core/get-models (2725ms) 17:58:26.917 [ refine] GET /command/core/get-models (139ms)
16:55:24.002 [ refine] POST /command/core/apply-operations (19ms) 17:58:26.934 [ refine] POST /command/core/apply-operations (17ms)
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
17:54:46 01:26 13.3 118 1076800 17:58:07 01:02 13.5 134 1096916
export to file 1831307645035.tsv...
16:56:14.909 [ refine] GET /command/core/get-models (50907ms) === 5. Export phm-collection.tsv ===
16:56:14.933 [ refine] GET /command/core/get-all-project-metadata (24ms)
16:56:14.949 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (16ms) starting time: Di 14. Mär 17:59:09 CET 2017
export to file phm-collection.tsv...
17:59:09.944 [ refine] GET /command/core/get-models (43010ms)
17:59:09.956 [ refine] GET /command/core/get-all-project-metadata (12ms)
17:59:09.967 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (11ms)
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
17:54:46 03:10 13.9 59.2 1130304 17:58:07 02:24 13.5 60.5 1098056
--- finished project 1831307645035 @ Mi 1. Mär 17:57:57 CET 2017 ---
output (number of lines / size in bytes): output (number of lines / size in bytes):
167017 60527726 /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output/1831307645035.tsv 167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
cleanup... cleanup...
16:58:00.158 [ ProjectManager] Saving all modified projects ... (105209ms) 18:00:35.425 [ ProjectManager] Saving all modified projects ... (85458ms)
16:58:07.242 [ project_utilities] Saved project '1831307645035' (7084ms) 18:00:42.357 [ project_utilities] Saved project '2073385535316' (6932ms)
6b622f38-bbdd-4a28-b590-0c7fdf9d577b
6b622f38-bbdd-4a28-b590-0c7fdf9d577b
finish: Mi 1. Mär 17:58:09 CET 2017 === Statistics ===
starting time and run time of each step:
Start process Di 14. Mär 17:58:08 CET 2017 (00:00:00)
Launch OpenRefine Di 14. Mär 17:58:08 CET 2017 (00:00:04)
Import all files Di 14. Mär 17:58:12 CET 2017 (00:00:14)
Prepare transform & export Di 14. Mär 17:58:26 CET 2017 (00:00:00)
Transform phm-collection.tsv Di 14. Mär 17:58:26 CET 2017 (00:00:43)
Export phm-collection.tsv Di 14. Mär 17:59:09 CET 2017 (00:01:34)
End process Di 14. Mär 18:00:43 CET 2017 (00:00:00)
total run time: 00:02:35 (hh:mm:ss)
highest memory load: 1072 MB
``` ```
### Docker
A variation of the shell script orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/felixlohmeier/openrefine/) (server) and a [docker container for the python client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) instead of native applications.
**Install**
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch-docker.sh && chmod +x openrefine-batch-docker.sh`
**Usage**
```
mkdir input
cp INPUTFILES input/
mkdir config
cp CONFIGFILES config/
sudo ./openrefine-batch-docker.sh input/ config/ OUTPUT/
```
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
### Todo ### Todo
- [ ] use getopts for parsing of arguments
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor - [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] add option to delete openrefine projects in output directory - [ ] add option to delete openrefine projects in output directory
- [ ] provide more example data from other OpenRefine tutorials - [ ] provide more example data from other OpenRefine tutorials

View File

@ -7,13 +7,14 @@ Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Dat
## Usage ## Usage
``` ```
sudo ./openrefine-batch.sh \ ./openrefine-batch.sh \
examples/powerhouse-museum/input/ \ -a examples/powerhouse-museum/input/ \
examples/powerhouse-museum/config/ \ -b examples/powerhouse-museum/config/ \
examples/powerhouse-museum/output/ \ -c examples/powerhouse-museum/output/ \
examples/powerhouse-museum/cross/ \ -f tsv \
2G 2.7rc1 restartfile-false restarttransform-false export-true \ -i processQuotes=false \
tsv --processQuotes=false --guessCellValueTypes=true -i guessCellValueTypes=true \
-RX
``` ```
## input/phm-collection.tsv ## input/phm-collection.tsv

339
openrefine-batch-docker.sh Executable file
View File

@ -0,0 +1,339 @@
#!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v1.0.1, 14.03.2017
# https://github.com/felixlohmeier/openrefine-batch
# check system requirements
DOCKER="$(which docker 2> /dev/null)"
if [ -z "$DOCKER" ] ; then
echo 1>&2 "This action requires you to have 'docker' installed and present in your PATH. You can download it for free at http://www.docker.com/"
exit 1
fi
DOCKERINFO="$(docker info 2>/dev/null | grep 'Server Version')"
if [ -z "$DOCKERINFO" ] ; then
echo 1>&2 "This action requires you to start the docker daemon. Try 'sudo systemctl start docker' or 'sudo start docker'. If the docker daemon is already running then maybe some security privileges are missing to run docker commands. Try to run the script with 'sudo ./openrefine-batch-docker.sh ...'"
exit 1
fi
# help screen
function usage () {
cat <<EOF
Usage: ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== basic arguments ==
-a INPUTDIR path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
-b TRANSFORMDIR path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
-c OUTPUTDIR path to directory for exported files (and OpenRefine workspace)
== options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-v VERSION OpenRefine version (2.7rc2, 2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc2)
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
./openrefine-batch-docker.sh \
-a examples/powerhouse-museum/input/ \
-b examples/powerhouse-museum/config/ \
-c examples/powerhouse-museum/output/ \
-f tsv \
-i processQuotes=false \
-i guessCellValueTypes=true
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
EOF
exit 1
}
# defaults
ram="2048M"
version="2.7rc2"
restartfile="true"
restarttransform="true"
export="true"
inputdir=/dev/null
configdir=/dev/null
crossdir=/dev/null
# check input
NUMARGS=$#
if [ "$NUMARGS" -eq 0 ]; then
usage
fi
# get user input
options="a:b:c:d:f:i:m:p:ERXh"
while getopts $options opt; do
case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;;
v ) version=${OPTARG} ;;
E ) export="false" ;;
R ) restarttransform="false" ;;
X ) restartfile="false" ;;
h ) usage ;;
\? ) echo 1>&2 "Unknown option: -$OPTARG"; usage; exit 1;;
: ) echo 1>&2 "Missing option argument for -$OPTARG"; usage; exit 1;;
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
esac
done
shift $(($OPTIND - 1))
# check for mandatory options
if [ -z "$outputdir" ]; then
echo 1>&2 "please provide path to directory for exported files (and OpenRefine workspace)"
echo 1>&2 "example: ./openrefine-batch-docker.sh -c output/"
exit 1
fi
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i recordPath=collection -i recordPath=record"
exit 1
fi
if [ "$format" = "fixed-width" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide column widths separated by comma (e.g. 7,5)"
echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i columnWidths=7,5"
exit 1
fi
if [ "$format" = "xlsx" ] || [ "$format" = "ods" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)"
echo 1>&2 "example: ./openrefine-batch-docker.sh ... -f $format -i sheets=0"
exit 1
fi
# print variables
uuid=$(cat /proc/sys/kernel/random/uuid)
echo "Input directory: $inputdir"
echo "Input files: ${inputfiles[*]}"
echo "Input format: $inputformat"
echo "Input options: ${inputoptions[*]}"
echo "Config directory: $configdir"
echo "Transformation rules: ${jsonfiles[*]}"
echo "Cross directory: $crossdir"
echo "Cross projects: ${crossprojects[*]}"
echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version"
echo "OpenRefine workspace: $outputdir"
echo "Export TSV to workspace: $export"
echo "Docker container name: $uuid"
echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform"
echo ""
# declare additional variables
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Start process"
# launch server
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
# wait until server is available
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# show server logs
docker attach ${uuid} &
echo ""
# import all files
if [ -n "$inputfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..."
# run client with input command
docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
echo ""
# restart server to clear memory
if [ "$restartfile" = "true" ]; then
echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
echo ""
fi
done
fi
# transform and export files
if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
# get project ids
echo "get project ids..."
docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
echo ""
# provide additional OpenRefine projects for cross function
if [ -n "$crossprojects" ]; then
echo "provide additional projects for cross function..."
# copy given projects to workspace
rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
# restart server to advertise copied projects
echo "restart OpenRefine server to advertise copied projects..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
echo ""
fi
# loop for all projects
for ((i=0;i<${#projectids[@]};++i)); do
# apply transformation rules
if [ -n "$jsonfiles" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..."
# run client with apply command
docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
# allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
echo ""
# restart server to clear memory
if [ "$restarttransform" = "true" ]; then
echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
fi
echo ""
done
fi
# export project to workspace
if [ "$export" = "true" ]; then
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
# get filename without extension
filename=${projectnames[i]%.*}
echo "export to file ${filename}.tsv..."
# run client with export command
docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.tsv" ${projectids[i]}
# show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
echo ""
fi
# restart server to clear memory
if [ "$restartfile" = "true" ]; then
echo "restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
fi
echo ""
done
# list output files
if [ "$export" = "true" ]; then
echo "output (number of lines / size in bytes):"
wc -c -l "${outputdir}"/*.tsv
echo ""
fi
fi
# cleanup
echo "cleanup..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects
if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi
echo ""
# calculate and print checkpoints
echo "=== Statistics ==="
echo ""
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="End process"
echo "starting time and run time of each step:"
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
for i in $(seq 1 $checkpoints); do
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
done
echo ""
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"

View File

@ -1,240 +1,382 @@
#!/bin/bash #!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v0.6.4, 01.03.2017 # openrefine-batch.sh, Felix Lohmeier, v1.0.1, 14.03.2017
# https://github.com/felixlohmeier/openrefine-batch # https://github.com/felixlohmeier/openrefine-batch
# user input # declare download URLs for OpenRefine and OpenRefine client
if [ -z "$1" ] openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/2.7-rc.2/openrefine-linux-2.7-rc.2.tar.gz"
then client_URL="https://github.com/felixlohmeier/openrefine-client/archive/v0.3.1.tar.gz"
echo 1>&2 "please provide path to directory with source files (leave empty to transform only)"
exit 2 # check system requirements
else PYTHON="$(which python 2> /dev/null)"
inputdir=$(readlink -f $1) if [ -z "$PYTHON" ] ; then
if [ -n "${inputdir// }" ] ; then echo 1>&2 "This action requires you to have 'python' installed and present in your PATH. You can download it for free at http://www.python.org/"
inputfiles=($(find -L ${inputdir}/* -type f -printf "%f\n" 2>/dev/null)) exit 1
fi
fi fi
if [ -z "$2" ] PYTHON_VERSION="$($PYTHON --version 2>&1 | cut -f 2 -d ' ' | cut -f 1,2 -d .)"
then if [ "$PYTHON_VERSION" != "2.6" ] && [ "$PYTHON_VERSION" != "2.7" ]; then
echo 1>&2 "please provide path to directory with config files (leave empty to import only)" echo 1>&2 "This action requires Python version 2.6.x. or 2.7.x. You can download it for free at http://www.python.org/"
exit 2 exit 1
else
configdir=$(readlink -f $2)
if [ -n "${configdir// }" ] ; then
jsonfiles=($(find -L ${configdir}/* -type f -printf "%f\n" 2>/dev/null))
fi
fi fi
if [ -z "$3" ] JAVA="$(which java 2> /dev/null)"
then if [ -z "$JAVA" ] ; then
echo 1>&2 "please provide path to output directory" echo 1>&2 "This action requires you to have 'Java JRE' installed. You can download it for free at https://java.com"
exit 2 exit 1
else
outputdir=$(readlink -m $3)
mkdir -p ${outputdir}
fi
if [ -z "$4" ]
then
echo 1>&2 "please provide path to directory with additional OpenRefine projects for use with cross function (may be empty)"
exit 2
else
crossdir=$(readlink -f $4)
if [ -n "${crossdir// }" ] ; then
crossprojects=($(find -L ${crossdir}/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null))
fi
fi
if [ -z "$5" ]
then
ram="4G"
else
ram="$5"
fi
if [ -z "$6" ]
then
version="2.7rc1"
else
version="$6"
fi
if [ -z "$7" ]
then
restartfile="restartfile-true"
else
restartfile="$7"
fi
if [ -z "$8" ]
then
restarttransform="restarttransform-false"
else
restarttransform="$8"
fi
if [ -z "$9" ]
then
export="export-true"
else
export="$9"
fi
if [ -z "${10}" ]
then
inputformat=""
else
inputformat="--format=${10}"
fi
if [ -z "${11}" ]
then
inputoptions=""
else
inputoptions=( "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" "${21}" "${22}" "${23}" "${24}" "${25}" )
fi fi
# variables # autoinstall OpenRefine
uuid=$(cat /proc/sys/kernel/random/uuid) if [ ! -d "openrefine" ]; then
echo "Download OpenRefine..."
mkdir -p openrefine
wget -q --show-progress $openrefine_URL
echo "Install OpenRefine in subdirectory openrefine..."
tar -xzf "$(basename $openrefine_URL)" -C openrefine --strip 1 --totals
rm -f "$(basename $openrefine_URL)"
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' openrefine/refine.ini
echo ""
fi
# autoinstall OpenRefine client
if [ ! -d "openrefine-client" ]; then
echo "Download OpenRefine client..."
mkdir -p openrefine-client
wget -q --show-progress $client_URL
echo "Install OpenRefine client in subdirectory openrefine-client..."
tar -xzf "$(basename $client_URL)" -C openrefine-client --strip 1 --totals
rm -f "$(basename $client_URL)"
echo ""
fi
# help screen
function usage () {
cat <<EOF
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== basic arguments ==
-a INPUTDIR path to directory with source files (leave empty to transform only ; multiple files may be imported into a single project by providing a zip or tar.gz archive, cf. https://github.com/OpenRefine/OpenRefine/wiki/Importers )
-b TRANSFORMDIR path to directory with OpenRefine transformation rules (json files, cf. http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html ; leave empty to transform only)
-c OUTPUTDIR path to directory for exported files (and OpenRefine workspace)
== options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
-p PORT PORT on which OpenRefine should listen (default: 3333)
-E do NOT export files
-R do NOT restart OpenRefine after each transformation (e.g. config file)
-X do NOT restart OpenRefine after each project (e.g. input file)
-h displays this help screen
== inputoptions (mandatory for xml, json, fixed-width, xslx, ods) ==
-i recordPath=RECORDPATH (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record
-i columnWidths=COLUMNWIDTHS (fixed-width): please provide widths separated by comma (e.g. 7,5)
-i sheets=SHEETS (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
== more inputoptions (optional, only together with inputformat) ==
-i projectName=PROJECTNAME (all formats)
-i limit=LIMIT (all formats), default: -1
-i includeFileSources=INCLUDEFILESOURCES (all formats), default: false
-i trimStrings=TRIMSTRINGS (xml, json), default: false
-i storeEmptyStrings=STOREEMPTYSTRINGS (xml, json), default: true
-i guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json), default: false
-i encoding=ENCODING (csv, tsv, line-based, fixed-width), please provide short encoding name (e.g. UTF-8)
-i ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: -1
-i headerLines=HEADERLINES (csv, tsv, fixed-width, xlsx, ods), default: 1
-i skipDataLines=SKIPDATALINES (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
-i storeBlankRows=STOREBLANKROWS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i processQuotes=PROCESSQUOTES (csv, tsv), default: true
-i storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
-i linesPerRow=LINESPERROW (line-based), default: 1
== example ==
./openrefine-batch.sh \
-a examples/powerhouse-museum/input/ \
-b examples/powerhouse-museum/config/ \
-c examples/powerhouse-museum/output/ \
-f tsv \
-i processQuotes=false \
-i guessCellValueTypes=true
clone or download GitHub repository to get example data:
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
EOF
exit 1
}
# defaults
ram="2048M"
port="3333"
restartfile="true"
restarttransform="true"
export="true"
inputdir=/dev/null
configdir=/dev/null
crossdir=/dev/null
# check input
NUMARGS=$#
if [ "$NUMARGS" -eq 0 ]; then
usage
fi
# get user input
options="a:b:c:d:f:i:m:p:ERXh"
while getopts $options opt; do
case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;;
p ) port=${OPTARG} ;;
E ) export="false" ;;
R ) restarttransform="false" ;;
X ) restartfile="false" ;;
h ) usage ;;
\? ) echo 1>&2 "Unknown option: -$OPTARG"; usage; exit 1;;
: ) echo 1>&2 "Missing option argument for -$OPTARG"; usage; exit 1;;
* ) echo 1>&2 "Unimplemented option: -$OPTARG"; usage; exit 1;;
esac
done
shift $(($OPTIND - 1))
# check for mandatory options
if [ -z "$outputdir" ]; then
echo 1>&2 "please provide path to directory for exported files (and OpenRefine workspace)"
echo 1>&2 "example: ./openrefine-batch.sh -c output/"
exit 1
fi
if [ "$format" = "xml" ] || [ "$format" = "json" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide recordpath in multiple arguments without slashes"
echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i recordPath=collection -i recordPath=record"
exit 1
fi
if [ "$format" = "fixed-width" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide column widths separated by comma (e.g. 7,5)"
echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i columnWidths=7,5"
exit 1
fi
if [ "$format" = "xlsx" ] || [ "$format" = "ods" ] && [ -z "$inputoptions" ]; then
echo 1>&2 "error: you specified the inputformat $format but did not provide mandatory input options"
echo 1>&2 "please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)"
echo 1>&2 "example: ./openrefine-batch.sh ... -f $format -i sheets=0"
exit 1
fi
# print variables
echo "Input directory: $inputdir" echo "Input directory: $inputdir"
echo "Input files: ${inputfiles[@]}" echo "Input files: ${inputfiles[*]}"
echo "Input format: $inputformat" echo "Input format: $inputformat"
echo "Input options: ${inputoptions[@]}" echo "Input options: ${inputoptions[*]}"
echo "Config directory: $configdir" echo "Config directory: $configdir"
echo "Transformation rules: ${jsonfiles[@]}" echo "Transformation rules: ${jsonfiles[*]}"
echo "Cross directory: $crossdir" echo "Cross directory: $crossdir"
echo "Cross projects: ${crossprojects[@]}" echo "Cross projects: ${crossprojects[*]}"
echo "OpenRefine heap space: $ram" echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version" echo "OpenRefine port: $port"
echo "OpenRefine workspace: $outputdir" echo "OpenRefine workspace: $outputdir"
echo "Export TSV to workspace: $export" echo "Export TSV to workspace: $export"
echo "Docker container name: $uuid"
echo "restart after file: $restartfile" echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform" echo "restart after transform: $restarttransform"
echo "" echo ""
# time # declare additional variables
echo "begin: $(date)" checkpoints=${#checkpointdate[@]}
echo "" checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Start process"
memoryload=()
# launch server # launch server
echo "start OpenRefine server..." checkpoints=${#checkpointdate[@]}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Launch OpenRefine"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
pid=$!
# wait until server is available # wait until server is available
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# show server logs
docker attach ${uuid} &
echo "" echo ""
# import all files # import all files
if [ -n "$inputfiles" ]; then if [ -n "$inputfiles" ]; then
echo "=== IMPORT ===" checkpoints=${#checkpointdate[@]}
echo "" checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Import all files"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
for inputfile in "${inputfiles[@]}" ; do for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..." echo "import ${inputfile}..."
# run client with input command # run client with input command
docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]} openrefine-client/refine.py -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
# show statistics # show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
memoryload+=($(ps --no-headers -o rss -p ${pid}))
echo "" echo ""
# restart server to clear memory # restart server to clear memory
if [ "$restartfile" = "restartfile-true" ]; then if [ "$restartfile" = "true" ]; then
echo "save project and restart OpenRefine server..." echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid} kill ${pid}
docker rm ${uuid} wait
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data echo ""
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
docker attach ${uuid} & pid=$!
until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
echo "" echo ""
fi fi
done done
fi fi
# transform and export files # transform and export files
if [ -n "$jsonfiles" ] || [ "$export" = "export-true" ]; then if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "=== TRANSFORM / EXPORT ===" checkpoints=${#checkpointdate[@]}
echo "" checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="Prepare transform & export"
echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo ""
# get project ids # get project ids
echo "get project ids..." echo "get project ids..."
projects=($(docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | tee ${outputdir}/projects.tmp | cut -c 2-14)) openrefine-client/refine.py -P ${port} -l > "${outputdir}/projects.tmp"
cat ${outputdir}/projects.tmp && rm ${outputdir}/projects.tmp projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
echo "" projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
echo ""
# provide additional OpenRefine projects for cross function # provide additional OpenRefine projects for cross function
if [ -n "$crossprojects" ]; then if [ -n "$crossprojects" ]; then
echo "provide additional projects for cross function..." echo "provide additional projects for cross function..."
# copy given projects to workspace # copy given projects to workspace
rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir rsync -a --exclude='*.project/history' "${crossdir}"/*.project "${outputdir}"
# restart server to advertise copied projects # restart server to advertise copied projects
echo "restart OpenRefine server to advertise copied projects..." echo "restart OpenRefine server to advertise copied projects..."
docker stop -t=5000 ${uuid} kill ${pid}
docker rm ${uuid} wait
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data echo ""
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
docker attach ${uuid} & pid=$!
echo "" until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
fi echo ""
fi
# loop for all projects # loop for all projects
for projectid in "${projects[@]}" ; do for ((i=0;i<${#projectids[@]};++i)); do
# time
echo "--- begin project $projectid @ $(date) ---"
echo ""
# apply transformation rules # apply transformation rules
if [ -n "$jsonfiles" ]; then if [ -n "$jsonfiles" ]; then
for jsonfile in "${jsonfiles[@]}" ; do checkpoints=${#checkpointdate[@]}
echo "transform ${jsonfile}..." checkpointdate[$(($checkpoints + 1))]=$(date +%s)
# run client with apply command checkpointname[$(($checkpoints + 1))]="Transform ${projectnames[i]}"
docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid} echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
# show statistics echo ""
ps -o start,etime,%mem,%cpu,rss -C java --sort=start echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
# restart server to clear memory echo ""
if [ "$restarttransform" = "restarttransform-true" ]; then for jsonfile in "${jsonfiles[@]}" ; do
echo "save project and restart OpenRefine server..." echo "transform ${jsonfile}..."
docker stop -t=5000 ${uuid} # run client with apply command
docker rm ${uuid} openrefine-client/refine.py -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data # allocated system resources
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
docker attach ${uuid} & memoryload+=($(ps --no-headers -o rss -p ${pid}))
fi echo ""
echo "" # restart server to clear memory
done if [ "$restarttransform" = "true" ]; then
fi echo "save project and restart OpenRefine server..."
kill ${pid}
wait
echo ""
openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
pid=$!
until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
fi
echo ""
done
fi
# export project to workspace # export project to workspace
if [ "$export" = "export-true" ]; then if [ "$export" = "true" ]; then
echo "export to file ${projectid}.tsv..." checkpoints=${#checkpointdate[@]}
# run client with export command checkpointdate[$(($checkpoints + 1))]=$(date +%s)
docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid} checkpointname[$(($checkpoints + 1))]="Export ${projectnames[i]}"
# show statistics echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
ps -o start,etime,%mem,%cpu,rss -C java --sort=start echo ""
# restart server to clear memory echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
if [ "$restartfile" = "restartfile-true" ]; then echo ""
echo "restart OpenRefine server..." # get filename without extension
docker stop -t=5000 ${uuid} filename=${projectnames[i]%.*}
docker rm ${uuid} echo "export to file ${filename}.tsv..."
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data # run client with export command
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done openrefine-client/refine.py -P ${port} -E --output="${outputdir}/${filename}.tsv" ${projectids[i]}
docker attach ${uuid} & # show allocated system resources
fi ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
echo"" memoryload+=($(ps --no-headers -o rss -p ${pid}))
fi echo ""
fi
# time # restart server to clear memory
echo "--- finished project $projectid @ $(date) ---" if [ "$restartfile" = "true" ]; then
echo "" echo "restart OpenRefine server..."
done kill ${pid}
wait
echo ""
openrefine/refine -p ${port} -d "${outputdir}" -m ${ram} &
pid=$!
until wget -q -O - http://localhost:${port} | cat | grep -q -o "OpenRefine" ; do sleep 1; done
fi
echo ""
# list output files done
if [ "$export" = "export-true" ]; then
echo "output (number of lines / size in bytes):" # list output files
wc -c -l ${outputdir}/*.tsv if [ "$export" = "true" ]; then
echo "" echo "output (number of lines / size in bytes):"
fi wc -c -l "${outputdir}"/*.tsv
echo ""
fi
fi fi
# cleanup # cleanup
echo "cleanup..." echo "cleanup..."
docker stop -t=5000 ${uuid} kill ${pid}
docker rm ${uuid} wait
rm -r -f ${outputdir}/workspace*.json rm -r -f "${outputdir:?}"/workspace*.json
# delete duplicates from copied projects # delete duplicates from copied projects
if [ -n "$crossprojects" ]; then if [ -n "$crossprojects" ]; then
for i in "${crossprojects[@]}" ; do rm -r -f ${outputdir}/${i} ; done for i in "${crossprojects[@]}" ; do rm -r -f "${outputdir}/${i}" ; done
fi fi
echo "" echo ""
# time # calculate and print checkpoints
echo "finish: $(date)" echo "=== Statistics ==="
echo ""
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
checkpointname[$(($checkpoints + 1))]="End process"
echo "starting time and run time of each step:"
checkpoints=${#checkpointdate[@]}
checkpointdate[$(($checkpoints + 1))]=$(date +%s)
for i in $(seq 1 $checkpoints); do
diffsec="$((${checkpointdate[$(($i + 1))]} - ${checkpointdate[$i]}))"
printf "%35s $(date --date=@${checkpointdate[$i]}) ($(date -d@${diffsec} -u +%H:%M:%S))\n" "${checkpointname[$i]}"
done
echo ""
diffsec="$((${checkpointdate[$checkpoints]} - ${checkpointdate[1]}))"
echo "total run time: $(date -d@${diffsec} -u +%H:%M:%S) (hh:mm:ss)"
# calculate and print memory load
max=${memoryload[0]}
for n in "${memoryload[@]}" ; do
((n > max)) && max=$n
done
echo "highest memory load: $(($max / 1024)) MB"