release v1.7, added option to specify export format, e.g. csv for issue #3

This commit is contained in:
Felix Lohmeier 2017-10-28 00:47:51 +02:00
parent f0db5b6cf3
commit 4f66502440
3 changed files with 83 additions and 75 deletions

View File

@ -6,7 +6,7 @@ Shell script to run OpenRefine in batch mode (import, transform, export). This b
1. imports all data from a given directory into OpenRefine 1. imports all data from a given directory into OpenRefine
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and 2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
3. finally exports the data in TSV (tab-separated values) format. 3. finally exports the data in csv, tsv, html, xlsx or ods.
It orchestrates [OpenRefine](https://github.com/OpenRefine/OpenRefine) (server) and a [python client](https://github.com/felixlohmeier/openrefine-client) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum. It orchestrates [OpenRefine](https://github.com/OpenRefine/OpenRefine) (server) and a [python client](https://github.com/felixlohmeier/openrefine-client) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
@ -15,7 +15,7 @@ If you prefer a containerized approach, see a [variation of this script for Dock
### Typical Workflow ### Typical Workflow
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json). - **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files. - **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects to files in the format specified (default: TSV - tab-separated values).
### Install ### Install
@ -43,7 +43,7 @@ cp CONFIGFILES config/
**OUTPUT/** **OUTPUT/**
* path to directory where results and temporary data should be stored * path to directory where results and temporary data should be stored
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv` * Transformed data will be stored in this directory in the format specified (default: TSV). Show results: `ls OUTPUT/*.tsv`
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project` * OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
### Example ### Example
@ -76,6 +76,7 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== options == == options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions ) -d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods) -f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below... -i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M) -m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
@ -119,108 +120,107 @@ https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample: The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
``` ```
[14:46 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX [00:41 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
Download OpenRefine... Download OpenRefine...
openrefine-linux-2.7.tar.g 100%[=====================================>] 60,23M 8,89MB/s in 11s openrefine-linux-2017-10-26.tar.gz 100%[=====================================================================================================================>] 66,34M 5,62MB/s in 12s
Install OpenRefine in subdirectory openrefine... Install OpenRefine in subdirectory openrefine...
Total bytes read: 72990720 (70MiB, 136MiB/s) Total bytes read: 79861760 (77MiB, 128MiB/s)
Download OpenRefine client... Download OpenRefine client...
v0.3.1.tar.gz [ <=> ] 563,12K 1015KB/s in 0,6s openrefine-client_0-3-1_linux-64bit 100%[=====================================================================================================================>] 5,39M 5,08MB/s in 1,1s
Install OpenRefine client in subdirectory openrefine-client...
Total bytes read: 3082240 (3,0MiB, 90MiB/s)
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input Input directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/input
Input files: phm-collection.tsv Input files: phm-collection.tsv
Input format: --format=tsv Input format: --format=tsv
Input options: --processQuotes=false --guessCellValueTypes=true Input options: --processQuotes=false --guessCellValueTypes=true
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config Config directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/config
Transformation rules: phm-transform.json Transformation rules: phm-transform.json
Cross directory: /dev/null Cross directory: /dev/null
Cross projects: Cross projects:
OpenRefine heap space: 2048M OpenRefine heap space: 2048M
OpenRefine port: 3333 OpenRefine port: 3333
OpenRefine workspace: /home/felix/openrefine-batch/examples/powerhouse-museum/output OpenRefine workspace: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output
Export TSV to workspace: true Export to workspace: true
Export format: tsv
restart after file: false restart after file: false
restart after transform: false restart after transform: false
=== 1. Launch OpenRefine === === 1. Launch OpenRefine ===
starting time: Di 20. Jun 13:51:06 CEST 2017 starting time: Sa 28. Okt 00:42:33 CEST 2017
Starting OpenRefine at 'http://127.0.0.1:3333/' Starting OpenRefine at 'http://127.0.0.1:3333/'
13:51:06.727 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms) 00:42:33.199 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
13:51:06.728 [ refine_server] refine.memory size: 2048M JVM Max heap: 1908932608 (1ms) 00:42:33.200 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
13:51:06.737 [ refine_server] Initializing context: '/' from '/home/felix/openrefine-batch/openrefine/webapp' (9ms) 00:42:33.206 [ refine_server] Initializing context: '/' from '/home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/openrefine/webapp' (6ms)
13:51:06.973 [ refine] Starting OpenRefine 2.7 [TRUNK]... (236ms) 00:42:33.418 [ refine] Starting OpenRefine 2017-10-26 [TRUNK]... (212ms)
13:51:06.978 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (5ms) 00:42:33.424 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
13:51:09.377 [ refine] Running in headless mode (2399ms) 00:42:35.993 [ refine] Running in headless mode (2569ms)
=== 2. Import all files === === 2. Import all files ===
starting time: Di 20. Jun 13:51:09 CEST 2017 starting time: Sa 28. Okt 00:42:36 CEST 2017
import phm-collection.tsv... import phm-collection.tsv...
13:51:09.900 [ refine] POST /command/core/create-project-from-upload (523ms) 00:42:36.393 [ refine] POST /command/core/create-project-from-upload (400ms)
New project: 2034248478869 New project: 1721413008439
13:51:14.110 [ refine] GET /command/core/get-rows (4210ms) 00:42:40.731 [ refine] GET /command/core/get-rows (4338ms)
Number of rows: 75814 Number of rows: 75814
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
13:51:05 00:08 5.3 191 864692 00:42:32 00:07 5.7 220 937692
=== 3. Prepare transform & export === === 3. Prepare transform & export ===
starting time: Di 20. Jun 13:51:14 CEST 2017 starting time: Sa 28. Okt 00:42:40 CEST 2017
get project ids... get project ids...
13:51:14.207 [ refine] GET /command/core/get-all-project-metadata (97ms) 00:42:40.866 [ refine] GET /command/core/get-all-project-metadata (135ms)
2034248478869: phm-collection.tsv 1721413008439: phm-collection.tsv
=== 4. Transform phm-collection.tsv === === 4. Transform phm-collection.tsv ===
starting time: Di 20. Jun 13:51:14 CEST 2017 starting time: Sa 28. Okt 00:42:40 CEST 2017
transform phm-transform.json... transform phm-transform.json...
13:51:14.265 [ refine] GET /command/core/get-models (58ms) 00:42:40.963 [ refine] GET /command/core/get-models (97ms)
13:51:14.273 [ refine] POST /command/core/apply-operations (8ms) 00:42:40.967 [ refine] POST /command/core/apply-operations (4ms)
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
13:51:05 00:23 7.0 155 1142712 00:42:32 00:29 7.1 142 1162720
=== 5. Export phm-collection.tsv === === 5. Export phm-collection.tsv ===
starting time: Di 20. Jun 13:51:29 CEST 2017 starting time: Sa 28. Okt 00:43:02 CEST 2017
export to file phm-collection.tsv... export to file phm-collection.tsv...
13:51:29.824 [ refine] GET /command/core/get-models (15551ms) 00:43:02.555 [ refine] GET /command/core/get-models (21588ms)
13:51:29.827 [ refine] GET /command/core/get-all-project-metadata (3ms) 00:43:02.557 [ refine] GET /command/core/get-all-project-metadata (2ms)
13:51:29.841 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (14ms) 00:43:02.561 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
STARTED ELAPSED %MEM %CPU RSS STARTED ELAPSED %MEM %CPU RSS
13:51:05 00:49 7.0 75.7 1144808 00:42:32 00:53 7.1 81.1 1164684
output (number of lines / size in bytes): output (number of lines / size in bytes):
167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv 167017 60619468 /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
cleanup... cleanup...
13:51:55.783 [ ProjectManager] Saving all modified projects ... (25942ms) 00:43:26.161 [ ProjectManager] Saving all modified projects ... (23600ms)
13:51:58.324 [ project_utilities] Saved project '2034248478869' (2541ms) 00:43:29.586 [ project_utilities] Saved project '1721413008439' (3425ms)
=== Statistics === === Statistics ===
starting time and run time of each step: starting time and run time of each step:
Start process Di 20. Jun 13:51:06 CEST 2017 (00:00:00) Start process Sa 28. Okt 00:42:33 CEST 2017 (00:00:00)
Launch OpenRefine Di 20. Jun 13:51:06 CEST 2017 (00:00:03) Launch OpenRefine Sa 28. Okt 00:42:33 CEST 2017 (00:00:03)
Import all files Di 20. Jun 13:51:09 CEST 2017 (00:00:05) Import all files Sa 28. Okt 00:42:36 CEST 2017 (00:00:04)
Prepare transform & export Di 20. Jun 13:51:14 CEST 2017 (00:00:00) Prepare transform & export Sa 28. Okt 00:42:40 CEST 2017 (00:00:00)
Transform phm-collection.tsv Di 20. Jun 13:51:14 CEST 2017 (00:00:15) Transform phm-collection.tsv Sa 28. Okt 00:42:40 CEST 2017 (00:00:22)
Export phm-collection.tsv Di 20. Jun 13:51:29 CEST 2017 (00:00:30) Export phm-collection.tsv Sa 28. Okt 00:43:02 CEST 2017 (00:00:28)
End process Di 20. Jun 13:51:59 CEST 2017 (00:00:00) End process Sa 28. Okt 00:43:30 CEST 2017 (00:00:00)
total run time: 00:00:53 (hh:mm:ss) total run time: 00:00:57 (hh:mm:ss)
highest memory load: 1117 MB highest memory load: 1137 MB
``` ```
### Docker ### Docker
@ -247,7 +247,6 @@ Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon
### Todo ### Todo
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor - [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] add option to delete openrefine projects in output directory
- [ ] provide more example data from other OpenRefine tutorials - [ ] provide more example data from other OpenRefine tutorials
### Licensing ### Licensing

View File

@ -1,5 +1,5 @@
#!/bin/bash #!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v1.1, 2017-06-20 # openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
# https://github.com/felixlohmeier/openrefine-batch # https://github.com/felixlohmeier/openrefine-batch
# check system requirements # check system requirements
@ -26,6 +26,7 @@ Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUT
== options == == options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions ) -d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods) -f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below... -i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M) -m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
@ -80,6 +81,7 @@ version="2.7"
restartfile="true" restartfile="true"
restarttransform="true" restarttransform="true"
export="true" export="true"
exportformat="tsv"
inputdir=/dev/null inputdir=/dev/null
configdir=/dev/null configdir=/dev/null
crossdir=/dev/null crossdir=/dev/null
@ -91,13 +93,14 @@ if [ "$NUMARGS" -eq 0 ]; then
fi fi
# get user input # get user input
options="a:b:c:d:f:i:m:p:ERXh" options="a:b:c:d:e:f:i:m:p:ERXh"
while getopts $options opt; do while getopts $options opt; do
case $opt in case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;; a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;; b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;; c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;; d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
e ) format="${OPTARG}" ; exportformat="${OPTARG}" ;;
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;; f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
i ) inputoptions+=("--${OPTARG}") ;; i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;; m ) ram=${OPTARG} ;;
@ -151,7 +154,8 @@ echo "Cross projects: ${crossprojects[*]}"
echo "OpenRefine heap space: $ram" echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version" echo "OpenRefine version: $version"
echo "OpenRefine workspace: $outputdir" echo "OpenRefine workspace: $outputdir"
echo "Export TSV to workspace: $export" echo "Export to workspace: $export"
echo "Export format: $exportformat"
echo "Docker container name: $uuid" echo "Docker container name: $uuid"
echo "restart after file: $restartfile" echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform" echo "restart after transform: $restarttransform"
@ -171,9 +175,9 @@ echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
echo "" echo ""
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})" echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
echo "" echo ""
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
# wait until server is available # wait until server is available
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# show server logs # show server logs
docker attach ${uuid} & docker attach ${uuid} &
echo "" echo ""
@ -190,7 +194,7 @@ if [ -n "$inputfiles" ]; then
for inputfile in "${inputfiles[@]}" ; do for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..." echo "import ${inputfile}..."
# run client with input command # run client with input command
docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]} sudo docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show allocated system resources # show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java)) memoryload+=($(ps --no-headers -o rss -C java))
@ -200,8 +204,8 @@ if [ -n "$inputfiles" ]; then
echo "save project and restart OpenRefine server..." echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid} docker stop -t=5000 ${uuid}
docker rm ${uuid} docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} & docker attach ${uuid} &
echo "" echo ""
fi fi
@ -220,7 +224,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# get project ids # get project ids
echo "get project ids..." echo "get project ids..."
docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp" sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14)) projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-)) projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp" cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
@ -235,8 +239,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "restart OpenRefine server to advertise copied projects..." echo "restart OpenRefine server to advertise copied projects..."
docker stop -t=5000 ${uuid} docker stop -t=5000 ${uuid}
docker rm ${uuid} docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} & docker attach ${uuid} &
echo "" echo ""
fi fi
@ -256,7 +260,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
for jsonfile in "${jsonfiles[@]}" ; do for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..." echo "transform ${jsonfile}..."
# run client with apply command # run client with apply command
docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]} sudo docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
# allocated system resources # allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java)) memoryload+=($(ps --no-headers -o rss -C java))
@ -266,8 +270,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "save project and restart OpenRefine server..." echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid} docker stop -t=5000 ${uuid}
docker rm ${uuid} docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} & docker attach ${uuid} &
fi fi
echo "" echo ""
@ -285,9 +289,9 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "" echo ""
# get filename without extension # get filename without extension
filename=${projectnames[i]%.*} filename=${projectnames[i]%.*}
echo "export to file ${filename}.tsv..." echo "export to file ${filename}.${exportformat}..."
# run client with export command # run client with export command
docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.tsv" ${projectids[i]} sudo docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.${exportformat}" ${projectids[i]}
# show allocated system resources # show allocated system resources
ps -o start,etime,%mem,%cpu,rss -C java --sort=start ps -o start,etime,%mem,%cpu,rss -C java --sort=start
memoryload+=($(ps --no-headers -o rss -C java)) memoryload+=($(ps --no-headers -o rss -C java))
@ -299,8 +303,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "restart OpenRefine server..." echo "restart OpenRefine server..."
docker stop -t=5000 ${uuid} docker stop -t=5000 ${uuid}
docker rm ${uuid} docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} & docker attach ${uuid} &
fi fi
echo "" echo ""
@ -310,7 +314,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# list output files # list output files
if [ "$export" = "true" ]; then if [ "$export" = "true" ]; then
echo "output (number of lines / size in bytes):" echo "output (number of lines / size in bytes):"
wc -c -l "${outputdir}"/*.tsv wc -c -l "${outputdir}"/*.${exportformat}
echo "" echo ""
fi fi
fi fi

View File

@ -1,5 +1,5 @@
#!/bin/bash #!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v1.6, 2017-10-26 # openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
# https://github.com/felixlohmeier/openrefine-batch # https://github.com/felixlohmeier/openrefine-batch
# declare download URLs for OpenRefine and OpenRefine client # declare download URLs for OpenRefine and OpenRefine client
@ -50,6 +50,7 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
== options == == options ==
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions ) -d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods) -f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
-i INPUTOPTIONS several options provided by openrefine-client, see below... -i INPUTOPTIONS several options provided by openrefine-client, see below...
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M) -m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
@ -104,10 +105,12 @@ port="3333"
restartfile="true" restartfile="true"
restarttransform="true" restarttransform="true"
export="true" export="true"
exportformat="tsv"
inputdir=/dev/null inputdir=/dev/null
configdir=/dev/null configdir=/dev/null
crossdir=/dev/null crossdir=/dev/null
# check input # check input
NUMARGS=$# NUMARGS=$#
if [ "$NUMARGS" -eq 0 ]; then if [ "$NUMARGS" -eq 0 ]; then
@ -115,13 +118,14 @@ if [ "$NUMARGS" -eq 0 ]; then
fi fi
# get user input # get user input
options="a:b:c:d:f:i:m:p:ERXh" options="a:b:c:d:e:f:i:m:p:ERXh"
while getopts $options opt; do while getopts $options opt; do
case $opt in case $opt in
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;; a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;; b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;; c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;; d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
e ) format="${OPTARG}" ; exportformat="${OPTARG}" ;;
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;; f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
i ) inputoptions+=("--${OPTARG}") ;; i ) inputoptions+=("--${OPTARG}") ;;
m ) ram=${OPTARG} ;; m ) ram=${OPTARG} ;;
@ -174,7 +178,8 @@ echo "Cross projects: ${crossprojects[*]}"
echo "OpenRefine heap space: $ram" echo "OpenRefine heap space: $ram"
echo "OpenRefine port: $port" echo "OpenRefine port: $port"
echo "OpenRefine workspace: $outputdir" echo "OpenRefine workspace: $outputdir"
echo "Export TSV to workspace: $export" echo "Export to workspace: $export"
echo "Export format: $exportformat"
echo "restart after file: $restartfile" echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform" echo "restart after transform: $restarttransform"
echo "" echo ""
@ -309,9 +314,9 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
echo "" echo ""
# get filename without extension # get filename without extension
filename=${projectnames[i]%.*} filename=${projectnames[i]%.*}
echo "export to file ${filename}.tsv..." echo "export to file ${filename}.${exportformat}..."
# run client with export command # run client with export command
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.tsv" ${projectids[i]} openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" ${projectids[i]}
# show allocated system resources # show allocated system resources
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
memoryload+=($(ps --no-headers -o rss -p ${pid})) memoryload+=($(ps --no-headers -o rss -p ${pid}))
@ -335,7 +340,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
# list output files # list output files
if [ "$export" = "true" ]; then if [ "$export" = "true" ]; then
echo "output (number of lines / size in bytes):" echo "output (number of lines / size in bytes):"
wc -c -l "${outputdir}"/*.tsv wc -c -l "${outputdir}"/*.${exportformat}
echo "" echo ""
fi fi
fi fi