release v1.7, added option to specify export format, e.g. csv for issue #3
This commit is contained in:
parent
f0db5b6cf3
commit
4f66502440
99
README.md
99
README.md
|
@ -6,7 +6,7 @@ Shell script to run OpenRefine in batch mode (import, transform, export). This b
|
|||
|
||||
1. imports all data from a given directory into OpenRefine
|
||||
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
|
||||
3. finally exports the data in TSV (tab-separated values) format.
|
||||
3. finally exports the data in csv, tsv, html, xlsx or ods.
|
||||
|
||||
It orchestrates [OpenRefine](https://github.com/OpenRefine/OpenRefine) (server) and a [python client](https://github.com/felixlohmeier/openrefine-client) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
|
||||
|
||||
|
@ -15,7 +15,7 @@ If you prefer a containerized approach, see a [variation of this script for Dock
|
|||
### Typical Workflow
|
||||
|
||||
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
|
||||
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects to files in the format specified (default: TSV - tab-separated values).
|
||||
|
||||
### Install
|
||||
|
||||
|
@ -43,7 +43,7 @@ cp CONFIGFILES config/
|
|||
|
||||
**OUTPUT/**
|
||||
* path to directory where results and temporary data should be stored
|
||||
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
|
||||
* Transformed data will be stored in this directory in the format specified (default: TSV). Show results: `ls OUTPUT/*.tsv`
|
||||
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
|
||||
|
||||
### Example
|
||||
|
@ -76,6 +76,7 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
|||
|
||||
== options ==
|
||||
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
|
||||
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
|
||||
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
|
@ -119,108 +120,107 @@ https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
|||
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
|
||||
|
||||
```
|
||||
[14:46 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
[00:41 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
Download OpenRefine...
|
||||
openrefine-linux-2.7.tar.g 100%[=====================================>] 60,23M 8,89MB/s in 11s
|
||||
openrefine-linux-2017-10-26.tar.gz 100%[=====================================================================================================================>] 66,34M 5,62MB/s in 12s
|
||||
Install OpenRefine in subdirectory openrefine...
|
||||
Total bytes read: 72990720 (70MiB, 136MiB/s)
|
||||
Total bytes read: 79861760 (77MiB, 128MiB/s)
|
||||
|
||||
Download OpenRefine client...
|
||||
v0.3.1.tar.gz [ <=> ] 563,12K 1015KB/s in 0,6s
|
||||
Install OpenRefine client in subdirectory openrefine-client...
|
||||
Total bytes read: 3082240 (3,0MiB, 90MiB/s)
|
||||
openrefine-client_0-3-1_linux-64bit 100%[=====================================================================================================================>] 5,39M 5,08MB/s in 1,1s
|
||||
|
||||
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input files: phm-collection.tsv
|
||||
Input format: --format=tsv
|
||||
Input options: --processQuotes=false --guessCellValueTypes=true
|
||||
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
|
||||
Config directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/config
|
||||
Transformation rules: phm-transform.json
|
||||
Cross directory: /dev/null
|
||||
Cross projects:
|
||||
OpenRefine heap space: 2048M
|
||||
OpenRefine port: 3333
|
||||
OpenRefine workspace: /home/felix/openrefine-batch/examples/powerhouse-museum/output
|
||||
Export TSV to workspace: true
|
||||
OpenRefine workspace: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output
|
||||
Export to workspace: true
|
||||
Export format: tsv
|
||||
restart after file: false
|
||||
restart after transform: false
|
||||
|
||||
=== 1. Launch OpenRefine ===
|
||||
|
||||
starting time: Di 20. Jun 13:51:06 CEST 2017
|
||||
starting time: Sa 28. Okt 00:42:33 CEST 2017
|
||||
|
||||
Starting OpenRefine at 'http://127.0.0.1:3333/'
|
||||
|
||||
13:51:06.727 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
||||
13:51:06.728 [ refine_server] refine.memory size: 2048M JVM Max heap: 1908932608 (1ms)
|
||||
13:51:06.737 [ refine_server] Initializing context: '/' from '/home/felix/openrefine-batch/openrefine/webapp' (9ms)
|
||||
13:51:06.973 [ refine] Starting OpenRefine 2.7 [TRUNK]... (236ms)
|
||||
13:51:06.978 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (5ms)
|
||||
13:51:09.377 [ refine] Running in headless mode (2399ms)
|
||||
00:42:33.199 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
||||
00:42:33.200 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
|
||||
00:42:33.206 [ refine_server] Initializing context: '/' from '/home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/openrefine/webapp' (6ms)
|
||||
00:42:33.418 [ refine] Starting OpenRefine 2017-10-26 [TRUNK]... (212ms)
|
||||
00:42:33.424 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
|
||||
00:42:35.993 [ refine] Running in headless mode (2569ms)
|
||||
|
||||
=== 2. Import all files ===
|
||||
|
||||
starting time: Di 20. Jun 13:51:09 CEST 2017
|
||||
starting time: Sa 28. Okt 00:42:36 CEST 2017
|
||||
|
||||
import phm-collection.tsv...
|
||||
13:51:09.900 [ refine] POST /command/core/create-project-from-upload (523ms)
|
||||
New project: 2034248478869
|
||||
13:51:14.110 [ refine] GET /command/core/get-rows (4210ms)
|
||||
00:42:36.393 [ refine] POST /command/core/create-project-from-upload (400ms)
|
||||
New project: 1721413008439
|
||||
00:42:40.731 [ refine] GET /command/core/get-rows (4338ms)
|
||||
Number of rows: 75814
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:51:05 00:08 5.3 191 864692
|
||||
00:42:32 00:07 5.7 220 937692
|
||||
|
||||
=== 3. Prepare transform & export ===
|
||||
|
||||
starting time: Di 20. Jun 13:51:14 CEST 2017
|
||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
||||
|
||||
get project ids...
|
||||
13:51:14.207 [ refine] GET /command/core/get-all-project-metadata (97ms)
|
||||
2034248478869: phm-collection.tsv
|
||||
00:42:40.866 [ refine] GET /command/core/get-all-project-metadata (135ms)
|
||||
1721413008439: phm-collection.tsv
|
||||
|
||||
=== 4. Transform phm-collection.tsv ===
|
||||
|
||||
starting time: Di 20. Jun 13:51:14 CEST 2017
|
||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
||||
|
||||
transform phm-transform.json...
|
||||
13:51:14.265 [ refine] GET /command/core/get-models (58ms)
|
||||
13:51:14.273 [ refine] POST /command/core/apply-operations (8ms)
|
||||
00:42:40.963 [ refine] GET /command/core/get-models (97ms)
|
||||
00:42:40.967 [ refine] POST /command/core/apply-operations (4ms)
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:51:05 00:23 7.0 155 1142712
|
||||
00:42:32 00:29 7.1 142 1162720
|
||||
|
||||
|
||||
=== 5. Export phm-collection.tsv ===
|
||||
|
||||
starting time: Di 20. Jun 13:51:29 CEST 2017
|
||||
starting time: Sa 28. Okt 00:43:02 CEST 2017
|
||||
|
||||
export to file phm-collection.tsv...
|
||||
13:51:29.824 [ refine] GET /command/core/get-models (15551ms)
|
||||
13:51:29.827 [ refine] GET /command/core/get-all-project-metadata (3ms)
|
||||
13:51:29.841 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (14ms)
|
||||
00:43:02.555 [ refine] GET /command/core/get-models (21588ms)
|
||||
00:43:02.557 [ refine] GET /command/core/get-all-project-metadata (2ms)
|
||||
00:43:02.561 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:51:05 00:49 7.0 75.7 1144808
|
||||
00:42:32 00:53 7.1 81.1 1164684
|
||||
|
||||
|
||||
output (number of lines / size in bytes):
|
||||
167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||
167017 60619468 /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||
|
||||
cleanup...
|
||||
13:51:55.783 [ ProjectManager] Saving all modified projects ... (25942ms)
|
||||
13:51:58.324 [ project_utilities] Saved project '2034248478869' (2541ms)
|
||||
00:43:26.161 [ ProjectManager] Saving all modified projects ... (23600ms)
|
||||
00:43:29.586 [ project_utilities] Saved project '1721413008439' (3425ms)
|
||||
|
||||
=== Statistics ===
|
||||
|
||||
starting time and run time of each step:
|
||||
Start process Di 20. Jun 13:51:06 CEST 2017 (00:00:00)
|
||||
Launch OpenRefine Di 20. Jun 13:51:06 CEST 2017 (00:00:03)
|
||||
Import all files Di 20. Jun 13:51:09 CEST 2017 (00:00:05)
|
||||
Prepare transform & export Di 20. Jun 13:51:14 CEST 2017 (00:00:00)
|
||||
Transform phm-collection.tsv Di 20. Jun 13:51:14 CEST 2017 (00:00:15)
|
||||
Export phm-collection.tsv Di 20. Jun 13:51:29 CEST 2017 (00:00:30)
|
||||
End process Di 20. Jun 13:51:59 CEST 2017 (00:00:00)
|
||||
Start process Sa 28. Okt 00:42:33 CEST 2017 (00:00:00)
|
||||
Launch OpenRefine Sa 28. Okt 00:42:33 CEST 2017 (00:00:03)
|
||||
Import all files Sa 28. Okt 00:42:36 CEST 2017 (00:00:04)
|
||||
Prepare transform & export Sa 28. Okt 00:42:40 CEST 2017 (00:00:00)
|
||||
Transform phm-collection.tsv Sa 28. Okt 00:42:40 CEST 2017 (00:00:22)
|
||||
Export phm-collection.tsv Sa 28. Okt 00:43:02 CEST 2017 (00:00:28)
|
||||
End process Sa 28. Okt 00:43:30 CEST 2017 (00:00:00)
|
||||
|
||||
total run time: 00:00:53 (hh:mm:ss)
|
||||
highest memory load: 1117 MB
|
||||
total run time: 00:00:57 (hh:mm:ss)
|
||||
highest memory load: 1137 MB
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
@ -247,7 +247,6 @@ Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon
|
|||
### Todo
|
||||
|
||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||
- [ ] add option to delete openrefine projects in output directory
|
||||
- [ ] provide more example data from other OpenRefine tutorials
|
||||
|
||||
### Licensing
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.1, 2017-06-20
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# check system requirements
|
||||
|
@ -26,6 +26,7 @@ Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUT
|
|||
|
||||
== options ==
|
||||
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
|
||||
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
|
||||
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
|
@ -80,6 +81,7 @@ version="2.7"
|
|||
restartfile="true"
|
||||
restarttransform="true"
|
||||
export="true"
|
||||
exportformat="tsv"
|
||||
inputdir=/dev/null
|
||||
configdir=/dev/null
|
||||
crossdir=/dev/null
|
||||
|
@ -91,13 +93,14 @@ if [ "$NUMARGS" -eq 0 ]; then
|
|||
fi
|
||||
|
||||
# get user input
|
||||
options="a:b:c:d:f:i:m:p:ERXh"
|
||||
options="a:b:c:d:e:f:i:m:p:ERXh"
|
||||
while getopts $options opt; do
|
||||
case $opt in
|
||||
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
|
||||
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
e ) format="${OPTARG}" ; exportformat="${OPTARG}" ;;
|
||||
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
|
||||
i ) inputoptions+=("--${OPTARG}") ;;
|
||||
m ) ram=${OPTARG} ;;
|
||||
|
@ -151,7 +154,8 @@ echo "Cross projects: ${crossprojects[*]}"
|
|||
echo "OpenRefine heap space: $ram"
|
||||
echo "OpenRefine version: $version"
|
||||
echo "OpenRefine workspace: $outputdir"
|
||||
echo "Export TSV to workspace: $export"
|
||||
echo "Export to workspace: $export"
|
||||
echo "Export format: $exportformat"
|
||||
echo "Docker container name: $uuid"
|
||||
echo "restart after file: $restartfile"
|
||||
echo "restart after transform: $restarttransform"
|
||||
|
@ -171,9 +175,9 @@ echo "=== $checkpoints. ${checkpointname[$(($checkpoints + 1))]} ==="
|
|||
echo ""
|
||||
echo "starting time: $(date --date=@${checkpointdate[$(($checkpoints + 1))]})"
|
||||
echo ""
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
# wait until server is available
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
# show server logs
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
|
@ -190,7 +194,7 @@ if [ -n "$inputfiles" ]; then
|
|||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# run client with input command
|
||||
docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -200,8 +204,8 @@ if [ -n "$inputfiles" ]; then
|
|||
echo "save project and restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
@ -220,7 +224,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cat "${outputdir}/projects.tmp" | cut -c 2-14))
|
||||
projectnames=($(cat "${outputdir}/projects.tmp" | cut -c 17-))
|
||||
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
|
||||
|
@ -235,8 +239,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
echo "restart OpenRefine server to advertise copied projects..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
@ -256,7 +260,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# run client with apply command
|
||||
docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
sudo docker run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
# allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -266,8 +270,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
echo "save project and restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
|
@ -285,9 +289,9 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
echo ""
|
||||
# get filename without extension
|
||||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.tsv..."
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.tsv" ${projectids[i]}
|
||||
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client -H ${uuid} -E --output="${filename}.${exportformat}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -299,8 +303,8 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
echo "restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
|
@ -310,7 +314,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# list output files
|
||||
if [ "$export" = "true" ]; then
|
||||
echo "output (number of lines / size in bytes):"
|
||||
wc -c -l "${outputdir}"/*.tsv
|
||||
wc -c -l "${outputdir}"/*.${exportformat}
|
||||
echo ""
|
||||
fi
|
||||
fi
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.6, 2017-10-26
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# declare download URLs for OpenRefine and OpenRefine client
|
||||
|
@ -50,6 +50,7 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
|||
|
||||
== options ==
|
||||
-d CROSSDIR path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the cross function, cf. https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions )
|
||||
-e EXPORTFORMAT (csv, tsv, html, xls, xlsx, ods)
|
||||
-f INPUTFORMAT (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
|
@ -104,10 +105,12 @@ port="3333"
|
|||
restartfile="true"
|
||||
restarttransform="true"
|
||||
export="true"
|
||||
exportformat="tsv"
|
||||
inputdir=/dev/null
|
||||
configdir=/dev/null
|
||||
crossdir=/dev/null
|
||||
|
||||
|
||||
# check input
|
||||
NUMARGS=$#
|
||||
if [ "$NUMARGS" -eq 0 ]; then
|
||||
|
@ -115,13 +118,14 @@ if [ "$NUMARGS" -eq 0 ]; then
|
|||
fi
|
||||
|
||||
# get user input
|
||||
options="a:b:c:d:f:i:m:p:ERXh"
|
||||
options="a:b:c:d:e:f:i:m:p:ERXh"
|
||||
while getopts $options opt; do
|
||||
case $opt in
|
||||
a ) inputdir=$(readlink -f ${OPTARG}); if [ -n "${inputdir// }" ] ; then inputfiles=($(find -L "${inputdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
b ) configdir=$(readlink -f ${OPTARG}); if [ -n "${configdir// }" ] ; then jsonfiles=($(find -L "${configdir}"/* -type f -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
c ) outputdir=$(readlink -m ${OPTARG}); mkdir -p "${outputdir}" ;;
|
||||
d ) crossdir=$(readlink -f ${OPTARG}); if [ -n "${crossdir// }" ] ; then crossprojects=($(find -L "${crossdir}"/* -maxdepth 0 -type d -printf "%f\n" 2>/dev/null)); fi ;;
|
||||
e ) format="${OPTARG}" ; exportformat="${OPTARG}" ;;
|
||||
f ) format="${OPTARG}" ; inputformat="--format=${OPTARG}" ;;
|
||||
i ) inputoptions+=("--${OPTARG}") ;;
|
||||
m ) ram=${OPTARG} ;;
|
||||
|
@ -174,7 +178,8 @@ echo "Cross projects: ${crossprojects[*]}"
|
|||
echo "OpenRefine heap space: $ram"
|
||||
echo "OpenRefine port: $port"
|
||||
echo "OpenRefine workspace: $outputdir"
|
||||
echo "Export TSV to workspace: $export"
|
||||
echo "Export to workspace: $export"
|
||||
echo "Export format: $exportformat"
|
||||
echo "restart after file: $restartfile"
|
||||
echo "restart after transform: $restarttransform"
|
||||
echo ""
|
||||
|
@ -309,9 +314,9 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
echo ""
|
||||
# get filename without extension
|
||||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.tsv..."
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.tsv" ${projectids[i]}
|
||||
openrefine-client/openrefine-client_0-3-1_linux-64bit -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -335,7 +340,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
# list output files
|
||||
if [ "$export" = "true" ]; then
|
||||
echo "output (number of lines / size in bytes):"
|
||||
wc -c -l "${outputdir}"/*.tsv
|
||||
wc -c -l "${outputdir}"/*.${exportformat}
|
||||
echo ""
|
||||
fi
|
||||
fi
|
||||
|
|
Loading…
Reference in New Issue