release v0.6
This commit is contained in:
parent
2f0d8fb080
commit
4ee785ecf3
128
README.md
128
README.md
|
@ -1,6 +1,6 @@
|
|||
## OpenRefine batch processing (openrefine-batch.sh)
|
||||
|
||||
Shell script to run OpenRefine on Windows, Linux or Mac in batch mode (import, transform, export). This bash script automatically...
|
||||
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
|
||||
|
||||
1. imports all data from a given directory into OpenRefine
|
||||
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
|
||||
|
@ -10,45 +10,55 @@ It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/fel
|
|||
|
||||
### Typical Workflow
|
||||
|
||||
- Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||
- Step 2: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
|
||||
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
|
||||
|
||||
### Install
|
||||
|
||||
Linux:
|
||||
|
||||
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
|
||||
2. Open Terminal and enter `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
|
||||
|
||||
Mac:
|
||||
|
||||
1. Install Docker
|
||||
2. ...
|
||||
|
||||
Windows:
|
||||
|
||||
1. Install Docker
|
||||
2. Install Cygwin with Bash
|
||||
3. ...
|
||||
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
|
||||
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
|
||||
|
||||
### Usage
|
||||
|
||||
```
|
||||
./openrefine-batch.sh input/ config/ output/
|
||||
mkdir -p input && cp INPUTFILES input/
|
||||
mkdir -p config && cp CONFIGFILES config/
|
||||
sudo ./openrefine-batch.sh input/ config/ OUTPUT/
|
||||
```
|
||||
|
||||
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
|
||||
|
||||
#### INPUTFILES
|
||||
* any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter options below):
|
||||
* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive])
|
||||
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
|
||||
|
||||
#### CONFIGFILES
|
||||
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
||||
|
||||
#### OUTPUT/
|
||||
* path to directory where results and temporary data should be stored
|
||||
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
|
||||
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
|
||||
|
||||
#### Example
|
||||
|
||||
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
|
||||
|
||||
```
|
||||
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 tsv --processQuotes=false --guessCellValueTypes=true
|
||||
sudo ./openrefine-batch.sh \
|
||||
examples/powerhouse-museum/input/ \
|
||||
examples/powerhouse-museum/config/ \
|
||||
examples/powerhouse-museum/output/ \
|
||||
examples/powerhouse-museum/cross/ \
|
||||
2G 2.7rc1 restartfile-false restarttransform-false export-true \
|
||||
tsv --processQuotes=false --guessCellValueTypes=true
|
||||
```
|
||||
|
||||
#### Options
|
||||
|
||||
```
|
||||
./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restart $inputformat $inputoptions
|
||||
sudo ./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restartfile $restarttransform $export $inputformat $inputoptions
|
||||
```
|
||||
|
||||
1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
|
||||
|
@ -56,8 +66,10 @@ clone or [download GitHub repository](https://github.com/felixlohmeier/openrefin
|
|||
3. outputdir: path to directory for exported files (and OpenRefine workspace)
|
||||
4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
|
||||
5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
|
||||
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev)
|
||||
7. restart: restart docker container after each transformation to clear memory (restart-true/restart-false)
|
||||
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc1)
|
||||
7. restartfile: restart docker after each project (e.g. input file) to clear memory (restartfile-true/restartfile-false; default: restartfile-true)
|
||||
8. restarttransform: restart docker container after each transformation (e.g. config file) to clear memory (restarttransform-true/restarttransform-false; default: restarttransform-false)
|
||||
9. export: toggle on/off (export-true/export-false; default: export-true)
|
||||
8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||
9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
|
||||
|
||||
|
@ -67,6 +79,7 @@ inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
|
|||
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||
|
||||
more inputoptions (optional, only together with inputformat):
|
||||
* `--projectName=PROJECTNAME` (all formats)
|
||||
* `--limit=LIMIT` (all formats), default: -1
|
||||
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
|
||||
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
|
||||
|
@ -86,80 +99,13 @@ more inputoptions (optional, only together with inputformat):
|
|||
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
|
||||
|
||||
```
|
||||
[03:27 felix ~/openrefine-batch (master *)]$ ./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 restart-true tsv --processQuotes=false --guessCellValueTypes=true
|
||||
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input files: phm-collection.tsv
|
||||
Input format: --format=tsv
|
||||
Input options: --processQuotes=false --guessCellValueTypes=true
|
||||
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
|
||||
Transformation rules: phm-transform.json
|
||||
Cross directory: /home/felix/openrefine-batch/examples/powerhouse-museum/cross
|
||||
Cross projects:
|
||||
OpenRefine heap space: 4G
|
||||
OpenRefine version: 2.7rc1
|
||||
Docker restart: restart-true
|
||||
Docker container: 6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
Output directory: /home/felix/openrefine-batch/examples/powerhouse-museum/output
|
||||
|
||||
begin: Mo 27. Feb 03:28:45 CET 2017
|
||||
|
||||
start OpenRefine server...
|
||||
[sudo] password for felix:
|
||||
92499ecd252a8768ea5b57e0be0fb30fe6340eab67d28b1be158e0ad01f79419
|
||||
|
||||
import phm-collection.tsv...
|
||||
New project: 2325849087106
|
||||
Number of rows: 75814
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
03:28:55 00:29 10.0 122 812208
|
||||
save project and restart OpenRefine server...
|
||||
02:29:28.170 [ ProjectManager] Saving all modified projects ... (4594ms)
|
||||
02:29:36.414 [ project_utilities] Saved project '2325849087106' (8244ms)
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
f28de26b99475c4db09dbfb9ab3d445aa8127dedd08b8e729cb6b4d65c96bf38
|
||||
|
||||
begin project 2325849087106 @ Mo 27. Feb 03:29:52 CET 2017
|
||||
transform phm-transform.json...
|
||||
02:29:54.372 [ refine] GET /command/core/get-models (2815ms)
|
||||
02:29:57.525 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3153ms)
|
||||
02:29:57.640 [ refine] POST /command/core/apply-operations (115ms)
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
03:29:38 01:07 19.6 128 1588152
|
||||
save project and restart OpenRefine server...
|
||||
02:30:46.280 [ ProjectManager] Saving all modified projects ... (48640ms)
|
||||
02:30:53.404 [ project_utilities] Saved project '2325849087106' (7124ms)
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
186b0bda0ca542642ce1875d55f8341648e05248eb359541b80191832783f40b
|
||||
export to file 2325849087106.tsv...
|
||||
02:31:08.149 [ refine] GET /command/core/get-models (4039ms)
|
||||
02:31:11.485 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3336ms)
|
||||
02:31:11.756 [ refine] GET /command/core/get-all-project-metadata (271ms)
|
||||
02:31:11.774 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (18ms)
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
03:30:55 01:59 11.6 28.6 942900
|
||||
restart OpenRefine server...
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
eb0f91675b5fbf21b4c17cceb6d93146876ea19316b7ab44af78a36f64ff1037
|
||||
finished project 2325849087106 @ Mo 27. Feb 03:33:11 CET 2017
|
||||
|
||||
output (number of lines / size in bytes):
|
||||
167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/2325849087106.tsv
|
||||
|
||||
cleanup...
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
||||
|
||||
finish: Mo 27. Feb 03:33:17 CET 2017
|
||||
```
|
||||
|
||||
### Todo
|
||||
|
||||
- [ ] howto for installation on Mac and Windows
|
||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||
- [ ] use getopts for parsing of arguments
|
||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||
- [ ] add option to delete openrefine projects in output directory
|
||||
- [ ] provide more example data from other OpenRefine tutorials
|
||||
|
||||
|
|
|
@ -7,14 +7,20 @@ Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Dat
|
|||
## Usage
|
||||
|
||||
```
|
||||
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
|
||||
sudo ./openrefine-batch.sh \
|
||||
examples/powerhouse-museum/input/ \
|
||||
examples/powerhouse-museum/config/ \
|
||||
examples/powerhouse-museum/output/ \
|
||||
examples/powerhouse-museum/cross/ \
|
||||
2G 2.7rc1 restartfile-false restarttransform-false export-true \
|
||||
tsv --processQuotes=false --guessCellValueTypes=true
|
||||
```
|
||||
|
||||
## phm-collection.tsv
|
||||
## input/phm-collection.tsv
|
||||
|
||||
* The [Powerhouse Museum in Sydney](https://maas.museum/powerhouse-museum/) provides a freely available metadata export of its collection on its website. The collection metadata has been retrieved from the website freeyourmetadata.org that has redistributed the data: http://data.freeyourmetadata.org/powerhouse-museum/
|
||||
|
||||
## phm-tutorial.json
|
||||
## config/phm-tutorial.json
|
||||
|
||||
* All steps from the tutorial above, extracted from the history of the processed tutorial project, retrieved from the website freeyourmetadata.org: [phm-collection-cleaned.google-refine.tar.gz](http://data.freeyourmetadata.org/powerhouse-museum/phm-collection-cleaned.google-refine.tar.gz)
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v0.5, 27.02.2017
|
||||
# openrefine-batch.sh, Felix Lohmeier, v0.6, 01.03.2017
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# user input
|
||||
|
@ -49,128 +49,177 @@ if [ -z "$6" ]
|
|||
fi
|
||||
if [ -z "$7" ]
|
||||
then
|
||||
restart="restart-true"
|
||||
restartfile="restartfile-true"
|
||||
else
|
||||
restart="$7"
|
||||
restartfile="$7"
|
||||
fi
|
||||
if [ -z "$8" ]
|
||||
then
|
||||
inputformat=""
|
||||
restarttransform="restarttransform-false"
|
||||
else
|
||||
inputformat="--format=${8}"
|
||||
restarttransform="$8"
|
||||
fi
|
||||
if [ -z "$9" ]
|
||||
then
|
||||
export="export-true"
|
||||
else
|
||||
export="$9"
|
||||
fi
|
||||
if [ -z "${10}" ]
|
||||
then
|
||||
inputformat=""
|
||||
else
|
||||
inputformat="--format=${10}"
|
||||
fi
|
||||
if [ -z "${11}" ]
|
||||
then
|
||||
inputoptions=""
|
||||
else
|
||||
inputoptions=( "$9" "${10}" "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" )
|
||||
inputoptions=( "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" "${21}" "${22}" "${23}" "${24}" "${25}" )
|
||||
fi
|
||||
|
||||
# variables
|
||||
uuid=$(cat /proc/sys/kernel/random/uuid)
|
||||
echo "Input directory: $inputdir"
|
||||
echo "Input files: ${inputfiles[@]}"
|
||||
echo "Input format: $inputformat"
|
||||
echo "Input options: ${inputoptions[@]}"
|
||||
echo "Config directory: $configdir"
|
||||
echo "Transformation rules: ${jsonfiles[@]}"
|
||||
echo "Cross directory: $crossdir"
|
||||
echo "Cross projects: ${crossprojects[@]}"
|
||||
echo "OpenRefine heap space: $ram"
|
||||
echo "OpenRefine version: $version"
|
||||
echo "Docker container: $uuid"
|
||||
echo "Docker restart: $restart"
|
||||
echo "Output directory: $outputdir"
|
||||
echo "Input directory: $inputdir"
|
||||
echo "Input files: ${inputfiles[@]}"
|
||||
echo "Input format: $inputformat"
|
||||
echo "Input options: ${inputoptions[@]}"
|
||||
echo "Config directory: $configdir"
|
||||
echo "Transformation rules: ${jsonfiles[@]}"
|
||||
echo "Cross directory: $crossdir"
|
||||
echo "Cross projects: ${crossprojects[@]}"
|
||||
echo "OpenRefine heap space: $ram"
|
||||
echo "OpenRefine version: $version"
|
||||
echo "OpenRefine workspace: $outputdir"
|
||||
echo "Export TSV to workspace: $export"
|
||||
echo "Docker container name: $uuid"
|
||||
echo "restart after file: $restartfile"
|
||||
echo "restart after transform: $restarttransform"
|
||||
echo ""
|
||||
|
||||
# time
|
||||
echo "begin: $(date)"
|
||||
echo ""
|
||||
|
||||
# launch openrefine server
|
||||
# launch server
|
||||
echo "start OpenRefine server..."
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
# wait until server is available
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
# show server logs
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
|
||||
# import all files
|
||||
if [ -n "$inputfiles" ]; then
|
||||
# import all files
|
||||
echo "=== IMPORT ==="
|
||||
echo ""
|
||||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# import
|
||||
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
# show server logs
|
||||
sudo docker attach ${uuid} &
|
||||
# statistics
|
||||
# run client with input command
|
||||
docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
# show statistics
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
# restart server to clear memory
|
||||
echo "save project and restart OpenRefine server..."
|
||||
sudo docker stop -t=5000 ${uuid}
|
||||
sudo docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
echo ""
|
||||
# restart server to clear memory
|
||||
if [ "$restartfile" = "restartfile-true" ]; then
|
||||
echo "save project and restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
# get project ids
|
||||
projects=($(sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | cut -c 2-14))
|
||||
echo "=== TRANSFORM / EXPORT ==="
|
||||
echo ""
|
||||
|
||||
# copy existing projects for use with OpenRefine cross function
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
projects=($(docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | tee ${outputdir}/projects.tmp | cut -c 2-14))
|
||||
cat ${outputdir}/projects.tmp && rm ${outputdir}/projects.tmp
|
||||
echo ""
|
||||
|
||||
# provide additional OpenRefine projects for cross function
|
||||
if [ -n "$crossprojects" ]; then
|
||||
echo "provide additional projects for cross function..."
|
||||
# copy given projects to workspace
|
||||
rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir
|
||||
# restart server to advertise copied projects
|
||||
echo "restart OpenRefine server to advertise copied projects..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# loop for all projects
|
||||
for projectid in "${projects[@]}" ; do
|
||||
echo "begin project $projectid @ $(date)"
|
||||
# show server logs
|
||||
sudo docker attach ${uuid} &
|
||||
# time
|
||||
echo "--- begin project $projectid @ $(date) ---"
|
||||
echo ""
|
||||
|
||||
# apply transformation rules
|
||||
if [ -n "$jsonfiles" ]; then
|
||||
# apply transformation rules
|
||||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# apply
|
||||
sudo docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
|
||||
# statistics
|
||||
# run client with apply command
|
||||
docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
|
||||
# show statistics
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
if [ "$restart" = "restart-true" ]; then
|
||||
# restart server to clear memory
|
||||
# restart server to clear memory
|
||||
if [ "$restarttransform" = "restarttransform-true" ]; then
|
||||
echo "save project and restart OpenRefine server..."
|
||||
sudo docker stop -t=5000 ${uuid}
|
||||
sudo docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
sudo docker attach ${uuid} &
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
done
|
||||
fi
|
||||
# export files
|
||||
echo "export to file ${projectid}.tsv..."
|
||||
# export
|
||||
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
|
||||
# statistics
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
# restart server to clear memory
|
||||
echo "restart OpenRefine server..."
|
||||
sudo docker stop -t=5000 ${uuid}
|
||||
sudo docker rm ${uuid}
|
||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
|
||||
# export project to workspace
|
||||
if [ "$export" = "export-true" ]; then
|
||||
echo "export to file ${projectid}.tsv..."
|
||||
# run client with export command
|
||||
docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
|
||||
# show statistics
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
# restart server to clear memory
|
||||
if [ "$restartfile" = "restartfile-true" ]; then
|
||||
echo "restart OpenRefine server..."
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
docker attach ${uuid} &
|
||||
fi
|
||||
echo""
|
||||
fi
|
||||
|
||||
# time
|
||||
echo "finished project $projectid @ $(date)"
|
||||
echo "--- finished project $projectid @ $(date) ---"
|
||||
echo ""
|
||||
done
|
||||
|
||||
# list output files
|
||||
echo "output (number of lines / size in bytes):"
|
||||
wc -c -l ${outputdir}/*.tsv
|
||||
echo ""
|
||||
if [ "$export" = "export-true" ]; then
|
||||
echo "output (number of lines / size in bytes):"
|
||||
wc -c -l ${outputdir}/*.tsv
|
||||
echo ""
|
||||
fi
|
||||
|
||||
# cleanup
|
||||
echo "cleanup..."
|
||||
sudo docker stop -t=5000 ${uuid}
|
||||
sudo docker rm ${uuid}
|
||||
docker stop -t=5000 ${uuid}
|
||||
docker rm ${uuid}
|
||||
rm -r -f ${outputdir}/workspace*.json
|
||||
echo ""
|
||||
|
||||
|
|
Loading…
Reference in New Issue