release v0.6

This commit is contained in:
Felix Lohmeier 2017-03-01 17:48:13 +01:00
parent 2f0d8fb080
commit 4ee785ecf3
3 changed files with 164 additions and 163 deletions

128
README.md
View File

@ -1,6 +1,6 @@
## OpenRefine batch processing (openrefine-batch.sh)
Shell script to run OpenRefine on Windows, Linux or Mac in batch mode (import, transform, export). This bash script automatically...
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
1. imports all data from a given directory into OpenRefine
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
@ -10,45 +10,55 @@ It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/fel
### Typical Workflow
- Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
- Step 2: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
### Install
Linux:
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
2. Open Terminal and enter `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
Mac:
1. Install Docker
2. ...
Windows:
1. Install Docker
2. Install Cygwin with Bash
3. ...
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
### Usage
```
./openrefine-batch.sh input/ config/ output/
mkdir -p input && cp INPUTFILES input/
mkdir -p config && cp CONFIGFILES config/
sudo ./openrefine-batch.sh input/ config/ OUTPUT/
```
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
#### INPUTFILES
* any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter options below):
* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive])
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
#### CONFIGFILES
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
#### OUTPUT/
* path to directory where results and temporary data should be stored
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
#### Example
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
```
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 tsv --processQuotes=false --guessCellValueTypes=true
sudo ./openrefine-batch.sh \
examples/powerhouse-museum/input/ \
examples/powerhouse-museum/config/ \
examples/powerhouse-museum/output/ \
examples/powerhouse-museum/cross/ \
2G 2.7rc1 restartfile-false restarttransform-false export-true \
tsv --processQuotes=false --guessCellValueTypes=true
```
#### Options
```
./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restart $inputformat $inputoptions
sudo ./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restartfile $restarttransform $export $inputformat $inputoptions
```
1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
@ -56,8 +66,10 @@ clone or [download GitHub repository](https://github.com/felixlohmeier/openrefin
3. outputdir: path to directory for exported files (and OpenRefine workspace)
4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev)
7. restart: restart docker container after each transformation to clear memory (restart-true/restart-false)
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc1)
7. restartfile: restart docker after each project (e.g. input file) to clear memory (restartfile-true/restartfile-false; default: restartfile-true)
8. restarttransform: restart docker container after each transformation (e.g. config file) to clear memory (restarttransform-true/restarttransform-false; default: restarttransform-false)
9. export: toggle on/off (export-true/export-false; default: export-true)
8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
@ -67,6 +79,7 @@ inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
more inputoptions (optional, only together with inputformat):
* `--projectName=PROJECTNAME` (all formats)
* `--limit=LIMIT` (all formats), default: -1
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
@ -86,80 +99,13 @@ more inputoptions (optional, only together with inputformat):
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
```
[03:27 felix ~/openrefine-batch (master *)]$ ./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 restart-true tsv --processQuotes=false --guessCellValueTypes=true
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
Input files: phm-collection.tsv
Input format: --format=tsv
Input options: --processQuotes=false --guessCellValueTypes=true
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
Transformation rules: phm-transform.json
Cross directory: /home/felix/openrefine-batch/examples/powerhouse-museum/cross
Cross projects:
OpenRefine heap space: 4G
OpenRefine version: 2.7rc1
Docker restart: restart-true
Docker container: 6b7eb36f-fc72-4040-b135-acee36948c13
Output directory: /home/felix/openrefine-batch/examples/powerhouse-museum/output
begin: Mo 27. Feb 03:28:45 CET 2017
start OpenRefine server...
[sudo] password for felix:
92499ecd252a8768ea5b57e0be0fb30fe6340eab67d28b1be158e0ad01f79419
import phm-collection.tsv...
New project: 2325849087106
Number of rows: 75814
STARTED ELAPSED %MEM %CPU RSS
03:28:55 00:29 10.0 122 812208
save project and restart OpenRefine server...
02:29:28.170 [ ProjectManager] Saving all modified projects ... (4594ms)
02:29:36.414 [ project_utilities] Saved project '2325849087106' (8244ms)
6b7eb36f-fc72-4040-b135-acee36948c13
6b7eb36f-fc72-4040-b135-acee36948c13
f28de26b99475c4db09dbfb9ab3d445aa8127dedd08b8e729cb6b4d65c96bf38
begin project 2325849087106 @ Mo 27. Feb 03:29:52 CET 2017
transform phm-transform.json...
02:29:54.372 [ refine] GET /command/core/get-models (2815ms)
02:29:57.525 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3153ms)
02:29:57.640 [ refine] POST /command/core/apply-operations (115ms)
STARTED ELAPSED %MEM %CPU RSS
03:29:38 01:07 19.6 128 1588152
save project and restart OpenRefine server...
02:30:46.280 [ ProjectManager] Saving all modified projects ... (48640ms)
02:30:53.404 [ project_utilities] Saved project '2325849087106' (7124ms)
6b7eb36f-fc72-4040-b135-acee36948c13
6b7eb36f-fc72-4040-b135-acee36948c13
186b0bda0ca542642ce1875d55f8341648e05248eb359541b80191832783f40b
export to file 2325849087106.tsv...
02:31:08.149 [ refine] GET /command/core/get-models (4039ms)
02:31:11.485 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3336ms)
02:31:11.756 [ refine] GET /command/core/get-all-project-metadata (271ms)
02:31:11.774 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (18ms)
STARTED ELAPSED %MEM %CPU RSS
03:30:55 01:59 11.6 28.6 942900
restart OpenRefine server...
6b7eb36f-fc72-4040-b135-acee36948c13
6b7eb36f-fc72-4040-b135-acee36948c13
eb0f91675b5fbf21b4c17cceb6d93146876ea19316b7ab44af78a36f64ff1037
finished project 2325849087106 @ Mo 27. Feb 03:33:11 CET 2017
output (number of lines / size in bytes):
167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/2325849087106.tsv
cleanup...
6b7eb36f-fc72-4040-b135-acee36948c13
6b7eb36f-fc72-4040-b135-acee36948c13
finish: Mo 27. Feb 03:33:17 CET 2017
```
### Todo
- [ ] howto for installation on Mac and Windows
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] use getopts for parsing of arguments
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] add option to delete openrefine projects in output directory
- [ ] provide more example data from other OpenRefine tutorials

View File

@ -7,14 +7,20 @@ Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Dat
## Usage
```
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
sudo ./openrefine-batch.sh \
examples/powerhouse-museum/input/ \
examples/powerhouse-museum/config/ \
examples/powerhouse-museum/output/ \
examples/powerhouse-museum/cross/ \
2G 2.7rc1 restartfile-false restarttransform-false export-true \
tsv --processQuotes=false --guessCellValueTypes=true
```
## phm-collection.tsv
## input/phm-collection.tsv
* The [Powerhouse Museum in Sydney](https://maas.museum/powerhouse-museum/) provides a freely available metadata export of its collection on its website. The collection metadata has been retrieved from the website freeyourmetadata.org that has redistributed the data: http://data.freeyourmetadata.org/powerhouse-museum/
## phm-tutorial.json
## config/phm-tutorial.json
* All steps from the tutorial above, extracted from the history of the processed tutorial project, retrieved from the website freeyourmetadata.org: [phm-collection-cleaned.google-refine.tar.gz](http://data.freeyourmetadata.org/powerhouse-museum/phm-collection-cleaned.google-refine.tar.gz)

View File

@ -1,5 +1,5 @@
#!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v0.5, 27.02.2017
# openrefine-batch.sh, Felix Lohmeier, v0.6, 01.03.2017
# https://github.com/felixlohmeier/openrefine-batch
# user input
@ -49,128 +49,177 @@ if [ -z "$6" ]
fi
if [ -z "$7" ]
then
restart="restart-true"
restartfile="restartfile-true"
else
restart="$7"
restartfile="$7"
fi
if [ -z "$8" ]
then
inputformat=""
restarttransform="restarttransform-false"
else
inputformat="--format=${8}"
restarttransform="$8"
fi
if [ -z "$9" ]
then
export="export-true"
else
export="$9"
fi
if [ -z "${10}" ]
then
inputformat=""
else
inputformat="--format=${10}"
fi
if [ -z "${11}" ]
then
inputoptions=""
else
inputoptions=( "$9" "${10}" "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" )
inputoptions=( "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" "${21}" "${22}" "${23}" "${24}" "${25}" )
fi
# variables
uuid=$(cat /proc/sys/kernel/random/uuid)
echo "Input directory: $inputdir"
echo "Input files: ${inputfiles[@]}"
echo "Input format: $inputformat"
echo "Input options: ${inputoptions[@]}"
echo "Config directory: $configdir"
echo "Transformation rules: ${jsonfiles[@]}"
echo "Cross directory: $crossdir"
echo "Cross projects: ${crossprojects[@]}"
echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version"
echo "Docker container: $uuid"
echo "Docker restart: $restart"
echo "Output directory: $outputdir"
echo "Input directory: $inputdir"
echo "Input files: ${inputfiles[@]}"
echo "Input format: $inputformat"
echo "Input options: ${inputoptions[@]}"
echo "Config directory: $configdir"
echo "Transformation rules: ${jsonfiles[@]}"
echo "Cross directory: $crossdir"
echo "Cross projects: ${crossprojects[@]}"
echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version"
echo "OpenRefine workspace: $outputdir"
echo "Export TSV to workspace: $export"
echo "Docker container name: $uuid"
echo "restart after file: $restartfile"
echo "restart after transform: $restarttransform"
echo ""
# time
echo "begin: $(date)"
echo ""
# launch openrefine server
# launch server
echo "start OpenRefine server..."
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
# wait until server is available
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# show server logs
docker attach ${uuid} &
echo ""
# import all files
if [ -n "$inputfiles" ]; then
# import all files
echo "=== IMPORT ==="
echo ""
for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..."
# import
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show server logs
sudo docker attach ${uuid} &
# statistics
# run client with input command
docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show statistics
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
# restart server to clear memory
echo "save project and restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
echo ""
# restart server to clear memory
if [ "$restartfile" = "restartfile-true" ]; then
echo "save project and restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
echo ""
fi
done
fi
# get project ids
projects=($(sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | cut -c 2-14))
echo "=== TRANSFORM / EXPORT ==="
echo ""
# copy existing projects for use with OpenRefine cross function
# get project ids
echo "get project ids..."
projects=($(docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | tee ${outputdir}/projects.tmp | cut -c 2-14))
cat ${outputdir}/projects.tmp && rm ${outputdir}/projects.tmp
echo ""
# provide additional OpenRefine projects for cross function
if [ -n "$crossprojects" ]; then
echo "provide additional projects for cross function..."
# copy given projects to workspace
rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir
# restart server to advertise copied projects
echo "restart OpenRefine server to advertise copied projects..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
echo ""
fi
# loop for all projects
for projectid in "${projects[@]}" ; do
echo "begin project $projectid @ $(date)"
# show server logs
sudo docker attach ${uuid} &
# time
echo "--- begin project $projectid @ $(date) ---"
echo ""
# apply transformation rules
if [ -n "$jsonfiles" ]; then
# apply transformation rules
for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..."
# apply
sudo docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
# statistics
# run client with apply command
docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
# show statistics
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
if [ "$restart" = "restart-true" ]; then
# restart server to clear memory
# restart server to clear memory
if [ "$restarttransform" = "restarttransform-true" ]; then
echo "save project and restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
sudo docker attach ${uuid} &
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
fi
echo ""
done
fi
# export files
echo "export to file ${projectid}.tsv..."
# export
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
# statistics
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
# restart server to clear memory
echo "restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# export project to workspace
if [ "$export" = "export-true" ]; then
echo "export to file ${projectid}.tsv..."
# run client with export command
docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
# show statistics
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
# restart server to clear memory
if [ "$restartfile" = "restartfile-true" ]; then
echo "restart OpenRefine server..."
docker stop -t=5000 ${uuid}
docker rm ${uuid}
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
docker attach ${uuid} &
fi
echo""
fi
# time
echo "finished project $projectid @ $(date)"
echo "--- finished project $projectid @ $(date) ---"
echo ""
done
# list output files
echo "output (number of lines / size in bytes):"
wc -c -l ${outputdir}/*.tsv
echo ""
if [ "$export" = "export-true" ]; then
echo "output (number of lines / size in bytes):"
wc -c -l ${outputdir}/*.tsv
echo ""
fi
# cleanup
echo "cleanup..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
docker stop -t=5000 ${uuid}
docker rm ${uuid}
rm -r -f ${outputdir}/workspace*.json
echo ""