mirror of
https://github.com/opencultureconsulting/openrefine-batch.git
synced 2025-03-30 00:00:25 +01:00
release v0.6
This commit is contained in:
parent
2f0d8fb080
commit
4ee785ecf3
128
README.md
128
README.md
@ -1,6 +1,6 @@
|
|||||||
## OpenRefine batch processing (openrefine-batch.sh)
|
## OpenRefine batch processing (openrefine-batch.sh)
|
||||||
|
|
||||||
Shell script to run OpenRefine on Windows, Linux or Mac in batch mode (import, transform, export). This bash script automatically...
|
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
|
||||||
|
|
||||||
1. imports all data from a given directory into OpenRefine
|
1. imports all data from a given directory into OpenRefine
|
||||||
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
|
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
|
||||||
@ -10,45 +10,55 @@ It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/fel
|
|||||||
|
|
||||||
### Typical Workflow
|
### Typical Workflow
|
||||||
|
|
||||||
- Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||||
- Step 2: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
|
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
|
||||||
|
|
||||||
### Install
|
### Install
|
||||||
|
|
||||||
Linux:
|
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux) and **a)** [configure Docker to start on boot](https://docs.docker.com/engine/installation/linux/linux-postinstall/#configure-docker-to-start-on-boot) or **b)** start Docker on demand each time you use the script: `sudo systemctl start docker`
|
||||||
|
2. Download the script and grant file permissions to execute: `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
|
||||||
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
|
|
||||||
2. Open Terminal and enter `wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh`
|
|
||||||
|
|
||||||
Mac:
|
|
||||||
|
|
||||||
1. Install Docker
|
|
||||||
2. ...
|
|
||||||
|
|
||||||
Windows:
|
|
||||||
|
|
||||||
1. Install Docker
|
|
||||||
2. Install Cygwin with Bash
|
|
||||||
3. ...
|
|
||||||
|
|
||||||
### Usage
|
### Usage
|
||||||
|
|
||||||
```
|
```
|
||||||
./openrefine-batch.sh input/ config/ output/
|
mkdir -p input && cp INPUTFILES input/
|
||||||
|
mkdir -p config && cp CONFIGFILES config/
|
||||||
|
sudo ./openrefine-batch.sh input/ config/ OUTPUT/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Why `sudo`? Non-root users can only access the Unix socket of the Docker daemon by using `sudo`. If you created a Docker group in [Post-installation steps for Linux](https://docs.docker.com/engine/installation/linux/linux-postinstall/) then you may call the script without `sudo`.
|
||||||
|
|
||||||
|
#### INPUTFILES
|
||||||
|
* any data that [OpenRefine supports](https://github.com/OpenRefine/OpenRefine/wiki/Importers). CSV, TSV and line-based files should work out of the box. XML, JSON, fixed-width, XSLX and ODS need one additional input parameter (see chapter options below):
|
||||||
|
* multiple slices of data may be transformed into a into a single file [by providing a zip or tar.gz archive])
|
||||||
|
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
|
||||||
|
|
||||||
|
#### CONFIGFILES
|
||||||
|
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
||||||
|
|
||||||
|
#### OUTPUT/
|
||||||
|
* path to directory where results and temporary data should be stored
|
||||||
|
* Transformed data will be stored in this directory in TSV (tab-separated values) format. Show results: `ls OUTPUT/*.tsv`
|
||||||
|
* OpenRefine stores data in directories like "1234567890123.project". You may have a look at the results by starting OpenRefine with this workspace. Delete the directories if you do not need them: `rm -r -f OUTPUT/*.project`
|
||||||
|
|
||||||
#### Example
|
#### Example
|
||||||
|
|
||||||
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
|
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
|
||||||
|
|
||||||
```
|
```
|
||||||
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 tsv --processQuotes=false --guessCellValueTypes=true
|
sudo ./openrefine-batch.sh \
|
||||||
|
examples/powerhouse-museum/input/ \
|
||||||
|
examples/powerhouse-museum/config/ \
|
||||||
|
examples/powerhouse-museum/output/ \
|
||||||
|
examples/powerhouse-museum/cross/ \
|
||||||
|
2G 2.7rc1 restartfile-false restarttransform-false export-true \
|
||||||
|
tsv --processQuotes=false --guessCellValueTypes=true
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Options
|
#### Options
|
||||||
|
|
||||||
```
|
```
|
||||||
./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restart $inputformat $inputoptions
|
sudo ./openrefine-batch.sh $inputdir $configdir $outputdir $crossdir $ram $version $restartfile $restarttransform $export $inputformat $inputoptions
|
||||||
```
|
```
|
||||||
|
|
||||||
1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
|
1. inputdir: path to directory with source files (multiple files may be imported into a single project [by providing a zip or tar.gz archive](https://github.com/OpenRefine/OpenRefine/wiki/Importers))
|
||||||
@ -56,8 +66,10 @@ clone or [download GitHub repository](https://github.com/felixlohmeier/openrefin
|
|||||||
3. outputdir: path to directory for exported files (and OpenRefine workspace)
|
3. outputdir: path to directory for exported files (and OpenRefine workspace)
|
||||||
4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
|
4. crossdir: path to directory with additional OpenRefine projects (will be copied to workspace before transformation step to support the [cross function](https://github.com/OpenRefine/OpenRefine/wiki/GREL-Other-Functions#crosscell-c-string-projectname-string-columnname))
|
||||||
5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
|
5. ram: maximum RAM for OpenRefine java heap space (default: 4G)
|
||||||
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev)
|
6. version: OpenRefine version (2.7rc1, 2.6rc2, 2.6rc1, dev; default: 2.7rc1)
|
||||||
7. restart: restart docker container after each transformation to clear memory (restart-true/restart-false)
|
7. restartfile: restart docker after each project (e.g. input file) to clear memory (restartfile-true/restartfile-false; default: restartfile-true)
|
||||||
|
8. restarttransform: restart docker container after each transformation (e.g. config file) to clear memory (restarttransform-true/restarttransform-false; default: restarttransform-false)
|
||||||
|
9. export: toggle on/off (export-true/export-false; default: export-true)
|
||||||
8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
8. inputformat: (csv, tsv, xml, json, line-based, fixed-width, xlsx, ods)
|
||||||
9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
|
9. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
|
||||||
|
|
||||||
@ -67,6 +79,7 @@ inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
|
|||||||
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
|
||||||
|
|
||||||
more inputoptions (optional, only together with inputformat):
|
more inputoptions (optional, only together with inputformat):
|
||||||
|
* `--projectName=PROJECTNAME` (all formats)
|
||||||
* `--limit=LIMIT` (all formats), default: -1
|
* `--limit=LIMIT` (all formats), default: -1
|
||||||
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
|
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
|
||||||
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
|
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
|
||||||
@ -86,80 +99,13 @@ more inputoptions (optional, only together with inputformat):
|
|||||||
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
|
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
|
||||||
|
|
||||||
```
|
```
|
||||||
[03:27 felix ~/openrefine-batch (master *)]$ ./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ examples/powerhouse-museum/cross/ 4G 2.7rc1 restart-true tsv --processQuotes=false --guessCellValueTypes=true
|
|
||||||
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
|
|
||||||
Input files: phm-collection.tsv
|
|
||||||
Input format: --format=tsv
|
|
||||||
Input options: --processQuotes=false --guessCellValueTypes=true
|
|
||||||
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
|
|
||||||
Transformation rules: phm-transform.json
|
|
||||||
Cross directory: /home/felix/openrefine-batch/examples/powerhouse-museum/cross
|
|
||||||
Cross projects:
|
|
||||||
OpenRefine heap space: 4G
|
|
||||||
OpenRefine version: 2.7rc1
|
|
||||||
Docker restart: restart-true
|
|
||||||
Docker container: 6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
Output directory: /home/felix/openrefine-batch/examples/powerhouse-museum/output
|
|
||||||
|
|
||||||
begin: Mo 27. Feb 03:28:45 CET 2017
|
|
||||||
|
|
||||||
start OpenRefine server...
|
|
||||||
[sudo] password for felix:
|
|
||||||
92499ecd252a8768ea5b57e0be0fb30fe6340eab67d28b1be158e0ad01f79419
|
|
||||||
|
|
||||||
import phm-collection.tsv...
|
|
||||||
New project: 2325849087106
|
|
||||||
Number of rows: 75814
|
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
|
||||||
03:28:55 00:29 10.0 122 812208
|
|
||||||
save project and restart OpenRefine server...
|
|
||||||
02:29:28.170 [ ProjectManager] Saving all modified projects ... (4594ms)
|
|
||||||
02:29:36.414 [ project_utilities] Saved project '2325849087106' (8244ms)
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
f28de26b99475c4db09dbfb9ab3d445aa8127dedd08b8e729cb6b4d65c96bf38
|
|
||||||
|
|
||||||
begin project 2325849087106 @ Mo 27. Feb 03:29:52 CET 2017
|
|
||||||
transform phm-transform.json...
|
|
||||||
02:29:54.372 [ refine] GET /command/core/get-models (2815ms)
|
|
||||||
02:29:57.525 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3153ms)
|
|
||||||
02:29:57.640 [ refine] POST /command/core/apply-operations (115ms)
|
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
|
||||||
03:29:38 01:07 19.6 128 1588152
|
|
||||||
save project and restart OpenRefine server...
|
|
||||||
02:30:46.280 [ ProjectManager] Saving all modified projects ... (48640ms)
|
|
||||||
02:30:53.404 [ project_utilities] Saved project '2325849087106' (7124ms)
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
186b0bda0ca542642ce1875d55f8341648e05248eb359541b80191832783f40b
|
|
||||||
export to file 2325849087106.tsv...
|
|
||||||
02:31:08.149 [ refine] GET /command/core/get-models (4039ms)
|
|
||||||
02:31:11.485 [ project] Loaded project 2325849087106 from disk in 3 sec(s) (3336ms)
|
|
||||||
02:31:11.756 [ refine] GET /command/core/get-all-project-metadata (271ms)
|
|
||||||
02:31:11.774 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (18ms)
|
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
|
||||||
03:30:55 01:59 11.6 28.6 942900
|
|
||||||
restart OpenRefine server...
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
eb0f91675b5fbf21b4c17cceb6d93146876ea19316b7ab44af78a36f64ff1037
|
|
||||||
finished project 2325849087106 @ Mo 27. Feb 03:33:11 CET 2017
|
|
||||||
|
|
||||||
output (number of lines / size in bytes):
|
|
||||||
167017 60527726 /home/felix/openrefine-batch/examples/powerhouse-museum/output/2325849087106.tsv
|
|
||||||
|
|
||||||
cleanup...
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
6b7eb36f-fc72-4040-b135-acee36948c13
|
|
||||||
|
|
||||||
finish: Mo 27. Feb 03:33:17 CET 2017
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Todo
|
### Todo
|
||||||
|
|
||||||
- [ ] howto for installation on Mac and Windows
|
|
||||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
|
||||||
- [ ] use getopts for parsing of arguments
|
- [ ] use getopts for parsing of arguments
|
||||||
|
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||||
- [ ] add option to delete openrefine projects in output directory
|
- [ ] add option to delete openrefine projects in output directory
|
||||||
- [ ] provide more example data from other OpenRefine tutorials
|
- [ ] provide more example data from other OpenRefine tutorials
|
||||||
|
|
||||||
|
@ -7,14 +7,20 @@ Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Dat
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
```
|
```
|
||||||
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
|
sudo ./openrefine-batch.sh \
|
||||||
|
examples/powerhouse-museum/input/ \
|
||||||
|
examples/powerhouse-museum/config/ \
|
||||||
|
examples/powerhouse-museum/output/ \
|
||||||
|
examples/powerhouse-museum/cross/ \
|
||||||
|
2G 2.7rc1 restartfile-false restarttransform-false export-true \
|
||||||
|
tsv --processQuotes=false --guessCellValueTypes=true
|
||||||
```
|
```
|
||||||
|
|
||||||
## phm-collection.tsv
|
## input/phm-collection.tsv
|
||||||
|
|
||||||
* The [Powerhouse Museum in Sydney](https://maas.museum/powerhouse-museum/) provides a freely available metadata export of its collection on its website. The collection metadata has been retrieved from the website freeyourmetadata.org that has redistributed the data: http://data.freeyourmetadata.org/powerhouse-museum/
|
* The [Powerhouse Museum in Sydney](https://maas.museum/powerhouse-museum/) provides a freely available metadata export of its collection on its website. The collection metadata has been retrieved from the website freeyourmetadata.org that has redistributed the data: http://data.freeyourmetadata.org/powerhouse-museum/
|
||||||
|
|
||||||
## phm-tutorial.json
|
## config/phm-tutorial.json
|
||||||
|
|
||||||
* All steps from the tutorial above, extracted from the history of the processed tutorial project, retrieved from the website freeyourmetadata.org: [phm-collection-cleaned.google-refine.tar.gz](http://data.freeyourmetadata.org/powerhouse-museum/phm-collection-cleaned.google-refine.tar.gz)
|
* All steps from the tutorial above, extracted from the history of the processed tutorial project, retrieved from the website freeyourmetadata.org: [phm-collection-cleaned.google-refine.tar.gz](http://data.freeyourmetadata.org/powerhouse-museum/phm-collection-cleaned.google-refine.tar.gz)
|
||||||
|
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
# openrefine-batch.sh, Felix Lohmeier, v0.5, 27.02.2017
|
# openrefine-batch.sh, Felix Lohmeier, v0.6, 01.03.2017
|
||||||
# https://github.com/felixlohmeier/openrefine-batch
|
# https://github.com/felixlohmeier/openrefine-batch
|
||||||
|
|
||||||
# user input
|
# user input
|
||||||
@ -49,128 +49,177 @@ if [ -z "$6" ]
|
|||||||
fi
|
fi
|
||||||
if [ -z "$7" ]
|
if [ -z "$7" ]
|
||||||
then
|
then
|
||||||
restart="restart-true"
|
restartfile="restartfile-true"
|
||||||
else
|
else
|
||||||
restart="$7"
|
restartfile="$7"
|
||||||
fi
|
fi
|
||||||
if [ -z "$8" ]
|
if [ -z "$8" ]
|
||||||
then
|
then
|
||||||
inputformat=""
|
restarttransform="restarttransform-false"
|
||||||
else
|
else
|
||||||
inputformat="--format=${8}"
|
restarttransform="$8"
|
||||||
fi
|
fi
|
||||||
if [ -z "$9" ]
|
if [ -z "$9" ]
|
||||||
|
then
|
||||||
|
export="export-true"
|
||||||
|
else
|
||||||
|
export="$9"
|
||||||
|
fi
|
||||||
|
if [ -z "${10}" ]
|
||||||
|
then
|
||||||
|
inputformat=""
|
||||||
|
else
|
||||||
|
inputformat="--format=${10}"
|
||||||
|
fi
|
||||||
|
if [ -z "${11}" ]
|
||||||
then
|
then
|
||||||
inputoptions=""
|
inputoptions=""
|
||||||
else
|
else
|
||||||
inputoptions=( "$9" "${10}" "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" )
|
inputoptions=( "${11}" "${12}" "${13}" "${14}" "${15}" "${16}" "${17}" "${18}" "${19}" "${20}" "${21}" "${22}" "${23}" "${24}" "${25}" )
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# variables
|
# variables
|
||||||
uuid=$(cat /proc/sys/kernel/random/uuid)
|
uuid=$(cat /proc/sys/kernel/random/uuid)
|
||||||
echo "Input directory: $inputdir"
|
echo "Input directory: $inputdir"
|
||||||
echo "Input files: ${inputfiles[@]}"
|
echo "Input files: ${inputfiles[@]}"
|
||||||
echo "Input format: $inputformat"
|
echo "Input format: $inputformat"
|
||||||
echo "Input options: ${inputoptions[@]}"
|
echo "Input options: ${inputoptions[@]}"
|
||||||
echo "Config directory: $configdir"
|
echo "Config directory: $configdir"
|
||||||
echo "Transformation rules: ${jsonfiles[@]}"
|
echo "Transformation rules: ${jsonfiles[@]}"
|
||||||
echo "Cross directory: $crossdir"
|
echo "Cross directory: $crossdir"
|
||||||
echo "Cross projects: ${crossprojects[@]}"
|
echo "Cross projects: ${crossprojects[@]}"
|
||||||
echo "OpenRefine heap space: $ram"
|
echo "OpenRefine heap space: $ram"
|
||||||
echo "OpenRefine version: $version"
|
echo "OpenRefine version: $version"
|
||||||
echo "Docker container: $uuid"
|
echo "OpenRefine workspace: $outputdir"
|
||||||
echo "Docker restart: $restart"
|
echo "Export TSV to workspace: $export"
|
||||||
echo "Output directory: $outputdir"
|
echo "Docker container name: $uuid"
|
||||||
|
echo "restart after file: $restartfile"
|
||||||
|
echo "restart after transform: $restarttransform"
|
||||||
echo ""
|
echo ""
|
||||||
|
|
||||||
# time
|
# time
|
||||||
echo "begin: $(date)"
|
echo "begin: $(date)"
|
||||||
echo ""
|
echo ""
|
||||||
|
|
||||||
# launch openrefine server
|
# launch server
|
||||||
echo "start OpenRefine server..."
|
echo "start OpenRefine server..."
|
||||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
# wait until server is available
|
||||||
|
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||||
|
# show server logs
|
||||||
|
docker attach ${uuid} &
|
||||||
echo ""
|
echo ""
|
||||||
|
|
||||||
|
# import all files
|
||||||
if [ -n "$inputfiles" ]; then
|
if [ -n "$inputfiles" ]; then
|
||||||
# import all files
|
echo "=== IMPORT ==="
|
||||||
|
echo ""
|
||||||
for inputfile in "${inputfiles[@]}" ; do
|
for inputfile in "${inputfiles[@]}" ; do
|
||||||
echo "import ${inputfile}..."
|
echo "import ${inputfile}..."
|
||||||
# import
|
# run client with input command
|
||||||
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||||
# show server logs
|
# show statistics
|
||||||
sudo docker attach ${uuid} &
|
|
||||||
# statistics
|
|
||||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||||
# restart server to clear memory
|
|
||||||
echo "save project and restart OpenRefine server..."
|
|
||||||
sudo docker stop -t=5000 ${uuid}
|
|
||||||
sudo docker rm ${uuid}
|
|
||||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
|
||||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
|
||||||
echo ""
|
echo ""
|
||||||
|
# restart server to clear memory
|
||||||
|
if [ "$restartfile" = "restartfile-true" ]; then
|
||||||
|
echo "save project and restart OpenRefine server..."
|
||||||
|
docker stop -t=5000 ${uuid}
|
||||||
|
docker rm ${uuid}
|
||||||
|
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||||
|
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||||
|
docker attach ${uuid} &
|
||||||
|
echo ""
|
||||||
|
fi
|
||||||
done
|
done
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# get project ids
|
echo "=== TRANSFORM / EXPORT ==="
|
||||||
projects=($(sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | cut -c 2-14))
|
echo ""
|
||||||
|
|
||||||
# copy existing projects for use with OpenRefine cross function
|
# get project ids
|
||||||
|
echo "get project ids..."
|
||||||
|
projects=($(docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | tee ${outputdir}/projects.tmp | cut -c 2-14))
|
||||||
|
cat ${outputdir}/projects.tmp && rm ${outputdir}/projects.tmp
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# provide additional OpenRefine projects for cross function
|
||||||
if [ -n "$crossprojects" ]; then
|
if [ -n "$crossprojects" ]; then
|
||||||
|
echo "provide additional projects for cross function..."
|
||||||
|
# copy given projects to workspace
|
||||||
rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir
|
rsync -a --exclude='*.project/history' $crossdir/*.project $outputdir
|
||||||
|
# restart server to advertise copied projects
|
||||||
|
echo "restart OpenRefine server to advertise copied projects..."
|
||||||
|
docker stop -t=5000 ${uuid}
|
||||||
|
docker rm ${uuid}
|
||||||
|
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||||
|
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||||
|
docker attach ${uuid} &
|
||||||
|
echo ""
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# loop for all projects
|
# loop for all projects
|
||||||
for projectid in "${projects[@]}" ; do
|
for projectid in "${projects[@]}" ; do
|
||||||
echo "begin project $projectid @ $(date)"
|
# time
|
||||||
# show server logs
|
echo "--- begin project $projectid @ $(date) ---"
|
||||||
sudo docker attach ${uuid} &
|
echo ""
|
||||||
|
|
||||||
|
# apply transformation rules
|
||||||
if [ -n "$jsonfiles" ]; then
|
if [ -n "$jsonfiles" ]; then
|
||||||
# apply transformation rules
|
|
||||||
for jsonfile in "${jsonfiles[@]}" ; do
|
for jsonfile in "${jsonfiles[@]}" ; do
|
||||||
echo "transform ${jsonfile}..."
|
echo "transform ${jsonfile}..."
|
||||||
# apply
|
# run client with apply command
|
||||||
sudo docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
|
docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
|
||||||
# statistics
|
# show statistics
|
||||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||||
if [ "$restart" = "restart-true" ]; then
|
# restart server to clear memory
|
||||||
# restart server to clear memory
|
if [ "$restarttransform" = "restarttransform-true" ]; then
|
||||||
echo "save project and restart OpenRefine server..."
|
echo "save project and restart OpenRefine server..."
|
||||||
sudo docker stop -t=5000 ${uuid}
|
docker stop -t=5000 ${uuid}
|
||||||
sudo docker rm ${uuid}
|
docker rm ${uuid}
|
||||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||||
sudo docker attach ${uuid} &
|
docker attach ${uuid} &
|
||||||
fi
|
fi
|
||||||
|
echo ""
|
||||||
done
|
done
|
||||||
fi
|
fi
|
||||||
# export files
|
|
||||||
echo "export to file ${projectid}.tsv..."
|
# export project to workspace
|
||||||
# export
|
if [ "$export" = "export-true" ]; then
|
||||||
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
|
echo "export to file ${projectid}.tsv..."
|
||||||
# statistics
|
# run client with export command
|
||||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
|
||||||
# restart server to clear memory
|
# show statistics
|
||||||
echo "restart OpenRefine server..."
|
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||||
sudo docker stop -t=5000 ${uuid}
|
# restart server to clear memory
|
||||||
sudo docker rm ${uuid}
|
if [ "$restartfile" = "restartfile-true" ]; then
|
||||||
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
echo "restart OpenRefine server..."
|
||||||
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
docker stop -t=5000 ${uuid}
|
||||||
|
docker rm ${uuid}
|
||||||
|
docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||||
|
until docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||||
|
docker attach ${uuid} &
|
||||||
|
fi
|
||||||
|
echo""
|
||||||
|
fi
|
||||||
|
|
||||||
# time
|
# time
|
||||||
echo "finished project $projectid @ $(date)"
|
echo "--- finished project $projectid @ $(date) ---"
|
||||||
echo ""
|
echo ""
|
||||||
done
|
done
|
||||||
|
|
||||||
# list output files
|
# list output files
|
||||||
echo "output (number of lines / size in bytes):"
|
if [ "$export" = "export-true" ]; then
|
||||||
wc -c -l ${outputdir}/*.tsv
|
echo "output (number of lines / size in bytes):"
|
||||||
echo ""
|
wc -c -l ${outputdir}/*.tsv
|
||||||
|
echo ""
|
||||||
|
fi
|
||||||
|
|
||||||
# cleanup
|
# cleanup
|
||||||
echo "cleanup..."
|
echo "cleanup..."
|
||||||
sudo docker stop -t=5000 ${uuid}
|
docker stop -t=5000 ${uuid}
|
||||||
sudo docker rm ${uuid}
|
docker rm ${uuid}
|
||||||
rm -r -f ${outputdir}/workspace*.json
|
rm -r -f ${outputdir}/workspace*.json
|
||||||
echo ""
|
echo ""
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user