Compare commits
8 Commits
Author | SHA1 | Date |
---|---|---|
Felix Lohmeier | 2cc2378085 | |
dependabot[bot] | 9e6e42261b | |
Felix Lohmeier | 4e32074d85 | |
Felix Lohmeier | a9c494856b | |
Felix Lohmeier | ca19d7ef16 | |
Felix Lohmeier | 93be203efe | |
Felix Lohmeier | 2894b0194f | |
Felix Lohmeier | 4199fadc04 |
125
README.md
125
README.md
|
@ -1,6 +1,6 @@
|
|||
## OpenRefine batch processing (openrefine-batch.sh)
|
||||
|
||||
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/66bf001c38194f5bb722f65f5e15f0ec)](https://www.codacy.com/app/mail_74/openrefine-batch?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-batch&utm_campaign=badger)
|
||||
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/ad8a97e42e634bbe87203ea48efb436e)](https://www.codacy.com/gh/opencultureconsulting/openrefine-batch/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
|
||||
|
||||
Shell script to run OpenRefine in batch mode (import, transform, export). This bash script automatically...
|
||||
|
||||
|
@ -17,6 +17,14 @@ If you prefer a containerized approach, see a [variation of this script for Dock
|
|||
- **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
|
||||
- **Step 2**: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects to files in the format specified (default: TSV - tab-separated values).
|
||||
|
||||
### Demo via binder
|
||||
|
||||
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-batch/master?urlpath=lab/tree/demo.ipynb)
|
||||
|
||||
- free to use on-demand server with Jupyterlab and Bash Kernel
|
||||
- no registration needed, will start within a few minutes
|
||||
- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
|
||||
|
||||
### Install
|
||||
|
||||
Download the script and grant file permissions to execute:
|
||||
|
@ -158,12 +166,12 @@ The script prints log messages from OpenRefine server and makes use of `ps` to s
|
|||
```
|
||||
[felix@tux openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||
Download OpenRefine...
|
||||
openrefine-linux-3.2.tar.gz 100%[===============================================>] 101,13M 9,46MB/s in 19s
|
||||
openrefine-linux-3.5.0.tar.gz 100%[=========================================================================================================================================>] 125,73M 9,50MB/s in 13s
|
||||
Install OpenRefine in subdirectory openrefine...
|
||||
Total bytes read: 125419520 (120MiB, 74MiB/s)
|
||||
Total bytes read: 154163200 (148MiB, 87MiB/s)
|
||||
|
||||
Download OpenRefine client...
|
||||
openrefine-client_0-3-9_linux 100%[===============================================>] 4,25M 2,61MB/s in 1,6s
|
||||
openrefine-client_0-3-10_linux 100%[=========================================================================================================================================>] 4,25M 9,17MB/s in 0,5s
|
||||
|
||||
Input directory: /home/felix/git/openrefine-batch/examples/powerhouse-museum/input
|
||||
Input files: phm-collection.tsv
|
||||
|
@ -184,107 +192,99 @@ restart after transform: false
|
|||
|
||||
=== 1. Launch OpenRefine ===
|
||||
|
||||
starting time: Sa 8. Aug 13:32:45 CEST 2020
|
||||
starting time: Di 9. Nov 22:37:25 CET 2021
|
||||
|
||||
You have 15927M of free memory.
|
||||
Using refine.ini for configuration
|
||||
You have 15913M of free memory.
|
||||
Your current configuration is set to use 2048M of memory.
|
||||
OpenRefine can run better when given more memory. Read our FAQ on how to allocate more memory here:
|
||||
https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory
|
||||
https://github.com/OpenRefine/OpenRefine/wiki/FAQ-Allocate-More-Memory
|
||||
/usr/bin/java -cp server/classes:server/target/lib/* -Drefine.headless=true -Xms2048M -Xmx2048M -Drefine.memory=2048M -Drefine.max_form_content_size=1048576 -Drefine.verbosity=info -Dpython.path=main/webapp/WEB-INF/lib/jython -Dpython.cachedir=/home/felix/.local/share/google/refine/cachedir -Drefine.data_dir=/home/felix/git/openrefine-batch/examples/powerhouse-museum/output -Drefine.webapp=main/webapp -Drefine.port=3333 -Drefine.interface=127.0.0.1 -Drefine.host=127.0.0.1 -Drefine.autosave=1440 com.google.refine.Refine
|
||||
Starting OpenRefine at 'http://127.0.0.1:3333/'
|
||||
|
||||
13:32:46.213 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
||||
13:32:46.214 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
|
||||
13:32:46.224 [ refine_server] Initializing context: '/' from '/home/felix/git/openrefine-batch/openrefine/webapp' (10ms)
|
||||
log4j:WARN No appenders could be found for logger (org.eclipse.jetty.util.log).
|
||||
log4j:WARN Please initialize the log4j system properly.
|
||||
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
|
||||
SLF4J: Class path contains multiple SLF4J bindings.
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/server/target/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/webapp/WEB-INF/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/webapp/WEB-INF/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: Found binding in [jar:file:/home/felix/git/openrefine-batch/openrefine/server/target/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
|
||||
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
|
||||
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
|
||||
13:32:46.937 [ refine] Starting OpenRefine 3.2 [55c921b]... (713ms)
|
||||
13:32:46.937 [ refine] initializing FileProjectManager with dir (0ms)
|
||||
13:32:46.937 [ refine] /home/felix/git/openrefine-batch/examples/powerhouse-museum/output (0ms)
|
||||
13:32:46.947 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (10ms)
|
||||
13:32:52.249 [ refine] Running in headless mode (5302ms)
|
||||
22:37:28.211 [ refine] Starting OpenRefine 3.5.0 [d4209a2]... (0ms)
|
||||
22:37:28.213 [ refine] initializing FileProjectManager with dir (2ms)
|
||||
22:37:28.213 [ refine] /home/felix/git/openrefine-batch/examples/powerhouse-museum/output (0ms)
|
||||
22:37:28.223 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (10ms)
|
||||
|
||||
=== 2. Import all files ===
|
||||
|
||||
starting time: Sa 8. Aug 13:32:53 CEST 2020
|
||||
starting time: Di 9. Nov 22:37:33 CET 2021
|
||||
|
||||
import phm-collection.tsv...
|
||||
13:32:53.686 [ refine] POST /command/core/create-project-from-upload (1437ms)
|
||||
13:33:01.606 [ refine] GET /command/core/get-models (7920ms)
|
||||
13:33:01.722 [ refine] POST /command/core/get-rows (116ms)
|
||||
id: 1705197298924
|
||||
22:37:33.804 [ refine] GET /command/core/get-csrf-token (5581ms)
|
||||
22:37:33.872 [ refine] POST /command/core/create-project-from-upload (68ms)
|
||||
22:37:44.653 [ refine] GET /command/core/get-models (10781ms)
|
||||
22:37:44.790 [ refine] POST /command/core/get-rows (137ms)
|
||||
id: 2252508879578
|
||||
rows: 75814
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:32:45 00:16 6.0 201 993192
|
||||
22:37:25 00:19 10.2 202 1670620
|
||||
|
||||
=== 3. Prepare transform & export ===
|
||||
|
||||
starting time: Sa 8. Aug 13:33:01 CEST 2020
|
||||
starting time: Di 9. Nov 22:37:44 CET 2021
|
||||
|
||||
get project ids...
|
||||
13:33:02.003 [ refine] GET /command/core/get-all-project-metadata (281ms)
|
||||
1705197298924: phm-collection
|
||||
22:37:45.112 [ refine] GET /command/core/get-csrf-token (322ms)
|
||||
22:37:45.115 [ refine] GET /command/core/get-all-project-metadata (3ms)
|
||||
2252508879578: phm-collection
|
||||
|
||||
=== 4. Transform phm-collection ===
|
||||
|
||||
starting time: Sa 8. Aug 13:33:02 CEST 2020
|
||||
starting time: Di 9. Nov 22:37:45 CET 2021
|
||||
|
||||
transform phm-transform.json...
|
||||
13:33:02.187 [ refine] GET /command/core/get-models (184ms)
|
||||
13:33:02.193 [ refine] POST /command/core/apply-operations (6ms)
|
||||
File /home/felix/git/openrefine-batch/examples/powerhouse-museum/config/phm-transform.json has been successfully applied to project 1705197298924
|
||||
22:37:45.303 [ refine] GET /command/core/get-csrf-token (188ms)
|
||||
22:37:45.308 [ refine] GET /command/core/get-models (5ms)
|
||||
22:37:45.324 [ refine] POST /command/core/apply-operations (16ms)
|
||||
File /home/felix/git/openrefine-batch/examples/powerhouse-museum/config/phm-transform.json has been successfully applied to project 2252508879578
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:32:45 00:32 6.3 165 1037688
|
||||
22:37:25 00:34 11.9 175 1940600
|
||||
|
||||
|
||||
=== 5. Export phm-collection ===
|
||||
|
||||
starting time: Sa 8. Aug 13:33:17 CEST 2020
|
||||
starting time: Di 9. Nov 22:37:59 CET 2021
|
||||
|
||||
export to file phm-collection.tsv...
|
||||
13:33:18.001 [ refine] GET /command/core/get-models (15808ms)
|
||||
13:33:18.005 [ refine] GET /command/core/get-all-project-metadata (4ms)
|
||||
13:33:18.007 [ refine] POST /command/core/export-rows/phm-collection.tsv (2ms)
|
||||
22:37:59.944 [ refine] GET /command/core/get-csrf-token (14620ms)
|
||||
22:37:59.947 [ refine] GET /command/core/get-models (3ms)
|
||||
22:37:59.951 [ refine] GET /command/core/get-all-project-metadata (4ms)
|
||||
22:37:59.954 [ refine] POST /command/core/export-rows/phm-collection.tsv (3ms)
|
||||
Export to file /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv complete
|
||||
STARTED ELAPSED %MEM %CPU RSS
|
||||
13:32:45 00:35 6.7 168 1098564
|
||||
22:37:25 00:38 12.4 181 2021388
|
||||
|
||||
|
||||
output (number of lines / size in bytes):
|
||||
75728 59431272 /home/felix/git/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||
|
||||
cleanup...
|
||||
13:33:24.667 [ ProjectManager] Saving all modified projects ... (6660ms)
|
||||
13:33:28.044 [ project_utilities] Saved project '1705197298924' (3377ms)
|
||||
22:38:06.850 [ ProjectManager] Saving all modified projects ... (6896ms)
|
||||
22:38:10.014 [ project_utilities] Saved project '2252508879578' (3164ms)
|
||||
|
||||
=== Statistics ===
|
||||
|
||||
starting time and run time of each step:
|
||||
Start process Sa 8. Aug 13:32:45 CEST 2020 (00:00:00)
|
||||
Launch OpenRefine Sa 8. Aug 13:32:45 CEST 2020 (00:00:08)
|
||||
Import all files Sa 8. Aug 13:32:53 CEST 2020 (00:00:08)
|
||||
Prepare transform & export Sa 8. Aug 13:33:01 CEST 2020 (00:00:01)
|
||||
Transform phm-collection Sa 8. Aug 13:33:02 CEST 2020 (00:00:15)
|
||||
Export phm-collection Sa 8. Aug 13:33:17 CEST 2020 (00:00:12)
|
||||
End process Sa 8. Aug 13:33:29 CEST 2020 (00:00:00)
|
||||
Start process Di 9. Nov 22:37:25 CET 2021 (00:00:00)
|
||||
Launch OpenRefine Di 9. Nov 22:37:25 CET 2021 (00:00:08)
|
||||
Import all files Di 9. Nov 22:37:33 CET 2021 (00:00:11)
|
||||
Prepare transform & export Di 9. Nov 22:37:44 CET 2021 (00:00:01)
|
||||
Transform phm-collection Di 9. Nov 22:37:45 CET 2021 (00:00:14)
|
||||
Export phm-collection Di 9. Nov 22:37:59 CET 2021 (00:00:11)
|
||||
End process Di 9. Nov 22:38:10 CET 2021 (00:00:00)
|
||||
|
||||
total run time: 00:00:44 (hh:mm:ss)
|
||||
highest memory load: 1072 MB
|
||||
```
|
||||
|
||||
### Performance gain with extended cross function
|
||||
|
||||
The original cross function expects normalized data (one foreign key per cell in base column). If you have multiple key values in one cell you need to split them first in multiple rows before you apply cross (and join results afterwards). This can be quite "expensive" if you work with bigger datasets.
|
||||
|
||||
There is a [fork available that extend the cross function](https://github.com/felixlohmeier/OpenRefine/wiki>) to support an integrated split and may provide a massive performance gain for this special use case.
|
||||
|
||||
Here is a code snippet to install this fork together with openrefine-batch.sh in a blank directory:
|
||||
```
|
||||
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh
|
||||
sed -i 's/.tar.gz/-with-pr1294.tar.gz/' openrefine-batch.sh
|
||||
./openrefine-batch.sh
|
||||
total run time: 00:00:45 (hh:mm:ss)
|
||||
highest memory load: 1974 MB
|
||||
```
|
||||
|
||||
### Docker
|
||||
|
@ -340,11 +340,6 @@ execute openrefine-batch-docker.sh
|
|||
-RX
|
||||
```
|
||||
|
||||
### Todo
|
||||
|
||||
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
|
||||
- [ ] provide more example data from other OpenRefine tutorials
|
||||
|
||||
### Licensing
|
||||
|
||||
MIT License
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
openjdk-8-jre
|
|
@ -0,0 +1,5 @@
|
|||
#!/bin/bash
|
||||
set -e
|
||||
|
||||
# Install bash_kernel https://github.com/takluyver/bash_kernel
|
||||
python -m bash_kernel.install
|
|
@ -0,0 +1,2 @@
|
|||
jupyter-server-proxy==3.2.1
|
||||
bash_kernel==0.7.2
|
|
@ -0,0 +1 @@
|
|||
{"metadata":{"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"},"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"}},"nbformat_minor":5,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Example Powerhouse Museum\n\nOutput will be stored in examples/powerhouse-museum/output/phm-collection.tsv","metadata":{}},{"cell_type":"code","source":"./openrefine-batch.sh \\\n-a examples/powerhouse-museum/input/ \\\n-b examples/powerhouse-museum/config/ \\\n-c examples/powerhouse-museum/output/ \\\n-f tsv \\\n-i processQuotes=false \\\n-i guessCellValueTypes=true \\\n-RX","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}
|
|
@ -1,5 +1,5 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch-docker.sh, Felix Lohmeier, v1.14, 2020-08-08
|
||||
# openrefine-batch-docker.sh, Felix Lohmeier, v1.16, 2021-11-09
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# check system requirements
|
||||
|
@ -40,7 +40,7 @@ Usage: ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDI
|
|||
-i INPUTOPTIONS several options provided by openrefine-client, see below...
|
||||
-m RAM maximum RAM for OpenRefine java heap space (default: 2048M)
|
||||
-t TEMPLATING several options for templating export, see below...
|
||||
-v VERSION OpenRefine version (3.2, 3.1, 3.0, 2.8, 2.7, ...; default: 3.2)
|
||||
-v VERSION OpenRefine version (3.5.0, 3.4.1, 3.4, 3.3, 3.2, 3.1, 3.0, 2.8, 2.7, ...; default: 3.5.0)
|
||||
-E do NOT export files
|
||||
-R do NOT restart OpenRefine after each transformation (e.g. config file)
|
||||
-X do NOT restart OpenRefine after each project (e.g. input file)
|
||||
|
@ -108,7 +108,7 @@ EOF
|
|||
|
||||
# defaults
|
||||
ram="2048M"
|
||||
version="3.2"
|
||||
version="3.5.0"
|
||||
restartfile="true"
|
||||
restarttransform="true"
|
||||
export="true"
|
||||
|
@ -229,7 +229,7 @@ echo "starting time: $(date --date=@${checkpointdate[$((checkpoints + 1))]})"
|
|||
echo ""
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
# wait until server is available
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.9 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
# show server logs
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
|
@ -246,7 +246,7 @@ if [ -n "$inputfiles" ]; then
|
|||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# run client with input command
|
||||
${docker[*]} run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client:v0.3.9 -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${inputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -257,7 +257,7 @@ if [ -n "$inputfiles" ]; then
|
|||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.9 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
@ -276,7 +276,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
${docker[*]} run --rm --link ${uuid} felixlohmeier/openrefine-client:v0.3.9 -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
${docker[*]} run --rm --link ${uuid} felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
|
||||
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
|
||||
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
|
||||
|
@ -292,7 +292,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.9 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
echo ""
|
||||
fi
|
||||
|
@ -312,7 +312,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# run client with apply command
|
||||
${docker[*]} run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client:v0.3.9 -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${configdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -f ${jsonfile} ${projectids[i]}
|
||||
# allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -323,7 +323,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.9 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
|
@ -343,7 +343,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
${docker[*]} run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client:v0.3.9 -H ${uuid} -E --output="${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
${docker[*]} run --rm --link ${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H ${uuid} -E --output="${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -C java --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -C java))
|
||||
|
@ -356,7 +356,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
${docker[*]} stop -t=5000 ${uuid}
|
||||
${docker[*]} rm ${uuid}
|
||||
${docker[*]} run -d --name=${uuid} -v ${outputdir}:/data:z felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.9 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
until ${docker[*]} run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client:v0.3.10 --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
|
||||
${docker[*]} attach ${uuid} &
|
||||
fi
|
||||
echo ""
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
#!/bin/bash
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.14, 2020-08-08
|
||||
# openrefine-batch.sh, Felix Lohmeier, v1.16, 2021-11-09
|
||||
# https://github.com/felixlohmeier/openrefine-batch
|
||||
|
||||
# declare download URLs for OpenRefine and OpenRefine client
|
||||
openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/3.2/openrefine-linux-3.2.tar.gz"
|
||||
client_URL="https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.9/openrefine-client_0-3-9_linux"
|
||||
openrefine_URL="https://github.com/OpenRefine/OpenRefine/releases/download/3.5.0/openrefine-linux-3.5.0.tar.gz"
|
||||
client_URL="https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux"
|
||||
|
||||
# check system requirements
|
||||
JAVA="$(which java 2> /dev/null)"
|
||||
|
@ -34,7 +34,7 @@ if [ ! -d "openrefine-client" ]; then
|
|||
echo "Download OpenRefine client..."
|
||||
mkdir -p openrefine-client
|
||||
wget -q -P openrefine-client $wget_opt $client_URL
|
||||
chmod +x openrefine-client/openrefine-client_0-3-9_linux
|
||||
chmod +x openrefine-client/openrefine-client_0-3-10_linux
|
||||
echo ""
|
||||
fi
|
||||
|
||||
|
@ -259,7 +259,7 @@ if [ -n "$inputfiles" ]; then
|
|||
for inputfile in "${inputfiles[@]}" ; do
|
||||
echo "import ${inputfile}..."
|
||||
# run client with input command
|
||||
openrefine-client/openrefine-client_0-3-9_linux -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -c ${inputdir}/${inputfile} $inputformat "${inputoptions[@]}"
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -290,7 +290,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
|
||||
# get project ids
|
||||
echo "get project ids..."
|
||||
openrefine-client/openrefine-client_0-3-9_linux -P ${port} -l > "${outputdir}/projects.tmp"
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -l > "${outputdir}/projects.tmp"
|
||||
projectids=($(cut -c 2-14 "${outputdir}/projects.tmp"))
|
||||
projectnames=($(cut -c 17- "${outputdir}/projects.tmp"))
|
||||
cat "${outputdir}/projects.tmp" && rm "${outputdir:?}/projects.tmp"
|
||||
|
@ -327,7 +327,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
for jsonfile in "${jsonfiles[@]}" ; do
|
||||
echo "transform ${jsonfile}..."
|
||||
# run client with apply command
|
||||
openrefine-client/openrefine-client_0-3-9_linux -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -f ${configdir}/${jsonfile} ${projectids[i]}
|
||||
# allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
@ -359,7 +359,7 @@ if [ -n "$jsonfiles" ] || [ "$export" = "true" ]; then
|
|||
filename=${projectnames[i]%.*}
|
||||
echo "export to file ${filename}.${exportformat}..."
|
||||
# run client with export command
|
||||
openrefine-client/openrefine-client_0-3-9_linux -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
openrefine-client/openrefine-client_0-3-10_linux -P ${port} -E --output="${outputdir}/${filename}.${exportformat}" "${templating[@]}" ${projectids[i]}
|
||||
# show allocated system resources
|
||||
ps -o start,etime,%mem,%cpu,rss -p ${pid} --sort=start
|
||||
memoryload+=($(ps --no-headers -o rss -p ${pid}))
|
||||
|
|
Loading…
Reference in New Issue