release v1.8, updated OpenRefine version (dev snapshot 2017-10-28)
This commit is contained in:
parent
2af30448dc
commit
c8e57230ca
127
README.md
127
README.md
|
@ -39,7 +39,7 @@ cp CONFIGFILES config/
|
||||||
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
|
* you may use hard symlinks instead of cp: `ln INPUTFILE input/`
|
||||||
|
|
||||||
**CONFIGFILES**
|
**CONFIGFILES**
|
||||||
* JSON files with [OpenRefine transformation rules)](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
* JSON files with [OpenRefine transformation rules](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html)
|
||||||
|
|
||||||
**OUTPUT/**
|
**OUTPUT/**
|
||||||
* path to directory where results and temporary data should be stored
|
* path to directory where results and temporary data should be stored
|
||||||
|
@ -50,6 +50,17 @@ cp CONFIGFILES config/
|
||||||
|
|
||||||
[Example Powerhouse Museum](examples/powerhouse-museum)
|
[Example Powerhouse Museum](examples/powerhouse-museum)
|
||||||
|
|
||||||
|
download example data
|
||||||
|
|
||||||
|
```
|
||||||
|
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||||
|
unzip master.zip openrefine-batch-master/examples/*
|
||||||
|
mv openrefine-batch-master/examples .
|
||||||
|
rm -f master.zip
|
||||||
|
```
|
||||||
|
|
||||||
|
execute openrefine-batch.sh
|
||||||
|
|
||||||
```
|
```
|
||||||
./openrefine-batch.sh \
|
./openrefine-batch.sh \
|
||||||
-a examples/powerhouse-museum/input/ \
|
-a examples/powerhouse-museum/input/ \
|
||||||
|
@ -61,12 +72,10 @@ cp CONFIGFILES config/
|
||||||
-RX
|
-RX
|
||||||
```
|
```
|
||||||
|
|
||||||
clone or [download GitHub repository](https://github.com/felixlohmeier/openrefine-batch/archive/master.zip) to get example data
|
|
||||||
|
|
||||||
### Help Screen
|
### Help Screen
|
||||||
|
|
||||||
```
|
```
|
||||||
[14:45 felix ~/openrefine-batch]$ ./openrefine-batch.sh
|
[11:36 felix ~/openrefine-batch]$ ./openrefine-batch.sh
|
||||||
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||||
|
|
||||||
== basic arguments ==
|
== basic arguments ==
|
||||||
|
@ -109,10 +118,17 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||||
|
|
||||||
== example ==
|
== example ==
|
||||||
|
|
||||||
|
download example data
|
||||||
|
|
||||||
|
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||||
|
unzip master.zip openrefine-batch-master/examples/*
|
||||||
|
mv openrefine-batch-master/examples .
|
||||||
|
rm -f master.zip
|
||||||
|
|
||||||
|
execute openrefine-batch.sh
|
||||||
|
|
||||||
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||||
|
|
||||||
clone or download GitHub repository to get example data:
|
|
||||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Logging
|
### Logging
|
||||||
|
@ -120,26 +136,26 @@ https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
||||||
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
|
The script prints log messages from OpenRefine server and makes use of `ps` to show statistics for each step. Here is a sample:
|
||||||
|
|
||||||
```
|
```
|
||||||
[00:41 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
[11:36 felix ~/openrefine-batch]$ ./openrefine-batch.sh -a examples/powerhouse-museum/input/ -b examples/powerhouse-museum/config/ -c examples/powerhouse-museum/output/ -f tsv -i processQuotes=false -i guessCellValueTypes=true -RX
|
||||||
Download OpenRefine...
|
Download OpenRefine...
|
||||||
openrefine-linux-2017-10-26.tar.gz 100%[=====================================================================================================================>] 66,34M 5,62MB/s in 12s
|
openrefine-linux-2017-10-2 100%[=====================================>] 66,34M 5,62MB/s in 12s
|
||||||
Install OpenRefine in subdirectory openrefine...
|
Install OpenRefine in subdirectory openrefine...
|
||||||
Total bytes read: 79861760 (77MiB, 128MiB/s)
|
Total bytes read: 79861760 (77MiB, 129MiB/s)
|
||||||
|
|
||||||
Download OpenRefine client...
|
Download OpenRefine client...
|
||||||
openrefine-client_0-3-1_linux-64bit 100%[=====================================================================================================================>] 5,39M 5,08MB/s in 1,1s
|
openrefine-client_0-3-1_li 100%[=====================================>] 5,39M 5,17MB/s in 1,0s
|
||||||
|
|
||||||
Input directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/input
|
Input directory: /home/felix/openrefine-batch/examples/powerhouse-museum/input
|
||||||
Input files: phm-collection.tsv
|
Input files: phm-collection.tsv
|
||||||
Input format: --format=tsv
|
Input format: --format=tsv
|
||||||
Input options: --processQuotes=false --guessCellValueTypes=true
|
Input options: --processQuotes=false --guessCellValueTypes=true
|
||||||
Config directory: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/config
|
Config directory: /home/felix/openrefine-batch/examples/powerhouse-museum/config
|
||||||
Transformation rules: phm-transform.json
|
Transformation rules: phm-transform.json
|
||||||
Cross directory: /dev/null
|
Cross directory: /dev/null
|
||||||
Cross projects:
|
Cross projects:
|
||||||
OpenRefine heap space: 2048M
|
OpenRefine heap space: 2048M
|
||||||
OpenRefine port: 3333
|
OpenRefine port: 3333
|
||||||
OpenRefine workspace: /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output
|
OpenRefine workspace: /home/felix/openrefine-batch/examples/powerhouse-museum/output
|
||||||
Export to workspace: true
|
Export to workspace: true
|
||||||
Export format: tsv
|
Export format: tsv
|
||||||
restart after file: false
|
restart after file: false
|
||||||
|
@ -147,80 +163,93 @@ restart after transform: false
|
||||||
|
|
||||||
=== 1. Launch OpenRefine ===
|
=== 1. Launch OpenRefine ===
|
||||||
|
|
||||||
starting time: Sa 28. Okt 00:42:33 CEST 2017
|
starting time: Sa 28. Okt 11:38:19 CEST 2017
|
||||||
|
|
||||||
Starting OpenRefine at 'http://127.0.0.1:3333/'
|
Starting OpenRefine at 'http://127.0.0.1:3333/'
|
||||||
|
|
||||||
00:42:33.199 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
11:38:19.275 [ refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
|
||||||
00:42:33.200 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (1ms)
|
11:38:19.275 [ refine_server] refine.memory size: 2048M JVM Max heap: 2058354688 (0ms)
|
||||||
00:42:33.206 [ refine_server] Initializing context: '/' from '/home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/openrefine/webapp' (6ms)
|
11:38:19.281 [ refine_server] Initializing context: '/' from '/home/felix/openrefine-batch/openrefine/webapp' (6ms)
|
||||||
00:42:33.418 [ refine] Starting OpenRefine 2017-10-26 [TRUNK]... (212ms)
|
11:38:19.478 [ refine] Starting OpenRefine 2017-10-28 [TRUNK]... (197ms)
|
||||||
00:42:33.424 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
|
11:38:19.484 [ FileProjectManager] Failed to load workspace from any attempted alternatives. (6ms)
|
||||||
00:42:35.993 [ refine] Running in headless mode (2569ms)
|
11:38:22.010 [ refine] Running in headless mode (2526ms)
|
||||||
|
|
||||||
=== 2. Import all files ===
|
=== 2. Import all files ===
|
||||||
|
|
||||||
starting time: Sa 28. Okt 00:42:36 CEST 2017
|
starting time: Sa 28. Okt 11:38:22 CEST 2017
|
||||||
|
|
||||||
import phm-collection.tsv...
|
import phm-collection.tsv...
|
||||||
00:42:36.393 [ refine] POST /command/core/create-project-from-upload (400ms)
|
11:38:22.479 [ refine] POST /command/core/create-project-from-upload (469ms)
|
||||||
New project: 1721413008439
|
New project: 1530205635037
|
||||||
00:42:40.731 [ refine] GET /command/core/get-rows (4338ms)
|
11:38:26.474 [ refine] GET /command/core/get-rows (3995ms)
|
||||||
Number of rows: 75814
|
Number of rows: 75814
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
STARTED ELAPSED %MEM %CPU RSS
|
||||||
00:42:32 00:07 5.7 220 937692
|
11:38:18 00:07 5.8 214 946616
|
||||||
|
|
||||||
=== 3. Prepare transform & export ===
|
=== 3. Prepare transform & export ===
|
||||||
|
|
||||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
starting time: Sa 28. Okt 11:38:26 CEST 2017
|
||||||
|
|
||||||
get project ids...
|
get project ids...
|
||||||
00:42:40.866 [ refine] GET /command/core/get-all-project-metadata (135ms)
|
11:38:26.589 [ refine] GET /command/core/get-all-project-metadata (115ms)
|
||||||
1721413008439: phm-collection.tsv
|
1530205635037: phm-collection.tsv
|
||||||
|
|
||||||
=== 4. Transform phm-collection.tsv ===
|
=== 4. Transform phm-collection.tsv ===
|
||||||
|
|
||||||
starting time: Sa 28. Okt 00:42:40 CEST 2017
|
starting time: Sa 28. Okt 11:38:26 CEST 2017
|
||||||
|
|
||||||
transform phm-transform.json...
|
transform phm-transform.json...
|
||||||
00:42:40.963 [ refine] GET /command/core/get-models (97ms)
|
11:38:26.684 [ refine] GET /command/core/get-models (95ms)
|
||||||
00:42:40.967 [ refine] POST /command/core/apply-operations (4ms)
|
11:38:26.687 [ refine] POST /command/core/apply-operations (3ms)
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
STARTED ELAPSED %MEM %CPU RSS
|
||||||
00:42:32 00:29 7.1 142 1162720
|
11:38:18 00:28 7.2 139 1169204
|
||||||
|
|
||||||
|
|
||||||
=== 5. Export phm-collection.tsv ===
|
=== 5. Export phm-collection.tsv ===
|
||||||
|
|
||||||
starting time: Sa 28. Okt 00:43:02 CEST 2017
|
starting time: Sa 28. Okt 11:38:47 CEST 2017
|
||||||
|
|
||||||
export to file phm-collection.tsv...
|
export to file phm-collection.tsv...
|
||||||
00:43:02.555 [ refine] GET /command/core/get-models (21588ms)
|
11:38:47.214 [ refine] GET /command/core/get-models (20527ms)
|
||||||
00:43:02.557 [ refine] GET /command/core/get-all-project-metadata (2ms)
|
11:38:47.217 [ refine] GET /command/core/get-all-project-metadata (3ms)
|
||||||
00:43:02.561 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
|
11:38:47.221 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (4ms)
|
||||||
STARTED ELAPSED %MEM %CPU RSS
|
STARTED ELAPSED %MEM %CPU RSS
|
||||||
00:42:32 00:53 7.1 81.1 1164684
|
11:38:18 00:50 7.2 81.2 1170760
|
||||||
|
|
||||||
|
|
||||||
output (number of lines / size in bytes):
|
output (number of lines / size in bytes):
|
||||||
167017 60619468 /home/felix/owncloud/Business/projekte/openrefine/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
167017 60619468 /home/felix/openrefine-batch/examples/powerhouse-museum/output/phm-collection.tsv
|
||||||
|
|
||||||
cleanup...
|
cleanup...
|
||||||
00:43:26.161 [ ProjectManager] Saving all modified projects ... (23600ms)
|
11:39:12.562 [ ProjectManager] Saving all modified projects ... (25341ms)
|
||||||
00:43:29.586 [ project_utilities] Saved project '1721413008439' (3425ms)
|
11:39:15.953 [ project_utilities] Saved project '1530205635037' (3391ms)
|
||||||
|
|
||||||
=== Statistics ===
|
=== Statistics ===
|
||||||
|
|
||||||
starting time and run time of each step:
|
starting time and run time of each step:
|
||||||
Start process Sa 28. Okt 00:42:33 CEST 2017 (00:00:00)
|
Start process Sa 28. Okt 11:38:19 CEST 2017 (00:00:00)
|
||||||
Launch OpenRefine Sa 28. Okt 00:42:33 CEST 2017 (00:00:03)
|
Launch OpenRefine Sa 28. Okt 11:38:19 CEST 2017 (00:00:03)
|
||||||
Import all files Sa 28. Okt 00:42:36 CEST 2017 (00:00:04)
|
Import all files Sa 28. Okt 11:38:22 CEST 2017 (00:00:04)
|
||||||
Prepare transform & export Sa 28. Okt 00:42:40 CEST 2017 (00:00:00)
|
Prepare transform & export Sa 28. Okt 11:38:26 CEST 2017 (00:00:00)
|
||||||
Transform phm-collection.tsv Sa 28. Okt 00:42:40 CEST 2017 (00:00:22)
|
Transform phm-collection.tsv Sa 28. Okt 11:38:26 CEST 2017 (00:00:21)
|
||||||
Export phm-collection.tsv Sa 28. Okt 00:43:02 CEST 2017 (00:00:28)
|
Export phm-collection.tsv Sa 28. Okt 11:38:47 CEST 2017 (00:00:30)
|
||||||
End process Sa 28. Okt 00:43:30 CEST 2017 (00:00:00)
|
End process Sa 28. Okt 11:39:17 CEST 2017 (00:00:00)
|
||||||
|
|
||||||
total run time: 00:00:57 (hh:mm:ss)
|
total run time: 00:00:58 (hh:mm:ss)
|
||||||
highest memory load: 1137 MB
|
highest memory load: 1143 MB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Performance gain with extended cross function
|
||||||
|
|
||||||
|
The original cross function expects normalized data (one foreign key per cell in base column). If you have multiple key values in one cell you need to split them first in multiple rows before you apply cross (and join results afterwards). This can be quite "expensive" if you work with bigger datasets.
|
||||||
|
|
||||||
|
There is a [fork available that extend the cross function](https://github.com/felixlohmeier/OpenRefine/wiki>) to support an integrated split and may provide a massive performance gain for this special use case.
|
||||||
|
|
||||||
|
Here is a code snippet to install this fork together with openrefine-batch.sh in a blank directory:
|
||||||
|
```
|
||||||
|
wget https://github.com/felixlohmeier/openrefine-batch/raw/master/openrefine-batch.sh && chmod +x openrefine-batch.sh
|
||||||
|
sed -i 's/.tar.gz/-with-pr1294.tar.gz/' openrefine-batch.sh
|
||||||
|
./openrefine-batch.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
### Docker
|
### Docker
|
||||||
|
|
|
@ -1,5 +1,5 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
# openrefine-batch-docker.sh, Felix Lohmeier, v1.8, 2017-10-28
|
||||||
# https://github.com/felixlohmeier/openrefine-batch
|
# https://github.com/felixlohmeier/openrefine-batch
|
||||||
|
|
||||||
# check system requirements
|
# check system requirements
|
||||||
|
@ -59,6 +59,15 @@ Usage: sudo ./openrefine-batch-docker.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUT
|
||||||
|
|
||||||
== example ==
|
== example ==
|
||||||
|
|
||||||
|
download example data
|
||||||
|
|
||||||
|
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||||
|
unzip master.zip openrefine-batch-master/examples/*
|
||||||
|
mv openrefine-batch-master/examples .
|
||||||
|
rm -f master.zip
|
||||||
|
|
||||||
|
execute openrefine-batch-docker.sh
|
||||||
|
|
||||||
sudo ./openrefine-batch-docker.sh \
|
sudo ./openrefine-batch-docker.sh \
|
||||||
-a examples/powerhouse-museum/input/ \
|
-a examples/powerhouse-museum/input/ \
|
||||||
-b examples/powerhouse-museum/config/ \
|
-b examples/powerhouse-museum/config/ \
|
||||||
|
@ -68,16 +77,13 @@ sudo ./openrefine-batch-docker.sh \
|
||||||
-i guessCellValueTypes=true \
|
-i guessCellValueTypes=true \
|
||||||
-RX
|
-RX
|
||||||
|
|
||||||
clone or download GitHub repository to get example data:
|
|
||||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
|
||||||
|
|
||||||
EOF
|
EOF
|
||||||
exit 1
|
exit 1
|
||||||
}
|
}
|
||||||
|
|
||||||
# defaults
|
# defaults
|
||||||
ram="2048M"
|
ram="2048M"
|
||||||
version="2.7"
|
version="dev"
|
||||||
restartfile="true"
|
restartfile="true"
|
||||||
restarttransform="true"
|
restarttransform="true"
|
||||||
export="true"
|
export="true"
|
||||||
|
|
|
@ -1,9 +1,9 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
# openrefine-batch.sh, Felix Lohmeier, v1.7, 2017-10-28
|
# openrefine-batch.sh, Felix Lohmeier, v1.8, 2017-10-28
|
||||||
# https://github.com/felixlohmeier/openrefine-batch
|
# https://github.com/felixlohmeier/openrefine-batch
|
||||||
|
|
||||||
# declare download URLs for OpenRefine and OpenRefine client
|
# declare download URLs for OpenRefine and OpenRefine client
|
||||||
openrefine_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-linux-2017-10-26.tar.gz"
|
openrefine_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-linux-2017-10-28.tar.gz"
|
||||||
client_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-client_0-3-1_linux-64bit"
|
client_URL="https://github.com/opencultureconsulting/openrefine-batch/raw/master/src/openrefine-client_0-3-1_linux-64bit"
|
||||||
|
|
||||||
# check system requirements
|
# check system requirements
|
||||||
|
@ -83,6 +83,15 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||||
|
|
||||||
== example ==
|
== example ==
|
||||||
|
|
||||||
|
download example data
|
||||||
|
|
||||||
|
wget https://github.com/opencultureconsulting/openrefine-batch/archive/master.zip
|
||||||
|
unzip master.zip openrefine-batch-master/examples/*
|
||||||
|
mv openrefine-batch-master/examples .
|
||||||
|
rm -f master.zip
|
||||||
|
|
||||||
|
execute openrefine-batch.sh
|
||||||
|
|
||||||
./openrefine-batch.sh \
|
./openrefine-batch.sh \
|
||||||
-a examples/powerhouse-museum/input/ \
|
-a examples/powerhouse-museum/input/ \
|
||||||
-b examples/powerhouse-museum/config/ \
|
-b examples/powerhouse-museum/config/ \
|
||||||
|
@ -92,9 +101,6 @@ Usage: ./openrefine-batch.sh [-a INPUTDIR] [-b TRANSFORMDIR] [-c OUTPUTDIR] ...
|
||||||
-i guessCellValueTypes=true \
|
-i guessCellValueTypes=true \
|
||||||
-RX
|
-RX
|
||||||
|
|
||||||
clone or download GitHub repository to get example data:
|
|
||||||
https://github.com/felixlohmeier/openrefine-batch/archive/master.zip
|
|
||||||
|
|
||||||
EOF
|
EOF
|
||||||
exit 1
|
exit 1
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue