🎉 first draft
This commit is contained in:
parent
fb4b884706
commit
62eb0cddbf
|
@ -0,0 +1,7 @@
|
||||||
|
.task
|
||||||
|
openrefine
|
||||||
|
*/output
|
||||||
|
example-doaj/input
|
||||||
|
example-doaj/config
|
||||||
|
example-powerhouse/input
|
||||||
|
example-powerhouse/config
|
144
README.md
144
README.md
|
@ -1,2 +1,142 @@
|
||||||
# openrefine-tasks
|
# OpenRefine Task Runner (💎+🤖)
|
||||||
Templates for OpenRefine batch processing (import, transform, export) using a task runner and a Python client.
|
|
||||||
|
Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
* run tasks in parallel
|
||||||
|
* basic error handling by monitoring the OpenRefine server log
|
||||||
|
* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
|
||||||
|
* prevent unnecessary work by fingerprinting generated files and their sources
|
||||||
|
* the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
|
||||||
|
* import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
|
||||||
|
* apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
|
||||||
|
* export to CSV, TSV, HTML, XLS, XLSX, ODS
|
||||||
|
* [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
|
||||||
|
* works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
|
||||||
|
* tasks are easy to extend with additional commands (e.g. to download input data or validate results)
|
||||||
|
|
||||||
|
## Typical workflow
|
||||||
|
|
||||||
|
**Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
|
||||||
|
|
||||||
|
**Step 2**: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.
|
||||||
|
|
||||||
|
**Possible automation benefits:**
|
||||||
|
|
||||||
|
* When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task example-doaj
|
||||||
|
```
|
||||||
|
|
||||||
|
* The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
|
||||||
|
|
||||||
|
* Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
* GNU/Linux (tested with Fedora 32)
|
||||||
|
* JAVA 8+ (for OpenRefine)
|
||||||
|
|
||||||
|
## Install
|
||||||
|
|
||||||
|
1. Clone this git repository
|
||||||
|
|
||||||
|
```sh
|
||||||
|
git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
|
||||||
|
cd openrefine-task-runner
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
|
||||||
|
|
||||||
|
a) RPM-based (Fedora, CentOS, SLES, etc.)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
|
||||||
|
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
|
||||||
|
```
|
||||||
|
|
||||||
|
b) DEB-based (Debian, Ubuntu etc.)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
|
||||||
|
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task install
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
* Run all tasks in parallel
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task
|
||||||
|
```
|
||||||
|
|
||||||
|
* Run a specific task
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task example-duplicates:main
|
||||||
|
```
|
||||||
|
|
||||||
|
* Run some tasks in parallel
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task --parallel example-duplicates:main example-doaj:main
|
||||||
|
```
|
||||||
|
|
||||||
|
* Force run a task even when the task is up-to-date
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task example-duplicates:main --force
|
||||||
|
```
|
||||||
|
|
||||||
|
* Dry-run in verbose mode for debugging
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task example-duplicates:main --dry --verbose --force
|
||||||
|
```
|
||||||
|
|
||||||
|
* List available tasks
|
||||||
|
|
||||||
|
```sh
|
||||||
|
task --list
|
||||||
|
```
|
||||||
|
|
||||||
|
### How to develop your own tasks
|
||||||
|
|
||||||
|
(first draft, will be elaborated later)
|
||||||
|
|
||||||
|
1. create a new folder
|
||||||
|
2. copy an example Taskfile.yml
|
||||||
|
3. provide input data in subdirectory input
|
||||||
|
4. provide OpenRefine transformation history files in subdirectory config
|
||||||
|
5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
|
||||||
|
6. add project to general Taskfile
|
||||||
|
7. check memory load and increase RAM if needed
|
||||||
|
|
||||||
|
### Getting help
|
||||||
|
|
||||||
|
Please file an [issue](https://github.com/opencultureconsulting/openrefine-task-runner/issues) if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!
|
||||||
|
|
||||||
|
## To do
|
||||||
|
|
||||||
|
- [ ] Codacy badge (needs to be public)
|
||||||
|
- [ ] add client log messages to openrefine.log (tee -a)
|
||||||
|
- [ ] differentiate examples
|
||||||
|
- [ ] example for loading multiple input files by providing a zip archive
|
||||||
|
- [ ] example for download "fresh" input data as a dependent task and generating archives/diffs
|
||||||
|
- [ ] example for applying multiple json files
|
||||||
|
- [ ] example for templating xml and validation with xmllint
|
||||||
|
- [ ] describe example datasets (and differences) with source code examples
|
||||||
|
- [ ] elaborate how-to for developing tasks
|
||||||
|
- [ ] document openrefine-client options and defaults (tables for input and output with file-format-specific defaults) including templating
|
||||||
|
- [ ] how-to for extracting input options from OpenRefine GUI (via metadata in open project)
|
||||||
|
- [ ] document known issues, e.g. [import xls, xlsx, ods](https://github.com/opencultureconsulting/openrefine-client/issues/4)
|
||||||
|
- [ ] add Binder files and badge
|
||||||
|
- [ ] add example notebooks (links to nbviewer and Binder)
|
|
@ -0,0 +1,84 @@
|
||||||
|
# https://github.com/opencultureconsulting/openrefine-tasks
|
||||||
|
|
||||||
|
version: '3'
|
||||||
|
|
||||||
|
includes:
|
||||||
|
example-doaj:
|
||||||
|
taskfile: example-doaj
|
||||||
|
dir: example-doaj
|
||||||
|
example-duplicates:
|
||||||
|
taskfile: example-duplicates
|
||||||
|
dir: example-duplicates
|
||||||
|
example-powerhouse:
|
||||||
|
taskfile: example-powerhouse
|
||||||
|
dir: example-powerhouse
|
||||||
|
# add your project here
|
||||||
|
|
||||||
|
silent: true
|
||||||
|
output: prefixed
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
default:
|
||||||
|
desc: execute all projects in parallel
|
||||||
|
deps:
|
||||||
|
- task: example-doaj:refine
|
||||||
|
- task: example-duplicates:refine
|
||||||
|
- task: example-powerhouse:refine
|
||||||
|
# add your project here
|
||||||
|
cmds:
|
||||||
|
- task: check
|
||||||
|
|
||||||
|
install:
|
||||||
|
desc: (re)install OpenRefine and openrefine-client into subdirectory openrefine
|
||||||
|
cmds:
|
||||||
|
- | # delete existing install and recreate folder
|
||||||
|
rm -rf openrefine; mkdir -p openrefine
|
||||||
|
- | # install OpenRefine into subdirectory openrefine
|
||||||
|
wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
|
||||||
|
tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
|
||||||
|
- sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "openrefine/refine" # fix path issue in OpenRefine startup file
|
||||||
|
- sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini" # do not try to open OpenRefine in browser
|
||||||
|
- sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
|
||||||
|
- | # install openrefine-client into subdirectory openrefine
|
||||||
|
wget --no-verbose -O openrefine/client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
|
||||||
|
chmod +x openrefine/client
|
||||||
|
|
||||||
|
start:
|
||||||
|
dir: ./{{.PROJECT}}/output
|
||||||
|
cmds:
|
||||||
|
- | # check install and delete any temporary OpenRefine files
|
||||||
|
if [ ! -f "../../openrefine/refine" ]; then
|
||||||
|
echo 1>&2 "OpenRefine missing; try task install"; exit 1
|
||||||
|
fi
|
||||||
|
rm -rf ./*.project* workspace.json
|
||||||
|
- | # launch OpenRefine with specific data directory and redirect its output to a log file
|
||||||
|
../../openrefine/refine -v warn -p {{.PORT}} -m {{.RAM}} \
|
||||||
|
-d ../{{.PROJECT}}/output \
|
||||||
|
> openrefine.log 2>&1 &
|
||||||
|
- | # wait until OpenRefine API is available
|
||||||
|
timeout 30s bash -c "until
|
||||||
|
wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
|
||||||
|
do sleep 1
|
||||||
|
done"
|
||||||
|
|
||||||
|
stop:
|
||||||
|
dir: ./{{.PROJECT}}/output
|
||||||
|
cmds:
|
||||||
|
- | # shut down OpenRefine
|
||||||
|
PID=$(lsof -t -i:{{.PORT}})
|
||||||
|
kill $PID
|
||||||
|
while ps -p $PID > /dev/null; do sleep 1; done
|
||||||
|
- | # archive the OpenRefine project
|
||||||
|
tar cfz \
|
||||||
|
{{.PROJECT}}.openrefine.tar.gz \
|
||||||
|
-C $(grep -l {{.PROJECT}} *.project/metadata.json | cut -d '/' -f 1) \
|
||||||
|
.
|
||||||
|
|
||||||
|
check:
|
||||||
|
desc: check OpenRefine log for any warnings and exit on error
|
||||||
|
dir: ./{{.PROJECT}}
|
||||||
|
cmds:
|
||||||
|
- | # find log file(s) and check for "exception" or "error"
|
||||||
|
if grep -i 'exception\|error' $(find . -name openrefine.log); then
|
||||||
|
echo 1>&2 "log contains warnings!"; exit 1
|
||||||
|
fi
|
|
@ -0,0 +1,77 @@
|
||||||
|
version: '3'
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
main:
|
||||||
|
desc: Library Carpentry Lesson covering DOAJ
|
||||||
|
cmds:
|
||||||
|
- task: refine
|
||||||
|
- task: :check # check OpenRefine log for any warnings and exit on error
|
||||||
|
vars: {PROJECT: '{{splitList ":" .TASK | first}}'}
|
||||||
|
|
||||||
|
refine:
|
||||||
|
vars:
|
||||||
|
PORT: 3335 # assign a different port for each project
|
||||||
|
RAM: 2048M # maximum RAM for OpenRefine java heap space
|
||||||
|
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||||
|
deps: # will be executed each run independent of up-to-date check
|
||||||
|
- task: download
|
||||||
|
cmds: # tasks prepended with ":" are defined in Taskfile.yml
|
||||||
|
- task: :start
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||||
|
- task: import
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: apply
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: export
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: stats
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: :stop
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
sources:
|
||||||
|
- input/**
|
||||||
|
- config/**
|
||||||
|
generates:
|
||||||
|
- output/openrefine.log
|
||||||
|
- output/{{.PROJECT}}.openrefine.tar.gz
|
||||||
|
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||||
|
|
||||||
|
download:
|
||||||
|
cmds:
|
||||||
|
- mkdir -p input config
|
||||||
|
- wget --no-verbose -O input/doaj-article-sample.csv https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
|
||||||
|
- wget --no-verbose -O config/doaj-openrefine.json https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
|
||||||
|
|
||||||
|
import:
|
||||||
|
dir: input
|
||||||
|
cmds:
|
||||||
|
- | # import file
|
||||||
|
../../openrefine/client -P {{.PORT}} \
|
||||||
|
--create doaj-article-sample.csv \
|
||||||
|
--projectName {{.PROJECT}}
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
apply:
|
||||||
|
dir: config
|
||||||
|
cmds:
|
||||||
|
- | # apply transformation rules
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--apply doaj-openrefine.json
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
export:
|
||||||
|
dir: output
|
||||||
|
cmds:
|
||||||
|
- | # export to file; use readlink to log full path to output file
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--output "$(readlink -m doaj-results.tsv)"
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
stats:
|
||||||
|
cmds:
|
||||||
|
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
default: # enable standalone execution (running `task` in project directory)
|
||||||
|
cmds:
|
||||||
|
- PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
|
|
@ -0,0 +1,70 @@
|
||||||
|
version: '3'
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
main:
|
||||||
|
desc: Removing duplicates in a very small test dataset
|
||||||
|
cmds:
|
||||||
|
- task: refine
|
||||||
|
- task: :check # check OpenRefine log for any warnings and exit on error
|
||||||
|
vars: {PROJECT: '{{splitList ":" .TASK | first}}'}
|
||||||
|
|
||||||
|
refine:
|
||||||
|
vars:
|
||||||
|
PORT: 3334 # assign a different port for each project
|
||||||
|
RAM: 2048M # maximum RAM for OpenRefine java heap space
|
||||||
|
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||||
|
cmds: # tasks prepended with ":" are defined in Taskfile.yml
|
||||||
|
- task: :start
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||||
|
- task: import
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: apply
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: export
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: stats
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: :stop
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
sources:
|
||||||
|
- input/**
|
||||||
|
- config/**
|
||||||
|
generates:
|
||||||
|
- output/openrefine.log
|
||||||
|
- output/{{.PROJECT}}.openrefine.tar.gz
|
||||||
|
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||||
|
|
||||||
|
import:
|
||||||
|
dir: input
|
||||||
|
cmds:
|
||||||
|
- | # import file
|
||||||
|
../../openrefine/client -P {{.PORT}} \
|
||||||
|
--create duplicates.csv \
|
||||||
|
--encoding UTF-8 \
|
||||||
|
--projectName {{.PROJECT}}
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
apply:
|
||||||
|
dir: config
|
||||||
|
cmds:
|
||||||
|
- | # apply transformation rules
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--apply duplicates-deletion.json
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
export:
|
||||||
|
dir: output
|
||||||
|
cmds:
|
||||||
|
- | # export to file; use readlink to log full path to output file
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--output "$(readlink -m deduped.xls)"
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
stats:
|
||||||
|
cmds:
|
||||||
|
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
default: # enable standalone execution (running `task` in project directory)
|
||||||
|
cmds:
|
||||||
|
- PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
|
|
@ -0,0 +1,69 @@
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"op": "core/row-reorder",
|
||||||
|
"description": "Reorder rows",
|
||||||
|
"mode": "record-based",
|
||||||
|
"sorting": {
|
||||||
|
"criteria": [
|
||||||
|
{
|
||||||
|
"errorPosition": 1,
|
||||||
|
"caseSensitive": false,
|
||||||
|
"valueType": "string",
|
||||||
|
"column": "email",
|
||||||
|
"blankPosition": 2,
|
||||||
|
"reverse": false
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"op": "core/column-addition",
|
||||||
|
"description": "Create column count at index 1 based on column email using expression grel:facetCount(value, \"value\", \"email\")",
|
||||||
|
"engineConfig": {
|
||||||
|
"mode": "row-based",
|
||||||
|
"facets": []
|
||||||
|
},
|
||||||
|
"newColumnName": "count",
|
||||||
|
"columnInsertIndex": 1,
|
||||||
|
"baseColumnName": "email",
|
||||||
|
"expression": "grel:facetCount(value, \"value\", \"email\")",
|
||||||
|
"onError": "set-to-blank"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"op": "core/blank-down",
|
||||||
|
"description": "Blank down cells in column email",
|
||||||
|
"engineConfig": {
|
||||||
|
"mode": "row-based",
|
||||||
|
"facets": []
|
||||||
|
},
|
||||||
|
"columnName": "email"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"op": "core/row-removal",
|
||||||
|
"description": "Remove rows",
|
||||||
|
"engineConfig": {
|
||||||
|
"mode": "row-based",
|
||||||
|
"facets": [
|
||||||
|
{
|
||||||
|
"omitError": false,
|
||||||
|
"expression": "isBlank(value)",
|
||||||
|
"selectBlank": false,
|
||||||
|
"selection": [
|
||||||
|
{
|
||||||
|
"v": {
|
||||||
|
"v": true,
|
||||||
|
"l": "true"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"selectError": false,
|
||||||
|
"invert": false,
|
||||||
|
"name": "email",
|
||||||
|
"omitBlank": false,
|
||||||
|
"type": "list",
|
||||||
|
"columnName": "email"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
|
@ -0,0 +1,11 @@
|
||||||
|
email,name,state,gender,purchase
|
||||||
|
danny.baron@example1.com,Danny Baron,CA,M,TV
|
||||||
|
melanie.white@example2.edu,Melanie White,NC,F,iPhone
|
||||||
|
danny.baron@example1.com,D. Baron,CA,M,Winter jacket
|
||||||
|
ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight
|
||||||
|
arthur.duff@example4.com,Arthur Duff,OR,M,Dining table
|
||||||
|
danny.baron@example1.com,Daniel Baron,CA,M,Bike
|
||||||
|
jean.griffith@example5.org,Jean Griffith,WA,F,Power drill
|
||||||
|
melanie.white@example2.edu,Melanie White,NC,F,iPad
|
||||||
|
ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier
|
||||||
|
arthur.duff@example4.com,Arthur Duff,OR,M,Night table
|
|
|
@ -0,0 +1,79 @@
|
||||||
|
version: '3'
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
main:
|
||||||
|
desc: Powerhouse Museum Tutorial
|
||||||
|
cmds:
|
||||||
|
- task: refine
|
||||||
|
- task: :check # check OpenRefine log for any warnings and exit on error
|
||||||
|
vars: {PROJECT: '{{splitList ":" .TASK | first}}'}
|
||||||
|
|
||||||
|
refine:
|
||||||
|
vars:
|
||||||
|
PORT: 3336 # assign a different port for each project
|
||||||
|
RAM: 2048M # maximum RAM for OpenRefine java heap space
|
||||||
|
PROJECT: '{{splitList ":" .TASK | first}}'
|
||||||
|
deps: # will be executed each run independent of up-to-date check
|
||||||
|
- task: download
|
||||||
|
cmds: # tasks prepended with ":" are defined in Taskfile.yml
|
||||||
|
- task: :start
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
|
||||||
|
- task: import
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: apply
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: export
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: stats
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
- task: :stop
|
||||||
|
vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
|
||||||
|
sources:
|
||||||
|
- input/**
|
||||||
|
- config/**
|
||||||
|
generates:
|
||||||
|
- output/openrefine.log
|
||||||
|
- output/{{.PROJECT}}.openrefine.tar.gz
|
||||||
|
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
|
||||||
|
|
||||||
|
download:
|
||||||
|
cmds:
|
||||||
|
- mkdir -p input config
|
||||||
|
- wget --no-verbose -O input/phm-collection.tsv https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
|
||||||
|
- wget --no-verbose -O config/phm-transform.json https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
|
||||||
|
|
||||||
|
import:
|
||||||
|
dir: input
|
||||||
|
cmds:
|
||||||
|
- | # import file
|
||||||
|
../../openrefine/client -P {{.PORT}} \
|
||||||
|
--create phm-collection.tsv \
|
||||||
|
--processQuotes false \
|
||||||
|
--guessCellValueTypes true \
|
||||||
|
--projectName {{.PROJECT}}
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
apply:
|
||||||
|
dir: config
|
||||||
|
cmds:
|
||||||
|
- | # apply transformation rules
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--apply phm-transform.json
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
export:
|
||||||
|
dir: output
|
||||||
|
cmds:
|
||||||
|
- | # export to file; use readlink to log full path to output file
|
||||||
|
../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
|
||||||
|
--output "$(readlink -m phm-results.tsv)"
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
stats:
|
||||||
|
cmds:
|
||||||
|
- ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
|
||||||
|
ignore_error: true # workaround
|
||||||
|
|
||||||
|
default: # enable standalone execution (running `task` in project directory)
|
||||||
|
cmds:
|
||||||
|
- PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
|
Loading…
Reference in New Issue