opencultureconsulting/openrefine-task-runner

mirror of https://github.com/opencultureconsulting/openrefine-task-runner.git synced 2025-05-04 00:00:23 +02:00

Go to file

Merge pull request #4 from opencultureconsulting/binder

resolve #3 add jupyter notebook and binder files

2021-06-11 20:04:30 +02:00

binder

cleanup archive file after extracting #3

2021-06-11 19:33:39 +02:00

example-doaj

🎨 Unify use of YAML syntax

2021-02-25 12:59:45 +01:00

example-duplicates

🎨 Unify use of YAML syntax

2021-02-25 12:59:45 +01:00

example-powerhouse

🎨 Unify use of YAML syntax

2021-02-25 12:59:45 +01:00

.gitignore

🚚 hide OpenRefine install by renaming directory to .openrefine

2021-02-25 13:16:16 +01:00

demo.ipynb

add jupyter notebook #3

2021-06-11 19:45:08 +02:00

LICENSE

Initial commit

2021-02-18 17:03:07 +01:00

README.md

finally change branch to main #3

2021-06-11 20:02:51 +02:00

Taskfile.yml

suppress wget log #3

2021-06-11 19:32:58 +02:00

README.md

OpenRefine Task Runner (💎+🤖)

Templates for OpenRefine batch processing (import, transform, export) using the task runner go-task and the openrefine-client to control OpenRefine via its HTTP API.

Features

run tasks in parallel
basic error handling by monitoring the OpenRefine server log
dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
prevent unnecessary work by fingerprinting generated files and their sources
the openrefine-client used here supports many core features of OpenRefine:
- import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
- apply undo/redo history from given JSON file(s)
- export to CSV, TSV, HTML, XLS, XLSX, ODS
- templating export to additional formats like JSON or XML
- works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
tasks are easy to extend with additional commands (e.g. to download input data or validate results)

Typical workflow

Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, extract the json code and save it as file (e.g. dedup.json).

Step 2: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.

Possible automation benefits:

When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
```
task example-doaj
```
The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.

Requirements

GNU/Linux (tested with Fedora 32)
JAVA 8+ (for OpenRefine)

Demo via binder

free to use on-demand server with Jupyterlab and Bash Kernel
OpenRefine, openrefine-client and go-task preinstalled
no registration needed, will start within a few minutes
restricted to 4 GB RAM and server will be deleted after 10 minutes of inactivity
service is provided by GESIS and is intended for use by social scientists

Install

Clone this git repository

git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
cd openrefine-task-runner

Install Task 3.2.2

a) RPM-based (Fedora, CentOS, SLES, etc.)

wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm

b) DEB-based (Debian, Ubuntu etc.)

wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb

Run install task to download OpenRefine 3.4.1 and openrefine-client 0.3.10
```
task install
```

Usage

Run all tasks in parallel
```
task
```
Run a specific task
```
task example-duplicates:main
```

Run some tasks in parallel

task --parallel example-duplicates:main example-doaj:main

Force run a task even when the task is up-to-date
```
task example-duplicates:main --force
```

Dry-run in verbose mode for debugging

task example-duplicates:main --dry --verbose --force

List available tasks
```
task --list
```

How to develop your own tasks

(first draft, will be elaborated later)

create a new folder
copy an example Taskfile.yml
provide input data in subdirectory input
provide OpenRefine transformation history files in subdirectory config
add commands to specific Taskfile (check openrefine-client help screen for available options: openrefine/client --help)
add project to general Taskfile
check memory load and increase RAM if needed

Getting help

Please file an issue if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!