openrefine-task-runner/README.md

124 lines
5.1 KiB
Markdown
Raw Normal View History

2021-02-20 00:22:12 +01:00
# OpenRefine Task Runner (💎+🤖)
2022-04-06 22:01:03 +02:00
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
2021-02-23 17:36:36 +01:00
2022-04-06 20:43:30 +02:00
Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
The workflow is defined in [Taskfile.yml](Taskfile.yml) and can be executed either locally (`task default`) or with [GitHub Actions](.github/workflows/default.yml).
2021-02-20 00:22:12 +01:00
## Features
* basic error handling by monitoring the OpenRefine server log
2022-04-06 13:30:59 +02:00
* dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
2021-02-20 00:22:12 +01:00
* prevent unnecessary work by fingerprinting generated files and their sources
* the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
* import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
* apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
* export to CSV, TSV, HTML, XLS, XLSX, ODS
* [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
2022-04-06 13:30:59 +02:00
* works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
2021-02-20 00:22:12 +01:00
* tasks are easy to extend with additional commands (e.g. to download input data or validate results)
2022-04-06 20:43:30 +02:00
## Requirements
* GNU/Linux (tested with Fedora 34)
* JAVA 8+ (for OpenRefine)
2021-02-20 00:22:12 +01:00
## Typical workflow
**Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
**Step 2**: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.
**Possible automation benefits:**
2022-04-06 13:30:59 +02:00
* When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:
2021-02-20 00:22:12 +01:00
```sh
2022-04-06 13:30:59 +02:00
task
2021-02-20 00:22:12 +01:00
```
* The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
* Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
2021-06-11 20:01:01 +02:00
## Demo via binder
2022-04-06 22:01:03 +02:00
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
2021-06-11 20:01:01 +02:00
- free to use on-demand server with Jupyterlab and Bash Kernel
- OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild)
- no registration needed, will start within a few minutes
2022-04-06 13:30:59 +02:00
- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
2021-06-11 20:01:01 +02:00
2021-02-20 00:22:12 +01:00
## Install
1. Clone this git repository
```sh
git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
cd openrefine-task-runner
```
2022-04-06 13:30:59 +02:00
2. Install [Task 3.10.0](https://github.com/go-task/task/releases/tag/v3.10.0)+
2021-02-20 00:22:12 +01:00
a) RPM-based (Fedora, CentOS, SLES, etc.)
```sh
2022-04-06 13:26:22 +02:00
wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
2021-02-20 00:22:12 +01:00
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
```
b) DEB-based (Debian, Ubuntu etc.)
```sh
2022-04-06 13:26:22 +02:00
wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
2021-02-20 00:22:12 +01:00
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
```
2022-04-06 13:27:50 +02:00
3. Run install task to download [OpenRefine 3.5.2](https://github.com/OpenRefine/OpenRefine/releases/tag/3.5.2) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
2021-02-20 00:22:12 +01:00
```sh
task install
```
## Usage
2022-04-06 20:43:30 +02:00
* Run workflow
2021-02-20 00:22:12 +01:00
```sh
2022-04-06 13:30:59 +02:00
task default
2021-02-20 00:22:12 +01:00
```
2022-04-06 13:30:59 +02:00
* Override settings with environment variables
2021-02-20 00:22:12 +01:00
```sh
2022-04-06 13:30:59 +02:00
OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
2021-02-20 00:22:12 +01:00
```
* Force run a task even when the task is up-to-date
```sh
2022-04-06 13:30:59 +02:00
task default --force
2021-02-20 00:22:12 +01:00
```
* Dry-run in verbose mode for debugging
```sh
2022-04-06 13:30:59 +02:00
task default --dry --verbose --force
2021-02-20 00:22:12 +01:00
```
* List available tasks
```sh
task --list
```
2022-04-06 13:30:59 +02:00
### Examples
2021-02-20 00:22:12 +01:00
2022-04-06 13:30:59 +02:00
* [noah-biejournals](https://github.com/opencultureconsulting/noah-biejournals): Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw
2021-02-20 00:22:12 +01:00
### Getting help
Please file an [issue](https://github.com/opencultureconsulting/openrefine-task-runner/issues) if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!