Templates for OpenRefine batch processing (import, transform, export) using a task runner and a Python client.
Go to file
Felix Lohmeier 30ea93e3f3 minimize task install 2022-04-06 22:10:41 +02:00
.github/workflows run tasks individually 2022-04-06 21:25:27 +02:00
binder minimize task install 2022-04-06 22:10:41 +02:00
config Revert "simulate failed gh actions run" 2022-04-06 21:14:32 +02:00
input reduce complexity 2022-04-06 13:52:21 +02:00
output latest change Wed Apr 6 19:15:21 UTC 2022 2022-04-06 19:15:21 +00:00
.gitignore reduce complexity 2022-04-06 13:52:21 +02:00
LICENSE Initial commit 2021-02-18 17:03:07 +01:00
README.md fix binder link 2022-04-06 22:01:03 +02:00
Taskfile.yml move stats to check task 2022-04-06 21:21:42 +02:00

README.md

OpenRefine Task Runner (💎+🤖)

Codacy Badge Binder

Templates for OpenRefine batch processing (import, transform, export) using the task runner go-task and the openrefine-client to control OpenRefine via its HTTP API.

The workflow is defined in Taskfile.yml and can be executed either locally (task default) or with GitHub Actions.

Features

  • basic error handling by monitoring the OpenRefine server log
  • dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
  • prevent unnecessary work by fingerprinting generated files and their sources
  • the openrefine-client used here supports many core features of OpenRefine:
    • import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
    • apply undo/redo history from given JSON file(s)
    • export to CSV, TSV, HTML, XLS, XLSX, ODS
    • templating export to additional formats like JSON or XML
    • works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
  • tasks are easy to extend with additional commands (e.g. to download input data or validate results)

Requirements

  • GNU/Linux (tested with Fedora 34)
  • JAVA 8+ (for OpenRefine)

Typical workflow

Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, extract the json code and save it as file (e.g. dedup.json).

Step 2: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.

Possible automation benefits:

  • When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:

    task
    
  • The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.

  • Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.

Demo via binder

Binder

  • free to use on-demand server with Jupyterlab and Bash Kernel
  • OpenRefine, openrefine-client and go-task preinstalled
  • no registration needed, will start within a few minutes
  • restricted to 2 GB RAM and server will be deleted after 10 minutes of inactivity

Install

  1. Clone this git repository

    git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
    cd openrefine-task-runner
    
  2. Install Task 3.10.0+

    a) RPM-based (Fedora, CentOS, SLES, etc.)

    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
    sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
    

    b) DEB-based (Debian, Ubuntu etc.)

    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
    sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
    
  3. Run install task to download OpenRefine 3.5.2 and openrefine-client 0.3.10

    task install
    

Usage

  • Run workflow

    task default
    
  • Override settings with environment variables

    OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
    
  • Force run a task even when the task is up-to-date

    task default --force
    
  • Dry-run in verbose mode for debugging

    task default --dry --verbose --force
    
  • List available tasks

    task --list
    

Examples

  • noah-biejournals: Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw

Getting help

Please file an issue if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!