🎉 first draft

2025-05-18 00:00:43 +02:00 · 2021-02-20 00:22:12 +01:00 · 2021-02-20 00:22:12 +01:00 · 62eb0cddbf
commit 62eb0cddbf
parent fb4b884706
8 changed files with 539 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,7 @@
+.task
+openrefine
+*/output
+example-doaj/input
+example-doaj/config
+example-powerhouse/input
+example-powerhouse/config
--- a/README.md
+++ b/README.md
@ -1,2 +1,142 @@
-# openrefine-tasks
-Templates for OpenRefine batch processing (import, transform, export) using a task runner and a Python client.
+# OpenRefine Task Runner (💎+🤖)
+
+Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api). 
+
+## Features
+
+* run tasks in parallel
+* basic error handling by monitoring the OpenRefine server log
+* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
+* prevent unnecessary work by fingerprinting generated files and their sources
+* the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
+  * import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
+  * apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
+  * export to CSV, TSV, HTML, XLS, XLSX, ODS
+  * [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
+  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
+* tasks are easy to extend with additional commands (e.g. to download input data or validate results)
+
+## Typical workflow
+
+**Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
+
+**Step 2**: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.
+
+**Possible automation benefits:**
+
+* When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
+
+  ```sh
+  task example-doaj
+  ```
+
+* The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
+
+* Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
+
+## Requirements
+
+* GNU/Linux (tested with Fedora 32)
+* JAVA 8+ (for OpenRefine)
+
+## Install
+
+1. Clone this git repository
+
+    ```sh
+    git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
+    cd openrefine-task-runner
+    ```
+
+2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
+
+    a) RPM-based (Fedora, CentOS, SLES, etc.)
+
+    ```sh
+    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
+    sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
+    ```
+
+    b) DEB-based (Debian, Ubuntu etc.)
+
+    ```sh
+    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
+    sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
+    ```
+
+3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
+
+   ```sh
+   task install
+   ```
+
+## Usage
+
+* Run all tasks in parallel
+
+    ```sh
+    task
+    ```
+
+* Run a specific task
+
+    ```sh
+    task example-duplicates:main
+    ```
+
+* Run some tasks in parallel
+
+    ```sh
+    task --parallel example-duplicates:main example-doaj:main
+    ```
+
+* Force run a task even when the task is up-to-date
+
+    ```sh
+    task example-duplicates:main --force
+    ```
+
+* Dry-run in verbose mode for debugging
+
+    ```sh
+    task example-duplicates:main --dry --verbose --force
+    ```
+
+* List available tasks
+
+    ```sh
+    task --list
+    ```
+
+### How to develop your own tasks
+
+(first draft, will be elaborated later)
+
+1. create a new folder
+2. copy an example Taskfile.yml
+3. provide input data in subdirectory input
+4. provide OpenRefine transformation history files in subdirectory config
+5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
+6. add project to general Taskfile
+7. check memory load and increase RAM if needed
+
+### Getting help
+
+Please file an [issue](https://github.com/opencultureconsulting/openrefine-task-runner/issues) if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!
+
+## To do
+
+- [ ] Codacy badge (needs to be public)
+- [ ] add client log messages to openrefine.log (tee -a)
+- [ ] differentiate examples
+  - [ ] example for loading multiple input files by providing a zip archive
+  - [ ] example for download "fresh" input data as a dependent task and generating archives/diffs
+  - [ ] example for applying multiple json files
+  - [ ] example for templating xml and validation with xmllint
+- [ ] describe example datasets (and differences) with source code examples
+- [ ] elaborate how-to for developing tasks
+  - [ ] document openrefine-client options and defaults (tables for input and output with file-format-specific defaults) including templating
+  - [ ] how-to for extracting input options from OpenRefine GUI (via metadata in open project)
+  - [ ] document known issues, e.g. [import xls, xlsx, ods](https://github.com/opencultureconsulting/openrefine-client/issues/4)
+- [ ] add Binder files and badge
+- [ ] add example notebooks (links to nbviewer and Binder)
--- a/Taskfile.yml
+++ b/Taskfile.yml
@ -0,0 +1,84 @@
+# https://github.com/opencultureconsulting/openrefine-tasks
+
+version: '3'
+
+includes:
+  example-doaj:
+    taskfile: example-doaj
+    dir: example-doaj
+  example-duplicates:
+    taskfile: example-duplicates
+    dir: example-duplicates
+  example-powerhouse:
+    taskfile: example-powerhouse
+    dir: example-powerhouse
+  # add your project here
+
+silent: true
+output: prefixed
+
+tasks:
+  default:
+    desc: execute all projects in parallel
+    deps:
+      - task: example-doaj:refine
+      - task: example-duplicates:refine
+      - task: example-powerhouse:refine
+      # add your project here
+    cmds:
+      - task: check
+
+  install:
+    desc: (re)install OpenRefine and openrefine-client into subdirectory openrefine
+    cmds:
+      - | # delete existing install and recreate folder
+        rm -rf openrefine; mkdir -p openrefine
+      - | # install OpenRefine into subdirectory openrefine
+        wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
+        tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
+      - sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "openrefine/refine" # fix path issue in OpenRefine startup file
+      - sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini" # do not try to open OpenRefine in browser
+      - sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
+      - | # install openrefine-client into subdirectory openrefine
+        wget --no-verbose -O openrefine/client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
+        chmod +x openrefine/client
+
+  start:
+    dir: ./{{.PROJECT}}/output
+    cmds:
+      - | # check install and delete any temporary OpenRefine files
+        if [ ! -f "../../openrefine/refine" ]; then
+          echo 1>&2 "OpenRefine missing; try task install"; exit 1
+        fi
+        rm -rf ./*.project* workspace.json
+      - | # launch OpenRefine with specific data directory and redirect its output to a log file
+        ../../openrefine/refine -v warn -p {{.PORT}} -m {{.RAM}} \
+          -d ../{{.PROJECT}}/output \
+          > openrefine.log 2>&1 & 
+      - | # wait until OpenRefine API is available
+        timeout 30s bash -c "until
+          wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
+          do sleep 1
+        done"
+
+  stop:
+    dir: ./{{.PROJECT}}/output
+    cmds:
+      - | # shut down OpenRefine
+        PID=$(lsof -t -i:{{.PORT}})
+        kill $PID
+        while ps -p $PID > /dev/null; do sleep 1; done
+      - | # archive the OpenRefine project
+        tar cfz \
+          {{.PROJECT}}.openrefine.tar.gz \
+          -C $(grep -l {{.PROJECT}} *.project/metadata.json | cut -d '/' -f 1) \
+          .
+
+  check:
+    desc: check OpenRefine log for any warnings and exit on error
+    dir: ./{{.PROJECT}}
+    cmds:
+      - | # find log file(s) and check for "exception" or "error"
+        if grep -i 'exception\|error' $(find . -name openrefine.log); then
+          echo 1>&2 "log contains warnings!"; exit 1
+        fi
--- a/example-doaj/Taskfile.yml
+++ b/example-doaj/Taskfile.yml
@ -0,0 +1,77 @@
+version: '3'
+
+tasks:
+  main:
+    desc: Library Carpentry Lesson covering DOAJ
+    cmds:
+      - task: refine
+      - task: :check # check OpenRefine log for any warnings and exit on error
+        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
+
+  refine:
+    vars:
+      PORT: 3335 # assign a different port for each project
+      RAM: 2048M # maximum RAM for OpenRefine java heap space
+      PROJECT: '{{splitList ":" .TASK | first}}'
+    deps: # will be executed each run independent of up-to-date check
+      - task: download
+    cmds: # tasks prepended with ":" are defined in Taskfile.yml
+      - task: :start
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
+      - task: import
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: apply
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: export
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: stats
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: :stop
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+    sources:
+      - input/**
+      - config/**
+    generates:
+      - output/openrefine.log
+      - output/{{.PROJECT}}.openrefine.tar.gz
+    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
+
+  download:
+    cmds:
+      - mkdir -p input config
+      - wget --no-verbose -O input/doaj-article-sample.csv https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
+      - wget --no-verbose -O config/doaj-openrefine.json https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
+
+  import:
+    dir: input
+    cmds:
+      - | # import file
+        ../../openrefine/client -P {{.PORT}} \
+        --create doaj-article-sample.csv \
+        --projectName {{.PROJECT}}
+    ignore_error: true # workaround
+
+  apply:
+    dir: config
+    cmds:
+      - | # apply transformation rules
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --apply doaj-openrefine.json
+    ignore_error: true # workaround
+
+  export:
+    dir: output
+    cmds:
+      - | # export to file; use readlink to log full path to output file
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --output "$(readlink -m doaj-results.tsv)"
+    ignore_error: true # workaround
+
+  stats:
+    cmds:
+      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
+    ignore_error: true # workaround
+
+  default: # enable standalone execution (running `task` in project directory)
+    cmds:
+      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
--- a/example-duplicates/Taskfile.yml
+++ b/example-duplicates/Taskfile.yml
@ -0,0 +1,70 @@
+version: '3'
+
+tasks:
+  main:
+    desc: Removing duplicates in a very small test dataset
+    cmds:
+      - task: refine
+      - task: :check # check OpenRefine log for any warnings and exit on error
+        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
+
+  refine:
+    vars:
+      PORT: 3334 # assign a different port for each project
+      RAM: 2048M # maximum RAM for OpenRefine java heap space
+      PROJECT: '{{splitList ":" .TASK | first}}'
+    cmds: # tasks prepended with ":" are defined in Taskfile.yml
+      - task: :start
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
+      - task: import
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: apply
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: export
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: stats
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: :stop
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+    sources:
+      - input/**
+      - config/**
+    generates:
+      - output/openrefine.log
+      - output/{{.PROJECT}}.openrefine.tar.gz
+    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
+
+  import:
+    dir: input
+    cmds:
+      - | # import file
+        ../../openrefine/client -P {{.PORT}} \
+        --create duplicates.csv \
+        --encoding UTF-8 \
+        --projectName {{.PROJECT}}
+    ignore_error: true # workaround
+
+  apply:
+    dir: config
+    cmds:
+      - | # apply transformation rules
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --apply duplicates-deletion.json
+    ignore_error: true # workaround
+
+  export:
+    dir: output
+    cmds:
+      - | # export to file; use readlink to log full path to output file
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --output "$(readlink -m deduped.xls)"
+    ignore_error: true # workaround
+
+  stats:
+    cmds:
+      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
+    ignore_error: true # workaround
+
+  default: # enable standalone execution (running `task` in project directory)
+    cmds:
+      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
--- a/example-duplicates/config/duplicates-deletion.json
+++ b/example-duplicates/config/duplicates-deletion.json
@ -0,0 +1,69 @@
+[
+  {
+    "op": "core/row-reorder",
+    "description": "Reorder rows",
+    "mode": "record-based",
+    "sorting": {
+      "criteria": [
+        {
+          "errorPosition": 1,
+          "caseSensitive": false,
+          "valueType": "string",
+          "column": "email",
+          "blankPosition": 2,
+          "reverse": false
+        }
+      ]
+    }
+  },
+  {
+    "op": "core/column-addition",
+    "description": "Create column count at index 1 based on column email using expression grel:facetCount(value, \"value\", \"email\")",
+    "engineConfig": {
+      "mode": "row-based",
+      "facets": []
+    },
+    "newColumnName": "count",
+    "columnInsertIndex": 1,
+    "baseColumnName": "email",
+    "expression": "grel:facetCount(value, \"value\", \"email\")",
+    "onError": "set-to-blank"
+  },
+  {
+    "op": "core/blank-down",
+    "description": "Blank down cells in column email",
+    "engineConfig": {
+      "mode": "row-based",
+      "facets": []
+    },
+    "columnName": "email"
+  },
+  {
+    "op": "core/row-removal",
+    "description": "Remove rows",
+    "engineConfig": {
+      "mode": "row-based",
+      "facets": [
+        {
+          "omitError": false,
+          "expression": "isBlank(value)",
+          "selectBlank": false,
+          "selection": [
+            {
+              "v": {
+                "v": true,
+                "l": "true"
+              }
+            }
+          ],
+          "selectError": false,
+          "invert": false,
+          "name": "email",
+          "omitBlank": false,
+          "type": "list",
+          "columnName": "email"
+        }
+      ]
+    }
+  }
+]
--- a/example-duplicates/input/duplicates.csv
+++ b/example-duplicates/input/duplicates.csv
@ -0,0 +1,11 @@
+email,name,state,gender,purchase
+danny.baron@example1.com,Danny Baron,CA,M,TV
+melanie.white@example2.edu,Melanie White,NC,F,iPhone
+danny.baron@example1.com,D. Baron,CA,M,Winter jacket
+ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight
+arthur.duff@example4.com,Arthur Duff,OR,M,Dining table
+danny.baron@example1.com,Daniel Baron,CA,M,Bike
+jean.griffith@example5.org,Jean Griffith,WA,F,Power drill
+melanie.white@example2.edu,Melanie White,NC,F,iPad
+ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier
+arthur.duff@example4.com,Arthur Duff,OR,M,Night table
--- a/example-powerhouse/Taskfile.yml
+++ b/example-powerhouse/Taskfile.yml
@ -0,0 +1,79 @@
+version: '3'
+
+tasks:
+  main:
+    desc: Powerhouse Museum Tutorial
+    cmds:
+      - task: refine
+      - task: :check # check OpenRefine log for any warnings and exit on error
+        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
+
+  refine:
+    vars:
+      PORT: 3336 # assign a different port for each project
+      RAM: 2048M # maximum RAM for OpenRefine java heap space
+      PROJECT: '{{splitList ":" .TASK | first}}'
+    deps: # will be executed each run independent of up-to-date check
+      - task: download
+    cmds: # tasks prepended with ":" are defined in Taskfile.yml
+      - task: :start
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
+      - task: import
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: apply
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: export
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: stats
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+      - task: :stop
+        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
+    sources:
+      - input/**
+      - config/**
+    generates:
+      - output/openrefine.log
+      - output/{{.PROJECT}}.openrefine.tar.gz
+    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
+
+  download:
+    cmds:
+      - mkdir -p input config
+      - wget --no-verbose -O input/phm-collection.tsv https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
+      - wget --no-verbose -O config/phm-transform.json https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
+
+  import:
+    dir: input
+    cmds:
+      - | # import file
+        ../../openrefine/client -P {{.PORT}} \
+        --create phm-collection.tsv \
+        --processQuotes false \
+        --guessCellValueTypes true \
+        --projectName {{.PROJECT}}
+    ignore_error: true # workaround
+
+  apply:
+    dir: config
+    cmds:
+      - | # apply transformation rules
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --apply phm-transform.json
+    ignore_error: true # workaround
+
+  export:
+    dir: output
+    cmds:
+      - | # export to file; use readlink to log full path to output file
+        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
+        --output "$(readlink -m phm-results.tsv)"
+    ignore_error: true # workaround
+
+  stats:
+    cmds:
+      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
+    ignore_error: true # workaround
+
+  default: # enable standalone execution (running `task` in project directory)
+    cmds:
+      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"