🎉 first draft

2025-04-13 00:00:12 +02:00 · 2021-02-20 00:22:12 +01:00 · 2021-02-20 00:22:12 +01:00 · 62eb0cddbf
commit 62eb0cddbf
parent fb4b884706
8 changed files with 539 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,7 @@
 .task
 openrefine
 */output
 example-doaj/input
 example-doaj/config
 example-powerhouse/input
 example-powerhouse/config
--- a/README.md
+++ b/README.md
@ -1,2 +1,142 @@
-# openrefine-tasks
+# OpenRefine Task Runner (💎+🤖)
-Templates for OpenRefine batch processing (import, transform, export) using a task runner and a Python client.
+
 Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api). 
 ## Features
 * run tasks in parallel
 * basic error handling by monitoring the OpenRefine server log
 * dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
 * prevent unnecessary work by fingerprinting generated files and their sources
 * the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
  * import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
  * apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
  * export to CSV, TSV, HTML, XLS, XLSX, ODS
  * [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
 * tasks are easy to extend with additional commands (e.g. to download input data or validate results)
 ## Typical workflow
 **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
 **Step 2**: Configure a task to automate importing your data set, applying the json file and exporting to the required output format.
 **Possible automation benefits:**
 * When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
  ```sh
  task example-doaj
  ```
 * The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
 * Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
 ## Requirements
 * GNU/Linux (tested with Fedora 32)
 * JAVA 8+ (for OpenRefine)
 ## Install
 1. Clone this git repository
    ```sh
    git clone https://github.com/opencultureconsulting/openrefine-task-runner.git
    cd openrefine-task-runner
    ```
 2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
    a) RPM-based (Fedora, CentOS, SLES, etc.)
    ```sh
    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
    sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
    ```
    b) DEB-based (Debian, Ubuntu etc.)
    ```sh
    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
    sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
    ```
 3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
   ```sh
   task install
   ```
 ## Usage
 * Run all tasks in parallel
    ```sh
    task
    ```
 * Run a specific task
    ```sh
    task example-duplicates:main
    ```
 * Run some tasks in parallel
    ```sh
    task --parallel example-duplicates:main example-doaj:main
    ```
 * Force run a task even when the task is up-to-date
    ```sh
    task example-duplicates:main --force
    ```
 * Dry-run in verbose mode for debugging
    ```sh
    task example-duplicates:main --dry --verbose --force
    ```
 * List available tasks
    ```sh
    task --list
    ```
 ### How to develop your own tasks
 (first draft, will be elaborated later)
 1. create a new folder
 2. copy an example Taskfile.yml
 3. provide input data in subdirectory input
 4. provide OpenRefine transformation history files in subdirectory config
 5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
 6. add project to general Taskfile
 7. check memory load and increase RAM if needed
 ### Getting help
 Please file an [issue](https://github.com/opencultureconsulting/openrefine-task-runner/issues) if you miss some features or if you have tracked a bug. And you are welcome to ask any questions!
 ## To do
 - [ ] Codacy badge (needs to be public)
 - [ ] add client log messages to openrefine.log (tee -a)
 - [ ] differentiate examples
  - [ ] example for loading multiple input files by providing a zip archive
  - [ ] example for download "fresh" input data as a dependent task and generating archives/diffs
  - [ ] example for applying multiple json files
  - [ ] example for templating xml and validation with xmllint
 - [ ] describe example datasets (and differences) with source code examples
 - [ ] elaborate how-to for developing tasks
  - [ ] document openrefine-client options and defaults (tables for input and output with file-format-specific defaults) including templating
  - [ ] how-to for extracting input options from OpenRefine GUI (via metadata in open project)
  - [ ] document known issues, e.g. [import xls, xlsx, ods](https://github.com/opencultureconsulting/openrefine-client/issues/4)
 - [ ] add Binder files and badge
 - [ ] add example notebooks (links to nbviewer and Binder)
--- a/Taskfile.yml
+++ b/Taskfile.yml
@ -0,0 +1,84 @@
 # https://github.com/opencultureconsulting/openrefine-tasks
 version: '3'
 includes:
  example-doaj:
    taskfile: example-doaj
    dir: example-doaj
  example-duplicates:
    taskfile: example-duplicates
    dir: example-duplicates
  example-powerhouse:
    taskfile: example-powerhouse
    dir: example-powerhouse
  # add your project here
 silent: true
 output: prefixed
 tasks:
  default:
    desc: execute all projects in parallel
    deps:
      - task: example-doaj:refine
      - task: example-duplicates:refine
      - task: example-powerhouse:refine
      # add your project here
    cmds:
      - task: check
  install:
    desc: (re)install OpenRefine and openrefine-client into subdirectory openrefine
    cmds:
      - | # delete existing install and recreate folder
        rm -rf openrefine; mkdir -p openrefine
      - | # install OpenRefine into subdirectory openrefine
        wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
        tar -xzf openrefine.tar.gz -C openrefine --strip 1 && rm openrefine.tar.gz
      - sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "openrefine/refine" # fix path issue in OpenRefine startup file
      - sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "openrefine/refine.ini" # do not try to open OpenRefine in browser
      - sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
      - | # install openrefine-client into subdirectory openrefine
        wget --no-verbose -O openrefine/client https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
        chmod +x openrefine/client
  start:
    dir: ./{{.PROJECT}}/output
    cmds:
      - | # check install and delete any temporary OpenRefine files
        if [ ! -f "../../openrefine/refine" ]; then
          echo 1>&2 "OpenRefine missing; try task install"; exit 1
        fi
        rm -rf ./*.project* workspace.json
      - | # launch OpenRefine with specific data directory and redirect its output to a log file
        ../../openrefine/refine -v warn -p {{.PORT}} -m {{.RAM}} \
          -d ../{{.PROJECT}}/output \
          > openrefine.log 2>&1 & 
      - | # wait until OpenRefine API is available
        timeout 30s bash -c "until
          wget -q -O - http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
          do sleep 1
        done"
  stop:
    dir: ./{{.PROJECT}}/output
    cmds:
      - | # shut down OpenRefine
        PID=$(lsof -t -i:{{.PORT}})
        kill $PID
        while ps -p $PID > /dev/null; do sleep 1; done
      - | # archive the OpenRefine project
        tar cfz \
          {{.PROJECT}}.openrefine.tar.gz \
          -C $(grep -l {{.PROJECT}} *.project/metadata.json | cut -d '/' -f 1) \
          .
  check:
    desc: check OpenRefine log for any warnings and exit on error
    dir: ./{{.PROJECT}}
    cmds:
      - | # find log file(s) and check for "exception" or "error"
        if grep -i 'exception\|error' $(find . -name openrefine.log); then
          echo 1>&2 "log contains warnings!"; exit 1
        fi
--- a/example-doaj/Taskfile.yml
+++ b/example-doaj/Taskfile.yml
@ -0,0 +1,77 @@
 version: '3'
 tasks:
  main:
    desc: Library Carpentry Lesson covering DOAJ
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
  refine:
    vars:
      PORT: 3335 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      PROJECT: '{{splitList ":" .TASK | first}}'
    deps: # will be executed each run independent of up-to-date check
      - task: download
    cmds: # tasks prepended with ":" are defined in Taskfile.yml
      - task: :start
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - task: import
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: apply
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: export
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: stats
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: :stop
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
    sources:
      - input/**
      - config/**
    generates:
      - output/openrefine.log
      - output/{{.PROJECT}}.openrefine.tar.gz
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  download:
    cmds:
      - mkdir -p input config
      - wget --no-verbose -O input/doaj-article-sample.csv https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
      - wget --no-verbose -O config/doaj-openrefine.json https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
  import:
    dir: input
    cmds:
      - | # import file
        ../../openrefine/client -P {{.PORT}} \
        --create doaj-article-sample.csv \
        --projectName {{.PROJECT}}
    ignore_error: true # workaround
  apply:
    dir: config
    cmds:
      - | # apply transformation rules
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --apply doaj-openrefine.json
    ignore_error: true # workaround
  export:
    dir: output
    cmds:
      - | # export to file; use readlink to log full path to output file
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --output "$(readlink -m doaj-results.tsv)"
    ignore_error: true # workaround
  stats:
    cmds:
      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
    ignore_error: true # workaround
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
--- a/example-duplicates/Taskfile.yml
+++ b/example-duplicates/Taskfile.yml
@ -0,0 +1,70 @@
 version: '3'
 tasks:
  main:
    desc: Removing duplicates in a very small test dataset
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
  refine:
    vars:
      PORT: 3334 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      PROJECT: '{{splitList ":" .TASK | first}}'
    cmds: # tasks prepended with ":" are defined in Taskfile.yml
      - task: :start
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - task: import
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: apply
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: export
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: stats
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: :stop
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
    sources:
      - input/**
      - config/**
    generates:
      - output/openrefine.log
      - output/{{.PROJECT}}.openrefine.tar.gz
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  import:
    dir: input
    cmds:
      - | # import file
        ../../openrefine/client -P {{.PORT}} \
        --create duplicates.csv \
        --encoding UTF-8 \
        --projectName {{.PROJECT}}
    ignore_error: true # workaround
  apply:
    dir: config
    cmds:
      - | # apply transformation rules
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --apply duplicates-deletion.json
    ignore_error: true # workaround
  export:
    dir: output
    cmds:
      - | # export to file; use readlink to log full path to output file
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --output "$(readlink -m deduped.xls)"
    ignore_error: true # workaround
  stats:
    cmds:
      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
    ignore_error: true # workaround
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"
--- a/example-duplicates/config/duplicates-deletion.json
+++ b/example-duplicates/config/duplicates-deletion.json
@ -0,0 +1,69 @@
 [
  {
    "op": "core/row-reorder",
    "description": "Reorder rows",
    "mode": "record-based",
    "sorting": {
      "criteria": [
        {
          "errorPosition": 1,
          "caseSensitive": false,
          "valueType": "string",
          "column": "email",
          "blankPosition": 2,
          "reverse": false
        }
      ]
    }
  },
  {
    "op": "core/column-addition",
    "description": "Create column count at index 1 based on column email using expression grel:facetCount(value, \"value\", \"email\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "count",
    "columnInsertIndex": 1,
    "baseColumnName": "email",
    "expression": "grel:facetCount(value, \"value\", \"email\")",
    "onError": "set-to-blank"
  },
  {
    "op": "core/blank-down",
    "description": "Blank down cells in column email",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "email"
  },
  {
    "op": "core/row-removal",
    "description": "Remove rows",
    "engineConfig": {
      "mode": "row-based",
      "facets": [
        {
          "omitError": false,
          "expression": "isBlank(value)",
          "selectBlank": false,
          "selection": [
            {
              "v": {
                "v": true,
                "l": "true"
              }
            }
          ],
          "selectError": false,
          "invert": false,
          "name": "email",
          "omitBlank": false,
          "type": "list",
          "columnName": "email"
        }
      ]
    }
  }
 ]
--- a/example-duplicates/input/duplicates.csv
+++ b/example-duplicates/input/duplicates.csv
@ -0,0 +1,11 @@
 email,name,state,gender,purchase
 danny.baron@example1.com,Danny Baron,CA,M,TV
 melanie.white@example2.edu,Melanie White,NC,F,iPhone
 danny.baron@example1.com,D. Baron,CA,M,Winter jacket
 ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight
 arthur.duff@example4.com,Arthur Duff,OR,M,Dining table
 danny.baron@example1.com,Daniel Baron,CA,M,Bike
 jean.griffith@example5.org,Jean Griffith,WA,F,Power drill
 melanie.white@example2.edu,Melanie White,NC,F,iPad
 ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier
 arthur.duff@example4.com,Arthur Duff,OR,M,Night table
--- a/example-powerhouse/Taskfile.yml
+++ b/example-powerhouse/Taskfile.yml
@ -0,0 +1,79 @@
 version: '3'
 tasks:
  main:
    desc: Powerhouse Museum Tutorial
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {PROJECT: '{{splitList ":" .TASK | first}}'} 
  refine:
    vars:
      PORT: 3336 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      PROJECT: '{{splitList ":" .TASK | first}}'
    deps: # will be executed each run independent of up-to-date check
      - task: download
    cmds: # tasks prepended with ":" are defined in Taskfile.yml
      - task: :start
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - task: import
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: apply
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: export
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: stats
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
      - task: :stop
        vars: {PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
    sources:
      - input/**
      - config/**
    generates:
      - output/openrefine.log
      - output/{{.PROJECT}}.openrefine.tar.gz
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  download:
    cmds:
      - mkdir -p input config
      - wget --no-verbose -O input/phm-collection.tsv https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
      - wget --no-verbose -O config/phm-transform.json https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
  import:
    dir: input
    cmds:
      - | # import file
        ../../openrefine/client -P {{.PORT}} \
        --create phm-collection.tsv \
        --processQuotes false \
        --guessCellValueTypes true \
        --projectName {{.PROJECT}}
    ignore_error: true # workaround
  apply:
    dir: config
    cmds:
      - | # apply transformation rules
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --apply phm-transform.json
    ignore_error: true # workaround
  export:
    dir: output
    cmds:
      - | # export to file; use readlink to log full path to output file
        ../../openrefine/client -P {{.PORT}} {{.PROJECT}} \
        --output "$(readlink -m phm-results.tsv)"
    ignore_error: true # workaround
  stats:
    cmds:
      - ps -o start,etime,%mem,%cpu,rss -p $(lsof -t -i:{{.PORT}}) # print allocated system resources
    ignore_error: true # workaround
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - PROJECT="${PWD##*/}:main" && cd .. && task "$PROJECT"