minimize task install

fix binder link
run tasks individually
2022-04-06 22:10:41 +02:00 · 2022-04-06 22:01:03 +02:00 · 2022-04-06 21:25:27 +02:00 · 2022-04-06 21:21:42 +02:00 · 2022-04-06 19:15:21 +00:00 · 2022-04-06 21:14:32 +02:00
13 changed files with 142 additions and 335 deletions
--- a/.github/workflows/default.yml
+++ b/.github/workflows/default.yml
@ -0,0 +1,33 @@
+name: default
+
+on:
+  workflow_dispatch: # allows you to run this workflow manually from the Actions tab
+
+jobs:
+  main:
+    runs-on: ubuntu-20.04
+    steps:
+    - uses: actions/checkout@v2
+    - name: install go-task 3.10.0
+      run: |
+        wget --no-verbose -O task.tar.gz https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
+        sudo tar -xzf task.tar.gz -C /usr/local/bin task && rm task.tar.gz
+    - name: install OpenRefine and openrefine-client
+      run: task install
+    - name: start OpenRefine
+      run: task start
+    - name: run workflow
+      run: task example
+    - name: print stats and check log file
+      run: task check
+    - uses: actions/upload-artifact@v2
+      if: always()
+      with:
+        name: OpenRefine project and logfile
+        path: .openrefine/tmp
+        retention-days: 7
+    - name: git commit and push
+      run: |
+        git config user.name "Automated"
+        git config user.email "actions@users.noreply.github.com"
+        task git
--- a/.gitignore
+++ b/.gitignore
@ -1,9 +1,2 @@
 .task
 .openrefine
-*/output
-*/*.log
-*/*.openrefine.tar.gz
-example-doaj/input
-example-doaj/config
-example-powerhouse/input
-example-powerhouse/config
--- a/README.md
+++ b/README.md
@ -1,23 +1,29 @@
 # OpenRefine Task Runner (💎+🤖)

-[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
+[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)

-Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api). 
+Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
+
+The workflow is defined in [Taskfile.yml](Taskfile.yml) and can be executed either locally (`task default`) or with [GitHub Actions](.github/workflows/default.yml).

 ## Features

-* run tasks in parallel
 * basic error handling by monitoring the OpenRefine server log
-* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
+* dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
 * prevent unnecessary work by fingerprinting generated files and their sources
 * the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
  * import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
  * apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
  * export to CSV, TSV, HTML, XLS, XLSX, ODS
  * [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
-  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
+  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
 * tasks are easy to extend with additional commands (e.g. to download input data or validate results)

+## Requirements
+
+* GNU/Linux (tested with Fedora 34)
+* JAVA 8+ (for OpenRefine)
+
 ## Typical workflow

 **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
@ -26,30 +32,24 @@ Templates for OpenRefine batch processing (import, transform, export) using the

 **Possible automation benefits:**

-* When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
+* When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:

  ```sh
-  task example-doaj
+  task
  ```

 * The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.

 * Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.

-## Requirements
-
-* GNU/Linux (tested with Fedora 32)
-* JAVA 8+ (for OpenRefine)
-
 ## Demo via binder

-[![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
+[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)

 - free to use on-demand server with Jupyterlab and Bash Kernel
 - OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild)
 - no registration needed, will start within a few minutes
- [restricted](https://notebooks.gesis.org/faq/) to 4 GB RAM and server will be deleted after 10 minutes of inactivity
- service is provided by GESIS and is intended for use by social scientists
+- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity

 ## Install

@ -60,23 +60,23 @@ Templates for OpenRefine batch processing (import, transform, export) using the
    cd openrefine-task-runner
    ```

-2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
+2. Install [Task 3.10.0](https://github.com/go-task/task/releases/tag/v3.10.0)+

    a) RPM-based (Fedora, CentOS, SLES, etc.)

    ```sh
-    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
+    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
    sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
    ```

    b) DEB-based (Debian, Ubuntu etc.)

    ```sh
-    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
+    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
    sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
    ```

-3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
+3. Run install task to download [OpenRefine 3.5.2](https://github.com/OpenRefine/OpenRefine/releases/tag/3.5.2) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)

   ```sh
   task install
@ -84,34 +84,28 @@ Templates for OpenRefine batch processing (import, transform, export) using the

 ## Usage

-* Run all tasks in parallel
+* Run workflow

    ```sh
-    task
+    task default
    ```

-* Run a specific task
+* Override settings with environment variables

    ```sh
-    task example-duplicates:main
-    ```
-
-* Run some tasks in parallel
-
-    ```sh
-    task --parallel example-duplicates:main example-doaj:main
+    OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
    ```

 * Force run a task even when the task is up-to-date

    ```sh
-    task example-duplicates:main --force
+    task default --force
    ```

 * Dry-run in verbose mode for debugging

    ```sh
-    task example-duplicates:main --dry --verbose --force
+    task default --dry --verbose --force
    ```

 * List available tasks
@ -120,17 +114,9 @@ Templates for OpenRefine batch processing (import, transform, export) using the
    task --list
    ```

-### How to develop your own tasks
+### Examples

-(first draft, will be elaborated later)
-
-1. create a new folder
-2. copy an example Taskfile.yml
-3. provide input data in subdirectory input
-4. provide OpenRefine transformation history files in subdirectory config
-5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
-6. add project to general Taskfile
-7. check memory load and increase RAM if needed
+* [noah-biejournals](https://github.com/opencultureconsulting/noah-biejournals): Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw 

 ### Getting help

--- a/Taskfile.yml
+++ b/Taskfile.yml
@ -1,102 +1,89 @@
-# https://github.com/opencultureconsulting/openrefine-task-runner
-
 version: '3'

-includes:
-  example-doaj: example-doaj
-  example-duplicates: example-duplicates
-  example-powerhouse: example-powerhouse
-  # add the directory name of your project here
-
 silent: true
-output: prefixed

 env:
-  OPENREFINE:
-    sh: readlink -m .openrefine/refine
-  CLIENT:
-    sh: readlink -m .openrefine/client
+  OPENREFINE_MEMORY: 5120M
+  OPENREFINE_PORT: 3333
+  OPENREFINE_APPDIR:
+    sh: readlink -m .openrefine
+  OPENREFINE_TMPDIR:
+    sh: mkdir -p .openrefine/tmp; readlink -m .openrefine/tmp

 tasks:
  default:
-    desc: execute all projects in parallel
-    deps:
-      - task: example-doaj:refine
-      - task: example-duplicates:refine
-      - task: example-powerhouse:refine
-      # add the directory name of your project here
+    desc: run workflow in batch mode
    cmds:
+      - defer: { task: cleanup } # will always be executed last
+      - task: start
+      - task: example
      - task: check
-
-  install:
-    desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
-    cmds:
-      - | # delete existing install and recreate folder
-        rm -rf .openrefine
-        mkdir -p .openrefine
-      - > # download OpenRefine archive
-        wget --no-verbose -O openrefine.tar.gz
-        https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
-      - | # install OpenRefine into subdirectory .openrefine
-        tar -xzf openrefine.tar.gz -C .openrefine --strip 1
-        rm openrefine.tar.gz
-      - | # optimize OpenRefine for batch processing
-        sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
-        sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
-        sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
-      - > # download openrefine-client into subdirectory .openrefine
-        wget --no-verbose -O .openrefine/client
-        https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
-      - chmod +x .openrefine/client # make client executable
+    sources:
+      - Taskfile.yml
+      - input/**
+      - config/**
+    generates:
+      - output/**
+    preconditions:
+      - sh: test -f "${OPENREFINE_APPDIR}/refine"
+        msg: "OpenRefine missing; try task install"

  start:
-    dir: ./{{.DIR}}
-    cmds:
-      - | # verify that OpenRefine is installed
-        if [ ! -f "$OPENREFINE" ]; then
-          echo 1>&2 "OpenRefine missing; try task install"; exit 1
-        fi
-      - | # delete temporary files and log file of previous run
-        rm -rf ./*.project* workspace.json
-        rm -rf "{{.PROJECT}}.log"
-      - > # launch OpenRefine with specific data directory and redirect its output to a log file
-        "$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
-        -d ../{{.DIR}}
-        >> "{{.PROJECT}}.log" 2>&1 &
+      - echo "start OpenRefine with max. $OPENREFINE_MEMORY on port $OPENREFINE_PORT..."
+      - | # launch OpenRefine with specific data directory and redirect its output to a log file
+        "${OPENREFINE_APPDIR}/refine" -v warn -p "$OPENREFINE_PORT" -m "$OPENREFINE_MEMORY" -d "${OPENREFINE_TMPDIR}" > "${OPENREFINE_TMPDIR}/log.txt" 2>&1 &
      - | # wait until OpenRefine API is available
-        timeout 30s bash -c "until
-          wget -q -O - -o /dev/null http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
-          do sleep 1
-        done"
+        timeout 30s bash -c "until wget -q -O - -o /dev/null http://localhost:${OPENREFINE_PORT} | cat | grep -q -o OpenRefine; do sleep 1; done"

-  stop:
-    dir: ./{{.DIR}}
-    cmds:
-      - | # shut down OpenRefine gracefully
-        PID=$(lsof -t -i:{{.PORT}})
-        kill $PID
-        while ps -p $PID > /dev/null; do sleep 1; done
-      - > # archive the OpenRefine project
-        tar cfz
-        "{{.PROJECT}}.openrefine.tar.gz"
-        -C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
-        .
-      - rm -rf ./*.project* workspace.json # delete temporary files
-
-  kill:
-    dir: ./{{.DIR}}
-    cmds:
-      - | # shut down OpenRefine immediately to save time and disk space
-        PID=$(lsof -t -i:{{.PORT}})
-        kill -9 $PID
-        while ps -p $PID > /dev/null; do sleep 1; done
-      - rm -rf ./*.project* workspace.json # delete temporary files
+  example:
+      - | # import (requires absolute path)
+        "${OPENREFINE_APPDIR}/client" \
+        --create "$(readlink -m input/duplicates.csv)" \
+        --projectName example
+      - | # apply undo/redo history
+        for f in config/*.json; do
+          "${OPENREFINE_APPDIR}/client" example --apply "$f"
+        done
+      - | # export to TSV
+        mkdir -p output
+        "${OPENREFINE_APPDIR}/client" example \
+        --output output/deduped.tsv

  check:
-    desc: check OpenRefine log for any warnings and exit on error
-    dir: ./{{.DIR}}
+     - | # print stats
+        PID="$(lsof -t -i:${OPENREFINE_PORT})"
+        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM"
+        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time"
+     - | # check log file for any warnings
+       if grep -i 'exception\|error' "${OPENREFINE_TMPDIR}/log.txt"
+         then echo 1>&2 "log contains warnings!"; echo; cat "${OPENREFINE_TMPDIR}/log.txt"; exit 1
+       fi
+
+  cleanup:
+      - | # kill OpenRefine immediately
+        PID="$(lsof -t -i:${OPENREFINE_PORT})"
+        kill -9 $PID
+      - | # delete temporary files
+        rm -rf "${OPENREFINE_TMPDIR}"
+
+  install:
+    desc: install OpenRefine and openrefine-client into subdirectory ${OPENREFINE_APPDIR}
    cmds:
-      - | # find log file(s) and check for "exception" or "error"
-        if grep -i 'exception\|error' $(find . -name '*.log'); then
-          echo 1>&2 "log contains warnings!"; exit 1
-        fi
+      - mkdir -p "${OPENREFINE_APPDIR}"
+      - | # install OpenRefine into subdirectory ${OPENREFINE_APPDIR}
+        wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.5.2/openrefine-linux-3.5.2.tar.gz
+        tar -xzf openrefine.tar.gz -C "${OPENREFINE_APPDIR}" --strip 1 && rm openrefine.tar.gz
+      - | # optimize OpenRefine for batch processing
+        sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "${OPENREFINE_APPDIR}/refine" # fix path issue in OpenRefine startup file
+        sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "${OPENREFINE_APPDIR}/refine.ini" # do not try to open OpenRefine in browser
+        sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "${OPENREFINE_APPDIR}/refine.ini" # set autosave period from 5 minutes to 25 hours
+      - | # install openrefine-client into subdirectory ${OPENREFINE_APPDIR}
+        wget --no-verbose -O "${OPENREFINE_APPDIR}/client" https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
+        chmod +x "${OPENREFINE_APPDIR}/client"
+
+  git:
+    desc: commit and push if something changed
+    cmds:
+      - git add -A
+      - git commit -m "latest change $(date -u)" || exit 0
+      - git push
--- a/binder/postBuild
+++ b/binder/postBuild
@ -5,8 +5,8 @@ set -e
 python -m bash_kernel.install

 # Install go-task https://github.com/go-task/task
-wget -q https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.tar.gz
-tar -xzf task_linux_amd64.tar.gz
+wget -q https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
+tar -xzf task_linux_amd64.tar.gz task
 rm task_linux_amd64.tar.gz
 mkdir -p $HOME/.local/bin
 mv task $HOME/.local/bin/
--- a/binder/requirements.txt
+++ b/binder/requirements.txt
@ -1,2 +1,2 @@
-jupyter-server-proxy==1.5.3
+jupyter-server-proxy==3.2.1
 bash_kernel==0.7.2
--- a/example-duplicates/config/duplicates-deletion.json
+++ b/example-duplicates/config/duplicates-deletion.json
--- a/demo.ipynb
+++ b/demo.ipynb
@ -1 +0,0 @@
-{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}
--- a/example-doaj/Taskfile.yml
+++ b/example-doaj/Taskfile.yml
@ -1,70 +0,0 @@
-version: '3'
-
-tasks:
-  main:
-    desc: Library Carpentry Lesson covering DOAJ
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
-    cmds:
-      - task: refine
-      - task: :check # check OpenRefine log for any warnings and exit on error
-        vars: {DIR: '{{.DIR}}'}
-
-  refine:
-    dir: ./{{.DIR}}
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}'
-      PROJECT: doaj
-      PORT: 3334 # assign a different port for each project
-      RAM: 2048M # maximum RAM for OpenRefine java heap space
-      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
-    deps:
-      - task: download # will be executed each run independent of up-to-date check
-    cmds:
-      - task: :start # launch OpenRefine
-        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
-      - > # import file
-        "$CLIENT" -P {{.PORT}}
-        --create "$(readlink -m input/doaj-article-sample.csv)"
-        --projectName "{{.PROJECT}}"
-        > {{.LOG}}
-      - > # apply transformation rules
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --apply config/doaj-openrefine.json
-        > {{.LOG}}
-      - mkdir -p output
-      - > # export to file
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --output "$(readlink -m output/doaj-results.tsv)"
-        > {{.LOG}}
-      - | # print allocated system resources
-        PID="$(lsof -t -i:{{.PORT}})"
-        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
-        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
-      - task: :stop # shut down OpenRefine and archive the OpenRefine project
-        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
-    sources:
-      - Taskfile.yml
-      - input/**
-      - config/**
-    generates:
-      - ./{{.PROJECT}}.openrefine.tar.gz
-      - output/**
-    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
-
-  download:
-    dir: ./{{.DIR}}
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}'
-    cmds:
-      - mkdir -p input config
-      - > # Download input
-        wget --no-verbose -O input/doaj-article-sample.csv
-        https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
-      - > # Download config
-        wget --no-verbose -O config/doaj-openrefine.json
-        https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
-
-  default: # enable standalone execution (running `task` in project directory)
-    cmds:
-      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-duplicates/Taskfile.yml
+++ b/example-duplicates/Taskfile.yml
@ -1,56 +0,0 @@
-version: '3'
-
-tasks:
-  main:
-    desc: Removing duplicates in a very small test dataset
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
-    cmds:
-      - task: refine
-      - task: :check # check OpenRefine log for any warnings and exit on error
-        vars: {DIR: '{{.DIR}}'}
-
-  refine:
-    dir: ./{{.DIR}}
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}'
-      PROJECT: duplicates
-      PORT: 3335 # assign a different port for each project
-      RAM: 2048M # maximum RAM for OpenRefine java heap space
-      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
-    cmds:
-      - task: :start # launch OpenRefine
-        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
-      - > # import file
-        "$CLIENT" -P {{.PORT}}
-        --create "$(readlink -m input/duplicates.csv)"
-        --encoding UTF-8
-        --projectName "{{.PROJECT}}"
-        > {{.LOG}}
-      - > # apply transformation rules
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --apply config/duplicates-deletion.json
-        > {{.LOG}}
-      - mkdir -p output
-      - > # export to file
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --output "$(readlink -m output/deduped.xls)"
-        > {{.LOG}}
-      - | # print allocated system resources
-        PID="$(lsof -t -i:{{.PORT}})"
-        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
-        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
-      - task: :stop # shut down OpenRefine and archive the OpenRefine project
-        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
-    sources:
-      - Taskfile.yml
-      - input/**
-      - config/**
-    generates:
-      - ./{{.PROJECT}}.openrefine.tar.gz
-      - output/**
-    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
-
-  default: # enable standalone execution (running `task` in project directory)
-    cmds:
-      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-powerhouse/Taskfile.yml
+++ b/example-powerhouse/Taskfile.yml
@ -1,72 +0,0 @@
-version: '3'
-
-tasks:
-  main:
-    desc: Powerhouse Museum Tutorial
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
-    cmds:
-      - task: refine
-      - task: :check # check OpenRefine log for any warnings and exit on error
-        vars: {DIR: '{{.DIR}}'}
-
-  refine:
-    dir: ./{{.DIR}}
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}'
-      PROJECT: phm
-      PORT: 3336 # assign a different port for each project
-      RAM: 2048M # maximum RAM for OpenRefine java heap space
-      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
-    deps:
-      - task: download # will be executed each run independent of up-to-date check
-    cmds:
-      - task: :start # launch OpenRefine
-        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
-      - > # import file
-        "$CLIENT" -P {{.PORT}}
-        --create "$(readlink -m input/phm-collection.tsv)"
-        --processQuotes false
-        --guessCellValueTypes true
-        --projectName "{{.PROJECT}}"
-        > {{.LOG}}
-      - > # apply transformation rules
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --apply config/phm-transform.json
-        > {{.LOG}}
-      - mkdir -p output
-      - > # export to file
-        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
-        --output "$(readlink -m output/phm-results.tsv)"
-        > {{.LOG}}
-      - | # print allocated system resources
-        PID="$(lsof -t -i:{{.PORT}})"
-        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
-        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
-      - task: :stop # shut down OpenRefine and archive the OpenRefine project
-        vars: {DIR: '{{.DIR}}', PORT: '{{.PORT}}', PROJECT: '{{.PROJECT}}'}
-    sources:
-      - Taskfile.yml
-      - input/**
-      - config/**
-    generates:
-      - ./{{.PROJECT}}.openrefine.tar.gz
-      - output/**
-    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
-
-  download:
-    dir: ./{{.DIR}}
-    vars:
-      DIR: '{{splitList ":" .TASK | first}}'
-    cmds:
-      - mkdir -p input config
-      - > # Download input
-        wget --no-verbose -O input/phm-collection.tsv
-        https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
-      - > # Download config
-        wget --no-verbose -O config/phm-transform.json
-        https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
-
-  default: # enable standalone execution (running `task` in project directory)
-    cmds:
-      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-duplicates/input/duplicates.csv
+++ b/example-duplicates/input/duplicates.csv
--- a/output/deduped.tsv
+++ b/output/deduped.tsv
@ -0,0 +1,7 @@
+email	count	name	state	gender	purchase
+arthur.duff@example4.com	2	Arthur Duff	OR	M	Dining table
+ben.morisson@example6.org	1	Ben Morisson	FL	M	Amplifier
+ben.tyler@example3.org	1	Ben Tyler	NV	M	Flashlight
+danny.baron@example1.com	3	Danny Baron	CA	M	TV
+jean.griffith@example5.org	1	Jean Griffith	WA	F	Power drill
+melanie.white@example2.edu	2	Melanie White	NC	F	iPhone
Author	SHA1	Message	Date
Felix Lohmeier	30ea93e3f3	minimize task install	2022-04-06 22:10:41 +02:00
Felix Lohmeier	f867050950	fix binder link	2022-04-06 22:01:03 +02:00
Felix Lohmeier	07b30f66c9	run tasks individually	2022-04-06 21:25:27 +02:00
Felix Lohmeier	0691b1f5e1	move stats to check task	2022-04-06 21:21:42 +02:00
Automated	5794f3cee0	latest change Wed Apr 6 19:15:21 UTC 2022	2022-04-06 19:15:21 +00:00
Felix Lohmeier	5ea4913f77	Revert "simulate failed gh actions run" This reverts commit `ed71aa005c`.	2022-04-06 21:14:32 +02:00
Felix Lohmeier	01dbe1d58f	fix yaml	2022-04-06 21:12:53 +02:00
Felix Lohmeier	d35b679f6d	deferred tasks don't fail	2022-04-06 21:11:08 +02:00
Automated	fceca60918	latest change Wed Apr 6 18:47:53 UTC 2022	2022-04-06 18:47:53 +00:00
Felix Lohmeier	ed71aa005c	simulate failed gh actions run	2022-04-06 20:47:07 +02:00
Felix Lohmeier	7812e5c8de	fix github actions workflow	2022-04-06 20:46:16 +02:00
Felix Lohmeier	c0facb81e0	simplify even more	2022-04-06 20:43:30 +02:00
Automated	cfb37d72e6	latest change Wed Apr 6 11:54:55 UTC 2022	2022-04-06 11:54:55 +00:00
Felix Lohmeier	8ee91ee84f	reduce complexity	2022-04-06 13:52:21 +02:00
Felix Lohmeier	1341e1b45c	update task in github action	2022-04-06 13:28:33 +02:00
Felix Lohmeier	12a1b1ab39	update OpenRefine to 3.5.2	2022-04-06 13:27:50 +02:00
Felix Lohmeier	847a622a10	update task to v3.10.0	2022-04-06 13:26:22 +02:00
Felix Lohmeier	72aabe685f	Merge pull request #6 from opencultureconsulting/dependabot/pip/binder/jupyter-server-proxy-3.2.1 ⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder	2022-01-28 17:19:51 +01:00
dependabot[bot]	88f9fe1e5f	⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder Bumps [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) from 1.5.3 to 3.2.1. - [Release notes](https://github.com/jupyterhub/jupyter-server-proxy/releases) - [Changelog](https://github.com/jupyterhub/jupyter-server-proxy/blob/main/CHANGELOG.md) - [Commits](https://github.com/jupyterhub/jupyter-server-proxy/compare/v1.5.3...v3.2.1) --- updated-dependencies: - dependency-name: jupyter-server-proxy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2022-01-27 16:25:09 +00:00
Felix Lohmeier	7c199424c6	OpenRefine 3.5.0	2021-11-09 23:46:29 +01:00
Felix Lohmeier	21b05626e9	set retention days for GitHub Artifacts	2021-08-01 12:06:10 +02:00
Felix Lohmeier	bebd9d8b39	install go-task to /usr/local/bin	2021-07-14 23:14:45 +02:00
Felix Lohmeier	b3752aaf58	fix go-task install	2021-07-14 23:03:13 +02:00
Felix Lohmeier	b770ffcb3f	debug system path variable	2021-07-14 22:56:04 +02:00
Felix Lohmeier	2f0ef9feca	improve go-task install	2021-07-14 22:49:07 +02:00
Felix Lohmeier	78567e5f44	use github context	2021-07-14 22:32:59 +02:00
Felix Lohmeier	02af29fec1	Update and rename openrefine-task-runner.yml to all-tasks.yml	2021-07-14 22:23:53 +02:00
Felix Lohmeier	5d474b5dfa	fix calling task	2021-07-14 22:03:24 +02:00
Felix Lohmeier	965f82b2de	Create openrefine-task-runner.yml	2021-07-14 21:56:46 +02:00
				`@ -1 +0,0 @@`
				{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}