minimize task install

fix binder link
run tasks individually
2022-04-06 22:10:41 +02:00 · 2022-04-06 22:01:03 +02:00 · 2022-04-06 21:25:27 +02:00 · 2022-04-06 21:21:42 +02:00 · 2022-04-06 19:15:21 +00:00 · 2022-04-06 21:14:32 +02:00
13 changed files with 142 additions and 335 deletions
--- a/.github/workflows/default.yml
+++ b/.github/workflows/default.yml
@ -0,0 +1,33 @@
 name: default
 on:
  workflow_dispatch: # allows you to run this workflow manually from the Actions tab
 jobs:
  main:
    runs-on: ubuntu-20.04
    steps:
    - uses: actions/checkout@v2
    - name: install go-task 3.10.0
      run: |
        wget --no-verbose -O task.tar.gz https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
        sudo tar -xzf task.tar.gz -C /usr/local/bin task && rm task.tar.gz
    - name: install OpenRefine and openrefine-client
      run: task install
    - name: start OpenRefine
      run: task start
    - name: run workflow
      run: task example
    - name: print stats and check log file
      run: task check
    - uses: actions/upload-artifact@v2
      if: always()
      with:
        name: OpenRefine project and logfile
        path: .openrefine/tmp
        retention-days: 7
    - name: git commit and push
      run: |
        git config user.name "Automated"
        git config user.email "actions@users.noreply.github.com"
        task git
--- a/.gitignore
+++ b/.gitignore
@ -1,9 +1,2 @@
 .task
 .openrefine
 */output
 */*.log
 */*.openrefine.tar.gz
 example-doaj/input
 example-doaj/config
 example-powerhouse/input
 example-powerhouse/config
--- a/README.md
+++ b/README.md
@ -1,23 +1,29 @@
 # OpenRefine Task Runner (💎+🤖)
-[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
+[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
-Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api). 
+Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
 The workflow is defined in [Taskfile.yml](Taskfile.yml) and can be executed either locally (`task default`) or with [GitHub Actions](.github/workflows/default.yml).
 ## Features
 * run tasks in parallel
 * basic error handling by monitoring the OpenRefine server log
-* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
+* dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
 * prevent unnecessary work by fingerprinting generated files and their sources
 * the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
  * import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
  * apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
  * export to CSV, TSV, HTML, XLS, XLSX, ODS
  * [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
-  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
+  * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
 * tasks are easy to extend with additional commands (e.g. to download input data or validate results)
 ## Requirements
 * GNU/Linux (tested with Fedora 34)
 * JAVA 8+ (for OpenRefine)
 ## Typical workflow
 **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
@ -26,30 +32,24 @@ Templates for OpenRefine batch processing (import, transform, export) using the
 **Possible automation benefits:**
-* When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
+* When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:
  ```sh
-  task example-doaj
+  task
  ```
 * The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
 * Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
 ## Requirements
 * GNU/Linux (tested with Fedora 32)
 * JAVA 8+ (for OpenRefine)
 ## Demo via binder
-[![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
+[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
 - free to use on-demand server with Jupyterlab and Bash Kernel
 - OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild)
 - no registration needed, will start within a few minutes
- [restricted](https://notebooks.gesis.org/faq/) to 4 GB RAM and server will be deleted after 10 minutes of inactivity
+- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
 - service is provided by GESIS and is intended for use by social scientists
 ## Install
@ -60,23 +60,23 @@ Templates for OpenRefine batch processing (import, transform, export) using the
    cd openrefine-task-runner
    ```
-2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
+2. Install [Task 3.10.0](https://github.com/go-task/task/releases/tag/v3.10.0)+
    a) RPM-based (Fedora, CentOS, SLES, etc.)
    ```sh
-    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
+    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
    sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
    ```
    b) DEB-based (Debian, Ubuntu etc.)
    ```sh
-    wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
+    wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
    sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
    ```
-3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
+3. Run install task to download [OpenRefine 3.5.2](https://github.com/OpenRefine/OpenRefine/releases/tag/3.5.2) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
   ```sh
   task install
@ -84,34 +84,28 @@ Templates for OpenRefine batch processing (import, transform, export) using the
 ## Usage
-* Run all tasks in parallel
+* Run workflow
    ```sh
-    task
+    task default
    ```
-* Run a specific task
+* Override settings with environment variables
    ```sh
-    task example-duplicates:main
+    OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
    ```
 * Run some tasks in parallel
    ```sh
    task --parallel example-duplicates:main example-doaj:main
    ```
 * Force run a task even when the task is up-to-date
    ```sh
-    task example-duplicates:main --force
+    task default --force
    ```
 * Dry-run in verbose mode for debugging
    ```sh
-    task example-duplicates:main --dry --verbose --force
+    task default --dry --verbose --force
    ```
 * List available tasks
@ -120,17 +114,9 @@ Templates for OpenRefine batch processing (import, transform, export) using the
    task --list
    ```
-### How to develop your own tasks
+### Examples
-(first draft, will be elaborated later)
+* [noah-biejournals](https://github.com/opencultureconsulting/noah-biejournals): Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw 
 1. create a new folder
 2. copy an example Taskfile.yml
 3. provide input data in subdirectory input
 4. provide OpenRefine transformation history files in subdirectory config
 5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
 6. add project to general Taskfile
 7. check memory load and increase RAM if needed
 ### Getting help
--- a/Taskfile.yml
+++ b/Taskfile.yml
@ -1,102 +1,89 @@
 # https://github.com/opencultureconsulting/openrefine-task-runner
 version: '3'
 includes:
  example-doaj: example-doaj
  example-duplicates: example-duplicates
  example-powerhouse: example-powerhouse
  # add the directory name of your project here
 silent: true
 output: prefixed
 env:
-  OPENREFINE:
+  OPENREFINE_MEMORY: 5120M
-    sh: readlink -m .openrefine/refine
+  OPENREFINE_PORT: 3333
-  CLIENT:
+  OPENREFINE_APPDIR:
-    sh: readlink -m .openrefine/client
+    sh: readlink -m .openrefine
  OPENREFINE_TMPDIR:
    sh: mkdir -p .openrefine/tmp; readlink -m .openrefine/tmp
 tasks:
  default:
-    desc: execute all projects in parallel
+    desc: run workflow in batch mode
    deps:
      - task: example-doaj:refine
      - task: example-duplicates:refine
      - task: example-powerhouse:refine
      # add the directory name of your project here
    cmds:
      - defer: { task: cleanup } # will always be executed last
      - task: start
      - task: example
      - task: check
-
+    sources:
-  install:
+      - Taskfile.yml
-    desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
+      - input/**
-    cmds:
+      - config/**
-      - | # delete existing install and recreate folder
+    generates:
-        rm -rf .openrefine
+      - output/**
-        mkdir -p .openrefine
+    preconditions:
-      - > # download OpenRefine archive
+      - sh: test -f "${OPENREFINE_APPDIR}/refine"
-        wget --no-verbose -O openrefine.tar.gz
+        msg: "OpenRefine missing; try task install"
        https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
      - | # install OpenRefine into subdirectory .openrefine
        tar -xzf openrefine.tar.gz -C .openrefine --strip 1
        rm openrefine.tar.gz
      - | # optimize OpenRefine for batch processing
        sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
        sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
        sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
      - > # download openrefine-client into subdirectory .openrefine
        wget --no-verbose -O .openrefine/client
        https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
      - chmod +x .openrefine/client # make client executable
  start:
-    dir: ./{{.DIR}}
+      - echo "start OpenRefine with max. $OPENREFINE_MEMORY on port $OPENREFINE_PORT..."
-    cmds:
+      - | # launch OpenRefine with specific data directory and redirect its output to a log file
-      - | # verify that OpenRefine is installed
+        "${OPENREFINE_APPDIR}/refine" -v warn -p "$OPENREFINE_PORT" -m "$OPENREFINE_MEMORY" -d "${OPENREFINE_TMPDIR}" > "${OPENREFINE_TMPDIR}/log.txt" 2>&1 &
        if [ ! -f "$OPENREFINE" ]; then
          echo 1>&2 "OpenRefine missing; try task install"; exit 1
        fi
      - | # delete temporary files and log file of previous run
        rm -rf ./*.project* workspace.json
        rm -rf "{{.PROJECT}}.log"
      - > # launch OpenRefine with specific data directory and redirect its output to a log file
        "$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
        -d ../{{.DIR}}
        >> "{{.PROJECT}}.log" 2>&1 &
      - | # wait until OpenRefine API is available
-        timeout 30s bash -c "until
+        timeout 30s bash -c "until wget -q -O - -o /dev/null http://localhost:${OPENREFINE_PORT} | cat | grep -q -o OpenRefine; do sleep 1; done"
          wget -q -O - -o /dev/null http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
          do sleep 1
        done"
-  stop:
+  example:
-    dir: ./{{.DIR}}
+      - | # import (requires absolute path)
-    cmds:
+        "${OPENREFINE_APPDIR}/client" \
-      - | # shut down OpenRefine gracefully
+        --create "$(readlink -m input/duplicates.csv)" \
-        PID=$(lsof -t -i:{{.PORT}})
+        --projectName example
-        kill $PID
+      - | # apply undo/redo history
-        while ps -p $PID > /dev/null; do sleep 1; done
+        for f in config/*.json; do
-      - > # archive the OpenRefine project
+          "${OPENREFINE_APPDIR}/client" example --apply "$f"
-        tar cfz
+        done
-        "{{.PROJECT}}.openrefine.tar.gz"
+      - | # export to TSV
-        -C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
+        mkdir -p output
-        .
+        "${OPENREFINE_APPDIR}/client" example \
-      - rm -rf ./*.project* workspace.json # delete temporary files
+        --output output/deduped.tsv
  kill:
    dir: ./{{.DIR}}
    cmds:
      - | # shut down OpenRefine immediately to save time and disk space
        PID=$(lsof -t -i:{{.PORT}})
        kill -9 $PID
        while ps -p $PID > /dev/null; do sleep 1; done
      - rm -rf ./*.project* workspace.json # delete temporary files
  check:
-    desc: check OpenRefine log for any warnings and exit on error
+     - | # print stats
-    dir: ./{{.DIR}}
+        PID="$(lsof -t -i:${OPENREFINE_PORT})"
        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM"
        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time"
     - | # check log file for any warnings
       if grep -i 'exception\|error' "${OPENREFINE_TMPDIR}/log.txt"
         then echo 1>&2 "log contains warnings!"; echo; cat "${OPENREFINE_TMPDIR}/log.txt"; exit 1
       fi
  cleanup:
      - | # kill OpenRefine immediately
        PID="$(lsof -t -i:${OPENREFINE_PORT})"
        kill -9 $PID
      - | # delete temporary files
        rm -rf "${OPENREFINE_TMPDIR}"
  install:
    desc: install OpenRefine and openrefine-client into subdirectory ${OPENREFINE_APPDIR}
    cmds:
-      - | # find log file(s) and check for "exception" or "error"
+      - mkdir -p "${OPENREFINE_APPDIR}"
-        if grep -i 'exception\|error' $(find . -name '*.log'); then
+      - | # install OpenRefine into subdirectory ${OPENREFINE_APPDIR}
-          echo 1>&2 "log contains warnings!"; exit 1
+        wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.5.2/openrefine-linux-3.5.2.tar.gz
-        fi
+        tar -xzf openrefine.tar.gz -C "${OPENREFINE_APPDIR}" --strip 1 && rm openrefine.tar.gz
      - | # optimize OpenRefine for batch processing
        sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "${OPENREFINE_APPDIR}/refine" # fix path issue in OpenRefine startup file
        sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "${OPENREFINE_APPDIR}/refine.ini" # do not try to open OpenRefine in browser
        sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "${OPENREFINE_APPDIR}/refine.ini" # set autosave period from 5 minutes to 25 hours
      - | # install openrefine-client into subdirectory ${OPENREFINE_APPDIR}
        wget --no-verbose -O "${OPENREFINE_APPDIR}/client" https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
        chmod +x "${OPENREFINE_APPDIR}/client"
  git:
    desc: commit and push if something changed
    cmds:
      - git add -A
      - git commit -m "latest change $(date -u)" || exit 0
      - git push
--- a/binder/postBuild
+++ b/binder/postBuild
@ -5,8 +5,8 @@ set -e
 python -m bash_kernel.install
 # Install go-task https://github.com/go-task/task
-wget -q https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.tar.gz
+wget -q https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
-tar -xzf task_linux_amd64.tar.gz
+tar -xzf task_linux_amd64.tar.gz task
 rm task_linux_amd64.tar.gz
 mkdir -p $HOME/.local/bin
 mv task $HOME/.local/bin/
--- a/binder/requirements.txt
+++ b/binder/requirements.txt
@ -1,2 +1,2 @@
-jupyter-server-proxy==1.5.3
+jupyter-server-proxy==3.2.1
 bash_kernel==0.7.2
--- a/example-duplicates/config/duplicates-deletion.json
+++ b/example-duplicates/config/duplicates-deletion.json
--- a/demo.ipynb
+++ b/demo.ipynb
@ -1 +0,0 @@
 {"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}
--- a/example-doaj/Taskfile.yml
+++ b/example-doaj/Taskfile.yml
@ -1,70 +0,0 @@
 version: '3'
 tasks:
  main:
    desc: Library Carpentry Lesson covering DOAJ
    vars:
      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {DIR: '{{.DIR}}'}
  refine:
    dir: ./{{.DIR}}
    vars:
      DIR: '{{splitList ":" .TASK | first}}'
      PROJECT: doaj
      PORT: 3334 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
    deps:
      - task: download # will be executed each run independent of up-to-date check
    cmds:
      - task: :start # launch OpenRefine
        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - > # import file
        "$CLIENT" -P {{.PORT}}
        --create "$(readlink -m input/doaj-article-sample.csv)"
        --projectName "{{.PROJECT}}"
        > {{.LOG}}
      - > # apply transformation rules
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --apply config/doaj-openrefine.json
        > {{.LOG}}
      - mkdir -p output
      - > # export to file
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --output "$(readlink -m output/doaj-results.tsv)"
        > {{.LOG}}
      - | # print allocated system resources
        PID="$(lsof -t -i:{{.PORT}})"
        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
      - task: :stop # shut down OpenRefine and archive the OpenRefine project
        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
    sources:
      - Taskfile.yml
      - input/**
      - config/**
    generates:
      - ./{{.PROJECT}}.openrefine.tar.gz
      - output/**
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  download:
    dir: ./{{.DIR}}
    vars:
      DIR: '{{splitList ":" .TASK | first}}'
    cmds:
      - mkdir -p input config
      - > # Download input
        wget --no-verbose -O input/doaj-article-sample.csv
        https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
      - > # Download config
        wget --no-verbose -O config/doaj-openrefine.json
        https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-duplicates/Taskfile.yml
+++ b/example-duplicates/Taskfile.yml
@ -1,56 +0,0 @@
 version: '3'
 tasks:
  main:
    desc: Removing duplicates in a very small test dataset
    vars:
      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {DIR: '{{.DIR}}'}
  refine:
    dir: ./{{.DIR}}
    vars:
      DIR: '{{splitList ":" .TASK | first}}'
      PROJECT: duplicates
      PORT: 3335 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
    cmds:
      - task: :start # launch OpenRefine
        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - > # import file
        "$CLIENT" -P {{.PORT}}
        --create "$(readlink -m input/duplicates.csv)"
        --encoding UTF-8
        --projectName "{{.PROJECT}}"
        > {{.LOG}}
      - > # apply transformation rules
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --apply config/duplicates-deletion.json
        > {{.LOG}}
      - mkdir -p output
      - > # export to file
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --output "$(readlink -m output/deduped.xls)"
        > {{.LOG}}
      - | # print allocated system resources
        PID="$(lsof -t -i:{{.PORT}})"
        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
      - task: :stop # shut down OpenRefine and archive the OpenRefine project
        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
    sources:
      - Taskfile.yml
      - input/**
      - config/**
    generates:
      - ./{{.PROJECT}}.openrefine.tar.gz
      - output/**
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-powerhouse/Taskfile.yml
+++ b/example-powerhouse/Taskfile.yml
@ -1,72 +0,0 @@
 version: '3'
 tasks:
  main:
    desc: Powerhouse Museum Tutorial
    vars:
      DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
    cmds:
      - task: refine
      - task: :check # check OpenRefine log for any warnings and exit on error
        vars: {DIR: '{{.DIR}}'}
  refine:
    dir: ./{{.DIR}}
    vars:
      DIR: '{{splitList ":" .TASK | first}}'
      PROJECT: phm
      PORT: 3336 # assign a different port for each project
      RAM: 2048M # maximum RAM for OpenRefine java heap space
      LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
    deps:
      - task: download # will be executed each run independent of up-to-date check
    cmds:
      - task: :start # launch OpenRefine
        vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
      - > # import file
        "$CLIENT" -P {{.PORT}}
        --create "$(readlink -m input/phm-collection.tsv)"
        --processQuotes false
        --guessCellValueTypes true
        --projectName "{{.PROJECT}}"
        > {{.LOG}}
      - > # apply transformation rules
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --apply config/phm-transform.json
        > {{.LOG}}
      - mkdir -p output
      - > # export to file
        "$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
        --output "$(readlink -m output/phm-results.tsv)"
        > {{.LOG}}
      - | # print allocated system resources
        PID="$(lsof -t -i:{{.PORT}})"
        echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
        echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
      - task: :stop # shut down OpenRefine and archive the OpenRefine project
        vars: {DIR: '{{.DIR}}', PORT: '{{.PORT}}', PROJECT: '{{.PROJECT}}'}
    sources:
      - Taskfile.yml
      - input/**
      - config/**
    generates:
      - ./{{.PROJECT}}.openrefine.tar.gz
      - output/**
    ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
  download:
    dir: ./{{.DIR}}
    vars:
      DIR: '{{splitList ":" .TASK | first}}'
    cmds:
      - mkdir -p input config
      - > # Download input
        wget --no-verbose -O input/phm-collection.tsv
        https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
      - > # Download config
        wget --no-verbose -O config/phm-transform.json
        https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
  default: # enable standalone execution (running `task` in project directory)
    cmds:
      - DIR="${PWD##*/}:main" && cd .. && task "$DIR"
--- a/example-duplicates/input/duplicates.csv
+++ b/example-duplicates/input/duplicates.csv
--- a/output/deduped.tsv
+++ b/output/deduped.tsv
@ -0,0 +1,7 @@
 email	count	name	state	gender	purchase
 arthur.duff@example4.com	2	Arthur Duff	OR	M	Dining table
 ben.morisson@example6.org	1	Ben Morisson	FL	M	Amplifier
 ben.tyler@example3.org	1	Ben Tyler	NV	M	Flashlight
 danny.baron@example1.com	3	Danny Baron	CA	M	TV
 jean.griffith@example5.org	1	Jean Griffith	WA	F	Power drill
 melanie.white@example2.edu	2	Melanie White	NC	F	iPhone
Author	SHA1	Message	Date
Felix Lohmeier	30ea93e3f3	minimize task install	2022-04-06 22:10:41 +02:00
Felix Lohmeier	f867050950	fix binder link	2022-04-06 22:01:03 +02:00
Felix Lohmeier	07b30f66c9	run tasks individually	2022-04-06 21:25:27 +02:00
Felix Lohmeier	0691b1f5e1	move stats to check task	2022-04-06 21:21:42 +02:00
Automated	5794f3cee0	latest change Wed Apr 6 19:15:21 UTC 2022	2022-04-06 19:15:21 +00:00
Felix Lohmeier	5ea4913f77	Revert "simulate failed gh actions run" This reverts commit `ed71aa005c`.	2022-04-06 21:14:32 +02:00
Felix Lohmeier	01dbe1d58f	fix yaml	2022-04-06 21:12:53 +02:00
Felix Lohmeier	d35b679f6d	deferred tasks don't fail	2022-04-06 21:11:08 +02:00
Automated	fceca60918	latest change Wed Apr 6 18:47:53 UTC 2022	2022-04-06 18:47:53 +00:00
Felix Lohmeier	ed71aa005c	simulate failed gh actions run	2022-04-06 20:47:07 +02:00
Felix Lohmeier	7812e5c8de	fix github actions workflow	2022-04-06 20:46:16 +02:00
Felix Lohmeier	c0facb81e0	simplify even more	2022-04-06 20:43:30 +02:00
Automated	cfb37d72e6	latest change Wed Apr 6 11:54:55 UTC 2022	2022-04-06 11:54:55 +00:00
Felix Lohmeier	8ee91ee84f	reduce complexity	2022-04-06 13:52:21 +02:00
Felix Lohmeier	1341e1b45c	update task in github action	2022-04-06 13:28:33 +02:00
Felix Lohmeier	12a1b1ab39	update OpenRefine to 3.5.2	2022-04-06 13:27:50 +02:00
Felix Lohmeier	847a622a10	update task to v3.10.0	2022-04-06 13:26:22 +02:00
Felix Lohmeier	72aabe685f	Merge pull request #6 from opencultureconsulting/dependabot/pip/binder/jupyter-server-proxy-3.2.1 ⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder	2022-01-28 17:19:51 +01:00
dependabot[bot]	88f9fe1e5f	⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder Bumps [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) from 1.5.3 to 3.2.1. - [Release notes](https://github.com/jupyterhub/jupyter-server-proxy/releases) - [Changelog](https://github.com/jupyterhub/jupyter-server-proxy/blob/main/CHANGELOG.md) - [Commits](https://github.com/jupyterhub/jupyter-server-proxy/compare/v1.5.3...v3.2.1) --- updated-dependencies: - dependency-name: jupyter-server-proxy dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2022-01-27 16:25:09 +00:00
Felix Lohmeier	7c199424c6	OpenRefine 3.5.0	2021-11-09 23:46:29 +01:00
Felix Lohmeier	21b05626e9	set retention days for GitHub Artifacts	2021-08-01 12:06:10 +02:00
Felix Lohmeier	bebd9d8b39	install go-task to /usr/local/bin	2021-07-14 23:14:45 +02:00
Felix Lohmeier	b3752aaf58	fix go-task install	2021-07-14 23:03:13 +02:00
Felix Lohmeier	b770ffcb3f	debug system path variable	2021-07-14 22:56:04 +02:00
Felix Lohmeier	2f0ef9feca	improve go-task install	2021-07-14 22:49:07 +02:00
Felix Lohmeier	78567e5f44	use github context	2021-07-14 22:32:59 +02:00
Felix Lohmeier	02af29fec1	Update and rename openrefine-task-runner.yml to all-tasks.yml	2021-07-14 22:23:53 +02:00
Felix Lohmeier	5d474b5dfa	fix calling task	2021-07-14 22:03:24 +02:00
Felix Lohmeier	965f82b2de	Create openrefine-task-runner.yml	2021-07-14 21:56:46 +02:00
`@ -1,2 +1,2 @@`
	`jupyter-server-proxy==1.5.3`	`jupyter-server-proxy==3.2.1`
	`bash_kernel==0.7.2`	`bash_kernel==0.7.2`
		`@ -1 +0,0 @@`
			{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}