Compare commits

...

29 Commits
v1.0 ... main

Author SHA1 Message Date
Felix Lohmeier 30ea93e3f3 minimize task install 2022-04-06 22:10:41 +02:00
Felix Lohmeier f867050950 fix binder link 2022-04-06 22:01:03 +02:00
Felix Lohmeier 07b30f66c9
run tasks individually 2022-04-06 21:25:27 +02:00
Felix Lohmeier 0691b1f5e1
move stats to check task 2022-04-06 21:21:42 +02:00
Automated 5794f3cee0 latest change Wed Apr 6 19:15:21 UTC 2022 2022-04-06 19:15:21 +00:00
Felix Lohmeier 5ea4913f77 Revert "simulate failed gh actions run"
This reverts commit ed71aa005c.
2022-04-06 21:14:32 +02:00
Felix Lohmeier 01dbe1d58f
fix yaml 2022-04-06 21:12:53 +02:00
Felix Lohmeier d35b679f6d
deferred tasks don't fail 2022-04-06 21:11:08 +02:00
Automated fceca60918 latest change Wed Apr 6 18:47:53 UTC 2022 2022-04-06 18:47:53 +00:00
Felix Lohmeier ed71aa005c simulate failed gh actions run 2022-04-06 20:47:07 +02:00
Felix Lohmeier 7812e5c8de fix github actions workflow 2022-04-06 20:46:16 +02:00
Felix Lohmeier c0facb81e0 simplify even more 2022-04-06 20:43:30 +02:00
Automated cfb37d72e6 latest change Wed Apr 6 11:54:55 UTC 2022 2022-04-06 11:54:55 +00:00
Felix Lohmeier 8ee91ee84f reduce complexity 2022-04-06 13:52:21 +02:00
Felix Lohmeier 1341e1b45c update task in github action 2022-04-06 13:28:33 +02:00
Felix Lohmeier 12a1b1ab39 update OpenRefine to 3.5.2 2022-04-06 13:27:50 +02:00
Felix Lohmeier 847a622a10 update task to v3.10.0 2022-04-06 13:26:22 +02:00
Felix Lohmeier 72aabe685f
Merge pull request #6 from opencultureconsulting/dependabot/pip/binder/jupyter-server-proxy-3.2.1
⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
2022-01-28 17:19:51 +01:00
dependabot[bot] 88f9fe1e5f
⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
Bumps [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) from 1.5.3 to 3.2.1.
- [Release notes](https://github.com/jupyterhub/jupyter-server-proxy/releases)
- [Changelog](https://github.com/jupyterhub/jupyter-server-proxy/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jupyterhub/jupyter-server-proxy/compare/v1.5.3...v3.2.1)

---
updated-dependencies:
- dependency-name: jupyter-server-proxy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-01-27 16:25:09 +00:00
Felix Lohmeier 7c199424c6 OpenRefine 3.5.0 2021-11-09 23:46:29 +01:00
Felix Lohmeier 21b05626e9 set retention days for GitHub Artifacts 2021-08-01 12:06:10 +02:00
Felix Lohmeier bebd9d8b39
install go-task to /usr/local/bin 2021-07-14 23:14:45 +02:00
Felix Lohmeier b3752aaf58
fix go-task install 2021-07-14 23:03:13 +02:00
Felix Lohmeier b770ffcb3f
debug system path variable 2021-07-14 22:56:04 +02:00
Felix Lohmeier 2f0ef9feca
improve go-task install 2021-07-14 22:49:07 +02:00
Felix Lohmeier 78567e5f44
use github context 2021-07-14 22:32:59 +02:00
Felix Lohmeier 02af29fec1
Update and rename openrefine-task-runner.yml to all-tasks.yml 2021-07-14 22:23:53 +02:00
Felix Lohmeier 5d474b5dfa
fix calling task 2021-07-14 22:03:24 +02:00
Felix Lohmeier 965f82b2de
Create openrefine-task-runner.yml 2021-07-14 21:56:46 +02:00
13 changed files with 142 additions and 335 deletions

33
.github/workflows/default.yml vendored Normal file
View File

@ -0,0 +1,33 @@
name: default
on:
workflow_dispatch: # allows you to run this workflow manually from the Actions tab
jobs:
main:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
- name: install go-task 3.10.0
run: |
wget --no-verbose -O task.tar.gz https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
sudo tar -xzf task.tar.gz -C /usr/local/bin task && rm task.tar.gz
- name: install OpenRefine and openrefine-client
run: task install
- name: start OpenRefine
run: task start
- name: run workflow
run: task example
- name: print stats and check log file
run: task check
- uses: actions/upload-artifact@v2
if: always()
with:
name: OpenRefine project and logfile
path: .openrefine/tmp
retention-days: 7
- name: git commit and push
run: |
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
task git

7
.gitignore vendored
View File

@ -1,9 +1,2 @@
.task .task
.openrefine .openrefine
*/output
*/*.log
*/*.openrefine.tar.gz
example-doaj/input
example-doaj/config
example-powerhouse/input
example-powerhouse/config

View File

@ -1,23 +1,29 @@
# OpenRefine Task Runner (💎+🤖) # OpenRefine Task Runner (💎+🤖)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api). Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
The workflow is defined in [Taskfile.yml](Taskfile.yml) and can be executed either locally (`task default`) or with [GitHub Actions](.github/workflows/default.yml).
## Features ## Features
* run tasks in parallel
* basic error handling by monitoring the OpenRefine server log * basic error handling by monitoring the OpenRefine server log
* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched) * dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
* prevent unnecessary work by fingerprinting generated files and their sources * prevent unnecessary work by fingerprinting generated files and their sources
* the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine: * the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
* import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options) * import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
* apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s) * apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
* export to CSV, TSV, HTML, XLS, XLSX, ODS * export to CSV, TSV, HTML, XLS, XLSX, ODS
* [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML * [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
* works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1 * works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
* tasks are easy to extend with additional commands (e.g. to download input data or validate results) * tasks are easy to extend with additional commands (e.g. to download input data or validate results)
## Requirements
* GNU/Linux (tested with Fedora 34)
* JAVA 8+ (for OpenRefine)
## Typical workflow ## Typical workflow
**Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json). **Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
@ -26,30 +32,24 @@ Templates for OpenRefine batch processing (import, transform, export) using the
**Possible automation benefits:** **Possible automation benefits:**
* When you receive updated data (in the same structure), you just need to drop the file and start the task like this: * When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:
```sh ```sh
task example-doaj task
``` ```
* The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments. * The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
* Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows. * Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
## Requirements
* GNU/Linux (tested with Fedora 32)
* JAVA 8+ (for OpenRefine)
## Demo via binder ## Demo via binder
[![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
- free to use on-demand server with Jupyterlab and Bash Kernel - free to use on-demand server with Jupyterlab and Bash Kernel
- OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild) - OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild)
- no registration needed, will start within a few minutes - no registration needed, will start within a few minutes
- [restricted](https://notebooks.gesis.org/faq/) to 4 GB RAM and server will be deleted after 10 minutes of inactivity - [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
- service is provided by GESIS and is intended for use by social scientists
## Install ## Install
@ -60,23 +60,23 @@ Templates for OpenRefine batch processing (import, transform, export) using the
cd openrefine-task-runner cd openrefine-task-runner
``` ```
2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2) 2. Install [Task 3.10.0](https://github.com/go-task/task/releases/tag/v3.10.0)+
a) RPM-based (Fedora, CentOS, SLES, etc.) a) RPM-based (Fedora, CentOS, SLES, etc.)
```sh ```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
``` ```
b) DEB-based (Debian, Ubuntu etc.) b) DEB-based (Debian, Ubuntu etc.)
```sh ```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
``` ```
3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10) 3. Run install task to download [OpenRefine 3.5.2](https://github.com/OpenRefine/OpenRefine/releases/tag/3.5.2) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
```sh ```sh
task install task install
@ -84,34 +84,28 @@ Templates for OpenRefine batch processing (import, transform, export) using the
## Usage ## Usage
* Run all tasks in parallel * Run workflow
```sh ```sh
task task default
``` ```
* Run a specific task * Override settings with environment variables
```sh ```sh
task example-duplicates:main OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
```
* Run some tasks in parallel
```sh
task --parallel example-duplicates:main example-doaj:main
``` ```
* Force run a task even when the task is up-to-date * Force run a task even when the task is up-to-date
```sh ```sh
task example-duplicates:main --force task default --force
``` ```
* Dry-run in verbose mode for debugging * Dry-run in verbose mode for debugging
```sh ```sh
task example-duplicates:main --dry --verbose --force task default --dry --verbose --force
``` ```
* List available tasks * List available tasks
@ -120,17 +114,9 @@ Templates for OpenRefine batch processing (import, transform, export) using the
task --list task --list
``` ```
### How to develop your own tasks ### Examples
(first draft, will be elaborated later) * [noah-biejournals](https://github.com/opencultureconsulting/noah-biejournals): Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw
1. create a new folder
2. copy an example Taskfile.yml
3. provide input data in subdirectory input
4. provide OpenRefine transformation history files in subdirectory config
5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
6. add project to general Taskfile
7. check memory load and increase RAM if needed
### Getting help ### Getting help

View File

@ -1,102 +1,89 @@
# https://github.com/opencultureconsulting/openrefine-task-runner
version: '3' version: '3'
includes:
example-doaj: example-doaj
example-duplicates: example-duplicates
example-powerhouse: example-powerhouse
# add the directory name of your project here
silent: true silent: true
output: prefixed
env: env:
OPENREFINE: OPENREFINE_MEMORY: 5120M
sh: readlink -m .openrefine/refine OPENREFINE_PORT: 3333
CLIENT: OPENREFINE_APPDIR:
sh: readlink -m .openrefine/client sh: readlink -m .openrefine
OPENREFINE_TMPDIR:
sh: mkdir -p .openrefine/tmp; readlink -m .openrefine/tmp
tasks: tasks:
default: default:
desc: execute all projects in parallel desc: run workflow in batch mode
deps:
- task: example-doaj:refine
- task: example-duplicates:refine
- task: example-powerhouse:refine
# add the directory name of your project here
cmds: cmds:
- defer: { task: cleanup } # will always be executed last
- task: start
- task: example
- task: check - task: check
sources:
install: - Taskfile.yml
desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine - input/**
cmds: - config/**
- | # delete existing install and recreate folder generates:
rm -rf .openrefine - output/**
mkdir -p .openrefine preconditions:
- > # download OpenRefine archive - sh: test -f "${OPENREFINE_APPDIR}/refine"
wget --no-verbose -O openrefine.tar.gz msg: "OpenRefine missing; try task install"
https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
- | # install OpenRefine into subdirectory .openrefine
tar -xzf openrefine.tar.gz -C .openrefine --strip 1
rm openrefine.tar.gz
- | # optimize OpenRefine for batch processing
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
- > # download openrefine-client into subdirectory .openrefine
wget --no-verbose -O .openrefine/client
https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
- chmod +x .openrefine/client # make client executable
start: start:
dir: ./{{.DIR}} - echo "start OpenRefine with max. $OPENREFINE_MEMORY on port $OPENREFINE_PORT..."
cmds: - | # launch OpenRefine with specific data directory and redirect its output to a log file
- | # verify that OpenRefine is installed "${OPENREFINE_APPDIR}/refine" -v warn -p "$OPENREFINE_PORT" -m "$OPENREFINE_MEMORY" -d "${OPENREFINE_TMPDIR}" > "${OPENREFINE_TMPDIR}/log.txt" 2>&1 &
if [ ! -f "$OPENREFINE" ]; then
echo 1>&2 "OpenRefine missing; try task install"; exit 1
fi
- | # delete temporary files and log file of previous run
rm -rf ./*.project* workspace.json
rm -rf "{{.PROJECT}}.log"
- > # launch OpenRefine with specific data directory and redirect its output to a log file
"$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
-d ../{{.DIR}}
>> "{{.PROJECT}}.log" 2>&1 &
- | # wait until OpenRefine API is available - | # wait until OpenRefine API is available
timeout 30s bash -c "until timeout 30s bash -c "until wget -q -O - -o /dev/null http://localhost:${OPENREFINE_PORT} | cat | grep -q -o OpenRefine; do sleep 1; done"
wget -q -O - -o /dev/null http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
do sleep 1
done"
stop: example:
dir: ./{{.DIR}} - | # import (requires absolute path)
cmds: "${OPENREFINE_APPDIR}/client" \
- | # shut down OpenRefine gracefully --create "$(readlink -m input/duplicates.csv)" \
PID=$(lsof -t -i:{{.PORT}}) --projectName example
kill $PID - | # apply undo/redo history
while ps -p $PID > /dev/null; do sleep 1; done for f in config/*.json; do
- > # archive the OpenRefine project "${OPENREFINE_APPDIR}/client" example --apply "$f"
tar cfz done
"{{.PROJECT}}.openrefine.tar.gz" - | # export to TSV
-C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1) mkdir -p output
. "${OPENREFINE_APPDIR}/client" example \
- rm -rf ./*.project* workspace.json # delete temporary files --output output/deduped.tsv
kill:
dir: ./{{.DIR}}
cmds:
- | # shut down OpenRefine immediately to save time and disk space
PID=$(lsof -t -i:{{.PORT}})
kill -9 $PID
while ps -p $PID > /dev/null; do sleep 1; done
- rm -rf ./*.project* workspace.json # delete temporary files
check: check:
desc: check OpenRefine log for any warnings and exit on error - | # print stats
dir: ./{{.DIR}} PID="$(lsof -t -i:${OPENREFINE_PORT})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM"
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time"
- | # check log file for any warnings
if grep -i 'exception\|error' "${OPENREFINE_TMPDIR}/log.txt"
then echo 1>&2 "log contains warnings!"; echo; cat "${OPENREFINE_TMPDIR}/log.txt"; exit 1
fi
cleanup:
- | # kill OpenRefine immediately
PID="$(lsof -t -i:${OPENREFINE_PORT})"
kill -9 $PID
- | # delete temporary files
rm -rf "${OPENREFINE_TMPDIR}"
install:
desc: install OpenRefine and openrefine-client into subdirectory ${OPENREFINE_APPDIR}
cmds: cmds:
- | # find log file(s) and check for "exception" or "error" - mkdir -p "${OPENREFINE_APPDIR}"
if grep -i 'exception\|error' $(find . -name '*.log'); then - | # install OpenRefine into subdirectory ${OPENREFINE_APPDIR}
echo 1>&2 "log contains warnings!"; exit 1 wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.5.2/openrefine-linux-3.5.2.tar.gz
fi tar -xzf openrefine.tar.gz -C "${OPENREFINE_APPDIR}" --strip 1 && rm openrefine.tar.gz
- | # optimize OpenRefine for batch processing
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "${OPENREFINE_APPDIR}/refine" # fix path issue in OpenRefine startup file
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "${OPENREFINE_APPDIR}/refine.ini" # do not try to open OpenRefine in browser
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "${OPENREFINE_APPDIR}/refine.ini" # set autosave period from 5 minutes to 25 hours
- | # install openrefine-client into subdirectory ${OPENREFINE_APPDIR}
wget --no-verbose -O "${OPENREFINE_APPDIR}/client" https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
chmod +x "${OPENREFINE_APPDIR}/client"
git:
desc: commit and push if something changed
cmds:
- git add -A
- git commit -m "latest change $(date -u)" || exit 0
- git push

View File

@ -5,8 +5,8 @@ set -e
python -m bash_kernel.install python -m bash_kernel.install
# Install go-task https://github.com/go-task/task # Install go-task https://github.com/go-task/task
wget -q https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.tar.gz wget -q https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
tar -xzf task_linux_amd64.tar.gz tar -xzf task_linux_amd64.tar.gz task
rm task_linux_amd64.tar.gz rm task_linux_amd64.tar.gz
mkdir -p $HOME/.local/bin mkdir -p $HOME/.local/bin
mv task $HOME/.local/bin/ mv task $HOME/.local/bin/

View File

@ -1,2 +1,2 @@
jupyter-server-proxy==1.5.3 jupyter-server-proxy==3.2.1
bash_kernel==0.7.2 bash_kernel==0.7.2

View File

@ -1 +0,0 @@
{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}

View File

@ -1,70 +0,0 @@
version: '3'
tasks:
main:
desc: Library Carpentry Lesson covering DOAJ
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: doaj
PORT: 3334 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
deps:
- task: download # will be executed each run independent of up-to-date check
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/doaj-article-sample.csv)"
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doaj-openrefine.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/doaj-results.tsv)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
download:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
cmds:
- mkdir -p input config
- > # Download input
wget --no-verbose -O input/doaj-article-sample.csv
https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
- > # Download config
wget --no-verbose -O config/doaj-openrefine.json
https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -1,56 +0,0 @@
version: '3'
tasks:
main:
desc: Removing duplicates in a very small test dataset
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: duplicates
PORT: 3335 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/duplicates.csv)"
--encoding UTF-8
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/duplicates-deletion.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/deduped.xls)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -1,72 +0,0 @@
version: '3'
tasks:
main:
desc: Powerhouse Museum Tutorial
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: phm
PORT: 3336 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
deps:
- task: download # will be executed each run independent of up-to-date check
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/phm-collection.tsv)"
--processQuotes false
--guessCellValueTypes true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/phm-transform.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/phm-results.tsv)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PORT: '{{.PORT}}', PROJECT: '{{.PROJECT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
download:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
cmds:
- mkdir -p input config
- > # Download input
wget --no-verbose -O input/phm-collection.tsv
https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
- > # Download config
wget --no-verbose -O config/phm-transform.json
https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

7
output/deduped.tsv Normal file
View File

@ -0,0 +1,7 @@
email count name state gender purchase
arthur.duff@example4.com 2 Arthur Duff OR M Dining table
ben.morisson@example6.org 1 Ben Morisson FL M Amplifier
ben.tyler@example3.org 1 Ben Tyler NV M Flashlight
danny.baron@example1.com 3 Danny Baron CA M TV
jean.griffith@example5.org 1 Jean Griffith WA F Power drill
melanie.white@example2.edu 2 Melanie White NC F iPhone
1 email count name state gender purchase
2 arthur.duff@example4.com 2 Arthur Duff OR M Dining table
3 ben.morisson@example6.org 1 Ben Morisson FL M Amplifier
4 ben.tyler@example3.org 1 Ben Tyler NV M Flashlight
5 danny.baron@example1.com 3 Danny Baron CA M TV
6 jean.griffith@example5.org 1 Jean Griffith WA F Power drill
7 melanie.white@example2.edu 2 Melanie White NC F iPhone