Compare commits

...

29 Commits
v1.0 ... main

Author SHA1 Message Date
Felix Lohmeier 30ea93e3f3 minimize task install 2022-04-06 22:10:41 +02:00
Felix Lohmeier f867050950 fix binder link 2022-04-06 22:01:03 +02:00
Felix Lohmeier 07b30f66c9
run tasks individually 2022-04-06 21:25:27 +02:00
Felix Lohmeier 0691b1f5e1
move stats to check task 2022-04-06 21:21:42 +02:00
Automated 5794f3cee0 latest change Wed Apr 6 19:15:21 UTC 2022 2022-04-06 19:15:21 +00:00
Felix Lohmeier 5ea4913f77 Revert "simulate failed gh actions run"
This reverts commit ed71aa005c.
2022-04-06 21:14:32 +02:00
Felix Lohmeier 01dbe1d58f
fix yaml 2022-04-06 21:12:53 +02:00
Felix Lohmeier d35b679f6d
deferred tasks don't fail 2022-04-06 21:11:08 +02:00
Automated fceca60918 latest change Wed Apr 6 18:47:53 UTC 2022 2022-04-06 18:47:53 +00:00
Felix Lohmeier ed71aa005c simulate failed gh actions run 2022-04-06 20:47:07 +02:00
Felix Lohmeier 7812e5c8de fix github actions workflow 2022-04-06 20:46:16 +02:00
Felix Lohmeier c0facb81e0 simplify even more 2022-04-06 20:43:30 +02:00
Automated cfb37d72e6 latest change Wed Apr 6 11:54:55 UTC 2022 2022-04-06 11:54:55 +00:00
Felix Lohmeier 8ee91ee84f reduce complexity 2022-04-06 13:52:21 +02:00
Felix Lohmeier 1341e1b45c update task in github action 2022-04-06 13:28:33 +02:00
Felix Lohmeier 12a1b1ab39 update OpenRefine to 3.5.2 2022-04-06 13:27:50 +02:00
Felix Lohmeier 847a622a10 update task to v3.10.0 2022-04-06 13:26:22 +02:00
Felix Lohmeier 72aabe685f
Merge pull request #6 from opencultureconsulting/dependabot/pip/binder/jupyter-server-proxy-3.2.1
⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
2022-01-28 17:19:51 +01:00
dependabot[bot] 88f9fe1e5f
⬆️ Bump jupyter-server-proxy from 1.5.3 to 3.2.1 in /binder
Bumps [jupyter-server-proxy](https://github.com/jupyterhub/jupyter-server-proxy) from 1.5.3 to 3.2.1.
- [Release notes](https://github.com/jupyterhub/jupyter-server-proxy/releases)
- [Changelog](https://github.com/jupyterhub/jupyter-server-proxy/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jupyterhub/jupyter-server-proxy/compare/v1.5.3...v3.2.1)

---
updated-dependencies:
- dependency-name: jupyter-server-proxy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-01-27 16:25:09 +00:00
Felix Lohmeier 7c199424c6 OpenRefine 3.5.0 2021-11-09 23:46:29 +01:00
Felix Lohmeier 21b05626e9 set retention days for GitHub Artifacts 2021-08-01 12:06:10 +02:00
Felix Lohmeier bebd9d8b39
install go-task to /usr/local/bin 2021-07-14 23:14:45 +02:00
Felix Lohmeier b3752aaf58
fix go-task install 2021-07-14 23:03:13 +02:00
Felix Lohmeier b770ffcb3f
debug system path variable 2021-07-14 22:56:04 +02:00
Felix Lohmeier 2f0ef9feca
improve go-task install 2021-07-14 22:49:07 +02:00
Felix Lohmeier 78567e5f44
use github context 2021-07-14 22:32:59 +02:00
Felix Lohmeier 02af29fec1
Update and rename openrefine-task-runner.yml to all-tasks.yml 2021-07-14 22:23:53 +02:00
Felix Lohmeier 5d474b5dfa
fix calling task 2021-07-14 22:03:24 +02:00
Felix Lohmeier 965f82b2de
Create openrefine-task-runner.yml 2021-07-14 21:56:46 +02:00
13 changed files with 142 additions and 335 deletions

33
.github/workflows/default.yml vendored Normal file
View File

@ -0,0 +1,33 @@
name: default
on:
workflow_dispatch: # allows you to run this workflow manually from the Actions tab
jobs:
main:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v2
- name: install go-task 3.10.0
run: |
wget --no-verbose -O task.tar.gz https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
sudo tar -xzf task.tar.gz -C /usr/local/bin task && rm task.tar.gz
- name: install OpenRefine and openrefine-client
run: task install
- name: start OpenRefine
run: task start
- name: run workflow
run: task example
- name: print stats and check log file
run: task check
- uses: actions/upload-artifact@v2
if: always()
with:
name: OpenRefine project and logfile
path: .openrefine/tmp
retention-days: 7
- name: git commit and push
run: |
git config user.name "Automated"
git config user.email "actions@users.noreply.github.com"
task git

7
.gitignore vendored
View File

@ -1,9 +1,2 @@
.task
.openrefine
*/output
*/*.log
*/*.openrefine.tar.gz
example-doaj/input
example-doaj/config
example-powerhouse/input
example-powerhouse/config

View File

@ -1,23 +1,29 @@
# OpenRefine Task Runner (💎+🤖)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
[![Codacy Badge](https://app.codacy.com/project/badge/Grade/888dbf663fdd409e8d8fcf8472114194)](https://www.codacy.com/gh/opencultureconsulting/openrefine-task-runner/dashboard) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
Templates for OpenRefine batch processing (import, transform, export) using the task runner [go-task](https://github.com/go-task/task) and the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) to control OpenRefine via [its HTTP API](https://docs.openrefine.org/technical-reference/openrefine-api).
The workflow is defined in [Taskfile.yml](Taskfile.yml) and can be executed either locally (`task default`) or with [GitHub Actions](.github/workflows/default.yml).
## Features
* run tasks in parallel
* basic error handling by monitoring the OpenRefine server log
* dedicated OpenRefine instances for each task (your existing OpenRefine data will not be touched)
* dedicated OpenRefine instance with temporary workspace (your existing OpenRefine data will not be touched)
* prevent unnecessary work by fingerprinting generated files and their sources
* the [openrefine-client](https://github.com/opencultureconsulting/openrefine-client) used here supports many core features of OpenRefine:
* import CSV, TSV, line-based TXT, fixed-width TXT, JSON or XML (and specify input options)
* apply [undo/redo history](https://docs.openrefine.org/manual/running/#reusing-operations) from given JSON file(s)
* export to CSV, TSV, HTML, XLS, XLSX, ODS
* [templating export](https://github.com/opencultureconsulting/openrefine-client#templating) to additional formats like JSON or XML
* works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.4.1
* works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4 and 3.5
* tasks are easy to extend with additional commands (e.g. to download input data or validate results)
## Requirements
* GNU/Linux (tested with Fedora 34)
* JAVA 8+ (for OpenRefine)
## Typical workflow
**Step 1**: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. dedup.json).
@ -26,30 +32,24 @@ Templates for OpenRefine batch processing (import, transform, export) using the
**Possible automation benefits:**
* When you receive updated data (in the same structure), you just need to drop the file and start the task like this:
* When you receive updated data (in the same structure), you just need to drop the input file and start the task like this:
```sh
task example-doaj
task
```
* The entire data processing (including options during import) becomes reproducible. The task configuration file can also be used for documentation through source code comments.
* Metadata experts can use OpenRefine's graphical interface and IT staff can incorporate the created transformation rules into regular data processing flows.
## Requirements
* GNU/Linux (tested with Fedora 32)
* JAVA 8+ (for OpenRefine)
## Demo via binder
[![Binder](https://notebooks.gesis.org/binder/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/opencultureconsulting/openrefine-task-runner/main?urlpath=lab/tree/demo.ipynb)
[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/opencultureconsulting/openrefine-task-runner/main)
- free to use on-demand server with Jupyterlab and Bash Kernel
- OpenRefine, openrefine-client and go-task [preinstalled](binder/postBuild)
- no registration needed, will start within a few minutes
- [restricted](https://notebooks.gesis.org/faq/) to 4 GB RAM and server will be deleted after 10 minutes of inactivity
- service is provided by GESIS and is intended for use by social scientists
- [restricted](https://mybinder.readthedocs.io/en/latest/about/about.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
## Install
@ -60,23 +60,23 @@ Templates for OpenRefine batch processing (import, transform, export) using the
cd openrefine-task-runner
```
2. Install [Task 3.2.2](https://github.com/go-task/task/releases/tag/v3.2.2)
2. Install [Task 3.10.0](https://github.com/go-task/task/releases/tag/v3.10.0)+
a) RPM-based (Fedora, CentOS, SLES, etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.rpm
wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.rpm
sudo dnf install ./task_linux_amd64.rpm && rm task_linux_amd64.rpm
```
b) DEB-based (Debian, Ubuntu etc.)
```sh
wget https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.deb
wget https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.deb
sudo apt install ./task_linux_amd64.deb && rm task_linux_amd64.deb
```
3. Run install task to download [OpenRefine 3.4.1](https://github.com/OpenRefine/OpenRefine/releases/tag/3.4.1) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
3. Run install task to download [OpenRefine 3.5.2](https://github.com/OpenRefine/OpenRefine/releases/tag/3.5.2) and [openrefine-client 0.3.10](https://github.com/opencultureconsulting/openrefine-client/releases/tag/v0.3.10)
```sh
task install
@ -84,34 +84,28 @@ Templates for OpenRefine batch processing (import, transform, export) using the
## Usage
* Run all tasks in parallel
* Run workflow
```sh
task
task default
```
* Run a specific task
* Override settings with environment variables
```sh
task example-duplicates:main
```
* Run some tasks in parallel
```sh
task --parallel example-duplicates:main example-doaj:main
OPENREFINE_MEMORY=2000M OPENREFINE_PORT=3334 task default
```
* Force run a task even when the task is up-to-date
```sh
task example-duplicates:main --force
task default --force
```
* Dry-run in verbose mode for debugging
```sh
task example-duplicates:main --dry --verbose --force
task default --dry --verbose --force
```
* List available tasks
@ -120,17 +114,9 @@ Templates for OpenRefine batch processing (import, transform, export) using the
task --list
```
### How to develop your own tasks
### Examples
(first draft, will be elaborated later)
1. create a new folder
2. copy an example Taskfile.yml
3. provide input data in subdirectory input
4. provide OpenRefine transformation history files in subdirectory config
5. add commands to specific Taskfile (check openrefine-client help screen for available options: `openrefine/client --help`)
6. add project to general Taskfile
7. check memory load and increase RAM if needed
* [noah-biejournals](https://github.com/opencultureconsulting/noah-biejournals): Harvesting des Zeitschriftenservers BieJournals der UB Bielefeld und Transformation in METS/MODS für das Portal noah.nrw
### Getting help

View File

@ -1,102 +1,89 @@
# https://github.com/opencultureconsulting/openrefine-task-runner
version: '3'
includes:
example-doaj: example-doaj
example-duplicates: example-duplicates
example-powerhouse: example-powerhouse
# add the directory name of your project here
silent: true
output: prefixed
env:
OPENREFINE:
sh: readlink -m .openrefine/refine
CLIENT:
sh: readlink -m .openrefine/client
OPENREFINE_MEMORY: 5120M
OPENREFINE_PORT: 3333
OPENREFINE_APPDIR:
sh: readlink -m .openrefine
OPENREFINE_TMPDIR:
sh: mkdir -p .openrefine/tmp; readlink -m .openrefine/tmp
tasks:
default:
desc: execute all projects in parallel
deps:
- task: example-doaj:refine
- task: example-duplicates:refine
- task: example-powerhouse:refine
# add the directory name of your project here
desc: run workflow in batch mode
cmds:
- defer: { task: cleanup } # will always be executed last
- task: start
- task: example
- task: check
install:
desc: (re)install OpenRefine and openrefine-client into subdirectory .openrefine
cmds:
- | # delete existing install and recreate folder
rm -rf .openrefine
mkdir -p .openrefine
- > # download OpenRefine archive
wget --no-verbose -O openrefine.tar.gz
https://github.com/OpenRefine/OpenRefine/releases/download/3.4.1/openrefine-linux-3.4.1.tar.gz
- | # install OpenRefine into subdirectory .openrefine
tar -xzf openrefine.tar.gz -C .openrefine --strip 1
rm openrefine.tar.gz
- | # optimize OpenRefine for batch processing
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' ".openrefine/refine" # fix path issue in OpenRefine startup file
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' ".openrefine/refine.ini" # do not try to open OpenRefine in browser
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' ".openrefine/refine.ini" # set autosave period from 5 minutes to 25 hours
- > # download openrefine-client into subdirectory .openrefine
wget --no-verbose -O .openrefine/client
https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
- chmod +x .openrefine/client # make client executable
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- output/**
preconditions:
- sh: test -f "${OPENREFINE_APPDIR}/refine"
msg: "OpenRefine missing; try task install"
start:
dir: ./{{.DIR}}
cmds:
- | # verify that OpenRefine is installed
if [ ! -f "$OPENREFINE" ]; then
echo 1>&2 "OpenRefine missing; try task install"; exit 1
fi
- | # delete temporary files and log file of previous run
rm -rf ./*.project* workspace.json
rm -rf "{{.PROJECT}}.log"
- > # launch OpenRefine with specific data directory and redirect its output to a log file
"$OPENREFINE" -v warn -p {{.PORT}} -m {{.RAM}}
-d ../{{.DIR}}
>> "{{.PROJECT}}.log" 2>&1 &
- echo "start OpenRefine with max. $OPENREFINE_MEMORY on port $OPENREFINE_PORT..."
- | # launch OpenRefine with specific data directory and redirect its output to a log file
"${OPENREFINE_APPDIR}/refine" -v warn -p "$OPENREFINE_PORT" -m "$OPENREFINE_MEMORY" -d "${OPENREFINE_TMPDIR}" > "${OPENREFINE_TMPDIR}/log.txt" 2>&1 &
- | # wait until OpenRefine API is available
timeout 30s bash -c "until
wget -q -O - -o /dev/null http://localhost:{{.PORT}} | cat | grep -q -o OpenRefine
do sleep 1
done"
timeout 30s bash -c "until wget -q -O - -o /dev/null http://localhost:${OPENREFINE_PORT} | cat | grep -q -o OpenRefine; do sleep 1; done"
stop:
dir: ./{{.DIR}}
cmds:
- | # shut down OpenRefine gracefully
PID=$(lsof -t -i:{{.PORT}})
kill $PID
while ps -p $PID > /dev/null; do sleep 1; done
- > # archive the OpenRefine project
tar cfz
"{{.PROJECT}}.openrefine.tar.gz"
-C $(grep -l "{{.PROJECT}}" *.project/metadata.json | cut -d '/' -f 1)
.
- rm -rf ./*.project* workspace.json # delete temporary files
kill:
dir: ./{{.DIR}}
cmds:
- | # shut down OpenRefine immediately to save time and disk space
PID=$(lsof -t -i:{{.PORT}})
kill -9 $PID
while ps -p $PID > /dev/null; do sleep 1; done
- rm -rf ./*.project* workspace.json # delete temporary files
example:
- | # import (requires absolute path)
"${OPENREFINE_APPDIR}/client" \
--create "$(readlink -m input/duplicates.csv)" \
--projectName example
- | # apply undo/redo history
for f in config/*.json; do
"${OPENREFINE_APPDIR}/client" example --apply "$f"
done
- | # export to TSV
mkdir -p output
"${OPENREFINE_APPDIR}/client" example \
--output output/deduped.tsv
check:
desc: check OpenRefine log for any warnings and exit on error
dir: ./{{.DIR}}
- | # print stats
PID="$(lsof -t -i:${OPENREFINE_PORT})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM"
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time"
- | # check log file for any warnings
if grep -i 'exception\|error' "${OPENREFINE_TMPDIR}/log.txt"
then echo 1>&2 "log contains warnings!"; echo; cat "${OPENREFINE_TMPDIR}/log.txt"; exit 1
fi
cleanup:
- | # kill OpenRefine immediately
PID="$(lsof -t -i:${OPENREFINE_PORT})"
kill -9 $PID
- | # delete temporary files
rm -rf "${OPENREFINE_TMPDIR}"
install:
desc: install OpenRefine and openrefine-client into subdirectory ${OPENREFINE_APPDIR}
cmds:
- | # find log file(s) and check for "exception" or "error"
if grep -i 'exception\|error' $(find . -name '*.log'); then
echo 1>&2 "log contains warnings!"; exit 1
fi
- mkdir -p "${OPENREFINE_APPDIR}"
- | # install OpenRefine into subdirectory ${OPENREFINE_APPDIR}
wget --no-verbose -O openrefine.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/3.5.2/openrefine-linux-3.5.2.tar.gz
tar -xzf openrefine.tar.gz -C "${OPENREFINE_APPDIR}" --strip 1 && rm openrefine.tar.gz
- | # optimize OpenRefine for batch processing
sed -i 's/cd `dirname $0`/cd "$(dirname "$0")"/' "${OPENREFINE_APPDIR}/refine" # fix path issue in OpenRefine startup file
sed -i '$ a JAVA_OPTIONS=-Drefine.headless=true' "${OPENREFINE_APPDIR}/refine.ini" # do not try to open OpenRefine in browser
sed -i 's/#REFINE_AUTOSAVE_PERIOD=60/REFINE_AUTOSAVE_PERIOD=1440/' "${OPENREFINE_APPDIR}/refine.ini" # set autosave period from 5 minutes to 25 hours
- | # install openrefine-client into subdirectory ${OPENREFINE_APPDIR}
wget --no-verbose -O "${OPENREFINE_APPDIR}/client" https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux
chmod +x "${OPENREFINE_APPDIR}/client"
git:
desc: commit and push if something changed
cmds:
- git add -A
- git commit -m "latest change $(date -u)" || exit 0
- git push

View File

@ -5,8 +5,8 @@ set -e
python -m bash_kernel.install
# Install go-task https://github.com/go-task/task
wget -q https://github.com/go-task/task/releases/download/v3.2.2/task_linux_amd64.tar.gz
tar -xzf task_linux_amd64.tar.gz
wget -q https://github.com/go-task/task/releases/download/v3.10.0/task_linux_amd64.tar.gz
tar -xzf task_linux_amd64.tar.gz task
rm task_linux_amd64.tar.gz
mkdir -p $HOME/.local/bin
mv task $HOME/.local/bin/

View File

@ -1,2 +1,2 @@
jupyter-server-proxy==1.5.3
jupyter-server-proxy==3.2.1
bash_kernel==0.7.2

View File

@ -1 +0,0 @@
{"cells":[{"metadata":{},"cell_type":"markdown","source":"## Run all tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run a specific task"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Run some tasks in parallel"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --parallel example-duplicates:main example-doaj:main","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Force run a task even when the task is up-to-date"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Dry-run in verbose mode for debugging"},{"metadata":{"trusted":true},"cell_type":"code","source":"task example-duplicates:main --dry --verbose --force","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## List available tasks"},{"metadata":{"trusted":true},"cell_type":"code","source":"task --list","execution_count":null,"outputs":[]}],"metadata":{"kernelspec":{"name":"bash","display_name":"Bash","language":"bash"},"language_info":{"name":"bash","codemirror_mode":"shell","mimetype":"text/x-sh","file_extension":".sh"}},"nbformat":4,"nbformat_minor":5}

View File

@ -1,70 +0,0 @@
version: '3'
tasks:
main:
desc: Library Carpentry Lesson covering DOAJ
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: doaj
PORT: 3334 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
deps:
- task: download # will be executed each run independent of up-to-date check
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/doaj-article-sample.csv)"
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/doaj-openrefine.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/doaj-results.tsv)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
download:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
cmds:
- mkdir -p input config
- > # Download input
wget --no-verbose -O input/doaj-article-sample.csv
https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-article-sample.csv
- > # Download config
wget --no-verbose -O config/doaj-openrefine.json
https://github.com/felixlohmeier/openrefine-kimws2019/raw/master/doaj-openrefine.json
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -1,56 +0,0 @@
version: '3'
tasks:
main:
desc: Removing duplicates in a very small test dataset
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: duplicates
PORT: 3335 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/duplicates.csv)"
--encoding UTF-8
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/duplicates-deletion.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/deduped.xls)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

View File

@ -1,72 +0,0 @@
version: '3'
tasks:
main:
desc: Powerhouse Museum Tutorial
vars:
DIR: '{{splitList ":" .TASK | first}}' # results in the task namespace, which is identical to the directory name
cmds:
- task: refine
- task: :check # check OpenRefine log for any warnings and exit on error
vars: {DIR: '{{.DIR}}'}
refine:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
PROJECT: phm
PORT: 3336 # assign a different port for each project
RAM: 2048M # maximum RAM for OpenRefine java heap space
LOG: '>(tee -a "{{.PROJECT}}.log") 2>&1' # be careful when making changes here, as the path to the log file should match the server log (see main task "start")
deps:
- task: download # will be executed each run independent of up-to-date check
cmds:
- task: :start # launch OpenRefine
vars: {DIR: '{{.DIR}}', PROJECT: '{{.PROJECT}}', PORT: '{{.PORT}}', RAM: '{{.RAM}}'}
- > # import file
"$CLIENT" -P {{.PORT}}
--create "$(readlink -m input/phm-collection.tsv)"
--processQuotes false
--guessCellValueTypes true
--projectName "{{.PROJECT}}"
> {{.LOG}}
- > # apply transformation rules
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--apply config/phm-transform.json
> {{.LOG}}
- mkdir -p output
- > # export to file
"$CLIENT" -P {{.PORT}} "{{.PROJECT}}"
--output "$(readlink -m output/phm-results.tsv)"
> {{.LOG}}
- | # print allocated system resources
PID="$(lsof -t -i:{{.PORT}})"
echo "used $(($(ps --no-headers -o rss -p "$PID") / 1024)) MB RAM" > {{.LOG}}
echo "used $(ps --no-headers -o cputime -p "$PID") CPU time" > {{.LOG}}
- task: :stop # shut down OpenRefine and archive the OpenRefine project
vars: {DIR: '{{.DIR}}', PORT: '{{.PORT}}', PROJECT: '{{.PROJECT}}'}
sources:
- Taskfile.yml
- input/**
- config/**
generates:
- ./{{.PROJECT}}.openrefine.tar.gz
- output/**
ignore_error: true # workaround to avoid an orphaned Java process on error https://github.com/go-task/task/issues/141
download:
dir: ./{{.DIR}}
vars:
DIR: '{{splitList ":" .TASK | first}}'
cmds:
- mkdir -p input config
- > # Download input
wget --no-verbose -O input/phm-collection.tsv
https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/input/phm-collection.tsv
- > # Download config
wget --no-verbose -O config/phm-transform.json
https://github.com/opencultureconsulting/openrefine-batch/raw/master/examples/powerhouse-museum/config/phm-transform.json
default: # enable standalone execution (running `task` in project directory)
cmds:
- DIR="${PWD##*/}:main" && cd .. && task "$DIR"

7
output/deduped.tsv Normal file
View File

@ -0,0 +1,7 @@
email count name state gender purchase
arthur.duff@example4.com 2 Arthur Duff OR M Dining table
ben.morisson@example6.org 1 Ben Morisson FL M Amplifier
ben.tyler@example3.org 1 Ben Tyler NV M Flashlight
danny.baron@example1.com 3 Danny Baron CA M TV
jean.griffith@example5.org 1 Jean Griffith WA F Power drill
melanie.white@example2.edu 2 Melanie White NC F iPhone
1 email count name state gender purchase
2 arthur.duff@example4.com 2 Arthur Duff OR M Dining table
3 ben.morisson@example6.org 1 Ben Morisson FL M Amplifier
4 ben.tyler@example3.org 1 Ben Tyler NV M Flashlight
5 danny.baron@example1.com 3 Danny Baron CA M TV
6 jean.griffith@example5.org 1 Jean Griffith WA F Power drill
7 melanie.white@example2.edu 2 Melanie White NC F iPhone