# OpenRefine Python Client with extended command line interface [![Codacy Badge](https://api.codacy.com/project/badge/Grade/33129bd15cdc4ece88c8012caab8d347)](https://www.codacy.com/app/felixlohmeier/openrefine-client?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-client&utm_campaign=Badge_Grade) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine-client?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine-client/) [![PyPI](https://img.shields.io/pypi/v/openrefine-client)](https://pypi.org/project/openrefine-client/) [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/felixlohmeier/openrefineder/master) The [OpenRefine Python Client from PaulMakepeace](https://github.com/PaulMakepeace/refine-client-py) provides a library for communicating with an [OpenRefine](http://openrefine.org) server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, macOS). It is also available via Docker Hub, PyPI and Binder. ## Download One-file-executables: - Windows: [openrefine-client_0-3-7_windows.exe](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_windows.exe) (~5 MB) - macOS: [openrefine-client_0-3-7_macos](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_macos) (~5 MB) - Linux: [openrefine-client_0-3-7_linux](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_linux) (~5 MB) For [Docker](#docker) containers, native [Python](#python) installation and free [Binder](#binder) on-demand server see the corresponding chapters below. ## Peek A short video loop that demonstrates the basic features (list, create, apply, export): ![video loop that demonstrates basic features](openrefine-client-peek.gif) ## Usage Ensure you have [OpenRefine](http://openrefine.org) running (i.e. available at http://localhost:3333 or [another URL](#change-url)). To use the client: 1. Open a terminal pointing to the folder where you have [downloaded](#download) the one-file-executable (e.g. Downloads in your home directory). - Windows: Open PowerShell and enter following command ```sh cd ~\Downloads ``` - macOS: Open Terminal (Finder > Applications > Utilities > Terminal) and enter following command ```sh cd ~/Downloads ``` - Linux: Open terminal app (Terminal, Konsole, xterm, ...) and enter following command ```sh cd ~/Downloads ``` 2. Make the file executable. - Windows: not necessary - macOS: ```sh chmod +x openrefine-client_0-3-7_macos ``` - Linux: ```sh chmod +x openrefine-client_0-3-7_linux ``` 3. Execute the file. - Windows: ```sh .\openrefine-client_0-3-7_windows.exe ``` - macOS: ```sh ./openrefine-client_0-3-7_macos ``` - Linux: ```sh ./openrefine-client_0-3-7_linux ``` Using tab completion and command history is highly recommended: - autocomplete filenames: enter a few characters and press `↹` - recall previous command: press `↑` ### Basic commands Execute the client by entering its filename followed by the desired command. The following example will download two small files ([duplicates.csv](https://raw.githubusercontent.com/opencultureconsulting/openrefine-client/master/tests/data/duplicates.csv) and [duplicates-deletion.json](https://raw.githubusercontent.com/opencultureconsulting/openrefine-client/master/tests/data/duplicates-deletion.json)) into the current directory and will create a new OpenRefine project from file duplicates.csv. Download example data (`--download`) and create project from file (`--create`): - Windows: ```sh .\openrefine-client_0-3-7_windows.exe --download "https://git.io/fj5hF" --output=duplicates.csv .\openrefine-client_0-3-7_windows.exe --download "https://git.io/fj5ju" --output=duplicates-deletion.json .\openrefine-client_0-3-7_windows.exe --create duplicates.csv ``` - macOS: ```sh ./openrefine-client_0-3-7_macos --download "https://git.io/fj5hF" --output=duplicates.csv ./openrefine-client_0-3-7_macos --download "https://git.io/fj5ju" --output=duplicates-deletion.json ./openrefine-client_0-3-7_macos --create duplicates.csv ``` - Linux: ```sh ./openrefine-client_0-3-7_linux --download "https://git.io/fj5hF" --output=duplicates.csv ./openrefine-client_0-3-7_linux --download "https://git.io/fj5ju" --output=duplicates-deletion.json ./openrefine-client_0-3-7_linux --create duplicates.csv ``` Other commands: - list all projects: `--list` - show project metadata: `--info "duplicates"` - export project to terminal: `--export "duplicates"` - apply [rules from json file](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html): `--apply duplicates-deletion.json "duplicates"` - export project to file: `--export --output=deduped.xls "duplicates"` - delete project: `--delete "duplicates"` ### Getting help Check `--help` for further options. Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions! ### Change URL By default the client connects to the usual URL of OpenRefine [http://localhost:3333](http://localhost:3333). If your OpenRefine server is running somewhere else then you may set hostname and port with additional command line options (e.g. http://example.com): - set host: `-H example.com` - set port: `-P 80` ### Advanced Templating The OpenRefine [Templating](https://github.com/OpenRefine/OpenRefine/wiki/Export-As-YAML) supports exporting data in any text format (i.e. to construct JSON or XML). The graphical user interface offers four input fields: 1. prefix 2. row template - supports [GREL](https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language) inside two curly brackets, e.g. `{{jsonize(cells["name"].value)}}` 3. row separator 4. suffix This templating functionality is available via the openrefine-client command line interface. It even provides an additional feature for splitting results into multiple files. To try out the functionality create another project from the example file above. ```sh --create duplicates.csv --projectName=advanced ``` The following example code will export... - the columns "name" and "purchase" in JSON format - from the project "advanced" - for rows matching the regex text filter `^F$` in column "gender" macOS/Linux Terminal (multi-line input with `\` ): ```sh "advanced" \ --prefix='{ "events" : [ ' \ --template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \ --rowSeparator=', ' \ --suffix=' ] }' \ --filterQuery='^F$' \ --filterColumn='gender' ``` Windows PowerShell (multi-line input with `` ` ``; quotes needs to be doubled): ```sh "advanced" ` --prefix='{ ""events"" : [ ' ` --template=' { ""name"" : {{jsonize(cells[""name""].value)}}, ""purchase"" : {{jsonize(cells[""purchase""].value)}} }' ` --rowSeparator=', ' ` --suffix=' ] }' ` --filterQuery='^F$' ` --filterColumn='gender' ``` Add the following options to the last command (recall with `↑`) to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix. ```sh --output=advanced.json --splitToFiles=true ``` Filenames are suffixed with the row number by default (e.g. `advanced_1.json`, `advanced_2.json` etc.). There is another option to use the value in the first column instead: ```sh --output=advanced.json --splitToFiles=true --suffixById=true ``` Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. `advanced_melanie.white@example2.edu.json`). When using this option, the first column should contain unique identifiers. ### See also - Linux Bash script to run OpenRefine in batch mode (import, transform, export): [openrefine-batch](https://github.com/opencultureconsulting/openrefine-batch) - [Jupyter notebook demonstrating usage in Linux Bash](https://nbviewer.jupyter.org/github/felixlohmeier/openrefineder/blob/master/openrefine-client-bash.ipynb) - Use case [HOS-MetadataTransformations](https://github.com/subhh/HOS-MetadataTransformations): Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack. - Use case [Data processing of ILS data to facilitate a new discovery layer for the German Literature Archive (DLA)](https://doi.org/10.5281/zenodo.2678113): Custom data processing pipeline based on Pandas (a Python library) and OpenRefine. ## Docker [felixlohmeier/openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine-client?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine-client/) ```sh docker pull felixlohmeier/openrefine-client:v0.3.7 ``` ### Option 1: Dockerized client Run client and mount current directory as workspace: ```sh docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 ``` The docker option `--network=host` allows you to connect to a local or remote OpenRefine via the host network: - list projects on default URL (http://localhost:3333) ```sh docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --list ``` - list projects on a remote server (http://example.com) ```sh docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H example.com -P 80 --list ``` Usage: same commands as explained above (see [Basic Commands](#basic-commands) and [Advanced Templating](#advanced-templating)) ### Option 2: Dockerized client and dockerized OpenRefine Run openrefine-client linked to a dockerized OpenRefine ([felixlohmeier/openrefine](https://hub.docker.com/r/felixlohmeier/openrefine/) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine)): 1. Create docker network ```sh docker network create openrefine ``` 2. Run server (will be available at http://localhost:3333) ```sh docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.2 ``` 3. Run client with some [basic commands](#basic-commands): 1. download example files, 2. create project from file, 3. list projects, 4. show metadata, 5. export to terminal, 6. apply transformation rules (deduplication), 7. export again to terminal, 8. export to xls file and 9. delete project ```sh docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --download "https://git.io/fj5hF" --output=duplicates.csv docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --download "https://git.io/fj5ju" --output=duplicates-deletion.json docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --create duplicates.csv docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --list docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --info "duplicates" docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export "duplicates" docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --apply duplicates-deletion.json "duplicates" docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export "duplicates" docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export --output=deduped.xls "duplicates" docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --delete "duplicates" ``` 4. Stop and delete server: ```sh docker stop openrefine-server docker rm openrefine-server ``` 5. Delete docker network: ```sh docker network rm openrefine ``` Customize OpenRefine server: - If you want to add an OpenRefine startup option you need to repeat the default commands (cf. [Dockerfile](https://hub.docker.com/r/felixlohmeier/openrefine/dockerfile)) - `-i 0.0.0.0` sets OpenRefine to be accessible from outside the container, i.e. from host - `-d /data` sets OpenRefine workspace - Example for [allocating more memory](https://github.com/OpenRefine/OpenRefine/wiki/FAQ#out-of-memory-errors---feels-slow---could-not-reserve-enough-space-for-object-heap) to OpenRefine with additional option `-m 4G` ```sh docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.2 -i 0.0.0.0 -d /data -m 4G ``` - The OpenRefine version is defined by the docker tag. Check the [DockerHub repository](https://hub.docker.com/r/felixlohmeier/openrefine) for available tags. Example for OpenRefine `2.8` with same options as above: ```sh docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G ``` - If you want OpenRefine to read and write persistent data in host directory (i.e. store projects) you can mount the container path `/data`. Example for host directory `/home/felix/refine`: ```sh docker run -d -p 3333:3333 -v /home/felix/refine:/data:z --network=openrefine name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G ``` See also: - [GitHub Repository](https://github.com/opencultureconsulting/openrefine-docker) for docker container `felixlohmeier/openrefine` - Linux Bash script to run OpenRefine in batch mode (import, transform, export) with docker containers: [openrefine-batch-docker.sh](https://github.com/opencultureconsulting/openrefine-batch/#docker) ## Python [openrefine-client](https://pypi.org/project/openrefine-client/) [![PyPI](https://img.shields.io/pypi/v/openrefine-client)](https://pypi.org/project/openrefine-client/) (requires Python 2.x) ```sh python2 -m pip install openrefine-client --user ``` This will install the package `openrefine-client` containing modules in `google.refine`. A command line script `openrefine-client` will also be installed. ### Option 1: command line script ```sh openrefine-client --help ``` Usage: same commands as explained above (see [Basic Commands](#basic-commands) and [Advanced Templating](#advanced-templating)) ### Option 2: using cli functions in Python 2.x environment Import module cli: ```python from google.refine import cli ``` Change URL (if necessary): ```python cli.refine.REFINE_HOST = 'localhost' cli.refine.REFINE_PORT = '3333' ``` Help screen: ```python help(cli) ``` Commands: * download (e.g. example data): ```python cli.download('https://git.io/fj5hF','duplicates.csv') cli.download('https://git.io/fj5ju','duplicates-deletion.json') ``` * list projects: ```python cli.ls() ``` * create project: ```python p1 = cli.create('duplicates.csv') ``` * show metadata: ```python cli.info(p1.project_id) ``` * apply rules from file to project: ```python cli.apply(p1.project_id, 'duplicates-deletion.json') ``` * export project to terminal: ```python cli.export(p1.project_id) ``` * export project to file in xls format: ```python cli.export(p1.project_id, 'deduped.xls') ``` * export templating (see [Advanced Templating](#advanced-templating) above): ```python cli.templating( p1.project_id, prefix='''{ "events" : [ ''',template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''', rowSeparator=''', ''',suffix=''' ] }''') ``` * delete project: ```python cli.delete(p1.project_id) ``` ### Option 3: the upstream way This fork can be used in the same way as the upstream [Python client library](https://github.com/PaulMakepeace/refine-client-py/). Some functions in the python client library are not yet compatible with OpenRefine >=3.0 (cf. [issue #19 in refine-client-py](https://github.com/paulmakepeace/refine-client-py/issues/19)). Import module refine: ```python from google.refine import refine ``` Server Commands: * set up connection: ```python server1 = refine.Refine('http://localhost:3333') ``` - show version: ```python server1.server.get_version() server1.server.version ``` - list projects: ```python server1.list_projects() ``` - pretty print the returned dict with json.dumps: ```python import json print(json.dumps(server1.list_projects(), indent=1)) ``` - create project: ```python server1.new_project(project_file='duplicates.csv') ``` * create and open the returned project in one step: ```python project1 = server1.new_project(project_file='duplicates.csv') ``` Project commands: * open project: ```python project1 = server1.open_project('1234567890123') ``` * print full URL to project: ```python project1.project_url() ``` * list columns: ```python project1.columns ``` * compute text facet on first column (**fails with OpenRefine >=3.2**): ```python project1.compute_facets(facet.TextFacet(project1.columns[0])) ``` * print returned object ```python facets = project1.compute_facets(facet.TextFacet(project1.columns[0])).facets[0] for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True): print(facets.choices[k].count, k) ``` * compute clusters on first column: ```python project1.compute_clusters(project1.columns[0]) ``` * apply rules from file to project: ```python project1.apply_operations('duplicates-deletion.json') ``` * export project: ```python project1.export(export_format='tsv') ``` * print the returned fileobject: ```python print(project1.export(export_format='tsv').read()) ``` * save the returned fileobject to file: ```python with open('export.tsv', 'wb') as f: f.write(project1.export(export_format='tsv').read()) ``` * templating export (**function was added in this fork**, see [Advanced Templating](#advanced-templating) above): ```python data = project1.export_templating( prefix='''{ "events" : [ ''',template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''', rowSeparator=''', ''',suffix=''' ] }''') print(data.read()) ``` * print help screen with available commands (many more!): ```python help(project1) ``` * example for custom commands: ```python project1.do_json('get-rows')['total'] ``` * delete project: ```python project1.delete() ``` See also: - Jupyter notebook by Trevor Muñoz (2013-08-18): [Programmatic Use of Open Refine to Facet and Cluster Names of 'Dishes' from NYPL's What's on the menu?](https://nbviewer.jupyter.org/gist/trevormunoz/6265360) - Jupyter notebook by Tony Hirst (2019-01-09) [Notebook demonstrating how to control OpenRefine via a Python client.](https://nbviewer.jupyter.org/github/ouseful-PR/openrefineder/blob/4cef25a4ca6077536c5f49cafb531499fbcad96e/notebooks/OpenRefine%20Demos.ipynb) - Unittests [test_refine.py](tests/test_refine.py) and [test_tutorial.py](tests/test_tutorial.py) (both importing [refinetest.py](tests/refinetest.py)) - [OpenRefine API](https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API) in official OpenRefine wiki ## Binder [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/felixlohmeier/openrefineder/master) - free to use on-demand server with Jupyter notebook, OpenRefine and Bash - no registration needed, will start within a few minutes - [restricted](https://mybinder.readthedocs.io/en/latest/faq.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity - [bash_kernel demo notebook](https://nbviewer.jupyter.org/github/felixlohmeier/openrefineder/blob/master/openrefine-client-bash.ipynb) for using the openrefine-client in a Linux Bash environment [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/felixlohmeier/openrefineder/master?urlpath=/tree/openrefine-client-bash.ipynb) - [python2 demo notebook](https://nbviewer.jupyter.org/github/felixlohmeier/openrefineder/blob/master/openrefine-client-python.ipynb) for using the openrefine-client in a Python 2 environment [![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/felixlohmeier/openrefineder/master?urlpath=/tree/openrefine-client-python.ipynb) ## Development If you would like to contribute to the Python client library please consider a pull request to the upstream repository [refine-client-py](https://github.com/PaulMakepeace/refine-client-py/). ### Tests Ensure you have OpenRefine running (i.e. available at http://localhost:3333). If necessary set the environment variables `OPENREFINE_HOST` and `OPENREFINE_PORT` to change the URL. The Python client library includes several unit tests. - run all tests ```sh python setup.py test ``` - run subset test_facet ```sh python setup.py --test-suite tests.test_facet ``` There is also a script that uses docker images to run the unit tests with different versions of OpenRefine. - run tests on all OpenRefine versions (from 2.0 up to 3.2) ```sh ./tests.sh -a ``` - run tests on tag 3.2 ```sh ./tests.sh -t 3.2 ``` - run tests on tag 3.2 interactively (pause before and after tests) ```sh ./tests.sh -t 3.2 -i ``` - run tests on tags 3.2 and 2.7 ```sh ./tests.sh -t 3.2 -t 2.7 ``` ### Distributing Note to myself: When releasing a new version... 1. Run tests ```sh ./tests.sh -a ``` 2. Make final changes in GitHub - update versions and download links (guess in advance) in [README.md](https://github.com/opencultureconsulting/openrefine-client/blob/master/README.md#download) - check if [Dockerfile](https://github.com/opencultureconsulting/openrefine-client/blob/master/docker/Dockerfile) needs to be changed 3. Build executables with PyInstaller - Run PyInstaller in Python 2 environments on native Windows, macOS and Linux. Should be "the oldest version of the OS you need to support"! Current release is built with: - Ubuntu 16.04 LTS (64-bit) - macOS Sierra 10.12 - Windows 10 - One-file-executables will be available in `dist/`. ```sh git clone https://github.com/opencultureconsulting/openrefine-client.git cd openrefine-client python -m pip install . --user python -m pip install pyinstaller --user pyinstaller --onefile refine.py --hidden-import google.refine.__main__ ``` 4. Create release in GitHub - draft [release notes](https://github.com/opencultureconsulting/openrefine-client/releases) and attach one-file-executables 5. Build package and upload to PyPI ```sh python3 setup.py sdist bdist_wheel python3 -m twine upload dist/* ``` 6. Update Docker container - add new autobuild for release version - trigger latest build 7. Bump openrefine-client version in related projects - openrefine-batch: [openrefine-batch.sh](https://github.com/opencultureconsulting/openrefine-batch/blob/master/openrefine-batch.sh#L7) and [openrefine-batch-docker.sh](https://github.com/opencultureconsulting/openrefine-batch/blob/master/openrefine-batch-docker.sh) - openrefineder: [postBuild](https://github.com/felixlohmeier/openrefineder/blob/master/postBuild) ## Credits [Paul Makepeace](http://paulm.com), author David Huynh, [initial cut](