release v0.3.7 - substantially revised code and docs
fixed bug #1 (option columnWidths broken since v0.3.2) fixed bug #2 (commands create and export templating broken since v0.3.5) added --download command extended --info command improved performance of --export command improved error handling and user feedback removed support for legacy docker option --link added detailed usage instructions with examples moved and extended instructions from docker/README.md to README.md added usage instructions for Python library added chapter on Binder openrefineder added badges for docker, pypi and binder added usage instructions for tests script added note to myself for distributing releases moved all functions from parser to cli module separated export and template function improved code style (PEP8)
This commit is contained in:
parent
7d66993982
commit
bfd00b55aa
697
README.md
697
README.md
|
@ -1,73 +1,700 @@
|
||||||
# OpenRefine Python Client with extended command line interface
|
# OpenRefine Python Client with extended command line interface
|
||||||
|
|
||||||
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/33129bd15cdc4ece88c8012caab8d347)](https://www.codacy.com/app/felixlohmeier/openrefine-client?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-client&utm_campaign=Badge_Grade)
|
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/33129bd15cdc4ece88c8012caab8d347)](https://www.codacy.com/app/felixlohmeier/openrefine-client?utm_source=github.com&utm_medium=referral&utm_content=opencultureconsulting/openrefine-client&utm_campaign=Badge_Grade) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine-client?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine-client/) [![PyPI](https://img.shields.io/pypi/v/openrefine-client)](https://pypi.org/project/openrefine-client/) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/betatim/openrefineder/22fbb07?filepath=openrefine-client.ipynb)
|
||||||
|
|
||||||
The [OpenRefine Python Client Library from PaulMakepeace](https://github.com/PaulMakepeace/refine-client-py) provides an interface to communicating with an [OpenRefine](http://openrefine.org) server. This fork extends the command line interface (CLI) and supports communication between docker containers.
|
The [OpenRefine Python Client from PaulMakepeace](https://github.com/PaulMakepeace/refine-client-py) provides a library for communicating with an [OpenRefine](http://openrefine.org) server.
|
||||||
|
This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, macOS).
|
||||||
|
It is also available via Docker Hub, PyPI and Binder.
|
||||||
|
|
||||||
## Download
|
## Download
|
||||||
|
|
||||||
One-file-executables:
|
One-file-executables:
|
||||||
|
|
||||||
* Linux: [openrefine-client_0-3-4_linux-64bit](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.4/openrefine-client_0-3-4_linux-64bit) (4,7 MB)
|
- Windows: [openrefine-client_0-3-7_windows.exe](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_windows.exe) (~5 MB)
|
||||||
* Windows: [openrefine-client_0-3-4_windows.exe](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.4/openrefine-client_0-3-4_windows.exe) (4,9 MB)
|
- macOS: [openrefine-client_0-3-7_macos](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_macos) (~5 MB)
|
||||||
* Mac: [openrefine-client_0-3-4_mac](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.4/openrefine-client_0-3-4_mac) (4,4 MB)
|
- Linux: [openrefine-client_0-3-7_linux](https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.7/openrefine-client_0-3-7_linux) (~5 MB)
|
||||||
|
|
||||||
For native Python installation on Windows, Mac or Linux see [Installation](#installation) below.
|
For [Docker](#docker) containers, native [Python](#python) installation and free [Binder](#binder) on-demand server see the corresponding chapters below.
|
||||||
|
|
||||||
## Peek
|
## Peek
|
||||||
|
|
||||||
A short video loop that demonstrates the basic features (list, create, apply, export)
|
A short video loop that demonstrates the basic features (list, create, apply, export):
|
||||||
|
|
||||||
![video loop that demonstrates basic features](openrefine-client-peek.gif)
|
![video loop that demonstrates basic features](openrefine-client-peek.gif)
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
Command line interface:
|
Ensure you have [OpenRefine](http://openrefine.org) running (i.e. available at http://localhost:3333 or [another URL](#change-url)).
|
||||||
|
|
||||||
|
To use the client:
|
||||||
|
|
||||||
|
1. Open a terminal pointing to the folder where you have [downloaded](#download) the one-file-executable (e.g. Downloads in your home directory).
|
||||||
|
|
||||||
|
- Windows: Open PowerShell and enter following command
|
||||||
|
|
||||||
|
```
|
||||||
|
cd ~\Downloads
|
||||||
|
```
|
||||||
|
|
||||||
|
- macOS: Open Terminal (Finder > Applications > Utilities > Terminal) and enter following command
|
||||||
|
|
||||||
|
```
|
||||||
|
cd ~/Downloads
|
||||||
|
```
|
||||||
|
|
||||||
|
- Linux: Open terminal app (Terminal, Konsole, xterm, ...) and enter following command
|
||||||
|
|
||||||
|
```
|
||||||
|
cd ~/Downloads
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Make the file executable.
|
||||||
|
|
||||||
|
- Windows: not necessary
|
||||||
|
|
||||||
|
- macOS:
|
||||||
|
|
||||||
|
```
|
||||||
|
chmod +x openrefine-client_0-3-7_macos
|
||||||
|
```
|
||||||
|
|
||||||
|
- Linux:
|
||||||
|
|
||||||
|
```
|
||||||
|
chmod +x openrefine-client_0-3-7_linux
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Execute the file.
|
||||||
|
|
||||||
|
- Windows:
|
||||||
|
|
||||||
|
```
|
||||||
|
.\openrefine-client_0-3-7_windows.exe
|
||||||
|
```
|
||||||
|
|
||||||
|
- macOS:
|
||||||
|
|
||||||
|
```
|
||||||
|
./openrefine-client_0-3-7_macos
|
||||||
|
```
|
||||||
|
|
||||||
|
- Linux:
|
||||||
|
|
||||||
|
```
|
||||||
|
./openrefine-client_0-3-7_linux
|
||||||
|
```
|
||||||
|
|
||||||
|
Using tab completion and command history is highly recommended:
|
||||||
|
|
||||||
|
- autocomplete filenames: enter a few characters and press `↹`
|
||||||
|
- recall previous command: press `↑`
|
||||||
|
|
||||||
|
### Basic commands
|
||||||
|
|
||||||
|
Execute the client by entering its filename followed by the desired command.
|
||||||
|
|
||||||
|
The following example will download two small files ([duplicates.csv](https://raw.githubusercontent.com/opencultureconsulting/openrefine-client/master/tests/data/duplicates.csv) and [duplicates-deletion.json](https://raw.githubusercontent.com/opencultureconsulting/openrefine-client/master/tests/data/duplicates-deletion.json)) into the current directory and will create a new OpenRefine project from file duplicates.csv.
|
||||||
|
|
||||||
|
Download example data (`--download`) and create project from file (`--create`):
|
||||||
|
|
||||||
|
- Windows:
|
||||||
|
|
||||||
|
```
|
||||||
|
.\openrefine-client_0-3-7_windows.exe --download "https://git.io/fj5hF" --output=duplicates.csv
|
||||||
|
.\openrefine-client_0-3-7_windows.exe --download "https://git.io/fj5ju" --output=duplicates-deletion.json
|
||||||
|
.\openrefine-client_0-3-7_windows.exe --create duplicates.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
- macOS:
|
||||||
|
|
||||||
|
```
|
||||||
|
./openrefine-client_0-3-7_macos --download "https://git.io/fj5hF" --output=duplicates.csv
|
||||||
|
./openrefine-client_0-3-7_macos --download "https://git.io/fj5ju" --output=duplicates-deletion.json
|
||||||
|
./openrefine-client_0-3-7_macos --create duplicates.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
- Linux:
|
||||||
|
|
||||||
|
```
|
||||||
|
./openrefine-client_0-3-7_linux --download "https://git.io/fj5hF" --output=duplicates.csv
|
||||||
|
./openrefine-client_0-3-7_linux --download "https://git.io/fj5ju" --output=duplicates-deletion.json
|
||||||
|
./openrefine-client_0-3-7_linux --create duplicates.csv
|
||||||
|
```
|
||||||
|
|
||||||
|
Other commands:
|
||||||
|
|
||||||
- list all projects: `--list`
|
- list all projects: `--list`
|
||||||
- create project from file: `--create [FILE]`
|
- show project metadata: `--info "duplicates"`
|
||||||
- apply [rules from json file](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html): `--apply [FILE.json] [PROJECTID/PROJECTNAME]`
|
- export project to terminal: `--export "duplicates"`
|
||||||
- export project to file: `--export [PROJECTID/PROJECTNAME] --output=FILE.tsv`
|
- apply [rules from json file](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html): `--apply duplicates-deletion.json "duplicates"`
|
||||||
- templating export: `--export "My Address Book" --template='{ "friend" : {{jsonize(cells["friend"].value)}}, "address" : {{jsonize(cells["address"].value)}} }' --prefix='{ "address" : [' --rowSeparator=',' --suffix='] }' --filterQuery="^mary$"`
|
- export project to file: `--export --output=deduped.xls "duplicates"`
|
||||||
- show project metadata: `--info [PROJECTID/PROJECTNAME]`
|
- delete project: `--delete "duplicates"`
|
||||||
- delete project: `--delete [PROJECTID/PROJECTNAME]`
|
|
||||||
- check `--help` for further options...
|
|
||||||
|
|
||||||
If you are familiar with python you may try all functions interactively (`python -i refine.py`) or use this library in your own python scripts. Some Examples:
|
### Getting help
|
||||||
|
|
||||||
* show version of OpenRefine server: `refine.RefineServer().get_version()`
|
Check `--help` for further options.
|
||||||
* show total rows of project 2151545447855: `refine.RefineProject(refine.RefineServer(),'2151545447855').do_json('get-rows')['total']`
|
|
||||||
* compute clusters of project 2151545447855 and column key: `refine.RefineProject(refine.RefineServer(),'2151545447855').compute_clusters('key')`
|
|
||||||
|
|
||||||
## Configuration
|
Please file an [issue](https://github.com/opencultureconsulting/openrefine-client/issues) if you miss some features in the command line interface or if you have tracked a bug.
|
||||||
|
And you are welcome to ask any questions!
|
||||||
|
|
||||||
By default the OpenRefine server URL is [http://127.0.0.1:3333](http://127.0.0.1:3333)
|
### Change URL
|
||||||
|
|
||||||
The environment variables `OPENREFINE_HOST` and `OPENREFINE_PORT` enable overriding the host & port as well as the command line options `-H` and `-P`.
|
By default the client connects to the usual URL of OpenRefine [http://localhost:3333](http://localhost:3333).
|
||||||
|
If your OpenRefine server is running somewhere else then you may set hostname and port with additional command line options (e.g. http://example.com):
|
||||||
|
|
||||||
## Installation
|
- set host: `-H example.com`
|
||||||
|
- set port: `-P 80`
|
||||||
|
|
||||||
|
### Advanced Templating
|
||||||
|
|
||||||
|
The OpenRefine [Templating](https://github.com/OpenRefine/OpenRefine/wiki/Export-As-YAML) supports exporting data in any text format (i.e. to construct JSON or XML).
|
||||||
|
The graphical user interface offers four input fields:
|
||||||
|
|
||||||
|
1. prefix
|
||||||
|
2. row template
|
||||||
|
- supports [GREL](https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language) inside two curly brackets, e.g. `{{jsonize(cells["name"].value)}}`
|
||||||
|
3. row separator
|
||||||
|
4. suffix
|
||||||
|
|
||||||
|
This templating functionality is available via the openrefine-client command line interface.
|
||||||
|
It even provides an additional feature for splitting results into multiple files.
|
||||||
|
|
||||||
|
To try out the functionality create another project from the example file above.
|
||||||
|
|
||||||
|
```
|
||||||
|
--create duplicates.csv --projectName=advanced
|
||||||
|
```
|
||||||
|
|
||||||
|
The following example code will export...
|
||||||
|
|
||||||
|
- the columns "name" and "purchase" in JSON format
|
||||||
|
- from the project "duplicates"
|
||||||
|
- for rows matching the regex text filter `^CA$` in column "state"
|
||||||
|
|
||||||
|
macOS/Linux Terminal (multi-line input with `\` ):
|
||||||
|
|
||||||
|
```
|
||||||
|
--export "advanced" \
|
||||||
|
--prefix='{ "events" : [
|
||||||
|
' \
|
||||||
|
--template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
|
||||||
|
--rowSeparator=',
|
||||||
|
' \
|
||||||
|
--suffix='
|
||||||
|
] }' \
|
||||||
|
--filterQuery='^CA$' \
|
||||||
|
--filterColumn='state'
|
||||||
|
```
|
||||||
|
|
||||||
|
Windows PowerShell (multi-line input with `` ` ``; quotes needs to be doubled):
|
||||||
|
|
||||||
|
```
|
||||||
|
--export "advanced" `
|
||||||
|
--prefix='{ ""events"" : [
|
||||||
|
' `
|
||||||
|
--template=' { ""name"" : {{jsonize(cells[""name""].value)}}, ""purchase"" : {{jsonize(cells[""purchase""].value)}} }' `
|
||||||
|
--rowSeparator=',
|
||||||
|
' `
|
||||||
|
--suffix='
|
||||||
|
] }' `
|
||||||
|
--filterQuery='^CA$' `
|
||||||
|
--filterColumn='state'
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the following options to the last command (recall with `↑`) to store the results in multiple files.
|
||||||
|
Each file will contain the prefix, an processed row, and the suffix.
|
||||||
|
|
||||||
|
```
|
||||||
|
--output=advanced.json --splitToFiles=true
|
||||||
|
```
|
||||||
|
|
||||||
|
Filenames are suffixed with the row number by default (e.g. `advanced_1.json`, `advanced_2.json` etc.).
|
||||||
|
There is another option to use the value in the first column instead:
|
||||||
|
|
||||||
|
```
|
||||||
|
--output=advanced.json --splitToFiles=true --suffixById=true
|
||||||
|
```
|
||||||
|
|
||||||
|
Because our project "advanced" contains duplicates in the first column "email" this command will store only one file `advanced_danny.baron@example1.com.json`.
|
||||||
|
When using this option, the first column should contain unique identifiers.
|
||||||
|
|
||||||
|
### See also
|
||||||
|
|
||||||
|
- Linux Bash script to run OpenRefine in batch mode (import, transform, export): [openrefine-batch](https://github.com/opencultureconsulting/openrefine-batch)
|
||||||
|
- [Jupyter notebook demonstrating usage in Linux Bash](https://nbviewer.jupyter.org/github/betatim/openrefineder/blob/master/openrefine-client.ipynb)
|
||||||
|
- Use case [HOS-MetadataTransformations](https://github.com/subhh/HOS-MetadataTransformations): Automated workflow for harvesting, transforming and indexing of metadata using metha, OpenRefine and Solr. Part of the Hamburg Open Science "Schaufenster" software stack.
|
||||||
|
- Use case [Data processing of ILS data to facilitate a new discovery layer for the German Literature Archive (DLA)](https://doi.org/10.5281/zenodo.2678113): Custom data processing pipeline based on Pandas (a Python library) and OpenRefine.
|
||||||
|
|
||||||
|
## Docker
|
||||||
|
|
||||||
|
[felixlohmeier/openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine-client?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
|
||||||
|
|
||||||
|
```
|
||||||
|
docker pull felixlohmeier/openrefine-client:v0.3.7
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 1: Dockerized client
|
||||||
|
|
||||||
|
Run client and mount current directory as workspace:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7
|
||||||
|
```
|
||||||
|
|
||||||
|
The docker option `--network=host` allows you to connect to a local or remote OpenRefine via the host network:
|
||||||
|
|
||||||
|
- list projects on default URL (http://localhost:3333)
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --list
|
||||||
|
```
|
||||||
|
|
||||||
|
- list projects on a remote server (http://example.com)
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H example.com -P 80 --list
|
||||||
|
```
|
||||||
|
|
||||||
|
Usage: same commands as explained above (see [Basic Commands](#basic-commands) and [Advanced Templating](#advanced-templating))
|
||||||
|
|
||||||
|
### Option 2: Dockerized client and dockerized OpenRefine
|
||||||
|
|
||||||
|
Run openrefine-client linked to a dockerized OpenRefine ([felixlohmeier/openrefine](https://hub.docker.com/r/felixlohmeier/openrefine/) [![Docker](https://img.shields.io/microbadger/image-size/felixlohmeier/openrefine?label=docker)](https://hub.docker.com/r/felixlohmeier/openrefine)):
|
||||||
|
|
||||||
|
1. Create docker network
|
||||||
|
|
||||||
|
```
|
||||||
|
docker network create openrefine
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Run server (will be available at http://localhost:3333)
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.2
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Run client with some [basic commands](#basic-commands): 1. download example files, 2. create project from file, 3. list projects, 4. show metadata, 5. export to terminal, 6. apply transformation rules (deduplication), 7. export again to terminal, 8. export to xls file and 9. delete project
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --download "https://git.io/fj5hF" --output=duplicates.csv
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 --download "https://git.io/fj5ju" --output=duplicates-deletion.json
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --create duplicates.csv
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --list
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --info "duplicates"
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export "duplicates"
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --apply duplicates-deletion.json "duplicates"
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export "duplicates"
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --export --output=deduped.xls "duplicates"
|
||||||
|
docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.7 -H openrefine-server --delete "duplicates"
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Stop and delete server:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker stop openrefine-server
|
||||||
|
docker rm openrefine-server
|
||||||
|
```
|
||||||
|
|
||||||
|
5. Delete docker network:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker network rm openrefine
|
||||||
|
```
|
||||||
|
|
||||||
|
Customize OpenRefine server:
|
||||||
|
|
||||||
|
- If you want to add an OpenRefine startup option you need to repeat the default commands (cf. [Dockerfile](https://hub.docker.com/r/felixlohmeier/openrefine/dockerfile))
|
||||||
|
- `-i 0.0.0.0` sets OpenRefine to be accessible from outside the container, i.e. from host
|
||||||
|
- `-d /data` sets OpenRefine workspace
|
||||||
|
|
||||||
|
- Example for [allocating more memory](https://github.com/OpenRefine/OpenRefine/wiki/FAQ#out-of-memory-errors---feels-slow---could-not-reserve-enough-space-for-object-heap) to OpenRefine with additional option `-m 4G`
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.2 -i 0.0.0.0 -d /data -m 4G
|
||||||
|
```
|
||||||
|
|
||||||
|
- The OpenRefine version is defined by the docker tag.
|
||||||
|
Check the [DockerHub repository](https://hub.docker.com/r/felixlohmeier/openrefine) for available tags.
|
||||||
|
Example for OpenRefine `2.8` with same options as above:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
|
||||||
|
```
|
||||||
|
|
||||||
|
- If you want OpenRefine to read and write persistent data in host directory (i.e. store projects) you can mount the container path `/data`. Example for host directory `/home/felix/refine`:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker run -d -p 3333:3333 -v /home/felix/refine:/data:z --network=openrefine name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
|
||||||
|
```
|
||||||
|
|
||||||
|
See also:
|
||||||
|
|
||||||
|
- [GitHub Repository](https://github.com/opencultureconsulting/openrefine-docker) for docker container `felixlohmeier/openrefine`
|
||||||
|
- Linux Bash script to run OpenRefine in batch mode (import, transform, export) with docker containers: [openrefine-batch-docker.sh](https://github.com/opencultureconsulting/openrefine-batch/#docker)
|
||||||
|
|
||||||
|
## Python
|
||||||
|
|
||||||
|
[openrefine-client](https://pypi.org/project/openrefine-client/) [![PyPI](https://img.shields.io/pypi/v/openrefine-client)](https://pypi.org/project/openrefine-client/) (requires Python 2.x)
|
||||||
|
|
||||||
```
|
```
|
||||||
pip install openrefine-client
|
pip install openrefine-client
|
||||||
```
|
```
|
||||||
|
|
||||||
(requires Python 2.x, depends on urllib2_file>=0.2.1)
|
This will install the package `openrefine-client` containing modules in `google.refine`.
|
||||||
|
|
||||||
## Tests
|
A command line script `openrefine-client` will also be installed.
|
||||||
|
|
||||||
Ensure you have a Refine server running somewhere and, if necessary, set the environment vars as above.
|
### Option 1: command line script
|
||||||
|
|
||||||
Run tests, build, and install:
|
|
||||||
|
|
||||||
```
|
```
|
||||||
python setup.py test # to do a subset, e.g., --test-suite tests.test_facet
|
openrefine-client --help
|
||||||
|
|
||||||
python setup.py build
|
|
||||||
|
|
||||||
python setup.py install
|
|
||||||
```
|
```
|
||||||
|
|
||||||
There is a Makefile that will do this too, and more.
|
Usage: same commands as explained above (see [Basic Commands](#basic-commands) and [Advanced Templating](#advanced-templating))
|
||||||
|
|
||||||
|
### Option 2: using cli functions in Python 2.x environment
|
||||||
|
|
||||||
|
Import module cli:
|
||||||
|
|
||||||
|
```
|
||||||
|
from google.refine import cli
|
||||||
|
```
|
||||||
|
|
||||||
|
Change URL (if necessary):
|
||||||
|
|
||||||
|
```
|
||||||
|
refine.REFINE_HOST = 'localhost'
|
||||||
|
refine.REFINE_PORT = 3333
|
||||||
|
```
|
||||||
|
|
||||||
|
Help screen:
|
||||||
|
|
||||||
|
```
|
||||||
|
help(cli)
|
||||||
|
```
|
||||||
|
|
||||||
|
Commands:
|
||||||
|
|
||||||
|
* download (e.g. example data):
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.download('https://git.io/fj5hF','duplicates.csv')
|
||||||
|
cli.download('https://git.io/fj5ju','duplicates-deletion.json')
|
||||||
|
```
|
||||||
|
|
||||||
|
* list projects:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.ls()
|
||||||
|
```
|
||||||
|
|
||||||
|
* create project:
|
||||||
|
|
||||||
|
```
|
||||||
|
p1 = cli.create('duplicates.csv')
|
||||||
|
```
|
||||||
|
|
||||||
|
* show metadata:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.info(p1.project_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
* apply rules from file to project:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.apply(p1.project_id, 'duplicates-deletion.json')
|
||||||
|
```
|
||||||
|
|
||||||
|
* export project to terminal:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.export(p1.project_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
* export project to file in xls format:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.export(p1.project_id, 'deduped.xls')
|
||||||
|
```
|
||||||
|
|
||||||
|
* export templating (see [Advanced Templating](#advanced-templating) above):
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.templating(p1.project_id, prefix='''{ "events" : [
|
||||||
|
''', template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''', rowSeparator=''',
|
||||||
|
''', suffix='''
|
||||||
|
] }''')
|
||||||
|
```
|
||||||
|
|
||||||
|
* delete project:
|
||||||
|
|
||||||
|
```
|
||||||
|
cli.delete(p1.project_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Option 3: the upstream way
|
||||||
|
|
||||||
|
This fork can be used in the same way as the upstream [Python client library](https://github.com/PaulMakepeace/refine-client-py/).
|
||||||
|
|
||||||
|
Some functions in the python client library are not yet compatible with OpenRefine >=3.0 (cf. [issue #19 in refine-client-py](https://github.com/paulmakepeace/refine-client-py/issues/19)).
|
||||||
|
|
||||||
|
Import module refine:
|
||||||
|
|
||||||
|
```
|
||||||
|
from google.refine import refine
|
||||||
|
```
|
||||||
|
|
||||||
|
Server Commands:
|
||||||
|
|
||||||
|
* set up connection:
|
||||||
|
|
||||||
|
```
|
||||||
|
server1 = refine.Refine('http://localhost:3333')
|
||||||
|
```
|
||||||
|
|
||||||
|
- show version:
|
||||||
|
|
||||||
|
```
|
||||||
|
server1.server.get_version()
|
||||||
|
server1.server.version
|
||||||
|
```
|
||||||
|
|
||||||
|
- list projects:
|
||||||
|
|
||||||
|
```
|
||||||
|
server1.list_projects()
|
||||||
|
```
|
||||||
|
|
||||||
|
- pretty print the returned dict with json.dumps:
|
||||||
|
|
||||||
|
```
|
||||||
|
import json
|
||||||
|
print(json.dumps(server1.list_projects(), indent=1))
|
||||||
|
```
|
||||||
|
|
||||||
|
- create project (**function was edited in this fork**):
|
||||||
|
|
||||||
|
```
|
||||||
|
server1.new_project(project_file='duplicates.csv')
|
||||||
|
```
|
||||||
|
|
||||||
|
* create and open the returned project in one step:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1 = server1.new_project(project_file='duplicates.csv')
|
||||||
|
```
|
||||||
|
|
||||||
|
Project commands:
|
||||||
|
|
||||||
|
* open project:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1 = server1.open_project('1234567890123')
|
||||||
|
```
|
||||||
|
|
||||||
|
* print full URL to project:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.project_url()
|
||||||
|
```
|
||||||
|
|
||||||
|
* list columns:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.columns
|
||||||
|
```
|
||||||
|
|
||||||
|
* compute text facet on first column (**fails with OpenRefine >=3.2**):
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.compute_facets(facet.TextFacet(project1.columns[0]))
|
||||||
|
```
|
||||||
|
|
||||||
|
* print returned object
|
||||||
|
|
||||||
|
```
|
||||||
|
facets = project1.compute_facets(facet.TextFacet(project1.columns[0])).facets[0]
|
||||||
|
for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True):
|
||||||
|
print(facets.choices[k].count, k)
|
||||||
|
```
|
||||||
|
|
||||||
|
* compute clusters on first column:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.compute_clusters(project1.columns[0])
|
||||||
|
```
|
||||||
|
|
||||||
|
* apply rules from file to project:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.apply_operations('duplicates-deletion.json')
|
||||||
|
```
|
||||||
|
|
||||||
|
* export project:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.export(export_format='tsv')
|
||||||
|
```
|
||||||
|
|
||||||
|
* print the returned fileobject:
|
||||||
|
|
||||||
|
```
|
||||||
|
print(project1.export(export_format='tsv').read())
|
||||||
|
```
|
||||||
|
|
||||||
|
* save the returned fileobject to file:
|
||||||
|
|
||||||
|
```
|
||||||
|
with open('export.tsv', 'wb') as f:
|
||||||
|
f.write(project1.export(export_format='tsv').read())
|
||||||
|
```
|
||||||
|
|
||||||
|
* templating export (**function was added in this fork**, see [Advanced Templating](#advanced-templating) above):
|
||||||
|
|
||||||
|
```
|
||||||
|
data = project1.export_templating(prefix='''{ "events" : [
|
||||||
|
''', template=''' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''', rowSeparator=''',
|
||||||
|
''', suffix='''
|
||||||
|
] }''')
|
||||||
|
print(data.read())
|
||||||
|
```
|
||||||
|
|
||||||
|
* print help screen with available commands (many more!):
|
||||||
|
|
||||||
|
```
|
||||||
|
help(project1)
|
||||||
|
```
|
||||||
|
|
||||||
|
* example for custom commands:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.do_json('get-rows')['total']
|
||||||
|
```
|
||||||
|
|
||||||
|
* delete project:
|
||||||
|
|
||||||
|
```
|
||||||
|
project1.delete()
|
||||||
|
```
|
||||||
|
|
||||||
|
See also:
|
||||||
|
|
||||||
|
- Jupyter notebook by Trevor Muñoz (2013-08-18): [Programmatic Use of Open Refine to Facet and Cluster Names of 'Dishes' from NYPL's What's on the menu?](https://nbviewer.jupyter.org/gist/trevormunoz/6265360)
|
||||||
|
- Jupyter notebook by Tony Hirst (2019-01-09) [Notebook demonstrating how to control OpenRefine via a Python client.](https://nbviewer.jupyter.org/github/ouseful-PR/openrefineder/blob/4cef25a4ca6077536c5f49cafb531499fbcad96e/notebooks/OpenRefine%20Demos.ipynb)
|
||||||
|
- Unittests [test_refine.py](tests/test_refine.py) and [test_tutorial.py](tests/test_tutorial.py) (both importing [refinetest.py](tests/refinetest.py))
|
||||||
|
- [OpenRefine API](https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API) in official OpenRefine wiki
|
||||||
|
|
||||||
|
## Binder
|
||||||
|
|
||||||
|
[openrefineder](https://github.com/betatim/openrefineder) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/betatim/openrefineder/22fbb07?filepath=openrefine-client.ipynb)
|
||||||
|
|
||||||
|
- free to use on-demand server with Jupyter notebook, OpenRefine and Bash
|
||||||
|
- no registration needed, will start within a few minutes
|
||||||
|
- [restricted](https://mybinder.readthedocs.io/en/latest/faq.html#how-much-memory-am-i-given-when-using-binder) to 2 GB RAM and server will be deleted after 10 minutes of inactivity
|
||||||
|
- includes [demo notebook](https://nbviewer.jupyter.org/github/betatim/openrefineder/blob/master/openrefine-client.ipynb) for using openrefine-client with Linux Bash
|
||||||
|
|
||||||
|
## Development
|
||||||
|
|
||||||
|
If you would like to contribute to the Python client library please consider a pull request to the upstream repository [refine-client-py](https://github.com/PaulMakepeace/refine-client-py/).
|
||||||
|
|
||||||
|
### Tests
|
||||||
|
|
||||||
|
Ensure you have OpenRefine running (i.e. available at http://localhost:3333). If necessary set the environment variables `OPENREFINE_HOST` and `OPENREFINE_PORT` to change the URL.
|
||||||
|
|
||||||
|
The Python client library includes several unit tests.
|
||||||
|
|
||||||
|
- run all tests
|
||||||
|
|
||||||
|
```
|
||||||
|
python setup.py test
|
||||||
|
```
|
||||||
|
|
||||||
|
- run subset test_facet
|
||||||
|
|
||||||
|
```
|
||||||
|
python setup.py --test-suite tests.test_facet
|
||||||
|
```
|
||||||
|
|
||||||
|
There is also a script that uses docker images to run the unit tests with different versions of OpenRefine.
|
||||||
|
|
||||||
|
- run tests on all OpenRefine versions (from 2.0 up to 3.2)
|
||||||
|
|
||||||
|
```
|
||||||
|
./tests.sh -a
|
||||||
|
```
|
||||||
|
|
||||||
|
- run tests on tag 3.2
|
||||||
|
|
||||||
|
```
|
||||||
|
./tests.sh -t 3.2
|
||||||
|
```
|
||||||
|
|
||||||
|
- run tests on tag 3.2 interactively (pause before and after tests)
|
||||||
|
|
||||||
|
```
|
||||||
|
./tests.sh -t 3.2 -i
|
||||||
|
```
|
||||||
|
|
||||||
|
- run tests on tags 3.2 and 2.7
|
||||||
|
|
||||||
|
```
|
||||||
|
./tests.sh -t 3.2 -t 2.7
|
||||||
|
```
|
||||||
|
|
||||||
|
### Distributing
|
||||||
|
|
||||||
|
Note to myself: When releasing a new version...
|
||||||
|
|
||||||
|
1. Run tests
|
||||||
|
|
||||||
|
```
|
||||||
|
./tests.sh -a
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Make final changes in GitHub
|
||||||
|
|
||||||
|
- update versions and download links (guess in advance) in [README.md](https://github.com/opencultureconsulting/openrefine-client/blob/master/README.md#download)
|
||||||
|
- check if [Dockerfile](https://github.com/opencultureconsulting/openrefine-client/blob/master/docker/Dockerfile) needs to be changed
|
||||||
|
|
||||||
|
3. Build executables with PyInstaller
|
||||||
|
|
||||||
|
- Run PyInstaller in Python 2 environments on native Windows, macOS and Linux. Should be "the oldest version of the OS you need to support"! Current release is built with:
|
||||||
|
|
||||||
|
- Ubuntu 14.04 LTS (64-bit)
|
||||||
|
- macOS Sierra 10.12
|
||||||
|
- Windows 10
|
||||||
|
|
||||||
|
- One-file-executables will be available in `dist/`.
|
||||||
|
|
||||||
|
```
|
||||||
|
git clone https://github.com/opencultureconsulting/openrefine-client.git
|
||||||
|
cd openrefine-client
|
||||||
|
pip install pyinstaller
|
||||||
|
pyinstaller --onefile refine.py
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Create release in GitHub
|
||||||
|
|
||||||
|
- draft [release notes](https://github.com/opencultureconsulting/openrefine-client/releases) and attach one-file-executables
|
||||||
|
|
||||||
|
5. Build package and upload to PyPI
|
||||||
|
|
||||||
|
```
|
||||||
|
python3 setup.py sdist bdist_wheel
|
||||||
|
python3 -m twine upload dist/*
|
||||||
|
```
|
||||||
|
|
||||||
|
6. Update Docker container
|
||||||
|
|
||||||
|
- add new autobuild for release version
|
||||||
|
- trigger latest build
|
||||||
|
|
||||||
|
7. Bump openrefine-client version in related projects
|
||||||
|
|
||||||
|
- openrefine-batch: [openrefine-batch.sh](https://github.com/opencultureconsulting/openrefine-batch/blob/master/openrefine-batch.sh#L7) and [openrefine-batch-docker.sh](https://github.com/opencultureconsulting/openrefine-batch/blob/master/openrefine-batch-docker.sh)
|
||||||
|
|
||||||
|
- openrefineder: [postBuild](https://github.com/betatim/openrefineder/blob/master/postBuild)
|
||||||
|
|
||||||
## Credits
|
## Credits
|
||||||
|
|
||||||
|
@ -79,7 +706,7 @@ David Huynh, [initial cut](<http://markmail.org/message/jsxzlcu3gn6drtb7)
|
||||||
|
|
||||||
[Felix Lohmeier](https://felixlohmeier.de), extended the CLI features
|
[Felix Lohmeier](https://felixlohmeier.de), extended the CLI features
|
||||||
|
|
||||||
Some data used in the test suite has been used from publicly available sources,
|
Some data used in the test suite has been used from publicly available sources:
|
||||||
|
|
||||||
- louisiana-elected-officials.csv: from http://www.sos.louisiana.gov/tabid/136/Default.aspx
|
- louisiana-elected-officials.csv: from http://www.sos.louisiana.gov/tabid/136/Default.aspx
|
||||||
|
|
||||||
|
|
|
@ -1,89 +0,0 @@
|
||||||
## batch processing with python-client
|
|
||||||
|
|
||||||
There are some client libraries for OpenRefine that communicate with the [OpenRefine API](https://github.com/OpenRefine/OpenRefine/wiki/OpenRefine-API). I have prepared a docker container on top of the [Python Library from PaulMakepeace](https://github.com/PaulMakepeace/refine-client-py/) and extended the CLI with some options to create new OpenRefine projects from files.
|
|
||||||
|
|
||||||
If you are looking for a ready to use command line interface to OpenRefine for batch processing then you might be interested in the following bash shell script: [felixlohmeier/openrefine-batch](https://github.com/felixlohmeier/openrefine-batch)
|
|
||||||
|
|
||||||
### basic usage
|
|
||||||
|
|
||||||
**1) start server:**
|
|
||||||
> docker run -d --name=openrefine-server felixlohmeier/openrefine
|
|
||||||
|
|
||||||
**2) run client with one of the following commands:**
|
|
||||||
|
|
||||||
list projects:
|
|
||||||
> docker run --rm --link openrefine-server felixlohmeier/openrefine-client --list
|
|
||||||
|
|
||||||
create project from file:
|
|
||||||
> docker run --rm --link openrefine-server felixlohmeier/openrefine-client --create [FILE]
|
|
||||||
|
|
||||||
apply [rules from json file](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html):
|
|
||||||
> docker run --rm --link openrefine-server felixlohmeier/openrefine-client --apply [FILE.json] [PROJECTID]
|
|
||||||
|
|
||||||
export project to file:
|
|
||||||
> docker run --rm --link openrefine-server felixlohmeier/openrefine-client --export [PROJECTID] --output=FILE.tsv
|
|
||||||
|
|
||||||
check help screen for more options:
|
|
||||||
> docker run --rm --link openrefine-server felixlohmeier/openrefine-client --help
|
|
||||||
|
|
||||||
**3) cleanup:**
|
|
||||||
> docker stop openrefine-server && docker rm openrefine-server
|
|
||||||
|
|
||||||
### example for customized run commands in interactive mode (e.g. for usage in terminals)
|
|
||||||
|
|
||||||
**1) start server in terminal A:**
|
|
||||||
|
|
||||||
```docker run --rm --name=openrefine-server -p 80:3333 -v /home/felix/refine:/data:z felixlohmeier/openrefine -i 0.0.0.0 -m 4G -d /data```
|
|
||||||
|
|
||||||
* automatically remove docker container when it exits
|
|
||||||
* set name "openrefine" for docker container
|
|
||||||
* publish internal port 3333 to host port 80
|
|
||||||
* mount host directory /home/felix/refine as working directory
|
|
||||||
* make openrefine available in the network
|
|
||||||
* increase java heap size to 4 GB
|
|
||||||
* set refine workspace to /data
|
|
||||||
* OpenRefine should be available at http://localhost
|
|
||||||
|
|
||||||
**2) start client in terminal B (prints help screen):**
|
|
||||||
|
|
||||||
```docker run --rm --link openrefine-server -v /home/felix/refine:/data:z felixlohmeier/openrefine-client```
|
|
||||||
|
|
||||||
* automatically remove docker container when it exits
|
|
||||||
* build up network connection with docker container "openrefine"
|
|
||||||
* mount host directory /home/felix/refine as working directory
|
|
||||||
* apply history in file /home/felix/refine/history.json to project with id 1234567890123
|
|
||||||
|
|
||||||
### example for customized run commands in detached mode (e.g. for usage in shell scripts)
|
|
||||||
|
|
||||||
**1) define variables (bring your own example data)**
|
|
||||||
> workingdir=/home/felix/refine
|
|
||||||
> inputfile=example.csv
|
|
||||||
> jsonfile=test.json
|
|
||||||
|
|
||||||
**2) start server**
|
|
||||||
|
|
||||||
```docker run -d --name=openrefine-server -v ${workingdir}:/data:z felixlohmeier/openrefine -i 0.0.0.0 -m 4G -d /data```
|
|
||||||
|
|
||||||
**3) wait until server is ready**
|
|
||||||
|
|
||||||
```until docker run --rm --link openrefine-server --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://openrefine-server:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done```
|
|
||||||
|
|
||||||
**4) create project (import file)**
|
|
||||||
|
|
||||||
```docker run --rm --link openrefine-server -v ${workingdir}:/data:z felixlohmeier/openrefine-client --create $inputfile```
|
|
||||||
|
|
||||||
**5) get project id**
|
|
||||||
|
|
||||||
```project=($(docker run --rm --link openrefine-server -v ${workingdir}:/data felixlohmeier/openrefine-client --list | cut -c 2-14))```
|
|
||||||
|
|
||||||
**6) apply transformations from json file**
|
|
||||||
|
|
||||||
```docker run --rm --link openrefine-server -v ${workingdir}:/data felixlohmeier/openrefine-client --apply ${jsonfile} ${project}```
|
|
||||||
|
|
||||||
**7) export project to file**
|
|
||||||
|
|
||||||
```docker run --rm --link openrefine-server -v ${workingdir}:/data felixlohmeier/openrefine-client --export --output=${project}.tsv ${project}```
|
|
||||||
|
|
||||||
**8) cleanup**
|
|
||||||
|
|
||||||
```docker stop -t=500 openrefine-server && docker rm openrefine-server```
|
|
|
@ -1,6 +1,6 @@
|
||||||
#! /usr/bin/env python
|
#! /usr/bin/env python
|
||||||
"""
|
"""
|
||||||
Script to provide a command line interface to a OpenRefine server.
|
Script to provide a command line interface to a Refine server.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Copyright (c) 2011 Paul Makepeace, Real Programmers. All rights reserved.
|
# Copyright (c) 2011 Paul Makepeace, Real Programmers. All rights reserved.
|
||||||
|
@ -21,164 +21,195 @@ Script to provide a command line interface to a OpenRefine server.
|
||||||
|
|
||||||
import optparse
|
import optparse
|
||||||
import os
|
import os
|
||||||
import sys
|
|
||||||
|
|
||||||
from google.refine import refine
|
from google.refine import refine
|
||||||
from google.refine import cli
|
from google.refine import cli
|
||||||
|
|
||||||
reload(sys)
|
|
||||||
sys.setdefaultencoding('utf-8')
|
|
||||||
|
|
||||||
class myParser(optparse.OptionParser):
|
class myParser(optparse.OptionParser):
|
||||||
|
|
||||||
def format_epilog(self, formatter):
|
def format_epilog(self, formatter):
|
||||||
return self.epilog
|
return self.epilog
|
||||||
|
|
||||||
|
|
||||||
PARSER = \
|
PARSER = \
|
||||||
myParser(description='Script to provide a command line interface to an OpenRefine server.',
|
myParser(description=('Script to provide a command line interface to an '
|
||||||
|
'OpenRefine server.'),
|
||||||
usage='usage: %prog [--help | OPTIONS]',
|
usage='usage: %prog [--help | OPTIONS]',
|
||||||
epilog="""
|
epilog="""
|
||||||
Examples:
|
Example data:
|
||||||
--list # show list of projects (id: name)
|
--download "https://git.io/fj5hF" --output=duplicates.csv
|
||||||
|
--download "https://git.io/fj5ju" --output=duplicates-deletion.json
|
||||||
|
|
||||||
|
Basic commands:
|
||||||
|
--list # list all projects
|
||||||
--list -H 127.0.0.1 -P 80 # specify hostname and port
|
--list -H 127.0.0.1 -P 80 # specify hostname and port
|
||||||
--info 2161595260364 # show metadata of project
|
--create duplicates.csv # create new project from file
|
||||||
--info "christmas gifts"
|
--info "duplicates" # show project metadata
|
||||||
--create example.csv # create new project from file example.csv
|
--apply duplicates-deletion.json "duplicates" # apply rules in file to project
|
||||||
|
--export "duplicates" # export project to terminal in tsv format
|
||||||
|
--export --output=deduped.xls "duplicates" # export project to file in xls format
|
||||||
|
--delete "duplicates" # delete project
|
||||||
|
|
||||||
|
Some more examples:
|
||||||
|
--info 1234567890123 # specify project by id
|
||||||
--create example.tsv --encoding=UTF-8
|
--create example.tsv --encoding=UTF-8
|
||||||
--create example.xml --recordPath=collection --recordPath=record
|
--create example.xml --recordPath=collection --recordPath=record
|
||||||
--create example.json --recordPath=_ --recordPath=_
|
--create example.json --recordPath=_ --recordPath=_
|
||||||
--create example.xlsx --sheets=0
|
--create example.xlsx --sheets=0
|
||||||
--create example.ods --sheets=0
|
--create example.ods --sheets=0
|
||||||
--apply trim.json 2161595260364 # apply rules in trim.json to project 1234...
|
|
||||||
--apply trim.json "christmas gifts"
|
Example for Templating Export:
|
||||||
--export 2161595260364 > project.tsv # export project 2161595260364 in tsv format
|
Cf. https://github.com/opencultureconsulting/openrefine-client#advanced-templating
|
||||||
--export "christmas gifts" > project.tsv
|
|
||||||
--export --output=project.xlsx 2161595260364 # export project in xlsx format
|
|
||||||
--export --output=project.xlsx "christmas gifts"
|
|
||||||
--export "My Address Book" --template='{ "friend" : {{jsonize(cells["friend"].value)}}, "address" : {{jsonize(cells["address"].value)}} }' --prefix='{ "rows" : [' --rowSeparator=',' --suffix='] }' --filterQuery="^mary$"
|
|
||||||
--delete 2161595260364 # delete project
|
|
||||||
--delete "christmas gifts"
|
|
||||||
""")
|
""")
|
||||||
|
|
||||||
group1 = optparse.OptionGroup(PARSER, 'Connection options')
|
group1 = optparse.OptionGroup(PARSER, 'Connection options')
|
||||||
group1.add_option('-H', '--host', dest='host', metavar='127.0.0.1',
|
group1.add_option('-H', '--host', dest='host',
|
||||||
|
metavar='127.0.0.1',
|
||||||
help='OpenRefine hostname (default: 127.0.0.1)')
|
help='OpenRefine hostname (default: 127.0.0.1)')
|
||||||
group1.add_option('-P', '--port', dest='port', metavar='3333',
|
group1.add_option('-P', '--port', dest='port',
|
||||||
|
metavar='3333',
|
||||||
help='OpenRefine port (default: 3333)')
|
help='OpenRefine port (default: 3333)')
|
||||||
PARSER.add_option_group(group1)
|
PARSER.add_option_group(group1)
|
||||||
|
|
||||||
group2 = optparse.OptionGroup(PARSER, 'Commands')
|
group2 = optparse.OptionGroup(PARSER, 'Commands')
|
||||||
group2.add_option('-c', '--create', dest='create', metavar='[FILE]',
|
group2.add_option('-c', '--create', dest='create',
|
||||||
|
metavar='[FILE]',
|
||||||
help='Create project from file. The filename ending (e.g. .csv) defines the input format (csv,tsv,xml,json,txt,xls,xlsx,ods)')
|
help='Create project from file. The filename ending (e.g. .csv) defines the input format (csv,tsv,xml,json,txt,xls,xlsx,ods)')
|
||||||
group2.add_option('-l', '--list', dest='list', action='store_true',
|
group2.add_option('-l', '--list', dest='list',
|
||||||
|
action='store_true',
|
||||||
help='List projects')
|
help='List projects')
|
||||||
|
group2.add_option('--download', dest='download',
|
||||||
|
metavar='[URL]',
|
||||||
|
help='Download file from URL (e.g. example data). Combine with --output to specify a filename.')
|
||||||
PARSER.add_option_group(group2)
|
PARSER.add_option_group(group2)
|
||||||
|
|
||||||
group3 = optparse.OptionGroup(PARSER, 'Commands with argument [PROJECTID/PROJECTNAME]')
|
group3 = optparse.OptionGroup(
|
||||||
group3.add_option('-d', '--delete', dest='delete', action='store_true',
|
PARSER, 'Commands with argument [PROJECTID/PROJECTNAME]')
|
||||||
|
group3.add_option('-d', '--delete', dest='delete',
|
||||||
|
action='store_true',
|
||||||
help='Delete project')
|
help='Delete project')
|
||||||
group3.add_option('-f', '--apply', dest='apply', metavar='[FILE]',
|
group3.add_option('-f', '--apply', dest='apply',
|
||||||
|
metavar='[FILE]',
|
||||||
help='Apply JSON rules to OpenRefine project')
|
help='Apply JSON rules to OpenRefine project')
|
||||||
group3.add_option('-E', '--export', dest='export', action='store_true',
|
group3.add_option('-E', '--export', dest='export',
|
||||||
|
action='store_true',
|
||||||
help='Export project in tsv format to stdout.')
|
help='Export project in tsv format to stdout.')
|
||||||
group3.add_option('-o', '--output', dest='output', metavar='[FILE]',
|
group3.add_option('-o', '--output', dest='output',
|
||||||
|
metavar='[FILE]',
|
||||||
help='Export project to file. The filename ending (e.g. .tsv) defines the output format (csv,tsv,xls,xlsx,html)')
|
help='Export project to file. The filename ending (e.g. .tsv) defines the output format (csv,tsv,xls,xlsx,html)')
|
||||||
group3.add_option('--info', dest='info', action='store_true',
|
group3.add_option('--template', dest='template',
|
||||||
|
metavar='[STRING]',
|
||||||
|
help='Export project with templating. Provide (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)')
|
||||||
|
group3.add_option('--info', dest='info',
|
||||||
|
action='store_true',
|
||||||
help='show project metadata')
|
help='show project metadata')
|
||||||
PARSER.add_option_group(group3)
|
PARSER.add_option_group(group3)
|
||||||
|
|
||||||
group4 = optparse.OptionGroup(PARSER, 'Create options')
|
group4 = optparse.OptionGroup(PARSER, 'General options')
|
||||||
group4.add_option('--columnWidths', dest='columnWidths',
|
group4.add_option('--format', dest='file_format',
|
||||||
help='(txt/fixed-width) please provide widths separated by comma (e.g. 7,5)')
|
help='Override file detection (import: csv,tsv,xml,json,line-based,fixed-width,xls,xlsx,ods; export: csv,tsv,html,xls,xlsx,ods)')
|
||||||
group4.add_option('--encoding', dest='encoding',
|
|
||||||
help='(csv,tsv,txt), please provide short encoding name (e.g. UTF-8)')
|
|
||||||
group4.add_option('--guessCellValueTypes', dest='guessCellValueTypes', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(xml,csv,tsv,txt,json, default: false)')
|
|
||||||
group4.add_option('--headerLines', dest='headerLines', type="int",
|
|
||||||
help='(csv,tsv,txt/fixed-width,xls,xlsx,ods), default: 1, default txt/fixed-width: 0')
|
|
||||||
group4.add_option('--ignoreLines', dest='ignoreLines', type="int",
|
|
||||||
help='(csv,tsv,txt,xls,xlsx,ods), default: -1')
|
|
||||||
group4.add_option('--includeFileSources', dest='includeFileSources', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(all formats), default: false')
|
|
||||||
group4.add_option('--limit', dest='limit', type="int",
|
|
||||||
help='(all formats), default: -1')
|
|
||||||
group4.add_option('--linesPerRow', dest='linesPerRow', type="int",
|
|
||||||
help='(txt/line-based), default: 1')
|
|
||||||
group4.add_option('--processQuotes', dest='processQuotes', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(csv,tsv), default: true')
|
|
||||||
group4.add_option('--projectName', dest='project_name',
|
|
||||||
help='(all formats), default: filename')
|
|
||||||
group4.add_option('--recordPath', dest='recordPath', action='append',
|
|
||||||
help='(xml,json), please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record, default xml: record, default json: _ _')
|
|
||||||
group4.add_option('--separator', dest='separator',
|
|
||||||
help='(csv,tsv), default csv: , default tsv: \\t')
|
|
||||||
group4.add_option('--sheets', dest='sheets', action='append', type="int",
|
|
||||||
help='(xls,xlsx,ods), please provide sheets in multiple arguments, e.g. --sheets=0 --sheets=1, default: 0 (first sheet)')
|
|
||||||
group4.add_option('--skipDataLines', dest='skipDataLines', type="int",
|
|
||||||
help='(csv,tsv,txt,xls,xlsx,ods), default: 0, default line-based: -1')
|
|
||||||
group4.add_option('--storeBlankRows', dest='storeBlankRows', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(csv,tsv,txt,xls,xlsx,ods), default: true')
|
|
||||||
group4.add_option('--storeBlankCellsAsNulls', dest='storeBlankCellsAsNulls', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(csv,tsv,txt,xls,xlsx,ods), default: true')
|
|
||||||
group4.add_option('--storeEmptyStrings', dest='storeEmptyStrings', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(xml,json), default: true')
|
|
||||||
group4.add_option('--trimStrings', dest='trimStrings', metavar='true/false', choices=('true', 'false'),
|
|
||||||
help='(xml,json), default: false')
|
|
||||||
PARSER.add_option_group(group4)
|
PARSER.add_option_group(group4)
|
||||||
|
|
||||||
group5 = optparse.OptionGroup(PARSER, 'Legacy options')
|
group5 = optparse.OptionGroup(PARSER, 'Create options')
|
||||||
group5.add_option('--format', dest='input_format',
|
group5.add_option('--columnWidths', dest='columnWidths',
|
||||||
help='Specify input format (csv,tsv,xml,json,line-based,fixed-width,xls,xlsx,ods)')
|
action='append',
|
||||||
|
type='int',
|
||||||
|
help='(txt/fixed-width), please provide widths in multiple arguments, e.g. --columnWidths=7 --columnWidths=5')
|
||||||
|
group5.add_option('--encoding', dest='encoding',
|
||||||
|
help='(csv,tsv,txt), please provide short encoding name (e.g. UTF-8)')
|
||||||
|
group5.add_option('--guessCellValueTypes', dest='guessCellValueTypes',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(xml,csv,tsv,txt,json, default: false)')
|
||||||
|
group5.add_option('--headerLines', dest='headerLines',
|
||||||
|
type="int",
|
||||||
|
help='(csv,tsv,txt/fixed-width,xls,xlsx,ods), default: 1, default txt/fixed-width: 0')
|
||||||
|
group5.add_option('--ignoreLines', dest='ignoreLines',
|
||||||
|
type="int",
|
||||||
|
help='(csv,tsv,txt,xls,xlsx,ods), default: -1')
|
||||||
|
group5.add_option('--includeFileSources', dest='includeFileSources',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(all formats), default: false')
|
||||||
|
group5.add_option('--limit', dest='limit',
|
||||||
|
type="int",
|
||||||
|
help='(all formats), default: -1')
|
||||||
|
group5.add_option('--linesPerRow', dest='linesPerRow',
|
||||||
|
type="int",
|
||||||
|
help='(txt/line-based), default: 1')
|
||||||
|
group5.add_option('--processQuotes', dest='processQuotes',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(csv,tsv), default: true')
|
||||||
|
group5.add_option('--projectName', dest='project_name',
|
||||||
|
help='(all formats), default: filename')
|
||||||
|
group5.add_option('--projectTags', dest='projectTags',
|
||||||
|
action='append',
|
||||||
|
help='(all formats), please provide tags in multiple arguments, e.g. --projectTags=beta --projectTags=client1')
|
||||||
|
group5.add_option('--recordPath', dest='recordPath',
|
||||||
|
action='append',
|
||||||
|
help='(xml,json), please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: --recordPath=collection --recordPath=record, default xml: record, default json: _ _')
|
||||||
|
group5.add_option('--separator', dest='separator',
|
||||||
|
help='(csv,tsv), default csv: , default tsv: \\t')
|
||||||
|
group5.add_option('--sheets', dest='sheets',
|
||||||
|
action='append',
|
||||||
|
type="int",
|
||||||
|
help='(xls,xlsx,ods), please provide sheets in multiple arguments, e.g. --sheets=0 --sheets=1, default: 0 (first sheet)')
|
||||||
|
group5.add_option('--skipDataLines', dest='skipDataLines',
|
||||||
|
type="int",
|
||||||
|
help='(csv,tsv,txt,xls,xlsx,ods), default: 0, default line-based: -1')
|
||||||
|
group5.add_option('--storeBlankCellsAsNulls', dest='storeBlankCellsAsNulls',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(csv,tsv,txt,xls,xlsx,ods), default: true')
|
||||||
|
group5.add_option('--storeBlankRows', dest='storeBlankRows',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(csv,tsv,txt,xls,xlsx,ods), default: true')
|
||||||
|
group5.add_option('--storeEmptyStrings', dest='storeEmptyStrings',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(xml,json), default: true')
|
||||||
|
group5.add_option('--trimStrings', dest='trimStrings',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='(xml,json), default: false')
|
||||||
PARSER.add_option_group(group5)
|
PARSER.add_option_group(group5)
|
||||||
|
|
||||||
group6= optparse.OptionGroup(PARSER, 'Templating export options')
|
group6 = optparse.OptionGroup(PARSER, 'Templating options')
|
||||||
group6.add_option('--template', dest='template',
|
group6.add_option('--mode', dest='mode',
|
||||||
help='mandatory; (big) text string that you enter in the *row template* textfield in the export/templating menu in the browser app)')
|
metavar='row-based/record-based',
|
||||||
group6.add_option('--mode', dest='mode', metavar='row-based/record-based', choices=('row-based', 'record-based'),
|
choices=('row-based', 'record-based'),
|
||||||
help='engine mode (default: row-based)')
|
help='engine mode (default: row-based)')
|
||||||
group6.add_option('--prefix', dest='prefix',
|
group6.add_option('--prefix', dest='prefix',
|
||||||
help='text string that you enter in the *prefix* textfield in the browser app')
|
help='text string that you enter in the *prefix* textfield in the browser app')
|
||||||
group6.add_option('--rowSeparator', dest='rowSeparator',
|
group6.add_option('--rowSeparator', dest='rowSeparator',
|
||||||
help='text string that you enter in the *row separator* textfield in the browser app')
|
help='text string that you enter in the *row separator* textfield in the browser app')
|
||||||
group6.add_option('--suffix', dest='suffix',
|
group6.add_option('--suffix', dest='suffix',
|
||||||
help='text string that you enter in the *suffix* textfield in the browser app')
|
help='text string that you enter in the *suffix* textfield in the browser app')
|
||||||
group6.add_option('--filterQuery', dest='filterQuery', metavar='REGEX',
|
group6.add_option('--filterQuery', dest='filterQuery',
|
||||||
help='Simple RegEx text filter on filterColumn, e.g. ^12015$'),
|
metavar='REGEX',
|
||||||
group6.add_option('--filterColumn', dest='filterColumn', metavar='COLUMNNAME',
|
help='Simple RegEx text filter on filterColumn, e.g. ^12015$'),
|
||||||
help='column name for filterQuery (default: name of first column)')
|
group6.add_option('--filterColumn', dest='filterColumn',
|
||||||
|
metavar='COLUMNNAME',
|
||||||
|
help='column name for filterQuery (default: name of first column)')
|
||||||
group6.add_option('--facets', dest='facets',
|
group6.add_option('--facets', dest='facets',
|
||||||
help='facets config in json format (may be extracted with browser dev tools in browser app)')
|
help='facets config in json format (may be extracted with browser dev tools in browser app)')
|
||||||
group6.add_option('--splitToFiles', dest='splitToFiles', metavar='true/false', choices=('true', 'false'),
|
group6.add_option('--splitToFiles', dest='splitToFiles',
|
||||||
help='will split each row/record into a single file; it specifies a presumably unique character series for splitting; --prefix and --suffix will be applied to all files; filename-prefix can be specified with --output (default: %Y%m%d)')
|
metavar='true/false', choices=('true', 'false'),
|
||||||
group6.add_option('--suffixById', dest='suffixById', metavar='true/false', choices=('true', 'false'),
|
help='will split each row/record into a single file; it specifies a presumably unique character series for splitting; --prefix and --suffix will be applied to all files; filename-prefix can be specified with --output (default: %Y%m%d)')
|
||||||
help='enhancement option for --splitToFiles; will generate filename-suffix from values in key column')
|
group6.add_option('--suffixById', dest='suffixById',
|
||||||
|
metavar='true/false', choices=('true', 'false'),
|
||||||
|
help='enhancement option for --splitToFiles; will generate filename-suffix from values in key column')
|
||||||
PARSER.add_option_group(group6)
|
PARSER.add_option_group(group6)
|
||||||
|
|
||||||
#noinspection PyPep8Naming
|
|
||||||
def main():
|
def main():
|
||||||
"""Command line interface."""
|
"""Command line interface."""
|
||||||
|
|
||||||
# get environment variables in docker network
|
|
||||||
docker_host = os.environ.get('OPENREFINE_SERVER_PORT_3333_TCP_ADDR')
|
|
||||||
if docker_host:
|
|
||||||
os.environ["OPENREFINE_HOST"] = docker_host
|
|
||||||
refine.REFINE_HOST = docker_host
|
|
||||||
docker_port = os.environ.get('OPENREFINE_SERVER_PORT_3333_TCP_PORT')
|
|
||||||
if docker_port:
|
|
||||||
os.environ["OPENREFINE_PORT"] = docker_port
|
|
||||||
refine.REFINE_PORT = docker_port
|
|
||||||
|
|
||||||
options, args = PARSER.parse_args()
|
options, args = PARSER.parse_args()
|
||||||
commands_dict = { group2_arg.dest : getattr(options, group2_arg.dest) for group2_arg in group2.option_list }
|
|
||||||
commands_dict.update({ group3_arg.dest : getattr(options, group3_arg.dest) for group3_arg in group3.option_list })
|
# set environment
|
||||||
commands_dict = { k: v for k, v in commands_dict.items() if v != None }
|
|
||||||
if not commands_dict:
|
|
||||||
PARSER.print_usage()
|
|
||||||
return
|
|
||||||
if options.host:
|
if options.host:
|
||||||
refine.REFINE_HOST = options.host
|
refine.REFINE_HOST = options.host
|
||||||
if options.port:
|
if options.port:
|
||||||
refine.REFINE_PORT = options.port
|
refine.REFINE_PORT = options.port
|
||||||
|
|
||||||
|
# get project_id
|
||||||
if args and not str.isdigit(args[0]):
|
if args and not str.isdigit(args[0]):
|
||||||
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
||||||
idlist = []
|
idlist = []
|
||||||
|
@ -186,32 +217,63 @@ def main():
|
||||||
if args[0] == project_info['name']:
|
if args[0] == project_info['name']:
|
||||||
idlist.append(str(project_id))
|
idlist.append(str(project_id))
|
||||||
if len(idlist) > 1:
|
if len(idlist) > 1:
|
||||||
raise Exception('Found at least two projects. Please specify project by id.')
|
print('Error: Found %s projects with name %s.\n'
|
||||||
|
'Please specify project by id.' % (len(idlist), args[0]))
|
||||||
|
for i in idlist:
|
||||||
|
print('')
|
||||||
|
cli.info(i)
|
||||||
|
return
|
||||||
else:
|
else:
|
||||||
args[0] = idlist[0]
|
try:
|
||||||
|
project_id = idlist[0]
|
||||||
|
except IndexError:
|
||||||
|
print('Error: No project found with name %s.\n'
|
||||||
|
'Try command --list' % args[0])
|
||||||
|
return
|
||||||
|
elif args:
|
||||||
|
project_id = args[0]
|
||||||
|
|
||||||
|
# commands without args
|
||||||
if options.list:
|
if options.list:
|
||||||
cli.list_projects()
|
cli.ls()
|
||||||
if options.create:
|
elif options.download:
|
||||||
cli.create_project(options)
|
cli.download(options.download, output_file=options.output)
|
||||||
if options.delete:
|
elif options.create:
|
||||||
project = refine.RefineProject(args[0])
|
group5_dict = {group5_arg.dest: getattr(options, group5_arg.dest)
|
||||||
project.delete()
|
for group5_arg in group5.option_list}
|
||||||
if options.apply:
|
kwargs = {k: v for k, v in group5_dict.items()
|
||||||
project = refine.RefineProject(args[0])
|
if v is not None and v not in ['true', 'false']}
|
||||||
response = project.apply_operations(options.apply)
|
kwargs.update({k: True for k, v in group5_dict.items()
|
||||||
if response != 'ok':
|
if v == 'true'})
|
||||||
print >> sys.stderr, 'Failed to apply %s: %s' \
|
kwargs.update({k: False for k, v in group5_dict.items()
|
||||||
% (options.apply, response)
|
if v == 'false'})
|
||||||
return project
|
if options.file_format:
|
||||||
if options.export or options.output:
|
kwargs.update({'project_format': options.file_format})
|
||||||
project = refine.RefineProject(args[0])
|
cli.create(options.create, **kwargs)
|
||||||
cli.export_project(project, options)
|
# commands with args
|
||||||
return project
|
elif args and options.info:
|
||||||
if options.info:
|
cli.info(project_id)
|
||||||
cli.info(args[0])
|
elif args and options.delete:
|
||||||
project = refine.RefineProject(args[0])
|
cli.delete(project_id)
|
||||||
return project
|
elif args and options.apply:
|
||||||
|
cli.apply(project_id, options.apply)
|
||||||
|
elif args and options.template:
|
||||||
|
group6_dict = {group6_arg.dest: getattr(options, group6_arg.dest)
|
||||||
|
for group6_arg in group6.option_list}
|
||||||
|
kwargs = {k: v for k, v in group6_dict.items()
|
||||||
|
if v is not None and v not in ['true', 'false']}
|
||||||
|
kwargs.update({k: True for k, v in group6_dict.items()
|
||||||
|
if v == 'true'})
|
||||||
|
kwargs.update({k: False for k, v in group6_dict.items()
|
||||||
|
if v == 'false'})
|
||||||
|
cli.templating(project_id, options.template,
|
||||||
|
output_file=options.output, **kwargs)
|
||||||
|
elif args and (options.export or options.output):
|
||||||
|
cli.export(project_id, output_file=options.output,
|
||||||
|
export_format=options.file_format)
|
||||||
|
else:
|
||||||
|
PARSER.print_usage()
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
# execute only if run as a script
|
# execute only if run as a script
|
||||||
|
|
|
@ -19,136 +19,295 @@ Functions used by the command line interface (CLI)
|
||||||
# along with this program. If not, see <http://www.gnu.org/licenses/>
|
# along with this program. If not, see <http://www.gnu.org/licenses/>
|
||||||
|
|
||||||
|
|
||||||
|
import json
|
||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
import time
|
import time
|
||||||
import json
|
import urllib
|
||||||
|
|
||||||
from google.refine import refine
|
from google.refine import refine
|
||||||
|
|
||||||
def list_projects():
|
|
||||||
"""Query the OpenRefine server and list projects by ID: name."""
|
def apply(project_id, history_file):
|
||||||
|
"""Apply OpenRefine history from json file to project."""
|
||||||
|
project = refine.RefineProject(project_id)
|
||||||
|
response = project.apply_operations(history_file)
|
||||||
|
if response != 'ok':
|
||||||
|
raise Exception('Failed to apply %s to %s: %s' %
|
||||||
|
(history_file, project_id, response))
|
||||||
|
else:
|
||||||
|
print('File %s has been successfully applied to project %s' %
|
||||||
|
(history_file, project_id))
|
||||||
|
|
||||||
|
|
||||||
|
def create(project_file,
|
||||||
|
project_format=None,
|
||||||
|
project_name=None,
|
||||||
|
columnWidths=None,
|
||||||
|
encoding=None,
|
||||||
|
guessCellValueTypes=False,
|
||||||
|
headerLines=None,
|
||||||
|
ignoreLines=None,
|
||||||
|
includeFileSources=False,
|
||||||
|
limit=None,
|
||||||
|
linesPerRow=None,
|
||||||
|
processQuotes=True,
|
||||||
|
projectName=None,
|
||||||
|
recordPath=None,
|
||||||
|
separator=None,
|
||||||
|
sheets=None,
|
||||||
|
skipDataLines=None,
|
||||||
|
storeBlankCellsAsNulls=True,
|
||||||
|
storeBlankRows=True,
|
||||||
|
storeEmptyStrings=True,
|
||||||
|
trimStrings=False
|
||||||
|
):
|
||||||
|
"""Create a new project from file."""
|
||||||
|
# guess format from file extension
|
||||||
|
if not project_format:
|
||||||
|
project_format = os.path.splitext(project_file)[1][1:].lower()
|
||||||
|
if project_format == 'txt':
|
||||||
|
try:
|
||||||
|
columnWidths
|
||||||
|
project_format = 'fixed-width'
|
||||||
|
except NameError:
|
||||||
|
project_format = 'line-based'
|
||||||
|
# defaults for each file type
|
||||||
|
if project_format == 'xml':
|
||||||
|
project_format = 'text/xml'
|
||||||
|
if not recordPath:
|
||||||
|
recordPath = 'record'
|
||||||
|
elif project_format == 'csv':
|
||||||
|
project_format = 'text/line-based/*sv'
|
||||||
|
elif project_format == 'tsv':
|
||||||
|
project_format = 'text/line-based/*sv'
|
||||||
|
if not separator:
|
||||||
|
separator = '\t'
|
||||||
|
elif project_format == 'line-based':
|
||||||
|
project_format = 'text/line-based'
|
||||||
|
if not skipDataLines:
|
||||||
|
skipDataLines = -1
|
||||||
|
elif project_format == 'fixed-width':
|
||||||
|
project_format = 'text/line-based/fixed-width'
|
||||||
|
if not headerLines:
|
||||||
|
headerLines = 0
|
||||||
|
elif project_format == 'json':
|
||||||
|
project_format = 'text/json'
|
||||||
|
if not recordPath:
|
||||||
|
recordPath = ('_', '_')
|
||||||
|
elif project_format == 'xls':
|
||||||
|
project_format = 'binary/text/xml/xls/xlsx'
|
||||||
|
if not sheets:
|
||||||
|
sheets = 0
|
||||||
|
elif project_format == 'xlsx':
|
||||||
|
project_format = 'binary/text/xml/xls/xlsx'
|
||||||
|
if not sheets:
|
||||||
|
sheets = 0
|
||||||
|
elif project_format == 'ods':
|
||||||
|
project_format = 'text/xml/ods'
|
||||||
|
if not sheets:
|
||||||
|
sheets = 0
|
||||||
|
# execute
|
||||||
|
kwargs = {k: v for k, v in vars().items() if v is not None}
|
||||||
|
project = refine.Refine(refine.RefineServer()).new_project(**kwargs)
|
||||||
|
rows = project.do_json('get-rows')['total']
|
||||||
|
if rows > 0:
|
||||||
|
print('{0}: {1}'.format('id', project.project_id))
|
||||||
|
print('{0}: {1}'.format('rows', rows))
|
||||||
|
return project
|
||||||
|
else:
|
||||||
|
raise Exception(
|
||||||
|
'Project contains 0 rows. Please check --help for mandatory '
|
||||||
|
'arguments for xml, json, xlsx and ods')
|
||||||
|
|
||||||
|
|
||||||
|
def delete(project_id):
|
||||||
|
"""Delete project."""
|
||||||
|
project = refine.RefineProject(project_id)
|
||||||
|
response = project.delete()
|
||||||
|
if response != True:
|
||||||
|
raise Exception('Failed to delete %s: %s' %
|
||||||
|
(project_id, response))
|
||||||
|
else:
|
||||||
|
print('Project %s has been successfully deleted' % project_id)
|
||||||
|
|
||||||
|
|
||||||
|
def download(url, output_file=None):
|
||||||
|
"""Integrated download function for your convenience."""
|
||||||
|
if not output_file:
|
||||||
|
output_file = os.path.basename(url)
|
||||||
|
if os.path.exists(output_file):
|
||||||
|
print('Error: File %s already exists.\n'
|
||||||
|
'Delete existing file or try command --output '
|
||||||
|
'to specify a different filename.' % output_file)
|
||||||
|
return
|
||||||
|
urllib.urlretrieve(url, output_file)
|
||||||
|
print('Download to file %s complete' % output_file)
|
||||||
|
|
||||||
|
|
||||||
|
def export(project_id, output_file=None, export_format=None):
|
||||||
|
"""Dump a project to stdout or file."""
|
||||||
|
project = refine.RefineProject(project_id)
|
||||||
|
if not export_format:
|
||||||
|
export_format = 'tsv'
|
||||||
|
if not output_file:
|
||||||
|
sys.stdout.write(project.export(export_format=export_format).read())
|
||||||
|
else:
|
||||||
|
ext = os.path.splitext(output_file)[1][1:]
|
||||||
|
if ext:
|
||||||
|
export_format = ext.lower()
|
||||||
|
with open(output_file, 'wb') as f:
|
||||||
|
f.write(project.export(export_format).read())
|
||||||
|
print('Export to file %s complete' % output_file)
|
||||||
|
|
||||||
|
|
||||||
|
def info(project_id):
|
||||||
|
"""Show project metadata"""
|
||||||
|
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
||||||
|
if project_id in [item[0] for item in projects]:
|
||||||
|
project = refine.RefineProject(project_id)
|
||||||
|
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
||||||
|
for projects_id, projects_info in projects:
|
||||||
|
if project_id == projects_id:
|
||||||
|
print('{0:>20}: {1}'.format('id', project_id))
|
||||||
|
print('{0:>20}: {1}'.format('url', 'http://' +
|
||||||
|
refine.REFINE_HOST + ':' +
|
||||||
|
refine.REFINE_PORT +
|
||||||
|
'/project?project=' + project_id))
|
||||||
|
for k, v in projects_info.items():
|
||||||
|
if v:
|
||||||
|
print('{0:>20}: {1}'.format(k, v))
|
||||||
|
project_model = project.get_models()
|
||||||
|
column_model = project_model['columnModel']
|
||||||
|
columns = [column['name'] for column in column_model['columns']]
|
||||||
|
for (i, v) in enumerate(columns, start=1):
|
||||||
|
print('{0:>20}: {1}'.format('column ' + str(i).zfill(3), v))
|
||||||
|
else:
|
||||||
|
print('Error: No project found with id %s.\n'
|
||||||
|
'Check existing projects with command --list' % (project_id))
|
||||||
|
|
||||||
|
|
||||||
|
def ls():
|
||||||
|
"""Query the server and list projects sorted by mtime."""
|
||||||
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
||||||
|
|
||||||
def date_to_epoch(json_dt):
|
def date_to_epoch(json_dt):
|
||||||
"""Convert a JSON date time into seconds-since-epoch."""
|
"""Convert a JSON date time into seconds-since-epoch."""
|
||||||
return time.mktime(time.strptime(json_dt, '%Y-%m-%dT%H:%M:%SZ'))
|
return time.mktime(time.strptime(json_dt, '%Y-%m-%dT%H:%M:%SZ'))
|
||||||
projects.sort(key=lambda v: date_to_epoch(v[1]['modified']), reverse=True)
|
projects.sort(key=lambda v: date_to_epoch(v[1]['modified']), reverse=True)
|
||||||
|
if projects:
|
||||||
for project_id, project_info in projects:
|
for project_id, project_info in projects:
|
||||||
print('{0:>14}: {1}'.format(project_id, project_info['name']))
|
print('{0:>14}: {1}'.format(project_id, project_info['name']))
|
||||||
|
|
||||||
def info(project_id):
|
|
||||||
"""Show project metadata"""
|
|
||||||
projects = refine.Refine(refine.RefineServer()).list_projects().items()
|
|
||||||
for projects_id, projects_info in projects:
|
|
||||||
if project_id == projects_id:
|
|
||||||
print('{0}: {1}'.format('id', projects_id))
|
|
||||||
print('{0}: {1}'.format('name', projects_info['name']))
|
|
||||||
print('{0}: {1}'.format('created', projects_info['created']))
|
|
||||||
print('{0}: {1}'.format('modified', projects_info['modified']))
|
|
||||||
|
|
||||||
def create_project(options):
|
|
||||||
"""Create a new project from options.create file."""
|
|
||||||
# general defaults are defined in google/refine/refine.py new_project
|
|
||||||
# additional defaults for each file type
|
|
||||||
defaults = {}
|
|
||||||
defaults['xml'] = { 'project_format' : 'text/xml', 'recordPath' : 'record' }
|
|
||||||
defaults['csv'] = { 'project_format' : 'text/line-based/*sv', 'separator' : ',' }
|
|
||||||
defaults['tsv'] = { 'project_format' : 'text/line-based/*sv', 'separator' : '\t' }
|
|
||||||
defaults['line-based'] = { 'project_format' : 'text/line-based', 'skipDataLines' : -1 }
|
|
||||||
defaults['fixed-width'] = { 'project_format' : 'text/line-based/fixed-width', 'headerLines' : 0 }
|
|
||||||
defaults['json'] = { 'project_format' : 'text/json', 'recordPath' : ('_', '_') }
|
|
||||||
defaults['xls'] = { 'project_format' : 'binary/text/xml/xls/xlsx', 'sheets' : 0 }
|
|
||||||
defaults['xlsx'] = { 'project_format' : 'binary/text/xml/xls/xlsx', 'sheets' : 0 }
|
|
||||||
defaults['ods'] = { 'project_format' : 'text/xml/ods', 'sheets' : 0 }
|
|
||||||
# guess format from file extension (or legacy option --format)
|
|
||||||
input_format = os.path.splitext(options.create)[1][1:].lower()
|
|
||||||
if input_format == 'txt' and options.columnWidths:
|
|
||||||
input_format = 'fixed-width'
|
|
||||||
if input_format == 'txt' and not options.columnWidths:
|
|
||||||
input_format = 'line-based'
|
|
||||||
if options.input_format:
|
|
||||||
input_format = options.input_format
|
|
||||||
# defaults for selected format
|
|
||||||
input_dict = defaults[input_format]
|
|
||||||
# user input
|
|
||||||
input_user = { group4_arg.dest : getattr(options, group4_arg.dest) for group4_arg in group4.option_list }
|
|
||||||
input_user['strings'] = { k: v for k, v in input_user.items() if v != None and v not in ['true', 'false'] }
|
|
||||||
input_user['trues'] = { k: True for k, v in input_user.items() if v == 'true' }
|
|
||||||
input_user['falses'] = { k: False for k, v in input_user.items() if v == 'false' }
|
|
||||||
input_user_eval = input_user['strings']
|
|
||||||
input_user_eval.update(input_user['trues'])
|
|
||||||
input_user_eval.update(input_user['falses'])
|
|
||||||
# merge defaults with user input
|
|
||||||
input_dict.update(input_user_eval)
|
|
||||||
input_dict['project_file'] = options.create
|
|
||||||
refine.Refine(refine.RefineServer()).new_project(**input_dict)
|
|
||||||
|
|
||||||
def export_project(project, options):
|
|
||||||
"""Dump a project to stdout or options.output file."""
|
|
||||||
export_format = 'tsv'
|
|
||||||
if options.output and not options.splitToFiles == 'true':
|
|
||||||
ext = os.path.splitext(options.output)[1][1:]
|
|
||||||
if ext:
|
|
||||||
export_format = ext.lower()
|
|
||||||
output = open(options.output, 'wb')
|
|
||||||
else:
|
else:
|
||||||
output = sys.stdout
|
print('Error: No projects found')
|
||||||
if options.template:
|
|
||||||
templateconfig = { group6_arg.dest : getattr(options, group6_arg.dest) for group6_arg in group6.option_list if group6_arg.dest in ['prefix', 'template', 'rowSeparator', 'suffix'] }
|
|
||||||
if options.mode == 'record-based':
|
def templating(project_id,
|
||||||
engine = { 'facets':[], 'mode':'record-based' }
|
template,
|
||||||
|
output_file=None,
|
||||||
|
mode=None,
|
||||||
|
prefix='',
|
||||||
|
rowSeparator='\n',
|
||||||
|
suffix='',
|
||||||
|
filterQuery=None,
|
||||||
|
filterColumn=None,
|
||||||
|
facets=None,
|
||||||
|
splitToFiles=False,
|
||||||
|
suffixById=None
|
||||||
|
):
|
||||||
|
"""Dump a project to stdout or file with templating."""
|
||||||
|
project = refine.RefineProject(project_id)
|
||||||
|
|
||||||
|
# basic config
|
||||||
|
templateconfig = {'prefix': prefix,
|
||||||
|
'suffix': suffix,
|
||||||
|
'template': template,
|
||||||
|
'rowSeparator': rowSeparator}
|
||||||
|
|
||||||
|
# construct the engine config
|
||||||
|
if mode == 'record-based':
|
||||||
|
engine = {'facets': [], 'mode': 'record-based'}
|
||||||
else:
|
else:
|
||||||
engine = { 'facets':[], 'mode':'row-based' }
|
engine = {'facets': [], 'mode': 'row-based'}
|
||||||
if options.facets:
|
if facets:
|
||||||
engine['facets'].append(json.loads(options.facets))
|
engine['facets'].append(json.loads(facets))
|
||||||
if options.filterQuery:
|
if filterQuery:
|
||||||
if not options.filterColumn:
|
if not filterColumn:
|
||||||
filterColumn = project.get_models()['columnModel']['keyColumnName']
|
filterColumn = project.get_models()['columnModel']['keyColumnName']
|
||||||
else:
|
textFilter = {'type': 'text',
|
||||||
filterColumn = options.filterColumn
|
'name': filterColumn,
|
||||||
textFilter = { 'type':'text', 'name':filterColumn, 'columnName':filterColumn, 'mode':'regex', 'caseSensitive':False, 'query':options.filterQuery }
|
'columnName': filterColumn,
|
||||||
|
'mode': 'regex',
|
||||||
|
'caseSensitive': False,
|
||||||
|
'query': filterQuery}
|
||||||
engine['facets'].append(textFilter)
|
engine['facets'].append(textFilter)
|
||||||
templateconfig.update({ 'engine': json.dumps(engine) })
|
templateconfig.update({'engine': json.dumps(engine)})
|
||||||
if options.splitToFiles == 'true':
|
|
||||||
|
# normal output or some refinable magic for splitToFiles functionality
|
||||||
|
if not splitToFiles:
|
||||||
|
if not output_file:
|
||||||
|
sys.stdout.write(project.export_templating(
|
||||||
|
**templateconfig).read())
|
||||||
|
else:
|
||||||
|
with open(output_file, 'wb') as f:
|
||||||
|
f.write(project.export_templating(**templateconfig).read())
|
||||||
|
print('Export to file %s complete' % output_file)
|
||||||
|
else:
|
||||||
# common config for row-based and record-based
|
# common config for row-based and record-based
|
||||||
prefix = templateconfig['prefix']
|
prefix = templateconfig['prefix']
|
||||||
suffix = templateconfig['suffix']
|
suffix = templateconfig['suffix']
|
||||||
split = '===|||THISISTHEBEGINNINGOFANEWRECORD|||==='
|
split = '===|||THISISTHEBEGINNINGOFANEWRECORD|||==='
|
||||||
keyColumn = project.get_models()['columnModel']['keyColumnName']
|
keyColumn = project.get_models()['columnModel']['keyColumnName']
|
||||||
if not options.output:
|
if not output_file:
|
||||||
filename = time.strftime('%Y%m%d')
|
output_file = time.strftime('%Y%m%d')
|
||||||
else:
|
else:
|
||||||
filename = os.path.splitext(options.output)[0]
|
base = os.path.splitext(output_file)[0]
|
||||||
ext = os.path.splitext(options.output)[1][1:]
|
ext = os.path.splitext(output_file)[1][1:]
|
||||||
if not ext:
|
if not ext:
|
||||||
ext = 'txt'
|
ext = 'txt'
|
||||||
if options.suffixById:
|
if suffixById:
|
||||||
ids_template = '{{forNonBlank(cells["' + keyColumn + '"].value, v, v, "")}}'
|
ids_template = ('{{forNonBlank(cells["' +
|
||||||
ids_templateconfig = { 'engine': json.dumps(engine), 'template': ids_template, 'rowSeparator':'\n' }
|
keyColumn +
|
||||||
ids = [line.rstrip('\n') for line in project.export_templating(**ids_templateconfig) if line.rstrip('\n')]
|
'"].value, v, v, "")}}')
|
||||||
if options.mode == 'record-based':
|
ids_templateconfig = {'engine': json.dumps(engine),
|
||||||
# record-based: split-character into template if key column is not blank (=record)
|
'template': ids_template,
|
||||||
template = '{{forNonBlank(cells["' + keyColumn + '"].value, v, "' + split + '", "")}}' + templateconfig['template']
|
'rowSeparator': '\n'}
|
||||||
templateconfig.update({ 'prefix': '', 'suffix': '', 'template': template, 'rowSeparator':'' })
|
ids = [line.rstrip('\n') for line in project.export_templating(
|
||||||
|
**ids_templateconfig) if line.rstrip('\n')]
|
||||||
|
if mode == 'record-based':
|
||||||
|
# record-based: split-character into template
|
||||||
|
# if key column is not blank (=record)
|
||||||
|
template = ('{{forNonBlank(cells["' +
|
||||||
|
keyColumn +
|
||||||
|
'"].value, v, "' +
|
||||||
|
split +
|
||||||
|
'", "")}}' +
|
||||||
|
templateconfig['template'])
|
||||||
|
templateconfig.update({'prefix': '',
|
||||||
|
'suffix': '',
|
||||||
|
'template': template,
|
||||||
|
'rowSeparator': ''})
|
||||||
else:
|
else:
|
||||||
# row-based: split-character into template
|
# row-based: split-character into template
|
||||||
template = split + templateconfig['template']
|
template = split + templateconfig['template']
|
||||||
templateconfig.update({ 'prefix': '', 'suffix': '', 'template': template, 'rowSeparator':'' })
|
templateconfig.update({'prefix': '',
|
||||||
records = project.export_templating(**templateconfig).read().split(split)
|
'suffix': '',
|
||||||
|
'template': template,
|
||||||
|
'rowSeparator': ''})
|
||||||
|
records = project.export_templating(
|
||||||
|
**templateconfig).read().split(split)
|
||||||
del records[0] # skip first blank entry
|
del records[0] # skip first blank entry
|
||||||
if options.suffixById:
|
if suffixById:
|
||||||
for index, record in enumerate(records):
|
for index, record in enumerate(records):
|
||||||
output = open(filename + '_' + ids[index] + '.' + ext, 'wb')
|
output_file = base + '_' + ids[index] + '.' + ext
|
||||||
output.writelines([prefix, record, suffix])
|
with open(output_file, 'wb') as f:
|
||||||
|
f.writelines([prefix, record, suffix])
|
||||||
|
print('Export to files complete. Last file: %s' % output_file)
|
||||||
else:
|
else:
|
||||||
zeros = len(str(len(records)))
|
zeros = len(str(len(records)))
|
||||||
for index, record in enumerate(records):
|
for index, record in enumerate(records):
|
||||||
output = open(filename + '_' + str(index+1).zfill(zeros) + '.' + ext, 'wb')
|
output_file = base + '_' + \
|
||||||
output.writelines([prefix, record, suffix])
|
str(index + 1).zfill(zeros) + '.' + ext
|
||||||
else:
|
with open(output_file, 'wb') as f:
|
||||||
output.writelines(project.export_templating(**templateconfig))
|
f.writelines([prefix, record, suffix])
|
||||||
output.close()
|
print('Export to files complete. Last file: %s' % output_file)
|
||||||
else:
|
|
||||||
output.writelines(project.export(export_format=export_format))
|
|
||||||
output.close()
|
|
||||||
|
|
|
@ -19,8 +19,7 @@ Script to provide a command line interface to a Refine server.
|
||||||
# along with this program. If not, see <http://www.gnu.org/licenses/>
|
# along with this program. If not, see <http://www.gnu.org/licenses/>
|
||||||
|
|
||||||
|
|
||||||
from google.refine import __main__
|
from google.refine import __main__, cli, refine
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
# return project so that it's available interactively, python -i refine.py
|
__main__.main()
|
||||||
refine_project = __main__.main()
|
|
||||||
|
|
Loading…
Reference in New Issue