This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, Mac). It is also available via Docker Hub, PyPI and Binder.
Go to file
Felix Lohmeier 058552aab6 refactored Dockerfile a bit 2017-11-19 23:27:40 +01:00
docker refactored Dockerfile a bit 2017-11-19 23:27:40 +01:00
google refactored and extended CLI 2017-11-19 23:26:22 +01:00
tests Explicitly insist on guessing cell value types (change in 2.6). 2013-10-14 00:30:24 +06:00
.gitignore ignore refine.spec (pyinstaller) 2017-11-19 23:24:52 +01:00
COPYING.txt Apply GPL 2011-05-01 17:24:12 +00:00
MANIFEST.in Add MANIFEST.in for correct setup.py sdist upload behavior 2011-07-22 11:00:35 +00:00
Makefile Remove README.txt too 2013-10-10 16:42:29 +05:00
README.rst Revert "included urllib2_file.py in the package to ease installation" 2017-11-17 16:47:31 +01:00
refine.py refactored and extended CLI 2017-11-19 23:26:22 +01:00
requirements.txt Revert "included urllib2_file.py in the package to ease installation" 2017-11-17 16:47:31 +01:00
setup.py Google Refine -> OpenRefine 2013-10-10 16:41:10 +05:00

README.rst

===================================
OpenRefine Python Client Library
===================================

The OpenRefine Python Client Library provides an interface to
communicating with an `OpenRefine <http://openrefine.org/>`_ server.

If you are looking for a ready to use command line interface to OpenRefine then you might be interested in the docker variation of this library:
`felixlohmeier/openrefine-client <https://hub.docker.com/r/felixlohmeier/openrefine-client/>`_. You will find examples for batch processing (e.g. for usage in shell scripts) there.

If you are familiar with python and want to go into more depth, then read on!

Features
=============

Command line interface:

- list projects: refine.py --list
- create project from file: refine.py --create [FILE]
- apply `rules from json file <http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html>`_: refine.py --apply [FILE.json] [PROJECTID]
- export project to file: refine.py --export [PROJECTID] --output=FILE.tsv

Currently, the following API is supported:

- project creation/import, deletion, export
- facet computation

  - text
  - text filter
  - numeric
  - blank
  - starred & flagged
  - ... extensible class

- 'engine': managing multiple facets and their computation results
- sorting & reordering
- clustering
- transforms
- transposes
- single and mass edits
- annotation (star/flag)
- column

  - move
  - add
  - split
  - rename
  - reorder
  - remove

- reconciliation

  - reconciliation judgment facet
  - guessing column type
  - querying reconciliation services preferences
  - perform reconciliation

Configuration
=============

By default the OpenRefine server URL is http://127.0.0.1:3333
The environment variables ``OPENREFINE_HOST`` and ``OPENREFINE_PORT``
enable overriding the host & port.

In order to run all tests, a live Refine server is needed. No existing projects
are affected.

Installation
============

(Someone with more familiarity with python's byzantine collection of installation
frameworks is very welcome to improve/"best practice" all this.)

#. Install dependencies, which currently is ``urllib2_file``:

   ``sudo pip install -r requirements.txt``

   (If you don't have ``pip`` visit `pip-installer.org <http://www.pip-installer.org/en/latest/installing.html#install-or-upgrade-pip>`_)

#. Ensure you have a Refine server running somewhere and, if necessary, set
   the environment vars as above.

#. Run tests, build, and install:

   ``python setup.py test # to do a subset, e.g., --test-suite tests.test_facet``

   ``python setup.py build``

   ``python setup.py install``

There is a Makefile that will do this too, and more.

TODO
====

The API so far has been filled out from building a test suite to carry out the
actions in `David Huynh's Refine tutorial <http://davidhuynh.net/spaces/nicar2011/tutorial.pdf>`_ which while certainly showing off a
wide range of Refine features doesn't cover the entire suite. Notable exceptions
currently include:

- reconciliation support is useful but not complete
- undo/redo
- Freebase
- join columns
- columns from URL

Contribute
============

Pull requests with passing tests welcome! Source is at https://github.com/PaulMakepeace/refine-client-py

Useful Tools
------------

One aspect of development is watching HTTP transactions. To that end, I found
`Fiddler <http://www.fiddler2.com/>`_ on Windows and `HTTPScoop
<http://www.tuffcode.com/>`_ invaluable. The latter won't URL-decode nor nicely
format JSON but the `Online JavaScript Beautifier <http://jsbeautifier.org/>`_
will.

History
=======

OpenRefine used to be called Google Refine, and this library used to be called
the Google Refine Python Client Library.

Credits
=======

Paul Makepeace, author, <paulm@paulm.com>

David Huynh, `initial cut <http://markmail.org/message/jsxzlcu3gn6drtb7>`_

`Artfinder <http://www.artfinder.com/>`_, inspiration

Some data used in the test suite has been used from publicly available sources,

- louisiana-elected-officials.csv: from
  http://www.sos.louisiana.gov/tabid/136/Default.aspx

- us_economic_assistance.csv: `"The Green Book" <http://www.data.gov/raw/1554>`_

- eli-lilly.csv: `ProPublica's "Docs for Dollars" <http://projects.propublica.org/docdollars/>`_ leading to a `Lilly Faculty PDF <http://www.lillyfacultyregistry.com/documents/EliLillyFacultyRegistryQ22010.pdf>`_ processed by `David Huynh's ScraperWiki script <http://scraperwiki.com/scrapers/eli-lilly-dollars-for-docs-scraper/edit/>`_