release v0.1

This commit is contained in:
Felix Lohmeier 2017-02-27 00:47:34 +01:00
parent 365885ef7c
commit cfb09fdd84
6 changed files with 76807 additions and 2 deletions

179
README.md
View File

@ -1,2 +1,177 @@
# openrefine-batch.sh
Shell script to run OpenRefine on Windows, Linux or Mac in batch mode (import, transform, export). It orchestrates docker containers for OpenRefine (server) and a python client that communicates with the OpenRefine API.
## OpenRefine batch processing (openrefine-batch.sh)
Shell script to run OpenRefine on Windows, Linux or Mac in batch mode (import, transform, export). This bash script automatically...
1. imports all data from a given directory into OpenRefine
2. transforms the data by applying OpenRefine transformation rules from all json files in another given directory and
3. finally exports the data in TSV (tab-separated values) format.
It orchestrates a [docker container for OpenRefine](https://hub.docker.com/r/felixlohmeier/openrefine/) (server) and a [docker container for a python client](https://hub.docker.com/r/felixlohmeier/openrefine-client/) that communicates with the OpenRefine API. By restarting the server after each process it reduces memory requirements to a minimum.
### Typical Workflow
- Step 1: Do some experiments with your data (or parts of it) in the graphical user interface of OpenRefine. If you are fine with all transformation rules, [extract the json code](http://kb.refinepro.com/2012/06/google-refine-json-and-my-notepad-or.html) and save it as file (e.g. transform.json).
- Step 2: Put your data and the json file(s) in two different directories and execute the script. The script will automatically import all data files in OpenRefine projects, apply the transformation rules in the json files to each project and export all projects in TSV-files.
### Install
Linux:
1. Install [Docker](https://docs.docker.com/engine/installation/#on-linux)
2. Open Terminal and enter `wget https://.../openrefine-batch.sh && chmod +x ./openrefine-batch.sh`
Mac:
1. Install Docker
2. ...
Windows:
1. Install Docker
2. Install Cygwin with Bash
3. ...
### Usage
```
./openrefine-batch.sh input/ config/ output/
```
#### Example
```
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
```
#### Options
```
./openrefine-batch.sh $inputdir $configdir $outputdir $ram $inputformat $inputoptions
```
1. inputdir: path to directory with source files (multiple files may be imported into a single project by providing a zip or tar.gz archive)
2. configdir: path to directory with OpenRefine transformation rules (json files)
3. outputdir: path to directory for exported files (and temporary workspace)
4. ram: maximum RAM for OpenRefine java heap space (default: 4G)
5. inputformat: csv, tsv, xml, json, line-based, fixed-width, xlsx or ods
6. inputoptions: several options provided by [openrefine-client](https://hub.docker.com/r/felixlohmeier/openrefine-client/)
inputoptions (mandatory for xml, json, fixed-width, xslx, ods):
* `--recordPath=RECORDPATH` (xml, json): please provide path in multiple arguments without slashes, e.g. /collection/record/ should be entered like this: `--recordPath=collection --recordPath=record`
* `--columnWidths=COLUMNWIDTHS` (fixed-width): please provide widths separated by comma (e.g. 7,5)
* `--sheets=SHEETS` (xlsx, ods): please provide sheets separated by comma (e.g. 0,1), default: 0 (first sheet)
more inputoptions (optional, only together with inputformat):
* `--limit=LIMIT` (all formats), default: -1
* `--includeFileSources=INCLUDEFILESOURCES` (all formats), default: false
* `--trimStrings=TRIMSTRINGS` (xml, json), default: false
* `--storeEmptyStrings=STOREEMPTYSTRINGS` (xml, json), default: true
* `--guessCellValueTypes=GUESSCELLVALUETYPES (xml, csv, tsv, fixed-width, json)`, default: false
* `--encoding=ENCODING (csv, tsv, line-based, fixed-width)`, please provide short encoding name (e.g. UTF-8)
* `--ignoreLines=IGNORELINES (csv, tsv, line-based, fixed-width, xlsx, ods)`, default: -1
* `--headerLines=HEADERLINES` (csv, tsv, fixed-width, xlsx, ods), default: 1
* `--skipDataLines=SKIPDATALINES` (csv, tsv, line-based, fixed-width, xlsx, ods), default: 0
* `--storeBlankRows=STOREBLANKROWS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
* `--processQuotes=PROCESSQUOTES` (csv, tsv), default: true
* `--storeBlankCellsAsNulls=STOREBLANKCELLSASNULLS` (csv, tsv, line-based, fixed-width, xlsx, ods), default: true
* `--linesPerRow=LINESPERROW` (line-based), default: 1
### Logging
The script uses `docker attach` to print log messages from OpenRefine server and `ps` to show statistics for each step. Here is a sample log:
```
[00:08 felix ~/openrefine/openrefine-batch]$ ./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
Input dir: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/input
Input files: phm-collection.tsv
Input format: --format=tsv
Input options: --processQuotes=false --guessCellValueTypes=true
Transformation rules: phm-transform.json
OpenRefine heap space: 4G
OpenRefine version: 2.7rc1
Docker container: 41ca6232-8484-40e0-a606-3bcbf29903f6
Output directory: /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output
begin: Mo 27. Feb 00:08:02 CET 2017
start OpenRefine server...
[sudo] password for felix:
fab9894d902372767cdb38d05b6e247dce722da22192d734862fc2f096a23d51
import phm-collection.tsv...
New project: 1719405033732
Number of rows: 75814
STARTED ELAPSED %MEM %CPU RSS
00:08:13 00:29 10.0 122 813604
save project and restart OpenRefine server...
23:08:46.130 [ ProjectManager] Saving all modified projects ... (4679ms)
23:08:55.190 [ project_utilities] Saved project '1719405033732' (9060ms)
41ca6232-8484-40e0-a606-3bcbf29903f6
41ca6232-8484-40e0-a606-3bcbf29903f6
6bb7ee1f1f2a1d09e191a3fadad9e26aaa89414b2c618a47d3d3ef7c040c6b1a
begin project 1719405033732 @ Mo 27. Feb 00:09:12 CET 2017
transform phm-transform.json...
23:09:13.747 [ refine] GET /command/core/get-models (2489ms)
23:09:16.887 [ project] Loaded project 1719405033732 from disk in 3 sec(s) (3140ms)
23:09:17.140 [ refine] POST /command/core/apply-operations (253ms)
STARTED ELAPSED %MEM %CPU RSS
00:08:57 01:10 20.1 124 1625788
save project and restart OpenRefine server...
23:10:07.930 [ ProjectManager] Saving all modified projects ... (50790ms)
23:10:15.173 [ project_utilities] Saved project '1719405033732' (7243ms)
41ca6232-8484-40e0-a606-3bcbf29903f6
41ca6232-8484-40e0-a606-3bcbf29903f6
cc9c49dcaf54c720d915a55b4e646909f657fb6582c0ac3c9f069996b9cd0b53
export to file 1719405033732.tsv...
23:10:29.972 [ refine] GET /command/core/get-models (4381ms)
23:10:33.826 [ project] Loaded project 1719405033732 from disk in 3 sec(s) (3854ms)
23:10:34.123 [ refine] GET /command/core/get-all-project-metadata (297ms)
23:10:34.140 [ refine] POST /command/core/export-rows/phm-collection.tsv.tsv (17ms)
STARTED ELAPSED %MEM %CPU RSS
00:10:17 02:01 12.8 27.2 1041596
save project and restart OpenRefine server...
41ca6232-8484-40e0-a606-3bcbf29903f6
41ca6232-8484-40e0-a606-3bcbf29903f6
8e1febaf862c2e0bb162c6dfe968015b54f600d6b45f8d1a401b74e7285bc521
finished project 1719405033732 @ Mo 27. Feb 00:12:36 CET 2017
cleanup...
41ca6232-8484-40e0-a606-3bcbf29903f6
41ca6232-8484-40e0-a606-3bcbf29903f6
output (number of lines / size in bytes):
167017 60527726 /home/felix/occcloud/Openness/Kunden+Projekte/OpenRefine/openrefine-batch/examples/powerhouse-museum/output/1719405033732.tsv
finish: Mo 27. Feb 00:12:42 CET 2017
```
### Todo
- [ ] howto for installation on Mac and Windows
- [ ] howto for extracting input options from OpenRefine GUI with Firefox network monitor
- [ ] use getopts for parsing of arguments
- [ ] provide more example data from other OpenRefine tutorials
### Licensing
MIT License
Copyright (c) 2017 Felix Lohmeier
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -0,0 +1,83 @@
Creative Commons Attribution-NonCommercial 2.5 Australia
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE LEGAL SERVICES. DISTRIBUTION OF THIS LICENCE DOES NOT CREATE AN ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES REGARDING THE INFORMATION PROVIDED, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM ITS USE.
Licence
THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENCE ("CCPL" OR "LICENCE"). THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS AUTHORISED UNDER THIS LICENCE AND/OR APPLICABLE LAW IS PROHIBITED.
BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENCE. THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS.
1. Definitions
"Collective Work" means a work, such as a periodical issue, anthology or encyclopaedia, in which the Work in its entirety in unmodified form, along with a number of other contributions, constituting separate and independent works in themselves, are assembled into a collective whole. A work that constitutes a Collective Work will not be considered a Derivative Work (as defined below) for the purposes of this Licence.
"Derivative Work" means a work that reproduces a substantial part of the Work, or of the Work and other pre-existing works protected by copyright, or that is an adaptation of a Work that is a literary, dramatic, musical or artistic work. Derivative Works include a translation, musical arrangement, dramatisation, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be adapted, except that a work that constitutes a Collective Work will not be considered a Derivative Work for the purpose of this Licence. For the avoidance of doubt, where the Work is a musical composition or sound recording, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered a Derivative Work for the purpose of this Licence.
"Licensor" means the individual or entity that offers the Work under the terms of this Licence.
"Moral rights law" means laws under which an individual who creates a work protected by copyright has rights of integrity of authorship of the work, rights of attribution of authorship of the work, rights not to have authorship of the work falsely attributed, or rights of a similar or analogous nature in the work anywhere in the world.
"Original Author" means the individual or entity who created the Work.
"Work" means the work or other subject-matter protected by copyright that is offered under the terms of this Licence, which may include (without limitation) a literary, dramatic, musical or artistic work, a sound recording or cinematograph film, a published edition of a literary, dramatic, musical or artistic work or a television or sound broadcast.
"You" means an individual or entity exercising rights under this Licence who has not previously violated the terms of this Licence with respect to the Work, or who has received express permission from the Licensor to exercise rights under this Licence despite a previous violation.
"Licence Elements" means the following high-level licence attributes as selected by Licensor and indicated in the title of this Licence: Attribution, NonCommercial, NoDerivatives, ShareAlike.
2. Fair Dealing and Other Rights. Nothing in this Licence excludes or modifies, or is intended to exclude or modify, (including by reducing, limiting, or restricting) the rights of You or others to use the Work arising from fair dealings or other limitations on the rights of the copyright owner or the Original Author under copyright law, moral rights law or other applicable laws.
3. Licence Grant. Subject to the terms and conditions of this Licence, Licensor hereby grants You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright) licence to exercise the rights in the Work as stated below:
to reproduce the Work, to incorporate the Work into one or more Collective Works, and to reproduce the Work as incorporated in the Collective Works;
to create and reproduce Derivative Works;
to publish, communicate to the public, distribute copies or records of, exhibit or display publicly, perform publicly and perform publicly by means of a digital audio transmission the Work including as incorporated in Collective Works;
to publish, communicate to the public, distribute copies or records of, exhibit or display publicly, perform publicly, and perform publicly by means of a digital audio transmission Derivative Works;
The above rights may be exercised in all media and formats whether now known or hereafter devised. The above rights include the right to make such modifications as are technically necessary to exercise the rights in other media and formats. All rights not expressly granted by Licensor under this Licence are hereby reserved, including but not limited to the rights set forth in Sections 4(d) and 4(e).
4. Restrictions. The licence granted in Section 3 above is expressly made subject to and limited by the following restrictions:
You may publish, communicate to the public, distribute, publicly exhibit or display, publicly perform, or publicly digitally perform the Work only under the terms of this Licence, and You must include a copy of, or the Uniform Resource Identifier for, this Licence with every copy or record of the Work You publish, communicate to the public, distribute, publicly exhibit or display, publicly perform or publicly digitally perform. You may not offer or impose any terms on the Work that exclude, alter or restrict the terms of this Licence or the recipients' exercise of the rights granted hereunder. You may not sublicense the Work. You must keep intact all notices that refer to this Licence and to the disclaimer of representations and warranties. You may not publish, communicate to the public, distribute, publicly exhibit or display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this Licence. The above applies to the Work as incorporated in a Collective Work, but this does not require the Collective Work apart from the Work itself to be made subject to the terms of this Licence. If You create a Collective Work, upon notice from any Licensor You must, to the extent practicable, remove from the Collective Work any credit as required by Section 4(c), as requested. If You create a Derivative Work, upon notice from any Licensor You must, to the extent practicable, remove from the Derivative Work any credit as required by Section 4(c), as requested.
You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation. The exchange of the Work for other copyrighted works by means of digital file-sharing or otherwise shall not be considered to be intended for or directed toward commercial advantage or private monetary compensation, provided there is no payment of any monetary compensation in connection with the exchange of copyrighted works.
If you publish, communicate to the public, distribute, publicly exhibit or display, publicly perform, or publicly digitally perform the Work or any Derivative Works or Collective Works, You must keep intact all copyright notices for the Work. You must also give clear and reasonably prominent credit to (i) the Original Author (by name or pseudonym if applicable), if the name or pseudonym is supplied; and (ii) if another party or parties (eg a sponsor institute, publishing entity or journal) is designated for attribution in the copyright notice, terms of service or other reasonable means associated with the Work, such party or parties. If applicable, that credit must be given in the particular way made known by the Original Author and otherwise as reasonable to the medium or means You are utilizing, by conveying the identity of the Original Author and the other designated party or parties (if applicable); the title of the Work if supplied; to the extent reasonably practicable, the Uniform Resource Identifier, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and in the case of a Derivative Work, a credit identifying the use of the Work in the Derivative Work (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). Such credit may be implemented in any reasonable manner; provided, however, that in the case of a Derivative Work or Collective Work, at a minimum such credit will appear where any other comparable authorship credit appears and in a manner at least as prominent as such other comparable authorship credit.
For the avoidance of doubt, where the Work is a musical composition:
Performance Royalties Under Blanket Licences. Licensor reserves the exclusive right to collect, whether individually or via a performance rights society, royalties for Your communication to the public, broadcast, public performance or public digital performance (e.g. webcast) of the Work if Your communication to the public, broadcast, public performance or public digital performace is primarily intended for or directed toward commercial advantage or private monetary compensation.
Mechanical Rights and Statutory Royalties. Licensor reserves the exclusive right to collect, whether individually or via a music rights agency, designated agent or a music publisher, royalties for any record You create from the Work ("cover version") and distribute, subject to the compulsory licence created by 17 USC Section 115 of the US Copyright Act (or an equivalent statutory licence under the Australian Copyright Act or in other jurisdictions), if Your distribution of such cover version is primarily intended for or directed toward commercial advantage or private monetary compensation.
Webcasting Rights and Statutory Royalties. For the avoidance of doubt, where the Work is a sound recording, Licensor reserves the exclusive right to collect, whether individually or via a performance-rights society, royalties for Your public digital performance (e.g. webcast) of the Work, subject to the compulsory licence created by 17 USC Section 114 of the US Copyright Act (or the equivalent in other jurisdictions), if Your public digital performance is primarily intended for or directed toward commercial advantage or private monetary compensation.
False attribution prohibited. Except as otherwise agreed in writing by the Licensor, if You publish, communicate to the public, distribute, publicly exhibit or display, publicly perform, or publicly digitally perform the Work or any Derivative Works or Collective Works in accordance with this Licence, You must not falsely attribute the Work to someone other than the Original Author.
Prejudice to honour or reputation prohibited. Except as otherwise agreed in writing by the Licensor, if you publish, communicate to the public, distribute, publicly exhibit or display, publicly perform, or publicly digitally perform the Work or any Derivative Works or Collective Works, You must not do anything that results in a material distortion of, the mutilation of, or a material alteration to, the Work that is prejudicial to the Original Author's honour or reputation, and You must not do anything else in relation to the Work that is prejudicial to the Original Author's honour or reputation.
5. Disclaimer.
EXCEPT AS EXPRESSLY STATED IN THIS LICENCE OR OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, AND TO THE FULL EXTENT PERMITTED BY APPLICABLE LAW, LICENSOR OFFERS THE WORK "AS-IS" AND MAKES NO REPRESENTATIONS, WARRANTIES OR CONDITIONS OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, ANY REPRESENTATIONS, WARRANTIES OR CONDITIONS REGARDING THE CONTENTS OR ACCURACY OF THE WORK, OR OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, THE ABSENCE OF LATENT OR OTHER DEFECTS, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE.
6. Limitation on Liability.
TO THE FULL EXTENT PERMITTED BY APPLICABLE LAW, AND EXCEPT FOR ANY LIABILITY ARISING FROM CONTRARY MUTUAL AGREEMENT AS REFERRED TO IN SECTION 5, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, NEGLIGENCE) FOR ANY LOSS OR DAMAGE WHATSOEVER, INCLUDING (WITHOUT LIMITATION) LOSS OF PRODUCTION OR OPERATION TIME, LOSS, DAMAGE OR CORRUPTION OF DATA OR RECORDS; OR LOSS OF ANTICIPATED SAVINGS, OPPORTUNITY, REVENUE, PROFIT OR GOODWILL, OR OTHER ECONOMIC LOSS; OR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF OR IN CONNECTION WITH THIS LICENCE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
If applicable legislation implies warranties or conditions, or imposes obligations or liability on the Licensor in respect of this Licence that cannot be wholly or partly excluded, restricted or modified, the Licensor's liability is limited, to the full extent permitted by the applicable legislation, at its option, to:
in the case of goods, any one or more of the following:
the replacement of the goods or the supply of equivalent goods;
the repair of the goods;
the payment of the cost of replacing the goods or of acquiring equivalent goods;
the payment of the cost of having the goods repaired; or
in the case of services:
the supplying of the services again; or
the payment of the cost of having the services supplied again.
7. Termination.
This Licence and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this Licence. Individuals or entities who have received Derivative Works or Collective Works from You under this Licence, however, will not have their licences terminated provided such individuals or entities remain in full compliance with those licences. Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this Licence.
Subject to the above terms and conditions, the licence granted here is perpetual (for the duration of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to release the Work under different licence terms or to stop distributing the Work at any time; provided, however that any such election will not serve to withdraw this Licence (or any other licence that has been, or is required to be, granted under the terms of this Licence), and this Licence will continue in full force and effect unless terminated as stated above.
8. Miscellaneous.
Each time You publish, communicate to the public, distribute or publicly digitally perform the Work or a Collective Work, the Licensor offers to the recipient a licence to the Work on the same terms and conditions as the licence granted to You under this Licence.
Each time You publish, communicate to the public, distribute or publicly digitally perform a Derivative Work, Licensor offers to the recipient a licence to the original Work on the same terms and conditions as the licence granted to You under this Licence.
If any provision of this Licence is invalid or unenforceable under applicable law, it shall not affect the validity or enforceability of the remainder of the terms of this Licence, and without further action by the parties to this agreement, such provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable.
No term or provision of this Licence shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.
This Licence constitutes the entire agreement between the parties with respect to the Work licensed here. To the full extent permitted by applicable law, there are no understandings, agreements or representations with respect to the Work not specified here. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This Licence may not be modified without the mutual written agreement of the Licensor and You.
The construction, validity and performance of this Licence shall be governed by the laws in force in New South Wales, Australia.
Creative Commons is not a party to this Licence, and, to the full extent permitted by applicable law, makes no representation or warranty whatsoever in connection with the Work. To the full extent permitted by applicable law, Creative Commons will not be liable to You or any party on any legal theory (including, without limitation, negligence) for any damages whatsoever, including without limitation any general, special, incidental or consequential damages arising in connection to this licence. Notwithstanding the foregoing two (2) sentences, if Creative Commons has expressly identified itself as the Licensor hereunder, it shall have all rights and obligations of Licensor.
Except for the limited purpose of indicating to the public that the Work is licensed under the CCPL, neither party will use the trademark "Creative Commons" or any related trademark or logo of Creative Commons without the prior written consent of Creative Commons. Any permitted use will be in compliance with Creative Commons' then-current trademark usage guidelines, as may be published on its website or otherwise made available upon request from time to time.
Creative Commons may be contacted at https://creativecommons.org/.

View File

@ -0,0 +1,23 @@
# Example Powerhouse Museum
## Tutorial
Seth van Hooland, Ruben Verborgh and Max De Wilde (August 5, 2013): Cleaning Data with OpenRefine. In: The Programming Historian. http://programminghistorian.org/lessons/cleaning-data-with-openrefine
## Usage
```
./openrefine-batch.sh examples/powerhouse-museum/input/ examples/powerhouse-museum/config/ examples/powerhouse-museum/output/ 4G tsv --processQuotes=false --guessCellValueTypes=true
```
## phm-collection.tsv
* The [Powerhouse Museum in Sydney](https://maas.museum/powerhouse-museum/) provides a freely available metadata export of its collection on its website. The collection metadata has been retrieved from the website freeyourmetadata.org that has redistributed the data: http://data.freeyourmetadata.org/powerhouse-museum/
## phm-tutorial.json
* All steps from the tutorial above, extracted from the history of the processed tutorial project, retrieved from the website freeyourmetadata.org: [phm-collection-cleaned.google-refine.tar.gz](http://data.freeyourmetadata.org/powerhouse-museum/phm-collection-cleaned.google-refine.tar.gz)
## License
* The data is released under a [Creative Commons Attribution-ShareAlike 2.5 Australia License](http://creativecommons.org/licenses/by-nc/2.5/au/)

View File

@ -0,0 +1,563 @@
[
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"selectNumeric": false,
"expression": "value",
"selectBlank": true,
"selectError": true,
"selectNonNumeric": true,
"name": "Record ID",
"from": 0,
"to": 510000,
"type": "range",
"columnName": "Record ID"
}
]
}
},
{
"op": "core/row-reorder",
"description": "Reorder rows",
"mode": "record-based",
"sorting": {
"criteria": [
{
"errorPosition": 1,
"valueType": "number",
"column": "Record ID",
"blankPosition": 2,
"reverse": false
}
]
}
},
{
"op": "core/blank-down",
"description": "Blank down cells in column Record ID",
"engineConfig": {
"mode": "row-based",
"facets": []
},
"columnName": "Record ID"
},
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"invert": false,
"selectError": false,
"selection": [
{
"v": {
"v": true,
"l": "true"
}
}
],
"name": "Record ID",
"omitBlank": false,
"type": "list",
"columnName": "Record ID"
}
]
}
},
{
"op": "core/multivalued-cell-split",
"description": "Split multi-valued cells in column Categories",
"columnName": "Categories",
"keyColumnName": "Record ID",
"separator": "|",
"mode": "plain"
},
{
"op": "core/row-removal",
"description": "Remove rows",
"engineConfig": {
"mode": "row-based",
"facets": [
{
"omitError": false,
"expression": "isBlank(value)",
"selectBlank": false,
"invert": false,
"selectError": false,
"selection": [
{
"v": {
"v": true,
"l": "true"
}
}
],
"name": "Categories",
"omitBlank": false,
"type": "list",
"columnName": "Categories"
}
]
}
},
{
"op": "core/mass-edit",
"description": "Mass edit cells in column Categories",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "value",
"edits": [
{
"fromBlank": false,
"fromError": false,
"from": [
"Audio and Visual Equipment",
"Audio and visual equipment"
],
"to": "Audio and Visual Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Photographic Equipment",
"Photographic equipment"
],
"to": "Photographic Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Food and Drink",
"food and drink"
],
"to": "Food and Drink"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Office Equipment",
"Office equipment"
],
"to": "Office Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Documents",
"documents"
],
"to": "Documents"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Musical Instruments",
"Musical instruments"
],
"to": "Musical Instruments"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Material technology",
"Material Technology"
],
"to": "Material technology"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Medicines",
"medicines"
],
"to": "Medicines"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Agricultural Equipment",
"Agricultural equipment"
],
"to": "Agricultural Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Personal Effects",
"Personal effects"
],
"to": "Personal Effects"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Toiletries",
"toiletries"
],
"to": "Toiletries"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Pictorials",
"pictorials"
],
"to": "Pictorials"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Postal Equipment",
"Postal equipment"
],
"to": "Postal Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Health and Medical Equipment",
"Health and medical equipment"
],
"to": "Health and Medical Equipment"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Glass Forms",
"Glass forms"
],
"to": "Glass Forms"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Botanical specimens",
"Botanical Specimens"
],
"to": "Botanical specimens"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Archaeology-Ancient",
"Archaeology-ancient"
],
"to": "Archaeology-Ancient"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Numismatics",
"numismatics"
],
"to": "Numismatics"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Plastics Technology",
"Plastics technology"
],
"to": "Plastics Technology"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Transport-Land",
"Transport-land"
],
"to": "Transport-Land"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Clothing and Dress",
"Clothing and dress"
],
"to": "Clothing and Dress"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Measuring Instruments",
"Measuring instruments"
],
"to": "Measuring Instruments"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Chemical Samples",
"Chemical samples"
],
"to": "Chemical Samples"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Electronics Packaging",
"Electronics packaging"
],
"to": "Electronics Packaging"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Didactic Displays",
"Didactic displays"
],
"to": "Didactic Displays"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Scientific Instruments",
"Scientific instruments"
],
"to": "Scientific Instruments"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Ceremonial Objects",
"Ceremonial objects"
],
"to": "Ceremonial Objects"
}
]
},
{
"op": "core/mass-edit",
"description": "Mass edit cells in column Categories",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "value",
"edits": [
{
"fromBlank": false,
"fromError": false,
"from": [
"Band saws",
"Bandsaws"
],
"to": "Band saws"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Mailbags",
"Mail bags"
],
"to": "Mailbags"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Bookmarks",
"book marks"
],
"to": "Bookmarks"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Bonbon dishes",
"Bon bon dishes"
],
"to": "Bonbon dishes"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Bedsheets",
"Bed sheets"
],
"to": "Bedsheets"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Skullcaps",
"Skull caps"
],
"to": "Skullcaps"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Air bricks",
"Airbricks"
],
"to": "Air bricks"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Transport-Water",
"Transport - Water"
],
"to": "Transport-Water"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Doorknobs",
"Door knobs"
],
"to": "Doorknobs"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Transport-Air",
"Transport - Air"
],
"to": "Transport-Air"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Swatch books",
"Swatchbooks"
],
"to": "Swatch books"
}
]
},
{
"op": "core/mass-edit",
"description": "Mass edit cells in column Categories",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "value",
"edits": [
{
"fromBlank": false,
"fromError": false,
"from": [
"Costumes",
"Costume"
],
"to": "Costumes"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Textile designs",
"Textile design"
],
"to": "Textile designs"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Bed fittings",
"Bed fitting"
],
"to": "Bed fittings"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Paintings",
"Painting"
],
"to": "Paintings"
},
{
"fromBlank": false,
"fromError": false,
"from": [
"Schilling coins",
"Shilling coins"
],
"to": "Schilling coins"
}
]
},
{
"op": "core/multivalued-cell-join",
"description": "Join multi-valued cells in column Categories",
"columnName": "Categories",
"keyColumnName": "Record ID",
"separator": ", "
},
{
"op": "core/text-transform",
"description": "Text transform on cells in column Categories using expression grel:value.split(\", \").uniques().join(\", \")",
"engineConfig": {
"mode": "record-based",
"facets": []
},
"columnName": "Categories",
"expression": "grel:value.split(\", \").uniques().join(\", \")",
"onError": "set-to-blank",
"repeat": false,
"repeatCount": 10
},
{
"op": "core/multivalued-cell-split",
"description": "Split multi-valued cells in column Categories",
"columnName": "Categories",
"keyColumnName": "Record ID",
"separator": ",",
"mode": "plain"
}
]

File diff suppressed because one or more lines are too long

146
openrefine-batch.sh Executable file
View File

@ -0,0 +1,146 @@
#!/bin/bash
# openrefine-batch.sh, Felix Lohmeier, v0.1, 27.02.2017
# https://github.com/...
# user input
if [ -z "$1" ]
then
echo 1>&2 "please provide path to directory with source files"
exit 2
else
inputdir=$(readlink -f $1)
inputfiles=($(basename -a ${inputdir}/*))
fi
if [ -z "$2" ]
then
echo 1>&2 "please provide path to directory with config files"
exit 2
else
configdir=$(readlink -f $2)
jsonfiles=($(basename -a ${configdir}/*))
fi
if [ -z "$3" ]
then
echo 1>&2 "please provide path to output directory"
exit 2
else
outputdir=$(readlink -f $3)
mkdir -p ${outputdir}
fi
if [ -z "$4" ]
then
ram="4G"
else
ram="$4"
fi
if [ -z "$5" ]
then
inputformat=""
else
inputformat="--format=${5}"
fi
if [ -z "$6" ]
then
inputoptions=""
else
inputoptions=( "$6" "$7" "$8" "$9" "${10}" "${11}" "${12}" "${13}" "${14}" "${15}" )
fi
# variables
version="2.7rc1"
uuid=$(cat /proc/sys/kernel/random/uuid)
echo "Input dir: $inputdir"
echo "Input files: ${inputfiles[@]}"
echo "Input format: $inputformat"
echo "Input options: ${inputoptions[@]}"
echo "Transformation rules: ${jsonfiles[@]}"
echo "OpenRefine heap space: $ram"
echo "OpenRefine version: $version"
echo "Docker container: $uuid"
echo "Output directory: $outputdir"
echo ""
# time
echo "begin: $(date)"
echo ""
# launch openrefine server
echo "start OpenRefine server..."
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
echo ""
# import all files
for inputfile in "${inputfiles[@]}" ; do
echo "import ${inputfile}..."
# import
sudo docker run --rm --link ${uuid} -v ${inputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -c $inputfile $inputformat ${inputoptions[@]}
# show server logs
sudo docker attach ${uuid} &
# statistics
ps -o start,etime,%mem,%cpu,rss -C java
# restart server to clear memory
echo "save project and restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
echo ""
done
# get project ids
projects=($(sudo docker run --rm --link ${uuid} felixlohmeier/openrefine-client -H ${uuid} -l | cut -c 2-14))
# loop for all projects
for projectid in "${projects[@]}" ; do
echo "begin project $projectid @ $(date)"
# apply transformation rules
for jsonfile in "${jsonfiles[@]}" ; do
echo "transform ${jsonfile}..."
# show server logs
sudo docker attach ${uuid} &
# apply
sudo docker run --rm --link ${uuid} -v ${configdir}:/data felixlohmeier/openrefine-client -H ${uuid} -f ${jsonfile} ${projectid}
# statistics
ps -o start,etime,%mem,%cpu,rss -C java
# restart server to clear memory
echo "save project and restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
done
# export files
echo "export to file ${projectid}.tsv..."
# show server logs
sudo docker attach ${uuid} &
# export
sudo docker run --rm --link ${uuid} -v ${outputdir}:/data felixlohmeier/openrefine-client -H ${uuid} -E --output=${projectid}.tsv ${projectid}
# statistics
ps -o start,etime,%mem,%cpu,rss -C java
# restart server to clear memory
echo "restart OpenRefine server..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo docker run -d --name=${uuid} -v ${outputdir}:/data felixlohmeier/openrefine:${version} -i 0.0.0.0 -m ${ram} -d /data
until sudo docker run --rm --link ${uuid} --entrypoint /usr/bin/curl felixlohmeier/openrefine-client --silent -N http://${uuid}:3333 | cat | grep -q -o "OpenRefine" ; do sleep 1; done
# time
echo "finished project $projectid @ $(date)"
echo ""
done
# cleanup
echo "cleanup..."
sudo docker stop -t=5000 ${uuid}
sudo docker rm ${uuid}
sudo rm -r -f ${outputdir}/*.project
sudo rm -r -f ${outputdir}/workspace*.json
echo ""
# list output files
echo "output (number of lines / size in bytes):"
wc -c -l ${outputdir}/*.tsv
echo ""
# time
echo "finish: $(date)"