# Test module cli in a Python 2 environment

## Install

This notebook requires a Python 2.7 environment and an OpenRefine server running at http://127.0.0.1:3333.

In [1]:
import sys
!{sys.executable} -m pip install .. --user --upgrade

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
Processing /home/felix/git/openrefine-client
Installing collected packages: openrefine-client
 Found existing installation: openrefine-client 0.3.7
 Uninstalling openrefine-client-0.3.7:
 Successfully uninstalled openrefine-client-0.3.7
 Running setup.py install for openrefine-client ... [?25ldone
[?25hSuccessfully installed openrefine-client-0.3.7


In [2]:
import tempfile
import shutil
import os
dirpath = tempfile.mkdtemp()
shutil.copytree('data',dirpath + '/data')
print(dirpath)
os.chdir(dirpath)

/tmp/tmp24HyYg


In [3]:
from google.refine import cli

## README.md

### Download

In [4]:
cli.download('https://git.io/fj5hF','duplicates.csv')

Download to file duplicates.csv complete


### Create

In [5]:
p1 = cli.create('duplicates.csv')

id: 2019539621291
rows: 10


### List

In [6]:
cli.ls()

 2019539621291: duplicates


### Info

In [7]:
cli.info(p1.project_id)

 id: 2019539621291
 url: http://127.0.0.1:3333/project?project=2019539621291
 name: duplicates
 modified: 2019-08-21T23:31:03Z
 created: 2019-08-21T23:31:02Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase


### Export

In [8]:
cli.export(p1.project_id)

email	name	state	gender	purchase
danny.baron@example1.com	Danny Baron	CA	M	TV
melanie.white@example2.edu	Melanie White	NC	F	iPhone
danny.baron@example1.com	D. Baron	CA	M	Winter jacket
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table
danny.baron@example1.com	Daniel Baron	CA	M	Bike
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill
melanie.white@example2.edu	Melanie White	NC	F	iPad
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier
arthur.duff@example4.com	Arthur Duff	OR	M	Night table


### Apply

In [9]:
cli.download('https://git.io/fj5ju','duplicates-deletion.json')

Download to file duplicates-deletion.json complete


In [10]:
cli.apply(p1.project_id, 'duplicates-deletion.json')

File duplicates-deletion.json has been successfully applied to project 2019539621291


In [11]:
cli.export(p1.project_id)

email	count	name	state	gender	purchase
arthur.duff@example4.com	2	Arthur Duff	OR	M	Dining table
ben.morisson@example6.org	1	Ben Morisson	FL	M	Amplifier
ben.tyler@example3.org	1	Ben Tyler	NV	M	Flashlight
danny.baron@example1.com	3	Danny Baron	CA	M	TV
jean.griffith@example5.org	1	Jean Griffith	WA	F	Power drill
melanie.white@example2.edu	2	Melanie White	NC	F	iPhone


### Export XLS

In [12]:
cli.export(p1.project_id, 'deduped.xls')

email	count	name	state	gender	purchase
arthur.duff@example4.com	2	Arthur Duff	OR	M	Dining table
ben.morisson@example6.org	1	Ben Morisson	FL	M	Amplifier
ben.tyler@example3.org	1	Ben Tyler	NV	M	Flashlight
danny.baron@example1.com	3	Danny Baron	CA	M	TV
jean.griffith@example5.org	1	Jean Griffith	WA	F	Power drill
melanie.white@example2.edu	2	Melanie White	NC	F	iPhone


### Delete

In [13]:
cli.delete(p1.project_id)

Project 2019539621291 has been successfully deleted


### Templating

In [14]:
p2 = cli.create('duplicates.csv')

id: 1716843473792
rows: 10


In [15]:
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender')

{ "events" : [
 { "name" : "Melanie White", "purchase" : "iPhone" },
 { "name" : "Jean Griffith", "purchase" : "Power drill" },
 { "name" : "Melanie White", "purchase" : "iPad" }
] }

In [16]:
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True)

Export to files complete. Last file: advanced_3.json


In [17]:
cli.templating(p2.project_id,
prefix='''{ "events" : [
''',
template=' { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^F$',
filterColumn='gender',
output_file='advanced.json',
splitToFiles=True,
suffixById=True)

Export to files complete. Last file: advanced_melanie.white@example2.edu.json


In [18]:
os.listdir(os.getcwd())

['advanced_jean.griffith@example5.org.json',
 'advanced_melanie.white@example2.edu.json',
 'advanced_3.json',
 'advanced_2.json',
 'advanced_1.json',
 'duplicates-deletion.json',
 'duplicates.csv',
 'data']

### Delete

In [19]:
cli.delete(p2.project_id)

Project 1716843473792 has been successfully deleted


## Unicode

### fruits

In [62]:
p1 = cli.create('data/cli/evil-fruits.tsv')
cli.info(p1.project_id)
cli.export(p1.project_id)

id: 1929957235590
rows: 5
 id: 1929957235590
 url: http://127.0.0.1:3333/project?project=1929957235590
 name: evil-fruits
 modified: 2019-08-21T23:35:47Z
 created: 2019-08-21T23:35:47Z
 rowCount: 5
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/evil-fruits.tsv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'evil-fruits', u'processQuotes': True, u'limit': -1, u'trimStrings': False, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: 🔣
 column 002: code
 column 003: meaning
🔣	code	meaning
🍇	1F347	GRAPES
🍉	1F349	WATERMELON
🍒	1F352	CHERRIES
🍓	1F353	STRAWBERRY
🍍	1F34D	PINEAPPLE


In [21]:
cli.export(p1.project_id, output_file='emojis.csv')
with open('emojis.csv', 'r') as f:
 print(f.read())

Export to file emojis.csv complete
🔣,code,meaning
🍇,1F347,GRAPES
🍉,1F349,WATERMELON
🍒,1F352,CHERRIES
🍓,1F353,STRAWBERRY
🍍,1F34D,PINEAPPLE



In [22]:
cli.templating(p1.project_id,
prefix='''{ "emojis" : [
''',
template=' { "symbol" : {{jsonize(with(row.columnNames[0],cn,cells[cn].value))}}, "meaning" : {{jsonize(cells["meaning"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^1F34',
filterColumn='code')

{ "emojis" : [
 { "symbol" : "🍇", "meaning" : "GRAPES" },
 { "symbol" : "🍉", "meaning" : "WATERMELON" },
 { "symbol" : "🍍", "meaning" : "PINEAPPLE" }
] }

In [23]:
cli.templating(p1.project_id,
prefix='''{ "emojis" : [
''',
template=' { "symbol" : {{jsonize(with(row.columnNames[0],cn,cells[cn].value))}}, "meaning" : {{jsonize(cells["meaning"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^1F34',
filterColumn='code',
output_file='trái cây.json',
splitToFiles=True)

Export to files complete. Last file: trái cây_3.json


In [24]:
cli.templating(p1.project_id,
prefix='''{ "emojis" : [
''',
template=' { "symbol" : {{jsonize(with(row.columnNames[0],cn,cells[cn].value))}}, "meaning" : {{jsonize(cells["meaning"].value)}} }',
rowSeparator=''',
''',
suffix='''
] }''',
filterQuery='^1F34',
filterColumn='code',
output_file='trái cây.json',
splitToFiles=True,
suffixById=True)

Export to files complete. Last file: trái cây_🍍.json


In [25]:
os.listdir(os.getcwd())

['tr\xc3\xa1i c\xc3\xa2y_\xf0\x9f\x8d\x8d.json',
 'tr\xc3\xa1i c\xc3\xa2y_\xf0\x9f\x8d\x89.json',
 'tr\xc3\xa1i c\xc3\xa2y_\xf0\x9f\x8d\x87.json',
 'tr\xc3\xa1i c\xc3\xa2y_3.json',
 'tr\xc3\xa1i c\xc3\xa2y_2.json',
 'tr\xc3\xa1i c\xc3\xa2y_1.json',
 'emojis.csv',
 'advanced_jean.griffith@example5.org.json',
 'advanced_melanie.white@example2.edu.json',
 'advanced_3.json',
 'advanced_2.json',
 'advanced_1.json',
 'duplicates-deletion.json',
 'duplicates.csv',
 'data']

In [26]:
cli.delete(p1.project_id)

Project 2401578251107 has been successfully deleted


### emoji data

In [63]:
p1 = cli.create('data/cli/dữ liệu biểu tượng cảm xúc.txt',
 project_format='tsv',
 headerLines=0,
 skipDataLines=34,
 limit=20)
cli.info(p1.project_id)
cli.export(p1.project_id)

id: 2314250240290
rows: 20
 id: 2314250240290
 url: http://127.0.0.1:3333/project?project=2314250240290
 name: dữ liệu biểu tượng cảm xúc
 modified: 2019-08-21T23:36:05Z
 created: 2019-08-21T23:36:05Z
 rowCount: 20
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/d\u1eef li\u1ec7u bi\u1ec3u t\u01b0\u1ee3ng c\u1ea3m x\xfac.txt', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'd\u1eef li\u1ec7u bi\u1ec3u t\u01b0\u1ee3ng c\u1ea3m x\xfac', u'processQuotes': True, u'skipDataLines': 34, u'limit': 20, u'trimStrings': False, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False, u'headerLines': 0}]
 column 001: Column 1
 column 002: Column 2
 column 003: Column 3
 column 004: Column 4
 column 005: Column 5
 column 006: Column 6
Column 1	Column 2	Column 3	Column 4	Column 5	Column 6
00A9 ;	text ;	L1 ;	none ;	j	# V1.1 (©) COPYRIGHT SIGN
00AE ;	text ;	L1 ;	none ;	j	# V1.1 (®) REGISTERED SIGN
203C ;	text ;	L1 ;	none ;	

In [64]:
cli.ls()

 2314250240290: dữ liệu biểu tượng cảm xúc
 1929957235590: evil-fruits


### Delete

In [29]:
cli.delete(p1.project_id)

Project 1602939526221 has been successfully deleted


## CSV

### default

In [30]:
p = cli.create('data/cli/duplicates.csv')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1675776970201
rows: 10
 id: 1675776970201
 url: http://127.0.0.1:3333/project?project=1675776970201
 name: duplicates
 modified: 2019-08-21T23:31:05Z
 created: 2019-08-21T23:31:05Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	" D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001

### encoding

check TV symbol in line 1

In [31]:
p = cli.create('data/cli/duplicates.csv', encoding='ISO-8859-1')
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2268199900543
rows: 10
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	" D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001/07/04
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	1	2001-07
danny.baron@example1.com	Daniel Baron			Bike	1	2001
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	1	2000
melanie.white@example2.edu	Melanie White	NC	F	'iPad'	1	1999
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	1	1998
arthur.duff@example4.com	Arthur Duff	OR	M	Night table	1	1997
Project 2268199900543 has been successfully deleted


In [32]:
p = cli.create('data/cli/duplicates.csv', encoding='UTF-8')
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1798292162864
rows: 10
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	" D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001/07/04
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	1	2001-07
danny.baron@example1.com	Daniel Baron			Bike	1	2001
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	1	2000
melanie.white@example2.edu	Melanie White	NC	F	'iPad'	1	1999
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	1	1998
arthur.duff@example4.com	Arthur Duff	OR	M	Night table	1	1997
Project 1798292162864 has been successfully deleted


### guessCellValueTypes

check OpenRefine GUI at url below: numbers should be green

In [33]:
p = cli.create('data/cli/duplicates.csv', guessCellValueTypes=True)
cli.info(p.project_id)

id: 2351526371150
rows: 10
 id: 2351526371150
 url: http://127.0.0.1:3333/project?project=2351526371150
 name: duplicates
 modified: 2019-08-21T23:31:05Z
 created: 2019-08-21T23:31:05Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': True, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date


In [34]:
cli.delete(p.project_id)

Project 2351526371150 has been successfully deleted


### headerLines

check column names, should be Column 1...

In [35]:
p = cli.create('data/cli/duplicates.csv', headerLines=0)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1753036694840
rows: 11
Column 1	Column 2	Column 3	Column 4	Column 5	Column 6	Column 7
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	" D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001/07/04
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	1	2001-07
danny.baron@example1.com	Daniel Baron			Bike	1	2001
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	1	2000
melanie.white@example2.edu	Melanie White	NC	F	'iPad'	1	1999
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	1	1998
arthur.duff@example4.com	Arthur Duff	OR	M	Night table	1	1997
Project 1753036694840 has been successfully deleted


### ignoreLines

check column names, should start with arthur.duff as header

In [36]:
p = cli.create('data/cli/duplicates.csv', ignoreLines=5)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1567779238383
rows: 5
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	1	2001-07
danny.baron@example1.com	Daniel Baron			Bike	1	2001
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	1	2000
melanie.white@example2.edu	Melanie White	NC	F	'iPad'	1	1999
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	1	1998
arthur.duff@example4.com	Arthur Duff	OR	M	Night table	1	1997
Project 1567779238383 has been successfully deleted


### limit

should contain 5 rows

In [37]:
p = cli.create('data/cli/duplicates.csv', limit=5)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2236287775552
rows: 5
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	" D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001/07/04
arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	1	2001-07
Project 2236287775552 has been successfully deleted


### separator and processQuotes

should contain 10 rows and 2 columns (Column 2)

In [38]:
p = cli.create('data/cli/duplicates.csv', separator=' ', processQuotes=False)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2493837924937
rows: 10
email,name,state,gender,purchase,count,date	Column 2
"danny.baron@example1.com,Danny Baron,CA,M,TV (UTF-8: 📺),1,""Wed, 4 Jul 2001"	
melanie.white@example2.edu,Melanie White,NC,F,,1,2001-07-04T12:08:56	
danny.baron@example1.com, D.	"(""Tab"") Baron,CA,M,Winter jacket,1,2001-07-04"
ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight,1,2001/07/04	
arthur.duff@example4.com,Arthur Duff,OR,M,Dining table,1,2001-07	
danny.baron@example1.com,Daniel Baron,,,Bike,1,2001	
jean.griffith@example5.org,Jean Griffith,WA,F,Power drill,1,2000	
melanie.white@example2.edu,Melanie White,NC,F,'iPad',1,1999	
ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier,1,1998	
arthur.duff@example4.com,Arthur Duff,OR,M,Night table,1,1997	
Project 2493837924937 has been successfully deleted


### projectName

In [39]:
p = cli.create('data/cli/duplicates.csv', projectName='foo')
cli.info(p.project_id)
cli.delete(p.project_id)

id: 1568868311685
rows: 10
 id: 1568868311685
 url: http://127.0.0.1:3333/project?project=1568868311685
 name: foo
 modified: 2019-08-21T23:31:06Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'foo', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date
Project 1568868311685 has been successfully deleted


### projectTags (introduced in OpenRefine 2.8)

check manually at http://127.0.0.1:3333 > Open Project if tags where stored

In [40]:
p = cli.create('data/cli/duplicates.csv', projectTags=['client1', 'beta'])
cli.info(p.project_id)

id: 1889306695897
rows: 10
 id: 1889306695897
 url: http://127.0.0.1:3333/project?project=1889306695897
 name: duplicates
 tags: [u'client1', u'beta']
 modified: 2019-08-21T23:31:06Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'projectTags': [u'client1', u'beta'], u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date


In [41]:
cli.delete(p.project_id)

Project 1889306695897 has been successfully deleted


### skipDataLines

should contain 5 rows

In [42]:
p = cli.create('data/cli/duplicates.csv', skipDataLines=5)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1906416549071
rows: 5
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Daniel Baron			Bike	1	2001
jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	1	2000
melanie.white@example2.edu	Melanie White	NC	F	'iPad'	1	1999
ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	1	1998
arthur.duff@example4.com	Arthur Duff	OR	M	Night table	1	1997
Project 1906416549071 has been successfully deleted


### storeBlankCellsAsNulls

check OpenRefine GUI at url below:
* All > View > Show/Hide 'null' values in cells
* row 6 should contain null values in columns state and gender

In [43]:
p = cli.create('data/cli/duplicates.csv', guessCellValueTypes=True)
cli.info(p.project_id)

id: 1641203332364
rows: 10
 id: 1641203332364
 url: http://127.0.0.1:3333/project?project=1641203332364
 name: duplicates
 modified: 2019-08-21T23:31:06Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': True, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date


In [44]:
cli.delete(p.project_id)

Project 1641203332364 has been successfully deleted


## TSV

### default

In [45]:
p = cli.create('data/cli/duplicates.tsv')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2332414205165
rows: 10
 id: 2332414205165
 url: http://127.0.0.1:3333/project?project=2332414205165
 name: duplicates
 modified: 2019-08-21T23:31:06Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.tsv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'limit': -1, u'trimStrings': False, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date
email	name	state	gender	purchase	count	date
danny.baron@example1.com	Danny Baron	CA	M	TV (UTF-8: 📺)	1	Wed, 4 Jul 2001
melanie.white@example2.edu	Melanie White	NC	F		1	2001-07-04T12:08:56
danny.baron@example1.com	"D.	(""Tab"") Baron"	CA	M	Winter jacket	1	2001-07-04
ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	1	2001/07/04
arthur.duff@ex

## JSON

### default

In [46]:
p = cli.create('data/cli/duplicates.json')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1978993820770
rows: 10
 id: 1978993820770
 url: http://127.0.0.1:3333/project?project=1978993820770
 name: duplicates
 modified: 2019-08-21T23:31:06Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.json', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'_', u'_'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: _ - name
 column 002: _ - date
 column 003: _ - email
 column 004: _ - state
 column 005: _ - count
 column 006: _ - gender
 column 007: _ - purchase
_ - name	_ - date	_ - email	_ - state	_ - count	_ - gender	_ - purchase
Danny Baron	Wed, 4 Jul 2001	danny.baron@example1.com	CA	1	M	TV (UTF-8: 📺)
Melanie White	2001-07-04T12:08:56	melanie.white@example2.edu	NC	1	F	
" D.	(""Tab"") Baron"	2001-07-04	danny.baron@exa

### trimStrings (broken, does not work in the GUI either)

check row 3 if spaces before `D.` are deleted

In [47]:
p = cli.create('data/cli/duplicates.json', trimStrings=True)
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1892692171021
rows: 10
 id: 1892692171021
 url: http://127.0.0.1:3333/project?project=1892692171021
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:06Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.json', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'_', u'_'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': True, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: _ - name
 column 002: _ - date
 column 003: _ - email
 column 004: _ - state
 column 005: _ - count
 column 006: _ - gender
 column 007: _ - purchase
_ - name	_ - date	_ - email	_ - state	_ - count	_ - gender	_ - purchase
Danny Baron	Wed, 4 Jul 2001	danny.baron@example1.com	CA	1	M	TV (UTF-8: 📺)
Melanie White	2001-07-04T12:08:56	melanie.white@example2.edu	NC	1	F	
" D.	(""Tab"") Baron"	2001-07-04	danny.baron@exam

### recordPath

In [48]:
p = cli.create('data/cli/duplicates.json', recordPath=['_', '_', 'purchase'])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1945894618034
rows: 10
 id: 1945894618034
 url: http://127.0.0.1:3333/project?project=1945894618034
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.json', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'_', u'_', u'purchase'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: purchase
purchase
TV (UTF-8: 📺)

Winter jacket
Flashlight
Dining table
Bike
Power drill
'iPad'
Amplifier
Night table
Project 1945894618034 has been successfully deleted


### storeEmptyStrings

default: True; set to False for null values

check OpenRefine GUI at url below:
* All > View > Show/Hide 'null' values in cells
* row 6 should contain null values in columns state and gender

In [49]:
p = cli.create('data/cli/duplicates.json', storeEmptyStrings=False)
cli.info(p.project_id)

id: 2551263767214
rows: 10
 id: 2551263767214
 url: http://127.0.0.1:3333/project?project=2551263767214
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': False, u'fileSource': u'data/cli/duplicates.json', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'_', u'_'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: _ - name
 column 002: _ - date
 column 003: _ - email
 column 004: _ - count
 column 005: _ - purchase
 column 006: _ - state
 column 007: _ - gender


In [50]:
cli.delete(p.project_id)

Project 2551263767214 has been successfully deleted


## XML

### default

In [51]:
p = cli.create('data/cli/duplicates.xml')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1926835461545
rows: 80
 id: 1926835461545
 url: http://127.0.0.1:3333/project?project=1926835461545
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 80
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.xml', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'root'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: root
 column 002: root - record
 column 003: root - record - name
 column 004: root - record - date
 column 005: root - record - email
 column 006: root - record - count
 column 007: root - record - purchase
 column 008: root - record - state
 column 009: root - record - gender
root	root - record	root - record - name	root - record - date	root - record - email	root - record - count	root - record - purchase	root - r

### trimStrings (broken, does not work in the GUI either)

check if spaces before `D.` are deleted

In [52]:
p = cli.create('data/cli/duplicates.xml', trimStrings=True)
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1615744471501
rows: 80
 id: 1615744471501
 url: http://127.0.0.1:3333/project?project=1615744471501
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 80
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.xml', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'root'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': True, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: root
 column 002: root - record
 column 003: root - record - name
 column 004: root - record - date
 column 005: root - record - email
 column 006: root - record - count
 column 007: root - record - purchase
 column 008: root - record - state
 column 009: root - record - gender
root	root - record	root - record - name	root - record - date	root - record - email	root - record - count	root - record - purchase	root - re

### recordPath

In [53]:
p = cli.create('data/cli/duplicates.xml', recordPath=['root', 'record', 'purchase'])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1843370951454
rows: 10
 id: 1843370951454
 url: http://127.0.0.1:3333/project?project=1843370951454
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.xml', u'storeBlankRows': True, u'encoding': u'', u'recordPath': [u'root', u'record', u'purchase'], u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: purchase
purchase
TV (UTF-8: 📺)

Winter jacket
Flashlight
Dining table
Bike
Power drill
'iPad'
Amplifier
Night table
Project 1843370951454 has been successfully deleted


### storeEmptyStrings

default: True; set to False for null values

check OpenRefine GUI at url below:
* All > View > Show/Hide 'null' values in cells
* row 6 should contain null values in columns state and gender

In [54]:
p = cli.create('data/cli/duplicates.csv', storeEmptyStrings=False)
cli.info(p.project_id)

id: 2549624481101
rows: 10
 id: 2549624481101
 url: http://127.0.0.1:3333/project?project=2549624481101
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': False, u'fileSource': u'data/cli/duplicates.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date


In [55]:
cli.delete(p.project_id)

Project 2549624481101 has been successfully deleted


## TXT

### default (line-based)

In [56]:
p = cli.create('data/cli/duplicates.txt')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2029778313736
rows: 11
 id: 2029778313736
 url: http://127.0.0.1:3333/project?project=2029778313736
 name: duplicates
 modified: 2019-08-21T23:31:07Z
 created: 2019-08-21T23:31:07Z
 rowCount: 11
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.txt', u'storeBlankRows': True, u'encoding': u'', u'ignoreLines': -1, u'projectName': u'duplicates', u'processQuotes': True, u'skipDataLines': -1, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False, u'headerLines': 0}]
 column 001: Column 1
Column 1
email name state gender purchase count date 
danny.baron@example1.com Danny Baron CA M TV (UTF-8: 📺) 1 Wed, 4 Jul 2001 
melanie.white@example2.edu Melanie White NC F 1 2001-07-04T12:08:5
"danny.baron@example1.com D.	(""Tab"") Baron CA M Winter jacket 1 2001-07-04 "
ben.tyler@example3.org Ben Tyler NV M Flashlight 1 2001/07/04 
arthur.duff@example4.com Arthur Duf

### linesPerRow

should return 6 rows in 2 columns

In [57]:
p = cli.create('data/cli/duplicates.txt', linesPerRow=2)
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1614710460265
rows: 6
 id: 1614710460265
 url: http://127.0.0.1:3333/project?project=1614710460265
 name: duplicates
 modified: 2019-08-21T23:31:08Z
 created: 2019-08-21T23:31:08Z
 rowCount: 6
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.txt', u'storeBlankRows': True, u'encoding': u'', u'ignoreLines': -1, u'projectName': u'duplicates', u'processQuotes': True, u'limit': -1, u'skipDataLines': -1, u'separator': u',', u'trimStrings': False, u'linesPerRow': 2, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False, u'headerLines': 0}]
 column 001: Column 1
 column 002: Column 2
Column 1	Column 2
email name state gender purchase count date 	danny.baron@example1.com Danny Baron CA M TV (UTF-8: 📺) 1 Wed, 4 Jul 2001 
melanie.white@example2.edu Melanie White NC F 1 2001-07-04T12:08:5	"danny.baron@example1.com D.	(""Tab"") Baron CA M Winter jacket 1 2001-07-04 "
ben.tyler@example3.org Ben Tyler NV M Flashlight 1

### fixed-width: columnWidths and headerLines

In [58]:
p = cli.create('data/cli/duplicates.txt', columnWidths=[27, 21, 6, 7, 15, 6, 1000], headerLines=1)
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 1729341878534
rows: 10
 id: 1729341878534
 url: http://127.0.0.1:3333/project?project=1729341878534
 name: duplicates
 modified: 2019-08-21T23:31:08Z
 created: 2019-08-21T23:31:08Z
 rowCount: 10
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'data/cli/duplicates.txt', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'limit': -1, u'separator': u',', u'trimStrings': False, u'columnWidths': [27, 21, 6, 7, 15, 6, 1000], u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False, u'headerLines': 1}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date
email	name	state	gender	purchase	count	date
danny.baron@example1.com 	Danny Baron 	CA 	M 	TV (UTF-8: 📺) 	1 	Wed, 4 Jul 2001 
melanie.white@example2.edu 	Melanie White 	NC 	F 	 	1 	2001-07-04T12:08:5
danny.baron@example1.com 	" D.	(""Tab"") Baron "	CA 	M 	W

## ZIP

### default

should contain 16 rows

In [59]:
p = cli.create('data/cli/duplicates.zip')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2279718038457
rows: 16
 id: 2279718038457
 url: http://127.0.0.1:3333/project?project=2279718038457
 name: duplicates
 modified: 2019-08-21T23:31:08Z
 created: 2019-08-21T23:31:08Z
 rowCount: 16
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'duplicates2.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}, {u'storeEmptyStrings': True, u'fileSource': u'duplicates2.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': False}]
 column 001: email
 column 002: name
 column 003: state
 column 004: gender
 column 005: purchase
 column 006: count
 column 007: date
email	name	stat

### includeFileSources

should contain column File

In [60]:
p = cli.create('data/cli/duplicates.zip', includeFileSources=True)
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

id: 2100283089198
rows: 16
 id: 2100283089198
 url: http://127.0.0.1:3333/project?project=2100283089198
 name: duplicates
 modified: 2019-08-21T23:31:08Z
 created: 2019-08-21T23:31:08Z
 rowCount: 16
importOptionMetadata: [{u'storeEmptyStrings': True, u'fileSource': u'duplicates2.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': True}, {u'storeEmptyStrings': True, u'fileSource': u'duplicates2.csv', u'storeBlankRows': True, u'encoding': u'', u'projectName': u'duplicates', u'processQuotes': True, u'separator': u',', u'trimStrings': False, u'limit': -1, u'storeBlankCellsAsNulls': True, u'guessCellValueTypes': False, u'includeFileSources': True}]
 column 001: File
 column 002: email
 column 003: name
 column 004: state
 column 005: gender
 column 006: purchase
 column 007: count
 column 008: date

## ODS (broken in OpenRefine >=2.8)

### default

many blank columns and rows in OpenRefine <=2.7 (also with manual import via GUI)

In [61]:
p = cli.create('data/cli/duplicates.ods')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

Exception: Project not created

### sheets

first sheet from file with 2 sheets

In [None]:
p = cli.create('data/cli/duplicates2.ods', sheets=[0])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

both sheets from file with 2 sheets: should contain 16 rows

In [None]:
p = cli.create('data/cli/duplicates2.ods', sheets=[0, 1])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

## XLS (broken in OpenRefine >=2.8)

### default

In [None]:
p = cli.create('data/cli/duplicates.xls')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

### sheets

first sheet from file with 2 sheets

In [None]:
p = cli.create('data/cli/duplicates2.xls', sheets=[0])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

both sheets from file with 2 sheets: should contain 16 rows

In [None]:
p = cli.create('data/cli/duplicates2.xls', sheets=[0, 1])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

## XLSX (broken in OpenRefine >=2.8)

### default

In [None]:
p = cli.create('data/cli/duplicates.xlsx')
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

### sheets

first sheet from file with 2 sheets

In [None]:
p = cli.create('data/cli/duplicates2.xlsx', sheets=[0])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)

both sheets from file with 2 sheets: should contain 16 rows

In [None]:
p = cli.create('data/cli/duplicates2.xlsx', sheets=[0, 1])
cli.info(p.project_id)
cli.export(p.project_id)
cli.delete(p.project_id)