Nachtrag zu Kapitel 8
This commit is contained in:
parent
b7a1366bf8
commit
6e563ae54e
|
@ -0,0 +1,13 @@
|
||||||
|
# 8 Installation und Konfiguration des Suchmaschinenindex Solr
|
||||||
|
|
||||||
|
Ziel: Suchindex installieren, konfigurieren und mit den Daten aus Kapitel 7 befüllen. Suchmaschinentechnologie am Beispiel von Solr ein wenig kennenlernen.
|
||||||
|
|
||||||
|
Inhalte:
|
||||||
|
|
||||||
|
1. [Installation von Solr mit Docker](08_1_installation-von-solr-mit-docker.md)
|
||||||
|
2. [Konfiguration des Solr Schemas](08_2_konfiguration-des-solr-schemas.md)
|
||||||
|
3. [TSV-Dateien in Solr laden](08_3_tsv-dateien-in-solr-laden.md)
|
||||||
|
|
||||||
|
Beiträge der Studierenden in den Lerntagebüchern:
|
||||||
|
|
||||||
|
* ...
|
|
@ -0,0 +1,374 @@
|
||||||
|
# 8.1 Installation von Solr mit Docker
|
||||||
|
|
||||||
|
Für die Installation von Solr nutzen wir wieder Docker. Es gibt ein Repository für Solr im Docker Hub, dass von Docker selbst gepflegt und daher als "offiziell" markiert ist: https://hub.docker.com/_/solr/
|
||||||
|
|
||||||
|
Literatur:
|
||||||
|
|
||||||
|
* http://reasoncodeexample.com/2016/06/29/dock-the-pain-away-running-solr-in-docker/
|
||||||
|
* https://lucidworks.com/blog/2015/11/03/solr-on-docker-2/
|
||||||
|
|
||||||
|
## Aufgabe 1: Installieren Sie Solr, legen Sie einen Index an, laden Sie Beispieldaten und machen Sie sich mit Administrationsoberfläche und integrierter Suchoberfläche vertraut
|
||||||
|
|
||||||
|
Hinweise:
|
||||||
|
|
||||||
|
* Folgen Sie der [Installationsanleitung im Docker Hub](https://hub.docker.com/_/solr/) unter der Überschrift "Run Solr and index example data".
|
||||||
|
* Ein Index wird innerhalb von Solr auch als "Core" bezeichnet.
|
||||||
|
* Machen Sie sich mit der Administrationsoberfläche vertraut.
|
||||||
|
* Testen Sie die integrierte Suchoberfläche. Diese erreichen Sie, wenn Sie an die Administrationsoberfläche den Pfad ```/solr/gettingstarted/browse``` anhängen.
|
||||||
|
|
||||||
|
## Lösung
|
||||||
|
|
||||||
|
* Docker-Container starten: {%s%}sudo docker run --name my_solr -d -p 8983:8983 -t solr{%ends%}
|
||||||
|
* Index anlegen: {%s%}sudo docker exec -it --user=solr my_solr bin/solr create_core -c gettingstarted{%ends%}
|
||||||
|
* Beispieldaten laden: {%s%}sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/manufacturers.xml{%ends%}
|
||||||
|
* Administrationsoberfläche: {%s%}Beispiel: http://192.168.1.1:8983{%ends%}
|
||||||
|
* Integrierte Suchoberfläche: {%s%}Beispiel: http://192.168.1.1:8983/solr/gettingstarted/browse{%ends%}
|
||||||
|
|
||||||
|
## Aufgabe 2: Lernen Sie die Abfragemöglichkeiten von Solr kennen
|
||||||
|
|
||||||
|
Disclaimer: Die folgenden Schritte sind aus der ["Quickstart" Anleitung von Solr](http://lucene.apache.org/solr/quickstart.html) entnommen und für die Verwendung mit Docker neu zusammengestellt und abgewandelt.
|
||||||
|
|
||||||
|
Hinweise:
|
||||||
|
|
||||||
|
* Ersetzen Sie ```localhost``` jeweils durch die IP-Adresse von Ihrem Webserver (Beispiel: 192.168.1.1).
|
||||||
|
|
||||||
|
### Indexing data
|
||||||
|
|
||||||
|
The Solr install includes the ```bin/post``` tool in order to facilitate getting various types of documents easily into Solr from the start. We'll be using this tool for the indexing examples below.
|
||||||
|
|
||||||
|
Let's first index local files in many formats. ```bin/post``` features the ability to crawl a directory of files, optionally recursively even, sending the raw content of each file into Solr for extraction and indexing.
|
||||||
|
|
||||||
|
A Solr install includes a example/exampledocs/ subdirectory, so that makes a convenient set of files built-in to start with.
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/
|
||||||
|
```
|
||||||
|
|
||||||
|
Here's what it'll look like:
|
||||||
|
|
||||||
|
```
|
||||||
|
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -classpath /opt/solr/dist/solr-core-6.3.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool example/exampledocs/
|
||||||
|
SimplePostTool version 5.0.0
|
||||||
|
Posting files to [base] url http://localhost:8983/solr/gettingstarted/update...
|
||||||
|
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
|
||||||
|
Entering recursive mode, max depth=999, delay=0s
|
||||||
|
Indexing directory example/exampledocs (19 files, depth=0)
|
||||||
|
POSTing file money.xml (application/xml) to [base]
|
||||||
|
POSTing file gb18030-example.xml (application/xml) to [base]
|
||||||
|
POSTing file utf8-example.xml (application/xml) to [base]
|
||||||
|
POSTing file more_books.jsonl (application/json) to [base]/json/docs
|
||||||
|
POSTing file manufacturers.xml (application/xml) to [base]
|
||||||
|
POSTing file ipod_video.xml (application/xml) to [base]
|
||||||
|
POSTing file sample.html (text/html) to [base]/extract
|
||||||
|
POSTing file monitor2.xml (application/xml) to [base]
|
||||||
|
POSTing file ipod_other.xml (application/xml) to [base]
|
||||||
|
POSTing file solr-word.pdf (application/pdf) to [base]/extract
|
||||||
|
POSTing file books.csv (text/csv) to [base]
|
||||||
|
POSTing file mp500.xml (application/xml) to [base]
|
||||||
|
POSTing file sd500.xml (application/xml) to [base]
|
||||||
|
POSTing file mem.xml (application/xml) to [base]
|
||||||
|
POSTing file monitor.xml (application/xml) to [base]
|
||||||
|
POSTing file vidcard.xml (application/xml) to [base]
|
||||||
|
POSTing file books.json (application/json) to [base]/json/docs
|
||||||
|
POSTing file hd.xml (application/xml) to [base]
|
||||||
|
POSTing file solr.xml (application/xml) to [base]
|
||||||
|
19 files indexed.
|
||||||
|
COMMITting Solr index changes to http://localhost:8983/solr/gettingstarted/update...
|
||||||
|
Time spent: 0:00:04.283
|
||||||
|
```
|
||||||
|
|
||||||
|
The command-line breaks down as follows:
|
||||||
|
|
||||||
|
* ```-c gettingstarted```: name of the collection to index into
|
||||||
|
* ```example/exampledocs/```: a relative path of the Solr install ```exampledocs/``` directory
|
||||||
|
|
||||||
|
You have now indexed fifty documents into the gettingstarted collection in Solr and committed these changes. You can search for "solr" by loading the Admin UI Query tab (http://localhost:8983/solr/#/gettingstarted/query), enter "solr" in the q param (replacing *:*, which matches all documents), and "Execute Query".
|
||||||
|
|
||||||
|
You can browse the documents indexed at http://localhost:8983/solr/gettingstarted/browse. The /browse UI allows getting a feel for how Solr's technical capabilities can be worked with in a familiar, though a bit rough and prototypical, interactive HTML view.
|
||||||
|
|
||||||
|
Solr supports indexing structured content in a variety of incoming formats. The historically predominant format for getting structured content into Solr has been Solr XML. Solr supports indexing JSON, either arbitrary structured JSON or "Solr JSON" (which is similar to Solr XML). A great conduit of data into Solr is via CSV, especially when the documents are homogeneous by all having the same set of fields. CSV can be conveniently exported from a spreadsheet such as Excel, or exported from databases such as MySQL. When getting started with Solr, it can often be easiest to get your structured data into CSV format and then index that into Solr rather than a more sophisticated single step operation.
|
||||||
|
|
||||||
|
* XML: ```sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/manufacturers.xml```
|
||||||
|
* CSV: ```sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/books.csv```
|
||||||
|
* JSON: ```sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted example/exampledocs/more_books.jsonl```
|
||||||
|
|
||||||
|
### Updating Data
|
||||||
|
|
||||||
|
You may notice that even if you index content more than once, it does not duplicate the results found. This is because the example ```schema.xml``` specifies a ```uniqueKey``` field called ```id```. Whenever you POST commands to Solr to add a document with the same value for the ```uniqueKey``` as an existing document, it automatically replaces it for you. You can see that that has happened by looking at the values for ```numDocs``` and ```maxDoc``` in the core-specific Overview section of the Solr Admin UI (http://localhost:8983/solr/#/gettingstarted)
|
||||||
|
|
||||||
|
```numDocs``` represents the number of searchable documents in the index (and will be larger than the number of XML, JSON, or CSV files since some files contained more than one document). The ```maxDoc``` value may be larger as the ```maxDoc``` count includes logically deleted documents that have not yet been physically removed from the index. You can re-post the sample files over and over again as much as you want and ```numDocs``` will never increase, because the new documents will constantly be replacing the old.
|
||||||
|
|
||||||
|
### Deleting Data
|
||||||
|
|
||||||
|
You can delete data by POSTing a delete command to the update URL and specifying the value of the document's unique key field, or a query that matches multiple documents (be careful with that one!). Since these commands are smaller, we specify them right on the command line rather than reference a JSON or XML file.
|
||||||
|
|
||||||
|
Execute the following command to delete a specific document:
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo docker exec -it --user=solr my_solr bin/post -c gettingstarted -d "<delete><id>SP2514N</id></delete>"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Searching
|
||||||
|
|
||||||
|
Solr can be queried via REST clients, cURL, wget, Chrome POSTMAN, etc., as well as via the native clients available for many programming languages.
|
||||||
|
|
||||||
|
The Solr Admin UI includes a query builder interface - see the ```gettingstarted``` query tab at http://localhost:8983/solr/#/gettingstarted/query. If you click the ```Execute Query``` button without changing anything in the form, you'll get 10 documents in JSON format (```*:*``` in the ```q``` param matches all documents).
|
||||||
|
|
||||||
|
The URL sent by the Admin UI to Solr is shown in light grey near the top right of the window - if you click on it, your browser will show you the raw response. To use cURL, give the same URL in quotes on the curl command line:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?indent=on&q=*:*&wt=json"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Search for a single term
|
||||||
|
|
||||||
|
To search for a term, give it as the q param value in the core-specific Solr Admin UI Query section, replace ```*:*``` with the term you want to find. To search for "foundation":
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation"
|
||||||
|
```
|
||||||
|
|
||||||
|
You'll see:
|
||||||
|
```
|
||||||
|
{
|
||||||
|
"responseHeader":{
|
||||||
|
"status":0,
|
||||||
|
"QTime":4,
|
||||||
|
"params":{
|
||||||
|
"q":"foundation",
|
||||||
|
"indent":"true",
|
||||||
|
"wt":"json"}},
|
||||||
|
"response":{"numFound":3,"start":0,"docs":[
|
||||||
|
{
|
||||||
|
"id":"0553293354",
|
||||||
|
"cat":["book"],
|
||||||
|
"name":["Foundation"],
|
||||||
|
```
|
||||||
|
|
||||||
|
The response indicates that there are 3 hits (```"numFound":3```), of which the first 10 were returned, since by default ```start=0``` and ```rows=10```. You can specify these params to page through results, where ```start``` is the (zero-based) position of the first result to return, and ```rows``` is the page size.
|
||||||
|
|
||||||
|
To restrict fields returned in the response, use the ```fl``` param, which takes a comma-separated list of field names. E.g. to only return the ```id``` field:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=foundation&fl=id"
|
||||||
|
```
|
||||||
|
|
||||||
|
To restrict search to a particular field, use the syntax ```q=field:value```, e.g. to search for Foundation only in the ```name``` field:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=name:Foundation"
|
||||||
|
```
|
||||||
|
|
||||||
|
The above request returns only one document (```"numFound":1```) - from the response:
|
||||||
|
|
||||||
|
```
|
||||||
|
...
|
||||||
|
"response":{"numFound":1,"start":0,"docs":[
|
||||||
|
{
|
||||||
|
"id":"0553293354",
|
||||||
|
"cat":["book"],
|
||||||
|
"name":["Foundation"],
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Phrase search
|
||||||
|
|
||||||
|
To search for a multi-term phrase, enclose it in double quotes: ```q="multiple terms here"```. E.g. to search for "CAS latency" - note that the space between terms must be converted to "```+```" in a URL (the Admin UI will handle URL encoding for you automatically):
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=\"CAS+latency\""
|
||||||
|
```
|
||||||
|
|
||||||
|
You'll get back:
|
||||||
|
```
|
||||||
|
{
|
||||||
|
"responseHeader":{
|
||||||
|
"status":0,
|
||||||
|
"QTime":25,
|
||||||
|
"params":{
|
||||||
|
"q":"\"CAS latency\"",
|
||||||
|
"indent":"true",
|
||||||
|
"wt":"json"}},
|
||||||
|
"response":{"numFound":2,"start":0,"docs":[
|
||||||
|
{
|
||||||
|
"id":"TWINX2048-3200PRO",
|
||||||
|
"name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"],
|
||||||
|
"manu":["Corsair Microsystems Inc."],
|
||||||
|
"manu_id_s":"corsair",
|
||||||
|
"cat":["electronics",
|
||||||
|
"memory"],
|
||||||
|
"features":["CAS latency 2, 2-3-3-6 timing, 2.75v, unbuffered, heat-spreader"],
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Combining searches
|
||||||
|
|
||||||
|
By default, when you search for multiple terms and/or phrases in a single query, Solr will only require that one of them is present in order for a document to match. Documents containing more terms will be sorted higher in the results list.
|
||||||
|
|
||||||
|
You can require that a term or phrase is present by prefixing it with a "```+```"; conversely, to disallow the presence of a term or phrase, prefix it with a "```-```".
|
||||||
|
|
||||||
|
To find documents that contain both terms "```apple```" and "```ipod```", enter ```+apple``` ```+ipod``` in the ```q``` param in the Admin UI Query tab. Because the "```+```" character has a reserved purpose in URLs (encoding the space character), you must URL encode it for ```curl``` as "```%2B```":
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=%2Bapple+%2Bipod"
|
||||||
|
```
|
||||||
|
|
||||||
|
To search for documents that contain the term "```apple```" but don't contain the term "```ipod```", enter ```+apple``` ```-ipod``` in the ```q``` param in the Admin UI. Again, URL encode "```+```" as "```%2B```":
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=%2Bapple+-ipod"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### In depth
|
||||||
|
|
||||||
|
For more Solr search options, see the Solr Reference Guide's Searching section: https://cwiki.apache.org/confluence/display/solr/Searching
|
||||||
|
|
||||||
|
|
||||||
|
### Faceting
|
||||||
|
|
||||||
|
One of Solr's most popular features is faceting. Faceting allows the search results to be arranged into subsets (or buckets or categories), providing a count for each subset. There are several types of faceting: field values, numeric and date ranges, pivots (decision tree), and arbitrary query faceting.
|
||||||
|
|
||||||
|
#### Field facets
|
||||||
|
|
||||||
|
In addition to providing search results, a Solr query can return the number of documents that contain each unique value in the whole result set.
|
||||||
|
|
||||||
|
From the core-specific Admin UI Query tab, if you check the ```"facet"``` checkbox, you'll see a few facet-related options appear.
|
||||||
|
|
||||||
|
To see facet counts from all documents (```q=*:*```): turn on faceting (```facet=true```), and specify the field to facet on via the ```facet.field``` param. If you only want facets, and no document contents, specify ```rows=0```. The ```curl``` command below will return facet counts for the ```manu_id_s``` field:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl 'http://localhost:8983/solr/gettingstarted/select?wt=json&indent=true&q=*:*&rows=0'\
|
||||||
|
'&facet=true&facet.field=manu_id_s'
|
||||||
|
```
|
||||||
|
|
||||||
|
In your terminal, you'll see:
|
||||||
|
|
||||||
|
```
|
||||||
|
{
|
||||||
|
"responseHeader":{
|
||||||
|
"status":0,
|
||||||
|
"QTime":5,
|
||||||
|
"params":{
|
||||||
|
"q":"*:*",
|
||||||
|
"facet.field":"manu_id_s",
|
||||||
|
"indent":"true",
|
||||||
|
"rows":"0",
|
||||||
|
"wt":"json",
|
||||||
|
"facet":"true"}},
|
||||||
|
"response":{"numFound":49,"start":0,"docs":[]
|
||||||
|
},
|
||||||
|
"facet_counts":{
|
||||||
|
"facet_queries":{},
|
||||||
|
"facet_fields":{
|
||||||
|
"manu_id_s":[
|
||||||
|
"corsair",3,
|
||||||
|
"belkin",2,
|
||||||
|
"canon",2,
|
||||||
|
"apple",1,
|
||||||
|
"asus",1,
|
||||||
|
"ati",1,
|
||||||
|
"boa",1,
|
||||||
|
"dell",1,
|
||||||
|
"eu",1,
|
||||||
|
"maxtor",1,
|
||||||
|
"nor",1,
|
||||||
|
"uk",1,
|
||||||
|
"viewsonic",1,
|
||||||
|
"samsung",0]},
|
||||||
|
"facet_ranges":{},
|
||||||
|
"facet_intervals":{},
|
||||||
|
"facet_heatmaps":{}}}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Range facets
|
||||||
|
|
||||||
|
For numerics or dates, it's often desirable to partition the facet counts into ranges rather than discrete values. A prime example of numeric range faceting, using the example product data, is ```price```. The data for these price range facets can be seen in JSON format with this command:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl 'http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=on&rows=0'\
|
||||||
|
'&facet=true'\
|
||||||
|
'&facet.range=price'\
|
||||||
|
'&f.price.facet.range.start=0'\
|
||||||
|
'&f.price.facet.range.end=600'\
|
||||||
|
'&f.price.facet.range.gap=50'\
|
||||||
|
'&facet.range.other=after'
|
||||||
|
```
|
||||||
|
|
||||||
|
In your terminal you will see:
|
||||||
|
|
||||||
|
```
|
||||||
|
{
|
||||||
|
"responseHeader":{
|
||||||
|
"status":0,
|
||||||
|
"QTime":41,
|
||||||
|
"params":{
|
||||||
|
"facet.range":"price",
|
||||||
|
"q":"*:*",
|
||||||
|
"f.price.facet.range.start":"0",
|
||||||
|
"facet.range.other":"after",
|
||||||
|
"indent":"on",
|
||||||
|
"f.price.facet.range.gap":"50",
|
||||||
|
"rows":"0",
|
||||||
|
"wt":"json",
|
||||||
|
"facet":"true",
|
||||||
|
"f.price.facet.range.end":"600"}},
|
||||||
|
"response":{"numFound":49,"start":0,"docs":[]
|
||||||
|
},
|
||||||
|
"facet_counts":{
|
||||||
|
"facet_queries":{},
|
||||||
|
"facet_fields":{},
|
||||||
|
"facet_ranges":{
|
||||||
|
"price":{
|
||||||
|
"counts":[
|
||||||
|
"0.0",19,
|
||||||
|
"50.0",1,
|
||||||
|
"100.0",0,
|
||||||
|
"150.0",2,
|
||||||
|
"200.0",0,
|
||||||
|
"250.0",1,
|
||||||
|
"300.0",1,
|
||||||
|
"350.0",2,
|
||||||
|
"400.0",0,
|
||||||
|
"450.0",1,
|
||||||
|
"500.0",0,
|
||||||
|
"550.0",0],
|
||||||
|
"gap":50.0,
|
||||||
|
"after":2,
|
||||||
|
"start":0.0,
|
||||||
|
"end":600.0}},
|
||||||
|
"facet_intervals":{},
|
||||||
|
"facet_heatmaps":{}}}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Pivot facets
|
||||||
|
|
||||||
|
Another faceting type is pivot facets, also known as "decision trees", allowing two or more fields to be nested for all the various possible combinations. Using the example technical product data, pivot facets can be used to see how many of the products in the "book" category (the cat field) are in stock or not in stock. Here's how to get at the raw data for this scenario:
|
||||||
|
|
||||||
|
```
|
||||||
|
curl 'http://localhost:8983/solr/gettingstarted/select?q=*:*&rows=0&wt=json&indent=on'\
|
||||||
|
'&facet=on&facet.pivot=cat,inStock'
|
||||||
|
```
|
||||||
|
|
||||||
|
This results in the following response (trimmed to just the book category output), which says out of 14 items in the "book" category, 12 are in stock and 2 are not in stock:
|
||||||
|
|
||||||
|
```
|
||||||
|
...
|
||||||
|
"facet_pivot":{
|
||||||
|
"cat,inStock":[{
|
||||||
|
"field":"cat",
|
||||||
|
"value":"book",
|
||||||
|
"count":14,
|
||||||
|
"pivot":[{
|
||||||
|
"field":"inStock",
|
||||||
|
"value":true,
|
||||||
|
"count":12},
|
||||||
|
{
|
||||||
|
"field":"inStock",
|
||||||
|
"value":false,
|
||||||
|
"count":2}]},
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
#### More faceting options
|
||||||
|
|
||||||
|
For the full scoop on Solr faceting, visit the Solr Reference Guide's Faceting section: https://cwiki.apache.org/confluence/display/solr/Faceting
|
|
@ -0,0 +1,79 @@
|
||||||
|
# 8.2 Konfiguration des Solr Schemas
|
||||||
|
|
||||||
|
Ab Solr Version 6.0 ist das sogenannte "managed schema" (auch "schemaless mode" genannt) voreingestellt. Solr analysiert bei der Indexierung die Daten und versucht das Schema selbst zu generieren. Felder können aber weiterhin zusätzlich manuell definiert werden.
|
||||||
|
|
||||||
|
Literatur:
|
||||||
|
|
||||||
|
* https://support.lucidworks.com/hc/en-us/articles/221618187-What-is-Managed-Schema-
|
||||||
|
* http://www.solrtutorial.com/schema-xml.html
|
||||||
|
|
||||||
|
## Aufgabe 1: Schema für Beispieldaten konfigurieren
|
||||||
|
|
||||||
|
Legen Sie zwei Felder "Title" und "Contributor" an.
|
||||||
|
|
||||||
|
Hinweise:
|
||||||
|
|
||||||
|
* Admin-Oberfläche aufrufen. Im Menü "Core Selector" den Index "gettingstarted" auswählen. Dann im zweiten Menü "Schema" aufrufen.
|
||||||
|
|
||||||
|
## Lösung
|
||||||
|
|
||||||
|
* Administrationsoberfläche: {%s%}http://192.168.1.1:8983/solr/#/gettingstarted/schema{%ends%}
|
||||||
|
* Feld Title ergänzen: {%s%}Button "Add Field" drücken, Title in das Feld name eingeben und als field type zum Beispiel "string" auswählen{%ends%}
|
||||||
|
* Feld Contributor ergänzen: {%s%}Button "Add Field", Contributor in das Feld name eingeben und als field type zum Beispiel "string" auswählen{%ends%}
|
||||||
|
|
||||||
|
|
||||||
|
## Aufgabe 2: Beispieldaten über Admin-Oberfläche laden
|
||||||
|
|
||||||
|
Laden Sie folgende CSV-Beispieldaten über die Admin-Oberfläche in Solr:
|
||||||
|
|
||||||
|
```
|
||||||
|
id,Contributor,Title
|
||||||
|
1,Klaus Gantert,Bibliothekswissen
|
||||||
|
2,Prof. Christine Gläser und Ursula Schulz, Bibliotheken als Schmelztiegel der Kulturen – ein Bericht aus der Werkstatt ethnographischer Methoden der Kundenforschung.
|
||||||
|
```
|
||||||
|
|
||||||
|
Hinweise:
|
||||||
|
|
||||||
|
* Admin-Oberfläche aufrufen. Im Menü "Core Selector" den Index "gettingstarted" auswählen. Dann im zweiten Menü "Documents" aufrufen.
|
||||||
|
* Prüfen Sie abschließend, ob die Daten indexiert sind: Entweder über eine Query in der Administrationsoberfläche oder über die Browse-Oberfläche
|
||||||
|
|
||||||
|
## Lösung
|
||||||
|
|
||||||
|
* Administrationsoberfläche: {%s%}http://192.168.1.1:8983/solr/#/gettingstarted/documents{%ends%}
|
||||||
|
* Daten laden: {%s%}Als Document type "CSV" auswählen und den Text oben in das Textfeld einfügen{%ends%}
|
||||||
|
* Prüfung: {%s%}In der Browsing-Oberfläche http://192.168.1.1:8983/solr/gettingstarted/browse/ nach gantert suchen{%ends%}
|
||||||
|
|
||||||
|
|
||||||
|
## Aufgabe 3: Schema über Admin-Oberfläche konfigurieren
|
||||||
|
|
||||||
|
Hinweise:
|
||||||
|
|
||||||
|
* Prüfen Sie mit dem Script ```count-tsv.sh``` aus Kapitel 7.6, Aufgabe 1 die Mehrfachbelegung der prozessierten Daten. Wenn in der Spalte Mehrfachbelegung ein Wert höher als 0 steht, dann sollte das Feld als "multiValued" markiert werden.
|
||||||
|
* Legen Sie für alle Spalten in den TSV-Daten ein Feld im Schema an.
|
||||||
|
* Admin-Oberfläche aufrufen. Im Menü "Core Selector" den Index "gettingstarted" auswählen. Dann im zweiten Menü "Schema" aufrufen.
|
||||||
|
* Im folgenden Kapitel 8.3 werden wir die Daten in Solr indexieren. Dabei erkennt Solr die allermeisten Felder automatisch. Wenn Sie sich Arbeit ersparen wollen, dann definieren Sie nur die Felder ```ISBN``` und ```DDC``` manuell. Alle anderen Felder sollte Solr automatisch erkennen. Wenn Sie lieber auf Nummer sicher gehen wollen, dann legen Sie alle Felder manuell an.
|
||||||
|
* Groß- und Kleinschreibung ist wichtig.
|
||||||
|
|
||||||
|
## Lösung
|
||||||
|
|
||||||
|
* Mehrfachbelegung prüfen: {%s%}./count-tsv.sh ~/tsv/haw-prozessiert.tsv{%ends%}
|
||||||
|
* Administrationsoberfläche: {%s%}http://192.168.1.1:8983/solr/#/gettingstarted/schema{%ends%}
|
||||||
|
* Feld ISBN ergänzen: {%s%}Button "Add Field" drücken, ISBN in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld ISSN ergänzen: {%s%}Button "Add Field" drücken, ISSN in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Sprache ergänzen: {%s%}Button "Add Field" drücken, Sprache in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld LCC ergänzen: {%s%}Button "Add Field" drücken, LCC in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld DDC ergänzen: {%s%}Button "Add Field" drücken, DDC in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Urheber ergänzen: {%s%}Button "Add Field" drücken, Urheber in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Titel ergänzen: {%s%}Button "Add Field" drücken, Titel in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Medientyp ergänzen: {%s%}Button "Add Field" drücken, Medientyp in das Feld name eingeben, als field type "string" auswählen und NICHT als multiValued markieren{%ends%}
|
||||||
|
* Feld Ort ergänzen: {%s%}Button "Add Field" drücken, Ort in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Verlag ergänzen: {%s%}Button "Add Field" drücken, Verlag in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Jahr ergänzen: {%s%}Button "Add Field" drücken, Jahr in das Feld name eingeben, als field type "TrieLong" auswählen und NICHT als multiValued markieren{%ends%}
|
||||||
|
* Feld Datum ergänzen: {%s%}Button "Add Field" drücken, Datum in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Beschreibung ergänzen: {%s%}Button "Add Field" drücken, Beschreibung in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Schlagwoerter ergänzen: {%s%}Button "Add Field" drücken, Schlagwoerter in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Beitragende ergänzen: {%s%}Button "Add Field" drücken, Beitragende in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Reihe ergänzen: {%s%}Button "Add Field" drücken, Reihe in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Vorgaenger ergänzen: {%s%}Button "Add Field" drücken, Vorgaenger in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Nachfolger ergänzen: {%s%}Button "Add Field" drücken, Nachfolger in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
||||||
|
* Feld Link ergänzen: {%s%}Button "Add Field" drücken, Link in das Feld name eingeben, als field type "string" auswählen und als multiValued markieren{%ends%}
|
|
@ -0,0 +1,51 @@
|
||||||
|
# 8.3 TSV-Dateien in Solr laden
|
||||||
|
|
||||||
|
## Konfiguration neu einlesen
|
||||||
|
|
||||||
|
* Menü "Core Admin" http://192.168.1.1:8983/solr/#/~cores/gettingstarted
|
||||||
|
* Button "Reload" drücken
|
||||||
|
|
||||||
|
## Index leeren (im Terminal)
|
||||||
|
|
||||||
|
Der folgende Befehl löscht alle Daten im Index ```gettingstarted```
|
||||||
|
|
||||||
|
```
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/update?commit=true&stream.body=%3Cdelete%3E%3Cquery%3E*%3A*%3C/query%3E%3C/delete%3E"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Daten laden (im Terminal)
|
||||||
|
|
||||||
|
Der folgende Befehl indexiert die Daten aus der Datei ```haw-prozessiert.tsv```. Der Befehl ist so lang, weil Solr mitgeteilt werden muss, welche Felder mehrfachbelegt sind und mit welchem Zeichen diese getrennt sind. Die Laufzeit beträgt etwa 5 Minuten. Währenddessen kommt keine Statusmeldung, also haben Sie ein wenig Geduld.
|
||||||
|
|
||||||
|
```
|
||||||
|
cd ~/tsv/
|
||||||
|
curl "http://localhost:8983/solr/gettingstarted/update/csv?commit=true&separator=%09&f.ISBN.split=true&f.ISBN.separator=%E2%90%9F&f.ISSN.split=true&f.ISSN.separator=%E2%90%9F&f.Sprache.split=true&f.Sprache.separator=%E2%90%9F&f.LCC.split=true&f.LCC.separator=%E2%90%9F&f.DDC.split=true&f.DDC.separator=%E2%90%9F&f.Urheber.split=true&f.Urheber.separator=%E2%90%9F&f.Ort.split=true&f.Ort.separator=%E2%90%9F&f.Verlag.split=true&f.Verlag.separator=%E2%90%9F&f.Datum.split=true&f.Datum.separator=%E2%90%9F&f.Beschreibung.split=true&f.Beschreibung.separator=%E2%90%9F&f.Schlagwoerter.split=true&f.Schlagwoerter.separator=%E2%90%9F&f.Beitragende.split=true&f.Beitragende.separator=%E2%90%9F&f.Reihe.split=true&f.Reihe.separator=%E2%90%9F&f.Vorgaenger.split=true&f.Vorgaenger.separator=%E2%90%9F&f.Nachfolger.split=true&f.Nachfolger.separator=%E2%90%9F&f.Link.split=true&f.Link.separator=%E2%90%9F&f.Titel.split=true&f.Titel.separator=%E2%90%9F" --data-binary @haw-prozessiert.tsv -H 'Content-type:text/plain; charset=utf-8'
|
||||||
|
```
|
||||||
|
|
||||||
|
Wenn Sie lieber die Daten aus der automatischen Verarbeitung indexieren wollen, dann wechseln Sie in das Verzeichnis ```~/refine/```, schauen Sie mit ```ls``` wie die Datei heißt und ersetzen Sie am Ende des Befehls ```haw-prozessiert.tsv``` durch den den Dateinamen.
|
||||||
|
|
||||||
|
Literatur:
|
||||||
|
|
||||||
|
* https://wiki.apache.org/solr/UpdateCSV#Updating_a_Solr_Index_with_CSV
|
||||||
|
|
||||||
|
## Prüfen Sie das Ergebnis
|
||||||
|
|
||||||
|
Rufen Sie die Browsing-Oberfläche auf (http://192.168.1.1:8983/solr/gettingstarted/browse). Es sollten über 200.000 Dokumente gefunden werden. Machen Sie ein paar Beispielsuchen, um sicherzugehen, dass die Daten richtig indexiert wurden.
|
||||||
|
|
||||||
|
## Solr beenden und starten
|
||||||
|
|
||||||
|
Der Docker-Container my_solr wurde in Kapitel 8.1 als Hintergrundprozess gestartet, der bis zum nächsten Neustart des Rechners weiterlaufen sollte. Sie können den Container jederzeit manuell beenden und starten.
|
||||||
|
|
||||||
|
Solr beenden:
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo docker stop my_solr
|
||||||
|
```
|
||||||
|
|
||||||
|
Solr starten:
|
||||||
|
|
||||||
|
```
|
||||||
|
sudo docker start my_solr
|
||||||
|
```
|
||||||
|
|
||||||
|
Etwa 15-30 Sekunden nach dem Startbefehl sollte die Administrations- und die Browsingoberfläche unter den gewohnten Adressen erreichbar sein.
|
Loading…
Reference in New Issue