Extension Fulltext Search


Extension Basics

Title
Fulltext Search
Name
ckanext-fulltext
Type
Public extension
Description
Fulltext searching plugin for CKAN that allows storing and searching full text data using Solr and Apache Tika for document parsing.
CKAN versions
Download-Url (zip)
Last commit
2 years ago (2023-08-31 22:18:52)
Url to repo
Category
Data Management & Quality


Background Infos

Description (long)
Show details

ckanext-fulltext - Fulltext searching plugin for CKAN

This extension provides plugins that allow CKAN to store and search full text data. It uses a new Solr field to do a full text search and then display the matches in CKAN.

The full text field enables the user to find datasets that contain text he or she is looking for, without the text being part of one of the CKAN fields. That means the full text will be stored separate and apart from other CKAN package data in Solr as well as in the PostgreSQL database.

Additionally you can parse the fulltext of documents using a JCC-Wrapper for Apache Tika.

Plugin Installation

  1. Install the extension into your python environment:

    (pyenv) $ pip install -e git+https://github.com/transparenzportalhamburg/ckanext-fulltext.git#egg=ckanext-fulltext
    
  2. Your CKAN configuration ini file should contain the following plugin:

    ckan.plugins = inforeg_solr_search
    
  3. Add a new field to your conf/schema.xml that acts like a catch-all field for the content of all resources:

    <field name="fulltext" type="textgen" indexed="true" stored="true"/>
    ...
    <copyField source="fulltext" dest="text"/>
    
  4. Create a fulltext table:

    paster --plugin=ckanext-fulltext fulltext init_fulltext_table --config=/etc/ckan/default/development.ini
    

Tika-Wrapper Installation (for Ubuntu)

In order to use the tikaparser you have to install jcc. JCC requires a recent cpp compiler, Java JDK 1.7+.

sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install openjdk-7-jdk
pip install jcc

Now install the tikaparser:

cd /path/to/ckanext-fulltext/ckanext/fulltext/parser
python setup.py build
python setup.py install

API usage

Once you’ve downloaded a full text online resource that you want to search in, create a package with a new metadata field full_text_search to store the full text.

Hide extras fields

You can set options in the CKAN config file to specify extras fields which are not visible for any user except sysadmin:

ckan.fulltext.hide.fields = extras_field1 extras_field2 ...
hide.extras.fields = full_text_search extras_field1 extras_field2 ...
hide.main.fields = maintainer_email author_email ...

Copying and License

This material is copyright (c) 2015 Fachliche Leitstelle Transparenzportal, Hamburg, Germany. Licensed under the GNU Affero General Public License (AGPL) v3.0.

Version
Version release date
(not set)
Contact name
Transparenzportal Hamburg
Contakt email
(not set)
Contact Url


Installation Guide

Configuration hints

Requires Solr schema modification to add fulltext field. Requires Apache Tika and JCC for document parsing. Uses paster commands to initialize fulltext table.

Plugins to configure (ckan.ini)
inforeg_solr_search
CKAN Settings (ckan.ini)
# ckan.fulltext.hide.fields = extras_field1 extras_field2
# hide.extras.fields = full_text_search extras_field1 extras_field2
# hide.main.fields = maintainer_email author_email
DB migration to be executed
(not set)
<< back to Extensions