ckanext-fulltext - Fulltext searching plugin for CKAN
This extension provides plugins that allow CKAN to store and search full text data. It uses a new Solr field to do a full text search and then display the matches in CKAN.
The full text field enables the user to find datasets that contain text he or she is looking for, without the text being part of one of the CKAN fields. That means the full text will be stored separate and apart from other CKAN package data in Solr as well as in the PostgreSQL database.
Additionally you can parse the fulltext of documents using a JCC-Wrapper for Apache Tika.
Plugin Installation
Install the extension into your python environment:
(pyenv) $ pip install -e git+https://github.com/transparenzportalhamburg/ckanext-fulltext.git#egg=ckanext-fulltext
Your CKAN configuration ini file should contain the following plugin:
ckan.plugins = inforeg_solr_search
Add a new field to your conf/schema.xml that acts like a catch-all field for the content of all resources:
<field name="fulltext" type="textgen" indexed="true" stored="true"/>
...
<copyField source="fulltext" dest="text"/>
Create a fulltext table:
paster --plugin=ckanext-fulltext fulltext init_fulltext_table --config=/etc/ckan/default/development.ini
Tika-Wrapper Installation (for Ubuntu)
In order to use the tikaparser you have to install jcc. JCC requires a recent cpp compiler, Java JDK 1.7+.
sudo apt-get update
sudo apt-get install build-essential
sudo apt-get install openjdk-7-jdk
pip install jcc
Now install the tikaparser:
cd /path/to/ckanext-fulltext/ckanext/fulltext/parser
python setup.py build
python setup.py install
API usage
Once you’ve downloaded a full text online resource that you want to search in, create a package with a new metadata field full_text_search to store the full text.
Hide extras fields
You can set options in the CKAN config file to specify extras fields which are not visible for any user except sysadmin:
ckan.fulltext.hide.fields = extras_field1 extras_field2 ...
hide.extras.fields = full_text_search extras_field1 extras_field2 ...
hide.main.fields = maintainer_email author_email ...
Copying and License
This material is copyright (c) 2015 Fachliche Leitstelle Transparenzportal, Hamburg, Germany.
Licensed under the GNU Affero General Public License (AGPL) v3.0.