ckanext-embeddings
An extension that uses Machine Learning Embeddings to provide similarity-based features for CKAN portals.
Note: This is an alpha version and has not been tested in any real world deployment.
What This Does
Embeddings are a Machine Learning technique that allows encoding complex pieces of information like text or images as numerical vectors (lists of numbers) that encode the relationships and similarities between these pieces of information. The closer the vectors are, the more similar these objects are according to the underlying model used to create the embeddings.
In the context of CKAN, this plugin computes embeddings for all datasets using their metadata (title, description, or any other relevant metadata field). Being able to compare datasets allows building features that increase discoverability of relevant data for users.
Features
1. Similar Datasets
By computing all datasets embeddings and ranking them against a particular dataset, we can get the most similar datasets. This similarity won’t just take text-based similarity into account but also the meaning and context of the dataset metadata.
The plugin adds a package_similar_show action that returns the closest datasets to the one provided with the id parameter. 5 are returned by default, configurable via the limit parameter.
2. Semantic Search
Rank dataset embeddings against an arbitrary query term, returning the most similar datasets. Uses Solr’s Dense Vector Search capability.
Pass extras={'ext_vector_search':'true'} to package_search action to perform semantic search instead of default Solr search.
Requirements
Tested on CKAN 2.10/master. Requires at least CKAN 2.10.4 and 2.9.11.
The Semantic Search feature requires a custom Solr schema with a Dense Vector Search field. A Dockerfile based on official CKAN Solr images is included.
Installation
Activate your CKAN virtual environment:
. /usr/lib/ckan/default/bin/activate
Clone the source and install:
git clone https://github.com/amercader/ckanext-embeddings.git
cd ckanext-embeddings
pip install .
pip install -r requirements.txt
Add embeddings to the ckan.plugins setting.
Add configuration:
ckan.search.solr_allowed_query_parsers = knn
Restart the CKAN process.
Customizing
Choose the backend for generating embeddings via ckanext.embeddings.backend:
- sentence_transformers (default): Local using Sentence Transformers’
all-MiniLM-L6-v2 model
- openai: Uses OpenAI’s Embeddings API (requires API key)
You can also provide custom backends by extending BaseEmbeddingsBackend.
Configuration
ckanext.embeddings.backend = sentence_transformers
ckanext.embeddings.openai.api_key = your_api_key # or use OPENAI_API_KEY env var
ckanext.embeddings.solr_vector_field_name = vector
License
AGPL-3.0 license