Extension Embeddings


Extension Basics

Title
Embeddings
Name
ckanext-embeddings
Type
Public extension
Description
CKAN extension using Machine Learning Embeddings to provide similarity-based features including similar dataset discovery and semantic search in Solr.
CKAN versions

~2.10,~2.11

Show details
Download-Url (zip)
Last commit
a year ago (2024-05-22 10:47:42)
Url to repo
Category
Data Management & Quality


Background Infos

Description (long)
Show details

ckanext-embeddings

An extension that uses Machine Learning Embeddings to provide similarity-based features for CKAN portals.

Note: This is an alpha version and has not been tested in any real world deployment.

What This Does

Embeddings are a Machine Learning technique that allows encoding complex pieces of information like text or images as numerical vectors (lists of numbers) that encode the relationships and similarities between these pieces of information. The closer the vectors are, the more similar these objects are according to the underlying model used to create the embeddings.

In the context of CKAN, this plugin computes embeddings for all datasets using their metadata (title, description, or any other relevant metadata field). Being able to compare datasets allows building features that increase discoverability of relevant data for users.

Features

1. Similar Datasets

By computing all datasets embeddings and ranking them against a particular dataset, we can get the most similar datasets. This similarity won’t just take text-based similarity into account but also the meaning and context of the dataset metadata.

The plugin adds a package_similar_show action that returns the closest datasets to the one provided with the id parameter. 5 are returned by default, configurable via the limit parameter.

2. Semantic Search

Rank dataset embeddings against an arbitrary query term, returning the most similar datasets. Uses Solr’s Dense Vector Search capability.

Pass extras={'ext_vector_search':'true'} to package_search action to perform semantic search instead of default Solr search.

Requirements

Tested on CKAN 2.10/master. Requires at least CKAN 2.10.4 and 2.9.11.

The Semantic Search feature requires a custom Solr schema with a Dense Vector Search field. A Dockerfile based on official CKAN Solr images is included.

Installation

  1. Activate your CKAN virtual environment:

    . /usr/lib/ckan/default/bin/activate
    
  2. Clone the source and install:

    git clone https://github.com/amercader/ckanext-embeddings.git
    cd ckanext-embeddings
    pip install .
    pip install -r requirements.txt
    
  3. Add embeddings to the ckan.plugins setting.

  4. Add configuration:

    ckan.search.solr_allowed_query_parsers = knn
    
  5. Restart the CKAN process.

Customizing

Choose the backend for generating embeddings via ckanext.embeddings.backend:

  • sentence_transformers (default): Local using Sentence Transformers’ all-MiniLM-L6-v2 model
  • openai: Uses OpenAI’s Embeddings API (requires API key)

You can also provide custom backends by extending BaseEmbeddingsBackend.

Configuration

ckanext.embeddings.backend = sentence_transformers
ckanext.embeddings.openai.api_key = your_api_key  # or use OPENAI_API_KEY env var
ckanext.embeddings.solr_vector_field_name = vector

License

AGPL-3.0 license

Version
0.1.1
Version release date
2024-05-22
Contact name
Adrià Mercader
Contakt email
(not set)
Contact Url
(not set)


Installation Guide

Configuration hints

Add ‘embeddings’ to ckan.plugins. Requires CKAN 2.10.4+ or 2.9.11+ and custom Solr schema with Dense Vector Search field. Supports Sentence Transformers (local) or OpenAI backends. Alpha version - not yet tested in production.

Plugins to configure (ckan.ini)
embeddings
CKAN Settings (ckan.ini)
ckan.search.solr_allowed_query_parsers = knn
ckanext.embeddings.backend = sentence_transformers
# ckanext.embeddings.openai.api_key = your_api_key
# OPENAI_API_KEY = your_api_key
ckanext.embeddings.solr_vector_field_name = vector
DB migration to be executed
(not set)
<< back to Extensions