ckanext-embeddings

An extension that uses Machine Learning Embeddings to provide similarity-based features for CKAN portals. Currently there are two features explored:
- Returning similar datasets
- Performing Semantic Search in Solr integrated with the usual CKAN search
| :warning: Note :warning: |
| This is just an alpha version and has not been tested in any real world deployment |
What this does
Embeddings are a Machine Learning technique that allows to encode complex pieces of information
like text or images as numerical vectors (lists of numbers) that encode the relationships and
similarities between these pieces of information. The closer the vectors are, the more similar
these objects are according to the underlying model used to create the embeddings. There are many
introductory resources to learn more about embeddings, here are some I found useful:
In the context of CKAN, this plugin computes embeddings for all datasets, using their metadata
(their title or description, but also any other relevant metadata field can be used). Being able to
compare datasets allows us to build features that increase discoverability of relevant data for users.
Right now there are two features implemented:
1. Similar datasets
By computing all datasets embeddings and rank them against a particular dataset one, we can get the most
similar datasets to the one