Integrating Apache Solr with Ballerina Central to Enhance Search Capabilities
During my internship at WSO2, I had the opportunity to work on integrating Apache Solr with Ballerina Central, the official package registry for the Ballerina programming language.
What is Ballerina Central?
Ballerina Central is a globally hosted package management system where you can discover, download, publish, and manage Ballerina packages. (Read more here)
What is Apache Solr?
Solr is an open-source enterprise search platform. Its major features include full-text search, relevance-based search, hit highlighting, faceted search, real-time indexing, dynamic clustering, and much more.
In this article, let’s explore the new search features added to Ballerina Central through the Solr integration.
Relevance Based Search
One of the main reasons why the Central team wanted to introduce Solr into Ballerina Central’s search was to provide users with more relevant search results. But what does “Relevance Based Search” actually mean?
I’ll explain this with an example. In the old implementation of search in Ballerina Central, if you searched for “io” you’d get the following results:
implement
ation of search in Ballerina CentralNotice how the package ballerina/io
(highlighted in red) is ranked third in the results while ballerina/constraint
is ranked first. Ideally, ballerina/io
should have ranked first, as it’s an exact match to the search query. The first two results are “irrelevant” to the query. There are two reasons why this was happening:
- The
constraint
package includes a keyword called "validation" (highlighted in blue), which contains the letters "io", making it a valid search result. Similarly, theoauth2
package is returned as a valid search result because it includes the keywords "authentication" and "introspection". - The results are ordered by pull count, which was the default sort option in Central previously. Since the
constraint
andoauth2
packages have more pulls than theio
package, they rank higher in the search results.
“Relevancy” can mean different things depending on the context. For Ballerina Central, “relevant” search results are those that not only match the user’s query but also prioritize packages offered by Ballerina’s standard library.
Solr comes with handy features that allow us to customize searches for relevance. Thanks to this, I introduced a new sorting option called “Sort by Relevance”, which resolves the issues that were present in the old search system.
Full-Text Search
Thanks to Solr, we can now index various metadata associated with packages on Central, such as package name, organization name, keywords, authors, license, platform, and README. This comprehensive indexing enables us to perform searches and discover packages based on any of these fields.
Symbol Search
With Solr’s extensive indexing capabilities, I built a pipeline to index symbols such as records, classes, functions, and more from packages. This enabled us to introduce a new search mode called “Symbol Search.”
Click on the “Symbols” tab to search for symbols or simply start your query with a hashtag (#) to initiate a symbol search, for example, #authenticate
.
You can filter the results for specific types of symbols using the type:
prefix, for example, type:function
or type:record
.
Autocomplete Suggestions
Using Solr’s suggester component, I implemented autocomplete suggestions in the Central search bar. Now, as you begin typing your query, helpful suggestions for both packages and symbols appear, making it easier for users to find what they’re looking for.
Spellchecking
Users sometimes make spelling mistakes in their search queries, leading to unsatisfactory results. To enhance the search experience, I integrated Solr’s spellchecker component. This feature offers helpful “Did you mean?” suggestions, making it easier for users to find the packages they need.
Hit Highlighting
Similar to Google Search, hit highlighting is used to emphasize terms that match the user’s query in the search results’ description. Solr provides this feature natively, so I integrated it into Central as well.
Boolean Operators
Search queries now support the usage of boolean operators like AND, OR and NOT to create more advanced queries. For example you can look up packages that have the words sms
and freemium
by using sms AND freemium
as the query.
Include/Exclude Terms
Queries can be refined by including or excluding specific terms using the +
and -
symbols. For example, to find packages related to Google’s apps while excluding Google Drive, you can use: google -drive
.
Wildcard Search
To perform wildcard searches, an asterisk (*) can be used with the search term. For example, web*
.
Overall, this project was an exciting experience for me. I learned many new and interesting things that I hadn’t known before. Working with Solr allowed me to gain a deeper understanding of search engines, how they function behind the scenes, and the various technologies that power them. This project not only expanded my technical knowledge but also strengthened my problem-solving skills, making it a truly valuable learning experience.
If you’re interested in learning more about Solr, feel free to explore some of the articles I’ve written on the topic:
- Setting up Apache Solr on Kubernetes
- Authentication and Authorization in Apache Solr using the Solr Operator
- Spellchecking on Apache Solr
Thanks for reading!