To celebrate the 10th edition of herri I’ve developed an advanced search tool, turning herri into a fully fledged research instrument: search.herri.org.za.
herri contains a vast array of content, articles, references, videos, audio and images, only to grow with future releases. To be able to search and filter all individual parts in near real-time, we needed a technology that can handle thousands of records with ease: Algolia. Thanks to their generous free usage tier, we can use their cloud infrastructure without adding additional running costs to herri.
Indexing herri
To provide a fast and accurate search, all articles needed to be split up in small pieces, since searching in long texts is generally slow. I’ve built an indexer which scans each article for its parts: headings, paragraphs, lists, quotes, images, videos and audio tracks.
// index content blocks as separate records
$tags = ['h1', 'h2', 'h3', 'h4', 'p', 'blockquote', 'ul', 'ol'];
foreach ($tags as $tag) {
$elements = $dom->getElementsByTagName($tag);
foreach ($elements as $element) {
$content_parts[] = [
'tag' => $tag,
'content' => strip_tags($dom->saveHTML($element)),
];
$elementsFound = true;
}
}
}
These are then pushed to Algolia (indexed) as individual records. All parts belonging to an article are interlinked so when a part is found, the search will present the article itself and a snippet related to the search query.
Filtering
You can filter results with or without a search query, by Issue, Section and media Type. This is called a faceted search, which makes it more easy to narrow down search results quickly. For herri, it also acts as a tool of exploration, for example browsing all videos in issue #6: https://search.herri.org.za/?issue=6&mediaType=video
Search UI
We’ve split up videos, images and audio tracks as separate items because herri contains so much rich media to explore. Extracting it from the articles makes it easier to browse around and I’ve built a mini player for each type to preview the item directly in the search results.
The search User Interface (UI) is built in Nuxt 3, an open source reactive javascript framework which enables a super fast and responsive User Experience (UX). herri itself is built on Nuxt 2 and I’ve integrated a more simple ‘mini search’ directly into it using the same Algolia index. This way it’s easy to find articles directly from inside herri as well. It defaults to show articles in the current issue you’re browsing, with the option to search all issues or open the Advanced Search for deeper exploration using filters.
A future for digital archives
It’s been 1.5 year since I reflected on the public release of chatGPT and the emerging AI phenomenon. Most big companies have jumped on the band wagon, racing to be ahead of the curve when AI will be integrated into almost everything we create and consume. Algolia already offers ‘AI search’, which greatly can improve getting quick relevant results because the AI can predict what you’re looking for and even what you didn’t know you’re looking for.
How will this change the way we interact with digital archives? Instead of searching through a database, we can train a custom AI model (LLM) with all the data and use a chat interface to ask it questions in natural language. It means we can converse with the archive as if it was a teacher or researcher.
The AI can even interpret images, video and audio so their information is in the knowledge base. It can describe these with more in-depth information the AI model knows about.
Even more powerful: it can create relations between vast amounts of data of all types, beyond human capacity or arguably human creativity. Based on all input, AI could ponder what an old photo would have looked like after reading a text about the set and setting and render it for you on demand. Similar to how humans can imagine visuals and audio while reading, based on all the data gathering during life experience. Deepmind, Google’s AI research lab, is already in that direction, using AI to add sound to videos based on what it sees and ‘imagines’ the video would sound like.
In the not so distant future, digital archives can go beyond just listing, filtering and searching data. Using AI integration we could greatly enrich and deepen the experience and make interaction much more accessible and fun, by using a natural language interface instead of dry keyword searching and topic browsing. A personal live story teller you can communicate with, tailored to your taste and interests.
For now, AI integration remains expensive and hard to use for a non-profit project like herri. While most providers offer a free tier for their most basic product, you can only do this in their own products with a personal identity.
To use their API always costs money per usage, a open-ended running cost we cannot bear at the moment.
JURGEN MEEKEL: This is great Martijn, I tried Merlin in Chrome and it works well… and is still for free. How long will it take for Merlin AI to become a paid feature?
“soon.
it takes energy to run those servers, energy needs to be paid by someone, preferably the user. part of the API will probably remain free in order to gain more usage data and train the bot via iteration and self learning to constantly improve output, aka satisfy the user more.
unless politics derive consensus about financing ‘free public bots for the good of humanity’ via subsidies or other means, smart AI will not be accessible for the ‘poor’, hence only enlarging the gap between classes. government sponsored bots are susceptible to corruption and censorship, and also don’t feel like the best solution.”
Herri #8 MARTIJN PANTLIN – Some notes […] on the AI phenomenon
Meanwhile I will keep searching for new ways to present herri content within our budget constraints.
Currently I use Codium AI in my code editor that predicts what I want to write, suggests and explains code, finds bugs and much more.
While we ticker on, go check it out at search.herri.org.za or use the search icon top right at any time.