Timo Hagenow and Duncan Blythe from Superduper

SuperDuper is a framework for integrating AI models and APIs directly with major databases, eliminating the need for complex MLOps pipelines. Timo and Duncan, the founders of SuperDuper, have diverse backgrounds in business and technical fields, which led them to create a tool that simplifies AI application development. They emphasize the importance of open-source technology and community-driven development. SuperDuper allows developers to build AI applications using a simple Python interface while providing the flexibility to drill down into the details. The tool supports a wide range of use cases, from generative AI to standard tasks like classification and segmentation. SuperDuper aims to become a leading standard for integrating AI with data and applications.

Takeaways

SuperDuper simplifies AI application development by integrating AI models and APIs directly with major databases.
The founders of SuperDuper have diverse backgrounds in business and technical fields, which influenced the creation of the tool.
Open-source technology and community-driven development are key principles of SuperDuper.
SuperDuper supports a wide range of use cases, from generative AI to standard tasks like classification and segmentation.

Episode Transcript

Introduction

Rod Rivera: Welcome to the AI Engineer podcast. Today, I'm pleased to have Duncan and Timo from SuperDuper. SuperDuper is a framework that enables anyone with basic Python knowledge to integrate machine learning and AI into their applications. Welcome, Timo and Duncan.

Duncan Blythe: Hi.

Timo: It's a pleasure to be here.

Founders' Backgrounds

Rod Rivera: Timo, you have a strong business background, while Duncan, you have a robust technical background. SuperDuper is both a technical product and one aimed at a less technical user base. Could you tell us about yourselves and how you arrived at where you are now?

Timo: You're correct, I'm not an engineer or data scientist, but I became a tech founder right after my studies. I did an internship at an advertising tech company and then decided to start something with friends. I ended up being responsible not just for building up sales, but I was always highly interested in product management. I would design the initial prototype, build a development team, and always work very technically.

I met Duncan through a friend when they were still studying in Berlin, so we've been friends for nearly 10 years. After handing over my ad tech company after seven years, Duncan, another friend, and I decided to found a company together. This was my first real involvement in AI. We created an AI innovation lab called LF1 in Berlin, focusing on technical development and prototyping.

Our first project was an AI-based e-commerce search and navigation suite. We developed services like semantic text search, reverse image search, and auto data enrichment, all based on vector embeddings of e-commerce products. We started in fashion, creating multimodal vector embeddings of products combining images, textual data, and IDs.

We developed our own models and could spawn different services based on these vector embeddings. It was exciting because we could compare our services side-by-side with existing solutions from big companies. We worked on model development, experimentation, prototyping, and bringing these models into production, serving large online shops with high traffic.

When we finished creating this end-to-end solution, we reached out to some e-commerce players. We were quickly contacted by a SaaS market leader from Britain who wanted to form an exclusive partnership and buy our IP. We did an IP sale and helped them onboard these algorithms. We brought them live very quickly on huge brands like Adidas, Asos, and about a dozen other big online shops.

While implementing these solutions, we started to build our own tooling. We participated in the PyTorch annual hackathon with a project called Pedal, which was a pipeline builder. However, we lost interest in this project as we were just following the narrative and paradigm in MLOps.

Duncan and I then decided to focus on SuperDuper and follow an open-source path. We believe that in data science, it's important to be open-source first and community-driven. That's how we ended up here with you.

Duncan Blythe: My background is quite different. I studied mathematics as an undergraduate at Oxford. Interestingly, I come from a family where we never had a computer, so until I was about 23, I'd never programmed a computer, despite having gone very far with mathematics.

It was around 2007-2008 when machine learning first popped up on my radar. I quickly had to adapt and learn programming and other skills. I started working in a research lab doing classical machine learning. It was pre-GPU days, but there was a large Sun grid cluster. It was clear that people who were powerful with their tools were good at bringing their mathematical algorithms to the compute.

Because of my background of not being familiar with computers, it immediately seemed bizarre to me how hard it was to connect valuable compute and data with algorithms. I was always looking for ways to optimize the journey between connecting mathematics with compute and data.

Fast forward to 2018, before we started the innovation lab with Timo, I was working at Zalando, which had a small research lab. It was very hard to get compute provisions because it was a publicly traded European company, and you couldn't just click on an instance in the cloud and start your environment. There was a lot of red tape around that process.

We set up a system to automatically start instances, pull in the code, and start computing in a scalable way. We got to certain discoveries which allowed us to outperform BERT on German language with a model called Flare, which then itself became a big open source project called Flare NLP.

The genesis was that we were in this suboptimal situation where we had to make the most of our resources. Not everyone is in the privileged position where they can just spin up hundreds of GPUs without a problem.

This premise became even more urgent when managing many customers in e-commerce who had different data requirements. It was clear that sometimes the data is the most painful point in the development cycle, especially if you have different sources of data that need to be transformed before they get ingested as tensors into your machine learning models.

We wanted to have an environment that was flexible enough to allow us to do a range of different things from AI and machine learning and would also allow us to connect flexibly with our data. We wanted it to be super flexible but also not really hard to get started. That's what we've created with SuperDuper.

SuperDuper: Bridging the Gap

Rod Rivera: Before we delve fully into SuperDuper, what has changed and what has remained the same in the machine learning and data science space over the past 15 years?

Duncan Blythe: One thing that's stayed the same is that a machine learning model is basically a nonlinear mathematical transformation that takes some numbers and outputs some other numbers. That's remained the same since the 60s when the first machine learning models existed, and even hundreds of years ago with linear models.

What's become easier on the infrastructure side is provisioning your environment with a predictable operating system. Cloud computing has been around for a while, making it easy to click and get an instance which you connect to by SSH, and then you manage it yourself. But that's still way away from what you actually want as an AI practitioner, which is a compute environment where you can just work exactly as you would locally, remotely.

There are things like Colab, which allow you to quickly provision an instance, but then you don't have your data there. You have other things like SageMaker, which make it much easier to do certain tasks in machine learning, like a classical classification task, train it, deploy it. But then you're missing the full range of flexibility.

Then you've got new things such as vector databases. These are super popular now because they're very useful in combination with large language models. If you have something that allows you to find similar documents to a question that you want to ask, then you're going to be able to provide these documents as context to your chatbot or your large language model, so that it doesn't confabulate or make up nonsense.

But the problem with those things is they're limited to these few use cases, just for basically searching with vectors and questioning documents. So it's missing the part where you've got complete configurability.

Navigating the AI Landscape

Rod Rivera: With the overwhelming number of AI solutions available, do you have any advice for practitioners trying to decide where to start?

Duncan Blythe: I think you need to think about what you want to do. For 99% of applications, a specialized vector database which can scale to hundreds of millions or billions of documents is probably not relevant. Most developers deploying their own application or small teams are going to have maybe a handful of thousands of documents or maybe a hundred thousand documents.

For that, you definitely don't need a specialized vector database. If you're a full stack developer, you'll likely be working with something like Postgres or MongoDB. I would recommend sticking with that because both of these solutions now have inbuilt vector search, which you can just access through those systems.

The pain point is preparing your application and your data in the database so that it's ready to be used by these underlying vector search functionalities. MongoDB now has Atlas Vector Search on their enterprise service, and Postgres has a plugin called PG Vector.

That's exactly what SuperDuper can help you with. Just by wrapping your database with SuperDuper, you're going to instantly have access in a fully operational way to these underlying vector search functionalities.

Timo: It's about consolidating and removing complexity by providing one environment with a great experience, which is familiar. You want to use your tools you're already working with and are familiar with within the Python ecosystem. You don't want to manage having your data in different places, needing to map the raw data with the IDs of the vector store in a different location.

Getting Started with SuperDuper

Rod Rivera: How would you explain SuperDuper to your parents? What is it for them?

Timo: I would say we provide software for these software developers to build AI software more easily. That's something my parents would probably understand.

Rod Rivera: Walk me through the first steps for a data scientist who primarily works with Python notebooks and wants to get started with SuperDuper.

Timo: It's very easy to get started. We have implemented a bunch of cool use cases or applications, AI applications, which can be found when you look at the repository. You can use these use cases directly in your browser, in a Jupyter Notebook. They're quite straightforward, and it's the beauty of SuperDuper - you can implement or re-implement quite robust, quite complex AI applications very easily and very quickly.

Just follow the notebooks and try them out at demo.SuperDuper.com. In addition to that, we have a dedicated repository for apps built by the community. There are already a couple of cool apps inside. I think that's the best way to get started - to really just directly jump into these notebooks, maybe go look at the notebooks and get an overview, and then go to the getting started section in the documentation. After exercising a couple of these notebooks, you can pretty much just directly build your own applications. The learning curve is really that simple because it's using basic Python commands.

Use Cases and Applications

Rod Rivera: What are some great or original use cases for SuperDuper?

Timo: Everybody is super hot on generative AI, so large language model chat, RAG chat (retrieval augmented generation). Not just asking a pre-trained model like GPT or ChatGPT, but providing context information which is up to date, which is not part of the model training. For example, your own technical documentation or some user manual which gets updated all the time.

Of course, image generation and audio generation, these generative AI use cases are super interesting. But at the same time, it's also all about the standard use cases that really provide a lot of value, whether it's classification, segmentation, data enrichment, anomaly detection, and so on.

What's super interesting, which the community is not really aware of, are these custom use cases. When you go into different industries, for some manufacturer, they have optic sensor machine data for their production.

We have, for example, a use case where you can talk to your meetings or talk to your podcast. We actually have an application built by Duarte Carmo who created a search solution for the Changelog podcast series. You can basically now use natural language to search through all the episodes. Once you have implemented the vector search for audio, you can say, "When did Rod mention SuperDuper use cases?" and it will give you the exact time course in the episode.

You can extend it by building RAG, and then you can have the large language model not just respond to you and provide you with the time course, but also actually give you the answer about what was discussed and summarize it.

Another use case we have is video search. It's a workflow where you take your videos, cut them into frames, vector embed the images, and then use vector search to search through them. For example, one of the notebooks you can try out in the demo environment is Video Search. It's a 1 minute 30 second video of animals, and you can use natural language to say, "Show me happy ducks swimming on a pond," and it will jump to whatever second that scene appears.

SuperDuper and Other Frameworks

Rod Rivera: Can I do everything exclusively with SuperDuper, or do I need to connect it with other frameworks like LangChain or LlamaIndex?

Duncan Blythe: SuperDuper is like a system of wrappers that will wrap your functionality. You can, in principle, take an agent function from LangChain and use it together with SuperDuper.

For the particular case of questioning documents, you don't need LangChain to do that, even without SuperDuper. You can just perform the search with a vector search library, and then insert the results as context to, for instance, the OpenAI API.

In SuperDuper, we have a slightly different approach, but it also makes it very, very easy to get things like question your documents installed and ready. We have a few other upsides which LangChain doesn't have. For instance, we can keep the deployment up to date. If someone from another location inserts documents into your collection or database, with SuperDuper you can deploy it in a way where it reacts to those changes. This means you actually have a productionizable deployment of VectorSearch plus question your documents.

We are discussing how we're going to approach the topic of agents - how we're going to allow users to have these kind of interactive components, functionality which they use together with their database. Whether we use LangChain or LlamaIndex or something that we make ourselves, it's not clear yet.

Timo: It would be awesome to hear from the people what they think. This is really supposed to be a community project and community effort. It would be great if people could provide us with insights for how they see the whole thing and the development.

Advice for Newcomers

Rod Rivera: What advice would you give to a backend developer who wants to get started with AI applications?

Duncan Blythe: The first thing to know before you do any of these things is to understand what these new AI functionalities allow you to do, and then to connect that with the use case on your app or your application domain. This is even before you start programming.

Given your operational requirements, like whether you're allowed to share your data with an external provider or if you need to keep everything behind the firewall, you have a kind of decision tree which will allow you to act.

If it's okay to share everything with an external provider and you're not worried about turnaround times or SLAs, then I would use Python to experiment with the OpenAI API. That would be how to get started understanding the concept of machine learning.

Nowadays, I think going to the level of understanding how to write your own model in PyTorch is not necessary anymore. There are so many possibilities which are very low code that you can use from the Transformers library, for instance. There are other very easy libraries to use, for instance, Spacy, if you're interested in NLP. And there are other libraries built on top of PyTorch, for instance, Torch Vision, which allow you to get something very good in very little time.

All of these can be used inside SuperDuper. So if you then want to experiment with integrating AI with your data, you could try our library and see if that works for you. But generally, moving from what you want to do and then connecting that with the right model from the ecosystem.

Timo: And again, the fashion of SuperDuper is that whatever you decide you want to experiment with, you integrate it with one Python command. You simply query, define what your input data is, and then the inference will be taken care of and the outputs will be stored back into your database where you can then use them for downstream applications.

Once you've installed SuperDuper, it's not a big problem to try out different things. It can be done very quickly with pip install. You can then really just move super fast for different things. And the good part is you have your outputs. They're all structured and available going forward in your database. You can compare, for example, different embedding models and how they work for your use case.

Future Predictions

Rod Rivera: As we approach 2024, what are your predictions for the AI industry?

Timo: I think AI has reached mainstream society with the breakthrough in ChatGPT and image generation. Everybody understands that companies really have to adopt AI, otherwise there's a big chance they're going to be left behind. For developers and technical people, it's just a fascinating new technology, so everybody wants to be involved.

I don't see there's going to be another technological advancement coming up next year. I also share the assessment that this kind of Terminator, Skynet situation is definitely not going to happen soon, because in the end, these models are still kind of very uncreative.

I think it's more about adoption of what we currently see. Of course, these models get better, and even the image generation, what's going on there. Now we're going to have video generation, which is probably the next big thing.

Duncan Blythe: I would tend to agree with that, but I would say the models will get smarter and smaller, and there will be more on offer open source, which is going to be shared with the community. You have this new partnership agreement between the consortium of different companies, including Facebook and IBM, with the aim to really make the open part of AI, you know, OpenAI not really open anymore, to kind of deliver on that promise.

That will give us all sorts of new opportunities to embed these models in our applications. They're going to come to the edge. And then you're going to start gradually seeing walking, talking, squawking robots at some point. The question for developers is how can they get involved in that with the least overhead possible, but have this flexibility and we're hoping to play a role in that journey. So it's the notion of AI moving from an academic thing to a central part of software development and society. That's a very exciting movement.

Timo: Although there is already so much available in all these different fields with computer vision and NLP, it's crazy. There's so much open source technology which can be used instantly in so many use cases, potentially creating so much value already. It's about really finding the use case and these models and then connecting them, integrating them with your proprietary data and your applications. Of course, that's going to become another wave of more advanced open source solutions, and of course also closed source services like other API providers, you know, Mistral and all the others.

The Future of SuperDuper

Rod Rivera: Where do you see SuperDuper one year from now?

Timo: It's quite a big vision, and we really hope that potentially we become maybe not the leading standard, but a leading standard of how AI is integrated with data and how developers and developer teams integrate AI into existing stacks. We just hope for adoption, a lot of interest from the community, contributions, and really that we can kind of be the starting point for a thriving and sustainable community.

In open source, you always have these more narrow use cases or projects which have a very small function and they get super viral, but then a couple of years later they're kind of dying out. We hope that this can be different because we want to provide more fundamental, substantial functionality.

Closing Remarks

Rod Rivera: Where can people find you, and do you have any message for the audience?

Timo: The best starting point is of course GitHub. It's SuperDuper. From there, you should really just check the README first. We've written a blog post which gives you a good overview of what SuperDuper is and why we created it, what problem we're trying to solve.

From there, you will see different points. As I mentioned before, just check out the use cases and the docs. Of course, check the docs and try out the first use cases. We have our own community on Slack or Discord, and we would be super happy to welcome you there and start discussing what you think about it, what you think can be improved, what's missing, what's not working.

Just check it out and provide feedback. Let us know what you think.

Rod Rivera: Well, it has been great having you, Timo and Duncan. So everyone can check SuperDuper on GitHub and contribute, join the community. Thanks so much for being here today.

Timo: Awesome. Thank you very much for having us. Super nice. Super duper.

Duncan Blythe: It was a pleasure, thank you Rod.

Rod Rivera: Great, great. So yeah, that's super duper.