October 2, 2023
Portfolio
Unusual

How open source AI will find product-market fit: A conversation with Databricks and AI startup Together

No items found.
How open source AI will find product-market fit: A conversation with Databricks and AI startup TogetherHow open source AI will find product-market fit: A conversation with Databricks and AI startup Together
All posts
Editor's note: 

In this episode of the Startup Field Guide podcast, we feature a conversation between Reynold Xin (co-founder, Databricks), Vipul Ved Prakash (co-founder, Together) and Wei Lien Dang (GP at Unusual Ventures) about the long-term impact of open source AI. Open-source AI models have become key drivers of innovation and collaboration. An increasing number of developers and end users are leveraging open-source technologies.

‍Be sure to check out more Startup Field Guide Podcast episodes on Spotify, Apple, and Youtube. Hosted by Unusual Ventures General Partner Sandhya Hegde (former EVP at Amplitude), the SFG podcast uncovers how the top unicorn founders of today really found product-market fit.

If you are interested in learning more about some of the themes and ideas in this episode, please check out our blog posts on:

TL;DR: Open-source LLMs and foundation models

What’s driving the development of open-source models?

  • Chinchilla’s research on training optimal models and Llama's work on inference have made it possible to get high-quality data sets.
  • The ability to customize open-source models and run them in private environments
  • The release of GPT-3 woke everyone up to the incredible potential of AI, emphasizing the pressing need for a collective effort towards developing open-source alternatives.

Why Databricks built Dolly — an open instruction-tuned LLM

  • Databricks built Dolly because they wanted to take an existing model and finetune it so it would exhibit the instruction-following behavior of GPT
  • The goal was to demonstrate that it’s not hard to train a model for a specific domain.

Why Together is building with open source

  • Open and available recipes for generating data sets speeds up the process of building models.
  • Together wants to provide not just data sets like RedPajama but also quality labels and content type labels to create a larger data playground.

Role of academia in AI innovation

  • Experimentation within academia has led to the production of industrialized artifacts 
  • Research within academia is allowing industry to scale up their ideas
  • Open source has become an enabler of collaboration between academia and industry

Why should product teams choose to build with open source?

  • Open source models offer control and privacy, as sensitive data can be used without being sent to shared cloud APIs. 
  • Fine-tuning open source models can result in higher accuracy, especially when using transfer learning.

Episode Transcript

Wei Lien Dang

Hi everyone. I'm very pleased to welcome you to our fireside chat this morning on Open Source AI. Super thrilled to be joined by David Hershey who's a VP on the Unusual Ventures team. My name is Wei Lien Dang. I'm a general partner here at Unusual Ventures focused on AI infrastructure software. And I'm thrilled to welcome two esteemed guest panelists. So the first is Vipul Ved Prakash. Vipul has a long track record as a serial founder. He's founder of Cloudmark and a founder of a company called Topsy, which got acquired by Apple, after which he led search and AI and ML efforts over at Apple before founding Together, where he's now co founder and CEO. And Together is one of the prominent platforms building around an open, decentralized AI cloud platform. And I'm also super happy to welcome Reynolds Xin, longtime thought leader and contributor to Apache Spark and co founder and chief architect at Databricks, where he's also been involved around efforts related to an open source instruction-following LLM that was recently announced. 

The rise of open source LLMs and foundation models

Wei Lien Dang

One of the reasons why we felt this was a really valuable and interesting topic to explore with you guys today is because we've just seen tremendous innovation and growth in what's happening with the Open AI ecosystem. If you look at the number of developers and end users who are utilizing open source AI technologies, if you look at what's been coming out from a research standpoint in terms of foundation models and other areas of the AI native stack, it's really astounding how quickly things are moving. We at Unusualeel we're at the beginning of long term impact that we're going to feel from Open Source AI. And so I'd love to start off getting both of your perspectives. In the last several months, we've seen so many new models like the Hugging Face Open LLM leaderboard is changing all the time. And these include models that your teams have trained. Like Vipul, you have Red Pajama from together. Reynold, you have Dolly.

I would love to start off by asking the both of you what do you believe is behind the recent wave of open source LLMs and foundation models and why is it important?

Vipul Ved Prakash

You know, unlike the sort of last decade of deep learning where companies who had data, they sort of had this asymmetric advantage in using AI. And with LLMs, they are built on open data sets. So I think it was a matter of time and interest and it's really great to see this happening. I think some of the sort of research, for instance, the Chinchilla research on training optimal models and then Llama's work on inference optimal models have kind of enabled training these models with sort of less hardware and really kind of get high quality. So that quality bar and threshold being hit, I think is one of the reasons we are seeing this interest. And I think part of the interest in open source is that you can customize these models, you can run them in sort of private environments where with sensitive data you have a lot more control over the weights. So there is a lot of interest in seeing these open models sort of become better and you're seeing this ecosystem effect of multiple research labs and open source projects contributing to the progress.

Wei Lien Dang

Reynold, we'd love to hear your take on the topic. 

Reynold Xin

I think there's in general, the technology and the dataset and the compute power has become to the point that it's significantly easier to train maybe the small language models. And that explains a lot of it. But I think we have to attribute maybe a lot of the sudden activity and interest in the last maybe three or four months to actually OpenAI themselves for the release of ChatGPT, because that for the first time probably captured or got everybody to wake up and think about, hey, AI seems to be a big deal. I'm not talking about everybody that would be calling into a webinar like this, but rather your mothers, your fathers, who might not be in tech. Like, my mom has nothing to do with tech and she's been using Chat GPT all the time and so is the CEO of corporate America and all the people that I would say would be able to influence the directions and they really are pushing, I would say, the community forward. And the other thing is because there's nothing that gathers, I would say, the community when there's a very obvious target and target here doesn't necessarily try to destroy it, but just trying maybe catch up. If there's a big gap, nothing exists in open source. Let's actually try to get something out there. I think that definitely had a pretty significant impact, especially in the last few months. And that's really where you saw the activity start to accelerate, which is around probably like late February, early March, and then between then and now you see a massive momentum, opportunities and activity.

Building AI platforms using open source

Wei Lien Dang

I'm curious, both of you guys work at companies that have built platforms, though, to support running those models. Certainly there's innovation in the models themselves, but I'm curious how you felt like having that platform to actually be able to serve customers who want to run the models. How that factored in? Why did each of you decide to build around open source? As opposed to certainly there are proprietary platforms out there as well. Fundamentally, how did each of your companies think about it? 

Reynold Xin

Sure. For us, it's not exclusive. Like it's fine for us to be using something proprietary, but at the same time actually be supporting the open source ecosystem. And really the reason Dolly was started was not that we felt, hey, we have to build a state of the art model. Initially it started as a very simple goal, which is, hey, let's try to learn about this and see how far we can get. And then very quickly we realized, hey, there's actually a lot of large open source, large language models out there that did not have instruction-following capabilities that the GPT has demonstrated. Can we actually take an existing model and for very little amount of money, actually fine tune so it would exhibit some of the instruction-following behavior? And it took us actually the first version of Dolly about three days from the inception of idea to training to actually making a blog post about it. So it was actually a very short period. And then we realized, hey, we should be telling the story. And the point of story is actually not what a lot of people thought it would be, which is, hey, they're trying to challenge OpenAI and coming up with amazing language models. The point of story is to demonstrate, hey, it is actually not that hard if you want to train something that's for your specific domain. It's really, really difficult to build a very general purpose chat bot that can talk about anything in the world, but it's actually not that hard if you want to build a domain specific model. As a matter of fact, here's all the ingredients you need to actually get there. And that's basically how Dolly came about. It wasn't really, hey, let's build something state of the art, but it's, hey, let's demonstrate to the world that it's possible. And then we quickly followed up. Once we released Dolly 1.0, the big challenge was, hey, you can't actually use it in a commercial setting because the data set they trained on did not have the permissive license because we're using OpenAI to generate those data sets. So we did this Dolly 2.0 follow up in about two weeks. It would just ask every employee at Databricks to write some questions and answers, and that generated a Dolly 15K data set that was also open source. And so to be honest, I think the Dolly 15 dataset itself is far more valuable than the Dolly model because the dataset will become the part of the training corpus probably for the thousands or even the millions of open source models to come. And it wouldn't be the only training data set to become a part of that. And it's sort of our little contribution to the world here.

The importance and impact of open source data in AI Research

Wei Lien Dang

Well, that's an interesting sort of segue. Vipul, would be curious to get your perspective because for the notion of a dataset I mean, you guys obviously put out the Red Pajama dataset and thought about that as a key starting point and sort of the open nature seems foundational to what Together is building, but would love to kind of hear your perspective on how you decided or why you decided to build around open source.

Vipul Ved Prakash

Yeah, I think echoing Reynold’s Dolly 15K is this incredible dataset. I do think data has this enduring power over time in open source models because you will see changes in architectures and more efficient architectures come up over time and they will build around the data ecosystem that has been created in the open. So we think that's really important. Also, I've seen some of the talks that Databricks done around how they created the data set, which I think is also super interesting because you can now sort of replicate this process in a different setting. And that's how we are thinking about pretraining data. It's very important that you have this pretraining data because it enables downstream model building. It's also very important that the recipes to generate the data are open and available because you can take those recipes, use them for another language, you can improve the dataset quality over time. And from the perspective of folks building models and exploring architectures or data mixtures, it really speeds up that process to have a dataset that's been carefully prepared. In the future, we want to do a lot more around this where there's a dataset, but there's also sort of quality labels and content type labels. So you have this bigger data playground to work with and you sort of see this with Pile. Pile has become really central to innovation in open AI and research and it's going to be pretty important. I do believe that over the next few years, the work around data will be sort of cornerstone of improvements.

Open Large Language Models (LLMs) and the need for standardization.

Wei Lien Dang

We're seeing the results and the impact of these data sets in these new models now. And I'm curious, there's these leaderboards, there's these rankings that people are paying attention to. What do you think is missing from them? What would you say isn't captured necessarily in these lists, these rankings, even some of the benchmarks that are being used generate the leaderboards. 

Vipul Ved Prakash

One, I think it's a good process. It's also fun for people who are building models that sets up some friendly competitive dynamics and it's really sort of acting as a north star for progress. That said, I do think the benchmarks need to be much more principled than they are today. You often see differences when people try to reproduce these benchmarks outside of the leaderboards. There's a research from University of Edinburgh that showed that the Llama 7b is still like five, six points ahead of all the Red Pajama and Falcon and MPT based models. So I think you need a lot more rigor around benchmarking because it is sort of driving and shaping what researchers are doing. Part of that rigor is processes around decontamination from datasets, from benchmark datasets, like they'll sometimes be included in GitHub code and they may end up in the models. And making the data open kind of allows a lot of that process to happen. But I think a lot more work needs to be done. Evaluation is fairly difficult from new research around how to evaluate these models and new processes.

Wei Lien Dang

Reynold, your perspective on how should people make sense of these leaderboards? 

Reynold Xin

So in a different world I'm in, which is the data systems world, benchmarks are very important and people have spent — they're like benchmark experts — who spent their entire life doing nothing but designing benchmarks. And we're so early in this process on the large language model side, that I think, first of all, it's great to see them. I do think the ultimate benchmark are the ones that you just have unbiased humans that rate the answers and you have large number of them. That's how some of this leaderboards work. But I think one of the big challenge, honestly, I don't know how to solve, is that they tend to be fairly simple. And many of the applications of these large language models is not going to be super simple. So having fairly simplistic setups with humans just asking simple questions and rating the answers represent only a very narrow slice of what these large language models will be used for. And they tend not to be domain specific, which I think a lot of the LLMs will be used for domain specific applications. And it's just going to be difficult to take hey, here's what maybe 10,000 random people on Internet think about the responses versus, hey, here's what doctors would think about this type of response. As a matter of fact, they might not even ask the right questions on the open Internet. And that is one of the things which also draws to how do you think about evaluation of the models. And I think just like any machine learning problem, LLMs are no different. You kind of need a continuous integrative improvement process. You shouldn't just trust, hey, here's a benchmark. Let me pick the number one place in that and start using it.

Wei Lien Dang

I think many of us would say we're still early and there's so many models out there. I know both of you know Chris Re at Stanford, and he's kind of likened where we are to sort of AI's Linux moment. And I think if you analogize, there were many variations of Linux distributions that existed early on, but eventually that consolidated. 

And do you think long term we'll see eventual industry consolidation? Does it make sense to have so many open LLMs and foundation models? Obviously there's an aspect where people have more choice and so on, but I think there's also something to be said for more standardization and people coalescing around fewer. I'm curious how each of you thinks about that.

Vipul Ved Prakash

I do think there will be consolidation eventually. I think there's a long way to go. There is going to be new architectures, exposure to new strategies that's going to happen in the coming years. And you will see a variety of models and models and research labs will start sort of adopting and investing in particular approaches. I think there are signs that there will be consolidation. For instance, today with the Llama architecture, the amount of tooling that's built around it is becoming substantial. And when you're building a new model, it does make more sense to sort of adopt that architecture, because now all of this tooling just automatically works. And that may be one of the ways in which consolidation happens. It happens around architectures. But as Reynold was saying, we're very early in this process, and once there are better models, users will gravitate towards them. And that's sort of the highest value thing in some ways.

Wei Lien Dang

Reynold, any thoughts on long term consolidation?

Reynold Xin

I think consolidation is necessary and bound to happen just like any technology. That doesn't mean there will be no more than three open source models. So there's even distribution of hey, here's a few of their wildly popular that basically people start with. But there might still be innovations that's happening over time, the disruptors that come in, replace them. But is it going to happen next year? Is it going to happen like 5, 10 years from now? I suspect it's more on the latter side, but it's to some extent not super useful to speculate exactly what would happen here, because the space is innovating so quickly that all you need to know is you have to be ready to embrace the change.

The role of academia in AI research and innovation

Wei Lien Dang

Vipul, one topic that you and I have chatted about is the impact of research and the fact that a lot of what's happening in the ecosystem and even from a standpoint of building things like LLM apps, ends up being research driven. Right now, both of you have worked across academic labs and large tech companies, and we've seen things come out of those, like, for instance, Llama from Meta and work from Stanford, Berkeley and other universities. I'm curious, what do you think the role of academic labs and large tech companies and these different stakeholders is alongside companies like yourselves, like driving the innovation forward?

Vipul Ved Prakash

I would say that I think in computer science, this has always been the case that academia is experimenting with the crazier ideas which are then adopted by industry, and industry then produces sort of industrialized artifacts from those ideas. And then there's a cycle of academia looking at that and kind of innovating further. I think that's very much the case in AI. There is an incredible amount of scholarship and research that's happening in academic labs and that's sort of making its way into the industry. And industry is sort of doing scale up of some of these ideas, which I think is a very sort of productive cycle. And this is one of the reasons, at Together, we are collaborating fairly deeply with academic labs, especially given open source is very sort of friendly way of doing this sort of collaboration. But I think academia has a huge role to play and so does industry.

Wei Lien Dang

Yeah, it would seem like both of your companies actually have been real leaders in fostering and figuring out ways to collaborate across academia and industry. And so I think it's interesting to see kind of like the cross pollination between the academia and what's going on with companies like yourselves in this context. Reynold, I'm curious to hear your opinion on what do you see as some of the most significant innovations coming out of the research community and the research side of things that perhaps you guys and others are looking to productize. 

Reynold Xin

Yeah, so one caveat, there's so many new things that are coming out every day that it's actually becoming very difficult to track, even for me.

I think one of the skepticism towards academia has been brewing recently in terms of AI is that, hey, AI requires a lot of compute power and access to data, which academia does not typically have. Does that mean the role of academic research is diminished in the era of large language models especially? I mean, I would challenge that assumption. And just talking about all the interesting work and one thing about the past is the very obvious example, which people often don't realize is stable diffusion, which is one of the most popular language model. One of the most popular AI models came out of latent diffusion work in academia. And that's something that's had a remarkable, profound impact in the industry. The LLM stuff I think it's a little bit newer, at least a lot of attention. There are people that's working on large language models, even academia, for a while, but a lot of attention is newer. And I think maybe two things I've seen recently that's pretty exciting is, one, it's actually not by academic research, but by a bunch of former academics at Mosaic. So Mosaic ML MPT's aligned Context Window, which actually, I think, allows up to 84k, if I remember correctly, on Context Window, which is pretty remarkable. It's actually even longer than GPT 4. The other thing is we all know that one of the most biggest issues with LLMs is they hallucinate, sometimes they would make up APIs. They would make up facts that are just completely wrong, which is problematic when you want to, for example, hey use the large language model to facilitate integration, orchestration of systems. And I've seen recently the guerrilla work in UC Berkeley by Joey Gonzalez’s team that basically combined retrieval-based systems. Hey, here's like a perfect API calling LLM that always makes sure I'm calling the right APIs or passing in the right parameters. I don't hallucinate an API out. I think work like that, maybe not the specific incarnation of it, but a work like this will actually push the application of large language models a pretty long way.

Innovations in Large Language Models (LLMs)

Wei Lien Dang

And then maybe one other question with regards to research is both of you highlighted the central role of data earlier in the conversation. Our observation has been a lot of the innovation in development of LLMs has centered on improving input data. But curious, what innovations do both of you see or expect with regards to training data? Who would you look to in terms of moving the space forward? 

Vipul Ved Prakash

Yeah, I think one of the recent paper from Stanford called Doremi that's sort of looking at how to, in a principled way, weight different data slices that go into data sets. I think this can have a potentially huge impact with Red Pajama. There are seven different slices of data from seven different sources. We weighted them according to the Llama paper. But when we reweight them in different ways, we find that the downstream models have significantly different better quality. So I think work around that, very excited about it. You can also take some of the large data sets, like Common Crawl and split them further and re-weight them. And I think there's also questions of deduplication, how much deduplication is right and what's sort of the sweet spot for it. There's research around that I think fairly leveraged in terms of the quality that you can get out of the same data and the same sort of model architecture. And I think instruct data is a huge area for improvement. The Red Pajama model, 7 billion models, seven points of difference on Helm benchmarks, and seven points is amazing. It's the difference between sort of Llama and GPT-3 today. And that comes from a set of instruct data, including the work that Databricks has done with Dolly. So I think that increasing the dataset and understanding what kind of data makes models better will have a big impact, and there's a lot of research happening around that.

Future of training data in AI

Wei Lien Dang

Reynolds, any thoughts on where you kind of take training data going forward? 

Reynold Xin:

Yeah, my sense is, thanks to the work of everybody and us playing a very small part, honestly, it would be more companies like Together and all that. Playing a bigger part is the training data for open source, training data for foundational models will become monetized over time. Everybody will have access to more or less the same set of data sets with reasonable amount of money, not like insane amount of money. And then there will be a lot of the maybe competitive advantage will be coming on. Hey, when we talk about domain-specific things, what about my data? What about how my customers been interacting with me in the past? So the ability to explore a live resource will become maybe competitive differentiations for most of the domain specific use cases. This is again, my thesis is most companies or organizations are not trying to build a general chat bot that can converse about everything from philosophy to state of the art technology. Most companies have certain applications they want to put the LLMs use for and data specifically relevant to those are going to be the key. And those are not the ones that will be available on the open Internet. And those are also not the ones that OpenAI would have. It's rather you, as your organization would have thosed data.

Building with open source models

Wei Lien Dang

I'm going to hand off to David now. He's going to cover a topic that I would say is really sort of top of mind for our audience, which is how to think about building with these open source models. And I think part of that is how companies like Together and Databricks are enabling people to build on top of the models that you've trained and made available. 

David Hershey

Yeah, thanks. Get started by asking you all about how you think about when teams should choose ti use open source models and the trade offs to make. You touched on it in your intro at the beginning, but I'm just curious in your mind, what is pushing teams to adopt open source models as opposed to some of the proprietary endpoints and what trade offs do you think they can make in that journey?

Vipul Ved Prakash

Yeah, I would say there are several reasons. A big one is control and privacy. The data that they are using is sensitive and it's not appropriate to send to shared cloud SaaS type API. We also see that when you fine tune these models, you're getting 10, 12 points of accuracy and the transfer learning works really well. And people are realizing that they're often using smaller models. We have several customers who've used 3 billion and 7 billion models and either continue to pretrain them or fine tune them and then use them with few shot context to get accuracy, better accuracy in GPT4. I think that process of anyone who has, as Reynold was saying, their own data and has a sense of what are the tasks they are using the language model for, really have now the tool set to build something on their own and get high quality and control over it.

David Hershey

Yeah, maybe to Reynold to piggyback off that. I think one of the nice things about the proprietary endpoints is you don't really have to think about the data challenge. And so some of the teams that may have been by managing data before might see this as scary. And I'm curious, you went down the route of building this big instruction-tuned data set with your company. But if you have advice or thoughts on what it actually means to build data that you can use and datasets that you can use to tune these models, to fine tune models, to instruction tune models, whatever it is, for folks that maybe are engineers just getting into the data space.

Reynold Xin

I think it's actually maybe less about how you think about the data here. It's a little more about how do you think about evaluation of MLOps, or evaluation of your machine learning models, which large language models are part of. Like some people, when you say machine learning, they'll think of large language models. When they say AI, they do confuse the hell out of me. But when I say ML, it's kind of equivalent to AI. Here I will argue what really you want to is just like any system you build, you want a way to evaluate it. For non machine learning system, you evaluate based on SLAs, you look at uptime, you look at a lot of things. Your regression analysis is no different when it comes to machine learning models. You want some way to evaluate how effective it is in production. And the reason that is important is I can guarantee whatever you deploy, even if just calling OpenAI's rest endpoint is not going to be perfect and there's a massive room for improvement through prompt engineering. What you want is you want to come up with some metrics that make sense in your domain specific context and evaluate that. You want to log the response to answers, you want to be able to tell, evaluate after the fact hey, how am I actually doing? Instead of just hey here's like a new pile of code that gets rolled out in production. And then you want to improve upon that and you want to run experiments. All of this are very standard things for anybody that's been doing applied machine learning. They've been doing this for the last ten years. A lot of them have gone with bespoke tooling that doesn't exist maybe in commercial settings or open source settings, the open source ecosystem and the vendors are trying to catch up really quickly. So you don't have to be building all of this custom bespoke tooling in order to be doing this. But ultimately what you really want to think of them as an optimization problem. There's some metrics I'll be defining and how do I improve that metric over time? And that's what's going to, I think, maybe differentiate the more successful deployments and the less successful deployment is your ability to continuously improve and innovate over time. And from that perspective, it doesn't really matter if you're just calling OpenAI. You should deploy in your own. As a matter of fact, you think of them as model agnostic. You think of it as, hey, maybe I'm starting with, say, OpenAI. But once I have enough actual answer responses that would make sense, I know how to evaluate, I want to experiment and try out different open source models. And I'll actually fine tune some of them. And by the way, it's actually gotten significantly easier. It's not voodoo magic. You don't need a PhD in machine learning to be doing it. Anybody could read up a tutorial and cobble together some tooling. It's harder to make it production quality, but it's not that hard. So that's really what I think people should be focusing on.

The Growing Accessibility of Machine Learning with Large Language Models

David Hershey:

Yeah, I maybe want to pick up off of that and touch on this team thing. And you talked about not necessarily needing a PhD anymore and I think that's really important to me. One of the more interesting things that's happened with language models is a lot of software engineers that have never really been scared of machine learning in the past have adopted the endpoint version of this and feel like they can do things. And what I'd love to hear from you all is what types of teams you've seen adopting open source models so far and is it ML PhDs? Is it ML engineers? Is it software engineers that are learning to fine tune? If you've talked to some of your users and gotten a sense of what skill sets are out there that are allowing people to fine tune models, I would love to hear more about it. 

Vipul Ved Prakash

Yeah, I mean, we are seeing, as I was saying, a lot of folks who do not have a background in machine learning have now access to these tools. There's documentation, there's videos on how to do fine tuning on the hardware that's available to you and we're seeing a lot of that. It's fairly interesting. I would also say in the sort of domain of few shot and in context learning sort of prompt engineering, we often see people do things that our research team looks at and says, oh wow, that's amazing. We never thought this was possible. It opens up, I think an area of creativity and a particular style of almost like programming these models which is fairly accessible to programmers who programmed different kinds of systems. And I expect that this will become more and more of a norm as tooling improves and understanding about how to really kind of drive and tune these models on our platform today, I would say 80% of the customers are not ML PhDs.

Reynold Xin:

We are seeing something very similar, especially in the last few months, there's a lot more people that have never done machine learning in the past or apply machine learning now coming on board. So just playing with large language models. And as matter of fact, the first priority now is large language models. Even if all they've been doing in the past are ETL pipelines to some extent. I felt like academia kind of failed maybe a whole generation here by making if you go take a machine learning class, the first thing you learn is linear algebra. It's like more linear algebra, more matrix multiplications, which kind of makes the thing sound very scary. This is equivalent to, hey, if you want to learn any programming, start with transistors and understand the physics behind them in order to actually learn any Pearl or Ruby or Python. Right? That's not how people learn programming. I think yes, if you want to learn how to maybe think about designing the next architecture, you might have to understand all the math behind it. But really I think there's a lot of application side that should be dumbed down substantially and also gets easier with better tooling.

Open Source Language Models in the Future

David Hershey

I'd love to close the Q and A or the conversational bit of this by thinking a little bit about the future and where we see this open source ecosystem going. And so maybe I'll go to each of you and ask just how you see the open source language model ecosystem evolving over the next few years. I know there's sort of a GPT 4 looming in the distance somewhere, but I'm curious which routes and where you see development really taking off.

Reynold Xin

I think the world will realize for domain specific applications you don't need hundreds of billions or trillions of parameters, probably going to standardize somewhere in the low or double digit billions. And then there will be commoditized open source foundational models, a few of them for people choose from. But what the key is you kind of have to train and fit for your specific domain. And that's how we see basically everybody that even starts with, hey, let's just call OpenAI endpoint or some azure endpoint. Once they start doing evaluation, they realize, hey, it might actually be cheaper, more cost effective, let alone any privacy issues, and maybe even more effective if I were to just do some fine tuning based on my own context, I think that's probably most of the world will actually realize for serious applications, they are domain specific, that's probably going to be the way to go. And almost all of those would be powered by the open source model.

Vipul Ved Prakash

I believe that the open source ecosystem will be the bigger ecosystem in this. I think there's a place for both closed and open models. I do think if you're sort of looking at a two year horizon, I do think frontier models will be open source. I think there is just an incredible amount of ecosystem interest in advancing this technology and you'll see that ecosystem interests coalesce into groups that are able to build frontier models and they will exist. As Reynold was saying, it's already sort of happened in text to image and they are smaller models and we have the compute for it, but we are putting together compute for these larger builds that I think will eventually turn into very competitive technologies.

Wei Lien Dang

I think it's been really interesting discussion just to get your perspective on how to make sense of what's been going on with these open source foundation models and perhaps some of the things that we can expect going forward. A few questions I wanted to pose to each of you. Maybe the starting one is actually one that we hear a lot as people are just starting to set out to build in this ecosystem and figuring out what's a good place to start. And so, first question is, given there's been an explosion of so many models, do you have any recommendations on which are the ones that make sense to start with, which would be a good ones to start to experiment with and play around with? Of course, there's the ones that each of your companies are making available. But I'm curious beyond that, just generally, how do you guys think about that? What advice do you have for AI builders out there in terms of what open models to potentially start with? 

Vipul Ved Prakash

One decision is the size of the model because that has impact on performance and picking something. We have great models now at 3 billion, at 7 billion, at 40 billion parameters that can serve as basis for your application. So I would sort of look at that and that often, if you can, with the smallest model, achieve your application goals, that's sort of the best in terms of performance. I mean, these are still expensive models to run and that should be a factor. But you also want to try multiple models times and see which of them perform better on your tasks and your tasks may not be captured by the standard benchmarks that exist. So I think picking the size and then trying a few different models that rank well on that parameter.

Reynold Xin

If you just Google best in class, open source generated AI model for free commercial use, you can find it because we've been asked literally hundreds of thousands of times this question in the last few months. We created a web page for it. It actually has a table listed along the use case on one dimension. Do you want quality? Do you want speed, which is related to cost? And once you look at it, you feel you should trust us because we don't even recommend Dolly, we don't even recommend our own models. It's a very unbiased opinion based on our experience working with actually there's thousands of different companies building large language model applications on Databricks now just based on our experience working with them. And this page gets updated every month because there's so many new things that come out very quickly. So just take a look at that. That's our recommendation. 

Best Practices for Parameterizing Large Language Models Over Time

Wei Lien Dang

I'm interested in understanding best practices and methods for parameterizing models correctly at the outset and how to ensure continuously monitor their performance and relevance over time. Any sort of guiding advice with regards to that topic? 

Reynold Xin

Yeah, I think one of the things is don't think of large language model as something like new that never existed before. Think of it as an applied machine learning problem. So a lot of lessons from applied machine learning applies here. There are frameworks we built in the past, like for example, ML Flow. We're updating it to actually introduce a lot of large language model specific things to make it easier for LLMs. And this includes how do you do experiments, how do you track them, how do you actually lock responses? We'll be announcing a lot of stuff also next week at our Data and the AI  conference to make a lot of this stuff easier. But in general, I would say think of it as an applied machine learning prompt in which maybe the key is to think of the model as something that's very likely you want to update multiple times. Even if you're just calling a rest endpoint, that's still a model. It's something very likely you'll be updating many times.Be agile out there and abstract away what the model is. I was joking with Mate the other day that I think there's going to be 100,000 model endpoint abstractions that will be built by different companies in the next year. Because everybody will be following this advice and everybody will realize, hey, there's nothing standard out there, let me build a new one. I'm pretty sure there will be open source standard that's very soon because it's not like a secret sauce for anybody. There's no point building 100,000 of them. So I think think of it from that perspective. So one treaty as applied ML problem and look at best practices from apply ML. Second, very importantly, think of model API. Abstract away the Model API so you can actually swap it in and out. the Model API might be a little bit deeper than you think. It also includes how does it do logging. It's not just a request response pair. It's a little bit more involved in it.

Open Source AI and Infrastructure Foundation Models 

Wei Lien Dang

Cool. Maybe one last question to wrap up this chat this morning for each of you is certainly the potential and capabilities of these models have captured the public's and the mainstream public imagination. Certainly a lot of discussion around standards and guiding principles. Do each of you believe that? What are your thoughts on producing a set of guidelines and not necessarily regulatory framework, but guidelines and industry norms and best practices for guiding the development of these models going forward? How do each of you think about that? And how would each of your companies actually potentially contribute and have a role in that? 

Vipul Ved Prakash

The way we think about it, I think it's important and there's been a lot in the media about the sort of dangers of AI. Because of that, models have to be regulated. I think it's very important to sort of not overstate those anddescribe what the potential concrete issues are. And I think a lot of these are very application specific. If you're building models that are being used to generate fiction and help Hollywood's script writing, it probably has a very different set of standards and rules than a model that's being used to approve mortgage applications, for instance. I think my view on this today is that the guidelines sort of need to be sectorial rather than sort of universal guidelines on models. And other than that, it doesn't really sort of make a lot of sense to treat these models as, like, dangerous artifacts today, which I think a conversation around that is significantly overblown and in some ways possibly also a way to kind of get regulatory sort of protection around big models from companies whose business models are dependent on the larger models being scarce.

Reynold Xin

Yeah, absolutely. I'm 100% agree it's bizarre that there's a doom state, but I do see a lot of potential issues. Like for example, it's probably much easier to spread misinformation because now they sound real. They sound even more real than the past with generative AI, but in general so I think some amount of regulation can be good. Like GDPR, despite very painful, I think is a good thing. The EU is actually working on the UAI Act. The big warning sign I would put on is, it is actually in many of the largest companies best interests to have this be even more regulated because regulation means the cost of innovation goes up. So one of the easiest ways to stifle innovation in the open source ecosystem is by making it very difficult to produce and train open source models because there's so many things you have to follow and that would dramatically cut down the amount of actual innovation. So we have to be very careful in regulation in terms of overly regulating too early. This is the field that is, I think, the inception of it. And we cannot regulate innovation out of this. And this is probably the only time you see a few large companies actively lobbying for regulation in tech because they feel like they are one of the few that could afford the regulation right now. 

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

All posts
October 2, 2023
Portfolio
Unusual

How open source AI will find product-market fit: A conversation with Databricks and AI startup Together

No items found.
How open source AI will find product-market fit: A conversation with Databricks and AI startup TogetherHow open source AI will find product-market fit: A conversation with Databricks and AI startup Together
Editor's note: 

In this episode of the Startup Field Guide podcast, we feature a conversation between Reynold Xin (co-founder, Databricks), Vipul Ved Prakash (co-founder, Together) and Wei Lien Dang (GP at Unusual Ventures) about the long-term impact of open source AI. Open-source AI models have become key drivers of innovation and collaboration. An increasing number of developers and end users are leveraging open-source technologies.

‍Be sure to check out more Startup Field Guide Podcast episodes on Spotify, Apple, and Youtube. Hosted by Unusual Ventures General Partner Sandhya Hegde (former EVP at Amplitude), the SFG podcast uncovers how the top unicorn founders of today really found product-market fit.

If you are interested in learning more about some of the themes and ideas in this episode, please check out our blog posts on:

TL;DR: Open-source LLMs and foundation models

What’s driving the development of open-source models?

  • Chinchilla’s research on training optimal models and Llama's work on inference have made it possible to get high-quality data sets.
  • The ability to customize open-source models and run them in private environments
  • The release of GPT-3 woke everyone up to the incredible potential of AI, emphasizing the pressing need for a collective effort towards developing open-source alternatives.

Why Databricks built Dolly — an open instruction-tuned LLM

  • Databricks built Dolly because they wanted to take an existing model and finetune it so it would exhibit the instruction-following behavior of GPT
  • The goal was to demonstrate that it’s not hard to train a model for a specific domain.

Why Together is building with open source

  • Open and available recipes for generating data sets speeds up the process of building models.
  • Together wants to provide not just data sets like RedPajama but also quality labels and content type labels to create a larger data playground.

Role of academia in AI innovation

  • Experimentation within academia has led to the production of industrialized artifacts 
  • Research within academia is allowing industry to scale up their ideas
  • Open source has become an enabler of collaboration between academia and industry

Why should product teams choose to build with open source?

  • Open source models offer control and privacy, as sensitive data can be used without being sent to shared cloud APIs. 
  • Fine-tuning open source models can result in higher accuracy, especially when using transfer learning.

Episode Transcript

Wei Lien Dang

Hi everyone. I'm very pleased to welcome you to our fireside chat this morning on Open Source AI. Super thrilled to be joined by David Hershey who's a VP on the Unusual Ventures team. My name is Wei Lien Dang. I'm a general partner here at Unusual Ventures focused on AI infrastructure software. And I'm thrilled to welcome two esteemed guest panelists. So the first is Vipul Ved Prakash. Vipul has a long track record as a serial founder. He's founder of Cloudmark and a founder of a company called Topsy, which got acquired by Apple, after which he led search and AI and ML efforts over at Apple before founding Together, where he's now co founder and CEO. And Together is one of the prominent platforms building around an open, decentralized AI cloud platform. And I'm also super happy to welcome Reynolds Xin, longtime thought leader and contributor to Apache Spark and co founder and chief architect at Databricks, where he's also been involved around efforts related to an open source instruction-following LLM that was recently announced. 

The rise of open source LLMs and foundation models

Wei Lien Dang

One of the reasons why we felt this was a really valuable and interesting topic to explore with you guys today is because we've just seen tremendous innovation and growth in what's happening with the Open AI ecosystem. If you look at the number of developers and end users who are utilizing open source AI technologies, if you look at what's been coming out from a research standpoint in terms of foundation models and other areas of the AI native stack, it's really astounding how quickly things are moving. We at Unusualeel we're at the beginning of long term impact that we're going to feel from Open Source AI. And so I'd love to start off getting both of your perspectives. In the last several months, we've seen so many new models like the Hugging Face Open LLM leaderboard is changing all the time. And these include models that your teams have trained. Like Vipul, you have Red Pajama from together. Reynold, you have Dolly.

I would love to start off by asking the both of you what do you believe is behind the recent wave of open source LLMs and foundation models and why is it important?

Vipul Ved Prakash

You know, unlike the sort of last decade of deep learning where companies who had data, they sort of had this asymmetric advantage in using AI. And with LLMs, they are built on open data sets. So I think it was a matter of time and interest and it's really great to see this happening. I think some of the sort of research, for instance, the Chinchilla research on training optimal models and then Llama's work on inference optimal models have kind of enabled training these models with sort of less hardware and really kind of get high quality. So that quality bar and threshold being hit, I think is one of the reasons we are seeing this interest. And I think part of the interest in open source is that you can customize these models, you can run them in sort of private environments where with sensitive data you have a lot more control over the weights. So there is a lot of interest in seeing these open models sort of become better and you're seeing this ecosystem effect of multiple research labs and open source projects contributing to the progress.

Wei Lien Dang

Reynold, we'd love to hear your take on the topic. 

Reynold Xin

I think there's in general, the technology and the dataset and the compute power has become to the point that it's significantly easier to train maybe the small language models. And that explains a lot of it. But I think we have to attribute maybe a lot of the sudden activity and interest in the last maybe three or four months to actually OpenAI themselves for the release of ChatGPT, because that for the first time probably captured or got everybody to wake up and think about, hey, AI seems to be a big deal. I'm not talking about everybody that would be calling into a webinar like this, but rather your mothers, your fathers, who might not be in tech. Like, my mom has nothing to do with tech and she's been using Chat GPT all the time and so is the CEO of corporate America and all the people that I would say would be able to influence the directions and they really are pushing, I would say, the community forward. And the other thing is because there's nothing that gathers, I would say, the community when there's a very obvious target and target here doesn't necessarily try to destroy it, but just trying maybe catch up. If there's a big gap, nothing exists in open source. Let's actually try to get something out there. I think that definitely had a pretty significant impact, especially in the last few months. And that's really where you saw the activity start to accelerate, which is around probably like late February, early March, and then between then and now you see a massive momentum, opportunities and activity.

Building AI platforms using open source

Wei Lien Dang

I'm curious, both of you guys work at companies that have built platforms, though, to support running those models. Certainly there's innovation in the models themselves, but I'm curious how you felt like having that platform to actually be able to serve customers who want to run the models. How that factored in? Why did each of you decide to build around open source? As opposed to certainly there are proprietary platforms out there as well. Fundamentally, how did each of your companies think about it? 

Reynold Xin

Sure. For us, it's not exclusive. Like it's fine for us to be using something proprietary, but at the same time actually be supporting the open source ecosystem. And really the reason Dolly was started was not that we felt, hey, we have to build a state of the art model. Initially it started as a very simple goal, which is, hey, let's try to learn about this and see how far we can get. And then very quickly we realized, hey, there's actually a lot of large open source, large language models out there that did not have instruction-following capabilities that the GPT has demonstrated. Can we actually take an existing model and for very little amount of money, actually fine tune so it would exhibit some of the instruction-following behavior? And it took us actually the first version of Dolly about three days from the inception of idea to training to actually making a blog post about it. So it was actually a very short period. And then we realized, hey, we should be telling the story. And the point of story is actually not what a lot of people thought it would be, which is, hey, they're trying to challenge OpenAI and coming up with amazing language models. The point of story is to demonstrate, hey, it is actually not that hard if you want to train something that's for your specific domain. It's really, really difficult to build a very general purpose chat bot that can talk about anything in the world, but it's actually not that hard if you want to build a domain specific model. As a matter of fact, here's all the ingredients you need to actually get there. And that's basically how Dolly came about. It wasn't really, hey, let's build something state of the art, but it's, hey, let's demonstrate to the world that it's possible. And then we quickly followed up. Once we released Dolly 1.0, the big challenge was, hey, you can't actually use it in a commercial setting because the data set they trained on did not have the permissive license because we're using OpenAI to generate those data sets. So we did this Dolly 2.0 follow up in about two weeks. It would just ask every employee at Databricks to write some questions and answers, and that generated a Dolly 15K data set that was also open source. And so to be honest, I think the Dolly 15 dataset itself is far more valuable than the Dolly model because the dataset will become the part of the training corpus probably for the thousands or even the millions of open source models to come. And it wouldn't be the only training data set to become a part of that. And it's sort of our little contribution to the world here.

The importance and impact of open source data in AI Research

Wei Lien Dang

Well, that's an interesting sort of segue. Vipul, would be curious to get your perspective because for the notion of a dataset I mean, you guys obviously put out the Red Pajama dataset and thought about that as a key starting point and sort of the open nature seems foundational to what Together is building, but would love to kind of hear your perspective on how you decided or why you decided to build around open source.

Vipul Ved Prakash

Yeah, I think echoing Reynold’s Dolly 15K is this incredible dataset. I do think data has this enduring power over time in open source models because you will see changes in architectures and more efficient architectures come up over time and they will build around the data ecosystem that has been created in the open. So we think that's really important. Also, I've seen some of the talks that Databricks done around how they created the data set, which I think is also super interesting because you can now sort of replicate this process in a different setting. And that's how we are thinking about pretraining data. It's very important that you have this pretraining data because it enables downstream model building. It's also very important that the recipes to generate the data are open and available because you can take those recipes, use them for another language, you can improve the dataset quality over time. And from the perspective of folks building models and exploring architectures or data mixtures, it really speeds up that process to have a dataset that's been carefully prepared. In the future, we want to do a lot more around this where there's a dataset, but there's also sort of quality labels and content type labels. So you have this bigger data playground to work with and you sort of see this with Pile. Pile has become really central to innovation in open AI and research and it's going to be pretty important. I do believe that over the next few years, the work around data will be sort of cornerstone of improvements.

Open Large Language Models (LLMs) and the need for standardization.

Wei Lien Dang

We're seeing the results and the impact of these data sets in these new models now. And I'm curious, there's these leaderboards, there's these rankings that people are paying attention to. What do you think is missing from them? What would you say isn't captured necessarily in these lists, these rankings, even some of the benchmarks that are being used generate the leaderboards. 

Vipul Ved Prakash

One, I think it's a good process. It's also fun for people who are building models that sets up some friendly competitive dynamics and it's really sort of acting as a north star for progress. That said, I do think the benchmarks need to be much more principled than they are today. You often see differences when people try to reproduce these benchmarks outside of the leaderboards. There's a research from University of Edinburgh that showed that the Llama 7b is still like five, six points ahead of all the Red Pajama and Falcon and MPT based models. So I think you need a lot more rigor around benchmarking because it is sort of driving and shaping what researchers are doing. Part of that rigor is processes around decontamination from datasets, from benchmark datasets, like they'll sometimes be included in GitHub code and they may end up in the models. And making the data open kind of allows a lot of that process to happen. But I think a lot more work needs to be done. Evaluation is fairly difficult from new research around how to evaluate these models and new processes.

Wei Lien Dang

Reynold, your perspective on how should people make sense of these leaderboards? 

Reynold Xin

So in a different world I'm in, which is the data systems world, benchmarks are very important and people have spent — they're like benchmark experts — who spent their entire life doing nothing but designing benchmarks. And we're so early in this process on the large language model side, that I think, first of all, it's great to see them. I do think the ultimate benchmark are the ones that you just have unbiased humans that rate the answers and you have large number of them. That's how some of this leaderboards work. But I think one of the big challenge, honestly, I don't know how to solve, is that they tend to be fairly simple. And many of the applications of these large language models is not going to be super simple. So having fairly simplistic setups with humans just asking simple questions and rating the answers represent only a very narrow slice of what these large language models will be used for. And they tend not to be domain specific, which I think a lot of the LLMs will be used for domain specific applications. And it's just going to be difficult to take hey, here's what maybe 10,000 random people on Internet think about the responses versus, hey, here's what doctors would think about this type of response. As a matter of fact, they might not even ask the right questions on the open Internet. And that is one of the things which also draws to how do you think about evaluation of the models. And I think just like any machine learning problem, LLMs are no different. You kind of need a continuous integrative improvement process. You shouldn't just trust, hey, here's a benchmark. Let me pick the number one place in that and start using it.

Wei Lien Dang

I think many of us would say we're still early and there's so many models out there. I know both of you know Chris Re at Stanford, and he's kind of likened where we are to sort of AI's Linux moment. And I think if you analogize, there were many variations of Linux distributions that existed early on, but eventually that consolidated. 

And do you think long term we'll see eventual industry consolidation? Does it make sense to have so many open LLMs and foundation models? Obviously there's an aspect where people have more choice and so on, but I think there's also something to be said for more standardization and people coalescing around fewer. I'm curious how each of you thinks about that.

Vipul Ved Prakash

I do think there will be consolidation eventually. I think there's a long way to go. There is going to be new architectures, exposure to new strategies that's going to happen in the coming years. And you will see a variety of models and models and research labs will start sort of adopting and investing in particular approaches. I think there are signs that there will be consolidation. For instance, today with the Llama architecture, the amount of tooling that's built around it is becoming substantial. And when you're building a new model, it does make more sense to sort of adopt that architecture, because now all of this tooling just automatically works. And that may be one of the ways in which consolidation happens. It happens around architectures. But as Reynold was saying, we're very early in this process, and once there are better models, users will gravitate towards them. And that's sort of the highest value thing in some ways.

Wei Lien Dang

Reynold, any thoughts on long term consolidation?

Reynold Xin

I think consolidation is necessary and bound to happen just like any technology. That doesn't mean there will be no more than three open source models. So there's even distribution of hey, here's a few of their wildly popular that basically people start with. But there might still be innovations that's happening over time, the disruptors that come in, replace them. But is it going to happen next year? Is it going to happen like 5, 10 years from now? I suspect it's more on the latter side, but it's to some extent not super useful to speculate exactly what would happen here, because the space is innovating so quickly that all you need to know is you have to be ready to embrace the change.

The role of academia in AI research and innovation

Wei Lien Dang

Vipul, one topic that you and I have chatted about is the impact of research and the fact that a lot of what's happening in the ecosystem and even from a standpoint of building things like LLM apps, ends up being research driven. Right now, both of you have worked across academic labs and large tech companies, and we've seen things come out of those, like, for instance, Llama from Meta and work from Stanford, Berkeley and other universities. I'm curious, what do you think the role of academic labs and large tech companies and these different stakeholders is alongside companies like yourselves, like driving the innovation forward?

Vipul Ved Prakash

I would say that I think in computer science, this has always been the case that academia is experimenting with the crazier ideas which are then adopted by industry, and industry then produces sort of industrialized artifacts from those ideas. And then there's a cycle of academia looking at that and kind of innovating further. I think that's very much the case in AI. There is an incredible amount of scholarship and research that's happening in academic labs and that's sort of making its way into the industry. And industry is sort of doing scale up of some of these ideas, which I think is a very sort of productive cycle. And this is one of the reasons, at Together, we are collaborating fairly deeply with academic labs, especially given open source is very sort of friendly way of doing this sort of collaboration. But I think academia has a huge role to play and so does industry.

Wei Lien Dang

Yeah, it would seem like both of your companies actually have been real leaders in fostering and figuring out ways to collaborate across academia and industry. And so I think it's interesting to see kind of like the cross pollination between the academia and what's going on with companies like yourselves in this context. Reynold, I'm curious to hear your opinion on what do you see as some of the most significant innovations coming out of the research community and the research side of things that perhaps you guys and others are looking to productize. 

Reynold Xin

Yeah, so one caveat, there's so many new things that are coming out every day that it's actually becoming very difficult to track, even for me.

I think one of the skepticism towards academia has been brewing recently in terms of AI is that, hey, AI requires a lot of compute power and access to data, which academia does not typically have. Does that mean the role of academic research is diminished in the era of large language models especially? I mean, I would challenge that assumption. And just talking about all the interesting work and one thing about the past is the very obvious example, which people often don't realize is stable diffusion, which is one of the most popular language model. One of the most popular AI models came out of latent diffusion work in academia. And that's something that's had a remarkable, profound impact in the industry. The LLM stuff I think it's a little bit newer, at least a lot of attention. There are people that's working on large language models, even academia, for a while, but a lot of attention is newer. And I think maybe two things I've seen recently that's pretty exciting is, one, it's actually not by academic research, but by a bunch of former academics at Mosaic. So Mosaic ML MPT's aligned Context Window, which actually, I think, allows up to 84k, if I remember correctly, on Context Window, which is pretty remarkable. It's actually even longer than GPT 4. The other thing is we all know that one of the most biggest issues with LLMs is they hallucinate, sometimes they would make up APIs. They would make up facts that are just completely wrong, which is problematic when you want to, for example, hey use the large language model to facilitate integration, orchestration of systems. And I've seen recently the guerrilla work in UC Berkeley by Joey Gonzalez’s team that basically combined retrieval-based systems. Hey, here's like a perfect API calling LLM that always makes sure I'm calling the right APIs or passing in the right parameters. I don't hallucinate an API out. I think work like that, maybe not the specific incarnation of it, but a work like this will actually push the application of large language models a pretty long way.

Innovations in Large Language Models (LLMs)

Wei Lien Dang

And then maybe one other question with regards to research is both of you highlighted the central role of data earlier in the conversation. Our observation has been a lot of the innovation in development of LLMs has centered on improving input data. But curious, what innovations do both of you see or expect with regards to training data? Who would you look to in terms of moving the space forward? 

Vipul Ved Prakash

Yeah, I think one of the recent paper from Stanford called Doremi that's sort of looking at how to, in a principled way, weight different data slices that go into data sets. I think this can have a potentially huge impact with Red Pajama. There are seven different slices of data from seven different sources. We weighted them according to the Llama paper. But when we reweight them in different ways, we find that the downstream models have significantly different better quality. So I think work around that, very excited about it. You can also take some of the large data sets, like Common Crawl and split them further and re-weight them. And I think there's also questions of deduplication, how much deduplication is right and what's sort of the sweet spot for it. There's research around that I think fairly leveraged in terms of the quality that you can get out of the same data and the same sort of model architecture. And I think instruct data is a huge area for improvement. The Red Pajama model, 7 billion models, seven points of difference on Helm benchmarks, and seven points is amazing. It's the difference between sort of Llama and GPT-3 today. And that comes from a set of instruct data, including the work that Databricks has done with Dolly. So I think that increasing the dataset and understanding what kind of data makes models better will have a big impact, and there's a lot of research happening around that.

Future of training data in AI

Wei Lien Dang

Reynolds, any thoughts on where you kind of take training data going forward? 

Reynold Xin:

Yeah, my sense is, thanks to the work of everybody and us playing a very small part, honestly, it would be more companies like Together and all that. Playing a bigger part is the training data for open source, training data for foundational models will become monetized over time. Everybody will have access to more or less the same set of data sets with reasonable amount of money, not like insane amount of money. And then there will be a lot of the maybe competitive advantage will be coming on. Hey, when we talk about domain-specific things, what about my data? What about how my customers been interacting with me in the past? So the ability to explore a live resource will become maybe competitive differentiations for most of the domain specific use cases. This is again, my thesis is most companies or organizations are not trying to build a general chat bot that can converse about everything from philosophy to state of the art technology. Most companies have certain applications they want to put the LLMs use for and data specifically relevant to those are going to be the key. And those are not the ones that will be available on the open Internet. And those are also not the ones that OpenAI would have. It's rather you, as your organization would have thosed data.

Building with open source models

Wei Lien Dang

I'm going to hand off to David now. He's going to cover a topic that I would say is really sort of top of mind for our audience, which is how to think about building with these open source models. And I think part of that is how companies like Together and Databricks are enabling people to build on top of the models that you've trained and made available. 

David Hershey

Yeah, thanks. Get started by asking you all about how you think about when teams should choose ti use open source models and the trade offs to make. You touched on it in your intro at the beginning, but I'm just curious in your mind, what is pushing teams to adopt open source models as opposed to some of the proprietary endpoints and what trade offs do you think they can make in that journey?

Vipul Ved Prakash

Yeah, I would say there are several reasons. A big one is control and privacy. The data that they are using is sensitive and it's not appropriate to send to shared cloud SaaS type API. We also see that when you fine tune these models, you're getting 10, 12 points of accuracy and the transfer learning works really well. And people are realizing that they're often using smaller models. We have several customers who've used 3 billion and 7 billion models and either continue to pretrain them or fine tune them and then use them with few shot context to get accuracy, better accuracy in GPT4. I think that process of anyone who has, as Reynold was saying, their own data and has a sense of what are the tasks they are using the language model for, really have now the tool set to build something on their own and get high quality and control over it.

David Hershey

Yeah, maybe to Reynold to piggyback off that. I think one of the nice things about the proprietary endpoints is you don't really have to think about the data challenge. And so some of the teams that may have been by managing data before might see this as scary. And I'm curious, you went down the route of building this big instruction-tuned data set with your company. But if you have advice or thoughts on what it actually means to build data that you can use and datasets that you can use to tune these models, to fine tune models, to instruction tune models, whatever it is, for folks that maybe are engineers just getting into the data space.

Reynold Xin

I think it's actually maybe less about how you think about the data here. It's a little more about how do you think about evaluation of MLOps, or evaluation of your machine learning models, which large language models are part of. Like some people, when you say machine learning, they'll think of large language models. When they say AI, they do confuse the hell out of me. But when I say ML, it's kind of equivalent to AI. Here I will argue what really you want to is just like any system you build, you want a way to evaluate it. For non machine learning system, you evaluate based on SLAs, you look at uptime, you look at a lot of things. Your regression analysis is no different when it comes to machine learning models. You want some way to evaluate how effective it is in production. And the reason that is important is I can guarantee whatever you deploy, even if just calling OpenAI's rest endpoint is not going to be perfect and there's a massive room for improvement through prompt engineering. What you want is you want to come up with some metrics that make sense in your domain specific context and evaluate that. You want to log the response to answers, you want to be able to tell, evaluate after the fact hey, how am I actually doing? Instead of just hey here's like a new pile of code that gets rolled out in production. And then you want to improve upon that and you want to run experiments. All of this are very standard things for anybody that's been doing applied machine learning. They've been doing this for the last ten years. A lot of them have gone with bespoke tooling that doesn't exist maybe in commercial settings or open source settings, the open source ecosystem and the vendors are trying to catch up really quickly. So you don't have to be building all of this custom bespoke tooling in order to be doing this. But ultimately what you really want to think of them as an optimization problem. There's some metrics I'll be defining and how do I improve that metric over time? And that's what's going to, I think, maybe differentiate the more successful deployments and the less successful deployment is your ability to continuously improve and innovate over time. And from that perspective, it doesn't really matter if you're just calling OpenAI. You should deploy in your own. As a matter of fact, you think of them as model agnostic. You think of it as, hey, maybe I'm starting with, say, OpenAI. But once I have enough actual answer responses that would make sense, I know how to evaluate, I want to experiment and try out different open source models. And I'll actually fine tune some of them. And by the way, it's actually gotten significantly easier. It's not voodoo magic. You don't need a PhD in machine learning to be doing it. Anybody could read up a tutorial and cobble together some tooling. It's harder to make it production quality, but it's not that hard. So that's really what I think people should be focusing on.

The Growing Accessibility of Machine Learning with Large Language Models

David Hershey:

Yeah, I maybe want to pick up off of that and touch on this team thing. And you talked about not necessarily needing a PhD anymore and I think that's really important to me. One of the more interesting things that's happened with language models is a lot of software engineers that have never really been scared of machine learning in the past have adopted the endpoint version of this and feel like they can do things. And what I'd love to hear from you all is what types of teams you've seen adopting open source models so far and is it ML PhDs? Is it ML engineers? Is it software engineers that are learning to fine tune? If you've talked to some of your users and gotten a sense of what skill sets are out there that are allowing people to fine tune models, I would love to hear more about it. 

Vipul Ved Prakash

Yeah, I mean, we are seeing, as I was saying, a lot of folks who do not have a background in machine learning have now access to these tools. There's documentation, there's videos on how to do fine tuning on the hardware that's available to you and we're seeing a lot of that. It's fairly interesting. I would also say in the sort of domain of few shot and in context learning sort of prompt engineering, we often see people do things that our research team looks at and says, oh wow, that's amazing. We never thought this was possible. It opens up, I think an area of creativity and a particular style of almost like programming these models which is fairly accessible to programmers who programmed different kinds of systems. And I expect that this will become more and more of a norm as tooling improves and understanding about how to really kind of drive and tune these models on our platform today, I would say 80% of the customers are not ML PhDs.

Reynold Xin:

We are seeing something very similar, especially in the last few months, there's a lot more people that have never done machine learning in the past or apply machine learning now coming on board. So just playing with large language models. And as matter of fact, the first priority now is large language models. Even if all they've been doing in the past are ETL pipelines to some extent. I felt like academia kind of failed maybe a whole generation here by making if you go take a machine learning class, the first thing you learn is linear algebra. It's like more linear algebra, more matrix multiplications, which kind of makes the thing sound very scary. This is equivalent to, hey, if you want to learn any programming, start with transistors and understand the physics behind them in order to actually learn any Pearl or Ruby or Python. Right? That's not how people learn programming. I think yes, if you want to learn how to maybe think about designing the next architecture, you might have to understand all the math behind it. But really I think there's a lot of application side that should be dumbed down substantially and also gets easier with better tooling.

Open Source Language Models in the Future

David Hershey

I'd love to close the Q and A or the conversational bit of this by thinking a little bit about the future and where we see this open source ecosystem going. And so maybe I'll go to each of you and ask just how you see the open source language model ecosystem evolving over the next few years. I know there's sort of a GPT 4 looming in the distance somewhere, but I'm curious which routes and where you see development really taking off.

Reynold Xin

I think the world will realize for domain specific applications you don't need hundreds of billions or trillions of parameters, probably going to standardize somewhere in the low or double digit billions. And then there will be commoditized open source foundational models, a few of them for people choose from. But what the key is you kind of have to train and fit for your specific domain. And that's how we see basically everybody that even starts with, hey, let's just call OpenAI endpoint or some azure endpoint. Once they start doing evaluation, they realize, hey, it might actually be cheaper, more cost effective, let alone any privacy issues, and maybe even more effective if I were to just do some fine tuning based on my own context, I think that's probably most of the world will actually realize for serious applications, they are domain specific, that's probably going to be the way to go. And almost all of those would be powered by the open source model.

Vipul Ved Prakash

I believe that the open source ecosystem will be the bigger ecosystem in this. I think there's a place for both closed and open models. I do think if you're sort of looking at a two year horizon, I do think frontier models will be open source. I think there is just an incredible amount of ecosystem interest in advancing this technology and you'll see that ecosystem interests coalesce into groups that are able to build frontier models and they will exist. As Reynold was saying, it's already sort of happened in text to image and they are smaller models and we have the compute for it, but we are putting together compute for these larger builds that I think will eventually turn into very competitive technologies.

Wei Lien Dang

I think it's been really interesting discussion just to get your perspective on how to make sense of what's been going on with these open source foundation models and perhaps some of the things that we can expect going forward. A few questions I wanted to pose to each of you. Maybe the starting one is actually one that we hear a lot as people are just starting to set out to build in this ecosystem and figuring out what's a good place to start. And so, first question is, given there's been an explosion of so many models, do you have any recommendations on which are the ones that make sense to start with, which would be a good ones to start to experiment with and play around with? Of course, there's the ones that each of your companies are making available. But I'm curious beyond that, just generally, how do you guys think about that? What advice do you have for AI builders out there in terms of what open models to potentially start with? 

Vipul Ved Prakash

One decision is the size of the model because that has impact on performance and picking something. We have great models now at 3 billion, at 7 billion, at 40 billion parameters that can serve as basis for your application. So I would sort of look at that and that often, if you can, with the smallest model, achieve your application goals, that's sort of the best in terms of performance. I mean, these are still expensive models to run and that should be a factor. But you also want to try multiple models times and see which of them perform better on your tasks and your tasks may not be captured by the standard benchmarks that exist. So I think picking the size and then trying a few different models that rank well on that parameter.

Reynold Xin

If you just Google best in class, open source generated AI model for free commercial use, you can find it because we've been asked literally hundreds of thousands of times this question in the last few months. We created a web page for it. It actually has a table listed along the use case on one dimension. Do you want quality? Do you want speed, which is related to cost? And once you look at it, you feel you should trust us because we don't even recommend Dolly, we don't even recommend our own models. It's a very unbiased opinion based on our experience working with actually there's thousands of different companies building large language model applications on Databricks now just based on our experience working with them. And this page gets updated every month because there's so many new things that come out very quickly. So just take a look at that. That's our recommendation. 

Best Practices for Parameterizing Large Language Models Over Time

Wei Lien Dang

I'm interested in understanding best practices and methods for parameterizing models correctly at the outset and how to ensure continuously monitor their performance and relevance over time. Any sort of guiding advice with regards to that topic? 

Reynold Xin

Yeah, I think one of the things is don't think of large language model as something like new that never existed before. Think of it as an applied machine learning problem. So a lot of lessons from applied machine learning applies here. There are frameworks we built in the past, like for example, ML Flow. We're updating it to actually introduce a lot of large language model specific things to make it easier for LLMs. And this includes how do you do experiments, how do you track them, how do you actually lock responses? We'll be announcing a lot of stuff also next week at our Data and the AI  conference to make a lot of this stuff easier. But in general, I would say think of it as an applied machine learning prompt in which maybe the key is to think of the model as something that's very likely you want to update multiple times. Even if you're just calling a rest endpoint, that's still a model. It's something very likely you'll be updating many times.Be agile out there and abstract away what the model is. I was joking with Mate the other day that I think there's going to be 100,000 model endpoint abstractions that will be built by different companies in the next year. Because everybody will be following this advice and everybody will realize, hey, there's nothing standard out there, let me build a new one. I'm pretty sure there will be open source standard that's very soon because it's not like a secret sauce for anybody. There's no point building 100,000 of them. So I think think of it from that perspective. So one treaty as applied ML problem and look at best practices from apply ML. Second, very importantly, think of model API. Abstract away the Model API so you can actually swap it in and out. the Model API might be a little bit deeper than you think. It also includes how does it do logging. It's not just a request response pair. It's a little bit more involved in it.

Open Source AI and Infrastructure Foundation Models 

Wei Lien Dang

Cool. Maybe one last question to wrap up this chat this morning for each of you is certainly the potential and capabilities of these models have captured the public's and the mainstream public imagination. Certainly a lot of discussion around standards and guiding principles. Do each of you believe that? What are your thoughts on producing a set of guidelines and not necessarily regulatory framework, but guidelines and industry norms and best practices for guiding the development of these models going forward? How do each of you think about that? And how would each of your companies actually potentially contribute and have a role in that? 

Vipul Ved Prakash

The way we think about it, I think it's important and there's been a lot in the media about the sort of dangers of AI. Because of that, models have to be regulated. I think it's very important to sort of not overstate those anddescribe what the potential concrete issues are. And I think a lot of these are very application specific. If you're building models that are being used to generate fiction and help Hollywood's script writing, it probably has a very different set of standards and rules than a model that's being used to approve mortgage applications, for instance. I think my view on this today is that the guidelines sort of need to be sectorial rather than sort of universal guidelines on models. And other than that, it doesn't really sort of make a lot of sense to treat these models as, like, dangerous artifacts today, which I think a conversation around that is significantly overblown and in some ways possibly also a way to kind of get regulatory sort of protection around big models from companies whose business models are dependent on the larger models being scarce.

Reynold Xin

Yeah, absolutely. I'm 100% agree it's bizarre that there's a doom state, but I do see a lot of potential issues. Like for example, it's probably much easier to spread misinformation because now they sound real. They sound even more real than the past with generative AI, but in general so I think some amount of regulation can be good. Like GDPR, despite very painful, I think is a good thing. The EU is actually working on the UAI Act. The big warning sign I would put on is, it is actually in many of the largest companies best interests to have this be even more regulated because regulation means the cost of innovation goes up. So one of the easiest ways to stifle innovation in the open source ecosystem is by making it very difficult to produce and train open source models because there's so many things you have to follow and that would dramatically cut down the amount of actual innovation. So we have to be very careful in regulation in terms of overly regulating too early. This is the field that is, I think, the inception of it. And we cannot regulate innovation out of this. And this is probably the only time you see a few large companies actively lobbying for regulation in tech because they feel like they are one of the few that could afford the regulation right now. 

All posts

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.