Exploring Numerai: The Hedge Fund Revolutionizing Data Sharing and Machine Learning
Discover how Numerai disrupts traditional hedge funds by sharing obfuscated data, enabling global participation in machine learning for stock market predictions.
File
A machine learning approach to stock trading Richard Craib and Lex Fridman
Added on 09/29/2024
Speakers
add Add new speaker

Speaker 1: So let's talk about Numeri, which is an incredible company system idea, I think, but good place to start. What is Numeri and how does it work?

Speaker 2: So Numeri is the first hedge fund that gives away all of its data. So this is like probably the last thing a hedge fund would do, right? Why would we give away a data? It's like giving away your edge. But the reason we do it is because we're looking for people to model our data. And the way we do it is by obfuscating the data. So when you get, when you look at Numeri's data that you can download for free, it just looks like, like a million rows of numbers between zero and one, and you have no idea what the columns mean. But you do know that if you're good at machine learning or have done regressions before, you know that I can still find patterns in this data, even though I don't know, I don't know what the features mean.

Speaker 1: And the data itself is a time series data. And even though it's obfuscated, anonymized, what is the source data? Like approximately, what are we talking about?

Speaker 2: So we are buying data from lots of different data vendors. And they would also never want us to share that data. So we have strict contracts with them. So we only, we only can, but it, but that's the kind of data you could never buy yourself unless you had maybe a million dollars a year of budget to buy data. So what's happened with the hedge fund industry is you have a lot of talented people who used to be able to trade and still can trade, but now they have such a data disadvantage, it would never make sense for them to, to, to trade themselves. But Numeri, by giving away this obfuscated data, we can give them a really, really high quality data set that's, that would otherwise be very expensive. And they can use whatever new machine learning technique they want to find patterns in that data that we can use in our hedge fund.

Speaker 1: And so how much variety is there in underlying data? We're talking about, I apologize if I'm using the wrong terms, but one is just like the stock price. The other, there's like options and all that kind of stuff, like the, what are they called? Order books or whatever, like, is, is there maybe other totally unrelated to directly to the stock market data, like, like natural language as well? All that kind of stuff.

Speaker 2: Yeah. We were really focused on stock data that's specific to stocks. So things like you can have like a P, every stock has like a PE ratio for some stocks. It's not as meaningful, but every stock has that. Every stock has one year momentum, how much they went up in the last year. But those are very common factors, but we try to get lots and lots of those factors that we have for many, many years, like 15, 20 years history. And then the setup of the problem is commonly in quant called like cross-sectional global equity. You're not really trying to say, I want, I believe the stock will go up. You're trying to say, um, the like relative position of this stock in feature space, uh, makes it not a bad buy in a, in a portfolio.

Speaker 1: So it captures some period of time and you're trying to find the patterns, the dynamics captured by the data of that period of time in order to make short-term predictions about what's going to happen.

Speaker 2: Yeah. So our predictions are also not that short. We're not really, um, caring about things like order books and tech data, uh, not high frequency at all. We actually holding things for quite a bit longer. Uh, so our prediction time horizon is about one month and we ended up holding stocks for maybe like three or four months. So I kind of believe that's a little bit more like investing than, um, then kind of plumbing, like to go long, a stock that's mispriced on one exchange and shorter on another exchange. That's just arbitrage. But what we're trying to do is really know, know, know something more about the longer term future of the stock.

Speaker 1: Yeah. So from the patterns from these like periods of, uh, time series data, you're trying to understand something fundamental about the stock, not like about deep value, about like with the, it's big in the context of the market, it's like underpriced, overpriced, all that kind of stuff. So like, this is about investing. It's not about just like you said, high frequency trading, which I think is a fascinating open question for a machine learning perspective, but just to like sort of build on that. So you've anonymized the data and now you're giving away the data and then now anyone can try to, uh, build algorithms that make investing decisions on top of that data or predictions in the top of that data. And so that, that's, um, what is it? So what does that look like? What's the goal of that? What are the underlying principles of that?

Speaker 2: So the first thing is, you know, we could obviously model that data in house, right? We can make an XGBoost model on the data, um, and that would be quite good too. But what we're trying to do is by, by opening it up and letting anybody participate, uh, we can do quite a lot better than if we modeled it ourselves and a lot better on the stock market. It doesn't need to be very much. Like it really matters the difference between if you can make 10 and 12% in an equity market neutral hedge fund, because the whole, usually you're trying, you're charging 2% fees. So if you can do 2% better, that's like all your fees, it's worth it. So we're trying to make sure that we always have the best possible model as new machine learning libraries come out, new techniques come out, they get automatically synthesized. Like if there's a great paper on supervised learning, someone on Numeri will figure out how to use it on Numeri's data.

Speaker 1: And is there an ensemble of, uh, models going on or is it always, or is it more towards kind of like one or two or three, like best performing models?

Speaker 2: So the way we decide on how to weight all of the predictions together is, um, by how much the users are staking on them. How much of the cryptocurrency that they're putting behind their models. So they're saying, I believe in my model. You can trust me because I'm going to put skin in the game. And so we can take the stake weighted predictions from all our users, add those together, average those together. And that's a much better model than any one model in the, in the sum, because ensembling a lot of models together is kind of the key thing you need to do in investing too.

Speaker 1: Yeah. So you're putting, there's a kind of duality from the user, from the perspective of a machine learning engineer, where you're, it's both a competition, just a really interesting, difficult machine learning problem. And it's a way to, to invest algorithmically. So like you're, and, but the, the way to invest algorithmically also is a way to put skin in the game that communicates to you that you're the, the quality of the algorithm and also forces you to really be serious about the models that you build. So it's like everything just works nicely together. Like I guess one way to say that is the interests are aligned. Exactly. Okay. So it's just like poker is not, not fun when it's like for very low stakes, the higher the stakes, the more the dynamics of the system starts playing out correctly. Like as a small side note, is there something you can say about which kind, looking at the big broad view of machine learning today or AI, what kind of algorithms seem to do good in these kinds of competitions at this time? Is there some universal thing you can say like neural networks suck, recurrent neural networks suck, transformers suck, or they're awesome, like old school, sort of more basic kind of classifiers are better, all the, is there, is there some kind of conclusion so far that you can say?

Speaker 2: There is, there definitely is something pretty nice about tree models like, like XGBoost. And they just seem to work pretty nicely on this type of data. So out of the box, if you're trying to come a hundredth in the competition in the tournament, maybe you would try to use that. But what's, what's particularly interesting about the problem that not many people understand, if you're familiar with machine learning, this typically will surprise you when you model our data. So one of the things that you look at in finance is you don't want to be too exposed to any one risk. Even if the best sector in the world to invest in over the last 10 years was tech, you would not, does not mean you should put all of your money into tech. So if you train a model, it would say put all your money into tech, it's super good. But what you want to do is actually be very careful of how much of this exposure you have to certain features. So on Numerai, what a lot of people figure out is, actually, if you train a model on this kind of data, you want to somehow neutralize or minimize your exposure to these certain features, which is unusual, because if you did train a stoplight or stop street detection on computer vision, your favorite feature, let's say you could, and you have an auto encoder and it's figuring out, okay, it's going to be red, and it's going to be white. That's the last thing you want to be, you want to reduce your exposure to. Why would you reduce your exposure to the thing that's helping you, your model the most? And that's actually this counterintuitive thing you have to do with machine learning on financial data.

Speaker 1: So reducing your exposure would help you generalize the things that are, so basically, financial data has a large amount of patterns that appeared in the past, and also a large amount of patterns that have not appeared in the past. And so in that sense, you have to reduce the exposure to red lights, to the color red. That's interesting, but how much of this is art and how much of it is science from your perspective so far, in terms of as you start to climb from the 100th position to the 95th, in the competition?

Speaker 2: Yeah, well, if you do make yourself super exposed to one or two features, you can have a lot of volatility when you're playing numeri. You could maybe very rapidly rise to be high if you were getting lucky. And that's a bit like the stock market. Sure, take on massive risk exposure, put all your money into one stock, and you might make 100%, but it doesn't in the long run work out very well. And so the best users are trying to stay high for as long as possible, and not necessarily try to be first for a little bit.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript