Exploring the Impact of Big Data on Predictive Analytics and Decision Making
A data scientist discusses the role of big data in predictive analytics, emphasizing the importance of fine-grained behavior data for improved decision making.
File
Foster Provost Is Bigger Really Better Predictive Analytics with Fine-grained Behavior Data
Added on 09/29/2024
Speakers
add Add new speaker

Speaker 1: Thanks a lot for the kind introduction, Ed, and for inviting me. And also, do I have my slides? Yes. Thanks to all my colleagues whose insight I'm going to draw on. And I didn't want to depend on O'Reilly for advertising, so I put a picture of the book up there. So I expected to get in a fight backstage with the prior speaker based upon his title, but I actually agree largely with just about everything that he said. You may well have seen Gartner's hype cycle, which tracks technologies through their emergence, their inflated expectations, the resulting disillusionment, and then eventually toward the point of being, hopefully for some technologies, being productive for business. Last year, Gartner put big data quite near the top of the peak of inflated expectations. And so Ed asked me to come and talk a little bit about whether big data indeed is better. I wanted to talk a bit about my particular perspective. So I'm a data scientist. I actually like that term because it's better than anything we've had before. My view of big data, therefore, is as a supporting technology for data science. I don't want to diminish all the other great uses of big data, but from my perspective, it's a supporting technology for data science. And data science is a supporting craft for doing better decision making. And what I'm going to talk about today is a particular sort of data science, predictive analytics, and the reason I'm going to talk about that is because it's one of the most mature and best understood of the data science, the fields of data science. By the way, Gartner puts predictive analytics furthest towards the productive use plateau. So what I'm going to talk about is exactly is big or really better in terms of data in the context of predictive analytics, and we'll get into fine-grained behavior data. So when we're talking about, when we're thinking about data science and big data as well, one of the fundamental principles that we should keep in mind is that data should be thought of as an asset. A lot of people have used the term data assets this week, right? And when we think about assets, we want to think about what value, what return we can get on our assets, and also about what investments we should make in our assets, not just think about what data we have, but how shall we invest in those assets. And in order to be able to examine whether or not we're getting return on our assets, we need some analytic tools, and so one such analytic tool is the learning curve. So here's an example. We're going to dig down into a particular case study, and this is a predictive analytics application for a major bank whose name you would recognize, if I could tell you, who has a very sophisticated and long-term predictive modeling team. And what we're showing here is on the vertical axis, on the y-axis, the predictive performance. And if you don't understand Lyft, read a good data science book, and you'll understand that. For our purposes here, it's how good the predictive model is. And on the horizontal axis, it's how much data you're using to build your models, the amount of training data in machine learning or statistical learning parlance. And what's important here is not that this model actually, for this particular application, gives pretty good predictive performance. Once we look at it in terms of the relationship between the amount of data we have and the predictive performance, we actually see that it very early, this was, the horizontal axis shows the percentage of a million customers that are used for building the models, bigger actually isn't better. Okay. Now once we're now thinking in a big data perspective, you know, we should all realize that big data, big data is really an unfortunate term that we now use, right? Because it's not necessarily about data, just more data. It's about different sorts of data. You know, if we take a big data perspective, we ought to think not just, do we want to get more data just like this? Maybe there are other sorts of data that might be able to improve our application, right? And so also as has been mentioned, another fundamental principle of data science is always keep in mind exactly what is the task that you're trying to solve. And so here we're trying to model some internal drivers that are going to make some subset of the population be more interested in this product. So then they're going to be more likely to respond hopefully to an offer for this product rather than for some other product. So what variables, what data were used so far? Well, this team actually was very sophisticated and used all manner of sociodemographic data, psychographic data, geographic data, prior experience with the firm, two to three hundred different variables. But one thing that they didn't use is something that banks have regularly, and that is the actual fine-grained behavior, individual fine-grained behaviors, not aggregated things to the customer level, individual fine-grained behaviors from the customers, right? Why might that be useful? Well, again, if we think back to what we're trying to do, we're trying to predict the individual's tastes, interests, proclivities for different products. And if we see a lot of different merchants, because we're a bank and we see the merchants that these individuals transact with, we may be able to get a much finer-grained and more nuanced view of their tastes and preferences and proclivities, whatever you want to call it, the fundamental drivers that are going to drive them to prefer one product and maybe to thereby respond to an offer. So what we have here on the left is the exact same learning curve we saw before. The blue curve now is the learning curve that we saw before. And the red curve is, what if you actually build models that take advantage of this fine-grained data? You have to be able to do that. I'm not going to talk about that today. There are, it's a relatively large data set. It's a million customers here described by over three million merchants, going to build predictive models from that. And when you do, you see that now, as you increase the amount of data, you continue to increase the predictive performance, essentially off the scale to where the data set that was used for the study was exhausted. This isn't specific to this particular product. This was a savings product, a very different product, a pension product, is on the right. And we see the same phenomenon, that indeed, when you actually are now looking at fine-grained behavior data, you get increasing returns to scale. Again, we could talk separately about the statistical learning reasons behind this, but I think the important thing is much more to have looked at the problem differently and brought in a different sort of data that intuitively felt like it was the appropriate data and then figure out how to actually build models based on it. I have a couple of footnotes before I conclude. One is, I wanted to dig down into this particular case study because, actually, there seemed to be, tell me if you know of others, there seem to be very few studies where you actually have really strong best practices with traditional data and then a comparison with good practices with another sort of data. I'm sure that firms may be doing a lot of this internally, but it's rare to actually see one that we can all look at and study. However, we do see similar learning curve behavior in other data sets that exhibit the same characteristics. What are the characteristics? You have massive numbers of individual behaviors, each of which probably contains a small amount of information, and the data items are sparse. Any individual, because you can only take, any individual, one of us, can only take so many of these millions of different behaviors, so the individual instances that you're going to be using to build your model and also for prediction are going to be sparse. There's a paper that just came out last week in the Big Data Journal that shows that across a variety of data sets of this type, with this type of data, we continue to see increasing returns in predictive performance to scale. My other footnote is I don't want, it may be interpreted, my talk may be interpreted saying this is something new. This certainly isn't something new, using this kind of data. A few decades ago, data on all the locations people visited were used for fraud detection. Chandra, who had a talk a few talks ago, has a very high-impact study on using individual communication behaviors between people for social-based marketing. Our friends at Distillery, Claudia Perlish and Brian D'Alessandro, who spoke yesterday and the day before, respectively, do this sort of massive, large-scale predictive modeling with massive, large-scale behavior data for online advertising. It's not new, but we rarely see comparisons where you can see the difference in showing that big data has a different impact for different sorts of data. That was my quick message. I think it has a very strong implication for firms that have the ability to invest in data assets of different sizes. Think about two banks, one very large bank and one small bank. The very large bank has a potentially much larger data asset, and so they may be able to get significant competitive advantage if they are building predictive models for doing something that's going to be advantageous to the bank. Whether bigger data is better depends upon the characteristics of the data. We saw one specific example of this, sparse, fine-grained data on consumer behavior. It also depends on the capability for modeling such data, which we can talk about some other time or you can read about in a good data science book. Thank you. Thank you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript