Speaker 1: Welcome back. So today we're going to talk about Chapter 8 of the book, Tree-Based Methods for Regression Classification. As we'll see, their methods for supervised learning would stratify or segment the predictor space in order to make predictions, and they form what are called decision trees in their stratification. And we'll talk about those methods, which are actually from the start of the mid-'80s, I'd say, and some of the names that are associated with that. So first of all, the software package which started in the method is called CART, for Classification and Regression Trees. And the first two authors of CART are Leo Breiman, who is a well-known statistician from Berkeley, and Jerry Friedman, who's our colleague here at Stanford. And Friedman actually was one of our teachers. Back in the 80s, yeah. Back in the 80s, and Trevor and I were graduate students here. And that's when the CART book came out. Right, and actually we hope to talk to Jerry in this course. You'll get a chance to hear him talk about the development of trees and how that happened in the 80s. So that's the first part of the section is on trees, and then we'll talk about bagging and boosting and random forest, which combine trees in a more modern way.
Speaker 2: It's interesting because trees were used widely for a number of years as one of the primary learning tools. And now trees are used widely as building blocks in some of the more modern learning tools. So they've been around for a long time, and they're going to stay around.
Speaker 1: Right, so in a little more detail, the good side of trees, as you'll see, they're simple in a sense, especially if they're small, and hence they can be useful for interpretation. But on the other hand, when you look at them as supervised learning techniques for prediction, they're typically not competitive with the best methods that are around. But they can be made competitive when they're combined in ensembles of trees. And we'll talk about that in the second part of the lecture, in particular bagging, random forest, and boosting. And they were methods which were developed in the 90s and beyond, which improved the prediction performance of trees substantially. Okay, so let's start at the beginning, the basic idea of a decision tree. And this applies to both regression and classification. We'll start with regression, and a little later we'll consider classification. Okay, so before we actually see a tree, let's think about how we would stratify some data. So here's the baseball data. The response is salary, which we've color-coded in this graph, from low salary, being blue and green, to higher salary, being yellow and red, for baseball players. And each player has, we've measured two predictors, the number of years he's been in the league, in the baseball league, and the number of hits he hits per season. So we want to predict salary from those two predictors. So if I asked you to stratify this population, and I was trying to separate the high from low salary players, what would you do? Well, if you look at it, it looks like the higher salaries are up here, right? And the lower ones are maybe in this L shape. So one might think, here's a place you might say, well, let's split around here. This split separates the predictor space into two regions, right? The higher salaries, we see some yellow and red up here, almost mixed with some blue and green, and the lower ones on the left. So that does a pretty good job of putting out the low salary players on the left. Doesn't do a great job on the right, so we might do a further split. So it looks like you've cut it five years here.
Speaker 2: So years is number of years in the league? Right, number of years in the league, yeah.
Speaker 1: Which makes sense, because the players that are in the league longer can expect to have a higher salary, whereas ones in the league lower have a lower salary. But those with more years in the league seem to be a bit mixed. Right, they're a bit mixed. So it looks like our job isn't quite done. Maybe we could do a refinement by stratifying in this way, right? And now we have three regions. We've got the high salary players up here. And what are these players? These are ones who have been in the league more than maybe around five years and who have more than maybe 125 hits. They're the ones who have the highest salary. And then the medium category looks like it's down here, right? They've got also more than roughly five years of experience but fewer number of hits. And then the lower is on the left. So with just two simple cuts, you've made a pretty good segmentation there. Exactly, and this is the idea of a decision tree, which we'll actually see on the next slide. When we applied a decision tree technique to that data, we got exactly this tree. And this tree is very much like what I drew in the previous slide. On the next slide, I've got the caption for this figure. And throughout the notes, we have a detailed caption for a lot of the figures. I'm not going to read out the caption in the course, but it's just there for your reference if you want us to read the details. But let's look at this tree and interpret what it's saying. I mean, what the layout is first of all. So this is a series of splits. At the top, we have all the data, the top. And this year is less than 4.5 as a split. It's a partition into, on the left, the players who are in the league for less than 4.5 years. On the right, players in the league for more than 4.5 years. So this is pretty much the split we saw that I made here, right? The split here is, I said at 5, but that's roughly, well, 4.5 is between 4 and 5. So this split at the top of the tree is a partition into the left and right regions. So this tree says we're first of all going to split players on the years of experience. Those with fewer than 4.5 years are assigned to the left node. And those with more than 4.5 are assigned to the right. And what's that number at the bottom? So the number at the bottom is the average log response. I think we took logs here. So it's the average log salary of the players who fell into that bin. On the right, we do a further split on hits. Among the players who have more than 4.5 years experience, those who also have fewer than 1 to 17.5 hits are assigned to this branch, otherwise to this branch. So we end up with three bins from highest salary, medium salary, and lowest salary. And these are exactly the, well, almost exactly the three regions that I drew here just by hand. Okay. So here actually is the details, the partition that corresponds to that tree. And it's very much like the one I drew by hand with the splits being given. There's the top split and here's the split on the left. But this was found by an algorithm, right? It's found by an algorithm. And actually, the algorithm we're going to describe is going to build a much larger tree, and then it's going to prune it from the bottom to give this three-node tree. So the automatic algorithm is going to do what we looked at. It has to be quite sensible in this example to divide into three regions. Okay. Some terminology, which I've already been using. The nodes at the bottom are called terminal nodes because they're terminal. They're not further split. You notice I'm calling these trees, but they're upside down, right? The leaves are at the bottom, right? At the top, it's just for convenience. The non-terminal nodes are called internal nodes, which in this case are the, we have two internal nodes in our tree. But usually the, sorry, internal nodes, the terminal nodes are the ones of interest because they're the ones that describe the partition of the predictor space. So how do we interpret that tree? Well, we split first on the years of experience. So that's saying the most important factor in determining salary is years of experience. Those with less experience could have lower salary. On that left branch, we didn't split any further, so it looks like the number of hits isn't important in determining salary for the players with less experience. But on the right, where we have the players with more experience, the hits is important. So notice what we're saying here is that this is not a symmetric tree, right? We split once to give us the left bucket, but then on the right we split again this bucket into two more buckets on hit. So the point being that number of hits seems to be important for those with more than 4.5 years experience, but not important for those with fewer than 4.5. Now, again, this gets to the point that I said earlier. It's a very simple tree, which makes it very easy to display and interpret, right? There's no equation. One thing that scares non-statistician collaborators that we have, it scares them sometimes, if they don't know much math and you write an equation down for a model, it's intimidating, not very attractive to them. One nice thing about a tree is that the tree is the model, right? This is the summary of the model, and we don't need an equation to summarize it. So it's something that's simple to understand by people who aren't comfortable with mathematics. On the other hand, it's probably much simpler than it deserves to be, right? And for that reason, for one of those reasons, the prediction error of trees is not very good compared to other methods, and we'll see we can improve it substantially if we combine ensembles of trees. So I haven't said in detail how we actually got that tree. I said that the tree that we saw there actually was very close to the one that we got just by intuitively splitting the feature space. But how does the automatic tree-growing algorithm work? Well, the idea is we want to divide the predictor space into non-overlapping regions, some j regions, some number j, which we'll have to pick as well. In the case of the previous example, j was 3. And having grown the tree, the prediction, as we've seen, is just the average of the response values that fall into each of the terminal nodes. But how do we actually decide on the splits, on the shape, on the partition of the feature space? Well, if we thought in the most general way, we could think about trying to divide the feature space into boxes, meaning that the edges of the regions are parallel to the axes. We want to do that for interpretation. If we had a region which was a circle, it would be really hard to interpret the predictor space. But even considering ourselves into boxes, it turns out the tree-building process is difficult. So let's pose exactly the problem we might want to solve. If we define our boxes as R1 through Rj for some number of boxes j, then we might want to find the boxes so that if we sum up the variation of the observations around the mean of each box, so the Rj is the set of observations falling in the j-th box, and y-hat Rj is the average of the response of those observations.
Speaker 2: So those are the averages at the terminal nodes in the tree, right? Exactly. So each box represents one of those terminal leafs where you're going to represent the observations by an average.
Speaker 1: Right, and we have in this case j such terminal leafs, and we're going to choose a set of boxes so that the total variation of observations around their mean in a box is as small as possible. That makes sense, right? Because we want the boxes to be homogeneous and to have observations which are very similar in each box, and across boxes they'll be very different. So this looks like a reasonable way to pose the box-finding problem. But it turns out actually that's too hard to solve computationally. If you say I want to find the 10 boxes that have the smallest value of this criterion, it's actually computationally infeasible. Well, 10 might be solvable, but certainly beyond 20 or 50 it gets very hard.
Speaker 2: Especially if you think about how many ways you can make boxes, the number just boggles, it just gets big very fast.
Speaker 1: Exactly, so trees use an approximation, sort of an obvious method, top-down greedy approach. And it's top-down because it starts at the top with a whole set of observations, and then it splits them into two pieces, one at a time at each level. It's greedy because it doesn't find the best split among all possible splits, but only the best split at the immediate place it's looking. So let's go through the details of the tree-growing process. I've got it on this slide, but it might be easier to go back to the little tree we have there and just talk through it there. So what we do is we start at the top with the full set of data and all the predictors, and we look for the predictor and the split that produces the smallest criterion, the one we had written down, which is the sum of squares of each response around the average in the node. So we're going to make a split to produce two nodes. We're going to look at all possible predictors and all split points that produce the smallest criterion value.
Speaker 2: So that's before we had three nodes here, so there would just be a mean in the left and a mean in the right.
Speaker 1: Exactly, right. So we're starting with the full data with actually only one node. We're going to split it into two, and the winner was this. The predictor here is at 4.5, producing these two branches, the left and the right. And then the process is repeated. We find the best split we can among the left and right nodes, and the winner was hits at 117.5. So again, in each case, we're doing the best we can in a greedy sense, and that produced these three terminal nodes. So let's go back to the... So I've said that in more detail in this slide. And then the question is... One question is, when do we stop? We could decide to stop to just do a small number of nodes, like create maybe three nodes, like in that tree, or we could grow a larger tree.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now