Speaker 1: Regression tree is for you and for me, StatQuest. Hello, I'm Josh Starmer and welcome to StatQuest. Today we're going to talk about regression trees, and they're going to be clearly explained. This StatQuest assumes you are already familiar with the tradeoff that plagues all of machine learning, the bias-variance tradeoff. And the basic ideas behind decision trees. And the basic ideas behind regression. If not, check out the quests. The links are in the description below. Now, imagine we developed a new drug to cure the common cold. However, we don't know the optimal dosage to give patients. So we do a clinical trial with different dosages, and measure how effective each dosage is. If the data looked like this, and, in general, the higher the dose, the more effective the drug. Then we could easily fit a line to the data. And if someone told us they were taking a 27 mg dose, we could use the line to predict that a 27 mg dose should be 62% effective. However, what if the data looked like this? Low dosages are not effective. Moderate dosages work really well. Somewhat higher dosages work at about 50% effectiveness. And high dosages are not effective at all. In this case, fitting a straight line to the data will not be very useful. For example, if someone told us they were taking a 20 mg dose, then we would predict that a 20 mg dose should be 45% effective, even though the observed data says it should be 100% effective. So we need to use something other than a straight line to make predictions. One option is to use a regression tree. Regression trees are a type of decision tree. In a regression tree, each leaf represents a numeric value. In contrast, classification trees have true or false in their leaves, or some other discrete category. With this regression tree, we start by asking if the dosage is less than 14.5. If so, then we are talking about these 6 observations in the training data. And the average drug effectiveness for these 6 observations is 4.2%. So the tree uses the average value, 4.2%, as its prediction for people with dosages less than 14.5. On the other hand, if the dosage is greater than or equal to 14.5, and greater than or equal to 29, then we are talking about these 4 observations in the training data set. And the average drug effectiveness for these 4 observations is 2.5%. So the tree uses the average value, 2.5%, as its prediction for people with dosages greater than or equal to 29. Now, if the dosage is greater than or equal to 14.5, and less than 29, and greater than or equal to 23.5, then we are talking about these 5 observations in the training data set. And the average drug effectiveness for these 5 observations is 52.8%. So the tree uses the average value, 52.8%, as its prediction for people with dosages between 23.5 and 29. Lastly, if the dosage is greater than or equal to 14.5, and less than 29, and less than 23.5, then we are talking about these 4 observations in the training data set. And the average drug effectiveness for these 4 observations is 100%. So the tree uses the average value, 100%, as its prediction for people with dosages between 14.5 and 23.5. Since each leaf corresponds to the average drug effectiveness in a different cluster of observations, the tree does a better job reflecting the data than the straight line. At this point, you might be thinking, the regression tree is cool, but I can also predict drug effectiveness just by looking at the graph. For example, if someone said they were taking a 27 mg dose, then, just by looking at the graph, I can tell that the drug will be about 50% effective. So why make a big deal about the regression tree? When the data are super simple, and we are only using one predictor, dosage, to predict drug effectiveness, making predictions by eye isn't terrible. But when we have 3 or more predictors, like dosage, age, and sex, to predict drug effectiveness, drawing a graph is very difficult, if not impossible. In contrast, a regression tree easily accommodates the additional predictors. For example, if we wanted to predict the drug effectiveness for this patient, we would start by asking if they are older than 50. And since they are not over 50, we follow the branch on the right and ask if their dosage is greater than or equal to 29. And since their dosage is not greater than or equal to 29, we follow the branch on the right and ask if they are female. And since they are female, we follow the branch on the left and predict that the dosage will be 100% effective. And that's not too far off from the truth, 98%. Okay, now that we know that regression trees can easily handle complicated data, let's go back to the original data, with just one predictor, dosage, and talk about how to build this regression tree from scratch. And since regression trees are built from the top down, the first thing we do is figure out why we start by asking if dosage is less than 14.5. Going back to the graph of the data, let's focus on the two observations with the smallest dosages. Their average dosage is 3, and that corresponds to this dotted red line. Now we can build a very simple tree that splits the observations into two groups based on whether or not dosage is less than 3. The point on the far left is the only one with dosage less than 3. And the average drug effectiveness for that one point is 0. So we put 0 in the leaf on the left side for when dosage is less than 3. All of the other points have dosages greater than or equal to 3. And the average drug effectiveness for all of the points with dosages greater than or equal to 3 is 38.8. So we put 38.8 in the leaf on the right side for when dosage is greater than or equal to 3. The values in each leaf are the predictions that this simple tree will make for drug effectiveness. For example, this point on the far left has dosage less than 3. And the tree predicts that the drug effectiveness will be 0. The prediction for this point, drug effectiveness equals 0, is pretty good since it is the same as the observed value. In contrast, for this point, which has dosage greater than 3, the tree predicts that the drug effectiveness will be 38.8. And that prediction is not very good since the observed drug effectiveness is 100%. Note, we can visualize how bad the prediction is by drawing a dotted line between the observed and predicted values. In other words, the dotted line is a residual. For each point in the data, we can draw its residual, the difference between the observed and predicted values, and we can use the residuals to quantify the quality of these predictions. Starting with the only point with dosage less than 3, we calculate the difference between its observed drug effectiveness, 0, and the predicted drug effectiveness, 0, and then square the difference. In other words, this is the squared residual for the first point. Now we add the square residuals for the remaining points with dosages greater than or equal to 3. In other words, for this point, we calculate the difference between the observed and predicted values, and square it, and then add it to the first term. Then we do the same thing for the next point, and the next point, and the rest of the points, until we have added squared residuals for every point. Thus, to evaluate the predictions made when the threshold is dosage less than 3, we add up the squared residuals for every point, and get 27,468.5. Note, we can plot the sum of squared residuals on this graph. The y-axis corresponds to the sum of squared residuals, and the x-axis corresponds to dosage thresholds. In this case, the dosage threshold was 3. But if we focus on the next two points in the graph, and calculate their average dosage, which is 5, then we can use dosage less than 5 as a new threshold. And using dosage less than 5 gives us new predictions, and new residuals. And that means we can add a new sum of squared residuals to our graph. In this case, the new threshold, dosage less than 5, results in a smaller sum of squared residuals. And that means using dosage less than 5 as the threshold resulted in better predictions overall. BAM. Now let's focus on the next two points. Calculate their average, which is 7, and use dosage less than 7 as a new threshold. Again, the new threshold gives us new predictions, new residuals, and a new sum of squared residuals. Now shift the threshold over to the average dosage for the next two points, and add a new sum of squared residuals to the graph. And we repeat until we have calculated the sum of squared residuals for all of the remaining thresholds. BAM. Now we can see the sum of squared residuals for all of the thresholds, and dosage less than 14.5 has the smallest sum of squared residuals. So dosage less than 14.5 will be the root of the tree. In summary, we split the data into two groups by finding the threshold that gave us the smallest sum of squared residuals. BAM. Now let's focus on the six observations with dosage less than 14.5 that ended up in the node to the left of the root. In theory, we could split these six observations into two smaller groups just like we did before by calculating the sum of squared residuals for different thresholds and choosing the threshold with the lowest sum of squared residuals. Note, this observation has dosage less than 14.5 and does not have dosage less than 11.5, so it is the only observation to end up in this node. And since we can't split a single observation into two groups, we will call this node a leaf. However, since the remaining five observations go to the other node, we can split them once more. Now we have divided the observations with dosage less than 14.5 into three separate groups. These two leaves only contain one observation each and cannot be split into smaller groups. In contrast, this leaf contains four observations. That said, those four observations all have the same drug effectiveness, so we don't need to split them into smaller groups. So we are done splitting the observations with dosage less than 14.5 into smaller groups. Note, the predictions that this tree makes for all observations with dosage less than 14.5 are perfect. In other words, this observation has 20% drug effectiveness and the tree predicts 20% drug effectiveness, so the observed and predicted values are the same. This observation has 5% drug effectiveness and that's exactly what the tree predicts. These four observations all have 0% drug effectiveness and that's exactly what the tree predicts. Is that awesome? No. When a model fits the training data perfectly, it probably means it is overfit and will not perform well with new data. In machine learning lingo, the model has no bias but potentially large variance. Bummer. Is there a way to prevent our tree from overfitting the training data? Yes, there are a bunch of techniques. The simplest is to only split observations when there are more than some minimum number. Typically, the minimum number of observations to allow for a split is 20. However, since this example doesn't have many observations, I set the minimum to 7. In other words, since there are only 6 observations with dosage less than 14.5, we will not split the observations in this node. Instead, this node will become a leaf and the output will be the average drug effectiveness for the 6 observations with dosage less than 14.5, 4.2%. Bam. Now we need to figure out what to do with the remaining 13 observations with dosages greater than or equal to 14.5. Since we have more than 7 observations on the right side, we can split them into two groups. And we do that by finding the threshold that gives us the smallest sum of squared residuals. Note, there are only 4 observations with dosage greater than or equal to 29. Thus, there are only 4 observations in this node. Thus, we will make this a leaf because it contains fewer than 7 observations. And the output will be the average drug effectiveness for these 4 observations, 2.5%. Now we need to figure out what to do with the 9 observations with dosages between 14.5 and 29. Since we have more than 7 observations, we can split them into two groups by finding the threshold that gives us the minimum sum of squared residuals. Note, since there are fewer than 7 observations in each of these two groups, this is the last split because none of the leaves have more than 7 observations in them. So we use the average drug effectiveness for the observations with dosages between 14.5 and 23.5, 100%, as the output for the leaf on the right. And we use the average drug effectiveness for observations with dosages between 23.5 and 29, 52.8%, as the output for the leaf on the left. Since no leaf has more than 7 observations in it, we're done building the tree. And each leaf corresponds to the average drug effectiveness from a different cluster of observations. Double-bam. So far, we have built a tree using a single predictor, dosage, to predict drug effectiveness. Now let's talk about how to build a tree to predict drug effectiveness using a bunch of predictors. Just like before, we will start by using dosage to predict drug effectiveness. Thus, just like before, we will try different thresholds for dosage and calculate the sum of squared residuals at each step and pick the threshold that gives us the minimum sum of squared residuals. The best threshold becomes a candidate for the root. Now we focus on using age to predict drug effectiveness. Just like with dosage, we try different thresholds for age and calculate the sum of squared residuals at each step and pick the one that gives us the minimum sum of squared residuals. The best threshold becomes another candidate for the root. Now we focus on using sex to predict drug effectiveness. With sex, there is only one threshold to try. So we use that threshold to calculate the sum of squared residuals and that becomes another candidate for the root. Now we compare the sum of squared residuals, SSRs, for each candidate and pick the candidate with the lowest value. Since age greater than 50 had the lowest sum of squared residuals, it becomes the root of the tree. Then we grow the tree just like before, except now we compare the lowest sum of squared residuals from each predictor. And just like before, when a leaf has less than a minimum number of observations, which is usually 20, but we are using 7, we stop trying to divide them. Triple-bam. In summary, regression trees are a type of decision tree. In a regression tree, each leaf represents a numeric value. We determine how to divide the observations by trying different thresholds and calculating the sum of squared residuals at each step.
Speaker 2: Beep boop boop boop boop. Beep boop boop boop boop boop boop.
Speaker 1: The threshold with the smallest sum of squared residuals becomes a candidate for the root of the tree. If we have more than one predictor, we find the optimal threshold for each one and pick the candidate with the smallest sum of squared residuals to be the root. When we have fewer than some minimum number of observations in a node, 7 in this example, but more commonly 20, then that node becomes a leaf. Otherwise, we repeat the process to split the remaining observations until we can no longer split the observations into smaller groups. And then we are done. Hooray. We've made it to the end of another exciting StatQuest. If you like this StatQuest and want to see more, please subscribe. And if you want to support StatQuest, consider contributing to my Patreon campaign. Buy one or two of my original songs, or a t-shirt or a hoodie, or just donate. The links are in the description below. Alright, until next time, quest on.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now