Speaker 1: Hello people from the future. Welcome to Normalize Nerd. Today I will explain the concept behind decision trees. To be more precise classification using decision trees. As you know me, obviously I will discuss the intuition and the underlying math behind training a decision tree and this video will contain a lot of visualizations. So I hope you are super excited. If you want to see more videos like this and stay connected with me, please subscribe to this channel and join our discord server. The link will be in the description. So let's get started. First of all, we need some data because no algorithm can learn without any previous experience. This data set contains two features X0 and X1. I have plotted X0 along the horizontal axis and X1 along the vertical axis. There exist two classes green and red. If you notice carefully, the classes are not linearly separable. That is you can't just draw a line to separate them. This is intentional because I want to show you the true power of decision trees. So what is a decision tree? Generally speaking, it's a binary tree that recursively splits the data set until we are left with pure leaf nodes. That is the data with only one type of class. Don't worry if you didn't get what I just said. By the end of this video, you will understand every bit of it. The best way to start is to look at a trained decision tree. So here's the decision tree classifier for this data set. Notice that there are two kinds of nodes, decision nodes and leaf nodes. The former one contains a condition to split the data and the latter one helps us to decide the class of a new data point. How? We will see later. I want to walk you through the tree first. Let's have the whole data at our root node. Now, based on the condition of the root node, we are going to split it. The condition is whether X0 feature is less than or equal to minus 12. In the plot, the splitting condition looks like this vertical line. Every point that lies either left or on the line satisfies this condition. We are going to place those points in the left children of the root node and the points that don't meet the condition will go to the right. We are going to follow this rule for the remaining nodes also. Please notice that the left child contains only red points. So it's a pure node. We don't need to split it further. Let's come to the right child. The condition is if X0 is less than or equal to 9. Here is the splitting line and resulting child nodes. In this case, the right child is pure. So we are going to further split the left one. Following the left child, this time the condition is X1 less than or equal to 9. Obviously, here we will have a horizontal splitting line. See that in this case, both the child nodes are pure. So we don't need to split further. Now the natural question is how can we use this tree to classify a new data point. Suppose the new data point has X0 and X1 as 15 and 7 respectively. First, we check the condition at the root node. 15 is greater than minus 12. So it doesn't satisfy that. So we come to the right child and here it fails again. So finally, we arrive at a leaf node. Luckily, all the points in this node are red. So we can easily classify our new point as red. But this might not be the case for more complex data. We might not have 100% pure leaf nodes. In those cases, we perform a majority voting and assign the majority class to the test point. Let me show you how our model has divided the feature space into two regions based on the splitting criteria. Okay, so that is how we use a binary decision tree for classification. At this point, some of you might be thinking, so this is just a bunch of nested if-else statements. So why do we consider this as machine learning? Well, yes, it is just a bunch of if-else statements. But we need to know the correct conditions, right? There are many, many possible splitting conditions. Here, I am showing a couple of them. Our model needs to learn which features to take and the corresponding correct threshold values to optimally split the data. And this is why it is machine learning. But how our model decides the optimal splits? For that, let's start at the root. At this stage, we have the whole data with us. We are going to compare two splits. The first one is x1 less than or equal to 4 and the
Speaker 2: corresponding child nodes look like this. The second split condition is x0 less than or equal
Speaker 1: to minus 12. Now, which split would you choose? Well, if you remember our goal, that is to get pure leaf nodes, we must go for the second split, right? Because in this case, we have successfully produced the left child with red points only. But how can our computer do the same? Well, the answer lies in information theory. More precisely, the model will choose the split that maximizes the information gain. To calculate information gain, we first need to understand the information contained in a state. Let's look at the root state. Here, the number of red points and green points is the same. Imagine we are predicting the class of a randomly picked point. Only half of the time we will be correct, which means this state has the highest impurity or uncertainty. The way to quantify this is to use entropy. It is a measure of information contained in a state. If entropy is high, then we are very uncertain about the randomly picked point and we need more bits to describe this state. By the way, pi denotes the probability of the ith class. Here, both red and green have equal probability, that is 0.5. If we just plug in the values here, we will find the entropy corresponding to the root state is 1, which is by the way the highest possible value of entropy. Please note that I am using log base 2 here. Using this formula, we will calculate the entropy of the remaining four states. You can also notice that the state with four red points gives the minimum entropy. That's why we call it a pure node. Now, to find the information gain corresponding to a split, we need to subtract the combined entropy of the child nodes from the entropy of the parent node. Also notice that I have written weights for the child nodes. That's nothing but the relative size of a child with
Speaker 2: respect to the parent. So, let's calculate the information gain. Clearly, the second split gives
Speaker 1: a greater information gain and that's why we would choose this one. Now, I have just compared two splits to show you the working. In reality, the model compares every possible split and takes the one that maximizes information gain. So, the model traverses through every possible feature and feature value to search for the best feature and the corresponding threshold. It will be more clear to you when I will show you the code. Yes, in the next video, we will see how to code a decision tree from scratch. Here, I want to tell you a very important point. Decision tree is a greedy algorithm. Well, it selects the current best split that maximizes information gain. It does not backtrack and change a previous split. So, all the following splits will depend on the current one and this does not guarantee that we will get the most optimal set of splits. But yes, greedy search makes our training a lot faster and it works really good despite of its simplicity. Coming back to our tree, now it's going to find the best split on the right child and so on. After this process is done, we will see our final model. Notice that how at every level the impurity of the states are decreasing. Okay, so by now you should have a great understanding about decision tree classifiers. In the next video, I will show you how to code such a tree from scratch that you can actually use on a real-world dataset and I will also show you another thing called Gini index. So, please subscribe and turn on the bell notification. I will see you in the next video. Thanks for watching. Bye.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now