Speaker 1: This is the HR data that we used in the second Python homework. You'll notice that when we bring it in to Jump that it has a couple of categorical variables, and basically you can see that sales and salary, because these are textual values, are categorical variables. Jump shows you that that is how it recognized the data type by these little bar charts. You can see these. When you see these little sideways rectangles, those are showing that Jump is considering these as continuous numeric variables. And all it has done is it's looked at things that have numbers in them and just made an assumption about what they are. And a number of these assumptions are wrong because a number of these are categories. So for example, left, which is the outcome variable, is yes or no, and it's coded 1 or 0. So we're going to go into column info and we need to change it from continuous to nominal. Nominal means it's a categorical variable where each category has a name. And so it's okay for the data type to be numeric, but how it relates to models is as a nominal variable, that is a categorical variable. And we have several. We have promoted last five years, and so let's change that to a nominal. And you can see that as I select OK, that this changes over to a bar chart. So now that's changed over. Have they had a work accident? You can also change your data types over here. You can go right into column information and change it to a nominal. And then, let's see, looks like the rest of these are continuous. Yes, they are. Okay, so now we've changed the data types over. We need to make our validation column. So I'm going to go to analyze and then predictive modeling and make validation column. There's lots of data here, so we can do 0.6, 0.4. And as is my habit, I like to just leave little memory joggers for myself. So this is a 60% split for training. And then we'll pick a seed. Let's say a seed of just five. I just picked something. We'll put our seed in here and fix random. All right, so now we have got our training and validation data ready to go. Now, under the analyze menu, we're going to select predictive modeling, and we're going to use partition. Now, partition is really what Jump uses as the name for categorization and regression tree models or decision tree models. So we're going to select that, and then we're going to put in the outcome variable. So that's whether someone left the company or not as the response variable. Our validation goes in the validation, and then everything else goes in as our X factors. Now, when Jump comes up, it takes us to a place where it is just kind of giving us a place to start. First thing I'm going to do is I'm going to select color points. Also, I'm going to set up in my Jump preferences under my Jump menu, or I think in Windows it's under your file menu. I'm going to go into platforms, select the partition platform, and make sure that I've got these set up so that it will give me my various things. It will give me my confusion matrix. It will give me my rock chart, my rock curve, show the details. And then it will also include some information in terms of on each split what the probabilities and counts are and things like that. So this is also shown in your homework. You want to set this up, and that way you don't have to keep selecting these things every time you run a CART model. All right. We've got these, and it's showing that right now we have about 23% of our observations are people that left the company. There's our base rate right there that we calculated in the last video, but Jump just does it for us automatically. And you can see that it's just ready for us to go. Now what we can do, there's several ways we can do it. If we hit Go, it will just build all of the model all the way down to where it starts overfitting and stop it. But I want to do a little bit of example in terms of taking us through split by split. So I'm going to select split, and it will do one split. And so in this very first split where it's doing it on satisfaction level, the most important variable for determining whether or not someone's going to stay at the company or not is how satisfied they are at work. So you can see this over here, that it's done a pretty big split here, and you can see that if people are satisfied, they tend to stay more, and people that aren't satisfied, they tend to leave more. You can also see the counts that are in each side of the split. It also takes this satisfaction level and starts splitting it up here. So you can see the proportions, which is very nice. We'll split again. The next split is years at company. So now we have almost perfection over here. We have just about 1% that are over here, and then we have almost 99%. So we're approaching perfection here, but we haven't. And you can also see as you scroll down here and you look at the confusion matrix on the validation data, you can see how things are starting to shake out. As we divide the tree and split the branches down here, what happens is our misclassification rate will get lower and lower and lower. Right now it's at 18%, and we're just a couple of splits down, so it's already doing a good job. It's also showing us that our rock curve is starting to show that it's actually moving up and increasing the area under the curve. Alright, so let's select split again. And now it takes the average monthly hours, and so if it's less than 21 versus greater than or equal to 217, so these are people that really work really, really long hours, and they tend to leave the company more. And you can see the probabilities. You can see that these aren't pure, but you can see that this one particularly is doing a pretty good job of splitting. So basically, if you want to, you can go through one by one. You can see with the addition of that split, this fell from about 18% to about 15%. And as we go and continue, we can do it split by split by split, or we can hit go, and what go will do is go just builds the model out. So what go does is it builds the model out, goes all the way down to the bottom, and showing here that when it was done, it was able to get down to a lot of predicted zeros that really are zeros, and we've got these are people that left the company that we predicted would leave the company. And we've got some misclassifications, but we have about 97.5% correct classification, as is showing here by the misclassification rate. So we're doing well. And it also takes the tree and shows it to you in sort of an abbreviated form, so you can see the tree, and you can see what's sort of showing up in all of the splits, and so it's done a lot of dividing. Now, one of the nice values of a decision tree is that you can actually see each split, and you can see what's on each side of the split. What we have in Jump is we have a split history. As the number of splits increase, what you can see is you can see that the validation data and the training data, which is in the blue, started to diverge. So what happens is the more splits there are, the higher the R-squared is. The R-squared gets higher and higher. Now, we usually don't do R-squared for categorical models, but Jump has made a sort of like pseudo R-squared calculation to sort of help you know how you're doing. And you can see here, as the number of splits increased, the R-squared went up. And you can also see that in the training data, it just kept going up. The more splits, the higher the R-squared got. And so the training model kept going higher and higher and higher, but the validation data, what happened is it hit a place where it stopped doing better. So even though the training data is getting more accurate, this is not getting any more accurate because basically overfitting is meaning that as you throw unseen data at the model, it's not getting better. So you can do more splits, but it's actually not improving the classification rate anymore. And so what Jump does is it stops right here. Now, all working categorical and regression tree models have some sort of form of algorithm that will help you know when to stop in terms of splits and will help you know when you're overfitting. And then here we have 97.9% area under the curve. And so this has done very, very well. Most of this area is under the curve, and so the modeling is doing well. And later on, we'll look more about area under the curve.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now