Intro to Decision Trees in R: Classifying Titanic Data with rpart and rpart.plot
Learn to classify Titanic passengers' survival using decision trees in R. This video covers data import, train-test split, and visualization with rpart and rpart.plot.
File
Decision Trees in R Rpart Library
Added on 09/29/2024
Speakers
add Add new speaker

Speaker 1: Hello. In this video, I'm going to explain how decision trees can be used to classify two or more different classes in a data set. This is going to be an introductory video on implementing decision trees in R. And we will be using two packages. The first package that we are going to be using is called rpart. And the next package that we are going to be using is rpart.plot. So rpart is going to be helping us implement decision tree algorithm. And rpart.plot is going to help us visualize the decision trees. Okay, so let's get started. I'll be on this first video, I will be working on a familiar data set. We worked on this data set before and it is a titanic data set where we keep passenger information and their survival outcomes, whether they survive or not, depending on their property. So let's get first import the data. Data is going to be in CSV format. And therefore we have to use this first option from text. And I keep my data here when I open. This is the preview format. As you can see, it is separated by commas here. And then the data frame output is going to look like this. So I'm happy with that. And I will keep the name as titanic underscore data. And I'll hit import. Okay, so in this data set, as a reminder, we do have the class of our passengers information. First class, second class, third class. I think we have fourth one as well. And the age category, and then their sex and the final dependent variable is going to be this one, the survive column. It's going to include yeses and noes. So we are hoping that our decision tree algorithm is going to show us a really good path into predicting whether these passengers have survived or not. Okay. The reason that we use seed value here, before we start, by using set that seed function is because whenever we run this code again, we would like to keep the same randomly generated numbers in order. So that whenever we run it several times ahead, the code, it is going to give us the repeatability feature of this code. I'm not sure whether this is clear, basically, this set that seed is going to give us the random numbers generated in order again, and again, and again, so that next time you run this code, you will not see any surprises. Okay. The reason that we use it is because we select our train and test data based on randomly generated numbers, right? And then we want these randomly generated numbers to generate themselves in the same order next time we run it. Okay. That's why we use a seed value. So this number inside it is just irrelevant. You can use any number you want. And for the sake of simplicity, I used one, two, three, four. All right, let's get started. We imported the data under the name of titanic underscore data. We set our seed value to one, two, three, four. Now let's split the data into train and tests by running or executing these codes in line seven, eight, and nine. So what this does is it generates, you know, it's 70% probability it generates one and 30% probably generates random number two, right? And the random number generated one is going to indicate that this specific role is going to be in the train dataset and random number two is going to be indicating that that specific role is going to be included in the test dataset. Okay. And we do say replace equals to true. This is going to do the selection with replacement. So you might want to see same roles popping up in your train and test data more than once. Okay. And because we said so. All right. So we generated our train data and test data. In our train data, we do have 1,579 observations. I think we haven't run the test yet. It hasn't been generated yet. Now let's generate that too. And in the test data, we do have 623 observations. So roughly it is 30% and 70%. All right. So at this point, we are actually ready to put together our decision tree algorithm. So we will have to install our part package. So I already installed it. If you have not installed it, please go ahead and execute line 13 and pause the video, install this, and then come back. All right. After you come back, make sure that you run line 15 to actually load this already installed library into R. Okay. Let me run this. Okay. This is just telling R that, hey, if you want to use this library, please make the functions under this library ready for my use. And if you would like to call for help or you want to just take a quick glance into what kind of functions and features that this R part of library offers, you can definitely do it here. One important thing to realize here is the method, method feature. So if your method feature is class, then this R part library understands that you actually would like to run a classification tree. All right. That's very important. And if your method feature is ANOVA, then the R part understands that you would like to run a decision tree, but your dependent variable is just a continuous number, like a regression variable that you run, like a value that you would like to predict. Okay. So class indicates that you would like to do a classification. You would like to classify your outcomes. And ANOVA means that you are actually doing some sort of regression, but using decision trees. All right. That's important feature to be aware of. All right. So we use the R part function. Nothing is different. Okay. R part function. And then we open the parentheses and we type our formula. Remember, we are trying to predict whether a person belongs to a survived category, yes or no, or that person, well, simply yes or no, right? So a person might have been survived, or that person just may not have survived. And that depends on their class, their age, and their sex information. Okay. So we represented that on the right side of the tilde. And then on the left side, we do have the survived column. And the data that we are going to be using is the tree data, because we would like to train the model. We would like to get the right parameters of the decision trees. And what kind of parameters that decision trees is a filling. It has the size of the tree. It has the number of branches and leaves that node controls. I'm not going to go into details of those, but make sure that node goes, you are aware of the fact that those are the parameters that are being optimized here. Okay. And then, as I said, method is class, because this is going to indicate that under the survived column, we do have two different outcomes. Okay. Let's run this. And what this does is it is going to create a decision tree object here. So you can actually take a look at this object. Or you can display it here on the console. And it tells you that this is a decision tree. It has 1,579 observations. It has a root node. And these are the percentages of yes and no category. Sorry, no and yes category. And this is the leaf node coming out of root one. And then here, after root three, we do have other child nodes, so on and so forth. This is not really an easy way to look at decision trees. Yeah. We are all aware of that. These bunch of numbers may not really tell us the information that we are after. Therefore, we do have a nice visualization package for decision trees, and it's called rpark.plot. And at this point, I need you to go ahead and execute at this time, the line below that. There it is, install.packages inside, in quotation, rpark.plot. Make sure you install this package, pause the video, and come back and load the package by calling library rpark.plot. Okay. So after you've got your decision tree object, all you have to do is to type rpark.plot in parentheses, the name of the object, the decision tree that you just created, and execute that. Okay. I'm glad that you executed again. As a reminder, if you're using a Mac, it is command and return. If you're using a PC, it should be control and enter. Okay. So I'd like to zoom into this graph to tell you what's going on.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript