Visualizing Decision Trees in Scikit-Learn 0.21: New Functions and Tips

Convert Your Audio To Text

4.9/5

3726 customer reviews

Explore new Scikit-Learn 0.21 functions for visualizing decision trees: plot tree with matplotlib and export text without external libraries.

Visualize a decision tree two different ways

Added on 09/28/2024

Speakers

Add new speaker

Speaker 1: Tip number 24, this is one of my favorites so far. So one of the reasons people use decision trees is because of their high interpretability and to interpret them, you have to visualize them, which is the point of this tip. So two new functions in Scikit-Learn 0.21 for visualizing decision trees, plot tree, which uses matplotlib instead of graphviz, and export text, which doesn't require any external libraries. So let's scroll down to the tree. In previous versions of Scikit-Learn, if you wanted to visualize a tree, you would have to use graphviz, and graphviz was a pain to install, and even when you got it working, it was a bit of a pain to use. You would run some Python code, it would output a file, you would leave Python, go to the command line, and convert it to a different file type, and then you could look at it. So plot tree, which is what I'm using here, is better in two ways. It only uses matplotlib, and the tree appears directly in your notebook. So let me briefly explain what you are looking at if you've never visualized a decision tree. So each box is a node. There are three internal nodes and four leaf nodes. The internal node is everywhere there was a split, so it tells you at the top the rule that was used to split that node, and the way it works is if the rule is true, you go left, if the rule is false, you go right. We see the genie impurity before the split, the number of samples before the split, the classes of those samples before the split, and the majority class in that node. So in this dataset, sex of zero is male, so if male, you go left, if female, you go right. Now that split was chosen by the tree to maximize the decrease in impurity, meaning the goal of the split is to increase the node purity. So below it, in these boxes, you'll see the new genie impurity, the new number of samples in that node, the class proportions within that node, and the majority class in that node. And you can see that the genie impurity has decreased in both of the boxes below. And then these boxes, these nodes split again, which is why they have another splitting rule to decide whether you go left or right here, and left or right here. Now these leaf nodes are the same, except the tree has stopped growing, so there's no more splits, hence why there's no more rules listed in those nodes. The color coding you're seeing in all of these nodes is based upon the genie impurity. Darker means more pure, which is ultimately what the tree is trying to achieve. This bottom right one is all white because it is perfectly split between the classes. Now if you were to plot a regression tree, it would look very similar to this, except that you would see mean squared error instead of genie impurity as the criteria for splitting. Final note is that this can only be used with a single tree, not an ensemble of trees like random forests. Finally, let's take a look at export text at the bottom, also new in 0.21. This is the same tree as this here above, just visualized in a different way. It, of course, does not require a matplotlib. It doesn't include nearly as much information, but it is much more compact.