Comprehensive Guide to Medical Data Visualization with Python Libraries

Convert Your Audio To Text

4.9/5

3720 customer reviews

Join Landon Schlungen as he walks through creating medical data visualizations using matplotlib, seaborn, and pandas, covering key steps and functions.

Medical Data Visualizer FreeCodeCamp

Added on 09/08/2024

Speakers

Add new speaker

Speaker 1: Hello, I'm Landon Schlungen, and today we're going to go through the medical data visualizer. So just open that up, access the full project description and start a code on repl.it. And here we go, we have it open. We have the markdown for us explaining the assignment. Basically, we have to use matplotlib, seaborn, and pandas in order to make a couple of graphs. And the graphs should look like the examples like figure 2 and figure 1. So if I go into the examples, figure 1 looks like this, where we have cardio of 0 and cardio of 1. And then there are different attributes for that. So like active, alcohol, cholesterol, glucose, overweight, and smoke. And like, yeah, we have to format the data so that we can output this graph like this. And then figure 2 looks like this. And there's some different functions that seaborn and matplotlib have that make this a lot easier to create. So if we go into medical data visualizer, this is what we have right now. First thing we have to do is add an overweight column. Or actually, we have to import the data first. So let's do that. We import the data with read csv. So pandas.readcsv. And then we read the medicalexaminations.csv. And that data looks like this. It has an age, height, weight, gender, systolic blood pressure, distal blood pressure, cholesterol, glucose, smoking, alcohol intake, physical activity, and presence or absence of cardiovascular disease. To add the overweight column, they have a thing here of how to do that. And I first did it like this, where we divide the weight by the height divided by 100. And then we square it, that gets the BMI. BMI is by dividing their weight in kilograms by the square of their height in meters. So yeah, their height is in centimeters right now. So that's why I had to divide it by 100 to make it into meters, and then square it and then divide it by the weight. And then we have to check if the value is greater than 25, then we have to have the value of one or zero. And to do this, we use a lambda function, we do dot apply, and then lambda x, x is the value in the overweight column. And basically, if x is greater than 25, then we do a one. Otherwise, we do a zero. And that's what this is wanting us to do. So that's great. Next thing we have to do is normalize data by making zero always good and one always bad. And we have to do this for cholesterol and glucose. So we do kind of the same things for these two. We use these lambda functions cholesterol dot apply zero if x equals one, otherwise one glucose, if x equals one, then we do zero, otherwise one, it's actually the same for both of them. And now we have to draw our cat plot. The cat plot is figure one. So we have to draw something like this. But first, we have to change the data in a way that it will like it. So using pd dot melt, we have to use just the values from cholesterol, glucose, smoke, alcohol, active and overweight. So this is how to do that. We do pd dot melt. So our pandas are just pd dot melt. And we pass in the data frame for the first parameter. And we go id vars equals cardio and cardio would be zero or one up here. And then the value vars is the cholesterol, glucose, smoke, alcohol, active, overweight, and all these values are zero and one as well. So like smoking is binary, alcohol intake binary, glucose would have been one of normal to above normal and three well above normal. But we changed that right here with cholesterol, if it equals one, then make it zero. Otherwise, it's one one would be bad in this case, like above normal and well above normal, and then zero would be just normal. Now we need to group and reformat the data to split it by cardio show the counts of each feature, you will have to rename one of the columns for the cat plot to work correctly. So the way to do this is like so I'm guessing this is the renamed column, or we just add a total column that equals one and then we group by cardio variable and value. Because this pd dot melt thing makes a variable and value column in our data frame. It's kind of weird as index equals false dot count this as index thing its default is true, but we changed it to false. It says for aggregated output return object with group labels as the index. Okay, so all we want is like cardio to not be the index or something, something like that. Now we need to draw it with an SNS dot cat plot and SNS is the seaborne. And if we look at seaborne.pydata.tutorial or slash tutorial, this, this is all their graphs that we can make. And we're making one of these. So if we try and look for one of these, like right here, this is kind of what we're going to be doing, except a little bit different, we're just going to do SNS dot cat plot, and then pass in our data, have an x and y, and have a hue and a kind of bar probably. So let's take a look at what we do here. And this is what I did. We pass in the data as dff or dfcat, the one that we made, x is variable, y is total. So if we look at this, total is on the y axis, variables on the x axis, and then we have hue equaling value. So what hue does is it changes the colors of the two bars based on the value. And zero would be always good, one would be always bad, I'm guessing, right? And then we have a kind of bar. So yep, bar chart, and column equals cardio. So cardio would be the two different graphs. So that must be what that does. I don't know a ton about this. These docs are kind of hard to read and like make sure everything is correct. But yeah, they just have this one example of it. So yeah, here they have a hue of class and then it makes three different hues for the class. We can also change the style of it. If we do sns.settheme, we can style it to like dark grid or we can leave it as what it is. This is just default. If it was dark grid, then it would have like a gray background. So I'm just gonna comment that out for now. And then let's try running it. Make sure there's no errors in here. It might take a while for it to update the package configuration. And then it should make a new file called catplot.png. And then we should be able to open it and it should look like figure one. All right, that took a few minutes, but we now have our catplot.png. And it looks like this, which is pretty much exactly what we wanted. It looks pretty much the same as figure one. So that's cool. Next thing we have to work on is this draw heat map. Clean the data, first of all. And to clean the data, it just means like, let's see if it's in the markdown, actually. Yeah, it says height is less than 2.5th, 2.5th percentile. So this is what it means by clean the data, just like make sure the height is within this quantile and make sure it's like, you know, on the inside of it. So like height is more than 97.5th percentile and weight is between the 2.5 and 97.5th percentile. So basically just take the center data. So they kind of showed how to do this here. And we want to do this with height and weight and also with this app low, app high thing. So this is how we do that. df and then we just select these ones and then we select these ones and these ones and these ones using the and sign. Calculate the correlation matrix. This is done with this core function in pandas. And they have an example here where they have a data frame like this and then columns, dogs and cats, and then they do a correlation and their method is histogram intersection. But we're actually going to use a different method. And then it makes something like this. We're going to be using the Pearson method, which does a standard correlation coefficient. I don't know. I don't really know what that means, but it works. So correlation, all we have to do is take our df.dfheat and then do that dot core on it and then method equals Pearson. So that should work for our correlation. Now we have to generate a mask for the upper triangle. The markdown doesn't really give any hints for how to mask the upper triangle, but we actually use a NumPy function, a NumPy function for this. It's called triu, numpy.triu. And this does the upper triangle of an array. Otherwise with trill, it would be the lower triangle of an array. Okay, so yeah, it does 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12. But then it just takes, it like gets split between the zero here and the zero here and just goes down in an upper triangle. And this negative 1 here must be where to start. So yeah, it's usually zero. So that would be split evenly. But yeah, okay. So that's what that does. And we need to use it here. So we'll do numpy.triu, which means the triangle of the upper. And then we pass in our correlation matrix. Now we need to set up the matplotlib figure. That's done with matplotlib. So plt. And we can actually go to matplotlib's docs as well. Just go to their tutorials on matplotlib. And then we have some sample plots and stuff. But yeah, the usage guide, this is how we set up a plot. We do plot.subplots, and it equals the figure in the axis. So this is what they want us to do. They want us to do plt.subplots. And then we can even change the figure size right off the bat. So if we do fig.size, we can change it to 1212. So that'll be a square. It's 12 in height, 12 in width. And now we need to draw the heat map with sns.heatmap. And this is actually kind of weird, I will admit. So if we go to sns, which is seaborn, and we look up their heat maps, this is what their heat map would look like, right? If it had random numbers in it. But we can use some data. In this case, they take flights in the month and the year. So this is the seaborn heat map function. And we have to use it here. We pass in the correlation data, looks like the matrix. We have a mask on it as well. Mask equals mask. And yeah, we can add in a mask onto this one. Looks like they have a pretty good example here. Square is true. Annot equals true. fmt equals 0.1f. fmt, string formatted code to use when adding annotations, whatever that means. Center is 0.08. And c bar kws equals shrink. And yeah, kind of confusing, but it actually works. Go to figure two, it should look like this. So I'm guessing the fmt stands for format. And if we do 0.1f, it stands for like 0.2, it should have it in that format of like one decimal place. Oh, it has a line width as well, which makes it spread out a little bit. It's like there's this line in between each one. And yeah, and then we just save the figure. Somehow it knows that we want the figure of the heat map. I don't know how it knows that, but it just does. We make a figure and then seaborn draws on that figure. And then we save the figure. So let's run it, see what it gives us. Looks like it gave us our heat map. And it looks good. Looks pretty much the same as this one, except the colors are a little bit different. Yeah, the colors are a bit different, but it's pretty much the same. And it had all the right values in it. So it ran the four tests just fine. Yeah, that's about it for this one. Not gonna lie, this project was pretty dang confusing. And there was a lot of stuff to go through. We had to go through seaborn, matplotlibs, plots, numpy. Yeah, tons of stuff. seaborn has a ton of different plots that we can do. We can do line plots, new bar plots, we can do different color palettes, we can do scatter plots, just a ton of stuff. numpy has a ton of different functions. And I guess we just had to use this one, which is a little weird. And pandas, we had to use this correlation function. And they didn't really give us any good hints in the markdown. So yeah, it would have been pretty hard to find on our own. But I think the purpose of this one was to just get you comfortable with using seaborn and matplotlib to plot different, different graphs. I was playing around with it a little bit as well. I tried doing this function, draw scatter plot. And then I just like put in my data frame and then x and y of weight and height, style, have a hue to it. And then we save the figure. So if I run this inside main and run it again, then it should give me another image. And that image should be called scatterplot.png. So yeah, there we go. Here's the scatterplot that I messed around with. We have the height and the weight, different genders of people, and then if they're active or not. So up here we have someone that is about 250 centimeters in height and they weigh about 85 kilograms. It's pretty neat. And I'm pretty sure that was part of the data that they wanted us to clean. It's like all these outliers is probably data that we cleaned off when we did the, when we did this right here. Yeah, that's it for this one. When we're done, all we have to do is take this REPL link and we paste it into FreeCodeCamp. So go back to FreeCodeCamp, data analysis, medical data visualizer, and paste it in. Complete the challenge and go on. Next up, we have this time series visualizer and it has a lot of the same concepts as the medical one did. Just a lot of making like the same types of graphs as they did. So yeah, I'll see you when we do that one. Bye.