Mastering Data Visualization: Combining Raw Data with Summary Statistics in R

Convert Your Audio To Text

4.9/5

3726 customer reviews

Learn to effectively combine raw data with summary statistics using R and ggplot. Enhance your data visualization skills by overlaying strip charts with box plots.

Using the the ggplot2 R package to create a boxplot with individual data points overlayed (CC091)

Added on 09/08/2024

Speakers

Add new speaker

Speaker 1: If you're like me, whenever you try to visualize data, you quickly run into the problem of trying to represent way too much information in a single figure. That's why when I try to make visuals, I try to keep things as simple as possible while still trying to convey the information. But this problem of trying to convey information is a real challenge. We have more variables we'd like to show. We have more summary data that we'd like to show. Where does it end? Well, in today's episode, I'll show you a strategy for combining raw data with summary data. In the last episode, we talked about drawing a single horizontal line to indicate the median. Today, we'll show more of the distribution of the data by overlaying our strip chart with the box plot. Hey, folks, I'm Pat Schloss, and this is Code Club. As scientists, we think it's all about the data. Something I've really tried to hammer home in these episodes is that it's really about our audience. We have to have empathy for our audience. And this is something that's perhaps soft and squishy and something that isn't what we hardened scientists like to think about, just the data. No, we've got to give empathy. We have to help our audience understand what's going in the data, what we see in the data. We have to help them to read our data in the visual. And so one of those tools is overlaying raw data with summary statistics. As I mentioned, we've seen this already with drawing a median line. And today what we'd like to do is show perhaps the intracortile range as well as some measure of what points are outliers in our distribution. We can achieve that by overlaying a box plot on top of a strip chart or also called a jitter plot. And that's exactly what we're going to talk about in today's episode. I'd love it if you could fire up RStudio alongside with me as I go through today's episode. The code that I'm starting with, you can find at a link down below here in the description. And up across the top is a video link where you can find instructions on installing R, RStudio, getting the tidyverse as well as the raw data that I'm using, as well as how to set up your project directory so that my code should work for you. Anyway, as a recap, we are loading in those different packages that come to us from the tidyverse package. Hold on to this set seed, but we're reading in the metadata, the alpha diversity data, and then joining that all together. We're defining some colors as well as getting the number of individuals in each of three disease status groups. So the data that we're looking at comes from a study looking at the gut microbiota of individuals with and without C. difficile infections. We had a group of people that were healthy controls, a group of people that had diarrhea but not C. diff, and a group of people with diarrhea and C. diff. So to be tested for C. diff, you have to first have diarrhea. Those are the three groups. Anyway, we then made a jitter plot, which again shows columns of points for each of the three different disease status groups. And within each of those clouds, the position on the X axis is randomly determined. So that set seed that I had up above pegs the random number generator. So when I run this multiple times, even though it's random on the X axis, the points will still fall in the same spot. We then have a bunch of other styling and we ran it. So let me go ahead and run all this to make sure it works and that I've got all my libraries and data loaded. So you can see my strip chart here. It's also called a jitter plot. I'll go back and forth. Anyway, I don't know what most people call it. Tell me down below in the comments. What do you call these plots, jitter plots or strip charts? The geome and ggplot is geome jitter. So maybe we should just just call it jitter plot. Anyway, we also have here a horizontal line indicating where the median is for the data. That's only the 50th percentile. We'd also like to indicate the 25th and 75th percentile, which are the hinges on the box of a box and whisker plot. All right. So the first thing I'll do is to comment out geome jitter and stat summary. I don't totally want to delete them because I want to work with them later in the episode. The geome we'll use to make the box plot is geome underscore box plot. And I'll add that to the workflow. And so what we get then are these box plots. And I've talked about this before. I'm repeating myself that that legend doesn't add anything because we have the X, X. We have those X axis labels corresponding to the color in the in that column. And so I don't think that legend, like I said, adds anything at all. And this then gives us more realistic. Great. We have a box and whisker plot. We have more real estate because that legend is gone. And let's break down what we've what we're seeing here in the output. So first of all, we have this horizontal line that is thicker. That indicates the median of the observations of the inverse Simpson index for each of our three disease status groups. The we then have a rectangle, the box, and the bottom is the 25th percentile. So 25 percent of the data falls at or below this line. The 75th percentile is the upper border of that box. Sometimes they're called hinges. I don't know. So 75 percent of the data falls below this. And of course, the median is the 50th percentile. Okay. So the distance between the 25th and the 75th percentile is the intracortile range, the IQR. This line then, the whisker, people often are like, what is that whisker? Shouldn't it go to the maximum point or the minimum point? No, the whisker is 1.5 times the length of the IQR to the point that is closest. That is to that point or whatever the most recent point was. Right. So this whisker isn't as long as this because we had a point right here at about like four. Right. So the whiskers are not symmetrical. They only go as far as the data go. And if their data beyond one and a half times the IQR, then we see those as individual points. So most people don't know that, I find. And it's a bit confusing what that represents. So a question for you to answer down below in the comments. What do you think or did you think that that whisker represented? And if this is something new to you, that it's 1.5 times the IQR, go ahead and hit that thumbs up button so that I know you're getting something out of this video. All right. So that's one thing to note. And again, when we're trying to have empathy for our audience, we want to make sure they know what they're looking at. These points then are kind of like outliers. They're not outliers in the sense that they're bad data or something wrong happened in the analysis. But they have extreme value and they are definitely part of the analysis that we want to keep. A problem with this depiction of the data, though, is I don't know if this is one point or two points or three points. If the points are on top of each other or not. There's no there's no jitter. Right. We talked about jittering in the last episode. So one benefit of adding data, the raw data to this depiction of the data would be then that we could see if that was a single point or multiple points. Another thing we could do is to turn the alpha down to say like 0.5. So let's go ahead and do that. So we can do alpha equals 0.5. Run that. And then we see that we've got two points here that are gray, gray or lighter gray, more transparent. And this point here is darker. Right. And this point here is also darker and indicating that there's perhaps two points on top of each other. And so because the outliers are not jittered, then, well, they're not jittered. We don't we see them on top of each other. Something else you may have noticed about this plot is that my blue and red got a more muted color. And that's because the fill color has also had an alpha applied to it. So if I only want to set the out the alpha for the outlier, I could say outlier dot alpha equals 0.5. And now my my fill is solid. But my outliers then have that that lighter shading. I don't totally like doing that with the alpha because I'll want to see the data. Ultimately, I want to see how many points are there. So I'm going to go ahead and remove that outlier alpha. But that's important to remember that there are things we can use. We can do to change the way those outliers appear. So what we will do now is to add back this geome jitter line. And so we're going to leave all the same styling that we had from the first when we first started this episode. What we see now are our data, the raw data stacked on top of our box plots. And so we see that again, the widths are different. We can we can modify that. Also, you know, I might go back and change the fill color of my boxes so that they're not that they contrast more. Right. And thank goodness I drew that black line around my circles. Otherwise, all the points in here would just kind of fade into the background of the box plot. The other thing I notice is that my outlier points for my box plot are still here. Thankfully, they're a different color. But the other thing I notice is that they are they're at the same Y level. So this is this and this is this. Right. And so I need to get rid of those outlier points from the box plot. So I'm not kind of like double representing the data. So let's take that on first. And what we can do in our box plot would be to do outlier dot shape equals N.A. Running that, we now get rid of those outlier points. Right. And so we no longer have that double representation of those outlier points. And they're they're gone. OK, let's go ahead and change the fill color to not be so full. We'll make that 50 percent transparent. We'll do like alpha equals zero point five. So that, again, gives it a more muted background and allows the data to really pop out more. You could probably make it more muted if you want. If we did like point two five. So that's cool. Let's make the box the same width as the data. And I'm going to guess that it's the same width here. So point six that we want to use here. So we'll do width equals zero point six. Stealing that from our stats summary from before. And so that looks good. That's about the same width as the actual cloud of points. It's maybe a little bit wider, but I think that looks pretty good. And again, it allows the data to really pop in front of that data. If you were to change the color aesthetic instead of the fill aesthetic, the color would change the color of the lines in the box plot. So let me show you what that looks like real quick. We're not going to use it, but we can also do color equals disease stat. And so then you can see that, again, the border colors are different. Anyway, let's let's get rid of that because I think I like the black border, both on my points as well as on the box plot. As I was mentioning earlier, this whisker that comes off the box is one and a half times the length of the IQR or to the last point. Right. And so, again, that causes confusion. And my sense is that most people in your audience will kind of automatically delete that line. They'll see it. They won't understand what it means. It might cause confusion. And so then they're like, I'm out. So we perhaps have three options. We could leave it and educate our audience about what it means. We could get rid of it or we could perhaps extend it to the maximum value. Right. I think all three are reasonable choices to make depending on what you're trying to say. So how do we do that? Well, funny you should ask, because I actually know. So there's an argument we can give geom box plot, which is coif. And so this is the coefficient that you're multiplying by the IQR to get the length of that whisker. So the default is 1.5. So if I do 1.5, I should get the same output. And thankfully, I do. If I do coif equals zero, I get rid of the whiskers. Right. And if I do coif, let's just pick a big number like a thousand, then it should extend to the full range of our data, which it does. Right. And so I think that looks interesting. I don't know that that the line extending to the min and max actually adds anything, because my eyes can see where the min and max is if I'm actually seeing the data. Right. So if you're showing the data with the box plot, I don't know that the whiskers really help. I think the box helps you to define where that, you know, 25th, the 75th percentile is. I don't know that I really need the whiskers. So what I'll go ahead and do is for me, I'm going to add zero. Make coif zero and get rid of those whiskers. And I think that looks pretty nice. It does look weird without the whiskers. But at the same time, if people don't know what that 1.5 IQR means, then it's not really helpful. Right. And frankly, I spend my time kind of like trying to visually multiply one and a half times the IQR to see if that actually works. And that's it's distracting. Right. I'm not spending my time interpreting the data the way that you, my presenter, want me to interpret the data anyway. So that's that's cool. All right. So, again, we have combined our raw data from it's not raw, but our individual person, patient data with a summary statistic of that box plot. We've been on a kick of talking about stats summary. And as I mentioned, at some point when I get a new tool, I love to use it wherever I can, because I'm just so excited to have a new tool. It's kind of like every I've got a hammer and everything's a nail. So how might we go about creating this box plot without using geom box plot, but using stats summary instead? Well, let's see. So I'm going to go ahead and for now and come out, comment out geom box plot. So we have that there for us to kind of compare the syntax to. I'll go ahead and grab this stats summary. And uncomment that. And instead of fun equals median, this is what drew the median line. I'm going to do fun dot data equals median high, high, low. And what fun data expects is a function that will output three values through a data frame of three values. So the Y. So position on the Y axis. Y min. So the lower edge of something. And the Y max or the upper edge of like an error bar or a point range or something like that. So median high low will return the median as well as a two and a half percentile and ninety seven point five percentile. I actually want the 25th and 75th. So what I'll do then is fun dot args equals 50. And that should give me my 50th percentile, the 50 percent confidence interval. And I will then use crossbar like I was using before. That crossbar was a bit of a hack where if you give the Y, Y min, Y max the same value, then you'll get a box. That's just the median. OK. And let's go ahead and run that and see what this looks like. And probs, this shouldn't be 50, this should be zero point five. And sure enough, we get our rectangles, our boxes from our box and whisker plot. So one difference between this and what we got using geom box plot. Again, the fill color had an alpha of twenty five percent, I believe. So we also don't need that size equals zero point five. I think that's the default, as is this color black. So I will make alpha equals zero point two five. Give that a run and it should look the same. And sure enough, there we go. We've got our boxes and our points and we've achieved this using stat summary. Now, which do I prefer? I'd probably prefer the geom box plot, because if I did want to draw the whiskers, it'd be a lot easier to include the whiskers than I can here with with geom crossbar. I could add the whiskers, but it would take some more finagling. Also, I don't totally know how I would do that. And I know how to make the box plot. So the box plot works just fine anyway. Again, we have options and flexibility. And that's nice. The nice thing about stat summary is that it's much easier, though, to change the confidence interval. Right. So if I did want to go back to the 95th percent confidence interval, I could change that one number and get that expansion in the size of my box. Anyway, I'm going to stick with the 50th percentile as a confidence interval and we will be good to go. So looking at this new figure that we have, the difference between what we started with and what we have now is that not only do we have the median line, but we also have the top and bottom of the box to indicate the 25th and 75th percentile. It adds more information. This adds the information of not only the central tendency of the data, the median, but also tells us something about the distribution of the data and that we can see that for healthy individuals, the IQR actually appears to be wider than it is for people with diarrhea who are C-diff negative. And, you know, that's even wider than people who have diarrhea and C-diff. Whether or not that difference in variation is meaningful, who knows? But it allows me to see in that box where 50 percent of the data lie. And I think that's good. I don't know that it's important. I'm not totally sold that I need to know the shape of the distribution to that level. I don't know that adding this ink to the amount of data I have is a proportional gain in information. But again, people see things differently and have different aesthetics and personal senses of style and how they like to present the data. You know, I would encourage you to play with this experiment. And as I am showing you, as I create a new visual, I try to stop, think about what I like, what I don't like, see if I can make modifications to make it better. But then also realizing that there's kind of natural limitations to what we can do with these types of visuals and that there is no perfect visual. There will always be things that we can critique and wish that we could make better. If you're not falling into that and that you're not seeing that you can always make something better, then I suspect you're probably not being critical enough of your own work. I would encourage you to kind of always be thinking of how can you make your visual better? And also, how can you make yourself better? Anyway, go ahead and see if you, with your own data, can combine the raw data with the box and whisker plot. See if you like to have the whiskers or don't like having the whiskers. Be sure to ask your friends, maybe people at a lab meeting, ask them to tell you what your box and whisker plot is actually saying. Do they know what those whiskers actually represent? And I think it would be really illuminative. Let us know down below if you ask people, what do they tell you those whiskers represent? I would love to know as well. I hope you've gotten a lot of value out of today's episode in how we can combine raw data, the data for the individuals with the summary statistic representation using the box plot, using geom box plot, as well as using stat summary. Please be sure to tell your friends about Code Club and the various ways that we've been working with different types of data visualization. I think this is also just hopefully making it perfectly clear that ggplot is an amazing tool to make all sorts of different visuals. It's amazingly flexible. And there's just so much there that I'm always learning more. And I'm sure and hope at least you're learning as we go along too. Anyway, like I said, keep practicing and we'll see you next time for another episode.