Introduction to SAS Statistics: Descriptive Analysis and Beyond
Join King Ivy as he guides you through SAS 9.4 for statistical analysis. Learn descriptive stats, ANOVA, regression, and more in this instructional series.
File
SAS Statistics - Descriptive Statistics (Module 01)
Added on 09/08/2024
Speakers
add Add new speaker

Speaker 1: Hi everyone, my name is King Ivy and this is introduction to SAS statistics. In these lessons I'm going to be using base SAS 9.4 32-bit. I also teach at the University of Waterloo in computer system auditing and business analytics. These are meant to be short instructional videos on how to apply stats or how to use different procedures to answer your statistical questions that you have and the type of analysis that you want to perform. This isn't a stats class, I'm not going to go over the theory and the different components. I'm going to point you to a couple different resources that you can check out, but I'm under the assumption that you at least know the basics of stats. In this case, in these videos I'm going to be covering six different topics. The first one is descriptive statistics, which is this video, and then we're going to be following that by ANOVA, which is analysis of variances. And then as well we're going to follow up with linear regression, logistic regression, model selection, and predictive models and scoring. So if you haven't checked out my previous playlist, introduction to SAS, I'm going to point a little card thing so you can check it out here. If you haven't checked that out, that's a good basis so you can help better understand how to do SAS programming so that you could enable you to better do the statistics and understand the different procedures as well and set up your data and and do all that. But enough rambling, let's go ahead and get started. So here I have SAS 9.4 open, hopefully it's not a surprise to to anybody. And before we dive into the different procedures that we're going to be performing, let's better understand the data set that we're working with. So I'm going to go to the SAS help library, and it's a data set called heart I believe. There you go. And this is based off of a study here done by Farmingham Heart Study, and then as well you can see here it has a bunch of different variables and observations, including the status, the cause of death, cholesterol, weight status, height, all those different components. And when we look at this data, we may have some ideas on different statistics that we want to run. So for example, weight. Is weight correlated with cholesterol? So if you're higher weight, does it correlate with having a higher cholesterol? And that's really the purpose of descriptive statistics. One is to better understand your data, and then to develop what I call some hypothesis based off your analysis to before you go in and start doing your regression, logistic regression, building your models and doing the scoring and doing all that and really not getting anywhere because you've gone down the wrong path or you've analyzed the wrong contributing variables that will impact your resulting variable. But enough of that, let's go ahead and get started. So here I'm going to be using PROC MEANS because PROC MEANS is a really useful way to develop some descriptives off your data. So here if you haven't, if you don't already know the the syntax, go ahead and check out my previous videos on how I cover PROC MEANS. In this case, we're going to be using the data set SASHELP which is the library HELP, not CARS, HART. And then this is where you usually tell SAS what variables you want to use, but in this case we're just going to leave it just so I can show you what the default looks like. And as well, this is where you usually tell SAS what your categorical variables you're going to use to split the data and as well what your numerical variable that you want to analyze. But let's leave it now just so you can see what the default is. And sometimes the default is good to look at. So you can look here, the number of observations, as well what the mean is, the standard deviation, the minimum and maximum, and you can see all these different data and useful stats. So you can check that out and get a better understanding. You can see here cholesterol, the mean is 227, the standard deviation is 44, and the standard deviation, hopefully everyone knows, but it's basically the variability around the mean for the different observations. So let's go ahead and modify our components here. Let's add a title. So here we're going to do some descriptive statistics on cholesterol based on weight status. Okay, interesting. So here I'm going to throw in a couple of different variables. I'm going to throw in n, which is the number of observations that are going to be used in the analysis, use mean, I'm going to use the standard error, and as well I'm going to use CLM, which is basically our confidence level. It's going to give us our 95% confidence level. And then here I'm going to divide the data by weight status, and then as well I'm going to be analyzing the variable cholesterol. There you go. It's a hard word to spell. You can see here the title up here, which is good, and as well the different weight statuses. And you'll see nops, and then as well n. So nops basically means how many observations fell into this category normal for weight status, but then you'll see here there's another n, which is the 1430. And the reason why that is is that there's a bunch of missing values around the cholesterol. Maybe they couldn't collect the information, maybe for whatever reason the data wasn't available, the machine was broken, whatever it is, PROC MEANS excludes that when it actually runs this analysis. That's as well important to know. You really need to know what these programs do when they handle missing data, when they handle data errors, just so you can actually interpret your results differently. This would be different if it made each of these 42 values, for example, zero or 100 or whatever it is, versus where it actually excludes it. You can see the mean here, the standard error, and as well the confidence level using 95% confidence. And you can see here there's a pretty low, pretty tight distribution around the mean. And you can see that because the standard error is fairly small compared to the mean. And as well you can see up here if you go to cholesterol, that the min was 96 and the maximum was 568, which tells us that there's a pretty big range in this data, which is going to give us a clue around what our kurtosis value is, which is likely going to be a positive number. So this is some of the interpretation that you can get as you build out your results. So that way you have this expectation, not necessarily that you think you already have the answer, but you at least have this expectation just so that you don't come across any errors when you're doing your analysis. And you can see here the confidence mean. You'll see if you go to underweight and go the upper 95 percentile, which is approximately the mean plus two standard errors. You can see here it's 207 plus three times two is going to give you the 213. You can see here it's actually below the 95% lower end of the mean for normal. And as well you can see the upper hand for normal is below the 95% lower end for overweight, which gives us indication of like maybe there's a correlation between weight status and cholesterol. After all, some hypothesis that we can start generating later on. Okay, perfect. Let's go ahead and move on to the next one. In this case, I want to, I'm going to put ODS graphics on, which is going to allow us to actually output some of the plots that we're going to be putting. And as well, I'm going to be throwing up a title. So in this case, I'm going to be calling it, let's call it Histogram and Probability Plot for Cholesterol Cholesterol and Weight Status, By Weight Status. As well, and within Proc Univariate, there's a whole bunch of other procedures that you could have included as well. But I'm only showing you a few so you can get a taste of it. And here we're going to be using the procedure Proc Univariate, which is very powerful for descriptive statistics. So here we're going to be using the same dataset. And as well, we're going to be calling our class status. So class is always going to be a categorical variable, which basically means like a letter, like how we, if we can split the database off that category. And then as well, we're going to be using the variable, no, not that, cholesterol, cholesterol. And as well, we're going to be, as based off the title, we're going to be doing a histogram. And here we need to define what our, our numeric variable is. So in this case, it's cholesterol. And as well, I'm going to put the inset, inset was basically what kind of other stats or do you want to throw on top of it? In this case, not slowness, skewness. And as well, the kurtosis. And then as well, I'm going to copy this because I want basically the same thing, except for here, I want prob plot, which is going to be the probability apply. And let's go ahead and run that and just throw on ODS graphics off. So I like to run my ODS graphics by itself first. Make sure that's been captured in the log, which has perfect. Let's go ahead and run this. And then you're going to see there's a whole bunch of analysis that's going to be run, which is again, super interesting and really good way to better understand your data. So let's go all the way to the top where do you see the title? That's why it's useful to have titles. So you can see here, it's going to run some descriptive statistics upfront. It's going to tell you in this case, this is the weight status normal. And it's going to tell you that your variable is cholesterol. And you see your mean, your sum of your observations, your variance, your kurtosis, your skewness. And as well here, some data around the mean, median and mode. And obviously you want to know that because in an ideal, in a normally distributed data set, normally distributed normal variable, the mean, the mode and the median would all be the same. And they're roughly the same, which tells us some of the values. You can see here your standard deviation, your variance, your range and your interquartile range. A lot of times people rely on range, but range I find, not I find, but it actually is, can be impacted by outliers. So if you have this one huge outlier, it can really produce a really large range. So actually prefer looking at the interquartile range, which tells us what's the range between the third quartile, which is 243 in this case, and the first quartile, which is 188, which is in this case is 55. While the range tells you what the max is and what the min is, you can see here, even this quartile table, the 99% is 342. And then the 100% is 568, which is really high, which tells us that there's some outliers on on the higher end. And as well, you can see some of the stats here, we're going to go over that in some of the future videos. And as well, the highest amount saying this is going to be the same thing for overweight and underweight. So I'm going to skip that for now. And then here, we have some histograms. And then you'll see here are inset, which is basically the stats that we're going to pull, you can see the skewness, which is positive, you can see here, because of it's kind of on the has some higher amounts rather than over here. Obviously, normally, you would have this normal curve that would tell you. And as well, you can see here, the kurtosis is positive. And you can really tell, and we already knew this, because we saw in our interquartile range, and our standard error, that there was some indication that there is a pretty heavy distribution around the mean, but more than normal. And that's for normal. And that's for overweight. And as well, you can see here, for underweight, it follows more of this normal distribution. And you see here, the probability plot, which tells you what the normal percentiles is plotted against the the actual cholesterol. And normally, ideally, it would be a linear relationship. You can see up here, there are some outliers here, which tells me there's some other contributing factors that we need to consider. And as well, you can see over here are for underweight. So again, some lots of useful descriptives, which tells us that there probably is a relationship between weight status and not relationship, a correlation between weight and cholesterol. Let's go and do one of my favorite graphics, which is a histogram. So here, I'm going to use sgplot. Not histogram, box and whiskers plot. And then here, we're going to be using vbox, which basically says that it's going to be a vertical box and whisker plot. In this case, we're going to be again analyzing cholesterol. And then as well, we're going to be using an option called category. And we're going to be splitting it based off weight status again. And before we run that, I'm going to throw up a title here. And I call this box and whisker plot for cholesterol by weight status. Perfect. And then let's run these components. Perfect. So you can see here, the blue line represents the mean. The lower end of the box is the first quartile, which is basically 25th percentile. The higher upper end is the third quartile, which is 75th percentile. So this tells you 50% of your data lies within this range. You can see up here as well. And as well, you'll see that these whiskers, which is where the name comes from, is basically the interquartile percentiles. So basically, this is 87.5, which is half a quartile. And as well, so this 87.5 and this 12 and a half. And a good analysis to look at this is to say that anything outside of here is generally an outlier. Not always the case, because there's always going to be values outside of the amount, because inherently, it's only 75% of the data from one whisker to another. But this gives you an indication of maybe there's some data errors. You can see up here that the normal amount, there's this amount is actually higher than the highest amount in the overweight, which maybe tells you a couple of things. One, maybe you have a data entry error. Maybe this should be 200 something. Or two, there are other explanatory variables that explain cholesterol, which is actually probably the case beyond just your weight and your weight status. So some things to consider that we'll need to pull in and consider as we go ahead and do our analysis. So I'm going to leave that there. There's obviously a lot more that you can do around script statistics. But I'm going to leave it there so you can go and explore and try things out on your own. If you have any comments or questions, feel free to leave it in the comment section below. And I look forward to speaking to you next time. Thank you.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript