Speaker 1: Learning statistics does not need to be difficult. Now, instead of bombarding you with complicated formulae and statistical theory, I'm going to walk you through a way of thinking and that's going to enable you to address the most common statistical questions. When we look at sample data, for the most part we see two things. We see differences between groups, so men are taller than women, and we see relationships between variables like taller people weigh more than shorter people. And the big question is, are those differences and are those associations or relationships real? And I'm going to talk you through what it is that we mean by the term real. Over the next few minutes, we're going to take a look at a very simple data set and we're going to see how, by looking at various combinations of variables and variable types, we can identify very specific differences between groups and very specific relationships between variables. And I'm going to walk you through when and how to use statistical tests and how to interpret your results. Now, let's imagine that we have a research question and it's about the height and the weight of people living in Ireland. Of course, we can't measure the height and the weight of the entire population, so instead we take a random sample of the population and we measure the weight and the height of that sample. And we collect some additional information like gender and age group from each of the people in our sample and we arrange these data in a spreadsheet or a data set with the various attributes in columns and these are called variables. And these variables will be the object of our inquiry. Now, most data sets that you work with will contain two types of variables, categorical and numeric variables. Categorical variables, like gender, contain categories as the name suggests. Think of them as groups or buckets that the data can be arranged into, in this case males and females. Numeric variables, like height, are numbers as the name suggests and can be arranged on a number line. Now, to better understand our data and to make sense of it, we summarize it and we visualize it. In the case of categorical data, we can count up the number of observations in any given category and we can represent them in a table and on a bar chart. And to summarize numeric data, we're firstly interested in the spread or the distribution of the data, so we might describe the range of the data, the interquartile range. We could also include the standard deviation. To get a sense of the middle of the data, we use the median, which divides the data into two equal halves, and we use the mean, which is the average. The mean is probably the most commonly used summary value to represent this kind of data. We can visualize our data using a box plot, which is a visual representation of the range, the interquartile range, and the median. And of course, we can create a histogram and this gives us the shape of the data. So I hope you can see that this process of summarizing and visualizing the data takes it from being just numbers and words on a spreadsheet and turns it into something that is meaningful to us, something that we can get our heads around, something that we can think about. Now in this very simple data set, we've got two categorical and two numeric variables, and things start to get interesting when we start looking at combinations of variables. So for example, we can take a look at a categorical and a numeric variable like gender and height. And so we can group the data by gender, which is the categorical variable, and create a summary of the numeric variable, in this case height, that is separated out into those two groups. And looking at the summary, we can see that in our sample data, men are on average taller than women. What I want you to see here is that we've looked at a combination of a categorical and a numeric variable, but as you can imagine, there are other possible combinations of variables that we could have looked at. We could have looked at height and weight, which are both numeric. We could have looked at gender and age group, both categorical. And in each case, we might see either differences between groups or relationships between variables. And in each of these cases, there are specific statistical tests that we can apply to see if what we are seeing in the sample data has implications for what we think about the wider population. Can we infer anything? Is what we are seeing statistically significant? So let's take a quick look at the five most important combinations of data that we have. And we'll look at firstly, what might we observe in our sample data given that sort of combination of data types? And secondly, what statistical tests we might apply to determine whether or not we can infer anything about the wider population. So we might look at a single categorical variable like gender, and we could do a one sample proportion test. For two categorical variables, we would do a chi-squared test. For a single numeric variable, we do a t-test. If we have a categorical and a numeric variable, we do a t-test or analysis of variance or ANOVA if there are more than two categories in our categorical variable. And for two numeric variables, we do a correlation test. Now I'm going to come back to each of these scenarios and each of these tests, so don't panic. At this point, what I want you to see is how the data can be divided up. And in just a few minutes, we're going to take each of these scenarios and work through exactly what questions you can ask, and how it is that you can apply statistical tests, and importantly, how to interpret your results. Now before we carry on, I just want to say a big thank you to Biomedcentral or BMC for sponsoring this video. BMC are a publishing company that publish open access journals, and that means that the full text of all of the papers published are available for free to anyone in the world. I'm the editor-in-chief of one of the journals that they publish called Globalization and Health. I'm genuinely impressed with them as a company. I believe that they have integrity, and I honestly believe that they are making the world a better place. They have a portfolio of over 300 journals that they publish, so check them out at Biomedcentral.com. I'll put a link in the description below. At this point, I want to say this, it's not good science to take a data set and just randomly stab around blindly hoping to find something that's statistically significant. Before you interrogate the data, you start off by defining your question, your hypothesis, you define your null hypothesis, you identify the alpha value that you're going to use, and then you analyze the data. So let's look at what we can do with just one categorical variable like gender. We might ask the question, is there a difference in the number of men and women in the population? Now we could state that as a hypothesis, which is that there is a difference between the number of men and women in the population, and we could check to see whether or not we think that that is the case. And when we look at our sample data, well, we do in fact see that there's a difference in the proportion of men and women. So should we get excited? Well, no, not yet. Remember, this is just sample data. We could have, by chance, selected a sample that just happened to show a difference. So let's consider the possibility that in actual fact, there is no difference in the number of men and women in the population, and we call that our null hypothesis. And if that were true, how likely would it be? What are the chances? What is the probability that we would see the difference that we have observed, or a greater difference for that matter? And if we can show that that probability is low, then we can have a degree of confidence that the null hypothesis is wrong, and we can reject it. But before we calculate this probability, which we're going to call our p-value, we must be clear about how small is small enough. Below what value of p would we reject the null? And we must decide on that cutoff before we calculate the p-value, and we call that cutoff the alpha value. And for the rest of the examples in this video, we're going to use an alpha value of 0.05 or 5%. So we've really got two scenarios. We've got the null hypothesis, which is that there's no difference, and the alternative hypothesis, which is that there is a difference. And the next step is to apply a statistical test, and in this case, we do a one-sample proportion test, and we generate a p-value. If the p is less than the alpha, then we can reject the null hypothesis and state that the difference that we observe is statistically significant. If we add another categorical variable, in this case age group, we may have a research question like, does the proportion of males and females differ across these groups? So our hypothesis is that the number of men and women that we observe is dependent on the age category that we look at. In other words, the proportions change or depend on or are dependent on the age category. Now we can collect our sample data, we look at it, and we can see that, yes, in fact, the proportions do change across the age groups. In other words, in our sample data, the proportions are dependent on age category. Now, is that due to chance? Well, let's test the idea that the proportions are all the same, or that they are independent of age category. That's our null hypothesis. Now here we can conduct a chi-squared test, and that gives us a p-value, and if the p-value is less than the alpha, we can reject the null hypothesis and state that our observation is statistically significant. If we want to look at just one numeric variable on its own, like height, then we don't have any groups to look for differences between, and we don't have another numeric variable to look for some sort of associational relationship with. So what questions can we ask? Well, we might have some theoretical value that we want to compare our data to. For example, in the case of average height, we might have some historic data. We might wonder if the current population is significantly different from that historic data. So our question might be, is the average height different from a previously established height? Let's imagine that the previously established height was 1.4 meters. We want to know if the average height in our current population is different to that. Our hypothesis is that there is a difference. Again, we collect some sample data, we find that the average height is indeed different from the historic height. Is that statistically significant? Well, if there were no difference, what would the chances be that we observe the difference that we do, or a greater difference? We conduct a t-test comparing the averages, and if the p-value is less than the alpha, then we can reject the null hypothesis and state that the observed difference is statistically significant. Now let's consider a categoric and a numeric variable, and we may ask the question, is there a difference between the average height of men and women in this case? Our hypothesis is that there is a difference. In our sample, we do observe a difference. Let's assume that there's no difference. We conduct a t-test, which gives us a p-value. If the p is less than the alpha, we reject the null, and we state that the observation is statistically significant. If we had a categorical variable with more than two categories, like age group that's got three categories, then instead of doing a t-test, we would do an analysis of variance, or ANOVA. Now let's look at the last combination of variable types in this data set, two numeric variables, height and weight. Here we might start with the question, is there a relationship between height and weight? Our hypothesis is that there is a relationship. We collect sample data, we look at it, and voila, we do see some sort of relationship. Is it real? Well, let's assume that it's not. Let's assume that there's no correlation between the two variables. And if it weren't real, then what are the chances that we'd see the relationship that we do? And here we conduct a correlation test. Now, a correlation test is going to give us two things. Firstly, it's going to give you a correlation coefficient, which tells us something about the nature of the association between the two variables. And I'm going to talk about that in just a minute. But of course, it also gives us a p-value. And again, if the p-value is less than the alpha, we can reject the null hypothesis and state that the correlation that we see is statistically significant. And the correlation that we see can be represented by a number that we call the correlation coefficient. So let's talk about that for a second. Correlation coefficient is a number between negative 1 and 1. And it looks at the relationship between two numeric variables. If, as the x variable gets larger, the y variable gets smaller, we say that they are negatively correlated. If they are perfectly negatively correlated, then the correlation coefficient will be negative 1. If there's no relationship between the two variables, then the correlation coefficient will be 0. And if there's a perfectly positive correlation as x goes up, y goes up, then the correlation coefficient will be 1. And of course, you can have any value in between. And by the way, it doesn't matter which of your variables is on the x and the y axis. The correlation coefficient will be the same. Now, of course, we've only just been able to scratch the surface in terms of what there is to learn about statistical analysis. If you want to learn more, then go to learnmore365.com. And I've got some courses there that you're going to love. And if you'd like to learn about R programming, which is a programming language that gets used for statistical analysis, and it's free, it's very powerful, it's easy to use, it's absolutely fantastic. I have a YouTube channel that focuses specifically on that. So that's R programming 101. I'll put a link in the description below. Go and check it out. Otherwise, please subscribe to this channel, hit the bell notification if you want notification of future videos. Leave your comments below and share this video with anyone that you think might find it useful. Until next time, take care.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now