Speaker 1: Welcome to Data Demystified. I'm Jeff Gallick and this is my series of tutorial videos on how to use SPSS to work with data. In this video, I'm going to show you how to conduct and interpret a factor analysis. As always, we'll be using the YouTube Viewing Habits survey that I created, and you can find both a link to the data file and a video tutorial of the data below. Factor analysis is our tool for reducing the number of dimensions that we're dealing with when we conduct any subsequent analysis. For social science, this translates into things like multi-item scales, where we want to see if some of those questions hang together into underlying constructs or factors. And in particular in this video, what I'm going to do is take a look at this Big 5 personality inventory. This is a 15-item scale that I'll make sure to link to below. And I have another video actually discussing how to compute the averages for each of those values that also I'll link to below. But rather than just assuming that these are actually representing what they're supposed to, so here under labels, I actually tell you this is N for neuroticism, E for extroversion, O for openness, A for agreeableness, and C for conscientiousness, we can allow factor analysis to actually determine if the questions do in fact hang together. In other words, if these three variables are in fact correlated with one another, and less correlated with some of these other variables. This is what we call discriminant validity. So right now I'm assuming that these three questions measure neuroticism and only measure neuroticism. But that is an assumption. Factor analysis can actually allow us to make that conclusion computationally without any human intervention. So how do we do this? We go up to Analyze, Dimension Reduction, Factor. Now there's quite a few things that we actually have to do here. So the very first, of course, is we have to put our variables of interest into our variables window. So down here, if we keep scrolling, we see Big 5 underscore 1 all the way through Big 5 underscore 15. And by the way, those Rs are to indicate that the question is reverse coded, meaning the interpretation of the question is opposite of what the other question says. And conveniently, factor analysis will take that into consideration as well. So let's move those over. And there's a bunch of options we're going to have to select. Under Descriptives, we have to check the KMO and Bartlett's test of sphericity and the anti-image correlation matrix. And I'll explain what those are when we see the results. But basically, this is going to be our test to see if factor analysis is even a viable approach given these data. Under Extraction, we're going to select the Scree Plot. This is a very common way for us to determine the number of factors, though I have to admit I find this approach completely unintuitive. And I prefer a separate approach using the eigenvalues that I'll describe when we see the results. But I'll show you the Scree Plot just so you understand what everybody is talking about when they say that. Under Rotation, we're going to ask for the VARIMAX rotation, which basically is going to take our factors and try to separate them from one another as much as possible. Under Scores, we're going to select Save as Variables. And that's going to actually create new variables, which represent each of the factors that are determined based on this factor analysis. And under Options, we're going to sort by size. That's going to make reading the output table a whole lot easier. And even making it easier is suppressing small coefficients. And I like to set this to about 0.25. All that's going to do is it's going to hide small values in our final output table, making it a whole lot easier to read. And again, we'll see what that looks like when we see the result. So having selected all those options, we can click OK and see what happens. The first thing we have to do, actually, is take a look and make sure that factor analysis is appropriate in this case. And we have several measures to actually assess this. The first is right up here, the KMO, the Kaiser-Meier-Olken Measure of Sampling Adequacy. This statistic is a measure of the proportion of variance among variables that might be common variance. And this measure can take a value between 0 and 1, and we're looking for values above 0.7. That's telling us roughly that we have enough shared variance to allow us to actually conduct factor analysis. Next, we have the Bartlett's Test of Sphericity, which is the hypothesis that the correlation matrix of this data is actually an identity matrix. Which, of course, would be a problem if we want to see if there's a relationship between these variables. And so what we're looking to do is reject that hypothesis. And seeing that the significance level here is below 0.05, we can reject that conjecture. And we conclude, in fact, that we're not dealing with an identity matrix and there are some relationships between our variables. And the last thing we want to do is take a look at the Anti-Image Correlation Matrix diagonals, those with the little A superscript. And we actually see that right here, that superscript translates to Measures of Sampling Adequacy. And what we're looking for is values that are above 0.5, which basically tells us that for that particular variable, we have enough variation and enough correlation with other variables to allow us to include that variable in the factor analysis. And in this example, all of those are above 0.5, so we're fine. Having satisfied all of the requirements to conduct factor analysis, we can scroll down and see what we actually get. We can ignore this commonalities table, it's not all that useful. And we come to this table of total variance explained. And what this table is saying is that if you allow SPSS to create a number of factors, let's say one factor or two factors or three factors, all the way up to 15 factors because we have 15 variables, how much of the variance could we explain in the original underlying data? So let's just start at the top so we understand the intuition. If you allowed SPSS to just pool all the variables into a single variable and have a new single factor, we'd be able to explain 26.5% of all the underlying variation in all 15 of our variables. If you then instead said, hey, factor analysis, figure out where these correlations exist and allow it to create two factors, we would now be able to explain an additional 13.2% variation or 39% in total. If you said, hey, let's allow it to make three groups based on the correlations in the data, well, now we get up to 52%. And as we go down this table, that's going to reach 100% because, of course, if we allow SPSS to create 15 factors to explain 15 variables, well, we could explain all the variation. So then the question is, where's the cutoff? How many factors should we include? Now, a popular approach is actually to use this scree plot. The scree plot is just a plot of this diagram right here, looking at what are known as the eigenvalues, which is a term from linear algebra. And the suggestion that I've always seen is to find the elbow in this, meaning that this is an arm and somewhere along here is an elbow. And then the answer is going to be one to the left of that elbow. I find that to be super counterintuitive. And even just looking at this scree plot, where is the elbow? Is it right here? Is it right here? Is it right here? Is it right here? I have actually no idea how to read this. So I never use scree plots in determining the number of factors. I find this super unintuitive. Instead, I use a very straightforward rule, which is that when the eigenvalue, which is reported right here, drops below one, that no longer is a factor we're going to include. And so reading this, very quickly we see that there are five factors. And what's convenient is actually the default that SPSS uses in determining how to extract the number of factors is the rule I just gave you. And that's why you see that down here these are all blank, because SPSS is not including those in our factor analysis. It's instead saying we will have five factors. Five factors can explain 68% of the variation of all the underlying data. But what are those factors? Well, to see that, we scroll down a little bit further. The first thing we see is this component matrix, but that's actually not that useful. Instead, we're going to use this rotated component matrix. If you recall from our varimax rotation, this is the same thing as the component matrix, but with a rotation applied to it that's going to differentiate these factors as much as possible. So how do we read these? These values right here are just correlations. Basically, I created five new variables, such that the correlations between them and our underlying questions that we put into this factor analysis are whatever you see in this table. But the goal was to identify factors that are as different from one another as possible. So how do we read this? Well, factor 1, this new variable, is correlated 0.89, which is very high, with this question, worries a lot, which comes from our neuroticism subscale. It's also correlated 0.88 with this question, gets nervous easily, which also comes from our neuroticism subscale. And finally, negatively, with this question, remains calm in tense situations, which is also part of our neuroticism subscale, but note that it was reverse-coded, meaning the implication is that it's reverse. And that's why we see a negative correlation. Now, the reason there's so many blanks here is because if you recall from the options, I said to suppress any coefficients that are below 0.25. Admittedly, that's an arbitrary cutoff, but I found it to work pretty effectively to make this chart a lot easier to read. If you didn't do that, you'd just see lots of little numbers all over the place which aren't that useful. The way to really think about this particular factor is that it is very related to these three questions. And these three questions are very related to one another because they're all part of a single factor. And because I sorted my cases, it makes this nice natural waterfall where I can look to see what questions are related with what factors. So here's factor two. And if you note, these are all my extroversion questions. And here's factor three. These are all my openness questions. And here's factor four, which is mostly just these three variables right here, and they're my conscientiousness questions. And here's factor five, and these are my agreeableness questions. And remember, I didn't tell it to create these sortings. It just noticed that there were high correlations between these particular variables, and it created a new variable that was heavily correlated with those three questions and less correlated with other ones. It's not that there's no correlation here. It's just that it's very small. And the same is true for each of these other factors. So what we now have, if we go back to our data set... So what we now have, if we go back to our data set, are five new variables. Factors one, two, three, four, five. And we can actually name these to make life a little bit easier. Factor one was neuroticism, so I'll just call that an N for now. Factor two was extroversion, so I'll call that an E. Factor three was openness, so I'll just call that an O. Factor four was conscientiousness, so I'll just call that a C. And factor five was agreeableness, so I'll just call that an A. And to make things easier, I'm just going to get rid of these labels really quickly. We can just select them all. And what I want to show you is actually why this is so useful in validating some of our scales. Remember, I computed in a previous video just the raw averages for these questions. So the three that happen to be part of openness, the three that happen to be part of conscientiousness, and so on. And what I want to show you is the correlation between these and these. Remember, these over here were created by our factor analysis with no human input as to which questions go with what. These were created with me, the human being, saying that there are three questions about openness, so let's pull them together. What I want to show you is how related these are and how well our scale actually works and how well factor analysis is at identifying these subscales. So I'll use a correlation matrix under Analyze, Correlate by Variate. And what I'll do is I'll include all five of my manually computed subscales and my factor subscales. And what we'll see is the following. Openness is very correlated with this factor O that was created by the factor analysis, 0.96. It's basically the same variable. Conscientiousness is very correlated with C, 0.88. Again, as designed. Inversion is very correlated with E, 0.98. Again, basically the same variable. Agreeableness is very correlated with A, 0.95. And neuroticism is very correlated with N, 0.95 again. Again, the point I'm trying to make is that factor analysis is automatically doing what we had basically done manually. It's also verifying that the scale we used is valid in determining those five subscales. And the last thing I want to point out is what are the actual values that you're getting here? If I click into this, I get some numbers. Well, those are my factor scores. What this is saying right here is that for person number one on the factor called N, which we know is neuroticism, that person is 1.46 standard deviations below the mean in terms of those questions. So these are all represented as z-scores. The same person is 0.55 standard deviations above the mean in terms of extraversion, right around the mean in terms of openness, about half a standard deviation below the mean in terms of conscientiousness, and pretty close to the mean in terms of agreeableness. So for every person, we can actually identify the degree to which they are deviant from the average on any of these new factors that we created. And that's it. So factor analysis allows us to take things like a multi-item scale and reduce that dimensional space to something much more manageable and something much more interesting that we can actually interpret. That's it for this video. I hope you found this useful, and if you have any questions, please comment below, and I'll be sure to reply as quickly as I can. Aside from these tutorials, I'm on a mission to equip everyone with the information they need to thrive in our data-rich world. If you'd like to learn not just the mechanics of analysis, which these video tutorials focus on, but also learn the intuition behind the analysis you're performing, I strongly suggest you check out the other intuition-focused videos on this channel where I take the jargon out of statistics and data science and help you build a deep, intuitive understanding behind all the analysis that you're performing. I'll put a link below to a playlist of the videos that focus on just this. Finally, please take a moment to like the video, subscribe to this channel, and click that little bell icon so that you don't miss out on any new content that I put out. Thanks for watching.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now