Mastering Hierarchical Cluster Analysis in SPSS: A Step-by-Step Guide (Full Transcript)

Learn how to conduct and interpret hierarchical cluster analysis using SPSS. Discover how to identify clusters in your data and enhance your analysis skills.

Download Transcript (DOCX)

Speakers

Add new speaker

Speaker 1: Welcome to Data Demystified. I'm Jeff Gallack and this is my series of tutorial videos on how to use SPSS to work with data. In this video, I'm going to show you how to conduct and interpret a hierarchical cluster analysis. As always, we'll be using the YouTube Viewings Habits survey that I created, and you can find both a link to the data file and a video tutorial of the data below. Very often when we're dealing with data, we can compute central tendencies like means or medians. But sometimes what we want to know is if not every response is identical. Rather, there are groups of responses that are cohesive and stick together. And that's where some form of cluster analysis comes in. There are a variety of tools that we can use, and in this video, I'll focus on hierarchical cluster analysis. Now I'll admit, this is not my favorite form of cluster analysis, but it turns out to be really useful in one specific way. And that is providing us a good estimate of how many clusters our data has. And we'll see how it does that in just a moment. So we're going to be using cluster analysis in this case to answer the following question. We have a number of dimensions right here, which describe how important each of these things is in determining whether somebody watches a YouTube video. Now it's possible that all of those are exactly the same for everybody, but more likely than not, there are groups of people who tend to respond differently on these dimensions, and we might be able to group those types of people. And we can do that with this type of cluster analysis. And to run hierarchical cluster analysis, we go up to Analyze, Classify, Hierarchical Cluster Analysis. And there's a few things that we have to do here. Well, the very first is we have to put in the variables that we'll be analyzing, and that is all of our importance measures right here. So I'll put those into variables. Under Plots, critically, we need to select the dendrogram. This is going to be a diagram that's going to help us understand the relationship between our variables and help us decide how many clusters there actually are in our data. So we'll click Continue. Under Method, there are a variety of approaches to determining what clusters to use. If we click under the clustering method, we see quite a few. Some of the more common ones are these between-group linkages, as well as for this neighbor or the centroid clustering technique. What I tend to prefer, though, is this Ward's method approach. One of the things that Ward's method does is it helps create equal-sized clusters. It's very possible that in our data, if we use another approach, we're going to find clusters where there's just a few responses in one group and lots and lots of responses in another. And practically speaking, that's not very useful. Ward's method attempts to create clusters that are more evenly sized. And so we'll select that. Now, all of my data are coming from the same type of scale, so I don't need to worry about standardization. But if your data are coming from wildly different sources in terms of the range of data that you're dealing with, you might want to consider standardizing your data using something like z-scores. I don't need to do that, so I'll skip that option. And I'll click Continue. Now, if this were the final step, and I was just running this cluster analysis and being done with it here, I could go to Save and save the cluster membership. And when this is all done, it'll categorize each response in our dataset as being a member of one cluster or another. Now, since I'm really going to be using this as an input to another cluster analysis technique, k-means, which is something I cover in the next video, I don't actually need to save the cluster membership. But I will just to show you what that looks like. And for this example, I'll say single solution, meaning just let's assume that there are exactly three clusters. Now, I don't know that to be true just yet, but we're going to go ahead with it just to see what happens. We'll click Continue, and we'll click OK. Now, the cluster analysis might take a moment to run, and a lot of information comes out. And what's important to note is a lot of these tables are actually going to be very large because we have 1,000 data points in our dataset. And I'm actually going to skip over some of them, including this agglomeration table, this vertical icicle chart, as I don't find it particularly useful. But what I am going to focus on is this dendrogram. Now, one small trick in SPSS is this dendrogram is going to be as large as the dataset we have. And it's really hard to read it when it's in this format. So a quick tip for how to make this a little bit more readable is if we double click into it, we get this chart editor. And if we right-click and select the Properties window, under Chart Size, if we uncheck Maintain Aspect Ratio and set the height to something manageable like 8 inches, we can click Apply. And what that will do is basically smush our chart. You can see that here. So we can then exit out of it. And now this chart is a whole lot more readable than it was before. And so this is how this chart works. We can start on the left over here next to what I'll call the origin. And what you'll have is every single response all the way down the list here. This is all 1,000 individuals that completed our survey. And it's a little bit hard to read because we smushed this chart together. But at the extreme, we could say that there are 1,000 clusters in our data. Now, that's not very useful because that doesn't actually group anything. But of course, we can assign each person their own cluster. At the other extreme, we can say that everyone is exactly the same. And if we were to go off this chart off to the right over here, we'd just be basically computing, let's say, an average of everyone. But that's also not useful. So what this dendrogram lets us do is decide how many clusters we're going to have. What this is a hierarchical branching diagram where each of these branches denotes connections. And the closer those connections are to each other on this diagram, the more related they are to one another. So for example, right here is one cluster. And I know that because there's a branch right here that eventually then does split into smaller groups. But ultimately, this is some sort of sizable cluster right here. And this cluster is reasonably similar to this cluster here because it is positioned close to one another on this chart. That's in contrast to this cluster compared to, say, this cluster down here. They're very different from one another because they're actually quite far apart on this chart. And so as we move up this branching diagram, the clusters become larger and more heterogeneous, meaning that there's more variation in what is comprised within the cluster as we create clusters based on branches that are more to the right in our diagram. Let me say that a bit differently. If I just pick two people that are right next to each other, they're probably very, very similar to one another. But as I expand what I consider to be a group, let's say I consider this one big cluster based on this one branch right here. Well, these people are, in fact, more similar to one another than, say, people in this group are. This group is going to be different from this group. But within this group, there's a lot more variation. So there's a trade off we have to make, which is to say, where do we make the cut? Do we decide to have lots of little clusters? Let's say here's a cluster, here's a cluster, here's one, another, another, another, another, and so on, where everyone in the cluster is very similar to one another, but now we're dealing with lots and lots of groups. Or do we choose really large clusters, let's say picking one right here and picking another right here? Well, now we've got a few clusters, which is nice because it's a little bit easier for us to handle from a mental perspective, but those clusters are more heterogeneous themselves. And that's actually a subjective call. And the way we do this is looking at how much would our grouping change if we made small deviations and where we drew a hypothetical vertical line running through this chart. So let's just pretend I draw a vertical line right here. If I move that line a little bit to the left or a little bit to the right, I still conclude that there are two general clusters. And how do I know that? Well, here's a cluster following this branching and encapsulating all of these people. And here's another cluster following this branching and encapsulating all of these people. So little deviations don't do much to change my solution, which is a good thing. On the other hand, if I drew my line, let's say right here, well, that would conclude three clusters. One cluster right here, another cluster right here, and a third cluster right here. But the problem is a slight move this way, and all of a sudden I generate many more cluster solutions. In other words, tiny variations in where I draw that line change my conclusion. And that's not great. We want solutions that are relatively stable to small variations in our judgment call. And so if I were to look at this dendrogram, I would conclude firmly that there are two clusters in this grouping. One right here, and one right here. And if you really push me, I might say, well, maybe there's a third cluster as well. So one big one here, one here, and one here. But practically speaking, the most robust solution is to say that there's two groups of people, a two-cluster solution. So if I were then to take this to the next step and look at something like a k-means algorithm, which is going to be a little more robust in how it identifies those clusters, I would feed that algorithm the solution from this dendrogram, which is two clusters. Now, if I didn't want to do that, and I wanted to rely solely on this particular hierarchical cluster analysis, if we go back to our data, we now have a new row of data called cluster 3-1. Now, what this is is a solution where there are exactly three clusters, because that's what I defined in my options. If I go to the data view, if I just double click on this, we see that each row, each response, is categorized as having been in one of three clusters. One, two, or three. And so now, if I wanted to go back and identify who those people are, I can do that using this new variable. But again, I don't find this particularly intuitive or useful. Instead, I'm going to take the next step and plug this solution of two clusters that we saw a moment ago into a k-means algorithm. And for that, I'll be using k-means, which is the topic of the next video, and I'll make sure to link to that below. That's it for this video. I hope you found this useful, and if you have any questions, please comment below, and I'll be sure to reply as quickly as I can. Aside from these tutorials, I'm on a mission to equip everyone with the information they need to thrive in our data-rich world. If you'd like to learn not just the mechanics of analysis, which these video tutorials focus on, but also learn the intuition behind the analysis you're performing, I strongly suggest you check out the other intuition-focused videos on this channel, where I take the jargon out of statistics and data science and help you build a deep, intuitive understanding behind all the analysis that you're performing. I'll put a link below to a playlist of the videos that focus on just this. Finally, please take a moment to like the video, subscribe to this channel, and click that little bell icon so that you don't miss out on any new content that I put out. Thanks for watching.

Summary

Generate a brief summary highlighting the main points of the transcript.

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Key Takeaways

Extract key takeaways from the content of the transcript.

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file