Speaker 1: What's going on, everybody? Welcome back to the Excel tutorial series. Today, we will be looking at how to clean data in Excel. Now, knowing how to clean data in Excel is actually extremely useful, and there are a ton of techniques to do this. I'm going to be showing you the ones that I probably use the most, and I feel like are the most helpful to kind of do the bulk or the majority of the data clean that you're going to do in Excel. Like I said, there's so many different ways and very specific things that you can do, but I'm going to highlight some of the bigger ones that I find the most useful. And some of you may be thinking, well, I'll just do my data cleaning in SQL or Python or when I get it ready to put it in Tableau. But honestly, a lot of the data cleaning, at least a lot of the big stuff, I tend to do in Excel if the data set is small enough to fit in Excel. And so I think it's actually really, really useful to know how to do this because you'll most likely be doing it more than you think. Now, before we jump into the tutorial, I want to give a shout out to the sponsor of this video and is a brand new sponsor. It is Unlocked by Z by HP. Unlocked is a movie that's actually broken up into four parts and each of them have a unique data science challenge associated with it. Now, I'm going to read this next part because it's extremely interesting. Each challenge represents a different topic. So there's data visualization, text analysis, audio signal processing, and computer vision. And you can submit your answers and your work on their website for a chance to win one of 10 ZBook Studio laptops or a free trip to the Kaggle World Championships. So I'll leave a link in the description where you can go watch the movie and then do the challenges and then submit your answers for a chance to win. You should also go check out their hackathon where you can do these projects with other people just like you who are trying to figure out these answers and submit them to win as well. So go check that out. Thank you again to the sponsor of this video, Unlocked by Z by HP. Now, without further ado, let's jump onto my screen and get started with the tutorial. All right, so let's jump right into it. I have this US president's data set. I got the base data set from Kaggle, but I added some of my own data and then I messed some stuff up as well just to kind of demonstrate some of these things that we're gonna be looking at today. This is not a full project. So, you know, we're actually gonna be using this to create new visualizations or anything like that. So, you know, all this is just for demonstration purposes. But we will be doing a full project in about two or three videos in this Excel series where we're gonna be doing from start to finish with a real data set. So, you know, if that's something that you're wanting, then we will absolutely be doing that. Now, something that you may be wondering is how do you actually identify what you need to clean in the data? What do you know to look for? Well, some of the obvious things are things like formatting and standardization. So things like, you know, this James Monroe is in all caps. That happens all the time with real data. And so, you know, you want to standardize that or this all lowercase. You want to standardize that. You want that all to be the same. There's also things like right here, where we have this wig and this wig with a bunch of random stuff after it. This happens all the time where it's not completely standardized. And you may even notice, you know, there are some spelling errors in here. And we'll kind of look through that in a little bit. And then, you know, there are things like additional spaces where there shouldn't be spaces. There are things like currencies that you need to be aware of if you were importing this into or going to be importing this into a SQL database. Things like currencies can be just a problem or be really unnecessary. It may actually cause more issues in the long run. So you may just want to, you know, take that to the base value. And then dates are always an issue. Always, always, always. So always look at your dates. Make sure they're formatted correctly. Make sure they're all the same. These are the types of things that right when I glance at this data set, these are things that I'm looking for. One other thing that is actually the first thing that we're going to start out with is you want to make sure that your data is not duplicated. Because if your data has duplicate data in it and you don't want that, it's not supposed to be there. There are some specific use cases where duplicated data is okay. You know, you want to get rid of that. And it's very easy to do in Excel. The first thing we're going to do, we're going to go to this data tab. We're going to go right over here and we're going to get, see if there's any duplicates in our data. So we're just going to go up to remove duplicates. It's going to automatically choose all of your columns to check against. So it's going to, from A all the way through I, it's going to see is the exact same data in all these rows. And if it is, it's going to get rid of it. And so we're going to click OK. And it did find one duplicate. And I'll show you that one real quick because, you know, it was right here. So Barack Obama was here twice. And then I'm going to hit control. I hit control Z to go back. I'm going to hit control Y to go forward. And it removed that row completely. Now, in this example, you may be able to spot that with your eye. But in a real data set where you have 10,000, 100,000 rows, there's absolutely no way you're going to see that. Or very, very unlikely that you are going to see that there's duplicated data in there. So just running a quick dedupe or removing of duplicates, that is really important to make sure that you have gotten rid of those things. So that's one of the first things that I do. We're going to go into a lot of these different columns. And I'm going to kind of show you different techniques or things that I do when I look at actual data. So I'm going to come right over here. I'm going to insert. And this is what I actually do. I usually create a separate column, especially when I'm working with this. Because I don't want to change this one. I don't want to go in here and say equals upper, equals proper, et cetera. There's a lot of different ways that you can change names. Or not a lot, but the main ones that you can change names. And all of them are completely okay. So for example, I'm going to hit equal upper. And I'm going to go like this and close my parentheses. So I selected this cell. I closed my parentheses. I hit enter. And I'm going to hit in the bottom right, I'm going to double click this. And it's going to apply it to all of them. It is completely okay to have your data like this. If you want it to be like that. If you want it to be all lower, you can do that. If you want it to be in proper case, you can do that. There are different uses for all of them. And honestly, as long as it's all the same, typically it's okay. But if, you know, for example, if you're selling this to like a third party company or something like that, they may have what they want for their ingestion process when they take your file in. If you send, you know, a weekly file or a monthly file, they may want it exactly how they want it. And you can change that to what they want. But as long as it's standardized for you, it's all the same for you, that is a good thing. So now we have all of these in the proper case. That's typically what I do, or I use upper. Those are the ones I use the most. I don't usually use lower. And if you go in here and you type in lower, you know, it changes it to all lower. I don't typically do that. And I'm gonna add, I'm gonna say president-fixed. And so now all of these names, all of these different uppercase and lowercase, these are all fixed. And it just makes it so much easier to read and you don't have different uppercase and lowercase issues. It's all the same. So I'm gonna keep that right there. If we move a little bit to the right, if you look at this prior, now this prior is a mess. It has stuff all over. And to be honest, this is not really something that I would probably be using. Like in a real data set, I would look at this column and I'd say, this is pretty useless. If I had a very specific use case for this data in this column, I might try to, you know, parse it out and do something. But I don't. This is a completely useless column to me. So I'm actually gonna skip this one. I'm going to go to this party one. And this party one to me, it looks pretty important because this is something that I know I can group by and I can create visualizations with and kind of break that out. And if you look right here, we're gonna add a filter. So now let's open up party and take a look. So if we look right here, we have Democratic, Democratic-Republican, Federalist, Nonpartisan, Republican, Republicans, Whig and Whig with a date and some information in the back of it and then some blanks. And it's really important when we're looking at these ones that we think we might group by that we have these properly grouped. So Republican and Republicans to me right off the bat looks like a spelling error. And so I'm just gonna deselect all. I'm gonna go to Republican, Republicans and it's literally Republican all the way down except for this last one. And to me, that's just something that I would update. So I would just go right here. I do that. If I didn't do that and then I try to create, let's say a pivot table on here, I'll have its own group of Republicans and it wouldn't be added to Republican. And maybe that's on purpose, but let's just presume that we know this data extremely well and that's not supposed to be like that, right? Again, that just comes back to knowing your data really well, understanding what it should look like. And we know that it should not be like that. So we're gonna fix that. The next thing that we're gonna fix, and as you can see, it got rid of it. Next thing we're gonna fix is this wig. That's just like an error. That's some issue on the data side. And we're just gonna fix that by updating it. And that's it. I would always be keeping a copy of this with the raw data somewhere else because this is presumably like a working document. This is not a, you know, you aren't saving over your original file. Let's just say that. And then let's take a look at these blanks real quick. Okay. So there are these rows right here that have nothing. I think we're okay. But if we see anything different, 47, 48. Okay. So yeah, it's just these ones right here that have no data in it. Anyways, it's just seeing it in the filter. So not an issue at all. So, okay, we're looking good. We've gone all the way over. We fixed this president. We skipped this one. We cleaned up this party. And I kept this one in here because I'm not exactly sure if that's a Democratic or Republican. So I'm gonna keep it its own thing. I'm not a huge history buff in that aspect. The next one right here is really easy. This is something that happens all the time, especially on actually most often it happens on numerical data. So like, you know, there'll be a number of 1,001 and then there'll be a space after it for absolutely no reason. And it happens all the time. It does happen like this as well, where you'll see this and all you gotta do is do trim and select the cell. I'm gonna close that parentheses and we're gonna apply that all the way down. What is so fantastic about the trim is that it's really intuitive and it knows basically everything it needs to do. For example, it gets rid of the spaces before. It gets rid of extra spaces in the middle and it'll get rid of extra spaces at the end, which you wouldn't be able to see, but they are there and they absolutely can cause issues. If you have spaces at the end that you cannot see, let's take this one for example. Like if I had spaces at the end, that can cause issues when you insert or put that into a database. That happens a lot with numbers. You know, when you're putting that into SQL, that can cause issues. And so you really, it is important to actually do that trim and you can do that on all of your columns or just ones that you know you're having issues with. But once you import that data into SQL, you will know if there's an issue or not when you actually try to start using it. So we're gonna say vice and we're gonna say fixed. Oops, there we go. This next one is one that you'll run into a lot. When you're working with numerical data, you will encounter so many different issues. One that I run into a lot is I've worked with a lot of cost data or pricing data. And when it's in an Excel, it sometimes comes in with these currencies, like a dollar sign, a pound sign, things like that. And when you put that into SQL, it just is a nuisance, right? You're not gonna be able to run, it's gonna go in as a text or it's gonna be like a string, right? Because it has that special character and you don't want that. You don't wanna have to then go in and then change things around. You just want to be able to start, you know, doing calculations on those numbers. So what you can do is sometimes it'll come in as a text, sometimes it'll come in as a currency, which I think this one's a currency. We are just gonna change that to be a number. And then we're gonna get rid of these, oops, and get rid of those. That, it doesn't look as pretty, but that is much more useful than actually having the currency on there with the decimals. This actually is so much easier when you wanna use it for almost anything because you're able to add and do things properly in other systems. In Excel, I think it does understand it, but, you know, that can cause issues. So there is how you do that. The next thing that we're gonna look at is these dates. And just notoriously, whenever I see a date field, I know there's gonna be an issue with it. It's very rare that I get a date field that is perfect. It just, it genuinely is a novelty when that happens. And most of the time it has to do with, let's say a date comes into Excel and it's in a text format or date comes into Excel and they're not the same. In this example, they are not the same. And we just want them to all be similar. They say date. If you look right here, it says date. It says date. It looks like it should be the same. But if we go like this, it all looks the same, right? There's no issues at all. If we were to try to use that, it may or may not be an issue. But we don't wanna leave that to chance later on if you're using this with Python or something like that. It can cause issues. Maybe not in SQL because it may see the underlying, what's in the underlying cell, not just what we see. But some systems won't. And so you want to make sure that they're all the same. And so what we were doing back here with, oops, with a party and we were looking at this filter and identifying the issues. I usually do that on date fields as well. And oftentimes, just for demonstration purposes, oftentimes I will get something like that. And then I'll come up here and I'll notice that there's this one random number. That happens all the time. All the time. And so you want to make sure that you look at these things and just do at least a quick glance. Not doing a deep dive into it. But all we're going to do is we're going to do both of these and we're going to do a short date and let's take a look and see if that fixed it. And so now they are all the same format and that is fantastic. That is exactly what we want. We're going to go back through here. We're going to get rid of these. Again, this is a working document. Oops, we need to, I'm going to do control shift down. Oops, let me go back up. Do control shift down and copy. And what I'm going to do right now is I'm actually going to copy. Let me do it right here. I'll show you. Sometimes I do this. It just depends. I'm going to go right here. I'm going to hit right click and I'm going to paste as a value, which means it's not going to take the calculation of the formula that I just did. It's going to actually paste it as that value. So we just replaced it right here. You can see up here. It says equals trim of G2. This now, now that I copied and pasted it over as a value, it got rid of that calculation and now it is actually a string. So we don't need this anymore. And I'll do the same thing over here as well. I'm going to control shift down copy and I just hit the right key. Or the left key, sorry. Now I'm going to right click and I'm going to do paste as a value. And again, it has this proper and now it doesn't have the proper. It's actually the value that was here. So that's really important to note. And we're going to get rid of that one. And so now what we have is already looking much better. Now, one of the last things I want to look at is deleting columns that we are not going to use. And this is why it's so important to keep a backup or the raw data, not in this file. Because if you start saving over this file and this is your raw file, that can mess up a lot of things. And that happens to me before and it's terrible. And then you have to request another file or you have to go back and bind it or something like that, it's terrible. So this is our working document. So we can mess with this and do whatever we want for our purposes. Now for us, I can already tell you that this prior is a bunch of nonsense and we do not need it. We're not going to use it for anything. And if we have, this is a very small data set. This only has like, let's say, one, two, three, four, five, six, seven, eight. We have like eight columns that we're kind of using that has data, eight or nine. Now that's a small data set. I've had ones with literally like hundreds and it has so many columns, so much data. And sometimes it's good to just trim it back to the things you know you're gonna use. This to me is absolutely useless. We're going to delete that. And then right over here, it's pretty redundant. It's just one number off. But if we scroll down just a little bit, it goes, it's basically just counts. It's a, you could even call it a unique identifier if you want, sure, why not? But we don't need both. So we're gonna get rid of this first one. And now we have more of the useful and relevant data rather than the stuff that we absolutely know that we are not gonna use. These date updated and date created, we may never use them. But we might. So it doesn't hurt to keep it on hand. Those other ones are ones that we are almost certain we will never use. Again, keep a backup just in case you need it. You can always go back and get it. So if you go back to what we started with and you look at what we have now, it is much cleaner. It's much more usable. And these are small, subtle changes, especially with this very small data set of only like 50 rows or 46 rows. But you're gonna be working with data sets that are thousands, tens of thousands, hundreds of thousands of rows. And you need to know how to kind of look at this data, standardize it, format it properly for what you're going to be using it for. If you're keeping it in Excel, there are different things that you may do than if you're putting it into a database or gonna be using Python to access it. So you need to kind of know your use case. But these are some things that I do all the time to kind of clean up the data before I use it for something, whether I'm creating pivot tables or I'm inserting it into, or I'm putting it into SQL. These are things I do all the time. And so hopefully that helps give you kind of an idea of some of the things that you should be looking for when you're actually cleaning data. And it's really important to understand why you're actually making these changes and the reason you're making these changes because some of the things that I did today may not be things you wanna do on a different data set that has different uses and different purposes for. So take everything that I've said and apply it with a little grain of salt to your data set because your specific needs may be different than what I wanted when I was cleaning my data set. So I hope this was helpful. I hope this gave you a small glimpse of some of the things that I'm looking for when I clean a data set or I get a new data set in and I'm kind of analyzing it, figuring out what I need to fix in it. I hope this has been helpful. With that being said, thank you so much for watching. I really appreciate it. If you like this video, be sure to like and subscribe below and I'll see you in the next video.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now