Introduction to Data Analysis and Visualization with Python and Matplotlib

Convert Your Audio To Text

4.9/5

3720 customer reviews

Learn why data visualization is crucial, how to use Python and Matplotlib, and follow a step-by-step guide to create line charts and analyze real datasets.

Intro to Data Analysis Visualization with Python, Matplotlib and Pandas Matplotlib Tutorial

Added on 09/08/2024

Speakers

Add new speaker

Speaker 1: Hey everyone, this is my introduction to data analysis slash data visualization with Python. In this video I'm gonna cover why you might want to use data visualization and Why you might want to use Python and Matplotlib for it And then we're gonna go over some simple examples of how to actually use these tools and then using these tools We're gonna do sort of a real analysis with a real data set at the end and in this video I'm only going to cover line charts just to keep everything simple. I'm also going to put a more detailed version of this outline in the comment section below so you don't have to watch the whole thing if you don't want to. Okay, so why should you use data visualization in the first place? Well, data visualization is actually often the first step of any type of data analysis work, whether it's simple data analysis or statistical analysis or machine learning analysis. And the reason for that is because because visualizing data often gives you an intuitive understanding of the data, and it often helps you see patterns that are otherwise hard to see. And we're gonna see an example of that later. Okay, and why should you use Python for this? Well, Python is not the only good choice, but I would say it's one of the best. And the reason is, first of all, it's a general purpose language that's pretty easy to use and learn. And it also has many libraries for scientific computing and data science, including Matplotlib. And if you work at a company, your company might already use Python for something else. And if that's the case, that's really nice because then you and your team are not gonna have to learn a totally new language to do some data analysis. And why are we using Matplotlib for this? Well, Matplotlib is not the only good visualization library for Python, but it's still one of the most popular choices. And there are actually other libraries that are based on Matplotlib. So if you learn Matplotlib, it's going to help you learn these other libraries, for example, this one called Seaborn later on if you want to. And Matplotlib is also pretty easy to get started with. Anyway, let's dive into a demo. For this demo, we're going to use something called Jupyter Notebook and a few other Python libraries, and we're going to use Anaconda to install them. If you're not familiar with Jupyter Notebook and Anaconda, I have an explanation about them in my Python tutorial video, So I'm gonna leave a link to that in the description. Anyway, to install Anaconda, just search for Anaconda Python, or directly go to anaconda.org, and there, find the button that says Download Anaconda, and select whatever OS you're using. I'm using Mac here. And click Download under Python 3.something version instead of Python 2.something, because we're gonna use Python 3 here. and select where you want to download this package, save it, and once it's downloaded, open up the package that you just downloaded, and then just click Continue, Continue, Continue, Continue, Agree, Install for me only or Install on a specific disk, it doesn't matter which one, and Continue, and click Install. And this process is going to take a while. After some waiting, you might see this prompt to install Microsoft VS Code. We don't need that, so let's just continue here, and then close. And then to launch Jupyter Notebook, you can do it through this thing called Anaconda Navigator. So just launch it like you launch any other application. Just dismiss whatever comes up, and then click Launch in the Jupyter Notebook section. And then you should see a browser window show up with the Jupyter Notebook interface. Now, if you want to follow this tutorial, The first thing you should do is you should create a new folder, let's say on Desktop, and let's call this one Data Visualization, and we're going to put all our data and Jupyter Notebook file here. So let's first download our data. To do that, just go to csdojo.io slash data, and download these two files, sampledata.csv countries.csv, and then put these CSV files in the folder that you just created, Data Visualization. After that, go back to the Jupyter Notebook interface, and you can just navigate to Desktop, and then the folder that we just created, Data Visualization. And to create a new Jupyter Notebook file here, just find the New button on the right, and click Python 3. Right now, this notebook file has untitled as the title, so let's change it to Data Visualization with Python. Click Rename, and you have a notebook called Data Visualization with Python. You can check it just by going to Desktop, and then to the folder that you just created, and you should see that there's a file called Data Visualization with Python.ipynb. And it's really important that this notebook file is in the same folder as the data that you just downloaded, countries.csv and the other one. And once everything is set up, just write in the first cell, import pandas as pd. This means we wanna import a module called pandas as pd, or we wanna give it sort of a nickname, and that's gonna be pd. You can run this cell by clicking this button, and now pandas is imported as pd. And here, we're gonna use pandas for importing and using some data from our CSV files. And we need to import another module here. So for that, just write, from matplotlib import pyplot as plt. So this says, from the matplotlib package, import pyplot module, and then call it plt. Let's run this cell, and now pyplot is imported. We're gonna use pyplot from matplotlib for making our charts. So here, let's first take a look at a really simple example of how to use pyplot. So here, I'm gonna write x equals one, two, three, it's a list of three elements, and y equals one, four, and nine. And to plot this set of data, you can just write plt.plot x comma y, and this plots x on the x-axis and y on the y-axis, and then you can show this graph by writing plt.show. And when you run this cell, you should see a graph like this. You see that the values of x are one, two, and three, as expected, and the values of y are one, four, and nine. If you want to add a title to this graph, you can do so by writing plt.title.testplot, right after the plot statement, before the show statement, And then you can add an x label and a y label as well, by writing plt.xlabel, let's call the x label x, and plt.ylabel, let's call the y label y here. And when you run this cell, you'll see that there's a title called testplot, and xlabel called x, and ylabel called y. Okay, what if you wanted to plot multiple lines here? Well, to do that, let's create another list, let's call it z, and this one is going to have 10, 5, and 0 inside. And to plot x and z on top of x and y, you can just write plt.plot x, z right after plt.plot x, y. And then let's fix the y level here to y and z. And when you run this cell, you should see these two lines. So the blue line represents x and y, and the orange line represents x and z. So plt.plot x and z plotted x on the x-axis and z on the y-axis. But right now, it's kind of hard to tell which line represents which data, so we can fix it by adding a legend statement. Let's add that after the y-level statement by writing plt.legend() parentheses, square brackets, double quotes, this is y, comma, double quotes, this is z. So note here that this legend function takes a list as an argument. And when you run this cell, you should see this legend. That says the blue line is this is y, and the orange line is this is z. Okay, that's the basics of plotting. Now, let's see how to load data from a CSV file. For that, you can just write sample underscore data equals pd or pandas dot read CSV. By the way, I just press tab here to do autocomplete, and then parentheses sample underscore data dot CSV. Now, before you run this cell, make sure that the The notebook file, data visualization with python.ipynb, is in the same folder as sampledata.csv. When you run this cell, this data, sampledata.csv, is loaded by the pandas module, which we call pd, and then it's assigned to this variable called sampledata. You can check what's inside this variable, sample underscore data, just by writing sample underscore data in this new cell, and then when you run this cell, you should see something like this. So as you can see, this data has three columns, column A, column B, and column C, and five rows. And you see a bunch of values inside this table. If you want to check if this set of data is exactly the same as the original data, you can do so by opening up the original data file, sampledata.csv, with Excel or any other spreadsheet application. And when you open it, you you should see exactly the same data. Column A, column B, column C, with five rows with a bunch of values. Okay, the only difference that you might see is that in Jupyter Notebook, you might see these numbers, 0, 1, 2, 3, and 4, and these are just indices for the rows. And you can check what type this variable is by writing type parentheses sample underscore data. And when you run this cell, It says that this is pandas.core.frame.dataframe. So this is a dataframe type that's defined by the pandas module. And the dataframe type is used to contain a table-like piece of information, just like this one. Okay, now, what if you wanted to plot data in this dataframe? For example, the values of column A on the x-axis and column C on the y-axis. Well, to do that, you'll need to be able to retrieve a specific column, And you can do that by writing sample underscore data dot column dot C, column underscore C. When you run this cell, you'll see that the column C is retrieved. It has the values 10, eight, six, four, and two. And the numbers you see on the left are just indices, zero, one, two, three, and four. Just like before, you can check what type this is by writing type parentheses sample data dot column C. And when you run this cell, you see that this is pandas.core.series.series. So this is basically a series type that's defined by the pandas module, and it's a type that's used to store a series of values. For example, these values, 10, eight, six, four, and two. Now, what if we wanted to retrieve a specific value out of this series? Well, if you want to retrieve, for example, the second value here, eight, you can do so by writing sampleData.columnC.iloc, dot iloc, I-L-O-C, square brackets, one. And this retrieves the second value of the series, eight, and if you want to retrieve the third value, six, you can write iloc two, and that gets the third value. And if you want to retrieve the first value, you can write iloc zero, and this should give us ten, and it does. Okay, and using what we've just learned here, we'll be able to plot the data in this data frame. So let's say we want to plot column A on the x-axis and column B on the y-axis. We can do that by writing plt.plots sampleData.columnA, sampleData.columnB And we can show it by writing plt.show. Let's see how it looks. We have 1, 2, 3, 4, and 5 on the x-axis, and on the y-axis, we have 1, 4, 9, 16, and 25, as expected. If you want to add column C to this data, you can write plt.plotsSampleData.ColumnA, so let's use column A as the x-axis again, and SampleData.ColumnC. When you run the cell, you see that there are two lines here. Just like before, if you want to make this graph a little bit easier to read, you can add titles and a legend. And by the way, in this plot function, you can use the third argument to change how the plot looks. So for example, if you give it O in a string as the argument in the first line for column B, and when you run the cell, the plot becomes dots instead of just a line. And there's a lot more you can do, and you can find more about it in the official documentation. Anyway, let's move on and do sort of a real analysis with a real dataset. Now, for this analysis, we're going to use this data, countries.csv. It should be in the same folder as well. And when you open it, you should see this data. So we have a bunch of countries and a bunch of years, ranging from 1952 to 2007 for every five years, and population for each year for that country. And you can see that there are a lot of rows in this data. So let's now import that data just like before by writing pd or pandas.readcsv parentheses, single quotes or double quotes countries.csv. And by the way, this is a string, single quotes countries.csv, and in Python you can use either double quotes or single quotes to express a string. Let's assign that to a new variable called data by writing data equals and when you run this cell, this data is loaded onto data. So once you write data in this new cell and run it, you should be able to see this data in a data frame. Now let's say that the analysis we want to do here is we want to compare the population growth in the US and China. Now to do this analysis, the first thing we want to do is we want to isolate the data for the U.S. and China. We can do that for the U.S. by writing U.S. equals data square brackets data dot country equals United States in single quotes and when you run this cell U.S. now only contains the data for the United States. So let's break down this statement a little bit more. Let's click insert here and insert cell below. When you write data.country equals United States, this actually gives a series of a bunch of trues and falses. So when the row is not US, this gives us false. And when it is US, it gives us true. We don't see any trues here, but there are a bunch of trues here where the rows are for the US. And then when you write data square brackets, this, a series of bunch of trues and falses, this gives us a portion of the data where the value of the series is true, and that's the data for the US, as you can see here. And then we just assign it to this variable called US. Okay, let's now do the same thing for China by writing China equals data square brackets data dot country equals China, and when you run this cell, and when you write China here and run this cell, you you should only see the data for China. And using these two variables, US and China will be able to compare their population growth. So let's first plot US's population here by writing plt.plot, us.year, comma, us.population. You can show this plot with plt.show, and when you run this cell, you should see this graph. You see that us.year is plotted on the x-axis, and US star population is plotted on the y-axis. But you see this scientific notation thing, 1E8, because the numbers are so big. So let's divide the whole population, each number in the series, with one million, or 10 to the power of six. That's 10 star star six in Python. And when you run this cell again, you now see the population in millions. So this is 160 million, and it goes up to, I think, more than 300 million in 2007. And let's plot China's data on top of this plot by writing plt.plot China.year. Actually, you could use US.year or China.year because we have exactly the same years, but for now, let's just use China.year for the x-axis and China.population for the y-axis. And we're gonna divide this by 1 million as well to make the population show in millions. And when you run the cell, you should see these two lines. Let's add a legend and titles here to make this graph easier to read. So plt.legend, parentheses, square brackets, United States and China. And the x label, plt.xlabel, should be just year. And plt.ylabel should be population. Run this cell again, and this graph is much easier to read. So you can see that China's population started out much larger than the U.S. in 1952, and it seems like it's growing faster as well. Now, what if you wanted to compare, instead of the absolute amount that you see here, the percentage growth from the first year that we have in our data, 1952? Well, there are several different ways of doing this, but I'm gonna show you just one way. So to do that, let's first copy this whole block of code over here. And let's say that for each country, we want to find the percentage growth from the first year. So we want to set the first year's amount to 100, as in 100%, and show the rest of the data in percentage relative to the first year. And we can do that by dividing this whole series, for example, US dot population, with the first year's population, and then multiplying everything by 100. So to show you what I mean, let's just create a new cell here above by clicking Insert Cell above here, And here, first, I'm gonna write us.population, and you see a series of population here for each year. And the first row you see here is the first year's population, or the population in 1952, I think. And let's insert a new cell below here. Now, to retrieve the first year's population, you can just write us.population.iloc, square brackets, zero. And this gives us the first year's population, which is this amount. Then we can divide the whole population, this whole series, by the first year's population just by writing us.population divided by us.population.iloc square brackets there. And this gives us this series. So as you can see, the first year is set to one and the rest of the years are shown in relative amounts. And if you multiply everything by 100, just by writing start 100 here, you'll be able to show everything in percentage amounts. So you can see that the first year is shown as 100%, and from 1952 to 2007, which is the last year we have, the population grew by 90%. Now, like I said earlier, this is not the only method to show the relative growth in population, but I chose this method here because it's pretty simple to implement. Anyway, let's copy this whole thing and paste it over here to replace the y-axis. And let's do the same thing for China as well. So copy the whole thing for China here, and then replace US with China. Once you do that, let's change the population in y-label to population growth. And let's just write first year equals 100 just for clarity here. And when you run this cell, you should see this graph. So you can see that even in percentage amount, China's population grew much faster than that of United States. The U.S.'s population grew by 90% from 1952 to 2007, but during the same time, China's population grew by more than 120%. Okay, this was a pretty simple example, and it actually came from my course called Introduction to Data Visualization. If you like this video, I'd actually highly recommend it. It's a course with more videos just like this one, And I cover more realistic and complex examples and more different types of data visualization techniques, not just line charts. So if you wanna check out the course, you can just go to csdojo.io slash moredata. You can actually take this course for free by signing up to Pluralsight's 10-day free trial. That's the site the course is hosted on. Anyway, as always, thanks for watching this video and I'll see you in the next one.