Master Data Analysis in 5 Minutes: A Step-by-Step Guide Using Pandas Profiling
Learn to analyze any dataset in just 5 minutes using a powerful technique with Pandas Profiling. Generate comprehensive EDA reports effortlessly.
File
Generate Data ScienceData Analysis Report of your DataSet in 5 Minutes
Added on 10/02/2024
Speakers
add Add new speaker

Speaker 1: Guys, in today's video, I will tell you how to analyze any dataset in 5 minutes. I use this technique a lot and I am sharing it with you today. I guarantee you that after watching this video, you will be able to analyze any dataset in 5 minutes. You will be able to generate a report and you will be able to get all the insights about it. Guys, if you are also in data science and machine learning like me, you will get to know that you don't have to remove anything like mean, median, mode or if we have to remove Pearson correlation then it takes a lot of time if you do it manually. But if you get a complete EDA report of a data then it is very useful and we are going to see that in this video. So before starting this video, I want you to like this video and tell me in the comments if this video is helpful or not so that I can bring more useful videos like this for you. So at this point, I don't want to waste any more time I want you guys to watch this video carefully and remember my guarantee you will be able to analyze any data in 5 minutes you will learn a very big skill after this video So let's go to the computer screen and let's start this ninja technique Alright guys, I have come to my computer and I have made a folder here and in this folder, I will tell you guys how to analyze data in 5 minutes So first of all, what we do is Let's open this folder in VS Code VS Code is a source code editor by Microsoft. You can also use Atom I am writing VS Code here. If you haven't downloaded VS Code, you can use it. Now here I am telling you the name of the package through which you can do data analysis in 5 minutes. And to do data analysis after that I will also tell you that our data and you can download it from any dataset I am getting this housing.csv data this housing.csv is a very famous data in which your housing data is there, as you can see this housing.csv is here you can either copy it and make a new file csv or download it I can paste its content or I can download it you can right click and save as save it in your hard disk, this is basically a csv file and any other data can be there, this data is just a sample this data is just a sample so let's see this data first, which we are going to analyze now this is such a big data, how complex can it be it is not necessary that it is so easy, how complex can it be what is available here, let me tell you what is this data this is the data of house pricing if latitude is this, longitude is this and after that, median housing age is given, then total rooms are given, how many bedrooms are there in the house what is the population of that area where this house is and how many households live there on average then what is the median income of those households, it has some unit, scale up, scale down this is the data of the United States California's, if I'm not wrong, this is the data there and in California, the dollar runs, so money is written here according to the dollar and these are the prices of the houses and this is the ocean proximity that how far is the ocean from those houses so if this dataset is caught whatever you have to do in this dataset first of all you have to do EDA EDA means Exploratory Data Analysis so let me tell you EDA EDA is Exploratory Data Analysis and you have to analyze and summarize this data to summarize their main characteristics If you are in the field of data analysis or machine learning or if you want to come, you have just learned this in Python so you should know this term, everyone should know Now I will unveil the mystery here without wasting time I will tell you here, you do pip install pandas profiling you install this package of pandas profiling so pip install pandas profiling you have to write this here and you should have a little bit of knowledge of Python you should have a little bit of knowledge of DataFrame and here it is downloading all the necessary things as long as this is going on, I will make a file main.py and as I told you that the information of DataFrame is important I will also give you that in this video here you click on select python interpreter and select your python interpreter if this error comes if you are not able to see the python version now this is getting installed automatically you have to import pandas as pd pandas is a package which helps you in python data analysis is basically a library which helps you in writing csv and reading there are many things in it this is used to do data manipulation and data analysis in python programming I have also made a tutorial on this by the way I have made one video on pandas but that is not the point of this video so I will quickly pd the report pandas I will write df is equal to pandas has a data frame for now you understand that to read every excel you have to make a data frame although it is not necessary that to read excel you have to make a data frame in python but for now you understand that in pandas data frame, excel and data frame I will write here pd.readcsv if it was excel, it would have read excel now it is csv, so I am writing csv here I have written housing.csv df is equal to pd.readcsv housing.csv, it is still installing it takes some time, my computer is fast but it is still taking time I know some of you are gamers and you have a faster computer than me I know many of you are gamers and you have a faster computer than me but I have a faster computer than an average user now if I just print df here and I print it for you by the way, pandas should also be installed in your computer so hit pip install pandas so I will write here because you have to hit to install you have to hit pip install pandas and along with that pip install you have to type pandas profiling you can save it like this and after saving this file will not have your syntax so see my pandas profiling is installed at this point if I run this file so simply here I didn't save it I have to save it now if I run this file see my data frame is printed so data frame is one of the ways one of the ways to bring excel so here is one way which I have used by using data frame I have brought excel in python now I want pandas profiling to analyze this data and generate a report for me now this thing looks a little weird but it will generate a report and I will tell you how this work will be so here without giving too many comments I will close this and here I will write from pandas profiling because I have just installed pandas underscore profiling import it has a profile report ok import profile report so from pandas profiling import profile report now here it is saying unsolved unresolved import why is it saying like this when I have installed pandas profiling so I am running it so there is no problem so there should be no problem ok so there is no error so sometimes IntelliSense gives some errors but ok, you have to import profile report from bandas underscore profiling ok, now what you have to do, you have read this df now I will write here generate a report, ok now what you have to do to generate a report you have to write profile is equal to profile report and after that you have to give the data frame which you want to profile okay, after that what you do, you write profile.tofile, I need my report in a file and I will write here output underscore file is equal to and I will write here housing.html so housing.html is an html file in which what I will do, in which I will want that my data analysis should be done and it should be saved, so see below it is written summarizing data set it is summarizing the data set it is telling get scatter matrix and all the things that may not be possible for you it will do it in a minute and 5 minutes is a long time I have said 5 minutes in the video because it takes time to install pandas profiling it takes time to read the data frame if someone doesn't have pandas installed, he can also complain see the analysis is done now what i will do is, you can open housing.html by double clicking in the folder ok, you can open it by double clicking, so what you guys do is, either open it by double clicking or what you guys do is, run a live server, if you want to run a live server then you can do that too but there are many people who don't know much about html and web development and that's totally fine, if you are a data science person, you can open it by double clicking you can open it, there is no problem, it will work fine okay, so it takes some time to load sometimes, it loads immediately for me but see here, it has analyzed your data and bound it properly it has analyzed it first and then bound it in a report so I call it as I have bound it okay, so pandas profiling binds your data, why it is bound so it goes inside the data, checks everything and then makes a report so that's why I say that he made it okay so here you see what he has done number of variables he has told that you have 10 so if I open my report and I come here I open it and count it so here you see its count is 10 he said that you have 10 columns in your data you have 10 columns in your data and number of observations is this much missing cells are this much missing cells percentage is this much how many percent cells are missing, there is no duplicate rows in your data, duplicate percent, just think about it, in such a short time he is telling you all these things, then after that total size and memory is also telling you that if you have memory constraints, you want to estimate according to the memory that there is so much RAM in my computer, if I load this type of data set then multiply by 100 or multiply by 1000, whatever data set you want to load in your RAM, according to that, it has given you an information that your total size is this much in memory then what is the average record size in memory, it has told everything and what is the variable type you have, numerical is 9, categorical is 1, and this is the truth rest all are numerical, categorical is only this, if I put a filter here if I put a filter inside excel then see this is my categorical rest all are numerical, all are numerical attributes so it automatically recognizes you that numerical attributes i.e. the numbers 1, 2, 3, 4, 3.1, 3.2, minus 122.26, these are all numerical and what are categorical attributes? Either this or inland or island or near or near ocean, these are categorical like are you married, yes or no, these are categorical and such things are categorical and there are other variables like if you do EDA, you will know but now I am assuming that we don't know anything, we don't know anything, we are simply depending on pandas profiling now here you see, he has told that he has analyzed each and every variable he has said that the longitude, this longitude, this column, let's talk about this he has only analyzed this and told, he has said that the correlation is very high this field is highly correlated, so he is saying that this longitude is highly correlated with whom? it is highly correlated with latitude so it is showing this correlation in your dataset after that it is showing a histogram here you can see in the histogram it has told me different things and bins is equal to 10 bins means 1,2,3,4,5,6,7,8,9,10 these are 10 and it has put data in fixed size bins and there are so many things in this I can't read it one by one but just think if you are analyzing a dataset because you have never seen a dataset, if you analyze it, it becomes so easy for you. Then look here, it is telling all these things here, mean, median, maximum, just think if you have to calculate these things yourself, then how much effort will it take, but it has done it for you for free. Then look here, it is saying that it has missing values and it is also highly correlated, so it has told that it has missing values in your total bedroom, so if I go to the total bedroom here, it also has missing values, So here you see there are blanks also, so if I want to see that there are missing values also, I see in Excel, so here I got it, where there were missing values, I got it. So in this dataset, missing values have been intentionally kept, so that you can see the concepts of machine learning on it, and I have made a video in which I have analyzed all this data, so if you have not seen that one video of machine learning, then you are missing out a lot, you must watch it. So I have put it in my ML tutorials, so in my machine learning series you will get that video so this is my machine learning series you have to go in it and I have made this project for you end-to-end python machine learning project in which I have made a 3 hours long video I have told you everything and we have done all these things manually but band-ass profiling can be done in one go, it doesn't mean that it is useless and you should do it and all the people who have learned machine learning data science are useless To make these things, you should know all the concepts If you don't know the concepts, you won't understand what is mean You will understand minimum and maximum And you won't understand correlation Now see here, they have given interactions They have given a lot of things here They have given different types of correlation Spearman, Spearman row or Pearson R Candles, whatever it is, everything is given Okay Now here it has told you about missing values then it gives a sample of your data, first rows from your dataset last rows from your dataset and below it says report generated with pandas profiling and above it says pandas profiling report now many people want to embed this report where to embed? embed it in your system make a system so that it can be generated with the help of pandas profiling all the data, now see here they have given a lot of warnings that if you are using this data set then you have to take care of all these things Latitude is highly correlated with Longitude Longitude is highly correlated with Latitude These are the things they have told you and here they have told you a lot they have told you that there are missing values in your data so if I read out this then it will be a little boring I want you to explore it yourself what all is there in it so you must have enjoyed seeing it I also enjoyed when I saw it for the first time so this is a package basically pandas profiling you can go to github and follow it you can see what is going on in it and if something new comes then you will get all the updates there are many more things in it there is a 2 notebook iframe which is given in jupyter notebook so there are different methods of export I told you one in html, I want you to go here and after going explore all these methods you will be able to do data analysis in 5 minutes if I stopwatch for 5 minutes and download a new data set then it will take time to download a new data set and bring it here because I will fire this command so I have made a template of this in my computer whenever I need data, what I do is I bring data in the name of data.csv and run it on that copy of data and in my program data.csv is written and output.html is output so I run it once and once I analyze it and every data set that comes to me First of all, I do not even open Excel without taking tension If I know that the data is structured, columns are labeled, then I do not even open and see the data I say, catch the pandas, catch the pandas profiling What to say about pandas profiling, see this first First I see this, what is pandas profiling saying Good, good, good, these are columns, these are correlations, this is unique percent these are the missing values, you can see the last rows and the first rows there is no need to open it then finally I take my decisions because there are a lot of machine learning decisions that you have to take based on how high your correlation is then if you want to make a new attribute then how will it be made I have discussed all these things when this video gets bigger then people will say that you said 5 minutes the video is so long so bro, it happens that in the video I have to explain you I have to do one thing step by step I don't want to take this challenge that the video should be only of 5 minutes and you didn't understand anything so what I want to do by not doing that that you guys can understand even if the video is a little big but in future when you do any analysis take out any report instead of 5 minutes you can take it out in 2-3 minutes I can take it out within a minute any data report with the help of pandas profiling okay now I want to tell you some things although this pandas profiling package is very good but sometimes what happens that there is a lot of information in it and for that I want to tell you that if you do minimal is equal to true then you will get less information so if I run this as well minimal is equal to true and see here it is generating the report while it is generating the report I want to tell you that this can be problematic for a big data set so see here by doing minimal it told me very less about it I was showing you my last rows in the end he has not shown all those things see here he has cut many things so he will talk very to the point about your dataset if you give it minimal is equal to true so if you think that in your report there is a lot of unnecessary information you can remove it and if you think that it is right I want to keep it, I want to know more about my data I don't do it personally minimal is equal to true I don't use it I see the whole I will do it myself, I will see what I want, which is the report of 20 pages, it is a small page, so I hope you have understood this, now ridiculously large data sets, like if you have a data set of 20, 20, 30, 100 GB, if you catch it, then maybe you will have a problem, so what can you do for that, you can take a sample of data and put it on it, and I leave that thing on you, Take a sample of the data, take 1000 rows out of it, do profiling on that, you will get to know a lot about the data. If not 1000, then take 1 million out of it, if there are 1 billion rows in it, then take 1 million rows, take 100k rows. It depends on what context you are working in, how much compute power you have available. Some people work in big companies, who have their own sandboxes, they have very powerful computers. I have a lot of RAM in GBs, I think I have a lot of RAM in TB as well, so a lot of progress is going on and a lot of things depend on what you do, how much resources you have on it. So I hope you liked my ninja technique and I have to tell you by commenting below, everyone write a comment below how did you like it. And if you want a course on machine learning, data science and all, then you can subscribe to the channel. A lot is going to come on this channel. You can press the bell icon if you want to connect with me on this channel. I upload a lot of things on this channel. If you have seen, then you will know C++, PHP, I have made a web development course. The whole data science course is coming very soon. I have told you about Git, JavaScript, Python, GUI, Android Studio. I keep making more discussion videos. So if you want to connect with the channel, a lot of people I have seen in analytics are not connected, not subscribed to the channel. most of the viewers even more than 50% of the number so I want you guys to subscribe to this channel and do like this video if you want more techniques like this then let me know in the comments that's it for this video guys thank you so much guys for watching this video and I will see you next time the next, video..

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript