Understanding Big Data: From Hadoop to Apache Spark's Advanced Processing

Convert Your Audio To Text

4.9/5

3726 customer reviews

Explore the evolution of big data processing from Hadoop's distributed storage to Apache Spark's in-memory computing, enhancing speed and efficiency.

Learn Apache Spark in 10 Minutes Step by Step Guide

Added on 09/29/2024

Speakers

Add new speaker

Speaker 1: 90% of world's data was generated in just last two years. In early 2000, the amount of data being generated exploded exponentially. With the use of internet, social media, and various digital technologies, organizations found themselves facing massive volumes of data, which were very hard to process. To address this challenge, the concept of Big Data emerged. Big Data refers to extremely large and complex data sets that are difficult to process using traditional methods. Organizations across the world wanted to process this massive volume of data and derive the useful insights from it. Here's where the Hadoop comes into the picture. In 2006, a group of engineers at Yahoo developed a special software framework called as Hadoop. They were inspired from Google's MapReduce and Google File System technology. Hadoop introduced a new way of data processing called as Distributed Processing. Instead of relying on a single machine, we can use multiple computers to get the final result. It works like a teamwork. Each machine in a cluster will get some part of data to process. They will work simultaneously on all of this data and at the end will combine the output to get the final result. There are two main key components of Hadoop. One is Hadoop Distributed File System, HDFS, which is like a giant storage system for keeping our data safe. It divides our data into multiple chunks and it will store all of this data across different computers. And the second part of Hadoop is called as MapReduce, which is like a super smart way of processing all of this data together. MapReduce helps in processing all of this data in parallel. So you can divide your data into multiple chunks and process it together. Same as a big team of friends are working to solve a very large puzzle. Each person in a team will get some part of the puzzle. They will solve it by themselves and at the end will put together everything to get the final result. So with Hadoop, we got two things. One is HDFS, Hadoop Distributed File System, which is used for storing our data across multiple computers. And second is MapReduce, which is used to process all of this data in parallel. It allowed the organization to store and process very large volume of data. But here's the thing. Although Hadoop was very good at handling big data, there were a few limitations. One of the biggest problem behind Hadoop was it relied on storing data on disk, which made things much more slower. Every time we run a job, it will store the data onto the disk, it will read the data, process it and again store that data onto the disk. This made the data processing a lot slower. Another issue with Hadoop is that it processed data only in batches. This means we have to wait for one process to complete, then only we can submit any other job. It is like waiting for the whole group of friends to complete their puzzle individually and then putting it together. So there was a need to process all of this data faster and in real-time manner. Here's where the Apache Spark comes into the picture. In 2009, researchers at the University of California, Berkeley developed Apache Spark as a research project. The main reason behind the development of Apache Spark was to address the limitations of Hadoop. This is where they introduced the powerful concept called as RDD, Resilient Distributed Dataset. RDD is the backbone of Apache Spark. It allows data to be stored in memory and enables faster data access and processing. So instead of reading and writing the data again and again from the disk, Spark processes the entire data in just memory. The meaning of memory here is the RAM, Random Access Memory that is stored inside our computer. And this in-memory processing of data makes Spark 100 times faster than Hadoop. Yes, you heard it right, 100 times faster than Hadoop. Not only this, Spark also gave the ability to write code in various programming languages such as Python, Java, Scala. So you can easily start writing the Spark application with the respective language and process your data on large scale. Apache Spark became very famous because it was fast, it could handle a lot of data and process it efficiently. Here are the different components attached to Apache Spark. One of the most important part of Spark ecosystem is called as Spark Core. It helps with processing data across multiple computers and it makes sure everything works efficiently and smoothly. Another part is Spark SQL. So if you want to write the SQL queries directly on your dataset, you can easily do that using Spark. Then there is Spark Streaming. If you want to process the real-time data that you see in Google Maps or Uber, you can easily do that using Apache Spark Streaming. And at the end, we have the MLlib. MLlib is basically if you want to train the large-scale machine learning model on big data, you can do that using Spark itself. With all of these components working together made the Apache Spark really powerful tool for processing and analyzing big data. Nowadays, you work in any company, you will see to process big data, they use Apache Spark. Now here's the basic architecture behind the Apache Spark. When you think of a computer, the standalone computer, you generally use it to watch movies, play games or anything else. But if you want to process the large big data, you can't do that onto a single computer. You need multiple computers working together on individual tasks so that you can combine the output at the end and get the desired result. You can't just take the 10 computers and start processing your big data. You need a proper framework to coordinate work across all of these different machines. And Apache Spark does that. Basically it manages and coordinates the execution of tasks on data across the cluster of computers. Now how does Apache Spark does that? It has something called as the cluster manager. When we write any job in Spark, it is called as the Spark application. Whenever we run anything, it goes to the cluster manager, which will grant the resources to our application so that we can complete our work. Now understand this, in Spark application, we have two important components. One is the driver processes and second is the executor. The driver processes are like a boss and executor processes are like workers. The main job of driver processes is to keep track of all of the information about Apache Spark application. It responds to the command and input from the user. So whenever we submit anything, the driver process will make sure it goes through the Apache Spark application properly. It analyzes the work that needs to be done, divides our work into smaller tasks and assigns these tasks to executor processes. So it is basically the boss or a manager who is trying to make sure everything works properly. The executor process is the heart of Apache Spark application because it makes sure everything runs smoothly and allocates the right resources based on the input that we provide. The executor processes are the ones that actually does some work. So it executes the code assigned by the driver process and it will report back whatever the progress and the result of the computation that is done. Now how does Apache Spark executes the code in practice? When we actually write our code in Apache Spark, the first thing we need to do is create the Spark session. It is basically we are making the connection with the cluster manager. You can create Spark session with any of these languages, Python, Scala or Java. No matter what you use, to begin writing your Spark application, the first thing you need to create is Spark session. You can perform simple tasks such as generating the range of numbers by writing this code. We just ran our first Spark code where we created a data frame with one column containing 1000 rows with the values from 0 to 999. Just by writing this one line of code, we created a data frame. A data frame is simply the representation of data in rows and columns. It is similar to MS Excel. The concept of data frame is not new to Spark. We also have the data frame concept available in Python and R. In Python, the data frame is stored on a single computer whereas in Spark, the data frame is distributed across multiple computers. In Spark, to make sure all of these data are executed parallelly, we need to divide our data into multiple chunks. This is called as a partitioning. You can have single partition or you can have multiple partition that you can specify while writing the code. All of these things are done using the transformation. Transformation are basically the instructions that tells Apache Spark on how to modify this data and get the desired result. For example, let's say if you want to find all of the even numbers in data frame, we can use the transformation function call as well to specify this condition. But here's the thing, if we run this code, we will not get the desired output. In most of the programming language, once you run the code, you get the output then and there. But Spark will not give you the output. The reason is lazy evaluation. It will wait till you complete writing your entire code and then it will generate the proper plan based on the code that you have written. This allows Spark to calculate your entire data flow and execute it efficiently. To actually execute your transformation block, we have something called as actions. There are multiple actions available in Apache Spark. So one of the actions is called as the count action which will give us the total number of records in data frame. We can run action and Spark will run the entire transformation block and give us the final output. Here's an end-to-end example to understand all of this concept in just single project. The first thing we need to do over here is that we need to import the Spark session that you can do it from the from PySpark.c for import Spark session. This is basically we are creating the entry point for the Spark application. Once you do that, you can use the Spark session.builder.create. This will create the Spark application so that you can import the data set and start writing the query. All of these details available such as versions, app name and everything. Now let's see if we have this data called as a tips. If you want to read this data, we have a very simple function called as a spark.read.csv. If you provide the path and the header is true, it will print the entire data from the CSV file. As you can see, our data contains total bill, tips, sex, smoker, date, time and size. All of this data is being imported from the CSV file and if you print the type of this particular file, you will understand this is the PySpark.sql.dataframe. Now you can create the temporary view on top of this. If you use this function, it will create the table inside the Spark and you can write the SQL queries on top of it. So let's see if we have this query select star from tips and if you run this and if you provide this query into the spark.sql, we can easily run this particular SQL query on top of our data frame. So what we really did, we imported the data, we converted that data onto the table and then we can write the SQL query on top of it. Same thing you can also do and convert this Spark data frame into the pandas data frame. So if you want to apply any pandas function, you can also do that inside the Spark itself. Over here, if you want to understand the lazy evaluation where we are just filtering the sex by female and the day as son. Once we run this particular statement, the Spark does not execute this entire thing. It will wait for the action to be performed. So the action over here is the show. So once you run the show, then it will run this entire thing and then you will be able to see the result. So this is called as a transformation that we understood in the video and this is the action that we were talking about. Like this, you can do a lot of things, you can go to the Spark documentation and understand it in detail. There are multiple functions available and for each and every function you will get the detailed understanding. I hope you understood everything about the Spark and how it executes all of this code. If you want to do the entire data engineering project containing the Apache Spark, then you can watch this video over here. This video will give you the complete understanding of how data engineering project is built from start to end. This was all from this video. If you have any questions, let me know in the comments and I'll see you in the next video. Thank you.