Master Web Scraping with Puppeteer: Extract Data from Any Website Safely
Learn to use Puppeteer for web scraping, extract data from sites like Amazon, and leverage AI tools like GPT-4 for analysis and automation. Safe, effective methods.
File
Industrial-scale Web Scraping with AI Proxy Networks
Added on 09/28/2024
Speakers
add Add new speaker

Speaker 1: The internet is packed with useful data, but unfortunately that data is often buried deep within a mountain of complex HTML. The term data mining is the perfect metaphor, because you literally have to dig through a bunch of useless, dirty markup to extract the precious raw data you're looking for. One of the most common ways to make money on the internet is with e-commerce and dropshipping, but it's highly competitive and you need to know what to sell and when to sell it. Don't worry, I'm not about to scam you with my own dropshipping masterclass. Instead, I'm going to teach you about web scraping with a headless browser called Puppeteer. Allowing you to extract data from virtually any public-facing website to access precious data, even for websites like Amazon that don't offer an API. What we'll do is find trending products on websites like Amazon and eBay, build up a dataset, then bring in AI tools like GPT-4 to analyze the data, write reviews, write advertisements, and automate virtually any other task you might need. In addition, I'll teach you some tricks with ChatGPT to write your web scraping code way faster, which is historically very annoying code to write. But first, there's a big problem. Big e-commerce sites like Amazon don't love big traffic like bots and will block your IP address or make you solve CAPTCHAs if they suspect you're not a human. But that's kind of racist to non-biological life. Luckily, Bright Data, the sponsor of today's video, provides a special tool called the Scraping Browser. It runs on a proxy network and provides a variety of built-in features like CAPTCHA solving, fingerprints, retries, and so on that allow you to scrape the web at an industrial scale. That being said, if you're serious about extracting data from the web, you'll very likely need a tool that does automated IP address rotation, and you can try Bright Data for free using this code. After you sign up for an account, you'll notice a product called the WebScraper IDE. We're not going to use it in this video. However, if you're serious about web scraping, it provides a bunch of templates and additional tools that you'll likely want to take advantage of. As a developer myself, I want full control over my workflow. So for that, I'm going to use an open source tool from Google called Puppeteer, which is a headless browser that allows you to view a website like an end user to interact with it programmatically by executing JavaScript, clicking on buttons, and doing everything else a user can do. That's pretty cool, but if you use it a lot on the same website, they'll eventually flag your IP and ban you from using it. Then your mom will be pissed that she can no longer order her groceries from walmart.com. That's where the Scraping Browser comes in. It's a remote browser that uses the proxy network to avoid these problems. To get started, I'm creating a brand new Node.js project with NPM, then installing Puppeteer. Well, actually Puppeteer Core, which is the automation library without the browser itself. Because again, we're connecting to a remote browser. Now go ahead and create an index.js file and import Puppeteer. From there, we'll create an async function called run that declares a variable for the browser itself. Inside this try-catch block, we'll try to connect to the browser. If it throws an error, we'll make sure to console log that error. And then finally, when all of our scraping is done, we'll want to automatically close the browser. You don't want to leave the browser opened unintentionally. Now inside of try, we're going to await a Puppeteer connection that uses a browser WebSocket endpoint. At this point, we can go to the proxy section on the Bright Data dashboard and create a new Scraping Browser instance. Once created, go to the access parameters and you'll notice a host, username, and password. Back in the code, we can use these values to create a WebSocket URL. You'll have your username and password separated by a colon, followed by the host URL. Now that we're connected to this browser, we can use Puppeteer to do virtually anything a human can do programmatically. Let's create a new page, and then set the default navigation timeout to 2 minutes. From there, we can go to any URL on the internet. Then Puppeteer has a variety of API methods that can help you parse a web page, like the dollar sign, which feels like jQuery, corresponds to document query selector in the browser. It allows you to grab any element in the DOM, then extract text content from it. Or as an alternative, you can use PageEvaluate, which takes a callback function that gives you access to the browser APIs directly. Like here, we can grab the document element and get its outer HTML, just like you might do in the browser console. Let's go ahead and console log the document's outer HTML. And now we're ready to test our scraper out to make sure everything's working as expected. Open up the terminal and run the node command on your file, and you should get the HTML for that page back as a result. Congratulations, you're now ready to do industrial-scale web scraping. Now I'm going to go ahead and update the code to go to the Amazon Best Sellers page. And my first goal is to get a manageable chunk of HTML. What I'm doing is opening up the browser DevTools in Chrome to inspect the HTML directly, until we highlight the list of products that we want to scrape. Ideally, we'd like to get all these products and their prices as a JSON object. You'll notice all the products are wrapped in a div that has a class of a carousel. We can use that selector as our starting point. Chrome DevTools also has a copy selector feature, which is pretty cool, but usually it's a bit of overkill. Back in the code, we can make sure that the page will wait for that selector to appear. Then we can use the $querySelector to grab it from the DOM, and finally evaluate it to get its inner HTML. Now let's go ahead and console log that, and run the script once again. At this point, we have a more manageable chunk of HTML, and I could analyze it myself, but the faster way to get this job done is to use a tool like ChatGPT. We can simply copy and paste this HTML into the chat, and ask it to write puppets to your code that will grab the product title and price and return it as a JSON object. Literally, on the first try, it writes some perfect evaluation code that grabs the elements with the proper query selectors, and then formats the data we requested as a JSON object. Let's copy and paste that code into the project, and then run the node script once again. Now we're in business. We just built our own custom API for trending products on Amazon, and we could apply the same technique to any other ecommerce store, like eBay, Walmart, etc. That's pretty cool, and if we wanted to extract even more data, we could also grab the link for each one of these products, then use the scraping browser to navigate there, and extract even more data. We loop over each product, and use the goto method to navigate to that URL just like we did before. However, when doing this, I would recommend also implementing a delay of at least 2 seconds or so between pages, just so you're not sending an overwhelming amount of server requests. Now that we have all this wonderful data, the possibilities are endless. Like, for example, we could use GPT-4 to write advertisements that target different demographics for each one of these products. Or, we might want to store millions of products in a vector database, where they could be used to build a custom AI agent of some sort, like an auto-GPT tool that can take this data, and then build you an Amazon dropshipping business plan. The bottom line is that if you want to do cool stuff with AI, you're going to need data. But in many cases, the only way to get the data you need is through web scraping, and now you know how to do it in a safe and effective way. Thanks for watching, and I will see you in the next one.

ai AI Insights
Summary

Generate a brief summary highlighting the main points of the transcript.

Generate
Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate
Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate
Enter your query
Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate
Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript