Web Scraping Steps

Posted on 30-04-2021 by admin

Web Scraping Using Python Step By Step Tutorial Written by Ashwin Joy. In this tutorial, we are going to do web scraping using Python’s Beautiful Soup library step-by-step. Python 3 is ridiculously fast in web scraping. It provides a beautiful framework for that called beautiful soup. The first step of web scraping is to find a table we want to scrape, which means figuring out the table and web page we want to scrape. As I am a research scientist, I would like to give an example of web scraping from a list of countries by cancer rate from Wikipedia (Step 2.

Data scraping is the technique that helps in the extraction of desired information from a HTML web page to a local file present in your local machine. Normally, a local file could correspond to an excel file, word file, or to say any Microsoft office application. Web scraping opens up opportunities and gives us the tools needed to actually create data sets when we can't find the data we're looking for. And since we’re using R to do the web scraping, we can simply run our code again to get an updated data set if the sites we use get updated.

A growing number of business activities and our lives are being spent online, this has led to an increase in the amount of publicly available data. Web scraping allows you to tap into this public information with the help of web scrapers.

In the first part of this guide to basics of web scraping you will learn –

What is web scraping?
Web scraping use cases
Types of web scrapers
How does a web scraper work?
Difference between a web scraper and web crawler
Is web scraping legal?

What is web scraping?

Web scraping automates the process of extracting data from a website or multiple websites. Web scraping or data extraction helps convert unstructured data from the internet into a structured format allowing companies to gain valuable insights. This scraped data can be downloaded as a CSV, JSON, or XML file.

Web scraping (or Data Scraping or Data Extraction or Web Data Extraction used synonymously), helps transform this content on the Internet into structured data that can be consumed by other computers and applications. The scraped data can help users or businesses to gather insights that would otherwise be expensive and time-consuming.

Since the basic idea of web scraping is automating a task, it can be used to create web scraping APIs and Robotic Process Automation (RPA) solutions. Web scraping APIs allow you to stream scraped website data easily into your applications. This is especially useful in cases where a website does not have an API or has a rate/volume-limited API.

Uses of Web Scraping

People use web scrapers to automate all sorts of scenarios. Web scrapers have a variety of uses in the enterprise. We have listed a few below:

Price Monitoring –Product data is impacting eCommerce monitoring, product development, and investing. Extracting product data such as pricing, inventory levels, reviews and more from eCommere websites can help you create a better product strategy.
Marketing and Lead Generation –As a business, to reach out to customers and generate sales, you need qualified leads. That is getting details of companies, addresses, contacts, and other necessary information. Publicly information like this is valuable. Web scraping can enhance the productivity of your research methods and save you time.
Location Intelligence – The transformation of geospatial data into strategic insights can solve a variety of business challenges. By interpreting rich data sets visually you can conceptualize the factors that affect businesses in various locations and optimize your business process, promotion, and valuation of assets.
News and Social Media – Social media and news tells your viewers how they engage with, share, and perceive your content. When you collect this information through web scraping you can optimize your social content, update your SEO, monitor other competitor brands, and identify influential customers.
Real Estate – The real estate industry has myriad opportunities. Including web scraped data into your business can help you identify real estate opportunities, find emerging markets analyze your assets.

Learn More

How to get started with web scraping

There are many ways to get started with web scraper, writing code from scratch is fine for smaller data scraping needs. But beyond that, if you need to scrape a few different types of web pages and thousands of data fields, you will need a web scraping service that is able to scrape multiple websites easily on a large scale.

Custom Web Scraping Services

Many companies build their own web scraping departments but other companies use Web Scraping services. While it may make sense to start an in house web scraping solution, the time and cost involved far outweigh the benefits. Hiring a custom web scraping service ensures that you can concentrate on your projects.

Web scraping companies such as ScrapeHero, have the technology and scalability to handle web scraping tasks that are complex and massive in scale – think millions of pages. You need not worry about setting up and running scrapers, avoiding and bypassing CAPTCHAs, rotating proxies, and other tactics websites use to block web scraping.

Web Scraping Tools and Software

Point and click web scraping tools have a visual interface, where you can annotate the data you need, and it automatically builds a web scraper with those instructions. Web Scraping tools (free or paid) and self-service applications can be a good choice if the data requirement is small, and the source websites aren’t complicated.

ScrapeHero Cloud has pre-built scrapers that in addition to scraping search engine data, can Scrape Job data, Scrape Real Estate Data, Scrape Social Media and more. These scrapers are easy to use and cloud-based, where you need not worry about selecting the fields to be scraped nor download any software. The scraper and the data can be accessed from any browser at any time and can deliver the data directly to Dropbox.

Scraping Data Yourself

You can build web scrapers in almost any programming language. It is easier with Scripting languages such as Javascript (Node.js), PHP, Perl, Ruby, or Python. If you are a developer, open-source web scraping tools can also help you with your projects. If you are just new to web scraping these tutorials and guides can help you get started with web scraping.

If you don't like or want to code, ScrapeHero Cloud is just right for you!

Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.

How does a web scraper work

A web scraper is a software program or script that is used to download the contents (usually text-based and formatted as HTML) of multiple web pages and then extract data from it.

Web scrapers are more complicated than this simplistic representation. They have multiple modules that perform different functions.

What are the components of a web scraper

Web scraping is like any other Extract-Transform-Load (ETL) Process. Web Scrapers crawl websites, extracts data from it, transforms it into a usable structured format, and loads it into a file or database for subsequent use.

A typical web scraper has the following components:

1. Crawl

First, we start at the data source and decide which data fields we need to extract. For that, we have web crawlers, that crawl the website and visit the links that we want to extract data from. (e.g the crawler will start at https://scrapehero.com and crawl the site by following links on the home page.)

The goal of a web crawler is to learn what is on the web page, so that the information when it is needed, can be retrieved. The web crawling can be based on what it finds or it can search the whole web (just like the Google search engine does).

2. Parse and Extract

Extracting data is the process of taking the raw scraped data that is in HTML format and extracting and parsing the meaningful data elements. In some cases extracting data may be simple such as getting the product details from a web page or it can get more difficult such as retrieving the right information from complex documents.

You can use data extractors and parsers to extract the information you need. There are different kinds of parsing techniques: Regular Expression, HTML Parsing, DOM Parsing (using a headless browser), or Automatic Extraction using AI.

3. Format

Now the data extracted needs to be formatted into a human-readable form. These can be in simple data formats such as CSV, JSON, XML, etc. You can store the data depending on the specification of your data project.

The data extracted using a parser won’t always be in the format that is suitable for immediate use. Most of the extracted datasets need some form of “cleaning” or “transformation.” Regular expressions, string manipulation, and search methods are used to perform this cleaning and transformation.

4. Store and Serialize Data

After the data has been scraped, extracted, and formatted you can finally store and export the data. Once you get the cleaned data, it needs to be serialized according to the data models that you require. Choosing an export method largely depends on how large your data files are and what data exports are preferred within your company.

This is the final module that will output data in a standard format that can be stored in Databases using ETL tools (Check out our guide on ETL Tools), JSON/CSV files, or data delivery methods such as Amazon S3, Azure Storage, and Dropbox.

ScrapeHero crawls, parses, formats, stores and delivers the data for no additional charge.

Web Crawling vs. Web Scraping

People often use Web Scraping and Web Crawling interchangeably. Although the underlying concept is to extract data from the web, they are different.

Web Crawling mostly refers to downloading and storing the contents of a large number of websites, by following links in web pages. A web crawler is a standalone bot, that scans the internet, searching, and indexing for content. In general, a ‘crawler’ means the ability to navigate pages on its own. Crawlers are the backbones of search engines like Google, Bing, Yahoo, etc.

A Web scraper is built specifically to handle the structure of a particular website. The scraper then uses this site-specific structure to extract individual data elements from the website. Unlike a web crawler, a web scraper extracts specific information such as pricing data, stock market data, business leads, etc.

Is web scraping legal?

Although web scraping is a powerful technique in collecting large data sets, it is controversial and may raise legal questions related to copyright and terms of service. Most times a web scraper is free to copy a piece of data from a web page without any copyright infringement. This is because it is difficult to prove copyright over such data since only a specific arrangement or a particular selection of the data is legally protected.

Legality is totally dependent on the legal jurisdiction (i.e. Laws are country and locality specific). Publicly available information gathering or scraping is not illegal, if it were illegal, Google would not exist as a company because they scrape data from every website in the world.

Terms of Service

Although most web applications and companies include some form of TOS agreement, it lies within a gray area. For instance, the owner of a web scraper that violates the TOS may argue that he or she never saw or officially agreed to the TOS

Some forms of web scraping can be illegal such as scraping non-public data or disclosed data. Non-public data is something that isn’t reachable or open to the public. An example of this would be, the stealing of intellectual property.

Ethical Web Scraping

If a web scraper sends data acquiring requests too frequently, the website will block you. The scraper may be refused entry and may be liable for damages because the owner of the web application has a property interest. An ethical scraping tool or professional web scraping services will avoid this issue by maintaining a reasonable requesting frequency. We talk in other guides about how you can make your scraper more “polite” so that it doesn’t get you into trouble.

What’s next?

Let’s do something hands-on before we get into web page structures and XPaths. We will make a very simple scraper to scrape Reddit’s top pages and extract the title and URLs of the links shared.

Check out part 2 and 3 of this post in the link here – A beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup

Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data – Navigating and Scraping Data from Reddit

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Javascript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it's a web or mobile application, Javascript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.

Prerequisites

This post is primarily aimed at developers who have some level of experience with Javascript. However, if you have a firm understanding of Web Scraping but have no experience with Javascript, this post could still prove useful.Below are the recommended prerequisites for this article:

✅ Experience with Javascript
✅ Experience using DevTools to extract selectors of elements
✅ Some experience with ES6 Javascript (Optional)

⭐ Make sure to check out the resources at the end of this article to learn more!

Outcomes

After reading this post will be able to:

Have a functional understanding of NodeJS
Use multiple HTTP clients to assist in the web scraping process
Use multiple modern and battle-tested libraries to scrape the web

Understanding NodeJS: A brief introduction

Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. When a website is loaded, Javascript is run by the browser's Javascript Engine and converted into a bunch of code that the computer can understand.

For Javascript to interact with your browser, the browser provides a Runtime Environment (document, window, etc.).

This means that Javascript is not the kind of programming language that can interact with or manipulate the computer or it's resources directly. Servers, on the other hand, are capable of directly interacting with the computer and its resources, which allows them to read files or store records in a database.

When introducing NodeJS, the crux of the idea was to make Javascript capable of running not only client-side but also server-side. To make this possible, Ryan Dahl, a skilled developer took Google Chrome's v8 Javascript Engine and embedded it with a C++ program named Node.

So, NodeJS is a runtime environment that allows an application written in Javascript to be run on a server as well.

As opposed to how most languages, including C and C++, deal with concurrency, which is by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a non-nlocking manner with the help of the Event Loop.

Putting up a simple web server is fairly simple as shown below:

If you have NodeJS installed and you run the above code by typing(without the < and >) in node <YourFileNameHere>.js opening up your browser, and navigating to localhost:3000, you will see some text saying, “Hello World”. NodeJS is ideal for applications that are I/O intensive.

HTTP clients: querying the web

HTTP clients are tools capable of sending a request to a server and then receiving a response from it. Almost every tool that will be discussed in this article uses an HTTP client under the hood to query the server of the website that you will attempt to scrape.

Request

Request is one of the most widely used HTTP clients in the Javascript ecosystem. However, currently, the author of the Request library has officially declared that it is deprecated. This does not mean it is unusable. Quite a lot of libraries still use it, and it is every bit worth using.

It is fairly simple to make an HTTP request with Request:

You can find the Request library at GitHub, and installing it is as simple as running npm install request.

You can also find the deprecation notice and what this means here. If you don't feel safe about the fact that this library is deprecated, there are other options down below!

Axios

Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use TypeScript, then Axios has you covered with built-in types.

Making an HTTP request with Axios is straight-forward. It ships with promise support by default as opposed to utilizing callbacks in Request:

If you fancy the async/await syntax sugar for the promise API, you can do that too. But since top level await is still at stage 3, we will have to make use of an async function instead:

All you have to do is call getForum! You can find the Axios library at Github and installing Axios is as simple as npm install axios.

SuperAgent

Much like Axios, SuperAgent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but SuperAgent has more dependencies and is less popular.

Regardless, making an HTTP request with Superagent using promises, async/await, or callbacks looks like this:

You can find the SuperAgent library at GitHub and installing Superagent is as simple as npm install superagent.

For the upcoming few web scraping tools, Axios will be used as the HTTP client.

Note that there are other great HTTP clients for web scrapinglike node-fetch!

Regular expressions: the hard way

The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you fetch using an HTTP client. But there is a big tradeoff. Regular expressions aren't as flexible and both professionals and amateurs struggle with writing them correctly.

For complex web scraping, the regular expression can also get out of hand. With that said, let's give it a go. Say there's a label with some username in it, and we want the username. This is similar to what you'd have to do if you relied on regular expressions:

In Javascript, match() usually returns an array with everything that matches the regular expression. In the second element(in index 1), you will find the textContent or the innerHTML of the <label>tag which is what we want. But this result contains some unwanted text (“Username: “), which has to be removed.

As you can see, for a very simple use case the steps and the work to be done are unnecessarily high. This is why you should rely on something like an HTML parser, which we will talk about next.

Cheerio: Core jQuery for traversing the DOM

Cheerio is an efficient and light library that allows you to use the rich and powerful API of jQuery on the server-side. If you have used jQuery previously, you will feel right at home with Cheerio. It removes all of the DOM inconsistencies and browser-related features and exposes an efficient API to parse and manipulate the DOM.

As you can see, using Cheerio is similar to how you'd use jQuery.

However, it does not work the same way that a web browser works, which means it does not:

Render any of the parsed or manipulated DOM elements
Apply CSS or load any external resource
Execute Javascript

So, if the website or web application that you are trying to crawl is Javascript-heavy (for example a Single Page Application), Cheerio is not your best bet. You might have to rely on other options mentionned later in this article.

To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit and, get a list of post names.

First, install Cheerio and axios by running the following command:npm install cheerio axios.

Then create a new file called crawler.js, and copy/paste the following code:

getPostTitles() is an asynchronous function that will crawl the Reddit's old r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library. Then the HTML data is fed into Cheerio using the cheerio.load() function.

With the help of the browser Dev-Tools, you can obtain the selector that is capable of targeting all of the postcards. If you've used jQuery, the $('div > p.title > a') is probably familiar. This will get all the posts. Since you only want the title of each post individually, you have to loop through each post. This is done with the help of the each() function.

To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (el refers to the current element). Then, calling text() on each element will give you the text.

Now, you can pop open a terminal and run node crawler.js. You'll then see an array of about 25 or 26 different post titles (it'll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.

If your use case requires the execution of Javascript and loading of external sources, the following few options will be helpful.

JSDOM: the DOM for Node

JSDOM is a pure Javascript implementation of the Document Object Model to be used in NodeJS. As mentioned previously, the DOM is not available to Node, so JSDOM is the closest you can get. It more or less emulates the browser.

Once a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, using JSDOM will be straightforward.

As you can see, JSDOM creates a DOM. Then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.

To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it. Then, we will verify if the post has been upvoted.

Start by running the following command to install JSDOM and Axios:npm install jsdom axios

Then, make a file named crawler.js and copy/paste the following code:

upvoteFirstPost() is an asynchronous function that will obtain the first post in r/programming and upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier.

The JSDOM constructor accepts the HTML as the first argument and the options as the second. The two options that have been added perform the following functions:

runScripts: When set to “dangerously”, it allows the execution of event handlers and any Javascript code. If you do not have a clear idea of the credibility of the scripts that your application will run, it is best to set runScripts to “outside-only”, which attaches all of the Javascript specification provided globals to the window object, thus preventing any script from being executed on the inside.
resources: When set to “usable”, it allows the loading of any external script declared using the <script> tag (e.g, the jQuery library fetched from a CDN).

Once the DOM has been created, you can use the same DOM methods to get the first post's upvote button and then click on it. To verify if it has been clicked, you could check the classList for a class called upmod. If this class exists in classList, a message is returned.

Now, you can pop open a terminal and run node crawler.js. You'll then see a neat string that will tell you if the post has been upvoted. While this example use case is trivial, you could build on top of it to create something powerful (for example, a bot that goes around upvoting a particular user's posts).

If you dislike the lack of expressiveness in JSDOM and your crawling relies heavily on such manipulations or if there is a need to recreate many different DOMs, the following options will be a better match.

Puppeteer: the headless browser

Puppeteer, as the name implies, allows you to manipulate the browser programmatically, just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.

Taken from the Puppeteer Docs (Source)

Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilities that weren't there before:

You can get screenshots or generate PDFs of pages.
You can crawl a Single Page Application and generate pre-rendered content.
You can automate many different user interactions, like keyboard inputs, form submissions, navigation, etc.

It could also play a big role in many other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc.

Quite often, you will probably want to take screenshots of websites or, get to know about a competitor's product catalog. Puppeteer can be used to do this. To start, install Puppeteer by running the following command:npm install puppeteer

This will download a bundled version of Chromium which takes up about 180 to 300 MB, depending on your operating system. If you wish to disable this and point Puppeteer to an already downloaded version of Chromium, you must set a few environment variables.

This, however, is not recommended. Ff you truly wish to avoid downloading Chromium and Puppeteer for this tutorial, you can rely on the Puppeteer playground.

Let's attempt to get a screenshot and PDF of the r/programming forum in Reddit, create a new file called crawler.js, and copy/paste the following code:

getVisual() is an asynchronous function that will take a screenshot and PDF of the value assigned to the URL variable. To start, an instance of the browser is created by running puppeteer.launch(). Then, a new page is created. This page can be thought of like a tab in a regular browser. Then, by calling page.goto() with the URL as the parameter, the page that was created earlier is directed to the URL specified. Finally, the browser instance is destroyed along with the page.

Once that is done and the page has finished loading, a screenshot and PDF will be taken using page.screenshot() and page.pdf() respectively. You could also listen to the Javascript load event and then perform these actions, which is highly recommended at the production level.

When you run the code type in node crawler.js to the terminal, after a few seconds, you will notice that two files by the names screenshot.jpg and page.pdf have been created.

Also, we've written a complete guide on how to download a file with Puppeteer. You should check it out!

Nightmare: an alternative to Puppeteer

Nightmare is another a high-level browser automation library like Puppeteer. It uses Electron but is said to be roughly twice as fast as it's predecessor PhantomJS and it's more modern.

If you dislike Puppeteer or feel discouraged by the size of the Chromium bundle, Nightmare is an ideal choice. To start, install the Nightmare library by running the following command:npm install nightmare

Once Nightmare has been downloaded, we will use it to find ScrapingBee's website through a Google search. To do so, create a file called crawler.js and copy/paste the following code into it:

First, a Nightmare instance is created. Then, this instance is directed to the Google search engine by calling goto() once it has loaded. The search box is fetched using its selector. Then the value of the search box (an input tag) is changed to “ScrapingBee”.

Web Scraping Sql

After this is finished, the search form is submitted by clicking on the “Google Search” button. Then, Nightmare is told to wait untill the first link has loaded. Once it has loaded, a DOM method will be used to fetch the value of the href attribute of the anchor tag that contains the link.

Finally, once everything is complete, the link is printed to the console. To run the code, type in node crawler.js to your terminal.

Summary

That was a long read! But now you understand the different ways to use NodeJS and it's rich ecosystem of libraries to crawl the web in any way you want. To wrap up, you learned:

✅ NodeJS is a Javascript runtime that allow Javascript to be run server-side. It has a non-blocking nature thanks to the Event Loop.
✅ HTTP clients such as Axios, SuperAgent, Node fetch and Request are used to send HTTP requests to a server and receive a response.
✅ Cheerio abstracts the best out of jQuery for the sole purpose of running it server-side for web crawling but does not execute Javascript code.
✅ JSDOM creates a DOM per the standard Javascript specification out of an HTML string and allows you to perform DOM manipulations on it.
✅ Puppeteer and Nightmare are high-level browser automation libraries, that allow you to programmatically manipulate web applications as if a real person were interacting with them.

Web Scraping Sites

While this article tackles the main aspects of web scraping with NodeJS, it does not talk about web scraping without getting blocked.

If you want to learn how to avoid getting blocked, read our complete guide, and if you don't want to deal with this, you can always use our web scraping API.