Project Tutorial: Learn How To Build A Web Scraper In Python

Introduction

Web scraper is an application that extracts data from a website. Web scraping is also known as web mining or web harvesting. On a high level, it works by fetching the website content, copying it and then storing it at a local storage. It is used for a variety of purposes, such as web indexing, price monitoring, site change detection and many more. In this project, you are going to learn how to build a web scraping application for Wikipedia in Python.

Prerequisites

A basic understanding of Python, HTML and CSS is required for this project. You should be familiar with the Python syntax, the structure of an HTML page, and how CSS is used for styling.

Outcome

After completing this project, you would have:

A solid understanding of how to use Scrapy, a Python framework for web scraping.
Know how to scrape the Wikipedia page.
Better understanding of Python, HTML and CSS.
Understand how to debug common issues with web scraping.

Project curriculum

Session 1: Setting up the Environment

In this video, we are going to perform a full install of our editor VScode in a Linux environment (Lubuntu ).

Session 2: Scrapy Installation

This video will guide you through Scrapy installation with a demonstration in our Linux environment.

Session 3: Our first Scrapy project

Introduction to Scrapy. How to create your first Scrapy project.

Session 4: Extracting website data

In this video, you will scrap your first website data using a Scrapy spider

Session 5: Scrapy shell

Here, you will learn how to use the very powerful Scrapy shell

Session 6: About web crawling / scraping

A short video with a little discussion about the legal aspect of web crawling/scraping

Session 7: Creating a Scrapy spider

In this video we are going to create our first fully functional spider that will scrap quotes from multiple web pages.

Session 8: Wiki Scraper Intro

In this short video, you will learn about the main project of this tutorial. The Wikipedia Scraper

Sessions 9.1: Scraping Wikipedia Part 1

In this session we are going to start scraping data from Wikipedia using our new Wikipedia Spider

Sessions 9.2: Scraping Wikipedia Part 2

In this session we are going to solve a few problems that came up in the previous session and also test our new spider

Session 10: Scraping Multiple pages

We are going to take our app one step further and scrap multiple Wikipedia pages at the same time

Session 11: Scrapy Items

We will polish our program by changing the way we store our items using Scrapy Items

Session 12: Scrapy selection using CSS

In this final part of the project, we are going to create another spider that instead of XPath, it is going to use CSS to scrap our web page elements

Education Ecosystem Blog

Featured in