Programming

Project Tutorial: Learn How To Build A Web Scraper In Python

scrape2
Python code for web scraping

Introduction

Web scraper is an application that extracts data from a website. Web scraping is also known as web mining or web harvesting. On a high level, it works by fetching the website content, copying it and then storing it at a local storage. It is used for a variety of purposes, such as web indexing, price monitoring, site change detection and many more. In this project, you are going to learn how to build a web scraping application for Wikipedia in Python.

Prerequisites

A basic understanding of Python, HTML and CSS is required for this project. You should be familiar with the Python syntax, the structure of an HTML page, and how CSS is used for styling. 

Outcome

After completing this project, you would have:

  • A solid understanding of how to use Scrapy, a Python framework for web scraping. 
  • Know how to scrape the Wikipedia page.
  • Better understanding of Python, HTML and CSS.
  • Understand how to debug common issues with web scraping.

Project curriculum 

Session 1: Setting up the Environment

  • In this video, we are going to perform a full install of our editor VScode in a Linux environment (Lubuntu ).

Session 2: Scrapy Installation

  • This video will guide you through Scrapy installation with a demonstration in our Linux environment.

Session 3: Our first Scrapy project

  • Introduction to Scrapy. How to create your first Scrapy project.

Session 4: Extracting website data

  • In this video, you will scrap your first website data using a Scrapy spider

Session 5: Scrapy shell

  • Here, you will learn how to use the very powerful Scrapy shell

Session 6: About web crawling / scraping

  • A short video with a little discussion about the legal aspect of web crawling/scraping

Session 7: Creating a Scrapy spider

  • In this video we are going to create our first fully functional spider that will scrap quotes from multiple web pages.

Session 8: Wiki Scraper Intro

  • In this short video, you will learn about the main project of this tutorial. The Wikipedia Scraper

Sessions 9.1: Scraping Wikipedia Part 1

  • In this session we are going to start scraping data from Wikipedia using our new Wikipedia Spider

Sessions 9.2: Scraping Wikipedia Part 2

  • In this session we are going to solve a few problems that came up in the previous session and also test our new spider

Session 10: Scraping Multiple pages

  • We are going to take our app one step further and scrap multiple Wikipedia pages at the same time

Session 11: Scrapy Items

  • We will polish our program by changing the way we store our items using Scrapy Items

Session 12: Scrapy selection using CSS

  • In this final part of the project, we are going to create another spider that instead of XPath, it is going to use CSS to scrap our web page elements
Education Ecosystem Staff
About author

The Education Ecosystem Staff consist of various writers who are experienced in their field. The staff strives to educate and inform others of what's happening in the technology industry.