Project Tutorial: Learn How To Build A Web Scraper In Python
Introduction
Web scraper is an application that extracts data from a website. Web scraping is also known as web mining or web harvesting. On a high level, it works by fetching the website content, copying it and then storing it at a local storage. It is used for a variety of purposes, such as web indexing, price monitoring, site change detection and many more. In this project, you are going to learn how to build a web scraping application for Wikipedia in Python.
Prerequisites
A basic understanding of Python, HTML and CSS is required for this project. You should be familiar with the Python syntax, the structure of an HTML page, and how CSS is used for styling.
Outcome
After completing this project, you would have:
A solid understanding of how to use Scrapy, a Python framework for web scraping.
Know how to scrape the Wikipedia page.
Better understanding of Python, HTML and CSS.
Understand how to debug common issues with web scraping.
Project curriculum
Session 1: Setting up the Environment
In this video, we are going to perform a full install of our editor VScode in a Linux environment (Lubuntu ).
Session 2: Scrapy Installation
This video will guide you through Scrapy installation with a demonstration in our Linux environment.
Session 3: Our first Scrapy project
Introduction to Scrapy. How to create your first Scrapy project.
Session 4: Extracting website data
In this video, you will scrap your first website data using a Scrapy spider
Session 5: Scrapy shell
Here, you will learn how to use the very powerful Scrapy shell
Session 6: About web crawling / scraping
A short video with a little discussion about the legal aspect of web crawling/scraping
Session 7: Creating a Scrapy spider
In this video we are going to create our first fully functional spider that will scrap quotes from multiple web pages.
Session 8: Wiki Scraper Intro
In this short video, you will learn about the main project of this tutorial. The Wikipedia Scraper
Sessions 9.1: Scraping Wikipedia Part 1
In this session we are going to start scraping data from Wikipedia using our new Wikipedia Spider
Sessions 9.2: Scraping Wikipedia Part 2
In this session we are going to solve a few problems that came up in the previous session and also test our new spider
Session 10: Scraping Multiple pages
We are going to take our app one step further and scrap multiple Wikipedia pages at the same time
Session 11: Scrapy Items
We will polish our program by changing the way we store our items using Scrapy Items
Session 12: Scrapy selection using CSS
In this final part of the project, we are going to create another spider that instead of XPath, it is going to use CSS to scrap our web page elements
Education Ecosystem Staff
The Education Ecosystem Staff consist of various writers who are experienced in their field. The staff strives to educate and inform others of what's happening in the technology industry.