Introduction
Web scraper is an application that extracts data from a website. Web scraping is also known as web mining or web harvesting. On a high level, it works by fetching the website content, copying it and then storing it at a local storage. It is used for a variety of purposes, such as web indexing, price monitoring, site change detection and many more. In this project, you are going to learn how to build a web scraping application for Wikipedia in Python.
Prerequisites
A basic understanding of Python, HTML and CSS is required for this project. You should be familiar with the Python syntax, the structure of an HTML page, and how CSS is used for styling.
Outcome
After completing this project, you would have:
- A solid understanding of how to use Scrapy, a Python framework for web scraping.
- Know how to scrape the Wikipedia page.
- Better understanding of Python, HTML and CSS.
- Understand how to debug common issues with web scraping.
Project curriculum
Session 1: Setting up the Environment
- In this video, we are going to perform a full install of our editor VScode in a Linux environment (Lubuntu ).
Session 2: Scrapy Installation
- This video will guide you through Scrapy installation with a demonstration in our Linux environment.
Session 3: Our first Scrapy project
- Introduction to Scrapy. How to create your first Scrapy project.
Session 4: Extracting website data
- In this video, you will scrap your first website data using a Scrapy spider
Session 5: Scrapy shell
- Here, you will learn how to use the very powerful Scrapy shell
Session 6: About web crawling / scraping
- A short video with a little discussion about the legal aspect of web crawling/scraping
Session 7: Creating a Scrapy spider
- In this video we are going to create our first fully functional spider that will scrap quotes from multiple web pages.
Session 8: Wiki Scraper Intro
- In this short video, you will learn about the main project of this tutorial. The Wikipedia Scraper
Sessions 9.1: Scraping Wikipedia Part 1
- In this session we are going to start scraping data from Wikipedia using our new Wikipedia Spider
Sessions 9.2: Scraping Wikipedia Part 2
- In this session we are going to solve a few problems that came up in the previous session and also test our new spider
Session 10: Scraping Multiple pages
- We are going to take our app one step further and scrap multiple Wikipedia pages at the same time
Session 11: Scrapy Items
- We will polish our program by changing the way we store our items using Scrapy Items
Session 12: Scrapy selection using CSS
- In this final part of the project, we are going to create another spider that instead of XPath, it is going to use CSS to scrap our web page elements