Scrapy, recommended in this issue, is a fast advanced web scraping and web scraping framework for scraping websites and extracting structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
Frame example
Scrapy is an application framework for scraping websites and extracting structured data that can be used for a variety of useful applications such as data mining, information processing, or historical archiving.
Below is the code of a crawler that it takes from the site
Fetching in http://quotes.toscrape.com, follow the paging:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
Put it in a text file, name it something like quotes_spider.py and run it with the following runspider command:
scrapy runspider quotes_spider.py -o quotes.jl
When you’re done, you’ll get a list of quotes in JSON line format in the quotes.jl file, containing the text and the author, as follows:
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
Spider middleware
The Spider middleware is a hook framework for the Scrapy Spider processing mechanism in which custom functionality can be inserted to handle the responses sent to the spider for processing as well as the requests and items generated from the spider.
Architecture overview
data flow:
flow of execution
-
- The engine gets the spider from the initial request to crawl.
- The engine schedules the request scheduler and requests the next request crawl.
- The plan returns the engine for the next request.
- The engine sends requests to the downloader, through the downloader middleware.
- After the page completes downloading, the downloader generates a response (with the page) and sends it to the engine, via the downloader middleware.
- The engine receives the response from the downloader and sends it to the spider for processing, via the spider middleware.
- The spider processes the response and returns the engine under the item and the new request (follow) through the spider middleware.
- The engine sends processed items to the project pipeline, then puts the processed requests on schedule and requests for future possible requests to crawl.
- This process repeats (starting with step 1) until there are no more requests from the Scheduler.
installation guide
Scrapy requires Python 3.6+, CPython implementation (default), or PyPy 7.2.0+ implementation.
Mounting Scrapy
If you are using Anaconda or Miniconda, you can install the package from the conda-forge channel, which has the latest packages for Linux, Windows, and macOS.
To use the installation Scrapy conda, run:
conda install -c conda-forge scrapy
Alternatively, if you are already familiar with installing Python packages, you can install Scrapy and its dependencies from PyPI using the following command:
pip install Scrapy
Note: Scrapy is written in pure Python and relies on some key Python packages
- lxml, an efficient XML and HTML parser
- parsel, an HTML/XML data extraction library written on top of lxml,
- w3lib, a versatile helper for handling urls and web coding
- Distortion, an asynchronous network framework
- Cryptography and pyOpenSSL, which handle various network-level security needs
Core API
Crawler API
The main entry point of the Scrapy API is the Crawler object, passed to the extension through the from_crawler class method. This object provides access to all Scrapy core components, and it is the only way to extend access to them and hook their functionality to Scrapy.
Setup API
Set the default set priority key name and priority dictionary used in Scrapy.
Each project defines a setup entry point, giving it a code name to identify and an integer priority. When setting and retrieving values in the Settings class, the larger one takes precedence over the smaller superior 222222.
Spider loader API
This class is responsible for retrieving and processing spider classes defined across projects.
Custom spider loaders can be used by specifying their paths in the SPIDER_LOADER_CLASS project Settings. They must be fully realized
Scrapy. Interfaces. ISpiderLoader interface in order to make sure no wrong.
Signal API
Connect the receiver’s function to the signal.
The signal can be any object, although Scrapy comes with some predefined signals that are recorded in the signal section.
Statistics collector API
There are several statistics collectors available under the scrapy.statscollectors module, all of which implement the statistics collector API defined by the StatsCollector class (they all inherit from it).