An efficient Python crawler framework Scrapy

An efficient Python crawler framework Scrapy

2022-09-02 0 1,458
Resource Number 38015 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

Scrapy, recommended in this issue, is a fast advanced web scraping and web scraping framework for scraping websites and extracting structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

An efficient Python crawler framework Scrapy插图

Frame example

Scrapy is an application framework for scraping websites and extracting structured data that can be used for a variety of useful applications such as data mining, information processing, or historical archiving.

Below is the code of a crawler that it takes from the site
Fetching in http://quotes.toscrape.com, follow the paging:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

Put it in a text file, name it something like quotes_spider.py and run it with the following runspider command:

scrapy runspider quotes_spider.py -o quotes.jl

When you’re done, you’ll get a list of quotes in JSON line format in the quotes.jl file, containing the text and the author, as follows:

{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...

Spider middleware

The Spider middleware is a hook framework for the Scrapy Spider processing mechanism in which custom functionality can be inserted to handle the responses sent to the spider for processing as well as the requests and items generated from the spider.

Architecture overview

data flow:

An efficient Python crawler framework Scrapy插图1

flow of execution

    1. The engine gets the spider from the initial request to crawl.
    2. The engine schedules the request scheduler and requests the next request crawl.
    3. The plan returns the engine for the next request.
    4. The engine sends requests to the downloader, through the downloader middleware.
    5. After the page completes downloading, the downloader generates a response (with the page) and sends it to the engine, via the downloader middleware.
    6. The engine receives the response from the downloader and sends it to the spider for processing, via the spider middleware.
    7. The spider processes the response and returns the engine under the item and the new request (follow) through the spider middleware.
    8. The engine sends processed items to the project pipeline, then puts the processed requests on schedule and requests for future possible requests to crawl.
    9. This process repeats (starting with step 1) until there are no more requests from the Scheduler.

installation guide

Scrapy requires Python 3.6+, CPython implementation (default), or PyPy 7.2.0+ implementation.

Mounting Scrapy

If you are using Anaconda or Miniconda, you can install the package from the conda-forge channel, which has the latest packages for Linux, Windows, and macOS.

To use the installation Scrapy conda, run:

conda install -c conda-forge scrapy

Alternatively, if you are already familiar with installing Python packages, you can install Scrapy and its dependencies from PyPI using the following command:

pip install Scrapy

Note: Scrapy is written in pure Python and relies on some key Python packages

  • lxml, an efficient XML and HTML parser
  • parsel, an HTML/XML data extraction library written on top of lxml,
  • w3lib, a versatile helper for handling urls and web coding
  • Distortion, an asynchronous network framework
  • Cryptography and pyOpenSSL, which handle various network-level security needs

Core API

Crawler API

The main entry point of the Scrapy API is the Crawler object, passed to the extension through the from_crawler class method. This object provides access to all Scrapy core components, and it is the only way to extend access to them and hook their functionality to Scrapy.

Setup API

Set the default set priority key name and priority dictionary used in Scrapy.

Each project defines a setup entry point, giving it a code name to identify and an integer priority. When setting and retrieving values in the Settings class, the larger one takes precedence over the smaller superior 222222.

Spider loader API

This class is responsible for retrieving and processing spider classes defined across projects.

Custom spider loaders can be used by specifying their paths in the SPIDER_LOADER_CLASS project Settings. They must be fully realized
Scrapy. Interfaces. ISpiderLoader interface in order to make sure no wrong.

Signal API

Connect the receiver’s function to the signal.

The signal can be any object, although Scrapy comes with some predefined signals that are recorded in the signal section.

Statistics collector API

There are several statistics collectors available under the scrapy.statscollectors module, all of which implement the statistics collector API defined by the StatsCollector class (they all inherit from it).

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code An efficient Python crawler framework Scrapy https://ictcoder.com/an-efficient-python-crawler-framework-scrapy/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services