A lightweight Baidu crawler written in Python

A lightweight Baidu crawler written in Python

2022-09-16 0 1,339
Resource Number 38573 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

This issue recommends a lightweight Baidu crawler written in Python – BaiduSpider.

A lightweight Baidu crawler written in Python插图

BaiduSpider is a lightweight Baidu crawler written in Python. It is built on Requests and BeautifulSoup, and provides an easy-to-use API interface and comprehensive type annotations to improve the developer experience.

Features

  • Saves the time of data extraction, which is helpful for data model establishment and training of similar deep learning projects
  • Accurately and quickly extract Baidu search results and remove ads
  • Large and complete search results, support a variety of search types, support a variety of return types
  • provides a simple and easy-to-use API

Installation

Dependent environment:

Python 3.6+

Install with pip:

$ pip install baiduspider

Install manually from GitHub:

$ git clone git@github.com:BaiduSpider/BaiduSpider.git

# ... 

$ python setup.py install

Example

Baidu web search, can also be used as a comprehensive search.

BaiduSpider.search_web(
self: BaiduSpider,
query: str,
pn: int = 1,
exclude: list = [],
proxies: Union[dict, None] = None,
) ->  WebResult

Parameter

  • query str: string to query web search
  • pn int: page number to crawl. Default is 1, optional
  • exclude dict: List of subparts to be blocked, optional
  • time str | List[datetime.datetime] : Search time range
  • proxies Union[dict, None] : proxy configuration. Default to None, optional

Instance

Basic call: This is the most basic parameter — query. It is used to pass search terms (string type).

# Import BaiduSpider
from baiduspider import BaiduSpider
from pprint import pprint

# Instantiate BaiduSpider
spider = BaiduSpider()

# Search web pages 
pprint(spider.search_web(query=" Keyword to search ").plain)

Specify the page number: You can change the page number obtained by BaiduSpider by setting the pn parameter.

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the page number parameter (here, the second page) 
pprint(spider.search_web(query=" Keywords to search ", pn=2).plain)

Note: be careful when passing page number parameters, be sure not to pass too large page numbers, otherwise Baidu search will automatically jump back to the first page.


Block specific search results: This parameter can provide you with great convenience. By setting the exclude list, you can block certain web search subsearch results to improve parsing speed.

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the results to be blocked 
# In this example, the post bar and blog are blocked 
pprint(spider.search_web(query=" Keywords to search ", exclude=["tieba", "blog"]).plain)

exclude values can contain: [“news”, “video”, “baike”, “tieba”, “blog”, “gitee”, “related”, “calc”], respectively: Information, video, encyclopedia, post bar, blog, Gitee code repository, related search, calculation. The value of exclude can also be [“all”], which means that all search results except ordinary search results are excluded. Example:

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the results to be blocked 
# In this example, all non-normal search results are blocked 
pprint(spider.search_web(query=" Keywords to search ", exclude=["all"]).plain)

If exclude includes all and has other parameters, the search results will be filtered in an all-only manner.


Filter by time: The time parameter can achieve more accurate search. The value of time can be a string or a tuple consisting of datetime.datetime. For example, using the string form:

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search web pages, showing only the search results within the time period 
# In this example, only one week's search results are displayed after filtering 
pprint(spider.search_web(query=" Keywords to search ", time="week").plain)

This function uses Baidu’s built-in search time filter to filter results, rather than using program filtering. In this example, the value of time is “week”, which means filtering the search results within one week. The possible values of time are as follows: [“day”, “week”, “month”, “year”]. Respectively: within a day, within a week, within a month, within a year. In addition, BaiduSpider also supports custom time periods. For example:

from baiduspider import BaiduSpider
from pprint import pprint
from datetime import datetime

spider = BaiduSpider()

# In this example, only search results from 2020.1.5 to 2020.4.9 are displayed after filtering 
pprint(spider.search_web(query=" Keywords to search ", time=(datetime(2020, 1,  5), datetime(2020,  4, 9))).plain)

In this example, the value of time is a tuple. The first value of the tuple is the start time and the second value is the end time. BaiduSpider converts them all to floating-point numbers in the form of time.time() (and then keeps only integers), so you can also replace datetime with an integer.

—END—

This project uses the GPL3.0 open source protocol, and more functions can be read by yourself.

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code A lightweight Baidu crawler written in Python https://ictcoder.com/a-lightweight-baidu-crawler-written-in-python/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services