BaiduSpider is a lightweight Baidu crawler written in Python. It is built on Requests and BeautifulSoup, and provides an easy-to-use API interface and comprehensive type annotations to improve the developer experience.

Features

Saves the time of data extraction, which is helpful for data model establishment and training of similar deep learning projects
Accurately and quickly extract Baidu search results and remove ads
Large and complete search results, support a variety of search types, support a variety of return types
provides a simple and easy-to-use API

Installation

Dependent environment:

Python 3.6+

Install with pip:

$ pip install baiduspider

Install manually from GitHub:

$ git clone git@github.com:BaiduSpider/BaiduSpider.git

# ... 

$ python setup.py install

Example

Baidu web search, can also be used as a comprehensive search.

BaiduSpider.search_web(
self: BaiduSpider,
query: str,
pn: int = 1,
exclude: list = [],
proxies: Union[dict, None] = None,
) ->  WebResult

Parameter

query str: string to query web search
pn int: page number to crawl. Default is 1, optional
exclude dict: List of subparts to be blocked, optional
time str | List[datetime.datetime] : Search time range
proxies Union[dict, None] : proxy configuration. Default to None, optional

Instance

Basic call: This is the most basic parameter — query. It is used to pass search terms (string type).

# Import BaiduSpider
from baiduspider import BaiduSpider
from pprint import pprint

# Instantiate BaiduSpider
spider = BaiduSpider()

# Search web pages 
pprint(spider.search_web(query=" Keyword to search ").plain)

Specify the page number: You can change the page number obtained by BaiduSpider by setting the pn parameter.

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the page number parameter (here, the second page) 
pprint(spider.search_web(query=" Keywords to search ", pn=2).plain)

Note: be careful when passing page number parameters, be sure not to pass too large page numbers, otherwise Baidu search will automatically jump back to the first page.

Block specific search results: This parameter can provide you with great convenience. By setting the exclude list, you can block certain web search subsearch results to improve parsing speed.

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the results to be blocked 
# In this example, the post bar and blog are blocked 
pprint(spider.search_web(query=" Keywords to search ", exclude=["tieba", "blog"]).plain)

exclude values can contain: [“news”, “video”, “baike”, “tieba”, “blog”, “gitee”, “related”, “calc”], respectively: Information, video, encyclopedia, post bar, blog, Gitee code repository, related search, calculation. The value of exclude can also be [“all”], which means that all search results except ordinary search results are excluded. Example:

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search the web page and pass in the results to be blocked 
# In this example, all non-normal search results are blocked 
pprint(spider.search_web(query=" Keywords to search ", exclude=["all"]).plain)

If exclude includes all and has other parameters, the search results will be filtered in an all-only manner.

Filter by time: The time parameter can achieve more accurate search. The value of time can be a string or a tuple consisting of datetime.datetime. For example, using the string form:

from baiduspider import BaiduSpider
from pprint import pprint

spider = BaiduSpider()

# Search web pages, showing only the search results within the time period 
# In this example, only one week's search results are displayed after filtering 
pprint(spider.search_web(query=" Keywords to search ", time="week").plain)

This function uses Baidu’s built-in search time filter to filter results, rather than using program filtering. In this example, the value of time is “week”, which means filtering the search results within one week. The possible values of time are as follows: [“day”, “week”, “month”, “year”]. Respectively: within a day, within a week, within a month, within a year. In addition, BaiduSpider also supports custom time periods. For example:

from baiduspider import BaiduSpider
from pprint import pprint
from datetime import datetime

spider = BaiduSpider()

# In this example, only search results from 2020.1.5 to 2020.4.9 are displayed after filtering 
pprint(spider.search_web(query=" Keywords to search ", time=(datetime(2020, 1,  5), datetime(2020,  4, 9))).plain)

In this example, the value of time is a tuple. The first value of the tuple is the start time and the second value is the end time. BaiduSpider converts them all to floating-point numbers in the form of time.time() (and then keeps only integers), so you can also replace datetime with an integer.

—END—

This project uses the GPL3.0 open source protocol, and more functions can be read by yourself.

资源下载此资源为免费资源立即下载

Telegram:@John_Software

collect(0) Like (0)

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free source code A lightweight Baidu crawler written in Python https://ictcoder.com/kyym/a-lightweight-baidu-crawler-written-in-python.html

lllll

Share free open-source source code

Previous article： Cloud Platform based on Spring Cloud backend service development scaffolding

Next article： NocoBase is an extremely scalable open source no-code and low-code development platform

Q&A

What is the delivery method?

1, automatic: after taking the photo, click the (download) link to download; 2. Manual: After taking the photo, contact the seller to issue it or contact the official to find the developer to ship.

View details

How long is the trading cycle?

1, the default transaction cycle of the source code: manual delivery of goods for 1-3 days, and the user payment amount will enter the platform guarantee until the completion of the transaction or 3-7 days can be issued, in case of disputes indefinitely extend the collection amount until the dispute is resolved or refunded!

View details

Matters needing attention

1. Heptalon will permanently archive the process of trading between the two parties and the snapshots of the traded goods to ensure that the transaction is true, effective and safe! 2, Seven PAWS can not guarantee such as "permanent package update", "permanent technical support" and other similar transactions after the merchant commitment, please identify the buyer; 3, in the source code at the same time there is a website demonstration and picture demonstration, and the site is inconsistent with the diagram, the default according to the diagram as the dispute evaluation basis (except for special statements or agreement); 4, in the absence of "no legitimate basis for refund", the commodity written "once sold, no support for refund" and other similar statements, shall be deemed invalid; 5, before the shooting, the transaction content agreed by the two parties on QQ can also be the basis for dispute judgment (agreement and description of the conflict, the agreement shall prevail); 6, because the chat record can be used as the basis for dispute judgment, so when the two sides contact, only communicate with the other party on the QQ and mobile phone number left on the systemhere, in case the other party does not recognize self-commitment. 7, although the probability of disputes is very small, but be sure to retain such important information as chat records, mobile phone messages, etc., in case of disputes, it is convenient for seven PAWS to intervene in rapid processing.

View details

Systemhere declaration

1. As a third-party intermediary platform, Qichou protects the security of the transaction and the rights and interests of both buyers and sellers according to the transaction contract (commodity description, content agreed before the transaction); 2, non-platform online trading projects, any consequences have nothing to do with mutual site; No matter the seller for any reason to require offline transactions, please contact the management report.

View details