This issue recommends a lightweight Baidu crawler written in Python – BaiduSpider.
BaiduSpider is a lightweight Baidu crawler written in Python. It is built on Requests and BeautifulSoup, and provides an easy-to-use API interface and comprehensive type annotations to improve the developer experience.
Features
- Saves the time of data extraction, which is helpful for data model establishment and training of similar deep learning projects
- Accurately and quickly extract Baidu search results and remove ads
- Large and complete search results, support a variety of search types, support a variety of return types
- provides a simple and easy-to-use API
Installation
Dependent environment:
Python 3.6+
Install with pip:
$ pip install baiduspider
Install manually from GitHub:
$ git clone git@github.com:BaiduSpider/BaiduSpider.git
# ...
$ python setup.py install
Example
Baidu web search, can also be used as a comprehensive search.
BaiduSpider.search_web(
self: BaiduSpider,
query: str,
pn: int = 1,
exclude: list = [],
proxies: Union[dict, None] = None,
) -> WebResult
Parameter
- query str: string to query web search
- pn int: page number to crawl. Default is 1, optional
- exclude dict: List of subparts to be blocked, optional
- time str | List[datetime.datetime] : Search time range
- proxies Union[dict, None] : proxy configuration. Default to None, optional
Instance
Basic call: This is the most basic parameter — query. It is used to pass search terms (string type).
# Import BaiduSpider
from baiduspider import BaiduSpider
from pprint import pprint
# Instantiate BaiduSpider
spider = BaiduSpider()
# Search web pages
pprint(spider.search_web(query=" Keyword to search ").plain)
Specify the page number: You can change the page number obtained by BaiduSpider by setting the pn parameter.
from baiduspider import BaiduSpider
from pprint import pprint
spider = BaiduSpider()
# Search the web page and pass in the page number parameter (here, the second page)
pprint(spider.search_web(query=" Keywords to search ", pn=2).plain)
Note: be careful when passing page number parameters, be sure not to pass too large page numbers, otherwise Baidu search will automatically jump back to the first page.
Block specific search results: This parameter can provide you with great convenience. By setting the exclude list, you can block certain web search subsearch results to improve parsing speed.
from baiduspider import BaiduSpider
from pprint import pprint
spider = BaiduSpider()
# Search the web page and pass in the results to be blocked
# In this example, the post bar and blog are blocked
pprint(spider.search_web(query=" Keywords to search ", exclude=["tieba", "blog"]).plain)
exclude values can contain: [“news”, “video”, “baike”, “tieba”, “blog”, “gitee”, “related”, “calc”], respectively: Information, video, encyclopedia, post bar, blog, Gitee code repository, related search, calculation. The value of exclude can also be [“all”], which means that all search results except ordinary search results are excluded. Example:
from baiduspider import BaiduSpider
from pprint import pprint
spider = BaiduSpider()
# Search the web page and pass in the results to be blocked
# In this example, all non-normal search results are blocked
pprint(spider.search_web(query=" Keywords to search ", exclude=["all"]).plain)
If exclude includes all and has other parameters, the search results will be filtered in an all-only manner.
Filter by time: The time parameter can achieve more accurate search. The value of time can be a string or a tuple consisting of datetime.datetime. For example, using the string form:
from baiduspider import BaiduSpider
from pprint import pprint
spider = BaiduSpider()
# Search web pages, showing only the search results within the time period
# In this example, only one week's search results are displayed after filtering
pprint(spider.search_web(query=" Keywords to search ", time="week").plain)
This function uses Baidu’s built-in search time filter to filter results, rather than using program filtering. In this example, the value of time is “week”, which means filtering the search results within one week. The possible values of time are as follows: [“day”, “week”, “month”, “year”]. Respectively: within a day, within a week, within a month, within a year. In addition, BaiduSpider also supports custom time periods. For example:
from baiduspider import BaiduSpider
from pprint import pprint
from datetime import datetime
spider = BaiduSpider()
# In this example, only search results from 2020.1.5 to 2020.4.9 are displayed after filtering
pprint(spider.search_web(query=" Keywords to search ", time=(datetime(2020, 1, 5), datetime(2020, 4, 9))).plain)
In this example, the value of time is a tuple. The first value of the tuple is the start time and the second value is the end time. BaiduSpider converts them all to floating-point numbers in the form of time.time() (and then keeps only integers), so you can also replace datetime with an integer.
—END—
This project uses the GPL3.0 open source protocol, and more functions can be read by yourself.