This issue recommends DrissionPage, an open source Python-based Web automation integration tool.
Use requests to do data collection in the face of the website to log in, to analyze data packets, JS source code, construct complex requests, often have to deal with verification code, JS confusion, signature parameters and other anti-crawling means, the threshold is high. If the data is generated by JS calculation, the calculation process must be reproduced, which is not good experience and development efficiency is not high.
With selenium, these pits can be largely bypassed, but selenium is not very efficient. Therefore, this library combines selenium and requests into one, switches the corresponding mode when different needs are needed, and provides a user-friendly way to improve development and operation efficiency.
In addition to merging the two, this library also encapsulates the common functions by the unit of web page, simplifies the operation and statement of selenium, and reduces the consideration of details and focuses on the realization of functions when it is used for web page automation operation, which is more convenient to use. Everything is simple, try to provide simple and direct use of the method, more friendly to the novice.
class=”pgc-h-arrow-right” data-track=”7″> feature
-
- Code is highly integrated, with concise code as the first pursuit.
The
- page object can switch between selenium mode and requests mode to retain the login state.
- extremely simple but powerful element location syntax, support chain operation, the code is extremely concise.
- Both modes provide a consistent API and a consistent experience.
- Humanized design, integration of many practical functions, greatly reduce the development workload
class=”pgc-h-arrow-right” data-track=”15″>
-
- You can use an open browser repeatedly each time you run the program. For example, manually set the webpage to a certain state, and then use the program to take over, or manually handle the login, and then use the program to crawl the content. No need to start the browser from scratch every time you run it, super convenient
- Use ini file to save common configuration, automatic call, also provide convenient setting API, away from complicated configuration items
- extremely concise positioning syntax, support direct positioning of elements by text, support direct access to siblings and parents, etc.
- Powerful download tool, even when operating the browser can enjoy fast and reliable download function
- The download tool supports multiple ways to handle file name conflicts, automatically create target paths, break links and retry, etc.
- Access URL with automatic retry function, can set interval and timeout time
- Access web page can automatically identify the code, no need to manually set
- Link parameters automatically generate Host and Referer attributes by default
- can hide or display the browser process window directly at any time, non-headless or minimized
- can automatically download the appropriate version of chromedriver, eliminating the hassle of configuration
- d mode lookup element built-in wait, you can arbitrarily set the global wait time or single lookup wait time
- Click element to integrate js click mode, one parameter can switch click mode
- Click support failure to retry, can be used to ensure the success of the click, to interpret whether the web page mask layer disappears, etc.
- The input text can automatically determine whether it is successful and retry, avoiding the occurrence of input or empty failure in some cases
- d mode supports full-function xpath, which can directly obtain an attribute of an element. selenium native does not have this function
- supports getting the shadow-root directly and manipulating the element below it like a normal element
- supports getting the contents of after and before pseudo-elements directly
- can be used directly under the element. Gets the direct child of the current element as a css selector.
is not supported by native
- can simply use lxml to parse D-mode pages or elements, which greatly improves the speed of crawling complex page data
- The output data has been transcoded and processed for basic typesetting, reducing repeated labor
- can be easily interlinked with selenium or requests native code to facilitate project migration
- is encapsulated in POM mode, which can be used directly for testing and easy to expand
- d mode configuration can be compatible with debugger_address and other parameters, native is not compatible with
class=”pgc-h-arrow-right” data-track=”43″>
As shown in the figure, the Drission object is responsible for creating links, sharing login status, and so on, similar to the concept of driver in selenium. The MixPage object is responsible for parsing and manipulating the retrieved page. DriverElement and SessionElement are element objects that are retrieved from the page object. Responsible for parsing and manipulating elements.
class=”pgc-h-arrow-right” data-track=”40″>
and selenium code comparison
Go to the first TAB
# use selenium :
driver.switch_to.window(driver.window_handles[0])
DrissionPage :
page.to_tab(0)
Press text to select drop-down list
# using selenium:
from selenium.webdriver.support.select import Select
select_element = Select(element)
select_element.select_by_visible_text('text')
# use DrissionPage:
element.select('text')
Drag an element
# use selenium :
ActionChains(driver).drag_and_drop(ele1, ele2).perform()
DrissionPage :
ele1.drag_to(ele2)
versus requests
Get element content
url = 'https://baike.baidu.com/item/python'
# use requests: span
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}
response = requests.get(url, headers=headers)
html = etree.HTML(response.text)
element = html.xpath('//h1')[0]
title = element.text
# use DrissionPage:
page = MixPage('s')
page.get(url)
title = page('tag:h1').text
url = 'https://www.baidu.com/img/flexible/logo/pc/result.png'
save_path = r'C:\download'
# use requests:
r = requests.get(url)
with open(f'{save_path}\\img.png', 'wb') as fd:
for chunk in r.iter_content():
fd.write(chunk)
# use DrissionPage:
page.download(url, save_path, 'img') # Support renaming, handle file name conflicts, Automatically create target folder
Climb the COVID-19 chart
URL:
https://www.outbreak.my/zh/world, this example crawl new global champions league list. The site is a pure html page, especially suitable for S-mode crawling and parsing.
from DrissionPage import MixPage
# Create page object with s mode Span >
page = MixPage('s')
span
page.get('https://www.outbreak.my/zh/world')
# get header element
thead = page('tag:thead')
# Get the header column, skip the hidden column span
title = thead.eles('tag:th@@-style:display: none; ')
data = [th.text for th in title]
print(data) # print header span
# get content table elements Span
tbody = page('tag:tbody')
# Get all rows Span
rows = tbody.eles('tag:tr')
for row in rows:
# Gets all columns of the current row
cols = row.eles('tag:td')
# Generate the current row data list (skip the useless columns)
data = [td.text for k, td in enumerate(cols) if k not in (2, 4, 6)]
print(data) # Print line data
Output:
[' total (205)', < span class = "HLJS - string" > 'cumulative confirmed' < / span >, < span class = "HLJS - string" > 'death' < / span >, < span class = "HLJS - string" > < / span > 'cure', ' Current diagnosis ', ' mortality ', ' recovery rate ']
[' US ', '55252823', '845745', '41467660', < span class = "HLJS - string" > '12939418', < / span > < span class = "HLJS - string" > < / span > '1.53%', '75.05%'] span
[' India ', '34838804', '481080', '34266363', < span class = "HLJS - string" > '91361', < / span > < span class = "HLJS - string" > < / span > '1.38%', '98.36%'] span
[' Brazil ', '22277239', '619024', '21567845', < span class = "HLJS - string" > '90370', < / span > < span class = "HLJS - string" > < / span > '2.78%', '96.82%'] span
[' UK ', '12748050', '148421', '10271706', < span class = "HLJS - string" > '2327923', < / span > < span class = "HLJS - string" > < / span > '1.16%', '80.57%'] span
[' Russia ', '10499982', '308860', '9463919', < span class = "HLJS - string" > '727203', < / span > < span class = "HLJS - string" > < / span > '2.94%', '90.13%'] span
[' France ', '9740600', '123552', '8037752', < span class = "HLJS - string" > '1579296', < / span > < span class = "HLJS - string" > < / span > '1.27%', '82.52%'] span
...
Go to gitee
URL: https://gitee.com/login. This example demonstrates how to automatically login to the gitee website by controlling the browser.
from DrissionPage import MixPage
# Create a page object in d mode Span >
page = MixPage()
# Jump to page Span
page.get(‘https://gitee.com/login’)
# Navigate to the account text box and enter the account span
page.ele(‘#user_login’).input(‘ your account ‘)
# Locate the password text field and enter the password span class=”hljs-comment”>#
page.ele(‘#user_password’).input(‘ your password’)
# click login button span
Page. Ele (< span class = “HLJS – string” > ‘@ value = log’ < / span >). Click () < / code > < / pre >
—END—
Open source: BSD-3-Clause