Efficient and intelligent online crawler -spider-flow

Efficient and intelligent online crawler -spider-flow

2022-09-05 0 717
Resource Number 38065 Last Updated 2025-02-24
¥ 0USD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

The Spiderflow recommended in this issue can be a platform for writing crawlers in the way of flow charts, and can quickly complete a simple crawler without writing code.

Efficient and intelligent online crawler -spider-flow插图

Project characteristics

    • Support css selectors, RE extraction
    • Support JSON/XML format
    • Support Xpath/JsonPath extraction

< li data – track = “6” > support multiple data sources, SQL select/insert/update/delete < / li >

  • Support to crawl JS dynamically rendered pages
  • Support agent
  • Support binary format
  • Support to save/read files (csv, xls, jpg, etc.)
  • Common string, date, file, encryption and decryption, random and other functions
  • Support process nesting
  • Support plug-in extensions (custom actuators, custom functions, custom controllers, type extensions, etc.)
  • Support HTTP interface

Installation and deployment

Prepare the environment

JDK > = 1.8
Mysql > = 5.7
Maven > = 3.0 download address: (http://maven.apache.org/download.cgi) < / span > < / code > < / pre >

Run item

 

    1. < li data – track = “19” > cloud to code download page (https://gitee.com/ssssssss-team/spider-flow) to download into working directory < / li >

    2. Set Eclipse repository, menu Window-> Preferences-> Maven-> User Settings-> Browse behind the User Settings, then import the settings.xml file in the conf directory of your Maven directory, then Apply, click OK

 

    1. Import to Eclipse, menu file-> Import, then select Maven-> Existing Maven Projects, click Next> Button, select the working directory, and then click the Finish button to import successfully

 

    1. Import database, basic table: spider-flow/db/spiderflow.sql

< li data – track = “23” > open and run the org. Spiderflow. SpiderApplication. Java < / li >

    1. Open a browser and type (http://localhost:8088/)

 

 

< Import plug-in

 

    1. First download the required plug-ins locally and import them into your workspace or install them into your maven library

 

    1. Introducing plugins in spider-flow/spider-flow-web/pom.xml

 

 

< ! -- Take the introduction of mongodb plug-ins as an example-->
<dependency>
	<groupId>org.spiderflow</groupId>
	<artifactId>spider-flow-mongodb</artifactId>
</dependency>

Quick Start

Crawl node

This object is used to request HTTP/HTTPS pages or interfaces

    • Request method: GET, POST, PUT, DELETE and other methods
    • URL: request address
    • Delay time: in milliseconds, meaning a delay of some time before crawling is performed
    • Timeout: The timeout time of a network request, also in milliseconds
    • Proxy: The proxy set at the time of request, in the format of host:port, for example, 192.168.1.26:8888
    • Encoding format: It is used to set the default encoding format of the page to UTF-8. When garble characters appear in the parsing, this value can be modified
    • Follow redirection: The default is follow 30x redirection, you can uncheck

when this function is not needed.

  • TLS certificate verification: This is checked by default. When an exception such as a certificate occurs, you can uncheck this attempt
  • Automatic Cookie management: Automatically set cookies upon request (manually set by oneself and the cookies previously requested will be set in)
  • Automatic deduplication: If this parameter is selected, the url will be deduplication. If the URL is repeated, the URL will be skipped.
  • Number of retries: Retries when the request is abnormal or the status code is not 200
  • Retry interval: the interval between retries in milliseconds
  • Parameter: Used to set parameter Settings for GET, POST, and other methods
    • Parameter name: parameter key
    • Parameter value: Parameter value
    • Parameter description: Only used to describe this parameter (equivalent to remarks/comments) has no practical significance
  • Cookie: used to set the request Cookie
    • Cookie name: Cookie key
    • Cookie value: Cookie value
    • Description: Only used to describe the Cookie (equivalent to remarks/comments) has no practical significance
  • Header: sets the request header
    • Header name: Header key
    • Header value: Header value
    • Description: Only used to describe the Header (equivalent to remarks/comments) has no practical significance
  • Body: request type (default is none)
  • form-data (Body item is set to form-data)
    • Parameter name: Request parameter name
    • Parameter value: Request parameter value
    • Parameter type: text/file
    • File name: file name required when uploading binary data
  • raw (set the Body item to raw)
      < li data – track = “63” > the content-type: text/plain, application/json < / li >

    • Content: Request body content (String type)

< Define variable Var

After the node is used to define variables, it can be used with expressions to achieve dynamic setting of parameters (such as dynamic request paging address)

    • Variable name: the name of the variable, which overwrites the previous variable

when the variable name is repeated

  • Variable value: The value of a variable, which can be either constant or expression

Output node

This node is mainly used for debugging, the output will be printed to the page during testing, and it can also be used to automatically save to the database or file

  • Output to database: When checked, you need to fill in the data source, table name, and < font color=”blue”> Output item < /font> To correspond to the column name
  • Output to CSV file: If selected, enter the CSV file path. font color=”blue”> Output item < /font> Will be used as the header
  • Output all parameters: generally used for debugging, can output all variables to the interface
  • Output item: name of the output item
  • Output value: The output value, which can be a constant or an expression

Loop node

  • Times or sets: When this item has a value (the value is a set or a number), subsequent nodes (including this node) will loop
  • Loop variable: default to item, same meaning as item in for(Object item: collections)
  • Loop subscript: When looped, a subscript (starting at 0) is generated and stored in the variable with this value, as for(int i =0; i < array.length; The i in I ++ has the same meaning
  • Start position: loop from this position (starting from 0)
  • End position: End to this position (-1 is the last item,-2 is the second-to-last item, and so on)

Execute SQL

Mainly used to interact with the database (query/modify/insert/delete, etc.)

    • data source: need to select the configured data source

< li data – track = “87” > statement type: select/selectInt/selectOne/insert/insertofPk/update/delete < / li >

  • SQL: To execute the SQL statement, the parameters that need to be injected dynamically are wrapped with ## such as: #${item[index].id}#

Process execution

Process Instance 1

Efficient and intelligent online crawler -spider-flow插图1

It is easy to see that the process execution process is: A-> B-> C-> D, but since node A is A loop, assuming that the number of loops of node A is 3, then the execution process will become A,A,A-> B,B,B-> C,C,C-> D,D,D ( Three A’s are executed together, but the order is not fixed, each execution will flow directly to the next node, instead of waiting for all three A’s to end ), when D,D,D are executed, because no flow to the next node, then the whole process ends.

Since loops can also be set in nodes B,C and D, assuming that loops are also set in node C, the number of loops is 2 times, then the execution process of the whole process is A,A,A-> B,B,B-> C,C,C,C,C,C-> D,D,D,D,D,D(i.e., forming a nested loop )

Process Instance 2

Efficient and intelligent online crawler -spider-flow插图2

  • Running sequence: A-> B-> A,C-> B-> C
    • Execute node A
    • When node A is executed, execute node B
    • When node B is executed, execute node A and C
    • Execute A twice, B twice, and C twice in total.

This will form the recursion , which is A< -> B, but in this case, often need to plus conditions to limit , that is, the number of pages in the figure above < 3

Project part screenshot

Efficient and intelligent online crawler -spider-flow插图3

Efficient and intelligent online crawler -spider-flow插图4

Crawler test

Efficient and intelligent online crawler -spider-flow插图5

debug

Efficient and intelligent online crawler -spider-flow插图6
资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free source code Efficient and intelligent online crawler -spider-flow https://ictcoder.com/kyym/efficient-and-intelligent-online-crawler-spider-flow.html

Share free open-source source code

Q&A
  • 1, automatic: after taking the photo, click the (download) link to download; 2. Manual: After taking the photo, contact the seller to issue it or contact the official to find the developer to ship.
View details
  • 1, the default transaction cycle of the source code: manual delivery of goods for 1-3 days, and the user payment amount will enter the platform guarantee until the completion of the transaction or 3-7 days can be issued, in case of disputes indefinitely extend the collection amount until the dispute is resolved or refunded!
View details
  • 1. Heptalon will permanently archive the process of trading between the two parties and the snapshots of the traded goods to ensure that the transaction is true, effective and safe! 2, Seven PAWS can not guarantee such as "permanent package update", "permanent technical support" and other similar transactions after the merchant commitment, please identify the buyer; 3, in the source code at the same time there is a website demonstration and picture demonstration, and the site is inconsistent with the diagram, the default according to the diagram as the dispute evaluation basis (except for special statements or agreement); 4, in the absence of "no legitimate basis for refund", the commodity written "once sold, no support for refund" and other similar statements, shall be deemed invalid; 5, before the shooting, the transaction content agreed by the two parties on QQ can also be the basis for dispute judgment (agreement and description of the conflict, the agreement shall prevail); 6, because the chat record can be used as the basis for dispute judgment, so when the two sides contact, only communicate with the other party on the QQ and mobile phone number left on the systemhere, in case the other party does not recognize self-commitment. 7, although the probability of disputes is very small, but be sure to retain such important information as chat records, mobile phone messages, etc., in case of disputes, it is convenient for seven PAWS to intervene in rapid processing.
View details
  • 1. As a third-party intermediary platform, Qichou protects the security of the transaction and the rights and interests of both buyers and sellers according to the transaction contract (commodity description, content agreed before the transaction); 2, non-platform online trading projects, any consequences have nothing to do with mutual site; No matter the seller for any reason to require offline transactions, please contact the management report.
View details

Related Article

make a comment
No comments available at the moment
Official customer service team

To solve your worries - 24 hours online professional service