Efficient and intelligent online crawler -spider-flow

Efficient and intelligent online crawler -spider-flow

2022-09-05 0 1,129
Resource Number 38065 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

The Spiderflow recommended in this issue can be a platform for writing crawlers in the way of flow charts, and can quickly complete a simple crawler without writing code.

Efficient and intelligent online crawler -spider-flow插图

Project characteristics

    • Support css selectors, RE extraction
    • Support JSON/XML format
    • Support Xpath/JsonPath extraction

< li data – track = “6” > support multiple data sources, SQL select/insert/update/delete < / li >

  • Support to crawl JS dynamically rendered pages
  • Support agent
  • Support binary format
  • Support to save/read files (csv, xls, jpg, etc.)
  • Common string, date, file, encryption and decryption, random and other functions
  • Support process nesting
  • Support plug-in extensions (custom actuators, custom functions, custom controllers, type extensions, etc.)
  • Support HTTP interface

Installation and deployment

Prepare the environment

JDK > = 1.8
Mysql > = 5.7
Maven > = 3.0 download address: (http://maven.apache.org/download.cgi) < / span > < / code > < / pre >

Run item

 

    1. < li data – track = “19” > cloud to code download page (https://gitee.com/ssssssss-team/spider-flow) to download into working directory < / li >

    2. Set Eclipse repository, menu Window-> Preferences-> Maven-> User Settings-> Browse behind the User Settings, then import the settings.xml file in the conf directory of your Maven directory, then Apply, click OK

 

    1. Import to Eclipse, menu file-> Import, then select Maven-> Existing Maven Projects, click Next> Button, select the working directory, and then click the Finish button to import successfully

 

    1. Import database, basic table: spider-flow/db/spiderflow.sql

< li data – track = “23” > open and run the org. Spiderflow. SpiderApplication. Java < / li >

    1. Open a browser and type (http://localhost:8088/)

 

 

< Import plug-in

 

    1. First download the required plug-ins locally and import them into your workspace or install them into your maven library

 

    1. Introducing plugins in spider-flow/spider-flow-web/pom.xml

 

 

< ! -- Take the introduction of mongodb plug-ins as an example-->
<dependency>
	<groupId>org.spiderflow</groupId>
	<artifactId>spider-flow-mongodb</artifactId>
</dependency>

Quick Start

Crawl node

This object is used to request HTTP/HTTPS pages or interfaces

    • Request method: GET, POST, PUT, DELETE and other methods
    • URL: request address
    • Delay time: in milliseconds, meaning a delay of some time before crawling is performed
    • Timeout: The timeout time of a network request, also in milliseconds
    • Proxy: The proxy set at the time of request, in the format of host:port, for example, 192.168.1.26:8888
    • Encoding format: It is used to set the default encoding format of the page to UTF-8. When garble characters appear in the parsing, this value can be modified
    • Follow redirection: The default is follow 30x redirection, you can uncheck

when this function is not needed.

  • TLS certificate verification: This is checked by default. When an exception such as a certificate occurs, you can uncheck this attempt
  • Automatic Cookie management: Automatically set cookies upon request (manually set by oneself and the cookies previously requested will be set in)
  • Automatic deduplication: If this parameter is selected, the url will be deduplication. If the URL is repeated, the URL will be skipped.
  • Number of retries: Retries when the request is abnormal or the status code is not 200
  • Retry interval: the interval between retries in milliseconds
  • Parameter: Used to set parameter Settings for GET, POST, and other methods
    • Parameter name: parameter key
    • Parameter value: Parameter value
    • Parameter description: Only used to describe this parameter (equivalent to remarks/comments) has no practical significance
  • Cookie: used to set the request Cookie
    • Cookie name: Cookie key
    • Cookie value: Cookie value
    • Description: Only used to describe the Cookie (equivalent to remarks/comments) has no practical significance
  • Header: sets the request header
    • Header name: Header key
    • Header value: Header value
    • Description: Only used to describe the Header (equivalent to remarks/comments) has no practical significance
  • Body: request type (default is none)
  • form-data (Body item is set to form-data)
    • Parameter name: Request parameter name
    • Parameter value: Request parameter value
    • Parameter type: text/file
    • File name: file name required when uploading binary data
  • raw (set the Body item to raw)
      < li data – track = “63” > the content-type: text/plain, application/json < / li >

    • Content: Request body content (String type)

< Define variable Var

After the node is used to define variables, it can be used with expressions to achieve dynamic setting of parameters (such as dynamic request paging address)

    • Variable name: the name of the variable, which overwrites the previous variable

when the variable name is repeated

  • Variable value: The value of a variable, which can be either constant or expression

Output node

This node is mainly used for debugging, the output will be printed to the page during testing, and it can also be used to automatically save to the database or file

  • Output to database: When checked, you need to fill in the data source, table name, and < font color=”blue”> Output item < /font> To correspond to the column name
  • Output to CSV file: If selected, enter the CSV file path. font color=”blue”> Output item < /font> Will be used as the header
  • Output all parameters: generally used for debugging, can output all variables to the interface
  • Output item: name of the output item
  • Output value: The output value, which can be a constant or an expression

Loop node

  • Times or sets: When this item has a value (the value is a set or a number), subsequent nodes (including this node) will loop
  • Loop variable: default to item, same meaning as item in for(Object item: collections)
  • Loop subscript: When looped, a subscript (starting at 0) is generated and stored in the variable with this value, as for(int i =0; i < array.length; The i in I ++ has the same meaning
  • Start position: loop from this position (starting from 0)
  • End position: End to this position (-1 is the last item,-2 is the second-to-last item, and so on)

Execute SQL

Mainly used to interact with the database (query/modify/insert/delete, etc.)

    • data source: need to select the configured data source

< li data – track = “87” > statement type: select/selectInt/selectOne/insert/insertofPk/update/delete < / li >

  • SQL: To execute the SQL statement, the parameters that need to be injected dynamically are wrapped with ## such as: #${item[index].id}#

Process execution

Process Instance 1

Efficient and intelligent online crawler -spider-flow插图1

It is easy to see that the process execution process is: A-> B-> C-> D, but since node A is A loop, assuming that the number of loops of node A is 3, then the execution process will become A,A,A-> B,B,B-> C,C,C-> D,D,D ( Three A’s are executed together, but the order is not fixed, each execution will flow directly to the next node, instead of waiting for all three A’s to end ), when D,D,D are executed, because no flow to the next node, then the whole process ends.

Since loops can also be set in nodes B,C and D, assuming that loops are also set in node C, the number of loops is 2 times, then the execution process of the whole process is A,A,A-> B,B,B-> C,C,C,C,C,C-> D,D,D,D,D,D(i.e., forming a nested loop )

Process Instance 2

Efficient and intelligent online crawler -spider-flow插图2

  • Running sequence: A-> B-> A,C-> B-> C
    • Execute node A
    • When node A is executed, execute node B
    • When node B is executed, execute node A and C
    • Execute A twice, B twice, and C twice in total.

This will form the recursion , which is A< -> B, but in this case, often need to plus conditions to limit , that is, the number of pages in the figure above < 3

Project part screenshot

Efficient and intelligent online crawler -spider-flow插图3

Efficient and intelligent online crawler -spider-flow插图4

Crawler test

Efficient and intelligent online crawler -spider-flow插图5

debug

Efficient and intelligent online crawler -spider-flow插图6
资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code Efficient and intelligent online crawler -spider-flow https://ictcoder.com/efficient-and-intelligent-online-crawler-spider-flow/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services