NEOCrawler (Chinese: Cow coffee) – crawler system

NEOCrawler (Chinese: Cow coffee) – crawler system

2022-09-14 0 969
Resource Number 38338 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation
The recommended NEOCrawler (Chinese: 牛coffee) in this issue is a crawler system implemented by nodejs, redis, and phantomjs. The code is completely open-source, which is suitable for data collection in vertical fields and secondary development of crawlers.Key features:

Using nodejs implementation, JavaScript is simple, efficient, and easy to learn, saving a lot of time for the development of crawlers and the secondary development of crawler users; nodejs uses Google V8 as the runtime engine, and the performance is considerable; Due to the non-blocking and asynchronous nature of the nodejs language, the performance of the system that is not sensitive to IO-intensive CPU requirements such as running crawlers is very good, and the development volume is less than that of C/C++/JAVA, and the performance is higher than that of the multi-threaded implementation of JAVA and the implementation of asynchronous and Ctrip in Python.

The dispatching center is responsible for the scheduling of URLs, and the crawler process runs in a distributed manner, that is, the central scheduler makes a unified decision on which URLs to capture in a single time slice, and coordinates the work of each crawler, and the crawler single point failure does not affect the overall system.

The crawler has carried out a structured analysis of the web page when crawling, extracted the required data fields, and not only the source code of the web page but also the structured field data when entering the database, which not only makes the data available immediately after the web page is crawled, but also facilitates the accurate content ranking when entering the warehouse.

PhantomJS is integrated. PhantomJS is a web browser implementation that does not require a graphical interface environment, and can be used to crawl web pages that require JS execution to produce content. Perform user actions on the page through JS statements, so that the content of the next page can be crawled after the form is filled and submitted, and the content of the next page can be crawled after the page is jumped after clicking the button.

Integrate the function of proxy IP use, this function is aimed at anti-crawling websites (limited to the number of visits, traffic, and intelligent judgment crawlers under a single IP), you need to provide available proxy IPs, and crawlers will independently choose the proxy IP address that can also be accessed for the source website, and the source website cannot be blocked from crawling.

Configurables:

1). To describe it in regular expressions, similar web pages are grouped together and use the same rules. a crawler system (the following items refer to a certain type of URL configurable items);

2). Start address, crawling method, storage location, page processing method, etc.

3). Link rules that need to be collected, use CSS selectors to limit the crawler to only collect links that appear in a certain position in the page;

3). Page extraction rules: CSS selectors and regular expressions can be used to locate the position of each field content to be extracted;

4). Predefine the js statement to be injected after the page is opened;

5). Cookies set by the web;

6). The rule to judge whether the return of this kind of web page is normal, usually to specify the keywords that must exist on the page after some web pages return to normal for crawlers to detect;

7). The rules for judging whether the data extraction is complete, and selecting several very necessary fields in the extraction fields as the criteria for judging whether the extraction is complete;

8). The scheduling weight (priority) and cycle (how long it takes to recrawl and update) of the page.

Architecture

NEOCrawler (Chinese: Cow coffee) – crawler system插图

The yellow part of the figure shows the subsystems of the crawler system

SuperScheduler is the central scheduler, the Spider crawler will put the collected URLs into the corresponding URL libraries of various URLs, and SuperScheduler will extract the corresponding number of URLs from various URL libraries according to the scheduling rules and put them into the queue to be crawled.

Spider is a distributed crawler, take out the task from the queue to be crawled by the scheduler for crawling, put the discovered URL into the URL library, and store the extracted content, and divide the crawler into a core core and download, extract, pipeline 4 middleware, in order to be able to easily recustomize one of the functions in the crawler instance.

ProxyRouter intelligently routes crawler requests to available proxy IPs when using proxy IPs.

Webconfig is the backend of web crawler rule configuration.

Run the steps

Prepare the runtime environment

Install the nodejs environment, clone the source code from the git repository to the local computer, open the command prompt in the folder location, and run “npm install” to install the dependent modules;
Redis Server installation (supports both Redis and SSDB, SSDB can be used from the perspective of memory saving, and the type can be specified in the setting.json, which will be mentioned below).
HBase environment, crawled to the web page, the extracted data will be stored in HBase, after HBase is installed, the HTTP REST service will be turned on, which will be used in the later configuration, if you want to use other database storage, you can not install HBase, the following section will talk about how to turn off the HBase function and customize your own storage. Initialize the HBase column cluster in the HBase shell:
create ‘crawled’,{NAME => ‘basic’, VERSIONS => 3},{NAME=>”data”,VERSIONS=>3},{NAME => ‘extra’, VERSIONS => 3}
create ‘crawled_bin’,{NAME => ‘basic’, VERSIONS => 3},{NAME=>”binary”,VERSIONS=>3}

It is recommended to use the HBase REST mode, after you start HBase, run the following command in the bin subdirectory of the HBase directory to start HBase REST:

./hbase-daemon.sh start rest

The default port is 8080, which will be used in the following configuration.

##【Instance Configuration】

In the instance directory, copy an example and rename another instance name, for example, abc.

Edit instance/abc/setting.json

{
/*Note: This is used to explain the configurations, and the real setting.json cannot contain comments */

“driller_info_redis_db”:[“127.0.0.1”,6379,0],/*URL rule configuration information storage location, the last digit indicates the number of Redis databases*/
“url_info_redis_db”:[“127.0.0.1”,6379,1],/*URL information storage location*/
“url_report_redis_db”:[“127.0.0.1”,6380,2],/*Crawl error message storage location*/
“proxy_info_redis_db”:[“127.0.0.1”,6379,3],/*http proxy URL storage location*/
“use_proxy”: false, /*Whether to use proxy service*/
“proxy_router”:”127.0.0.1:2013″,/*In the case of proxy service, the routing center address of the proxy service */
“download_timeout”: 60,/*Download timeout in seconds, not the same as the corresponding timeout*/
“save_content_to_hbase”: false, /* whether to store the scraped information to HBase, so far only tested at 0.94 */
“crawled_hbase_conf”:[“localhost”,8080],/*hbase rest configuration, you can use tcp mode to connect, configured as {“zookeeperHosts”: [“localhost:2181″],”zookeeperRoot”: “/hbase”}, there is an OOM bug in this mode, */ is not recommended.
“crawled_hbase_table”:”crawled”,/*The scraped data is stored in the HBase table*/
“crawled_hbase_bin_table”: “crawled_bin”, /* The captured binary data is stored in the hbase table*/
“statistic_mysql_db”:[“127.0.0.1″,3306,”crawling”,”crawler”,”123″],/* is used to store the analysis results of the crawling logs, which needs to be implemented in conjunction with flume, and is generally not used*/
“check_driller_rules_interval”:120,/*How often to detect changes in URL rules for hot refresh to running crawlers*/
“spider_concurrency”: 5,/*Number of concurrent requests for crawlers to crawl pages*/
“spider_request_delay”: 0,/*The time between two concurrent requests, seconds*/
“schedule_interval”: 60,/*The interval between two schedulers */
“schedule_quantity_limitation”: 200,/*The maximum number of URLs to be crawled by the scheduler to the crawler*/
“download_retry”: 3, /*Number of retries with error*/
“log_level”:”DEBUG”, /*Log Level*/
“use_ssdb”: false, /* whether to use ssdb*/
“to_much_fail_exit”: false, /*whether to automatically terminate the crawler when there are too many errors*/
“keep_link_relation”: false/*Whether the link repository stores the relationship between links*/
}

run

CONFIGURE SCRAPING RULES ON THE WEB PAGE

Debug whether individual URL crawling is OK

Run the scheduler (the scheduler can start one)

If you use proxy IP scraping, start proxy routing

Start the crawler (the crawler can start multiple in a distributed manner)

The following is the specific launch command

1. Run the WEB configuration (see the next chapter for configuration rules)

node run.js -i abc -a config -p 8888

Open http://localhost:8888 in your browser to configure scraping rules in the web interface

2. Test a single page crawl

node run.js -i abc -a test -l “http://domain/page/”

3. Run the scheduler

node run.js -i abc -a schedule
-i specifies the instance name, and -a specifies the action schedule

4. Run Proxy Routing You only need to run proxy routing if you use proxy IP scraping

node run.js -i abc -a proxy -p 2013

The -p here specifies the port of the proxy route, and if it is running natively, the proxy_router and port of the setting.json is 127.0.0.1:2013

5. Run the crawler

node run.js -i abc -a crawl

You can view the output log debug-result.json under instance/example/logs

Redis/ssdb data structure

Understanding the data structure will help you to familiarize yourself with the whole system for secondary development. neocrawler uses 4 storage spaces, driller_info_redis_db, url_info_redis_db, url_report_redis_db, proxy_info_redis_db, you can configure settings.json under the instance, the categories of 4 spaces are different, the key names will not conflict, you can point 4 spaces to a redis/ssdb library, The amount of growth in each space is different, and if you use Redis, it is recommended to point to a DB for each space, and if possible, one Redis for each space.

driller_info_redis_db

Crawl rules and URLs are stored

url_info_redis_db

This space stores the URL information, and the longer the crawl runs, the larger the amount of data will be

url_report_redis_db

This space stores crawl reports

proxy_info_redis_db

This space stores data related to proxy IPs

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code NEOCrawler (Chinese: Cow coffee) – crawler system https://ictcoder.com/neocrawler-chinese-cow-coffee-crawler-system/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services