Gecco is a lightweight web crawler developed in the Java language

Gecco is a lightweight web crawler developed in the Java language

2022-09-02 0 1,476
Resource Number 38051 Last Updated 2025-02-24
¥ 0HKD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

Gecco recommended in this issue is a lightweight and easy to use web crawler developed in Java language, integrating jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks.

Introduction to Gecco

Gecco is a lightweight, easy-to-use web crawler developed in the java language. Gecco integrates jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks, so that you only need to configure some Jquery-style selectors to quickly write a crawler. Gecco framework has excellent scalability, the framework is designed based on the principle of open and closed, closed for modification and open for extension.

Main features

    • Easy to use, extract elements using Jquery-style selectors
    • Supports dynamic configuration and loading of crawl rules
    • Support for asynchronous ajax requests in pages
    • Support for javascript variable extraction from pages
    • Implement distributed fetching with Redis, refer to gecco-redis
    • Support business logic development combined with Spring, refer to gecco-spring
    • Support htmlunit extension, refer to gecco-htmlunit
    • Support plugin extension mechanism
    • UserAgent randomly selects

when downloading

  • Support to download proxy server randomly selected

Framework Overview

GeccoEngine

GeccoEngine is a crawler engine, and each crawler engine is preferably an independent process. In distributed crawler scenarios, it is recommended that each crawler server (physical machine or virtual machine) run a GeccoEngine. The crawler engine consists of five main modules: Scheduler, Downloader, Spider, SpiderBeanFactory and PipelineFactory.

Downloader

Downloader is responsible for obtaining requests from the Scheduler to be downloaded. gecco uses httpclient4.x as the download engine by default. By implementing the Downloader interface, you can customize your own download engine. You can also define BeforeDownload and AfterDownload for each request to meet individual download requirements.

SpiderBeanFactory

Gecco renders the downloaded content as SpiderBean, and all crawler rendered Javabeans inherit SpiderBean. Spiderbeans are further divided into HTMLBeans and JsonBeans, which correspond to the rendering of html pages and json data respectively. SpiderBeanFactroy matches the corresponding SpiderBean based on the requested url and generates the SpiderBean context, SpiderBeanContext. The context SpiderBeanContext tells the SpiderBean what renderer it uses, which downloader it uses, and which pipeline it uses after rendering.

Spider

The core class of the Gecco framework should be the Spider thread, and a Spider engine can run multiple spider threads at the same time. Spider describes the basic skeleton of this framework. It first obtains requests from Scheduler, then matches SpiderBeanClass through SpiderBeanFactory, and then finds the SpiderBean context through SpiderBeanClass. Download the web page and compare with SpiderBean rendering, then transfer the rendered SpiderBean into a pipeline for processing.

Use

Maven

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>x.x.x</version>
</dependency>

Fast start

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
private String user; //url {user} value 

    @RequestParameter("project")
    private String project;                 //url {project} value 

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2)  .social-count")
    private String star; // Extract star

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3)  .social-count")
    private String fork; // fork

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme; // Extract readme

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getStar() {
        return star;
    }

    public void setStar(String star) {
        this.star = star;
    }

    public String getFork() {
        return fork;
    }

    public void setFork(String fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        // Package path of the project 
        .classpath("com.geccocrawler.gecco.demo")
        // Start grabbing page address 
        .start("https://github.com/xtuhcy/gecco")
        // Open several crawler threads 
        .thread(1)
        // The interval between each request taken by a single crawler 
        .interval(2000)
        // Loop grab 
        .loop(true)
        // Use pc userAgent
        .mobile(false)
        // Run in non-blocking mode 
        .start();
    }
}

Example – Use java crawler gecco to crawl JD all product information

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {

	private static final long serialVersionUID = 665662335318691818L;
	
	@Request
	private HttpRequest request;

	// Mobile 
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
	private List<Category> mobile;
	
	//Home Appliances 
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
	private List<Category> domestic;

	public List<Category> getMobile() {
		return mobile;
	}

	public void setMobile(List<Category> mobile) {
		this.mobile = mobile;
	}

	public List<Category> getDomestic() {
		return domestic;
	}

	public void setDomestic(List<Category> domestic) {
		this.domestic = domestic;
	}

	public HttpRequest getRequest() {
		return request;
	}

	public void setRequest(HttpRequest request) {
		this.request = request;
	}
}

It can be seen that by taking the product information of two categories of mobile phones and household appliances as an example, each category contains several sub-categories, using List< Category> Indicates. gecco supports the nesting of beans, which can be a good expression of html page structure. Category represents the subcategory information content, and the HrefBean is the shared link Bean.

public class Category implements  HtmlBean {

	private static final long serialVersionUID = 3018760488621382659L;

	@Text
	@HtmlField(cssPath="dt a")
	private String parentName;
	
	@HtmlField(cssPath="dd a")
	private List<HrefBean> categorys;

	public String getParentName() {
		return parentName;
	}

	public void setParentName(String parentName) {
		this.parentName = parentName;
	}

	public List<HrefBean> getCategorys() {
		return categorys;
	}

	public void setCategorys(List<HrefBean> categorys) {
		this.categorys = categorys;
	}
	
}

## Tips for obtaining page elements cssPath The above two class difficulties are on the acquisition of cssPath, here introduce some tips for obtaining cssPath. Open the page you want to crawl in Chrome and press F12 to enter sender mode. Select the element you want to get, as shown below:

Gecco is a lightweight web crawler developed in the Java language插图

element, right-click and select Copy–Copy selector to get the cssPath

for that element

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

If you know the selector for jquery, and we only want to get the dl element, we can simplify this to:

.category-items >  div:nth-child(1) >  div:nth-child(2) >  div.mc >  div.items >  dl

After compiling the business processing class of AllSort and completing the injection of AllSort, we need to carry out business processing on AllSort. Here, we do not persist the classification information, but only extract the classification link and further grab the commodity list information. See code:

@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline< AllSort>  {

	@Override
	public void process(AllSort allSort) {
		List<Category> categorys = allSort.getMobile();
		for(Category category : categorys) {
			List<HrefBean> hrefs = category.getCategorys();
			for(HrefBean href : hrefs) {
				String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
				HttpRequest currRequest = allSort.getRequest();
				SchedulerContext.into(currRequest.subRequest(url));
			}
		}
	}
}

@PipelinName defines the name of the pipeline and associates it in the @Gecco annotation in AllSort so that gecco calls the pipelines defined by @Gecco one by one after it has extracted and injected the beans. Add “& for each sublink delivery=1& page=1& JL=4_10_0& The purpose of go=0” is to grab only the products that JD owns and has in stock. The SchedulerContext.into() method places the links to be grabbed in a queue for further fetching.

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free Source Code Gecco is a lightweight web crawler developed in the Java language https://ictcoder.com/gecco-is-a-lightweight-web-crawler-developed-in-the-java-language/

Share free open-source source code

Q&A
  • 1. Automatic: After making an online payment, click the (Download) link to download the source code; 2. Manual: Contact the seller or the official to check if the template is consistent. Then, place an order and make payment online. The seller ships the goods, and both parties inspect and confirm that there are no issues. ICTcoder will then settle the payment for the seller. Note: Please ensure to place your order and make payment through ICTcoder. If you do not place your order and make payment through ICTcoder, and the seller sends fake source code or encounters any issues, ICTcoder will not assist in resolving them, nor can we guarantee your funds!
View details
  • 1. Default transaction cycle for source code: The seller manually ships the goods within 1-3 days. The amount paid by the user will be held in escrow by ICTcoder until 7 days after the transaction is completed and both parties confirm that there are no issues. ICTcoder will then settle with the seller. In case of any disputes, ICTcoder will have staff to assist in handling until the dispute is resolved or a refund is made! If the buyer places an order and makes payment not through ICTcoder, any issues and disputes have nothing to do with ICTcoder, and ICTcoder will not be responsible for any liabilities!
View details
  • 1. ICTcoder will permanently archive the transaction process between both parties and snapshots of the traded goods to ensure the authenticity, validity, and security of the transaction! 2. ICTcoder cannot guarantee services such as "permanent package updates" and "permanent technical support" after the merchant's commitment. Buyers are advised to identify these services on their own. If necessary, they can contact ICTcoder for assistance; 3. When both website demonstration and image demonstration exist in the source code, and the text descriptions of the website and images are inconsistent, the text description of the image shall prevail as the basis for dispute resolution (excluding special statements or agreements); 4. If there is no statement such as "no legal basis for refund" or similar content, any indication on the product that "once sold, no refunds will be supported" or other similar declarations shall be deemed invalid; 5. Before the buyer places an order and makes payment, the transaction details agreed upon by both parties via WhatsApp or email can also serve as the basis for dispute resolution (in case of any inconsistency between the agreement and the description of the conflict, the agreement shall prevail); 6. Since chat records and email records can serve as the basis for dispute resolution, both parties should only communicate with each other through the contact information left on the system when contacting each other, in order to prevent the other party from denying their own commitments. 7. Although the probability of disputes is low, it is essential to retain important information such as chat records, text messages, and email records, in case a dispute arises, so that ICTcoder can intervene quickly.
View details
  • 1. As a third-party intermediary platform, ICTcoder solely protects transaction security and the rights and interests of both buyers and sellers based on the transaction contract (product description, agreed content before the transaction); 2. For online trading projects not on the ICTcoder platform, any consequences are unrelated to this platform; regardless of the reason why the seller requests an offline transaction, please contact the administrator to report.
View details

Related Source code

ICTcoder Customer Service

24-hour online professional services