Gecco is a lightweight web crawler developed in the Java language

Gecco is a lightweight web crawler developed in the Java language

2022-09-02 0 1,061
Resource Number 38051 Last Updated 2025-02-24
¥ 0USD Upgrade VIP
Download Now Matters needing attention
Can't download? Please contact customer service to submit a link error!
Value-added Service: Installation Guide Environment Configuration Secondary Development Template Modification Source Code Installation

Gecco recommended in this issue is a lightweight and easy to use web crawler developed in Java language, integrating jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks.

Introduction to Gecco

Gecco is a lightweight, easy-to-use web crawler developed in the java language. Gecco integrates jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks, so that you only need to configure some Jquery-style selectors to quickly write a crawler. Gecco framework has excellent scalability, the framework is designed based on the principle of open and closed, closed for modification and open for extension.

Main features

    • Easy to use, extract elements using Jquery-style selectors
    • Supports dynamic configuration and loading of crawl rules
    • Support for asynchronous ajax requests in pages
    • Support for javascript variable extraction from pages
    • Implement distributed fetching with Redis, refer to gecco-redis
    • Support business logic development combined with Spring, refer to gecco-spring
    • Support htmlunit extension, refer to gecco-htmlunit
    • Support plugin extension mechanism
    • UserAgent randomly selects

when downloading

  • Support to download proxy server randomly selected

Framework Overview

GeccoEngine

GeccoEngine is a crawler engine, and each crawler engine is preferably an independent process. In distributed crawler scenarios, it is recommended that each crawler server (physical machine or virtual machine) run a GeccoEngine. The crawler engine consists of five main modules: Scheduler, Downloader, Spider, SpiderBeanFactory and PipelineFactory.

Downloader

Downloader is responsible for obtaining requests from the Scheduler to be downloaded. gecco uses httpclient4.x as the download engine by default. By implementing the Downloader interface, you can customize your own download engine. You can also define BeforeDownload and AfterDownload for each request to meet individual download requirements.

SpiderBeanFactory

Gecco renders the downloaded content as SpiderBean, and all crawler rendered Javabeans inherit SpiderBean. Spiderbeans are further divided into HTMLBeans and JsonBeans, which correspond to the rendering of html pages and json data respectively. SpiderBeanFactroy matches the corresponding SpiderBean based on the requested url and generates the SpiderBean context, SpiderBeanContext. The context SpiderBeanContext tells the SpiderBean what renderer it uses, which downloader it uses, and which pipeline it uses after rendering.

Spider

The core class of the Gecco framework should be the Spider thread, and a Spider engine can run multiple spider threads at the same time. Spider describes the basic skeleton of this framework. It first obtains requests from Scheduler, then matches SpiderBeanClass through SpiderBeanFactory, and then finds the SpiderBean context through SpiderBeanClass. Download the web page and compare with SpiderBean rendering, then transfer the rendered SpiderBean into a pipeline for processing.

Use

Maven

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco</artifactId>
    <version>x.x.x</version>
</dependency>

Fast start

@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {

    private static final long serialVersionUID = -7127412585200687225L;

    @RequestParameter("user")
private String user; //url {user} value 

    @RequestParameter("project")
    private String project;                 //url {project} value 

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(2)  .social-count")
    private String star; // Extract star

    @Text
    @HtmlField(cssPath=".pagehead-actions li:nth-child(3)  .social-count")
    private String fork; // fork

    @Html
    @HtmlField(cssPath=".entry-content")
    private String readme; // Extract readme

    public String getReadme() {
        return readme;
    }

    public void setReadme(String readme) {
        this.readme = readme;
    }

    public String getUser() {
        return user;
    }

    public void setUser(String user) {
        this.user = user;
    }

    public String getProject() {
        return project;
    }

    public void setProject(String project) {
        this.project = project;
    }

    public String getStar() {
        return star;
    }

    public void setStar(String star) {
        this.star = star;
    }

    public String getFork() {
        return fork;
    }

    public void setFork(String fork) {
        this.fork = fork;
    }

    public static void main(String[] args) {
        GeccoEngine.create()
        // Package path of the project 
        .classpath("com.geccocrawler.gecco.demo")
        // Start grabbing page address 
        .start("https://github.com/xtuhcy/gecco")
        // Open several crawler threads 
        .thread(1)
        // The interval between each request taken by a single crawler 
        .interval(2000)
        // Loop grab 
        .loop(true)
        // Use pc userAgent
        .mobile(false)
        // Run in non-blocking mode 
        .start();
    }
}

Example – Use java crawler gecco to crawl JD all product information

@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {

	private static final long serialVersionUID = 665662335318691818L;
	
	@Request
	private HttpRequest request;

	// Mobile 
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
	private List<Category> mobile;
	
	//Home Appliances 
	@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
	private List<Category> domestic;

	public List<Category> getMobile() {
		return mobile;
	}

	public void setMobile(List<Category> mobile) {
		this.mobile = mobile;
	}

	public List<Category> getDomestic() {
		return domestic;
	}

	public void setDomestic(List<Category> domestic) {
		this.domestic = domestic;
	}

	public HttpRequest getRequest() {
		return request;
	}

	public void setRequest(HttpRequest request) {
		this.request = request;
	}
}

It can be seen that by taking the product information of two categories of mobile phones and household appliances as an example, each category contains several sub-categories, using List< Category> Indicates. gecco supports the nesting of beans, which can be a good expression of html page structure. Category represents the subcategory information content, and the HrefBean is the shared link Bean.

public class Category implements  HtmlBean {

	private static final long serialVersionUID = 3018760488621382659L;

	@Text
	@HtmlField(cssPath="dt a")
	private String parentName;
	
	@HtmlField(cssPath="dd a")
	private List<HrefBean> categorys;

	public String getParentName() {
		return parentName;
	}

	public void setParentName(String parentName) {
		this.parentName = parentName;
	}

	public List<HrefBean> getCategorys() {
		return categorys;
	}

	public void setCategorys(List<HrefBean> categorys) {
		this.categorys = categorys;
	}
	
}

## Tips for obtaining page elements cssPath The above two class difficulties are on the acquisition of cssPath, here introduce some tips for obtaining cssPath. Open the page you want to crawl in Chrome and press F12 to enter sender mode. Select the element you want to get, as shown below:

Gecco is a lightweight web crawler developed in the Java language插图

element, right-click and select Copy–Copy selector to get the cssPath

for that element

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

If you know the selector for jquery, and we only want to get the dl element, we can simplify this to:

.category-items >  div:nth-child(1) >  div:nth-child(2) >  div.mc >  div.items >  dl

After compiling the business processing class of AllSort and completing the injection of AllSort, we need to carry out business processing on AllSort. Here, we do not persist the classification information, but only extract the classification link and further grab the commodity list information. See code:

@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline< AllSort>  {

	@Override
	public void process(AllSort allSort) {
		List<Category> categorys = allSort.getMobile();
		for(Category category : categorys) {
			List<HrefBean> hrefs = category.getCategorys();
			for(HrefBean href : hrefs) {
				String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
				HttpRequest currRequest = allSort.getRequest();
				SchedulerContext.into(currRequest.subRequest(url));
			}
		}
	}
}

@PipelinName defines the name of the pipeline and associates it in the @Gecco annotation in AllSort so that gecco calls the pipelines defined by @Gecco one by one after it has extracted and injected the beans. Add “& for each sublink delivery=1& page=1& JL=4_10_0& The purpose of go=0” is to grab only the products that JD owns and has in stock. The SchedulerContext.into() method places the links to be grabbed in a queue for further fetching.

资源下载此资源为免费资源立即下载
Telegram:@John_Software

Disclaimer: This article is published by a third party and represents the views of the author only and has nothing to do with this website. This site does not make any guarantee or commitment to the authenticity, completeness and timeliness of this article and all or part of its content, please readers for reference only, and please verify the relevant content. The publication or republication of articles by this website for the purpose of conveying more information does not mean that it endorses its views or confirms its description, nor does it mean that this website is responsible for its authenticity.

Ictcoder Free source code Gecco is a lightweight web crawler developed in the Java language https://ictcoder.com/kyym/gecco-is-a-lightweight-web-crawler-developed-in-the-java-language.html

Share free open-source source code

Q&A
  • 1, automatic: after taking the photo, click the (download) link to download; 2. Manual: After taking the photo, contact the seller to issue it or contact the official to find the developer to ship.
View details
  • 1, the default transaction cycle of the source code: manual delivery of goods for 1-3 days, and the user payment amount will enter the platform guarantee until the completion of the transaction or 3-7 days can be issued, in case of disputes indefinitely extend the collection amount until the dispute is resolved or refunded!
View details
  • 1. Heptalon will permanently archive the process of trading between the two parties and the snapshots of the traded goods to ensure that the transaction is true, effective and safe! 2, Seven PAWS can not guarantee such as "permanent package update", "permanent technical support" and other similar transactions after the merchant commitment, please identify the buyer; 3, in the source code at the same time there is a website demonstration and picture demonstration, and the site is inconsistent with the diagram, the default according to the diagram as the dispute evaluation basis (except for special statements or agreement); 4, in the absence of "no legitimate basis for refund", the commodity written "once sold, no support for refund" and other similar statements, shall be deemed invalid; 5, before the shooting, the transaction content agreed by the two parties on QQ can also be the basis for dispute judgment (agreement and description of the conflict, the agreement shall prevail); 6, because the chat record can be used as the basis for dispute judgment, so when the two sides contact, only communicate with the other party on the QQ and mobile phone number left on the systemhere, in case the other party does not recognize self-commitment. 7, although the probability of disputes is very small, but be sure to retain such important information as chat records, mobile phone messages, etc., in case of disputes, it is convenient for seven PAWS to intervene in rapid processing.
View details
  • 1. As a third-party intermediary platform, Qichou protects the security of the transaction and the rights and interests of both buyers and sellers according to the transaction contract (commodity description, content agreed before the transaction); 2, non-platform online trading projects, any consequences have nothing to do with mutual site; No matter the seller for any reason to require offline transactions, please contact the management report.
View details

Related Article

make a comment
No comments available at the moment
Official customer service team

To solve your worries - 24 hours online professional service