Gecco recommended in this issue is a lightweight and easy to use web crawler developed in Java language, integrating jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks.
Introduction to Gecco
Gecco is a lightweight, easy-to-use web crawler developed in the java language. Gecco integrates jsoup, httpclient, fastjson, spring, htmlunit, redission and other excellent frameworks, so that you only need to configure some Jquery-style selectors to quickly write a crawler. Gecco framework has excellent scalability, the framework is designed based on the principle of open and closed, closed for modification and open for extension.
Main features
-
- Easy to use, extract elements using Jquery-style selectors
- Supports dynamic configuration and loading of crawl rules
- Support for asynchronous ajax requests in pages
- Support for javascript variable extraction from pages
- Implement distributed fetching with Redis, refer to gecco-redis
- Support business logic development combined with Spring, refer to gecco-spring
- Support htmlunit extension, refer to gecco-htmlunit
- Support plugin extension mechanism
- UserAgent randomly selects
when downloading
- Support to download proxy server randomly selected
Framework Overview
GeccoEngine
GeccoEngine is a crawler engine, and each crawler engine is preferably an independent process. In distributed crawler scenarios, it is recommended that each crawler server (physical machine or virtual machine) run a GeccoEngine. The crawler engine consists of five main modules: Scheduler, Downloader, Spider, SpiderBeanFactory and PipelineFactory.
Downloader
Downloader is responsible for obtaining requests from the Scheduler to be downloaded. gecco uses httpclient4.x as the download engine by default. By implementing the Downloader interface, you can customize your own download engine. You can also define BeforeDownload and AfterDownload for each request to meet individual download requirements.
SpiderBeanFactory
Gecco renders the downloaded content as SpiderBean, and all crawler rendered Javabeans inherit SpiderBean. Spiderbeans are further divided into HTMLBeans and JsonBeans, which correspond to the rendering of html pages and json data respectively. SpiderBeanFactroy matches the corresponding SpiderBean based on the requested url and generates the SpiderBean context, SpiderBeanContext. The context SpiderBeanContext tells the SpiderBean what renderer it uses, which downloader it uses, and which pipeline it uses after rendering.
Spider
The core class of the Gecco framework should be the Spider thread, and a Spider engine can run multiple spider threads at the same time. Spider describes the basic skeleton of this framework. It first obtains requests from Scheduler, then matches SpiderBeanClass through SpiderBeanFactory, and then finds the SpiderBean context through SpiderBeanClass. Download the web page and compare with SpiderBean rendering, then transfer the rendered SpiderBean into a pipeline for processing.
Use
Maven
<dependency>
<groupId>com.geccocrawler</groupId>
<artifactId>gecco</artifactId>
<version>x.x.x</version>
</dependency>
Fast start
@Gecco(matchUrl="https://github.com/{user}/{project}", pipelines="consolePipeline")
public class MyGithub implements HtmlBean {
private static final long serialVersionUID = -7127412585200687225L;
@RequestParameter("user")
private String user; //url {user} value
@RequestParameter("project")
private String project; //url {project} value
@Text
@HtmlField(cssPath=".pagehead-actions li:nth-child(2) .social-count")
private String star; // Extract star
@Text
@HtmlField(cssPath=".pagehead-actions li:nth-child(3) .social-count")
private String fork; // fork
@Html
@HtmlField(cssPath=".entry-content")
private String readme; // Extract readme
public String getReadme() {
return readme;
}
public void setReadme(String readme) {
this.readme = readme;
}
public String getUser() {
return user;
}
public void setUser(String user) {
this.user = user;
}
public String getProject() {
return project;
}
public void setProject(String project) {
this.project = project;
}
public String getStar() {
return star;
}
public void setStar(String star) {
this.star = star;
}
public String getFork() {
return fork;
}
public void setFork(String fork) {
this.fork = fork;
}
public static void main(String[] args) {
GeccoEngine.create()
// Package path of the project
.classpath("com.geccocrawler.gecco.demo")
// Start grabbing page address
.start("https://github.com/xtuhcy/gecco")
// Open several crawler threads
.thread(1)
// The interval between each request taken by a single crawler
.interval(2000)
// Loop grab
.loop(true)
// Use pc userAgent
.mobile(false)
// Run in non-blocking mode
.start();
}
}
Example – Use java crawler gecco to crawl JD all product information
@Gecco(matchUrl="http://www.jd.com/allSort.aspx", pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {
private static final long serialVersionUID = 665662335318691818L;
@Request
private HttpRequest request;
// Mobile
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl")
private List<Category> mobile;
//Home Appliances
@HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
private List<Category> domestic;
public List<Category> getMobile() {
return mobile;
}
public void setMobile(List<Category> mobile) {
this.mobile = mobile;
}
public List<Category> getDomestic() {
return domestic;
}
public void setDomestic(List<Category> domestic) {
this.domestic = domestic;
}
public HttpRequest getRequest() {
return request;
}
public void setRequest(HttpRequest request) {
this.request = request;
}
}
It can be seen that by taking the product information of two categories of mobile phones and household appliances as an example, each category contains several sub-categories, using List< Category> Indicates. gecco supports the nesting of beans, which can be a good expression of html page structure. Category represents the subcategory information content, and the HrefBean is the shared link Bean.
public class Category implements HtmlBean {
private static final long serialVersionUID = 3018760488621382659L;
@Text
@HtmlField(cssPath="dt a")
private String parentName;
@HtmlField(cssPath="dd a")
private List<HrefBean> categorys;
public String getParentName() {
return parentName;
}
public void setParentName(String parentName) {
this.parentName = parentName;
}
public List<HrefBean> getCategorys() {
return categorys;
}
public void setCategorys(List<HrefBean> categorys) {
this.categorys = categorys;
}
}
## Tips for obtaining page elements cssPath The above two class difficulties are on the acquisition of cssPath, here introduce some tips for obtaining cssPath. Open the page you want to crawl in Chrome and press F12 to enter sender mode. Select the element you want to get, as shown below:
element, right-click and select Copy–Copy selector to get the cssPath
for that element
body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items
If you know the selector for jquery, and we only want to get the dl element, we can simplify this to:
.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl
After compiling the business processing class of AllSort and completing the injection of AllSort, we need to carry out business processing on AllSort. Here, we do not persist the classification information, but only extract the classification link and further grab the commodity list information. See code:
@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline< AllSort> {
@Override
public void process(AllSort allSort) {
List<Category> categorys = allSort.getMobile();
for(Category category : categorys) {
List<HrefBean> hrefs = category.getCategorys();
for(HrefBean href : hrefs) {
String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
HttpRequest currRequest = allSort.getRequest();
SchedulerContext.into(currRequest.subRequest(url));
}
}
}
}
@PipelinName defines the name of the pipeline and associates it in the @Gecco annotation in AllSort so that gecco calls the pipelines defined by @Gecco one by one after it has extracted and injected the beans. Add “& for each sublink delivery=1& page=1& JL=4_10_0& The purpose of go=0” is to grab only the products that JD owns and has in stock. The SchedulerContext.into() method places the links to be grabbed in a queue for further fetching.