PiFlow recommended in this issue includes a variety of processor components, including shell, DSL, web configuration interface, task scheduling, task monitoring, and other functions.
Project characteristics
Simple and easy to use
Visually configure pipeline monitoring pipelines, view pipeline logs, check points, and schedule pipelines
Scalability
Custom development of data processing components is supported
Superior performance
It is developed based on the distributed computing engine Spark
Powerful
It provides 100+ data processing components, including Hadoop, Spark, MLlib, Hive, Solr, Redis, MemCache, ElasticSearch, JDBC, MongoDB, HTTP, FTP, XML, CSV, JSON, etc., integrating relevant algorithms in the field of microorganisms
Architecture diagram
environment
JDK 1.8
Scala-2.11.8
Apache Maven 3.1.0
Spark-2.1.0 or later
Hadoop-2.6.0
Get started
Build PiFlow:
install external package
mvn install:install-file -Dfile=/.. /piflow/piflow-bundle/lib/spark-xml_2.11-0.4.2.jar -DgroupId=com.databricks -DartifactId=spark-xml_2.11 -Dversion=0.4.2 -Dpackaging=jar
mvn install:install-file -Dfile=/.. /piflow/piflow-bundle/lib/java_memcached-release_2.6.6.jar -DgroupId=com.memcached -DartifactId=java_memcached-release -Dversion=2.6.6 -Dpackaging=jar
mvn install:install-file -Dfile=/.. /piflow/piflow-bundle/lib/ojdbc6-11.2.0.3.jar -DgroupId=oracle -DartifactId=ojdbc6 -Dversion=11.2.0.3 -Dpackaging=jar
mvn install:install-file -Dfile=/.. /piflow/piflow-bundle/lib/edtftpj.jar -DgroupId=ftpClient -DartifactId=edtftp -Dversion=1.0.0 -Dpackaging=jar
mvn clean package -Dmaven.test.skip=true
[INFO] Replacing original artifact with shaded artifact.
[INFO] Reactor Summary:
[INFO]
[INFO] piflow-project ………………………………. SUCCESS [ 4.369 s]
[INFO] piflow-core …………………………………. SUCCESS [01:23 min]
[INFO] piflow-configure …………………………….. SUCCESS [ 12.418 s]
[INFO] piflow-bundle ……………………………….. SUCCESS [02:15 min]
[INFO] piflow-server ……………………………….. SUCCESS [02:05 min]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 06:01 min
[INFO] Finished at: 2020-05-21T15:22:58+08:00
[INFO] Final Memory: 118M/691M
[INFO] ————————————————————————
To run Piflow Server:
Running PiFlow Server on Intellij:
Download piflow: git clone https://github.com/cas-bigdatalab/piflow.git
Import PiFlow to Intellij
Edit the config.properties configuration file
Build PiFlow jar package:
Run –> Edit Configurations –> Add New Configuration –> Maven
Name: package
Command line: clean package -Dmaven.test.skip=true -X
run ‘package’ (piflow jar file will be built in .. /piflow/piflow-server/target/piflow-server-0.9.jar)
Run HttpService:
Edit Configurations –> Add New Configuration –> Application
Name: HttpService
Main class : cn.piflow.api.Main
Environment Variable: SPARK_HOME=/opt/spark-2.2.0-bin-hadoop2.6(change the path to your spark home)
run ‘HttpService’
Test HttpService:
Run a sample pipeline: /piflow/piflow-server/src/main/scala/cn/piflow/api/HTTPClientStartMockDataFlow.scala
You need to modify the server ip and port in the API
How to configure config.properties
#spark and yarn config
spark.master=yarn
spark.deploy.mode=cluster
#hdfs default file system
fs.defaultFS=hdfs://10.0.86.191:9000
#yarn resourcemanager.hostname
yarn.resourcemanager.hostname=10.0.86.191
#if you want to use hive, set hive metastore uris
#hive.metastore.uris=thrift://10.0.88.71:9083
#show data in log, set 0 if you do not want to show data in logs
data.show=10
#server port
server.port=8002
#h2db port
h2.port=50002
To run PiFlow Web, please go to the following link, PiFlow Server and PiFlow Web version should correspond:
https://github.com/cas-bigdatalab/piflow-web/releases/tag/v1.0
Docker image
Pull the Docker image
docker pull registry.cn-hangzhou.aliyuncs.com/cnic_piflow/piflow:v1.1
View the information about a Docker image
docker images
If you run a container with the image ID, all PiFlow services will run automatically. Please pay attention to the setting HOST_IP
docker run -h master -itd –env HOST_IP=*.*.*.* –name piflow-v1.1 -p 6001:6001 -p 6002:6002 [imageID]
Access “HOST_IP:6001”, the startup time may be a bit slow, and you need to wait a few minutes
if somethings goes wrong, all the application are in /opt folder
Page display
login
List of pipelines
Create a pipeline
Configure a pipeline
Configure a pipeline group
A list of pipeline runs
Monitor the pipeline