SeaTunnel is an easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data, which can stably and efficiently synchronize tens of billions of data every day, and has been used by nearly 100 companies in production.
Introduction to SeaTunnel
SeaTunnel does its best to solve the problems you may encounter when syncing massive amounts of data:
Data loss and duplication
Task piling up and delays
Low throughput
It takes a long time to apply to the production environment
Lack of application health monitoring
SeaTunnel use cases
Massive data synchronization
Massive data integration
ETL for massive amounts of data
Massive data aggregation
Multi-source data processing
Features of SeaTunnel
Easy to use, flexible configuration, no development required
Real-time streaming
Offline multi-source data analysis
High-performance, massive data processing capabilities
Modular and plug-in, easy to expand
SQL can be used for data processing and aggregation
Spark Structured Streaming is supported
Spark 2.x is supported
Environmental dependencies
Java runtime environment, Java > = 8
If you want to run seatunnel in a clustered environment, you’ll need one of the following Spark cluster environments:
Spark on Yarn
Spark Standalone
If you have a small amount of data or just do functional verification, you can also start it in local mode without the need for a cluster environment, SeaTunnel supports stand-alone operation. Note: seatunnel 2.0 runs on Spark and Flink
Production use cases
Data Platform of the Value-added Business Department A Weibo business has hundreds of real-time stream computing tasks that use the internal customized version of SeaTunnel, and its sub-project Guardian does task monitoring of seatunnel On Yarn.
Sina
Big Data O&M Analysis Platform Sina O&M Data Analysis Platform uses SeaTunnel to perform real-time and offline analysis of O&M big data for Sina News, CDN and other services, and writes it to Clickhouse.
Sogou
Sogou Singularity System Sogou Singularity System uses SeaTunnel as an ETL tool to help establish a real-time data warehouse system.
Get started quickly
1.seatunnel relies on JDK1.8 runtime environment.
2. seatunnel depends on Spark, you need to prepare Spark before installing seatunnel. Please download Spark first, and select >= 2.x.x for the Spark version. After downloading and decompressing, you do not need to configure any configuration to submit the Spark deploy-mode = local task. If you want the task to run on a standalone cluster or a Yarn or Mesos cluster, please refer to the Spark configuration documentation.
3. Download the seatunnel installation package and unzip it, here is the community version as an example:
wget https://github.com/InterestingLab/seatunnel/releases/download/v<version>/seatunnel-<version>.zip -O seatunnel-<version>.zip
unzip seatunnel-<version>.zip
ln -s seatunnel-<version> seatunnel
Deploy and run
Run Seatunnel locally in local mode
./bin/start-seatunnel.sh –master local[4] –deploy-mode client –config ./config/application.conf
Run seatunnel on a Spark Standalone cluster
# client mode
./bin/start-seatunnel.sh –master spark://207.184.161.138:7077 –deploy-mode client –config ./config/application.conf
# cluster mode
./bin/start-seatunnel.sh –master spark://207.184.161.138:7077 –deploy-mode cluster –config ./config/application.conf
Run seatunnel on the Yarn cluster
# client mode
./bin/start-seatunnel.sh –master yarn –deploy-mode client –config ./config/application.conf
# cluster mode
./bin/start-seatunnel.sh –master yarn –deploy-mode cluster –config ./config/application.conf
Run seatunnel on Mesos
# cluster mode
./bin/start-seatunnel.sh –master mesos://207.184.161.138:7077 –deploy-mode cluster –config ./config/application.conf
If you want to specify the size of resources occupied by the seatunnel runtime, or other Spark parameters, you can specify the configuration file specified in –config:
spark {
spark.executor.instances = 2
spark.executor.cores = 1
spark.executor.memory = “1g”
…
}