This issue recommends an open source big data distributed task scheduling system — Taier.
Taier is a distributed visual DAG task scheduling system. In order to reduce the development cost of ETL and improve the stability of big data platform, big data developers can directly develop business logic in Taier, without worrying about the complex dependency of tasks and the implementation of the underlying big data platform architecture, and focus more on the business.
Function feature
Stability
- Single point of failure: decentralized distributed mode
- High availability mode: Zookeeper
- Overload handling: distributed node + two-level storage strategy + queuing mechanism. Each node can handle task scheduling and submission; When a large number of tasks are performed, they are preferentially cached in the memory queue. When the number of tasks exceeds the configured maximum number of queues, all tasks are stored in the database. Task processing is consumed in a queue, which asynchronously fetches an executable instance from the database
- Actual test: hundreds of enterprise customers production environment actual test
Ease of use
- Support big data job scheduling Spark, Flink,
- Supports many types of tasks, currently supports Spark SQL, data synchronization
Later open source: SparkMR, PySpark, FlinkMR, Python, Shell, Jupyter, Tersorflow, Pytorch, HadoopMR, Kylin, Odps,
- SQL tasks (MySQL, PostgreSQL, Hive, Impala, Oracle, SQLServer, TiDB, greenplum, inceptor, kingbase, presto)
- Visual workflow configuration: supports encapsulated workflows, supports single-task running, does not need to encapsulate workflows, supports drag-and-drop mode to draw DAG
- DAG monitoring interface: operation and maintenance center, support to view cluster resources, understand the remaining situation of current cluster resources, support to batch stop tasks in the scheduling queue, task status, task type, retry times, task running machine, visual variables and other key information ata glance
- Scheduling time configuration: visual configuration
- Multi-cluster connection: Support a scheduling system to connect multiple Hadoop clusters
Multi-version engine
- Support for multiple versions of Spark, Flink and other engines
Kerberos support
- Spark
- Flink
System parameters
- Rich, supports 3 time benchmarks, and can flexibly set the output format
Extensibility
- The design considers the distributed mode, and currently supports the overall Taier horizontal expansion mode
- Scheduling capability increases linearly with cluster
Architecture Design
- DatasourceX is a data source plug-in, responsible for metadata and data operations of various types of data sources, such as obtaining table structure, previewing table data and other functions are implemented by DatasourceX
- Chunjun is a batch flow unified data synchronization tool based on Flink, which can collect both static data, such as MySQL, HDFS, etc., and real-time changing data, such as MySQL binlog, Kafka, etc.
Main interface

—END—
Open Source protocol: Apache2.0