This issue recommends the Apache one-stop mass data integration framework – InLong.
Apache InLong is a one-stop mass data integration framework donated by Tencent to the Apache community, providing automatic, secure, reliable and high-performance data transmission capabilities to facilitate business construction of stream-based data analysis, modeling and application. The InLong project, formerly known as TubeMQ, focuses on high-performance, low-cost message queuing services. In order to further release the ecological capabilities around TubeMQ, we upgraded the project to InLong, focusing on creating a one-stop mass data integration framework. Relying on trillion-level data access and processing capabilities, Apache InLong integrates the whole process of data collection, aggregation, storage, sorting and data processing, with simple and easy to use, flexible expansion, stability and reliability.
Feature
Simple and easy to use: External service based on SaaS model, users only need to publish and subscribe data according to the subject to complete data reporting, transmission and distribution
Stable and reliable: the system originates from the actual online system, serves ten trillion level of high performance and hundreds of billions of level of highly reliable data traffic, the system is stable and reliable
Complete functions: supports various types of data access methods, a variety of different types of MQ integration, as well as real-time data ETL and data sorting landing based on configuration rules, and supports pluggable expansion of system capabilities
Service integration: supports unified system monitoring, alarm, and fine-grained data indicator presentation. For pipeline operation, data operation with data theme as the core is summarized in a unified data indicator platform, and supports abnormal alarm reminder through the alarm information set by the business.
Flexible expansion: Each module on the whole chain is composed of services in a pluggable way based on the protocol, and the business can replace components and expand functions according to its own needs
<h1class=”pgc-h-arrow-right” data-track=”24″> Structure
Module
Apache InLong serves the whole life cycle from data collection to data landing, and provides different processing modules according to different stages of data, mainly including:
- inlong-agent : data collection Agent, which can read general logs from specified directories or files and report them one by one. Capabilities such as DB collection will also be expanded in the future.
- inlong-dataproxy : A Flume-ng-based Proxy component that supports data sending blocking and drop disk retransmission, and has the ability to forward received data to different MQ (message queues).
- inlong-tubemq : Tencent self-developed message queuing service, focusing on high-performance storage and transmission of massive data in big data scenarios, has a good core advantage in mass practice and low cost.
- inlong-sort : ETL processes the data consumed from different MQ servers, and then aggregates and writes the data to Hive, ClickHouse, Hbase, Iceberg and other storage systems.
- inlong-manager : provides complete data service management and control capabilities, including metadata, task flow, permissions, OpenAPI, etc.
- inlong-dashboard : Front-end page for managing data access, simplifying the use of the entire InLong management and control platform.
- inlong-audit : Performs real-time audit and reconciliation of incoming and outgoing traffic of Agent, DataProxy, and Sort modules in the inlong system.
Basic concept
Name |
Description |
Other |
Standard Architecture |
Standard architecture, including InLong |
Suitable for mass data, mass production environment |
Lightweight Architecture |
Lightweight architecture, contains only one component InLong Sort, can be used with Manager/Dashboard |
Lightweight architecture is simple and flexible, suitable for small-scale data |
Group |
data flow Group, including multiple data flows, a group represents a data access |
Group has ID, Name and other attributes |
Stream |
data flow, a data flow has a specific direction |
Stream has ID, Name, data field and other attributes |
Node |
data nodes, including Extract Node and Load Node, representing data source type and data flow target type respectively |
|
InLongMsg |
InLong data format, if directly consumed from the message queue, need to first InLongMsg parsing |
|
Agent |
Standard architecture uses Agent for data acquisition, Agent represents different types of acquisition capabilities |
Contains file Agent, SQL Agent, Binlog Agent, etc. |
DataProxy |
Forwards the received data to a different message queue |
Support for blocking and resending data |
Sort |
data stream sorting |
There are mainly Flink-based sort-flink, sort-standalone local sorting |
TubeMQ |
InLong’s own message queue service |
Also known as Tube, with low cost, high performance characteristics |
Pulsar |
Apache Pulsar, high performance, high consistency message queuing service |
—END—
Open Source protocol: Apache2.0