Arctic is a streaming lake warehouse system open sourced by NetEase. Arctic adds more real-time scene capabilities on top of Iceberg and Hive, and provides unified streaming and batch, out-of-the-box metadata services for DataOps, making data lakes better Useful and practical.
Overview
Arctic is a Streaming LakeHouse Service built on top of the Apache Iceberg table format. Through Arctic, users can implement more optimized CDC, streaming update, OLAP and other functions on Flink, Spark, Trino and other engines. Combined with the efficient offline processing capabilities of the data lake, Arctic can serve more scenarios where streams and batches are mixed; , Arctic’s structural self-optimization, concurrent conflict resolution, and standardized lake and warehouse management functions can effectively reduce the user’s burden on data lake management and optimization.
Arctic service is demonstrated by deploying AMS, which can be considered as the next generation product of HMS (Hive Metastore), or HMS for Iceberg. Arctic relies on Iceberg as the basic table format, but Arctic does not invade the implementation of Iceberg, but uses Iceberg as a Lib. In terms of computing engines such as Flink, Spark, and Trino, Arctic is first of all an independent data source with streaming lake storage. Second, Arctic tables can also be used as one or more Iceberg tables. Considering that Hive still has a large user volume, Arctic has been designed with Hive compatibility in mind. Arctic’s open overlay architecture can help large-scale offline data lakes be quickly upgraded to real-time data lakes in batches, without worrying about compatibility issues with original data lakes, allowing data lakes to meet more real-time analysis, real-time risk control, and Real-time training, feature engineering and other scenarios.
Arctic Features
- Efficient streaming updates based on primary keys
- Automatic bucketing of data, self-optimized structure
- Supports encapsulating data lakes and message queues into a unified table to achieve lower-latency stream-batch integration
- Provides standardized metrics, dashboards and related management tools for streaming data warehouses
- Support Spark and Flink to read and write data, support Trino to query data
- 100% compatible with Iceberg / Hive table format and syntax
- Provides transactional guarantees for streaming batch concurrent writes
Architecture and Concepts
Arctic’s components include AMS, optimizer, and dashboard, as follows:
AMS
Arctic Meta Service, in the Arctic architecture, AMS is defined as a new generation of HMS, AMS manages all Arctic schemas, provides metadata services and transaction APIs to the computing engine, and is responsible for triggering background structure optimization tasks.
Transaction
Arctic defines a data commit as a transaction, and guarantees transaction consistency semantics under streaming and batch concurrent writes. Unlike ACID provided by Iceberg, Arctic needs to ensure data consistency based on primary keys because it supports CDC ingestion and streaming updates.
Tablestore
Tablestore is a table-format entity stored by Arctic on the data lake. Tablestore is similar to cluster index in a database and represents an independent storage structure. A Tablestore is an Iceberg table. Data stream writing and batch writing will enter Arctic’s database respectively. For Changestore and Basestore, Arctic will provide integrated views on multiple Tablestores when querying. Subsequent expansion of sort key or aggregate key on Arctic will also be implemented by extending Tablestore.
Optimizing
As a streaming lake warehouse service, Arctic will continue to perform file structure optimization operations in the background, and is committed to the visualization and measurement of these optimization tasks. The optimization operations include but are not limited to small file merging, data partitioning, and data merging between Tablestores. transform.
Optimizing planner
Determines the scheduling strategy of the optimization task. Arctic supports setting quota in the table properties, so as to affect the resources occupied by the Optimizing planner in the optimization of a single table structure.Optimizer container
It is a container for optimizing task scheduling. Currently, it supports two kinds of scheduling: standalone and yarn. Standalone is scheduled locally in AMS, which is suitable for testing. arctic supports users to extend the implementation of optimizer container.Optimizer group
For resource isolation, one or more optimizer groups can be set under the optimizing container, or the priority can be guaranteed through the optimizer group, and the optimizer container corresponds to the queue on the yarn.
#Arctic #Homepage #Documentation #Downloads #Streaming #Lake #Warehouse #Service #News Fast Delivery
Arctic Homepage, Documentation and Downloads – Streaming Lake Warehouse Service – News Fast Delivery