Introduction
[ 中文文档 | English ]
Cozo is a transactional relational database:
- One Embeddable database;
- a use Datalog database as a query statement;
- a focus on Graph Data, Graph Algorithms database;
- one is available history shuttle query database;
- a support High performance and high concurrency database.
What does “embeddable” mean?
If a database can be used on a mobile phone that is not connected to the Internet, it is probably embedded. For example, SQLite is embedded, while MySQL, Postgres, Oracle, etc. are not (they are client-server (CS) databases).
If the database is running in the same process as your main program, then it’s Embedded database.In contrast, using client-server When building a database, the main program needs to access the database through a specific interface (usually a network interface), and the database may also run on another machine or an independent cluster. Embedded databases are simple to use, consume less resources, and can be used in a wider range of environments.
Cozo also supports running in client-server mode.Therefore, Cozo is a Embeddable rather than just Embedded database. In the client-server mode, Cozo can give full play to the performance of the server.
What is the use of “graph data”?
In essence, data must be interrelated and self-associated, and the mathematical expression of this association is picture (Also known as The internet). Only by considering these associations can we gain deeper insight into the logic behind the data.
most existing graph database It is mandatory to store data in a property graph format. In contrast, Cozo uses a traditional relational data model. The relational data model has the advantages of simple storage logic and powerful functions, and there is no problem in processing graph data.More importantly, data insights often need to mine hidden associations, and the relational data model as a relationship algebra(relational algebra) can handle this kind of problem very well. In comparison, because it does not constitute an algebra, the attribute graph model can only process explicit graph relationships as graph data, and its composability is weak.
What’s so good about “Datalog”?
Datalog appeared in 1977, it can express all relational queryand its advantage over SQL is its support for recursion expression. Due to the different execution logic, Datalog is usually faster for recursive operations than the corresponding SQL queries. Datalog is excellent in composability and modularity. Using it, you can express the required queries layer by layer and clearly.
Recursion is especially important for graph queries. The Datalog dialect used by Cozo is called CozoScript, which allows the mixed use of aggregation queries and recursion under certain conditions, thereby further enhancing the expressiveness of Datalog. At the same time, Cozo has built-in some commonly used algorithms in graph analysis (such as PageRank, etc.), which are easy to call.
After you have a better understanding of Datalog, you will find that Datalog’s rule Similar to functions in programming languages. A great feature of rules is their composability: decomposing a query into multiple progressive rules makes the query clearer and easier to maintain without loss of efficiency. In contrast, complex SQL query statements are usually expressed as multi-layer nested “select-from-where”, which is not very readable and maintainable.
History shuttle?
In the database, “historical shuttle” means to record all changes in the data, so as to allow the execution of queries against the data at a certain moment, and to spy on the history.
In a sense, this makes the database a Immutable database, since no data will actually be deleted.
Every extra feature has its price. Ideally, if you don’t use a feature, you don’t have to pay for it. In Cozo, not all data tables automatically support history shuttle, which leaves the choice of whether this function is needed and whether they are willing to pay the price to the user.
This little story about historical travel may inspire some application scenarios of historical travel.
How high is “high performance, high concurrency”?
We did a performance test on a 2020 Mac Mini using the RocksDB persistent storage engine (Cozo supports multiple storage engines):
- Querying a table with 1.6 million rows: mixed transactional queries of read, write, and modification can reach 100,000 times per second, and read-only queries can reach 250,000 times per second. During this process, the memory used by the database peaked at only 50MB.
- The speed of backup data is about 1 million rows per second, and the speed of restore is about 400,000 rows per second. The speed of backup and recovery does not slow down with the growth of form data.
- Analytical query: It takes about 1 second to scan a table with 1.6 million rows (the difference is about 2 times depending on the specific query statement). The time required for a query is roughly proportional to the number of rows involved in the query, while memory usage is mainly determined by the size of the returned collection.
- For a graph data table with 1.6 million vertices and 31 million edges, a “two-hop” graph query (such as querying who is the friend of someone’s friends) can be completed in 1 millisecond.
- Pagerank algorithm speed: 10,000 vertices, 120,000 edges: within 50 milliseconds; 100,000 vertices, 1.7 million edges: within 1 second; 1.6 million vertices, 31 million edges: within 30 seconds.
See this article for more details.
Learn Cozo
You have to install a database to start learning, right? Not necessarily: Cozo is “embedded”, so we embed it directly into the browser via WASM! Open this page, then:
Of course, it can also be done in one step: first turn to the back to learn how to install the native Cozo database in a familiar environment, and then start learning.
some examples
The following example can give you a feel for Cozo’s queries before you start learning.
Suppose there is a table named *route
with two columns named fr
and to
where the data is an airport code (such as FRA
is the code for Frankfurt Airport), and each row of data represents a flight route.
From FRA
How many airports you can arrive at without connecting:
?[count_unique(to)] := *route{fr: 'FRA', to}
From FRA
How many airports can you arrive at after departure and one transfer:
?[count_unique(to)] := *route{fr: 'FRA', to: 'stop},
*route{fr: stop, to}
From FRA
How many airports you can arrive at with any number of departures and transfers:
reachable[to] := *route{fr: 'FRA', to}
reachable[to] := reachable[stop], *route{fr: stop, to}
?[count_unique(to)] := reachable[to]
From FRA
Departures, sorted by the fewest number of connections required, arrive at which two airports require the most number of connections:
shortest_paths[to, shortest(path)] := *route{fr: 'FRA', to},
path = ['FRA', to]
shortest_paths[to, shortest(path)] := shortest_paths[stop, prev_path],
*route{fr: stop, to},
path = append(prev_path, to)
?[to, path, p_len] := shortest_paths[to, path], p_len = length(path)
:order -p_len
:limit 2
to | path | p_len |
---|---|---|
YPO | ["FRA","YYZ","YTS","YMO","YFA","ZKE","YAT","YPO"] | 8 |
BVI | ["FRA","AUH","BNE","ISA","BQL","BEU","BVI"] | 7 |
FRA
and YPO
What is the shortest route between these two airports and how many miles they actually fly:
start[] <- [['FRA']]
end[] <- [['YPO]]
?[src, dst, distance, path] <~ ShortestPathDijkstra(*route[], start[], end[])
src | dst | distance | path |
---|---|---|---|
FRA | YPO | 4544.0 | ["FRA","YUL","YVO","YKQ","YMO","YFA","ZKE","YAT","YPO"] |
Cozo will provide clear and useful error messages when the query statement is wrong:
?[x, Y] := x = 1, y = x + 1
eval::unbound_symb_in_head × Symbol 'Y' in rule head is unbound ╭──── 1 │ ?[x, Y] := x = 1, y = x + 1 · ─ ╰──── help: Note that symbols occurring only in negated positions are not considered bound
Install Cozo
It is recommended to learn first, then install. Of course, we do not object to the reverse.
Cozo can be installed in a bunch of different languages and environments:
Locales | Officially Supported Platforms | storage engine |
---|---|---|
Python (domestic image) | Linux (x86_64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
NodeJS | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
browser | Browsers that support WASM (newer browsers all support it) | m |
Java (JVM) (domestic image) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
Clojure (JVM) (domestic image) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
Android (domestic mirror) | Android (ARM64, ARMv7, x86_64, x86) | MQ |
iOS/macOS (Swift) | iOS (ARM64, Simulator), Mac (ARM64, x86_64) | MQ |
Rust | any supportstd platform (compiled from source code) | MQRST |
Go (domestic image) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
C/C++/languages that support C FFI | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
Standalone HTTP service | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQRST |
The meaning of the letters in the “Storage Engine” column:
- M: memory-based non-persistent storage engine
- Q: SQLite-based storage engine
- R: RocksDB-based storage engine
- S: Sled-based storage engine
- T: TiKV-based distributed storage engine
Cozo’s Rust API documentation (in English) has some additional advice on storage choices.
You can also try to compile Cozo yourself for other platforms, languages, and engines. Some code adjustments may be required, but overall not too difficult.
Optimize the storage engine based on RocksDB
RocksDB has various options for users to perform performance tuning. But the problem of tuning is too complicated, even RocksDB can’t handle it by themselves, so in actual production they use reinforcement learning to automatically tune. For 95% of users, it’s not worth the effort, especially since the Cozo’s “out of the box” setup is pretty fast, fast enough.
If you firmly believe that you are in the remaining 5%: When you created a CozoDB instance with the RocksDB engine, you provided a directory path to store data.If you create a directory calledoptions
file, the RocksDB engine will interpret it as a RocksDB options file and apply the settings there.If using the standalone cozoserver
program, there will be a prompt log when this function is activated.
Every time the RocksDB engine starts, the storage directory data/OPTIONS-XXXXXX
The file records the current application settings. You can copy this file and modify it based on it. If you are not a RocksDB expert, it is recommended to only change numeric options that you probably know what they mean. Improper settings can mess up and corrupt the database.
Cozo’s Architecture
The Cozo database consists of three upstream and downstream parts, each of which only calls the interface of the downstream part.
(user code) |
language/environment wrapper |
query engine |
storage engine |
(operating system) |
storage engine
In the storage engine part, Cozo defines a storage interface (in Rust trait
), the function of this interface is to store and range scan the key value of binary data. Currently this interface has the following official implementations:
- Memory-based non-persistent storage engine
- SQLite-based storage engine
- RocksDB-based storage engine
- Sled-based storage engine
- Distributed storage engine based on TiKV
Not all binary packages contain all of the above engines. Among these engines, the SQLite engine has a special status: Cozo uses its files as backup files to exchange data between Cozo of different engines. Rust users can easily implement their own engines (not that it is easy to write an engine, but it means that it is easy to connect existing engines to Cozo).
Cozo use row-oriented instead of row-oriented binary storage format. In this format, the composite key is stored as a byte array through the memcomparable method for key storage, and the correct semantic ordering can be obtained by directly sorting these byte arrays in byte order. This also means that directly using SQL to query the data stored in the SQLite engine will look like gibberish. Implementing the interface to the storage engine itself does not require knowledge of this format.
query engine
The query engine part implements the following functions:
- Implementation of various functions, aggregation operators, and algorithms
- Definition of form data structure (schema)
- Database query transaction (transaction)
- Compilation of query statements
- query execution
This section contains most of the code for the Cozo project. There is an entire chapter in the documentation detailing query execution.
Cozo’s Rust API is actually the public interface to the query engine.
language, encapsulation
All languages and environments other than Cozo’s Rust are just further wrappers of the Rust API. For example, in a standalone server (cozoserver), the Rust API is wrapped as an HTTP endpoint, while in Cozo-Node, the synchronous Rust API is wrapped as an asynchronous API based on the JavaScript runtime.
It is not difficult to encapsulate the Rust API, if you want to make Cozo run on other languages, you can try it. Rust has some ready-made libraries for interfacing with other languages. If you want to use a certain language but there is no ready-made interaction library, we suggest that you directly encapsulate Cozo’s C language API. The officially supported Go library does this (via cgo).
Project Process
A lot of the functions that Cozo envisioned at the beginning have been realized, but the project is still very young. Friends from all walks of life are welcome to use and put forward valuable opinions.
Versions prior to Cozo 1.0 make no promises about syntax, API stability, or storage compatibility.
License and Contribution
Cozo is licensed under MPL-2.0 or later. If you are interested in contributing code to the project, please see here.
#cozo #highperformance #relational #embeddable #graph #database #Datalog #query #language #perform #historical #shuttle #queries