Building logger with Rust and Scylla DB - Data Layer
In this serie, we’ll build a performant log-server using Rust as a programming language. This post, the first in this serie, will describe the corner stone of our logger - the data layer.
Rust is designed to be fast and safe – thanks to its ownership model it doesn’t require a garabge collector yet providing safety guaranties for data. We will also use ScyllaDB as a database server to store logs data. Scylla is Cassandra compatible database written in C++. According to benchmarks made by Samsung it performs 10x times better than Cassandra on high-end machines and up-to 3x faster on smaller workstations.
Figure 1. Scylla vs Cassandra performance benchmark. Image is taken from Scylla official website. YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs
Starting Scylla DB
In order to start Scylla instance let’s use official Docker image provided by ScyllaDB team:
$ docker run \
-p 9042:9042/tcp \
--name some-scylla \
--hostname some-scylla \
-d scylladb/scylla
In order to persist logs we will create a data volume and mount it to a container.
$ mkdir -p /xx/yy/scylla/data \
/xx/yy/scylla/commitlog \
/xx/yy/scylla/hints \
/xx/yy/scylla/view_hints
$ docker run --name some-scylla \
--volume /xx/yy/scylla:/var/lib/scylla \
-d scylladb/scylla
where /xx/yy
is an actual path to a folder where we create scylla
folder.
In order to access cqlsh
and nodetool
utility you can run
$ docker exec -it some-scylla cqlsh
and
$ docker exec -it some-scylla nodetool status
respectively.
Data schema
Our log-server will have capabilities for storring and querying temperature timeseries data. Each measurement entry will include:
- device ID that reported a temperature
- time of a temperature measurement
- temperature value itself.
Assuming that we will query the temperature reported by a certain device for a given time interval, we can use following table schema:
CREATE TABLE IF NOT EXISTS fast_logger.temperature (
device UUID,
time timestamp,
temperature smallint,
PRIMARY KEY(device, time)
);
This primary key has device
as a partitioning key which means that all temperature records reported by a given device will be stored in the same partition (node). time
is a clustering key that’s why all records from the same device will be sorted by this field. Thus, this table schema will support an effective temperature querying:
SELECT * FROM fast_logger.temperature
WHERE device = ?
AND time > ?
AND time < ?;
where ?
will be replaced with actual values – device ID, time-from and time-to respectively.
In order to create fast_logger
keyspace, first we need to decide which replication strategy to use and what would be the replication factor.
If it comes to replication strategy, there are two of them to choose:
-
SimpleStrategy
must be used for a single data center only. -
NetworkTopologyStrategy
is a strategy you should consider when you plan to have few data centers.
As for the replication factor, it depends on how much replicas you want to have in your cluster.
In our case, we are going to build a logger that uses a single Scylla node. That’s why SimpleStrategy
with replication factor equal to 1 is absolutelly sufficient:
CREATE KEYSPACE IF NOT EXISTS fast_logger
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
Since it’s not recommended to disable durabl writes when using SimpleStrategy
we keep default value for DURABLE_WRITES
which is true
(we don’t specify this parameter in the query at all).
Rust and connection to the DB
In order to create a skeleton for an executable application such as our logger server in Rust, we have to run cargo new fast_logger
command in a terminal.
It’s assumed that you already have Rust and Cargo installed. If you don’t, we recommend you to use rustup.rs toolchain that will help you to install Rust compiler as well as other commonly used tools and crates. Each rustup.rs profile contains rust compiler (rustc
), standard library (rust-std
) and Cargo package manager.
As a database driver we are going to use CDRS Rust crate. UUID crate will be used to represed UUID of a device. Apart of CDRS itself we will use cdrs_helpers_derive
crate that provide procedural macroses that derive methods which help with converting Rust structures into Cassandra/Scylla values. time
crate is used to work with time primitives. Here is how dependency section should look like in Cargo.toml file:
[dependencies]
cdrs = "2"
cdrs_helpers_derive = "0.1.0"
uuid = "0.7"
time = "0.1"
Do not forget to add
extern crate cdrs;
#[macro_use]
extern crate cdrs_helpers_derive;
extern crate uuid;
extern crate time;
to main.rs so that you could you declared dependencies in your code.
To be able to connect to a DB server CDRS needs following information:
- authenticator – any structure that has
Authenticator
implementation and which matches an authentication strategy used by DB Server. We’re going to disable authentication soNoneAuthenticator
should be used:
use cdrs::authenticators::NoneAuthenticator;
let auth = NoneAuthenticator{};
- cluster configuration – it consists of connection configuration provided for each node in a cluster which we want to connect to:
let node = NodeTcpConfigBuilder::new("127.0.0.1:9042", auth).build();
let cluster_config = ClusterTcpConfig(vec![node]);
Since we’re going to have just a single Scylla node only one node config is provided to a cluster. NodeTcpConfigBuilder
is a builder for NodeTcpConfig
. When we will pass cluster_config
to session creator, the creator will creat a pool of re-usable DB connectoin for each node in a cluster config. NodeTcpConfigBuilder
accepts the same parameters as Rust r2d2 Pool
builder.
Having an authenticator and a cluster configurtion we can create a new CDRS session (db.rs file - it’s a module that will hold the logic for working with Scylla instance):
use cdrs::{
authenticators::NoneAuthenticator,
cluster::{
session::{
new as new_session,
// other option: new_lz4 as new_lz4_session,
// other option: new_snappy as new_snappy_session
Session,
},
ClusterTcpConfig, NodeTcpConfigBuilder, TcpConnectionPool,
},
load_balancing::SingleNode,
query::*,
Result as CDRSResult,
};
pub type CurrentSession = Session<SingleNode<TcpConnectionPool<NoneAuthenticator>>>;
pub fn create_db_session() -> CDRSResult<CurrentSession> {
let auth = NoneAuthenticator;
let node = NodeTcpConfigBuilder::new("127.0.0.1:9042", auth).build();
let cluster_config = ClusterTcpConfig(vec![node]);
new_session(&cluster_config, SingleNode::new())
}
Apart of SingleNode
load balancing strategy used here CDRS provides Random
, RoundRobin
and RoundRobinSync
strategies. Being defined during session intialization load balancing strategy cannot be changed for a given session instance which is not the case for compression. Compression can be changed at any moment of time, for example session.compression = cdrs::compression::Compression::Lz4;
.
Since we are going write temperature measurement one-by-one, also taking into account that each measurement is a fairly small amount of data, session without any compression looks most reasonable for our logger.
Now, let us add functions for creating keyspace and a table where we will store temperature measurements (add to db.rs):
// ...
static CREATE_KEYSPACE_QUERY: &'static str = r#"
CREATE KEYSPACE IF NOT EXISTS fast_logger
WITH REPLICATION = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
"#;
static CREATE_TEMPERATURE_TABLE_QUERY: &'static str = r#"
CREATE TABLE IF NOT EXISTS fast_logger.temperature (
device UUID,
time timestamp,
temperature smallint,
PRIMARY KEY(device, time)
);
"#;
pub fn create_keyspace(session: &mut CurrentSession) -> CDRSResult<()> {
session.query(CREATE_KEYSPACE).map(|_| (()))
}
pub fn create_temperature_table(session: &mut CurrentSession) -> CDRSResult<()> {
session.query(CREATE_TEMPERATURE_TABLE_QUERY).map(|_| (()))
}
Here, we use queries for creating a keyspace and a table which we have previously agreed on.
Write temperature measurements
Let us now create a structure that will represent a single temperature measurement that will be stored in fast_logger.temperature
table (temperature_measurement.rs file):
use time::Timespec;
use cdrs::types::prelude::*;
use cdrs::types::from_cdrs::FromCDRSByName;
use uuid::Uuid;
#[derive(Debug, TryFromRow)]
struct TemperatureMeasurement {
pub device: Uuid,
pub time: Timespec,
pub temperature: i16
}
impl TemperatureMeasurement {
pub fn into_query_values(self) -> QueryValues {
query_values!(
"device" => self.device,
"time" => self.time,
"temperature" => self.temperature
)
}
}
Apart of standard Debug
we’ve derived one extra trait - TryFromRow
provided by CDRS.
pub trait TryFromRow: Sized {
fn try_from_row(row: Row) -> error::Result<Self>;
}
The proc macro from cdrs_helpers_derive
brings try_from_row
method that will convert Scylla row received from a DB server into TemperatureMeasurement
structure.
As a part of TemperatureMeasurement
implementation we’ve provided into_query_values
method that consumes a measurement structure and returns Cassandra values that further will be used by Scylla for substituting ?
in ADD_MEASUREMENT_QUERY
(add to db.rs file):
// ...
static ADD_MEASUREMENT_QUERY: &'static str = r#"
INSERT INTO fast_logger.temperature (device, time, temperature)
VALUES (?, ?, ?);
"#;
//...
pub fn add_measurement(
session: &mut CurrentSession,
measurement: TemperatureMeasurement,
) -> CDRSResult<()> {
session
.query_with_values(ADD_MEASUREMENT_QUERY, measurement.into_query_values())
.map(|_| (()))
}
Reading measurements
As you remember, during the data scheme design phase we assumed that our logger will be cappable to return measurements made by a selected device(-s) during a given timeframe. Taking into account this scenario, for fast_logger.temperature
table, we’ve created such primary key that guarantees efficient read operations.
Now we can create a select-query and a corresponded function in db.rs module:
use time::Timespec;
use uuid::Uuid;
// ...
static SELECT_MEASUREMENTS_QUERY: &'static str = r#"
SELECT * FROM fast_logger.temperature
WHERE device = ?
AND time > ?
AND time < ?;
"#;
//...
pub fn select_measurements(
session: &mut CurrentSession,
devices: Uuid,
time_from: Timespec,
time_to: Timespec,
) -> CDRSResult<Vec<TemperatureMeasurement>> {
let values = query_values!(devices, time_from, time_to);
session
.query_with_values(SELECT_MEASUREMENTS_QUERY, values)
.and_then(|res| res.get_body())
.and_then(|body| {
body
.into_rows()
.ok_or(CDRSError::from("cannot get rows from a response body"))
})
.and_then(|rows| {
let mut measurements: Vec<TemperatureMeasurement> = Vec::with_capacity(rows.len());
for row in rows {
measurements.push(TemperatureMeasurement::try_from_row(row)?);
}
Ok(measurements)
})
}
Since the SELECT
query has all filters (device
, time_from
and time_to
) as parameters we need to provide actual values to CDRS. This is done with help of query_values!
macro provided by CDRS.
query_values!
accepts arguments in two different forms - named and non-named - and depending on that it returns either named query values or non-named ones. As you may guess, eventough named values can be provided in any order they will be used by Scylla as a values for columns that have matched names. Unlike that, non-named query values should be provided in the same order as ?
-s appear in a corresponded query.
Further improvements - abstract session
So far we’ve created all functions necessary for logging and querying measurements data - create_db_session
connects to a DB cluster (though it consists of just one node), create_keyspace
creates fast_logger
keyspace if not exits, create_temperature_table
creates fast_logger.temperature
table if not exist, add_measurement
inserts temperature measurement data into the table and, finally, select_measurements
gets all measurements made by given devices in a given timeframe.
Eventough these functions are sufficient for our needs we can try to make them more testable and refactor the process of saving measurements more efficient.
Let us analyse code of create_keyspace
, create_temperature_table
, add_measurement
and select_measurements
. From one hand side we can notice that all of them receive &mut CurrentSession
. It is an alias that we’ve created for Session<SingleNode<TcpConnectionPool<NoneAuthenticator>>>
. This long generic type means following Session
that uses SingleNode
as a load balancing strategy where each node has a pool of TCP connections that use NoneAuthenticator
as an authentication strategy. From other hand side, why we need session is to run queries via session.query(...)
or session.query_with_values
. Both of these methods are the part of QueryExecutor
implementation. It means that session: &mut CurrentSession
in an argument can be efficiently replaced with following:
pub fn create_keyspace<T, M>(session: &mut impl QueryExecutor<T, M>) -> CDRSResult<()>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
/// ...
}
pub fn create_temperature_table<T, M>(session: &mut impl QueryExecutor<T, M>) -> CDRSResult<()>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
// ...
}
pub fn add_measurement<T, M>(
session: &mut impl QueryExecutor<T, M>,
measurement: TemperatureMeasurement,
) -> CDRSResult<()>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
// ...
}
pub fn select_measurements<T, M>(
session: &mut impl QueryExecutor<T, M>,
devices: Vec<Uuid>,
time_from: Timespec,
time_to: Timespec,
) -> CDRSResult<Vec<TemperatureMeasurement>>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
// ...
}
Despite function signatures now look more complex it’s beneficial for the code to have them in this form. Thanks to that we’ve made our code more decoupled and more testable. We’ve rid of CurrentSession
that creates a real connection to the DB, now these functions accepts any structure that implements QueryExecutor<T, M>
trait. So, now we could create a mock query executor that doesn’t need a real connection to a DB cluster. This is very beneficial for unit testing of our data layer. Apart of that, instead of TCP connection now QueryExecutor
relies on general CDRSTransport
trait, so the same functions will work with sessions based both on TCP and on TLS.
Generally speaking, CDRS Session
implements following traits related to querying:
-
BatchExecutor
that contains methods for batching querying, in other words making few queries in a single request to a DB. -
ExecExecutor
that contains methods for executing prepeared queries. -
PrepareExecutor
that contains methods for preparing queries. -
QueryExecutor
that contains methods for running queries.
Further improvements - prepare once, execute many times
Another kind of improvement is related to the way our logger is going to communicate with a database. What I mean is because of the fact that the logger will always send measurements of the same shape the only one this which will differ is actuall values – query will be the same. In terms of bytes, query string is comparable with actuall values. So, instead of doubling the size of a data being sent from a logger to Scylla we can try to optimize it by leveraging query preparation and query execution concepts that are supported by Scylla (as well as Cassandra).
So, we can request Scylla to prepare itself for INSERT
queries. This is going to be happen just once - during logger initialization phase. Then we can just execute prepeared query by sending only values that should be inserted but not the query itself (add to db.rs):
pub fn prepare_add_measurement<T, M>(
session: &mut impl PrepareExecutor<T, M>,
) -> CDRSResult<PreparedQuery>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
session.prepare(ADD_MEASUREMENT_QUERY)
}
pub fn execute_add_measurement<T, M>(
session: &mut impl ExecExecutor<T, M>,
prepared_query: &PreparedQuery,
measurement: TemperatureMeasurement,
) -> CDRSResult<()>
where
T: CDRSTransport + 'static,
M: r2d2::ManageConnection<Connection = RefCell<T>, Error = CDRSError> + Sized,
{
session
.exec_with_values(prepared_query, measurement.into_query_values())
.map(|_| (()))
}
Conclusion
Thus in this chapter we’ve build a data layer of the logger. It’s cappable for establishing Scylla session, creating a keyspace and a table for measurement entries, adding measurements and retrieving them. The full code of a data layer can be found https://github.com/AlexPikalov/ps-logger (revision).
In next post we’ll continue building our logger by adding a web interface to our service.