Documentation

/

White Papers

Building a Transaction-Free, Flawless CDC (Change Data Capture) System

mason

November 22, 2024

Building a Transaction-Free, Flawless CDC (Change Data Capture) System

Introduction

With increasing regulations such as the Personal Information Protection Act, Electronic Financial Transaction Act, and Medical Records Privacy Act, service providers are legally mandated to log changes to sensitive data, including personal information. Beyond legal obligations, logging database changes is essential for real-time monitoring, threat detection, data integrity maintenance, and incident response. These are critical for ensuring service stability and security, making change capture a vital requirement for developers, database administrators, and security experts.

To meet these requirements, many service providers leverage Change Data Capture (CDC) technology—a method to identify and track data changes in real time. At QueryPie, which provides database access control capabilities, CDC is also an integral feature. This white paper discusses the approach QueryPie takes to implement a CDC system without relying on transactions.

Challenges

Challenges with Transaction-Based CDC

CDC involves logging both before and after data whenever a change occurs. One common method to implement this is by utilizing database transactions.

For instance, when a user issues an UPDATE query, the CDC system can record the data states before and after the change through the following process:

Issue to Solve
*Before and After Data States for Query Execution Diagram*

However, implementing CDC through transactions poses challenges, as it requires a rollback operation for every update query executed on the target table. Transaction rollbacks increase the load on the DBMS, and this load grows with the size of the target table. Moreover, the process involves duplicate reads of the same table to capture the data states before and after the query execution, which significantly reduces database access efficiency. Additionally, in NoSQL systems, transactions are often not supported, or if they are, they operate at a weaker level of consistency. This makes it difficult or even impossible to implement transaction-based CDC in such environments.

The impact of transaction rollbacks on performance can be assessed through a simple test scenario with the following setup:

  • MySQL 8.0 (on-premise, 8core(vcore 16), mem 256GB)

  • 100,000 records

  • A table named actor with the specified DDL schema

  • High-speed network to minimize network bottlenecks

1234567
CREATE TABLE actor (    actor_id int NOT NULL AUTO_INCREMENT,    first_name varchar(255) NOT NULL,    last_name varchar(255) NOT NULL,    last_update timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,    PRIMARY KEY (actor_id));

Let's assume a scenario where the first_name column in all records of the table is updated to 'Christopher'. If transaction rollback functionality is utilized, the following steps and queries must be executed:

1234567891011121314151617181920
mysql> SET AUTOCOMMIT=0;Query OK, 0 rows affected (0.06 sec) mysql> SELECT * FROM actor;...100000 rows in set (0.42 sec) mysql> UPDATE actor4 SET first_name='Christopher' WHERE 1;Query OK, 100000 rows affected (1.54 sec)Rows matched: 100000  Changed: 100000  Warnings: 0 mysql> ROLLBACK;Query OK, 0 rows affected (0.85 sec) mysql> UPDATE actor4 SET first_name='Christopher' WHERE 1;Query OK, 100000 rows affected (1.58 sec)Rows matched: 100000  Changed: 100000  Warnings: 0 mysql> COMMIT;Query OK, 0 rows affected (0.04 sec)

Here’s a simplified table showing the time required for each operation. Steps added specifically for CDC are highlighted in yellow:

Operation

Execution Time

Table Query

0.42s

Table Update

1.54s

Transaction Rollback

0.85s

Table Update

1.58s

Transaction Commit

0.04s

Total Execution Time

4.43s

User Query Execution Time

2.04s

Additional Operation Time

2.39s

As observed, the transaction rollback-based CDC mechanism results in more than double the time required due to the additional operations for updates and rollbacks. This leads to a 50% performance degradation from the user's perspective when implementing CDC.

Other Challenges with CDC Implementations

Additionally, there are methods to implement CDC such as creating triggers on tables to track data changes, recording update times or timestamps like modified_at in tables and periodically querying them, or utilizing log files like MySQL's binlog. While each method has its pros and cons, they all share a common drawback: the need to modify the DBMS for CDC functionality, resulting in dependency on the DBMS. In other words, every time a new DBMS instance is added, the corresponding settings must also be configured.

So, what method can be used to implement CDC effectively without adding load to the DBMS or creating DBMS dependencies?

Goal

To minimize the load on the DBMS, CDC implementation should avoid using transactions and performing redundant queries on the same table. Additionally, to prevent DBMS dependency, the CDC process should not require modifications to the DBMS.

At QueryPie, we address these requirements by utilizing our in-house query analysis library, QSI (Query Structure Interface), to obtain post-query data without directly querying the database. But how can we retrieve post-query data without directly inspecting the table?

Solution Overview

Solution Overview
Diagram Related to *After Row* Prediction Logic for CDC in the *ActionAnalyzer* of *QSI*

QSI avoids directly querying the table. Instead, it simulates the changes that the executed query would make on the pre-change table data and delivers the post-change table results.
In other words, rather than delegating the process of checking query execution results to the database, QSI performs the query internally and provides the results. With QueryPie's query simulation technology, CDC can be implemented without triggering transactions or rollbacks in the database. Additionally, since only query execution is required, no modifications to the DBMS are necessary, making QueryPie immediately usable upon installation.