Streaming data with Postgres, Debezium and Kafka
Imagine you accidentally touch a hot plate for a second.
What happens?
You reflexively pull back your hand and you begin to feel pain. Why is that so?
Our central nervous system takes information through our senses, processes the information and triggers certain actions such as feeling pain.
I promise this is not going to be a biology class.
Let’s get started!
The digital equivalent of what our nervous system does is stream processing.
With stream processing, we’re processing data continuously and as soon as it becomes available
We are capturing the data in real-time from event sources such as databases in the form of streams of events and moving the data to target systems such as data warehouses and databases.
Sounds cool right?
Hold on! What are events?
Events are when you conceptualize the data.
What do I mean by that? Consider the statements below
“I bought coffee”
“I bought coffee from Starbites on Monday at 2 pm”
The first statement is just data but the second statement is an event.
Sounds cool right?
Let’s talk about Change Data Capture(CDC)and Debezium
The idea is simple. It’s about getting changes out of the database. CDC will enable us to capture everything that’s already in the database along with any changes after the first capture.
Let’s say you have a Postgres database. Now, the moment we insert any new record into the database, update an existing record in the database or delete a record from a database, Debezium gets notified about the changes by tapping into the database’s transaction logs. Debezium captures these changes and streams them to Kafka topics. Kafka Connect, a framework part of the Apache Kafka umbrella is responsible for sending the changes into Kafka. It acts as a bridge for streaming data in and out of Kafka. You can use it to connect your Kafka database with data sources.
Now, when the data is in Kafka, our target systems can subscribe to the Kafka topics so that the changes could be streamed to them.
One may ask, “Why are you doing this log-based CDC. Why not query-based CDC?
There are 2 types of CDC:query-based CDC and log-based CDC
With query-based CDC, you’re just writing a SQL to get the data you need. The query would include a predicate to identify what has changed. This could be based on a timestamp field.
Let’s assume we have 4 records in the database. When we run the query, we get back all the records(since the previous timestamp starts from 0 and all the ts_cols are greater than 0). Now, if a new record is inserted into the database, what happens when we run our query again. With the time stamp now set to the last time we run the query, we can extract the new record that was inserted.
This is how query-based CDC works.
Take note of this method as we’ll have to run the query as frequently as possible to get the new records.
The JDBC Connector is an example of what you will use. for a query-based CDC
However, this is not the same for log-based CDC. With log-based CDC, we’re not writing any query with a timestamp greater than something to extract the new records. Instead, we’re looking at the database transaction log and the connector has to someway somehow get access to that data. Anytime there’s an insert, it gets written to the transaction long and since we’re also keeping a tab on the transaction log, we can see them and then decode them from their internal format and write them as formatted messages in Avro/JSON.
With the log-based CDC, we don’t only see the new record, we can also see how it got there so there’s some sort of history. Also, when we make deletes, it gets appended to the log so we can see them. These are some things you won’t see when using the query-based CDC. You only see updates
All the connectors from the Debezium project are examples of log-based CDC connectors.
Streaming data from Postgres using Debezium opens up a host of possibilities that we can tap into. I hope this article served its purpose as a comprehensible introduction to setting up an event-driven system with Debezium and Kafka.
I hope you enjoyed this read. Thanks for your time.