Kafka Log Compaction | Confluent Documentation (2024)

Apache Kafka® log compaction and retention are essential features that ensure the integrity of data withina Kafka topic partition. Topic compaction guarantees that the latest value for each message key is alwaysretained within the log of data contained in that topic, making it ideal for use cases such as restoring state after system failureor reloading caches after application restarts. Continue reading to learn about log compaction and retentionin more detail and to understand how they work to preserve the accuracy of data streams.

Retention example

In the example that follows, there is a topic that contains user email addresses; every time a userupdates their email address this topic receives a message using the user ID as the primary key. Over a period of time,the following messages are sent for user ID 123. In this example, each message corresponds to an email address change.(messages for other IDs are omitted):

Log compaction provides a granular retention mechanism so at least the last update foreach primary key is retained. For the example, [email protected] would be retained. This guarantees that thelog contains a full snapshot of the final value for every key, not just keys that changed recently.This means downstream consumers can restore their own state off this topic requiring the retention of a complete log of all changes.

Following are some use cases where this is important:

  1. Database change subscriptions. You may have a data set in multiple data systems and often one of thesesystems is a database. For example you might havethe cache, search cluster, and Hadoop. If you are handling the real-time updates you onlyneed the recent log, but if you want to be able to reload the cache or restore a failed search node you may need a complete data set.
  2. Event sourcing. While not enabled by compaction, compaction does ensure you always know the latest state of each key, which is important for event sourcing.
  3. Journaling for high-availability. A process that does local computation can be made fault-tolerant by logging out changesthat it makes to its local state so another process can reload these changes and carry on if it should fail. A concrete exampleof this is handling counts, aggregations, and other “group by”-like processing in a stream processing system. Kafka Streams usesthis feature for this purpose.

Important

Compacted topics must have records with keys in order to implement record retention.

Compaction in Kafka does not guarantee there is only one record with the same key at any one time.There may be multiple records with the same key, including the tombstone, because compactiontiming is non-deterministic. Compaction is only done when the topic partition satisfies a few certain conditions, such asdirty ratio, or records in inactive segment files, etc.

In each of these cases, you primarily must handle the real-time feed of changes, but occasionally, when a machine crashes ordata needs to be reloaded or reprocessed, you must do a full load. Log compaction enables feeding both of these use cases offthe same backing topic. This style log usage is described in more detail in the blog post,The Logby Jay Kreps.

Simply, if the system had infinite log retention, and every change was logged, the state of the systemat every moment from when it started would be captured. Using this log, the system could be restored to anypoint in time by replaying the first N records in the log. However, this hypothetical complete log is not practical for systems thatupdate a single record many times, as the log will grow without bound. The simple log retention mechanismthat discards old updates bounds space, but the log cannot be used to restore the current state, meaning restoring from thebeginning of the log no longer recreates the current state as old updates may not be captured.

Log compaction is a mechanism to provide finer-grained per-record retention instead of coarser-grained time-based retention.Records with the same primary key are selectively removed when there is a more recent update. This way the log is guaranteed tohave at least the last state for each key.

This retention policy can be set per-topic, so a single cluster can have some topics where retention is enforced by size or timeand other topics where retention is enforced by compaction.

Unlike most log-structured storage systems Kafka is built for subscription and organizesdata for fast linear reads and writes. Kafka acts as a source-of-truth store so it is useful even in situationswhere the upstream data source would not otherwise be replayable.

Compaction in action

The following image provides the logical structure of a Kafka log, at a high level, with the offsetfor each message.

Kafka Log Compaction | Confluent Documentation (1)

The head of the log is identical to a traditional Kafka log. It has dense, sequential offsetsand retains all messages. Log compaction adds an option for handling the tail of the log.

The image shows a log with a compacted tail. However, the messages in the tailof the log retain the original offset assigned when they were first written.Also, all offsets remain valid positions in the log, even if the message with thatoffset has been compacted away; in this case this position is indistinguishable from the nexthighest offset that does appear in the log. For example,in the previous image, the offsets 36, 37, and 38 are all equivalent positions anda read beginning at any of these offsets would return a message set beginning with 38.

Compaction enables deletes

Compaction also enables deletes. A message with a key and a null payload (note that a string value of nullis not sufficient) will be treated as a delete from the log. These null payload messages are also called tombstones.Similar to when a new message with the same key arrives, this delete marker results in the deletionof the previous message with the same key. However, delete markers (tombstones) arespecial in that they are also cleaned out of the log after a period of time to free up space.This point is time is marked as the Delete Retention Point in the previous image, and is configuredwith delete.retention.ms on a topic.

View of compaction

Compaction is done in the background by periodically recopying log segments. Cleaning doesnot block reads and can be throttled to use no more than a configurable amount of I/O throughputto avoid impacting producers and consumers. The actual process of compacting a log segment lookssomething like the following:

Kafka Log Compaction | Confluent Documentation (2)

Topic compaction video

For an excellent video that describes log compaction in more detail, watch:

Compaction guarantees

Log compaction guarantees the following:

  1. Any consumer that stays caught-up to the head of the log will see every messagethat is written; these messages will have sequential offsets. The topic’smin-compaction-lag-mscan be used to guarantee the minimum length of time must pass after a message is written before itcould be compacted. That is, it provides a lower bound on how long each message will remain in the(uncompacted) head. The topic’s max-compaction-lag-ms can beused to guarantee the maximum delay between the time a message is written and the time the messagebecomes eligible for compaction.
  2. Ordering of messages is always maintained. Compaction will never reorder messages, just remove some.
  3. The offset for a message never changes. It is the permanent identifier for a position in the log.
  4. Any consumer progressing from the start of the log will see at least the final state of all recordsin the order they were written. Additionally, all delete markers for deleted records will be seen,provided the consumer reaches the head of the log in a time period less than the topic’sdelete.retention.ms setting (the default is 24 hours). In other words: since the removal ofdelete markers happens concurrently with reads, it is possible for a consumer to miss delete markersif it lags by more than delete.retention.ms.

Configure compaction

Following are some important topic configuration properties for log compaction.

log.cleanup.policy
Log compaction is enabled by setting the cleanup policy, which is a broker level setting.You can override this setting at the topic level. Enable log cleaning on a topic, add the log-specific property, either attopic creation time or using the alter command. For more information on modifying a topic setting,see Change the retention value for a topic.
log.cleaner.min.compaction.lag.ms
The log cleaner can be configured to retain a minimum amount of theuncompacted “head” of the log. This is enabled by setting the compactiontime lag. Use the min setting to prevent messages newer than a minimum message agefrom being subject to compaction. If not set, all log segments are eligible for compaction except for the last segment, meaning the onecurrently being written to. The active segment will not be compactedeven if all of its messages are older than the minimum compaction timelag. The log cleaner can be configured to ensure a maximum delay afterwhich the uncompacted “head” of the log becomes eligible for logcompaction.
log.cleaner.max.compaction.ms
Use this setting to prevent logs with low produce rates from remainingineligible for compaction for an unbounded duration. If not set, logsthat do not exceed min.cleanable.dirty.ratio are not compacted. Notethat this compaction deadline is not a hard guarantee since it is stillsubjected to the availability of log cleaner threads and the actualcompaction time. You will want to monitor theuncleanable.partitions.count, max.clean.time.secs andmax.compaction.delay.secs metrics. For more about monitoring logs in Kafka,see Monitor Log Metrics.
delete.retention.ms
Configures the amount of time to retain delete tombstone markers for log compacted topics.This setting also gives a bound on the time in which a consumer must complete a readstarting from offset 0 to ensure that they get a valid snapshot of the state of the topic.Use this setting to help prevent tombstones from being collected before a consumer completes their scan.

Confluent Tip

Read more about these topic configuration values in the Confluent Platform documentation.See:

  • log.cleanup.policy
  • log.cleaner.min.compaction.lag.ms
  • log.cleaner.max.compaction.lag.ms
  • delete.retention.ms

Note

This website includes content developed at the Apache Software Foundationunder the terms of the Apache License v2.

Kafka Log Compaction | Confluent Documentation (2024)
Top Articles
What you need to know about ATM withdrawal limits
Student Assistance - Georgia State Office of the Dean of Students
Www.mytotalrewards/Rtx
Brady Hughes Justified
DEA closing 2 offices in China even as the agency struggles to stem flow of fentanyl chemicals
Sprague Brook Park Camping Reservations
Tap Tap Run Coupon Codes
Encore Atlanta Cheer Competition
Where's The Nearest Wendy's
All Obituaries | Ashley's J H Williams & Sons, Inc. | Selma AL funeral home and cremation
Chastity Brainwash
Theycallmemissblue
Moonshiner Tyler Wood Net Worth
Is Grande Internet Down In My Area
Aspen Mobile Login Help
Farmer's Almanac 2 Month Free Forecast
South Bend Weather Underground
Accuweather Minneapolis Radar
The 15 Best Sites to Watch Movies for Free (Legally!)
Martins Point Patient Portal
Ehome America Coupon Code
Dentist That Accept Horizon Nj Health
Egg Crutch Glove Envelope
60 Second Burger Run Unblocked
Newsday Brains Only
Beaver Saddle Ark
Gyeon Jahee
Weekly Math Review Q4 3
Old Peterbilt For Sale Craigslist
Hingham Police Scanner Wicked Local
Cherry Spa Madison
Gvod 6014
T&Cs | Hollywood Bowl
9 oplossingen voor het laptoptouchpad dat niet werkt in Windows - TWCB (NL)
Let's co-sleep on it: How I became the mom I swore I'd never be
Dogs Craiglist
Jaefeetz
Juiced Banned Ad
Craigslist/Nashville
844 386 9815
Top 1,000 Girl Names for Your Baby Girl in 2024 | Pampers
Lawrence E. Moon Funeral Home | Flint, Michigan
The Blackening Showtimes Near Ncg Cinema - Grand Blanc Trillium
303-615-0055
Plasma Donation Greensburg Pa
25100 N 104Th Way
San Diego Padres Box Scores
The Plug Las Vegas Dispensary
Bluebird Valuation Appraiser Login
Zalog Forum
Unbiased Thrive Cat Food Review In 2024 - Cats.com
Wayward Carbuncle Location
Latest Posts
Article information

Author: Prof. Nancy Dach

Last Updated:

Views: 6673

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Prof. Nancy Dach

Birthday: 1993-08-23

Address: 569 Waelchi Ports, South Blainebury, LA 11589

Phone: +9958996486049

Job: Sales Manager

Hobby: Web surfing, Scuba diving, Mountaineering, Writing, Sailing, Dance, Blacksmithing

Introduction: My name is Prof. Nancy Dach, I am a lively, joyous, courageous, lovely, tender, charming, open person who loves writing and wants to share my knowledge and understanding with you.