Documentation

/

White Papers

Efficient Storage and Management of Large-Scale Audit Logs (OVEN)

fabian

January 23, 2025

Efficient Storage and Management of Large-Scale Audit Logs (OVEN)

Introduction

Audit logs are essential for ensuring security, transparency, and regulatory compliance across industries, especially in sectors handling sensitive data such as finance, healthcare, and public institutions. These logs often need to be retained for more than five years, necessitating robust storage and management solutions. For instance, large financial institutions generate millions of audit logs daily, resulting in terabytes (TBs) of storage over several years. This underscores the need for efficient, large-scale log management systems.

Challenges

Existing log management systems encounter the following key challenges:

1. Storage Issues

  • Log data grows exponentially, making database storage expansion unavoidable. This leads to increased operational costs and complexity.

2. Querying Challenges

  • Extracting critical insights or compliance-related information from massive log data often results in delays due to inefficient querying capabilities.

3. Inefficiencies in External Integration

  • Effective analysis of large volumes of log data frequently requires additional ETL tasks to load logs into external OLAP storage, increasing development and operational burdens.

Solution

QueryPie has developed a streamlined approach to storing, querying, and managing large-scale audit logs with the following objectives:

1. Efficient Storage of Large-Scale Logs:

  • Log data is efficiently stored using integrations with object storage solutions like S3.

  • Object storage is optimized for handling vast quantities of log data.

2. Enhanced External Integration Efficiency:

  • Data stored in external S3 is formatted for seamless integration and querying with external OLAP systems.

  • Logs are stored in a format tailored to OLAP requirements for smooth ingestion and analysis.

3. Minimalistic Functionality Focused on Audit Logs:

  • Features are limited to those essential for the immutable nature of audit logs.

  • Audit logs are not modified after storage and do not require explicit deletion.

    • The solution emphasizes on storage and retrieval functionality only.

  • Logs are stored in chronological order and can be queried only as lists within specific time ranges.

Detailed Explanation

Diary Overview
Diary Overview

Core Features for Audit Log Management

Audit logs are immutable after being stored and do not require explicit deletion. Therefore, the system provides functionality for storage and retrieval only.

  • Ingestion (WRITE)

    • Logs are initially stored in a high-performance HotStore.

    • After a specific period, logs are aggregated and transferred to a ColdStore for long-term storage.

  • Retrieval (READ)

    • Logs can be queried based on time ranges.

    • The retrieval process is abstracted, ensuring consistent querying across internal storage layers.

Efficient Storage Using Blob Storage

Logs moved to ColdStore are stored in Blob format, integrated with object storage solutions like S3. This ensures:

  • WORM (Write Once Read Many) compliance for secure, immutable log storage.

  • Efficient storage and retrieval without post-ingestion modifications.

Partitioning for Querying and Integration

For audit logs, the log group name and chronological time information are the most essential details for retrieving audit logs.

  • To optimize retrieval and external integration, a path-based partitioning system—commonly used for large-scale log data storage and querying—has been implemented.

    • collection=${COLLECTION}/date=${DATE}{HOUR}.gz

      • Log group name

        • ${COLLECTION}

      • Time

        • ${DATE}: Date of log storage.

        • ${HOUR}: Hour of log storage.

  • Example

    • collection test-log

      • Logs from 2024-01-01T19:00:00 to 2024-01-02T01:00:00 (UTC) are stored as:

Collection

Date

Hour

collection=test-log/

date=2024-01-02/

01.gz

02.gz

date=2024-01-01/

23.gz

22.gz

21.gz

20.gz

19.gz

The path-based partitioning method above enables efficient querying of log data in external OLAP storage.

  • The following example demonstrates how to query and analyze Diary logs stored in S3 directly using Athena.

Athena Integration and Query Example

  • Example: Creating a Table for Diary Logs Stored in S3 Using Athena:

123456789101112131415161718192021222324252627282930
    CREATE EXTERNAL TABLE test_log (        `record_uuid` STRING,        `created_at` STRING,        `data` STRUCT<            msg: STRING,            time: STRING,            level: STRING,            request_id: STRING,            operation_id: STRING        >    )    PARTITIONED BY (        `date_created` STRING    )    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'    STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'    LOCATION 's3://${S3_BUCKET_ROOT_PATH}/collection=test-log/'    TBLPROPERTIES (        'classification' = 'json',        'compressionType' = 'gzip',        'projection.enabled' = 'true',        'projection.date_created.type' = 'date',        'projection.date_created.format' = 'yyyy-MM-dd',        'projection.date_created.interval' = '1',        'projection.date_created.interval.unit' = 'DAYS',        'projection.date_created.range' = '2024-12-01, NOW',        'storage.location.template' = 's3://${S3_BUCKET_ROOT_PATH}/collection=test-log/date=${date_created}/'    );