The Evolution of Data Storage Formats: From CSV to Delta

The Evolution of Data Storage Formats: From CSV to Delta

Play this article

Hey there! Can you believe how far we’ve come from the good ol’ days of the basic CSV data storage format? As our need for faster and more adaptable data storage solutions grew, we’ve seen some pretty awesome new formats hit the scene. In this article, we’re gonna take a deep dive into the evolution of data storage formats — from CSV to Delta — and what makes each one an improvement over the last. Get ready to geek out with me

CSV (Comma Separated Values)

CSV is a simple used format for storing tabular data. Each line in a CSV file represents a record, with values separated by commas.

Why it was introduced?
CSV was created as a straightforward way to store and share tabular data, requiring minimal processing to read or write.

Limitations:

  • Limited data types and structure
  • No support for data compression
  • No schema enforcement or data validation

CSV

JSON (JavaScript Object Notation)

JSON is a lightweight, text-based format for data interchange that uses a subset of JavaScript’s syntax.

Why it was introduced?
JSON was designed as an alternative to XML, offering human-readable and easy-to-parse data structures.

Improvements over CSV:

  • Supports hierarchical data structures
  • More flexible in handling different data types
  • Better support for metadata

JSON Object

Why JSON is inefficient?
JSON is not efficient format for serialization and deserialization, which can lead to high CPU utilization. There are a few reasons for this inefficiency:

  1. Verbosity: JSON is a text-based format, which means it requires more bytes to represent the same data compared to binary formats. The added overhead can lead to increased memory consumption and processing time, especially for large data sets.
  2. Parsing: JSON data needs to be parsed into native data structures when it’s deserialized, which can be computationally expensive.
  3. Stringification: Converting various data types to their string representation can require significant CPU cycles.
  4. Encoding and decoding: JSON uses Unicode, which can increase the processing time for encoding and decoding operations, particularly for non-English characters or special symbols.
  5. Lack of schema: JSON does not have a built-in schema or type system, which means that data validation and error checking must be done manually.

Parquet

Parquet is a columnar storage file format optimized for use with big data processing frameworks like Apache Hadoop and Apache Spark.

Why it was introduced?
Parquet was developed to address the need for efficient storage and processing of large-scale datasets in distributed systems.

Improvements over JSON:

  • Columnar storage format, allowing for better data compression and query performance
  • Schema evolution support
  • Optimized for big data processing frameworks

Binary Representation of Columnar Layout can be thought as of Parquet

Now, let’s come to the point that How come Columnar Storage in Parquet is more efficient for Read? let’s deep dive here

  1. Compression: Columnar storage allows for better compression because values within a column are often similar in type and range. This similarity results in better compression ratios compared to row-based storage formats, where data from different columns is mixed.
    Snappy being the default compression Algorithm because of perfect balance between compression and speed

All Compression Algorithms compared

2. Encoding: Parquet supports various encoding techniques, such as dictionary encoding, run-length encoding, and bit packing, which can further reduce the storage footprint and improve read performance.

All the Encoding Algorithms explained

3. Vectorized query execution: Modern query engines can take advantage of columnar storage to perform vectorized operations, allowing them to process data in large batches rather than one row at a time. This approach improves

ORC (Optimized Row Columnar)

ORC is another columnar storage format designed for the Hadoop ecosystem, with a focus on optimizing performance and storage efficiency.

Why it was introduced?
ORC was created to provide better compression and faster read performance compared to other columnar storage formats.

Improvements over Parquet:

  • Light-weight compression algorithms, resulting in smaller file sizes
  • Improved read performance
  • Predicate pushdown and vectorization support

Avro

Avro is a row-based storage format that supports schema evolution, making it suitable for storing long-term data.

Why it was introduced?
Avro was developed to address the need for a flexible and efficient data serialization framework that could handle schema changes over time.

Improvements over ORC:

  • Row-based storage, better suited for certain use cases
  • Native support for schema evolution
  • Compact binary format

Avro

Serialized Avro data (binary format):

\x0cAlice\x1e\x0cNew York\x0cBob\x19\x0eSan Francisco\x0eCharlie\x16\x10Los Angeles

HDF5 (Hierarchical Data Format 5)

HDF5 is a versatile data storage format designed for high-performance computing and large-scale datasets.

Why it was introduced?
HDF5 was developed to handle complex, hierarchical data structures and support efficient data storage and retrieval in scientific applications.

Improvements over Avro:

  • Supports complex, hierarchical data structures
  • Optimized for high-performance computing environments
  • Advanced data compression and chunking capabilities

Example of HDF5

Delta Lake

Delta Lake is an open-source storage layer built on top of Apache Spark that brings ACID transactions and time-travel capabilities to big data workloads.

Why it was introduced?
Delta Lake was developed to address the challenges of data reliability and consistency in large-scale, distributed data processing systems.

Improvements over Avro:

  • ACID transaction support, ensuring data consistency
  • Time-travel capabilities for versioning and data rollback
  • Schema enforcement and evolution
  • Integration with Apache Spark and other big data frameworks

Delta Lake Table Structuring

Concluding

Comparison between All the Data Storage Formats

When choosing a data storage format, consider the following factors:

  1. Compatibility: Ensure the format is compatible with the tools, languages, and platforms you are using.
  2. Compression: If storage space is a concern, choose a format that provides better compression ratios.
  3. Schema evolution: If your data schema is likely to change, choose a format that supports schema evolution, like Avro, Parquet, or Delta Lake.
  4. Query performance: If read performance is a priority, consider using columnar formats like Parquet or ORC, which are optimized for fast analytical queries.
  5. Write performance: Consider formats like HDF5 or Delta Lake that offer efficient write performance, especially if you have real-time or high-volume write requirements.