Serialization-Deserialization

Serialization converts a data structure or object state into a format that can be stored, transmitted, and reconstructed later.

Deserialization is the reverse process, where the stored or transmitted data is used to recreate the original data structure or object state.

(Python/Scala/Rust) Objects to JSON back to Objects (Python/Scala/Rust...)

The analogy of translating from Spanish to English (Universal Language) and to German

JSON is a lightweight format for storing and transporting data, easy for humans to read and write and for machines to parse and generate.

Avro is a binary serialization format designed for serializing complex data structures efficiently and compactly and is often used within big data applications.

  1. Compact and Efficient: Avro uses binary serialization, making it more compact and efficient than text-based formats like JSON. This results in faster data processing and reduced storage needs.

  2. Schema Evolution: Avro supports schema evolution. Adding, removing, or changing fields while maintaining backward and forward compatibility. This makes it easier to evolve your data model over time without breaking existing systems.

  3. Rich Data Structures: It supports various primitive and complex data types, including nested and recursive. This makes it suitable for complex data representation.

  4. Fast Serialization and Deserialization: Avro's binary format allows for faster data serialization and deserialization, which is crucial for high-performance computing tasks.

  5. Integration with Big Data Tools: Avro is well-integrated with several big data technologies like Apache Hadoop, Apache Kafka, and Apache Spark, making it a popular choice for data serialization in big data ecosystems.

  6. Language Independent: Avro can be used in various programming languages, making it a versatile choice for systems that involve multiple languages.

  7. Self-Describing Format: Avro data is always accompanied by its schema, allowing any program that receives it to read it without knowing the schema in advance. This self-describing nature facilitates easier data processing and exchange between systems.

Schemas

An Avro schema defines the structure of the Avro data format. It's a JSON document that describes your data types and protocols, ensuring that even complex data structures are adequately represented. The schema is crucial for data serialization and deserialization, allowing systems to interpret the data correctly.

Example of Avro Schema

{
  "type": "record",
  "name": "Person",
  "namespace": "com.example",
  "fields": [
    {"name": "firstName", "type": "string"},
    {"name": "lastName", "type": "string"},
    {"name": "age", "type": "int"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

Here is the list of Primitive Types which Avro supports:

  • null: no value

  • boolean: a binary value

  • int: 32-bit signed integer

  • long: 64-bit signed integer

  • float: single precision (32-bit) IEEE 754 floating-point number

  • double: double precision (64-bit) IEEE 754 floating-point number

  • bytes: the sequence of 8-bit unsigned bytes

  • string: Unicode character sequence

There are six kinds of complex data types which Avro supports :

  • Records

  • Enums

  • Arrays

  • Maps

  • Unions

  • Fixed

git clone https://github.com/gchandra10/serialization_deserialization.git

Last updated