Mastering Heterogeneous Data Formats: JSON, XML, Parquet, and Avro

In today's data-driven world, being able to handle heterogeneous data formats is essential for any data scientist. Whether you're working with lightweight JSON files or high-performance Parquet datasets, Python offers robust tools to make the job easier.

Why Learn About Different Data Formats?

Data comes in various shapes and structures. Here are some reasons why understanding these formats is critical:

Working with JSON

JSON (JavaScript Object Notation) is widely used for its simplicity and readability. Let’s see how to work with it:

import json

# Sample JSON data
json_data = '{"name": "Alice", "age": 25}'

# Parsing JSON
data = json.loads(json_data)
print(data['name'])  # Output: Alice

Handling XML Data

XML (eXtensible Markup Language) is more verbose but highly structured. Use the xml.etree.ElementTree module:

import xml.etree.ElementTree as ET

# Sample XML data
xml_data = '''<person><name>Bob</name><age>30</age></person>'''

# Parsing XML
tree = ET.ElementTree(ET.fromstring(xml_data))
root = tree.getroot()
print(root.find('name').text)  # Output: Bob

Efficient Storage with Parquet

Parquet is a columnar storage format optimized for analytical queries. Use pandas and pyarrow:

import pandas as pd

data = {'Name': ['Charlie'], 'Age': [35]}
df = pd.DataFrame(data)

# Save to Parquet
df.to_parquet('data.parquet')

# Read from Parquet
df_read = pd.read_parquet('data.parquet')
print(df_read)

Binary Efficiency with Avro

Avro is a binary format popular in big data pipelines. Use the fastavro library:

from fastavro import writer, reader, parse_schema

schema = {"type": "record", "name": "User", "fields": [{"name": "name", "type": "string"}, {"name": "age", "type": "int"}]}
parsed_schema = parse_schema(schema)

records = [{"name": "David", "age": 40}]

# Write Avro file
with open('users.avro', 'wb') as out:
    writer(out, parsed_schema, records)

# Read Avro file
with open('users.avro', 'rb') as fo:
    for record in reader(fo):
        print(record)  # Output: {'name': 'David', 'age': 40}

By mastering these formats, you can efficiently manage and analyze data in diverse scenarios—making your workflows faster, more reliable, and adaptable!