19 JSON Manipulation

JSON stands for JavaScript Object Notation, a lightweight, text-based format that enables structured data storage and exchange across various platforms and programming languages. Although it originated in JavaScript, JSON has become a universal standard due to its simplicity, readability, and flexibility, making it easy to use in data interchange between clients and servers. JSON documents are composed of key-value pairs, supporting data types such as strings, numbers, booleans, arrays, and nested objects, making it well-suited for representing complex data hierarchies.

A JSON uses key-value pairs and is consisted of:

Objects: {} enclosed data structure containing key-value pairs. Keys are strings, while values can be of various types.
Arrays: [] enclosed list of values, which can hold multiple JSON objects or other data types.

myfirstjson={
    "name": "Alice",
    "age": 30,
    "is_student": "false",
    "courses": ["Math", "Science", "History"],
    "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "zipcode": "12345"
    }
}
myfirstjson

{'name': 'Alice',
 'age': 30,
 'is_student': 'false',
 'courses': ['Math', 'Science', 'History'],
 'address': {'street': '123 Main St', 'city': 'Anytown', 'zipcode': '12345'}}

Common Uses of JSON are:

Data Transfer in APIs: JSON is widely used in RESTful APIs for transferring data between a server and client.
Configuration Files: Many applications use JSON files for configuration due to its easy-to-read structure.
Data Serialization: JSON can serialize data in various programming languages, making it easy to store or send complex data structures.

JSONs are not Dictionaries

A JSON and a Python dictionary are similar in structure albeit with important differences:

JSON syntax is based on JavaScript and is strict. Keys must be in double quotes (“), and only a limited set of data types is allowed (e.g., strings, numbers, booleans, arrays, and nested objects).

{
    "name": "Alice",
    "age": 30,
    "is_student": false
}

Python dictionaries are more flexible in syntax. Keys can be in single or double quotes, and a broader set of data types is supported, including tuples and other Python-specific objects.

{
    'name': 'Alice',
    'age': 30,
    'is_student': False
}

JSONs and Python dictionaries serve different purposes:

A JSON is a text-based format used for data interchange, meaning it is a standardized way of representing structured data in a human-readable form. JSON data is typically stored in a string format when transmitted over the web.
A Python dictionary (dict) is a built-in data type in Python used to store data in key-value pairs. It exists as an in-memory object that Python can work with directly.

19.1 Processing JSONs with Python

Python supports the JSON format through the built-in module named json. The json module is specifically designed for reading and writing strings formatted as JSON. That means you can conveniently convert Python data types into JSON data and the other way around.

The act of converting data into the JSON format is referred to as serialization. This process involves transforming data into a series of bytes for storage or transmission over a network. The opposite process, deserialization, involves decoding data from the JSON format back into a usable form within Python.

JSON deserialization

JSON deserialization is the process of converting JSON-formatted data (a string) into a Python data structure, typically a dictionary or list. In Python, deserialization is done using the methods json.loads() and json.load(). The former deserializes JSON data from a string whereas the latter deserializes JSON data from a file object.

import json

json_string = '{"name": "Alice", "age": 30, "is_student": false}'
python_data = json.loads(json_string)
print(f"I am a {type(python_data)} type")
python_data

I am a <class 'dict'> type

{'name': 'Alice', 'age': 30, 'is_student': False}

The following code illustrates the use of json.load() to read a JSON file and convert it into a list of dictionaries:

import urllib.request, json
with urllib.request.urlopen("https://raw.githubusercontent.com/thousandoaks/Python4DS201/main/data/iris.json") as url:
    iris_DataSet = json.load(url)

print(f"I have a {type(iris_DataSet)} type")

iris_DataSet[0:3]

I have a <class 'list'> type

[{'sepalLength': 5.1,
  'sepalWidth': 3.5,
  'petalLength': 1.4,
  'petalWidth': 0.2,
  'species': 'setosa'},
 {'sepalLength': 4.9,
  'sepalWidth': 3.0,
  'petalLength': 1.4,
  'petalWidth': 0.2,
  'species': 'setosa'},
 {'sepalLength': 4.7,
  'sepalWidth': 3.2,
  'petalLength': 1.3,
  'petalWidth': 0.2,
  'species': 'setosa'}]

print(f"each element of the list is a {type(iris_DataSet[0])}")
iris_DataSet[0]

each element of the list is a <class 'dict'>

{'sepalLength': 5.1,
 'sepalWidth': 3.5,
 'petalLength': 1.4,
 'petalWidth': 0.2,
 'species': 'setosa'}

A common use case that involves JSON deserialization is the processing of generativeAI-based services such as OpenAI GPT.

For illustration purposes let’s assume that you get the following response from OpenAI:

import json

# Sample JSON response as a string (this would usually come from an API request)
json_response = '''
{
    "id": "cmpl-6LTI9pK31sdkk8fU",
    "object": "text_completion",
    "created": 1672531200,
    "model": "text-davinci-003",
    "choices": [
        {
            "text": "Hello! How can I assist you today?",
            "index": 0,
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 10,
        "total_tokens": 20
    }
}
'''

The following code illustrates how to translate the JSON response into a dictionary and extract the text embedded in the response.

# Deserialize the JSON response
response_data = json.loads(json_response)

# Access the generated text
generated_text = response_data["choices"][0]["text"]

print(f" I have a {type(response_data)} object")
generated_text

 I have a <class 'dict'> object

'Hello! How can I assist you today?'

JSON serialization

JSON serialization in Python is the process of converting a Python object (like a dictionary or list) into a JSON-formatted string. This is useful when you need to store data in JSON files, send data to web APIs, or share data across different systems. In Python, serialization is done using the methods json.dumps() and json.dump(). The former converts a Python object to a JSON string, the latter writes a Python object directly to a JSON file.

For instance, given a Python dictionary you can cast it into a JSON as follows:

import json

# Python dictionary
python_data = {
    "name": "Alice",
    "age": 30,
    "is_student": False,
    "courses": ["Math", "Science"]
}

# Serialize to JSON string
json_string = json.dumps(python_data)
print(f"I have a {type(json_string)} type")
print(json_string)

I have a <class 'str'> type
{"name": "Alice", "age": 30, "is_student": false, "courses": ["Math", "Science"]}

The json.dump() method allows you to write the serialized JSON data directly to a file, for instance the following code creates a data.json file with the content of the python_data dictionary.

# Serialize and write to a JSON file
with open('data.json', 'w') as file:
    json.dump(python_data, file)

19.2 Processing JSONs with Pandas

The good news is that Pandas offers methods to serialize and deserialize JSONs.

JSON Deserialization

Pandas offers two methods to transform JSONs into DataFrames, Pandas.read_json() and Pandas.json_normalize().

The most straightforward way to load JSON data into a DataFrame is by using pd.read_json(), for instance:

import pandas as pd
json_data = '[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]'
df = pd.read_json(json_data)
df

	name	age
0	Alice	30
1	Bob	25

df=pd.read_json("https://raw.githubusercontent.com/thousandoaks/Python4DS201/main/data/iris.json")
df.head(3)

	sepalLength	sepalWidth	petalLength	petalWidth	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sepalLength  150 non-null    float64
 1   sepalWidth   150 non-null    float64
 2   petalLength  150 non-null    float64
 3   petalWidth   150 non-null    float64
 4   species      150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Pandas.json_normalize() is useful for deserializing nested JSON objects, allowing you to flatten complex JSON structures. For instance given the following JSON string:

data = [
    {
        "state": "Florida",
        "shortname": "FL",
        "info": {"governor": "Rick Scott"},
        "county": [
            {"name": "Dade", "population": 12345},
            {"name": "Broward", "population": 40000},
            {"name": "Palm Beach", "population": 60000},
        ],
    },
    {
        "state": "Ohio",
        "shortname": "OH",
        "info": {"governor": "John Kasich"},
        "county": [
            {"name": "Summit", "population": 1234},
            {"name": "Cuyahoga", "population": 1337},
        ],
    },
]

The following code flattens the JSON structure into a Pandas DataFrame. The following arguments are used:

record_path="county": Specifies the nested list (the county field) that we want to flatten into rows. This will create rows for each county in each state.
meta=["state", "shortname", ["info", "governor"]]: Specifies which fields should be included as additional columns in the resulting DataFrame:
- "state" and "shortname": Adds these fields from each state dictionary as columns.
- ["info", "governor"]: Navigates into the nested “info” dictionary to retrieve the “governor” key for each county.

pd.json_normalize(data, "county", ["state", "shortname", ["info", "governor"]])

	name	population	state	shortname	info.governor
0	Dade	12345	Florida	FL	Rick Scott
1	Broward	40000	Florida	FL	Rick Scott
2	Palm Beach	60000	Florida	FL	Rick Scott
3	Summit	1234	Ohio	OH	John Kasich
4	Cuyahoga	1337	Ohio	OH	John Kasich

Using Generative AI. Flattening complex JSONs

Flattening complex JSONs can be time consuming, you can request help from your favorite GenAI system by submitting the following prompt:

transform the following JSON into a pandas dataframe:

data = [ { “state”: “Florida”, “shortname”: “FL”, “info”: {“governor”: “Rick Scott”}, “county”: [ {“name”: “Dade”, “population”: 12345}, {“name”: “Broward”, “population”: 40000}, {“name”: “Palm Beach”, “population”: 60000}, ], }, { “state”: “Ohio”, “shortname”: “OH”, “info”: {“governor”: “John Kasich”}, “county”: [ {“name”: “Summit”, “population”: 1234}, {“name”: “Cuyahoga”, “population”: 1337}, ], },]

JSON Serialization

Pandas provides the Pandas.to_json() method, which allows you to export a DataFrame to various JSON formats. For instance:

# Sample DataFrame
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)

# Serialize DataFrame to JSON string
json_data = df.to_json()
print(json_data)

{"name":{"0":"Alice","1":"Bob","2":"Charlie"},"age":{"0":25,"1":30,"2":35},"city":{"0":"New York","1":"Los Angeles","2":"Chicago"}}

Pandas.to_json() method offers several parameters to customize the JSON output format.

json_data = df.to_json(orient='records')
print(json_data)

[{"name":"Alice","age":25,"city":"New York"},{"name":"Bob","age":30,"city":"Los Angeles"},{"name":"Charlie","age":35,"city":"Chicago"}]

Some data processing pipelines prefer the so called JSON lines format (also known as NDJSON or newline-delimited JSON) in which each row of the DataFrame is serialized as a separate JSON object. For instance:

{"name":"Alice","age":25,"city":"New York"}
{"name":"Bob","age":30,"city":"Los Angeles"}
{"name":"Charlie","age":35,"city":"Chicago"}

The following code illustrates how to produce a JSON lines format out of a Pandas DataFrame:

# Save the DataFrame as JSON Lines
df.to_json('data.jsonl', orient='records', lines=True)

19.3 JSON Manipulation in Practice.

First Example (US Food database)

The following Jupyter Notebook illustrates how to process complex JSON-based information in a real setting.

19.4 Conclusion

JSON (JavaScript Object Notation) is a lightweight, text-based format widely used for data storage and exchange across platforms due to its simplicity and flexibility. JSON structures include key-value pairs, supporting data types like strings, numbers, booleans, arrays, and nested objects, making it ideal for representing complex data hierarchies. JSON’s primary use cases include data transfer in APIs, configuration files, and data serialization. While JSON is similar in structure to Python dictionaries, it adheres to stricter syntax rules, such as mandatory double quotes for keys. Python provides built-in support for JSON manipulation through the json module, which facilitates serialization (converting Python objects to JSON strings) and deserialization (converting JSON strings to Python objects).

For JSON deserialization, Python offers methods like json.loads() for strings and json.load() for file objects, allowing seamless conversion of JSON data into Python dictionaries or lists. Conversely, JSON serialization is achieved through json.dumps() for generating JSON strings and json.dump() for writing JSON data to files. Pandas enhances JSON handling with methods like pd.read_json() for basic JSON deserialization and pd.json_normalize() for flattening nested JSON structures into DataFrames. These capabilities make JSON manipulation an integral part of data science workflows, enabling efficient interaction with APIs, structured data storage, and complex data processing.

19.5 Further Readings

For those of you in need of additional, more advanced, topics please refer to the following references: