JSON stands for JavaScript Object Notation, a lightweight, text-based format that enables structured data storage and exchange across various platforms and programming languages. Although it originated in JavaScript, JSON has become a universal standard due to its simplicity, readability, and flexibility, making it easy to use in data interchange between clients and servers. JSON documents are composed of key-value pairs, supporting data types such as strings, numbers, booleans, arrays, and nested objects, making it well-suited for representing complex data hierarchies.
A JSON uses key-value pairs and is consisted of:
Objects: {} enclosed data structure containing key-value pairs. Keys are strings, while values can be of various types.
Arrays: [] enclosed list of values, which can hold multiple JSON objects or other data types.
Data Transfer in APIs: JSON is widely used in RESTful APIs for transferring data between a server and client.
Configuration Files: Many applications use JSON files for configuration due to its easy-to-read structure.
Data Serialization: JSON can serialize data in various programming languages, making it easy to store or send complex data structures.
JSONs are not Dictionaries
A JSON and a Python dictionary are similar in structure albeit with important differences:
JSON syntax is based on JavaScript and is strict. Keys must be in double quotes (“), and only a limited set of data types is allowed (e.g., strings, numbers, booleans, arrays, and nested objects).
{"name": "Alice","age": 30,"is_student": false}
Python dictionaries are more flexible in syntax. Keys can be in single or double quotes, and a broader set of data types is supported, including tuples and other Python-specific objects.
{'name': 'Alice','age': 30,'is_student': False}
JSONs and Python dictionaries serve different purposes:
A JSON is a text-based format used for data interchange, meaning it is a standardized way of representing structured data in a human-readable form. JSON data is typically stored in a string format when transmitted over the web.
A Python dictionary (dict) is a built-in data type in Python used to store data in key-value pairs. It exists as an in-memory object that Python can work with directly.
19.1 Processing JSONs with Python
Python supports the JSON format through the built-in module named json. The json module is specifically designed for reading and writing strings formatted as JSON. That means you can conveniently convert Python data types into JSON data and the other way around.
The act of converting data into the JSON format is referred to as serialization. This process involves transforming data into a series of bytes for storage or transmission over a network. The opposite process, deserialization, involves decoding data from the JSON format back into a usable form within Python.
JSON deserialization
JSON deserialization is the process of converting JSON-formatted data (a string) into a Python data structure, typically a dictionary or list. In Python, deserialization is done using the methods json.loads() and json.load(). The former deserializes JSON data from a string whereas the latter deserializes JSON data from a file object.
import jsonjson_string ='{"name": "Alice", "age": 30, "is_student": false}'python_data = json.loads(json_string)print(f"I am a {type(python_data)} type")python_data
I am a <class 'dict'> type
{'name': 'Alice', 'age': 30, 'is_student': False}
The following code illustrates the use of json.load() to read a JSON file and convert it into a list of dictionaries:
import urllib.request, jsonwith urllib.request.urlopen("https://raw.githubusercontent.com/thousandoaks/Python4DS201/main/data/iris.json") as url: iris_DataSet = json.load(url)print(f"I have a {type(iris_DataSet)} type")iris_DataSet[0:3]
A common use case that involves JSON deserialization is the processing of generativeAI-based services such as OpenAI GPT.
For illustration purposes let’s assume that you get the following response from OpenAI:
import json# Sample JSON response as a string (this would usually come from an API request)json_response ='''{ "id": "cmpl-6LTI9pK31sdkk8fU", "object": "text_completion", "created": 1672531200, "model": "text-davinci-003", "choices": [ { "text": "Hello! How can I assist you today?", "index": 0, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 10, "completion_tokens": 10, "total_tokens": 20 }}'''
The following code illustrates how to translate the JSON response into a dictionary and extract the text embedded in the response.
# Deserialize the JSON responseresponse_data = json.loads(json_response)# Access the generated textgenerated_text = response_data["choices"][0]["text"]print(f" I have a {type(response_data)} object")generated_text
I have a <class 'dict'> object
'Hello! How can I assist you today?'
JSON serialization
JSON serialization in Python is the process of converting a Python object (like a dictionary or list) into a JSON-formatted string. This is useful when you need to store data in JSON files, send data to web APIs, or share data across different systems. In Python, serialization is done using the methods json.dumps() and json.dump(). The former converts a Python object to a JSON string, the latter writes a Python object directly to a JSON file.
For instance, given a Python dictionary you can cast it into a JSON as follows:
import json# Python dictionarypython_data = {"name": "Alice","age": 30,"is_student": False,"courses": ["Math", "Science"]}# Serialize to JSON stringjson_string = json.dumps(python_data)print(f"I have a {type(json_string)} type")print(json_string)
I have a <class 'str'> type
{"name": "Alice", "age": 30, "is_student": false, "courses": ["Math", "Science"]}
The json.dump() method allows you to write the serialized JSON data directly to a file, for instance the following code creates a data.json file with the content of the python_data dictionary.
# Serialize and write to a JSON filewithopen('data.json', 'w') asfile: json.dump(python_data, file)
19.2 Processing JSONs with Pandas
The good news is that Pandas offers methods to serialize and deserialize JSONs.
JSON Deserialization
Pandas offers two methods to transform JSONs into DataFrames, Pandas.read_json() and Pandas.json_normalize().
The most straightforward way to load JSON data into a DataFrame is by using pd.read_json(), for instance:
Pandas.json_normalize() is useful for deserializing nested JSON objects, allowing you to flatten complex JSON structures. For instance given the following JSON string:
The following code flattens the JSON structure into a Pandas DataFrame. The following arguments are used:
record_path="county": Specifies the nested list (the county field) that we want to flatten into rows. This will create rows for each county in each state.
meta=["state", "shortname", ["info", "governor"]]: Specifies which fields should be included as additional columns in the resulting DataFrame:
"state" and "shortname": Adds these fields from each state dictionary as columns.
["info", "governor"]: Navigates into the nested “info” dictionary to retrieve the “governor” key for each county.
Some data processing pipelines prefer the so called JSON lines format (also known as NDJSON or newline-delimited JSON) in which each row of the DataFrame is serialized as a separate JSON object. For instance:
The following code illustrates how to produce a JSON lines format out of a Pandas DataFrame:
# Save the DataFrame as JSON Linesdf.to_json('data.jsonl', orient='records', lines=True)
19.3 JSON Manipulation in Practice.
First Example (US Food database)
The following Jupyter Notebook illustrates how to process complex JSON-based information in a real setting.
19.4 Conclusion
JSON (JavaScript Object Notation) is a lightweight, text-based format widely used for data storage and exchange across platforms due to its simplicity and flexibility. JSON structures include key-value pairs, supporting data types like strings, numbers, booleans, arrays, and nested objects, making it ideal for representing complex data hierarchies. JSON’s primary use cases include data transfer in APIs, configuration files, and data serialization. While JSON is similar in structure to Python dictionaries, it adheres to stricter syntax rules, such as mandatory double quotes for keys. Python provides built-in support for JSON manipulation through the json module, which facilitates serialization (converting Python objects to JSON strings) and deserialization (converting JSON strings to Python objects).
For JSON deserialization, Python offers methods like json.loads() for strings and json.load() for file objects, allowing seamless conversion of JSON data into Python dictionaries or lists. Conversely, JSON serialization is achieved through json.dumps() for generating JSON strings and json.dump() for writing JSON data to files. Pandas enhances JSON handling with methods like pd.read_json() for basic JSON deserialization and pd.json_normalize() for flattening nested JSON structures into DataFrames. These capabilities make JSON manipulation an integral part of data science workflows, enabling efficient interaction with APIs, structured data storage, and complex data processing.
19.5 Further Readings
For those of you in need of additional, more advanced, topics please refer to the following references: