15  Reading and Writing Data

Python offers native methods to read and write files yet it is advisable, whenever possible, to rely on Pandas methods to read and write data-related files. Pandas provides a variety of functions to handle different file formats, making it easy to import and export data. To read data from a file, you can use functions like pd.read_csv(), pd.read_excel(), and others. These functions allow you to load data into a DataFrame, which is Pandas’ primary data structure.

Conversely, writing data to a file is straightforward with Pandas. You can use functions like df.to_csv(), df.to_excel(), and df.to_json() to export your DataFrame to various file formats. This is particularly useful for saving your processed data or sharing it with others.

15.1 Reading Data using Pandas

You know already how to read data in text format using Pandas. The following code uses Pandas.read_csv() to read a csv file from the internet and have it processed into a new DataFrame df.

import pandas as pd

df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
df.head(3)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0

Reading files from the outside world can become quite nuanced in practice due to data inconsistencies, missing fields and many other circumstances. Be prepared to invest a significant amount of time learning how to use Pandas.read_csv() and the arguments you can use with this method, refer to the following table.

Argument Description
filepath_or_buffer The file path or object, URL, or a file-like object containing the CSV data.
sep The delimiter to use. Default is ‘,’ (comma).
delimiter Alternative argument name for ‘sep’.
header Row number(s) to use as the column names. Default is ‘infer’.
quotechar Character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
names List of column names to use.
index_col Column(s) to set as index(MultiIndex).
usecols Return a subset of the columns.
dtype Dictionary of column data types.
engine Parser engine to use: ‘c’ (default) or ‘python’.
converters Dictionary of functions for converting values in certain columns.
true_values List of values to consider as True.
false_values List of values to consider as False.
skiprows Line numbers to skip (0-indexed) or number of lines to skip (int).
nrows Number of rows of file to read.
na_values Additional strings to recognize as NA/NaN.
parse_dates Boolean or list of ints or names, to parse dates.
date_parser Function to use for parsing dates.
thousands Thousands separator.
comment Character to split comments.
error_bad_lines Flag to indicate whether to skip bad lines.
warn_bad_lines Flag to indicate whether to warn about bad lines.
low_memory Internally process file in chunks, resulting in lower memory use.
memory_map Flag to use memory mapping when reading file. Default is False.

Example (Datetime Parsing)

A useful feature of Pandas.read_csv() is the ability to specify in advance the types of data you are reading. The following example reads an external csv file into a new DataFrame, we cast the columms StartDate,EndDate and BirthDate as datetime64 types.

HealthCareData=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/HealthcareDataset_PublicRelease.csv", 
sep=',',
parse_dates=['StartDate','EndDate','BirthDate'])

HealthCareData.head(3)
Id MemberName MemberID County MedicalClaim ClaimItem HospitalName HospitalType StartDate EndDate PrincipalDiagnosisDesc PrincipalDiagnosis RevenueCodeDesc RevenueCode TypeFlag BirthDate TotalExpenses
0 634363 e659f3f4 6a380a28 6f943458 c1e3436737c77899 18 04b77561 HOSPITAL 2020-01-08 2020-01-08 Epigastric pain R10.13 DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET... 636.0 ER 1967-05-13 15.148
1 634364 e659f3f4 6a380a28 6f943458 c1e3436737c77899 21 04b77561 HOSPITAL 2020-01-08 2020-01-08 Epigastric pain R10.13 DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET... 636.0 ER 1967-05-13 3.073
2 634387 e659f3f4 6a380a28 6f943458 c1e3436737c77899 10 04b77561 HOSPITAL 2020-01-08 2020-01-08 Epigastric pain R10.13 LABORATORY - CLINICAL DIAGNOSTIC: HEMATOLOGY 305.0 ER 1967-05-13 123.900
HealthCareData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52563 entries, 0 to 52562
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Id                      52563 non-null  int64         
 1   MemberName              52563 non-null  object        
 2   MemberID                52563 non-null  object        
 3   County                  52563 non-null  object        
 4   MedicalClaim            52563 non-null  object        
 5   ClaimItem               52563 non-null  int64         
 6   HospitalName            52563 non-null  object        
 7   HospitalType            52563 non-null  object        
 8   StartDate               52563 non-null  datetime64[ns]
 9   EndDate                 52563 non-null  datetime64[ns]
 10  PrincipalDiagnosisDesc  52563 non-null  object        
 11  PrincipalDiagnosis      52563 non-null  object        
 12  RevenueCodeDesc         52561 non-null  object        
 13  RevenueCode             52563 non-null  float64       
 14  TypeFlag                52563 non-null  object        
 15  BirthDate               52563 non-null  datetime64[ns]
 16  TotalExpenses           52563 non-null  float64       
dtypes: datetime64[ns](3), float64(2), int64(2), object(10)
memory usage: 6.8+ MB

Example (Text Parsing)

It is quite common to run into errors derived from the existence of text inside the file containing the character used to separate columns (e.g. a comma). The argument quotechar usually addresses any potential confusion.

The following example reads a compressed csv file extracting it to the Articles DataFrame. We explicitly set ‘,’ as the separator and ‘“’ as the character to delimit text.

Articles=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/articles.csv.zip",
sep=',',
quotechar='"'
)
Articles
article_id product_code prod_name product_type_no product_type_name product_group_name graphical_appearance_no graphical_appearance_name colour_group_code colour_group_name ... department_name index_code index_name index_group_no index_group_name section_no section_name garment_group_no garment_group_name detail_desc
0 108775015 108775 Strap top 253 Vest top Garment Upper body 1010016 Solid 9 Black ... Jersey Basic A Ladieswear 1 Ladieswear 16 Womens Everyday Basics 1002 Jersey Basic Jersey top with narrow shoulder straps.
1 108775044 108775 Strap top 253 Vest top Garment Upper body 1010016 Solid 10 White ... Jersey Basic A Ladieswear 1 Ladieswear 16 Womens Everyday Basics 1002 Jersey Basic Jersey top with narrow shoulder straps.
2 108775051 108775 Strap top (1) 253 Vest top Garment Upper body 1010017 Stripe 11 Off White ... Jersey Basic A Ladieswear 1 Ladieswear 16 Womens Everyday Basics 1002 Jersey Basic Jersey top with narrow shoulder straps.
3 110065001 110065 OP T-shirt (Idro) 306 Bra Underwear 1010016 Solid 9 Black ... Clean Lingerie B Lingeries/Tights 1 Ladieswear 61 Womens Lingerie 1017 Under-, Nightwear Microfibre T-shirt bra with underwired, moulde...
4 110065002 110065 OP T-shirt (Idro) 306 Bra Underwear 1010016 Solid 10 White ... Clean Lingerie B Lingeries/Tights 1 Ladieswear 61 Womens Lingerie 1017 Under-, Nightwear Microfibre T-shirt bra with underwired, moulde...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105537 953450001 953450 5pk regular Placement1 302 Socks Socks & Tights 1010014 Placement print 9 Black ... Socks Bin F Menswear 3 Menswear 26 Men Underwear 1021 Socks and Tights Socks in a fine-knit cotton blend with a small...
105538 953763001 953763 SPORT Malaga tank 253 Vest top Garment Upper body 1010016 Solid 9 Black ... Jersey A Ladieswear 1 Ladieswear 2 H&M+ 1005 Jersey Fancy Loose-fitting sports vest top in ribbed fast-d...
105539 956217002 956217 Cartwheel dress 265 Dress Garment Full body 1010016 Solid 9 Black ... Jersey A Ladieswear 1 Ladieswear 18 Womens Trend 1005 Jersey Fancy Short, A-line dress in jersey with a round nec...
105540 957375001 957375 CLAIRE HAIR CLAW 72 Hair clip Accessories 1010016 Solid 9 Black ... Small Accessories D Divided 2 Divided 52 Divided Accessories 1019 Accessories Large plastic hair claw.
105541 959461001 959461 Lounge dress 265 Dress Garment Full body 1010016 Solid 11 Off White ... Jersey A Ladieswear 1 Ladieswear 18 Womens Trend 1005 Jersey Fancy Calf-length dress in ribbed jersey made from a...

105542 rows × 25 columns

Using Generative AI for coding purposes

Don’t forget that you can use your favourite GenAI-based assistant to help you identify the right arguments to process external files.

“Give me code to ignore errors in pandas.read_csv()”

Will return something similar to:

import pandas as pd

# Example CSV file path
file_path = 'example.csv'

try:
    df = pd.read_csv(file_path, error_bad_lines=False, warn_bad_lines=False)
except Exception as e:
    print(f"An error occurred: {e}")
An error occurred: read_csv() got an unexpected keyword argument 'error_bad_lines'

Example (Other formats)

Pandas offers several methods to ingest external data, for instance to open an existing Excel using Pandas we use the method Pandas.read_excel(). For instance, given the Excel file “mini.xlsx” stored in the following URL:

excelDataFrame=pd.read_excel('https://github.com/thousandoaks/Python4DS103/blob/main/data/mini.xlsx?raw=true')
excelDataFrame.head()
Unnamed: 0.1 Unnamed: 0 project_title region funding_ratio badge tag language remark start_date end_date date_remark sentiment anger fear joy love optimism
0 6976 6976 EasyTouch: Turn your world into a touch sensor CA 3.8548 Backer Hardware en Hi! I had not receive my rewards util today, t... 2014-10-21 14:30:02 2014-11-20 15:30:12 2015-07-02 11:00:27 NEGATIVE YES NO NO NO NO
1 6977 6977 EasyTouch: Turn your world into a touch sensor CA 3.8548 Backer Hardware en I have just received mine yesterday! Yipee! An... 2014-10-21 14:30:02 2014-11-20 15:30:12 2015-05-12 13:33:46 POSITIVE NO NO YES NO YES
2 6978 6978 EasyTouch: Turn your world into a touch sensor CA 3.8548 Backer Hardware en I haven't received mine. Winnipeg, Canada. For... 2014-10-21 14:30:02 2014-11-20 15:30:12 2015-05-11 16:11:59 NEUTRAL NO NO NO NO NO
3 6979 6979 EasyTouch: Turn your world into a touch sensor CA 3.8548 Backer Hardware en When are you going to fix your website? 2014-10-21 14:30:02 2014-11-20 15:30:12 2015-05-07 23:15:33 NEUTRAL NO NO NO NO NO
4 6980 6980 EasyTouch: Turn your world into a touch sensor CA 3.8548 Superbacker Hardware en Just received mine on May 6th. Vancouver, Canada. 2014-10-21 14:30:02 2014-11-20 15:30:12 2015-05-07 01:46:24 NEUTRAL NO NO NO NO NO

The following table provides a list of methods you can use to read external files using Pandas.

Method Description
pd.read_csv() Reads a CSV file into a DataFrame.
pd.read_excel() Reads an Excel file into a DataFrame.
pd.read_sql() Reads from a SQL database into a DataFrame.
pd.read_json() Reads a JSON string or file into a DataFrame.
pd.read_html() Reads HTML tables into a list of DataFrames.
pd.read_clipboard() Reads the clipboard’s contents into a DataFrame.
pd.read_hdf() Reads from an HDF5 file into a DataFrame.
pd.read_parquet() Reads a Parquet file into a DataFrame.
pd.read_orc() Reads an ORC file into a DataFrame.
pd.read_feather() Reads a Feather-format file into a DataFrame.
pd.read_stata() Reads a Stata file into a DataFrame.
pd.read_sas() Reads SAS files into a DataFrame.
pd.read_spss() Reads SPSS files into a DataFrame.
pd.read_pickle() Reads a pickled object (serialized Python object) into a DataFrame.
pd.read_sql_table() Reads a SQL database table into a DataFrame.
pd.read_sql_query() Executes a SQL query and returns the result as a DataFrame.

Example (Muliple files)

A frequent use case you will run into is the ingestion of several, potentially dozens, of files.

An option is to rely on Pandas native methods, for instance given the following URLs

file1= "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_First_Batch.csv"
file2= "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_Second_Batch.csv"
file3= 'https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_Third_Batch.csv'

The following code uses a list comprehension to process each URL using Pandas.read_csv(). The result is stored in articleslist which a list of DataFrames.

import pandas as pd
# Read multiple CSV files
files = [file1, file2, file3]
articlelist = [pd.read_csv(file) for file in files]
type(articlelist)
list

You can further process articlelist for instance by joining all the elements into a single DataFrame called Articles.

Articles = pd.concat(articlelist)
Articles
Unnamed: 0 article_id product_code prod_name product_type_no product_type_name product_group_name graphical_appearance_no graphical_appearance_name colour_group_code ... department_name index_code index_name index_group_no index_group_name section_no section_name garment_group_no garment_group_name detail_desc
0 52188 700817001 700817 Charlotte Wireless Abeline 306 Bra Underwear 1010021 Lace 9 ... Expressive Lingerie B Lingeries/Tights 1 Ladieswear 61 Womens Lingerie 1017 Under-, Nightwear Soft, non-wired bra in lace with crossover str...
1 52956 702802001 702802 Buzz padded soft bra back det 306 Bra Underwear 1010020 Contrast 9 ... Expressive Lingerie B Lingeries/Tights 1 Ladieswear 61 Womens Lingerie 1017 Under-, Nightwear Soft, non-wired bra in lace and mesh with line...
2 12555 554049001 554049 Ursula embroidery 275 Skirt Garment Lower body 1010016 Solid 9 ... Skirt A Ladieswear 1 Ladieswear 15 Womens Everyday Collection 1012 Skirts Knee-length skirt in embroidered mesh with an ...
3 98783 880793001 880793 Mille linen premium shorts 274 Shorts Garment Lower body 1010016 Solid 73 ... Shorts & Skirts A Ladieswear 1 Ladieswear 6 Womens Casual 1025 Shorts Shorts in airy linen. Regular drawstring waist...
4 7908 518318003 518318 Pat ls shaggy flanel check 259 Shirt Garment Upper body 1010004 Check 73 ... Shirt F Menswear 3 Menswear 21 Contemporary Casual 1011 Shirts Checked shirt in soft cotton flannel with a tu...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
295 39627 662108002 662108 ES Alba 252 Sweater Garment Upper body 1010001 All over pattern 7 ... Kids Girl Knitwear H Children Sizes 92-140 4 Baby/Children 76 Kids Girl 1003 Knitwear Long-sleeved jumper in soft, fine-knit cotton ...
296 32024 633813007 633813 Enrique sneaker 94 Sneakers Shoes 1010016 Solid 9 ... Sneakers C Ladies Accessories 1 Ladieswear 64 Womens Shoes 1020 Shoes Cotton twill shoes with lacing at the front an...
297 40918 666037001 666037 Nutella corset 284 Underwear corset Underwear 1010021 Lace 9 ... Expressive Lingerie B Lingeries/Tights 1 Ladieswear 61 Womens Lingerie 1017 Under-, Nightwear Corset in lace and mesh with adjustable suspen...
298 77861 791492008 791492 SEQUINS tee TP 255 T-shirt Garment Upper body 1010015 Sequin 19 ... Kids Boy Jersey Fancy H Children Sizes 92-140 4 Baby/Children 46 Kids Boy 1005 Jersey Fancy T-shirt in soft jersey with a reversible sequi...
299 97419 874168001 874168 Walter 2p Coolmax R-neck(1) 306 Bra Underwear 1010016 Solid 10 ... Underwear Jersey F Menswear 3 Menswear 26 Men Underwear 1017 Under-, Nightwear Round-necked T-shirts in a soft, stretch cotto...

900 rows × 26 columns

Example (Relational Database)

A frequent use case you will surely run into is the retrieval of data from relational databases. We will devote more time to relational databases in another chapter.

Say you are granted access to a local database called ‘sales.db’, the following code: (1) opens a connection to the database then (2) reads all fields from the table sales.

# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('sales.db')

# Read data from the SQL table into a DataFrame
df = pd.read_sql('SELECT * FROM sales', conn)

df.head(3)
region sales
0 North 200
1 South 150
2 East 300
Accessing Relational Databases

Don’t worry if you don’t understand the previous argument ‘SELECT * FROM sales’, it is a SQL statement. You will learn more about it in future chapters.

15.2 Reading Data using Requests

It is always a good idea to have alternatives to Pandas when it comes to accessing and storing external data. The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of making requests behind a simple API so that you can focus on interacting with services and consuming data in your application.

The following code establishes a HTTP connection to a url and returns: (1) The status of the response, (2) the first three lines of the said response.

import requests

url = "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_First_Batch.csv"

response = requests.get(url)

response.raise_for_status()

print(response)


# Get the response text
response_text = response.text

# Split the text into lines
lines = response_text.splitlines()

# Extract the first three lines
first_three_lines = lines[:3]

first_three_lines
<Response [200]>
[',article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc',
 '52188,700817001,700817,Charlotte Wireless Abeline,306,Bra,Underwear,1010021,Lace,9,Black,4,Dark,5,Black,1338,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Soft, non-wired bra in lace with crossover straps at the front and lightly padded, moulded cups that shape the bust and provide good support. Narrow, adjustable shoulder straps, a lace trim at the hem and a hook-and-eye fastening at the back."',
 '52956,702802001,702802,Buzz padded soft bra back det,306,Bra,Underwear,1010020,Contrast,9,Black,4,Dark,5,Black,1338,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Soft, non-wired bra in lace and mesh with lined cups with removable inserts for a larger bust and fuller cleavage. Narrow, adjustable shoulder straps, a lace racer back and front fastening."']
Always check the status of connections

It’s a good practice to always call raise_for_status after using requests.get to check for HTTP error prior to any other operation.

Example (Wikipedia API-based access)

The requests library is extremely handy to access external services. The following example performs an API-based request to Wikipedia

import requests

# Define the URL for the Wikipedia API
url = 'https://en.wikipedia.org/w/api.php'

# Define the parameters for the API request
params = {
    'action': 'query',
    'format': 'json',
    'titles': 'Python (programming language)',
    'prop': 'extracts',
    'exintro': True,
    'explaintext': True
}

# Make the GET request
response = requests.get(url, params=params)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    
    # Extract the page content
    page = next(iter(data['query']['pages'].values()))
    content = page['extract']
    
    # Print the content
    print(content)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine learning community.
Using Generative AI for coding purposes

The previous example was created by Microsoft CoPilot, I simply submitted the following prompt:

“Example of a request python to download a wikipedia page”

Example (Multiple File processing)

The requests library

The following code processes the directory: “https://api.github.com/repos/thousandoaks/Python4DS-I/contents/datasets” extracting all the file URLs in the variable files.

import requests

# Define the URL for the GitHub API
url = f'https://api.github.com/repos/thousandoaks/Python4DS-I/contents/datasets'

# Make the GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    files = response.json()
    
    # Print the list of files
    for file in files:
        print("I got:",file['name'])
else:
    print(f"Failed to retrieve the files. Status code: {response.status_code}")
I got: Articles_First_Batch.csv
I got: Articles_Second_Batch.csv
I got: Articles_Third_Batch.csv
I got: HealthcareDataset_PublicRelease.csv
I got: PharmaDataSet.csv.zip
I got: articles.csv.zip
I got: customers.csv.zip
I got: transactions2020.csv.zip

Once we have the list of files to be retrieved we iterate over the list files this time processing each element using Pandas.read_csv() method.

import pandas as pd
from IPython.display import display, HTML

## we only process the first two files and 
## extract only the first two rows of each file
for file in files:
  fileURL=file['download_url']
  print("I am processing the file:",fileURL)
  print('\n')
  internalfile=pd.read_csv(fileURL)
  display(HTML(internalfile.iloc[:2, :2].to_html()))
  print('\n')
I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_First_Batch.csv

Unnamed: 0 article_id
0 52188 700817001
1 52956 702802001


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_Second_Batch.csv

Unnamed: 0 article_id
0 2323 402382009
1 29553 625653029


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_Third_Batch.csv

Unnamed: 0 article_id
0 74841 778515002
1 100266 889379009


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/HealthcareDataset_PublicRelease.csv

Id MemberName
0 634363 e659f3f4
1 634364 e659f3f4


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/PharmaDataSet.csv.zip

TransactionId Date
0 172865 2019-01-01 10:47:41
1 172868 2019-01-01 10:47:41


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/articles.csv.zip

article_id product_code
0 108775015 108775
1 108775044 108775


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/customers.csv.zip

customer_id FN
0 00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657 NaN
1 0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa NaN


I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/transactions2020.csv.zip

t_dat customer_id
0 2020-06-01 00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec5e0fb21b12f862f73d
1 2020-06-01 000b31552d3785c79833262bbeefa484cbc43d7b612b3c8fc696260b5afaadc4

15.3 Writing Data using Pandas

Writing files using Pandas is straightforward. The following code uses DataFrame.to_csv() to write a DataFrame as a local file in csv format.

import pandas as pd

df.to_csv("./california_housing_train.csv", sep=",")

Pandas provides several methods you can use to write your datasets, refer to the following table.

Method Description File Format
DataFrame.to_csv() Write DataFrame to a comma-separated values file CSV
DataFrame.to_excel() Write DataFrame to an Excel file Excel (XLSX, XLS)
DataFrame.to_json() Write DataFrame to a JSON file JSON
DataFrame.to_html() Write DataFrame to an HTML file HTML
DataFrame.to_sql() Write DataFrame to a SQL database table SQL Database
DataFrame.to_pickle() Serialize DataFrame to a pickle file Pickle
DataFrame.to_hdf() Write DataFrame to an HDF5 file HDF5
DataFrame.to_feather() Write DataFrame to a Feather file Feather
DataFrame.to_parquet() Write DataFrame to a Parquet file Parquet
DataFrame.to_stata() Write DataFrame to a Stata file Stata (DTA)
DataFrame.to_records() Write DataFrame to a numpy record array NumPy Record Array
DataFrame.to_latex() Write DataFrame to a LaTeX tabular environment LaTeX
DataFrame.to_markdown() Write DataFrame to a markdown-formatted string Markdown

Example (Pickle Files)

The csv format is quite common to exchange data on the internet given its simplicity and the fact that is readable by humans, the csv format is not ideal if you need to handle large volumes of data. In these situations is advisable to use other formats which are more efficient and robust.

The following code writes an existing dataframe df to localfile.pkl. The pkl extension is used to denote pickle format.

import pandas as pd

df.to_pickle("./localfile.pkl")

Example (SQL table)

It is possible to write information to SQL-based databases using DataFrame.to_sql() method.

Say you are granted access to a local database called ‘sales.db’, the following code: (1) opens a connection to the database ‘sales.db’ then (2) writes the dataframe df into the table sales.

import pandas as pd
import sqlite3


# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('sales.db')


# Write DataFrame to SQL table
df.to_sql('sales', con=engine, if_exists='replace', index=False)

15.4 Conclusion

This chapter explores reading and writing data in Python, emphasizing the efficiency and versatility of using Pandas for these operations. Reading data is made easy with methods like pd.read_csv() and pd.read_excel(), which load external files into Pandas DataFrames for analysis. The chapter discusses handling common issues such as inconsistent data, missing fields, and complex formats using arguments like sep, parse_dates, and quotechar. Writing data is equally straightforward with methods like df.to_csv() and df.to_excel(), enabling the export of processed data for storage or sharing. The chapter also introduces alternatives like the requests library for HTTP-based data retrieval and APIs for dynamic data acquisition.

In the next chapter you will learn how to efficiently process time series data using Pandas.

15.5 Further Reading

For those of you in need of additional, more advanced, topics please refer to the following references: