15 Reading and Writing Data

Python offers native methods to read and write files yet it is advisable, whenever possible, to rely on Pandas methods to read and write data-related files. Pandas provides a variety of functions to handle different file formats, making it easy to import and export data. To read data from a file, you can use functions like pd.read_csv(), pd.read_excel(), and others. These functions allow you to load data into a DataFrame, which is Pandas’ primary data structure.

Conversely, writing data to a file is straightforward with Pandas. You can use functions like df.to_csv(), df.to_excel(), and df.to_json() to export your DataFrame to various file formats. This is particularly useful for saving your processed data or sharing it with others.

15.1 Reading Data using Pandas

You know already how to read data in text format using Pandas. The following code uses Pandas.read_csv() to read a csv file from the internet and have it processed into a new DataFrame df.

import pandas as pd

df = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
df.head(3)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0

Reading files from the outside world can become quite nuanced in practice due to data inconsistencies, missing fields and many other circumstances. Be prepared to invest a significant amount of time learning how to use Pandas.read_csv() and the arguments you can use with this method, refer to the following table.

Argument	Description
`filepath_or_buffer`	The file path or object, URL, or a file-like object containing the CSV data.
`sep`	The delimiter to use. Default is ‘,’ (comma).
`delimiter`	Alternative argument name for ‘sep’.
`header`	Row number(s) to use as the column names. Default is ‘infer’.
`quotechar`	Character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
`names`	List of column names to use.
`index_col`	Column(s) to set as index(MultiIndex).
`usecols`	Return a subset of the columns.
`dtype`	Dictionary of column data types.
`engine`	Parser engine to use: ‘c’ (default) or ‘python’.
`converters`	Dictionary of functions for converting values in certain columns.
`true_values`	List of values to consider as True.
`false_values`	List of values to consider as False.
`skiprows`	Line numbers to skip (0-indexed) or number of lines to skip (int).
`nrows`	Number of rows of file to read.
`na_values`	Additional strings to recognize as NA/NaN.
`parse_dates`	Boolean or list of ints or names, to parse dates.
`date_parser`	Function to use for parsing dates.
`thousands`	Thousands separator.
`comment`	Character to split comments.
`error_bad_lines`	Flag to indicate whether to skip bad lines.
`warn_bad_lines`	Flag to indicate whether to warn about bad lines.
`low_memory`	Internally process file in chunks, resulting in lower memory use.
`memory_map`	Flag to use memory mapping when reading file. Default is False.

Example (Datetime Parsing)

A useful feature of Pandas.read_csv() is the ability to specify in advance the types of data you are reading. The following example reads an external csv file into a new DataFrame, we cast the columms StartDate,EndDate and BirthDate as datetime64 types.

HealthCareData=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/HealthcareDataset_PublicRelease.csv", 
sep=',',
parse_dates=['StartDate','EndDate','BirthDate'])

HealthCareData.head(3)

	Id	MemberName	MemberID	County	MedicalClaim	ClaimItem	HospitalName	HospitalType	StartDate	EndDate	PrincipalDiagnosisDesc	PrincipalDiagnosis	RevenueCodeDesc	RevenueCode	TypeFlag	BirthDate	TotalExpenses
0	634363	e659f3f4	6a380a28	6f943458	c1e3436737c77899	18	04b77561	HOSPITAL	2020-01-08	2020-01-08	Epigastric pain	R10.13	DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...	636.0	ER	1967-05-13	15.148
1	634364	e659f3f4	6a380a28	6f943458	c1e3436737c77899	21	04b77561	HOSPITAL	2020-01-08	2020-01-08	Epigastric pain	R10.13	DRUGS REQUIRE SPECIFIC ID: DRUGS REQUIRING DET...	636.0	ER	1967-05-13	3.073
2	634387	e659f3f4	6a380a28	6f943458	c1e3436737c77899	10	04b77561	HOSPITAL	2020-01-08	2020-01-08	Epigastric pain	R10.13	LABORATORY - CLINICAL DIAGNOSTIC: HEMATOLOGY	305.0	ER	1967-05-13	123.900

HealthCareData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52563 entries, 0 to 52562
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Id                      52563 non-null  int64         
 1   MemberName              52563 non-null  object        
 2   MemberID                52563 non-null  object        
 3   County                  52563 non-null  object        
 4   MedicalClaim            52563 non-null  object        
 5   ClaimItem               52563 non-null  int64         
 6   HospitalName            52563 non-null  object        
 7   HospitalType            52563 non-null  object        
 8   StartDate               52563 non-null  datetime64[ns]
 9   EndDate                 52563 non-null  datetime64[ns]
 10  PrincipalDiagnosisDesc  52563 non-null  object        
 11  PrincipalDiagnosis      52563 non-null  object        
 12  RevenueCodeDesc         52561 non-null  object        
 13  RevenueCode             52563 non-null  float64       
 14  TypeFlag                52563 non-null  object        
 15  BirthDate               52563 non-null  datetime64[ns]
 16  TotalExpenses           52563 non-null  float64       
dtypes: datetime64[ns](3), float64(2), int64(2), object(10)
memory usage: 6.8+ MB

Example (Text Parsing)

It is quite common to run into errors derived from the existence of text inside the file containing the character used to separate columns (e.g. a comma). The argument quotechar usually addresses any potential confusion.

The following example reads a compressed csv file extracting it to the Articles DataFrame. We explicitly set ‘,’ as the separator and ‘“’ as the character to delimit text.

Articles=pd.read_csv("https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/articles.csv.zip",
sep=',',
quotechar='"'
)
Articles

	article_id	product_code	prod_name	product_type_no	product_type_name	product_group_name	graphical_appearance_no	graphical_appearance_name	colour_group_code	colour_group_name	...	department_name	index_code	index_name	index_group_no	index_group_name	section_no	section_name	garment_group_no	garment_group_name	detail_desc
0	108775015	108775	Strap top	253	Vest top	Garment Upper body	1010016	Solid	9	Black	...	Jersey Basic	A	Ladieswear	1	Ladieswear	16	Womens Everyday Basics	1002	Jersey Basic	Jersey top with narrow shoulder straps.
1	108775044	108775	Strap top	253	Vest top	Garment Upper body	1010016	Solid	10	White	...	Jersey Basic	A	Ladieswear	1	Ladieswear	16	Womens Everyday Basics	1002	Jersey Basic	Jersey top with narrow shoulder straps.
2	108775051	108775	Strap top (1)	253	Vest top	Garment Upper body	1010017	Stripe	11	Off White	...	Jersey Basic	A	Ladieswear	1	Ladieswear	16	Womens Everyday Basics	1002	Jersey Basic	Jersey top with narrow shoulder straps.
3	110065001	110065	OP T-shirt (Idro)	306	Bra	Underwear	1010016	Solid	9	Black	...	Clean Lingerie	B	Lingeries/Tights	1	Ladieswear	61	Womens Lingerie	1017	Under-, Nightwear	Microfibre T-shirt bra with underwired, moulde...
4	110065002	110065	OP T-shirt (Idro)	306	Bra	Underwear	1010016	Solid	10	White	...	Clean Lingerie	B	Lingeries/Tights	1	Ladieswear	61	Womens Lingerie	1017	Under-, Nightwear	Microfibre T-shirt bra with underwired, moulde...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
105537	953450001	953450	5pk regular Placement1	302	Socks	Socks & Tights	1010014	Placement print	9	Black	...	Socks Bin	F	Menswear	3	Menswear	26	Men Underwear	1021	Socks and Tights	Socks in a fine-knit cotton blend with a small...
105538	953763001	953763	SPORT Malaga tank	253	Vest top	Garment Upper body	1010016	Solid	9	Black	...	Jersey	A	Ladieswear	1	Ladieswear	2	H&M+	1005	Jersey Fancy	Loose-fitting sports vest top in ribbed fast-d...
105539	956217002	956217	Cartwheel dress	265	Dress	Garment Full body	1010016	Solid	9	Black	...	Jersey	A	Ladieswear	1	Ladieswear	18	Womens Trend	1005	Jersey Fancy	Short, A-line dress in jersey with a round nec...
105540	957375001	957375	CLAIRE HAIR CLAW	72	Hair clip	Accessories	1010016	Solid	9	Black	...	Small Accessories	D	Divided	2	Divided	52	Divided Accessories	1019	Accessories	Large plastic hair claw.
105541	959461001	959461	Lounge dress	265	Dress	Garment Full body	1010016	Solid	11	Off White	...	Jersey	A	Ladieswear	1	Ladieswear	18	Womens Trend	1005	Jersey Fancy	Calf-length dress in ribbed jersey made from a...

105542 rows × 25 columns

Using Generative AI for coding purposes

Don’t forget that you can use your favourite GenAI-based assistant to help you identify the right arguments to process external files.

“Give me code to ignore errors in pandas.read_csv()”

Will return something similar to:

import pandas as pd

# Example CSV file path
file_path = 'example.csv'

try:
    df = pd.read_csv(file_path, error_bad_lines=False, warn_bad_lines=False)
except Exception as e:
    print(f"An error occurred: {e}")

An error occurred: read_csv() got an unexpected keyword argument 'error_bad_lines'

Example (Other formats)

Pandas offers several methods to ingest external data, for instance to open an existing Excel using Pandas we use the method Pandas.read_excel(). For instance, given the Excel file “mini.xlsx” stored in the following URL:

excelDataFrame=pd.read_excel('https://github.com/thousandoaks/Python4DS103/blob/main/data/mini.xlsx?raw=true')
excelDataFrame.head()

	Unnamed: 0.1	Unnamed: 0	project_title	region	funding_ratio	badge	tag	language	remark	start_date	end_date	date_remark	sentiment	anger	fear	joy	love	optimism
0	6976	6976	EasyTouch: Turn your world into a touch sensor	CA	3.8548	Backer	Hardware	en	Hi! I had not receive my rewards util today, t...	2014-10-21 14:30:02	2014-11-20 15:30:12	2015-07-02 11:00:27	NEGATIVE	YES	NO	NO	NO	NO
1	6977	6977	EasyTouch: Turn your world into a touch sensor	CA	3.8548	Backer	Hardware	en	I have just received mine yesterday! Yipee! An...	2014-10-21 14:30:02	2014-11-20 15:30:12	2015-05-12 13:33:46	POSITIVE	NO	NO	YES	NO	YES
2	6978	6978	EasyTouch: Turn your world into a touch sensor	CA	3.8548	Backer	Hardware	en	I haven't received mine. Winnipeg, Canada. For...	2014-10-21 14:30:02	2014-11-20 15:30:12	2015-05-11 16:11:59	NEUTRAL	NO	NO	NO	NO	NO
3	6979	6979	EasyTouch: Turn your world into a touch sensor	CA	3.8548	Backer	Hardware	en	When are you going to fix your website?	2014-10-21 14:30:02	2014-11-20 15:30:12	2015-05-07 23:15:33	NEUTRAL	NO	NO	NO	NO	NO
4	6980	6980	EasyTouch: Turn your world into a touch sensor	CA	3.8548	Superbacker	Hardware	en	Just received mine on May 6th. Vancouver, Canada.	2014-10-21 14:30:02	2014-11-20 15:30:12	2015-05-07 01:46:24	NEUTRAL	NO	NO	NO	NO	NO

The following table provides a list of methods you can use to read external files using Pandas.

Method	Description
`pd.read_csv()`	Reads a CSV file into a DataFrame.
`pd.read_excel()`	Reads an Excel file into a DataFrame.
`pd.read_sql()`	Reads from a SQL database into a DataFrame.
`pd.read_json()`	Reads a JSON string or file into a DataFrame.
`pd.read_html()`	Reads HTML tables into a list of DataFrames.
`pd.read_clipboard()`	Reads the clipboard’s contents into a DataFrame.
`pd.read_hdf()`	Reads from an HDF5 file into a DataFrame.
`pd.read_parquet()`	Reads a Parquet file into a DataFrame.
`pd.read_orc()`	Reads an ORC file into a DataFrame.
`pd.read_feather()`	Reads a Feather-format file into a DataFrame.
`pd.read_stata()`	Reads a Stata file into a DataFrame.
`pd.read_sas()`	Reads SAS files into a DataFrame.
`pd.read_spss()`	Reads SPSS files into a DataFrame.
`pd.read_pickle()`	Reads a pickled object (serialized Python object) into a DataFrame.
`pd.read_sql_table()`	Reads a SQL database table into a DataFrame.
`pd.read_sql_query()`	Executes a SQL query and returns the result as a DataFrame.

Example (Muliple files)

A frequent use case you will run into is the ingestion of several, potentially dozens, of files.

An option is to rely on Pandas native methods, for instance given the following URLs

file1= "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_First_Batch.csv"
file2= "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_Second_Batch.csv"
file3= 'https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_Third_Batch.csv'

The following code uses a list comprehension to process each URL using Pandas.read_csv(). The result is stored in articleslist which a list of DataFrames.

import pandas as pd
# Read multiple CSV files
files = [file1, file2, file3]
articlelist = [pd.read_csv(file) for file in files]
type(articlelist)

list

You can further process articlelist for instance by joining all the elements into a single DataFrame called Articles.

Articles = pd.concat(articlelist)
Articles

	Unnamed: 0	article_id	product_code	prod_name	product_type_no	product_type_name	product_group_name	graphical_appearance_no	graphical_appearance_name	colour_group_code	...	department_name	index_code	index_name	index_group_no	index_group_name	section_no	section_name	garment_group_no	garment_group_name	detail_desc
0	52188	700817001	700817	Charlotte Wireless Abeline	306	Bra	Underwear	1010021	Lace	9	...	Expressive Lingerie	B	Lingeries/Tights	1	Ladieswear	61	Womens Lingerie	1017	Under-, Nightwear	Soft, non-wired bra in lace with crossover str...
1	52956	702802001	702802	Buzz padded soft bra back det	306	Bra	Underwear	1010020	Contrast	9	...	Expressive Lingerie	B	Lingeries/Tights	1	Ladieswear	61	Womens Lingerie	1017	Under-, Nightwear	Soft, non-wired bra in lace and mesh with line...
2	12555	554049001	554049	Ursula embroidery	275	Skirt	Garment Lower body	1010016	Solid	9	...	Skirt	A	Ladieswear	1	Ladieswear	15	Womens Everyday Collection	1012	Skirts	Knee-length skirt in embroidered mesh with an ...
3	98783	880793001	880793	Mille linen premium shorts	274	Shorts	Garment Lower body	1010016	Solid	73	...	Shorts & Skirts	A	Ladieswear	1	Ladieswear	6	Womens Casual	1025	Shorts	Shorts in airy linen. Regular drawstring waist...
4	7908	518318003	518318	Pat ls shaggy flanel check	259	Shirt	Garment Upper body	1010004	Check	73	...	Shirt	F	Menswear	3	Menswear	21	Contemporary Casual	1011	Shirts	Checked shirt in soft cotton flannel with a tu...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
295	39627	662108002	662108	ES Alba	252	Sweater	Garment Upper body	1010001	All over pattern	7	...	Kids Girl Knitwear	H	Children Sizes 92-140	4	Baby/Children	76	Kids Girl	1003	Knitwear	Long-sleeved jumper in soft, fine-knit cotton ...
296	32024	633813007	633813	Enrique sneaker	94	Sneakers	Shoes	1010016	Solid	9	...	Sneakers	C	Ladies Accessories	1	Ladieswear	64	Womens Shoes	1020	Shoes	Cotton twill shoes with lacing at the front an...
297	40918	666037001	666037	Nutella corset	284	Underwear corset	Underwear	1010021	Lace	9	...	Expressive Lingerie	B	Lingeries/Tights	1	Ladieswear	61	Womens Lingerie	1017	Under-, Nightwear	Corset in lace and mesh with adjustable suspen...
298	77861	791492008	791492	SEQUINS tee TP	255	T-shirt	Garment Upper body	1010015	Sequin	19	...	Kids Boy Jersey Fancy	H	Children Sizes 92-140	4	Baby/Children	46	Kids Boy	1005	Jersey Fancy	T-shirt in soft jersey with a reversible sequi...
299	97419	874168001	874168	Walter 2p Coolmax R-neck(1)	306	Bra	Underwear	1010016	Solid	10	...	Underwear Jersey	F	Menswear	3	Menswear	26	Men Underwear	1017	Under-, Nightwear	Round-necked T-shirts in a soft, stretch cotto...

900 rows × 26 columns

Example (Relational Database)

A frequent use case you will surely run into is the retrieval of data from relational databases. We will devote more time to relational databases in another chapter.

Say you are granted access to a local database called ‘sales.db’, the following code: (1) opens a connection to the database then (2) reads all fields from the table sales.

# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('sales.db')

# Read data from the SQL table into a DataFrame
df = pd.read_sql('SELECT * FROM sales', conn)

df.head(3)

	region	sales
0	North	200
1	South	150
2	East	300

Accessing Relational Databases

Don’t worry if you don’t understand the previous argument ‘SELECT * FROM sales’, it is a SQL statement. You will learn more about it in future chapters.

15.2 Reading Data using Requests

It is always a good idea to have alternatives to Pandas when it comes to accessing and storing external data. The requests library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of making requests behind a simple API so that you can focus on interacting with services and consuming data in your application.

The following code establishes a HTTP connection to a url and returns: (1) The status of the response, (2) the first three lines of the said response.

import requests

url = "https://github.com/thousandoaks/Python4DS-I/raw/main/datasets/Articles_First_Batch.csv"

response = requests.get(url)

response.raise_for_status()

print(response)


# Get the response text
response_text = response.text

# Split the text into lines
lines = response_text.splitlines()

# Extract the first three lines
first_three_lines = lines[:3]

first_three_lines

<Response [200]>

[',article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,perceived_colour_value_id,perceived_colour_value_name,perceived_colour_master_id,perceived_colour_master_name,department_no,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc',
 '52188,700817001,700817,Charlotte Wireless Abeline,306,Bra,Underwear,1010021,Lace,9,Black,4,Dark,5,Black,1338,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Soft, non-wired bra in lace with crossover straps at the front and lightly padded, moulded cups that shape the bust and provide good support. Narrow, adjustable shoulder straps, a lace trim at the hem and a hook-and-eye fastening at the back."',
 '52956,702802001,702802,Buzz padded soft bra back det,306,Bra,Underwear,1010020,Contrast,9,Black,4,Dark,5,Black,1338,Expressive Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Soft, non-wired bra in lace and mesh with lined cups with removable inserts for a larger bust and fuller cleavage. Narrow, adjustable shoulder straps, a lace racer back and front fastening."']

Always check the status of connections

It’s a good practice to always call raise_for_status after using requests.get to check for HTTP error prior to any other operation.

Example (Wikipedia API-based access)

The requests library is extremely handy to access external services. The following example performs an API-based request to Wikipedia

import requests

# Define the URL for the Wikipedia API
url = 'https://en.wikipedia.org/w/api.php'

# Define the parameters for the API request
params = {
    'action': 'query',
    'format': 'json',
    'titles': 'Python (programming language)',
    'prop': 'extracts',
    'exintro': True,
    'explaintext': True
}

# Make the GET request
response = requests.get(url, params=params)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()
    
    # Extract the page content
    page = next(iter(data['query']['pages'].values()))
    content = page['extract']
    
    # Print the content
    print(content)
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming. It is often described as a "batteries included" language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming languages, and has gained widespread use in the machine learning community.

Using Generative AI for coding purposes

The previous example was created by Microsoft CoPilot, I simply submitted the following prompt:

“Example of a request python to download a wikipedia page”

Example (Multiple File processing)

The requests library

The following code processes the directory: “https://api.github.com/repos/thousandoaks/Python4DS-I/contents/datasets” extracting all the file URLs in the variable files.

import requests

# Define the URL for the GitHub API
url = f'https://api.github.com/repos/thousandoaks/Python4DS-I/contents/datasets'

# Make the GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    files = response.json()
    
    # Print the list of files
    for file in files:
        print("I got:",file['name'])
else:
    print(f"Failed to retrieve the files. Status code: {response.status_code}")

I got: Articles_First_Batch.csv
I got: Articles_Second_Batch.csv
I got: Articles_Third_Batch.csv
I got: HealthcareDataset_PublicRelease.csv
I got: PharmaDataSet.csv.zip
I got: articles.csv.zip
I got: customers.csv.zip
I got: transactions2020.csv.zip

Once we have the list of files to be retrieved we iterate over the list files this time processing each element using Pandas.read_csv() method.

import pandas as pd
from IPython.display import display, HTML

## we only process the first two files and 
## extract only the first two rows of each file
for file in files:
  fileURL=file['download_url']
  print("I am processing the file:",fileURL)
  print('\n')
  internalfile=pd.read_csv(fileURL)
  display(HTML(internalfile.iloc[:2, :2].to_html()))
  print('\n')

I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_First_Batch.csv

	Unnamed: 0	article_id
0	52188	700817001
1	52956	702802001



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_Second_Batch.csv

	Unnamed: 0	article_id
0	2323	402382009
1	29553	625653029



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/Articles_Third_Batch.csv

	Unnamed: 0	article_id
0	74841	778515002
1	100266	889379009



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/HealthcareDataset_PublicRelease.csv

	Id	MemberName
0	634363	e659f3f4
1	634364	e659f3f4



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/PharmaDataSet.csv.zip

	TransactionId	Date
0	172865	2019-01-01 10:47:41
1	172868	2019-01-01 10:47:41



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/articles.csv.zip

	article_id	product_code
0	108775015	108775
1	108775044	108775



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/customers.csv.zip

	customer_id	FN
0	00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657	NaN
1	0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa	NaN



I am processing the file: https://raw.githubusercontent.com/thousandoaks/Python4DS-I/main/datasets/transactions2020.csv.zip

	t_dat	customer_id
0	2020-06-01	00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec5e0fb21b12f862f73d
1	2020-06-01	000b31552d3785c79833262bbeefa484cbc43d7b612b3c8fc696260b5afaadc4

15.3 Writing Data using Pandas

Writing files using Pandas is straightforward. The following code uses DataFrame.to_csv() to write a DataFrame as a local file in csv format.

import pandas as pd

df.to_csv("./california_housing_train.csv", sep=",")

Pandas provides several methods you can use to write your datasets, refer to the following table.

Method	Description	File Format
`DataFrame.to_csv()`	Write DataFrame to a comma-separated values file	CSV
`DataFrame.to_excel()`	Write DataFrame to an Excel file	Excel (XLSX, XLS)
`DataFrame.to_json()`	Write DataFrame to a JSON file	JSON
`DataFrame.to_html()`	Write DataFrame to an HTML file	HTML
`DataFrame.to_sql()`	Write DataFrame to a SQL database table	SQL Database
`DataFrame.to_pickle()`	Serialize DataFrame to a pickle file	Pickle
`DataFrame.to_hdf()`	Write DataFrame to an HDF5 file	HDF5
`DataFrame.to_feather()`	Write DataFrame to a Feather file	Feather
`DataFrame.to_parquet()`	Write DataFrame to a Parquet file	Parquet
`DataFrame.to_stata()`	Write DataFrame to a Stata file	Stata (DTA)
`DataFrame.to_records()`	Write DataFrame to a numpy record array	NumPy Record Array
`DataFrame.to_latex()`	Write DataFrame to a LaTeX tabular environment	LaTeX
`DataFrame.to_markdown()`	Write DataFrame to a markdown-formatted string	Markdown

Example (Pickle Files)

The csv format is quite common to exchange data on the internet given its simplicity and the fact that is readable by humans, the csv format is not ideal if you need to handle large volumes of data. In these situations is advisable to use other formats which are more efficient and robust.

The following code writes an existing dataframe df to localfile.pkl. The pkl extension is used to denote pickle format.

import pandas as pd

df.to_pickle("./localfile.pkl")

Example (SQL table)

It is possible to write information to SQL-based databases using DataFrame.to_sql() method.

Say you are granted access to a local database called ‘sales.db’, the following code: (1) opens a connection to the database ‘sales.db’ then (2) writes the dataframe df into the table sales.

import pandas as pd
import sqlite3


# Connect to SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('sales.db')


# Write DataFrame to SQL table
df.to_sql('sales', con=engine, if_exists='replace', index=False)

15.4 Conclusion

This chapter explores reading and writing data in Python, emphasizing the efficiency and versatility of using Pandas for these operations. Reading data is made easy with methods like pd.read_csv() and pd.read_excel(), which load external files into Pandas DataFrames for analysis. The chapter discusses handling common issues such as inconsistent data, missing fields, and complex formats using arguments like sep, parse_dates, and quotechar. Writing data is equally straightforward with methods like df.to_csv() and df.to_excel(), enabling the export of processed data for storage or sharing. The chapter also introduces alternatives like the requests library for HTTP-based data retrieval and APIs for dynamic data acquisition.

In the next chapter you will learn how to efficiently process time series data using Pandas.

15.5 Further Reading

For those of you in need of additional, more advanced, topics please refer to the following references:

If you want a good reference site: Official Pandas Read/Write Reference