Accessing Data in Structured Flat-File Form

In many cases, the data you need to work with won’t appear within a library, such as the toy datasets in the Scikit-learn library. Real-world data usually appears in a file of some type, and a flat file presents the easiest kind of file to work with. In a flat file, the data appears as a simple list of entries that you can read one at a time, if desired, into memory. Depending on the requirements for your project, you can read all or part of the file.
A problem with using native Python techniques is that the input isn’t intelligent. For example, when a file contains a header, Python simply reads it as yet more data to process, rather than as a header. You can’t easily select a particular column of data. The pandas library used in the sections that follow makes it much easier to read and understand flat-file data. Classes and methods in the pandas library interpret (parse) the flatfile data to make it easier to manipulate.

The least formatted and therefore easiest-to-read flat-file format is the text file. However, a text file also treats all data as strings, so you often have to convert numeric data into other forms. A commaseparated value (CSV) file provides more formatting and more information, but it requires a little more effort to read. At the high end of flat-file formatting are custom data formats, such as an Excel file, which contains extensive formatting and could include multiple datasets in a single file.

The following sections describe these three levels of flat-file dataset and show how to use them. These sections assume that the file structures the data in some way. For example, the CSV file uses commas to separate data fields. A text file might rely on tabs to separate data fields. An Excel file uses a complex method to separate data fields and to provide a wealth of information about each field. You can work with unstructured data as well, but working with structured data is much easier because you know where each field begins and ends.

Reading from a text file

Text files can use a variety of storage formats. However, a common format is to have a header line that documents the purpose of each field, followed by another line for each record in the file. The file separates the fields using tabs. Refer to the “Streaming large amounts of data into memory” section, earlier in this chapter, for an example of the Colors.txt file used for the example in this section.
Native Python provides a wide variety of methods you can use to read such a file. However, it’s far easier to let someone else do the work. In this case, you can use the pandas library to perform the task. Within the pandas library, you find a set of parsers, or code used to read individual bits of data and determine the purpose of each bit according to the format of the entire file. Using the correct parser is essential if you want to make sense of file content. In this case, you use the read_table() method to accomplish the task, as shown in the following code:

import pandas as pd
color_table = pd.io.parsers.read_table("Colors.txt")
print(color_table)

The code imports the pandas library, uses the read_table() method to read Colors.txt into a variable named color_table, and then displays the resulting memory data onscreen using the print function. Here’s the output you can expect to see from this example.

Color Value
0 Red 1
1 Orange 2
2 Yellow 3
3 Green 4
4 Blue 5
5 Purple 6
6 Black 7
7 White 8

Notice that the parser correctly interprets the first row as consisting of field names. It numbers the records from 0 through 7. Using read_table() method arguments, you can adjust how the parser interprets the input file, but the default settings usually work best. You can read more about the read_table() arguments at https://pandas.pydata.org/docs/reference/api/pandas.read_table.html.

Reading CSV delimited format

A CSV file provides more formatting than a simple text file. In fact, CSV files can become quite complicated. There is a standard that defines the format of CSV files, and you can see it at https://tools.ietf.org/html/rfc4180. The CSV file used for this example is quite simple:

A header defines each of the fields
Fields are separated by commas
Records are separated by linefeeds
Strings are enclosed in double quote
Integers and real numbers appear without double quotes

Figure below shows the raw format for the Titanic.csv file used for this example. You can see the raw format using any text editor.

The raw format of a CSV file is still text and quite readable.

Applications such as Excel can import and format CSV files so that they become easier to read. Figure 6-3 shows the same file in Excel. Excel actually recognizes the header as a header. If you were to use features such as data sorting, you could select header columns to obtain the desired result. Fortunately, pandas also makes it possible to work with the CSV file as formatted data, as shown in the following example:

import pandas as pd
titanic = pd.io.parsers.read_csv("Titanic.csv")
X = titanic[['age']]
print(X)

Use an application such as Excel to create a formatted CSV presentation.

Notice that the parser of choice this time is read_csv(), which understands CSV files and provides you with new options for working with it. (You can read more about this parser at https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.)

Selecting a specific field is quite easy — you just supply the field name as shown. The output from this example looks like this (some values omitted for the sake of space):

 age
0 29.0000
1 0.9167
2 2.0000
3 30.0000
4 25.0000
…
1304 14.5000
1305 9999.0000
1306 26.5000
1307 27.0000
1308 29.0000
[1309 rows x 1 columns]

Of course, a human-readable output like this one is nice when working through an example, but you may also need the output as a list. To create the output as a list, you simply change the third line of code to read X = titanic[[‘age’]].values. Notice the addition of the values property. The output changes to something like this (some values omitted for the sake of space):

[[29. ]
 [ 0.91670001]
 [ 2. ]
 …
[26.5 ]
 [27. ]
 [29. ]]

Reading Excel and other Microsoft Office files

Excel and other Microsoft Office applications provide highly formatted content. You can specify every aspect of the information these files contain. The Values.xls file used for this example provides a listing of sine, cosine, and tangent values for a random list of angles. You can see this file in Figure below.

An Excel file is highly formatted and might contain information of various
types.

When you work with Excel or other Microsoft Office products, you begin to experience some complexity. For example, an Excel file can contain more than one worksheet, so you need to tell pandas which worksheet to process. In fact, you can choose to process multiple worksheets, if desired. When working with other Office products, you have to be specific about what to process. Just telling pandas to process something isn’t good enough. Here’s an example of working with the Values.xls file.

import pandas as pd
xls = pd.ExcelFile("Values.xls")
trig_values = xls.parse('Sheet1', index_col=None,
 na_values=['NA'])
print(trig_values)

Note that you may have to install the xlrd library to read the .xls file. The downloadable source contains a special line, !pip install xlrd, to perform this task.

The code begins by importing the pandas library as normal. It then creates a pointer to the Excel file using the ExcelFile() constructor. This pointer, xls, lets you access a worksheet, define an index column, and specify how to present empty values. The index column is the one that the worksheet uses to index the records. Using a value of None means that pandas should generate an index for you. The parse() method obtains the values you request.

You don’t absolutely have to use the two-step process of obtaining a file pointer and then parsing the content. You can also perform the task using a single step like this: trig_values = pd.read_excel(“Values.xls”, ‘Sheet1’, index_col=None, na_values=[‘NA’]). Because Excel files are more complex, using the two-step process is often more convenient and efficient because you don’t have to reopen the file for each read of the data.