1. Code
  2. Python

Pandas: The Swiss Army Knife for Your Data, Part 1

Scroll to top
5 min read
This post is part of a series called Pandas: The Swiss Army Knife for Your Data.

Pandas is an amazing data analysis toolkit for Python. It is designed to operate on relational or labeled data and gives you tools to slice and dice as you please. 

In this two-part tutorial, you'll learn about the fundamental data structures of Pandas: the series and the data frame. You'll also learn how to select data, deal with missing values, manipulate your data, merge your data, group your data, work with time series, and even plot data.

Installation

To install, just pip install pandas. It will take care of installing numpy too if you don't have it installed.

Series

Pandas series are typed and labeled 1-D arrays. This means that each element can be accessed by its label in addition to its index.

Here is a series of integers where the labels are Roman numerals. You can index and slice using the labels or integer indices. Unlike with regular Python list slicing, when using labels the last item is included!

1
>>> s = pd.Series(np.arange(1,5), ['I', 'II', 'III', 'IV', 'V'])
2
>>> s['III']
3
3
4
>>> s[0]
5
1
6
>>> s['II':'V']
7
II     2
8
III    3
9
IV     4
10
V      5
11
12
>>> s[1:5]
13
II     2
14
III    3
15
IV     4
16
V      5

If you don't provide an index then a 0-based integer index is automatically created:

1
>>> s = pd.Series((50, 7, 88, 9))
2
>>> s
3
0    50
4
1     7
5
2    88
6
3     9

Now, here is a little secret for you. Pandas series are a wrapper around Numpy's arrays.

1
>>> s.values
2
array([50,  7, 88,  9])
3
>>> type(s.values)
4
<class 'numpy.ndarray'>

Unlike Python lists or numpy arrays, operations on series align on the index. If the indexes don't match then the union of indices will be used with missing values as appropriate. Here are a few examples using dicts as data so the keys become the series index:

1
>>> s1 = pd.Series(dict(a=1, b=2, c=3))
2
>>> s2 = pd.Series(dict(a=4, b=5, c=6, d=7))
3
>>> s1 + s2
4
a    5.0
5
b    7.0
6
c    9.0
7
d    NaN
8
9
>>> s1[1:] * s2[:-1]
10
a     NaN
11
b    10.0
12
c    18.0

Data Frames

Data frames are the primary pandas data structure. They represent tables of data where each column is a series. Data frames have an index too, which serves as a row label. A data frame also has column labels. Here is how to declare a data frame using a dict. 

1
>>> df = pd.DataFrame(dict(a=[1, 2, 3], 
2
                           b=[4,5,6], 
3
                           c=pd.Timestamp('20170902'),
4
                           d=pd.Categorical(['red', 
5
                                             'green', 
6
                                             'blue'])))
7
                           
8
>>> df
9
   a  b          c      d
10
0  1  4 2017-09-02    red
11
1  2  5 2017-09-02  green
12
2  3  6 2017-09-02   blue

Note that an integer index (row label) was created automatically. You can of course provide your own index:

1
>>> df.index = ('I II III'.split())
2
>>> df
3
     a  b          c      d
4
I    1  4 2017-09-02    red
5
II   2  5 2017-09-02  green
6
III  3  6 2017-09-02   blue

Importing and Exporting Data

Data frames can be constructed from a very wide variety of sources:

  • dict of 1-D ndarrays, lists, dicts, or series
  • 2-D numpy.ndarray
  • structured or record ndarray
  • another DataFrame

You can also import or load data from many file formats and databases such as:

  • CSV
  • Excel
  • HTML
  • HDFStore 
  • SQL

Here is how to read a CSV file:

1
data.csv
2
--------
3
I,1,4,2017-09-02,red
4
II,2,5,2017-09-02,green
5
III,3,6,2017-09-02,blue
6
7
>>> pd.read_csv('data.csv')
8
     I  1  4  2017-09-02    red
9
0   II  2  5  2017-09-02  green
10
1  III  3  6  2017-09-02   blue

Here is the complete list of read_functions():

1
>>> read_functions = [a for a in dir(pd) if a.startswith('read_')]
2
>>> print('\n'.join(read_functions))
3
read_clipboard
4
read_csv
5
read_excel
6
read_feather
7
read_fwf
8
read_gbq
9
read_hdf
10
read_html
11
read_json
12
read_msgpack
13
read_pickle
14
read_sas
15
read_sql
16
read_sql_query
17
read_sql_table
18
read_stata
19
read_table

There are corresponding methods on the data frame object itself for exporting the data to many formats and databases. Here is how you export to json and msgpack:

1
>>> df.to_json()
2
'{"a":{"I":1,"II":2,"III":3},
3
  "b":{"I":4,"II":5,"III":6},
4
  "c":{"I":1504310400000,"II":1504310400000,"III":1504310400000},
5
  "d":{"I":"red","II":"green","III":"blue"}}'
6
  
7
>>> df.to_msgpack()  
8
b'\x84\xa3typ\xadblock_manager\xa5klass\xa9DataFrame\xa4axes
9
\x92\x86\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype
10
\xa6object\xa4data\x94\xa1a\xa1b\xa1c\xa1d\xa8compress\xc0\x86
11
\xa3typ\xa5index\xa5klass\xa5Index\xa4name\xc0\xa5dtype
12
\xa6object\xa4data\x93\xa1I\xa2II\xa3III\xa8compress\xc0
13
\xa6blocks\x93\x86\xa4locs\x86\xa3typ\xa7ndarray\xa5shape\x91
14
\x02\xa4ndim\x01\xa5dtype\xa5int64\xa4data\xd8\x00\x00\x00\x00
15
\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\xa8compress
16
\xc0\xa6values\xc70\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00
17
\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04
18
\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00
19
\x06\x00\x00\x00\x00\x00\x00\x00\xa5shape\x92\x02\x03\xa5dtype
20
\xa5int64\xa5klass\xa8IntBlock\xa8compress\xc0\x86\xa4locs\x86
21
\xa3typ\xa7ndarray\xa5shape\x91\x01\xa4ndim\x01\xa5dtype
22
\xa5int64\xa4data\xd7\x00\x02\x00\x00\x00\x00\x00\x00\x00
23
\xa8compress\xc0\xa6values\xc7\x18\x00\x00\x00\xed\xafVb\xe0
24
\x14\x00\x00\xed\xafVb\xe0\x14\x00\x00\xed\xafVb\xe0\x14
25
\xa5shape\x92\x01\x03\xa5dtype\xaedatetime64[ns]\xa5klass
26
\xadDatetimeBlock\xa8compress\xc0\x86\xa4locs\x86\xa3typ
27
\xa7ndarray\xa5shape\x91\x01\xa4ndim\x01\xa5dtype\xa5int64
28
\xa4data\xd7\x00\x03\x00\x00\x00\x00\x00\x00\x00\xa8compress
29
\xc0\xa6values\x87\xa3typ\xa8category\xa5klass\xabCategorical
30
\xa4name\xc0\xa5codes\x86\xa3typ\xa7ndarray\xa5shape\x91\x03
31
\xa4ndim\x01\xa5dtype\xa4int8\xa4data\xc7\x03\x00\x02\x01\x00
32
\xa8compress\xc0\xaacategories\x86\xa3typ\xa5index\xa5klass
33
\xa5Index\xa4name\xc0\xa5dtype\xa6object\xa4data\x93\xa4blue
34
\xa5green\xa3red\xa8compress\xc0\xa7ordered\xc2\xa8compress
35
\xc0\xa5shape\x91\x03\xa5dtype\xa8category\xa5klass
36
\xb0CategoricalBlock\xa8compress\xc0'

Metadata and Stats

Pandas gives a lot of information about data frames. Check out these methods:

1
>>> df.index
2
Index(['I', 'II', 'III'], dtype='object')
3
>>> df.columns
4
Index(['a', 'b', 'c', 'd'], dtype='object')
5
>>> df.describe()
6
         a    b
7
count  3.0  3.0
8
mean   2.0  5.0
9
std    1.0  1.0
10
min    1.0  4.0
11
25%    1.5  4.5
12
50%    2.0  5.0
13
75%    2.5  5.5
14
max    3.0  6.

Selecting Data

Data frames let you select data. If you want to select a row by index, you need to use the loc attribute. To select columns, you simply use the column name. Here is how to select individual rows, individual columns, a slice of rows, a slice of columns, and last but not least, a rectangular section (subset of rows and subset of columns from these rows):

1
Single row
2
----------
3
>>> df.loc['II']
4
a                      2
5
b                      5
6
c    2017-09-02 00:00:00
7
d                  green
8
9
Multiple rows using integer index (no 'loc')
10
--------------------------------------------
11
>>> df[:2]
12
    a  b          c      d
13
I   1  4 2017-09-02    red
14
II  2  5 2017-09-02  green
15
16
Single column
17
-------------
18
>>> df['b']
19
I      4
20
II     5
21
III    6
22
23
24
Multiple columns
25
----------------
26
>>> df.loc[:, 'b':'c']
27
     b          c
28
I    4 2017-09-02
29
II   5 2017-09-02
30
III  6 2017-09-02
31
32
Rectangular section
33
-------------------
34
>>> df.loc[:'II', 'b':'c']
35
    b          c
36
I   4 2017-09-02
37
II  5 2017-09-02
38
39
40
Using integer index (when actual index is not integer)
41
------------------------------------------------------
42
>>> df.iloc[:2, 1:3]
43
    b          c
44
I   4 2017-09-02
45
II  5 2017-09-02

In addition to those direct addressing data selections, you can also select based on values. For example, you can select only rows with even values in column b: 

1
>>> df[df.b % 2 == 0]
2
     a  b          c     d
3
I    1  4 2017-09-02   red
4
III  3  6 2017-09-02  blue

Sorting Data

Pandas gives you sorting too. Let's sort the following data frame by index (rows) and by column. Multiple-level indexing is supported too:

1
index=['one', 'two', 'three', 'four', 'five']
2
df = pd.DataFrame(np.random.randn(5,2),
3
                  index=index, 
4
                  columns=['a','b'])
5
6
Sort by index (alphabetically and descending)
7
---------------------------------------------
8
>>> df.sort_index(ascending=False)
9
              a         b
10
two   -0.689523  1.411403
11
three  0.332707  0.307561
12
one   -0.042172  0.374922
13
four   0.426519 -0.425181
14
five  -0.161095 -0.849932
15
16
Sort by column
17
--------------
18
>>> df.sort_values(by='a')
19
              a         b
20
two   -0.689523  1.411403
21
five  -0.161095 -0.849932
22
one   -0.042172  0.374922
23
three  0.332707  0.307561
24
four   0.426519 -0.425181

Conclusion

In this part of the tutorial, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data. In part two, we'll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting. Stay tuned.

In the meantime, don’t hesitate to see what we have available for sale and for study in the marketplace, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.