Pandas

Pandas is a Python module that provides features somewhat akin to a spreadsheet or database and meshes very naturally with both NumPy and Matplotlib. Before we can explore it, we need to import this module.

Installation

If you enter import pandas as pd in a cell in a Jupyter notebook and Python doesn’t complain, pandas is already installed. If not, installing is easy. If you are running jupyter notebook having installed with pip, enter

pip install pandas

If you are using Anaconda, use

conda install pandas

Creating a pandas DataFrame

The easiest way to create a DataFrame is from a dictionary of arrays, which all must have the same length. Here’s a simple example.

fibo = [1, 1, 2, 3, 5, 8, 13, 21]
prime = [2, 3, 5, 7, 11, 13, 17, 19]
nifty = [0., np.euler_gamma, np.sqrt(np.pi)/2, 1., np.sqrt(np.pi), 2.0, np.e, np.pi]
df = pd.DataFrame(dict(Fibonacci=fibo, Prime=prime, Nifty=nifty))

If we ask Jupyter to display the DataFrame by submitting df, this is what we get:

In combination with Jupyter, pandas generates a nice looking table, with column heads you can use for sorting. In this case, everything is sorted in ascending order, so the sort buttons aren’t particularly useful. In general, however, they can be quite useful. Also in this case, I have not set an index array, so Pandas uses the default integer vector starting from zero. Let’s change the index of this DataFrame to use the square root of integers:

df.index = np.sqrt(range(8))
df

which generates the output

Plotting data in a pandas DataFrame

Plotting a pandas data column against the index is straightforward. Since I would like a plot of discrete points, I will add the keyword argument style and use it to specify blue squares:

df.plot(y='Prime', style='bs')

If I would rather plot one column against another, I can specify both x and y values. I’ll gussy up some other things, too.

df.plot(x='Fibonacci', y='Prime', kind='scatter', title='Ridiculous')

As usual, you can get lots more information about a command by asking for help:

help(df.plot)

Filtering data

Suppose we wanted to work only with the data in the table for which the Fibonacci number was even. We can construct a new subset table with the following command

df[df['Fibonacci'] % 2 == 0]

which yields

What’s going on here? The interior expression, df['Fibonacci'] % 2 == 0 produces a boolean array of values, one for each row in the DataFrame. This array is then used to index into df, yielding only those rows corresponding to True values in the array.

You can access individual columns either with a string index, as in the previous example, or as a data member field of the same name:

df.Fibonacci

000000     1
000000     1
414214     2
732051     3
000000     5
236068     8
449490    13
645751    21
Name: Fibonacci, dtype: int64

Example

On the Matplotlib introduction page, we developed the following figure.

Now our resonance cup runneth over in style!

I’m going to recompute the values plotted in this figure and store them in a pandas DataFrame for easy display.

# First make a dictionary of vectors
values = {z:myfunc(logx, z) for z in (0.05, 0.1, 0.2, 0.5, 1, 2)}
# Now create a pandas DataFrame to store them
df = pd.DataFrame(values, index=logx)
df

Displaying a pandas DataFrame.

You can access individual columns using their name:

   df[0.2]
   df.plot(logx=True, logy=True, xlabel="$x$", ylabel="$y$");

With the curves in a pandas DataFrame, the plot command takes a single line.

Notice how the value of \(\zeta\) doesn’t affect the frequencies either significantly below the natural frequency (\(x = 1\)) nor significantly above it, where all the curves fall off in the same way with increasing frequency. Also note how the plotting operation could be achieved with a single call, with the various properties adjusted with keyword arguments.

Practice with pandas and matplotlib

I have placed some experimental data with uncertainties at https://www.physics.hmc.edu/courses/p134/CircularMoore2004.txt. The \(x\) axis variable is the position of the detector (in mm); the \(y\) axis variable is the observed light intensity (in volts); the \(y\) uncertainty values are in the same units.

Prepare a plot of these data using discrete points for the data, error bars set according to the uncertainties, and see if you can style your graph exactly like the one shown below. Feel free to consult the matplotlib documentation as liberally as you like!

Target practice: Can you make your plot look exactly like this one?

Hints

You can load the data into a convenient data structure defined by pandas with some code like

import pandas as pd
the_data = pd.read_table('filename_or_URL')

You can then see a table of the data by entering the_data in a Jupyter cell. You can access individual columns of the data by name: the_data['x'] or the_data.x.

The command to plot with error bars is ax.errorbar(xvals, yvals, yerrors) and you can add lots of keyword arguments to make things look the way you wish. To make a pleasant-looking plot, I recommend scaling the errors in Moore’s data by a factor of 50 (yeah, he was good!). The data in the \(x\) column is the position of the detector in millimeters, while the data in the columns labeled I and I_err are the intensity and intensity errors, with values in volts.
Potentially interesting keywords:
- marker
- fmt
- markersize
- capsize