Cleaning out the Pandas

Cat or a panda? @imgur

Pandas 🐼 are notoriously dirty creatures (you just have to look at their browsing history). So when we use the python dataframe library, pandas, we find the data is a bit dirty.

Here’s some stuff we might find, with some fixes.

Whitespace in my headers

If we’re reading in a csv file, and the data looks a bit like this:

1
2
Kitten Name, Kitten Colour, Favourite Biscuit
...

It might seem we can read in the data and use it like this:

1
2
3
4
import pandas as pd

df = pd.read_csv("data/secret_kitten_biscuits.csv")
df['Favourite Biscuit']

But we can’t. Why not? Well, the read_csv function is pretty basic, and defaults to ‘,’ as the deliminator. This means our header name isn’t ‘Favourite Biscuit’, but it’s ‘ Favourite Biscuit’.
We need to strip out those leading and trailing whitespaces. There is a function for that: str.strip.

There’s a cute way to apply a function across all of the column names:

df.columns = df.columns.map(str.strip)

I don’t trust nulls

Perhaps the data you have has a different character to mean data not available. Perhaps you have a mix of data, no data and special null data. I am speaking from experience.

Here’s some data you might read in:

1
secret sauce,,,23, - ,,

Here want ‘ - ‘ to also mean null. This means we can’t redefine null at read time, but we can add an extra step to handle these special values as nulls:

1
2
3
4
import numpy as np

df = pd.read_csv("data/seckret_sauces.csv")
df = df.replace(' - ', np.nan)

My numbers are strings

I have no idea how data gets this way naturally, but it does, and let’s suppose you are reading in a file that looks like this:

"penguins","123,345,123",345,"345,321",431332

You have a weird mix of numbers in strings, non-strings and non-numbers, but you actually want all the numbers to be treated the same so you can do like calculations on them and stuff. So how do we do that?

Well you don’t want to do anything to the first column because that’s definitely not a number, and you want to yank out those commas because they are for humans and not computers, and then you want to cast everything to a number. It’s easier in python that in English:

df.iloc[:,1:].replace(',','',regex=True).apply(pd.to_numeric)

Conclusion

pandas is pretty awesome and with a bit of preprocessing and functional programming you can bash the data into the shapes you want it to be. You can do processing at the read_csv stage with the converters function if you know what columns you are getting and what you want to do with them. Sometimes though, you just know you are getting a glorious data file full of little monsters you need to handle somehow, and with a series of transformation steps you can absorb almost anything.