Adventures in Python: Transposing data

Share on:

I deal with data all the time. One of my pet peeves is Excel spreadsheets in which the person responsible for creating it has built it in such a way that you cannot pivot the information without restructuring the content. So to remedy this situation I created a quick and dirty tool to perform a specific task (transposing CSV files).

The tool is named Hydra, as every time I see a spreadsheet like this, I immediately freeze while my mind drifts off to consider how much time I will need to waste reformatting the information. Honestly, this task is really dull and I wanted a simple tool to do this, so I did not have to do it over and over again.

To express the problem, I am commonly faced with excel files that have been created to record large volumes of data. The issue is that due to the format selected to store the data in, it often makes it difficult to pivot, without first performing some Jitsu (data wrangling) on the data. For example, consider the following dataset

Surname FirstName Discipline 28-May-12 04-Jun-12 11-Jun-12
Allan Catherine Manufacturing 7 7 7
Babbcock Neil Design 2 2 2
Cook David Engineering 5 3 4
Dumas Sue IT 7 2 7
Marks Jan Manufacturing 1 3 2
Polly Paul Manufacturing 0 3 7
Salmon Christopher IT 5 0 0
Thatch Andy Engineering 2 5 0
Whedon Kelly Design 5 7 2

While the information presented make logical sense, in my case, this format is problematic. Ideally, the last three date columns should be consolidated to give just a single date, enabling me to run a quick pivot on results, with a new structure similar to the revised example table below.

Surname FirstName Discipline Date Hours
Allan Catherine Manufacturing 28-May-12 7
Allan Catherine Manufacturing 04-Jun-12 7
Allan Catherine Manufacturing 11-Jun-12 7
Babbcock Neil Design 28-May-12 2

However to achieve this outcome is a lot of cutting and pasting effort. The key difficulty here is I want to keep some of the original data and make a new consolidated column. The new column reflects the original data but in a more accessible format.

Thinking about this problem and the frequency with which I am faced with this task, lead me to the conclusion I should invest some effort on writing some quick code that could automate this process. I selected Python as it is something that leads itself to quick programming and has routines that are very good at handling data.

The Python code essentially does a number of things, but here are the key factors.

  • Read information in from a CSV file
  • Indicate the number of columns in the data read in
  • Determine which columns can remain
  • Perform a quick calculation to establish how many columns will need to be transposed
  • Loop through the data, taking each column to transpose and appending this as a new record (complete with static columns)
  • Output a transposed CSV file

Lazy nirvana achieved. The code is very much beta, meaning at the moment I do not see any need to industrialise it. However, it works and means I no longer have to turn to stone when I see a large data set in Excel.

Repository Link: Hydra