Adventures in Python: Transposing data
Introduction
I deal with data all the time. One of my pet peeves is Excel spreadsheets in which the person responsible for creating it has built it in such a way that you cannot pivot the information without restructuring the content. So to remedy this situation I created a quick and dirty tool to perform a specific task (transposing CSV files).
The tool is named Hydra, as every time I see a spreadsheet like this, I immediately freeze while my mind drifts off to consider how much time I will need to waste reformatting the information. Honestly, this task is really dull and I wanted a simple tool to do this, so I did not have to do it over and over again.
The problem
To express the problem, I am commonly faced with excel files that have been created to record large volumes of data. The issue is that due to the format selected to store the data in, it often makes it difficult to pivot, without first performing some Jitsu (data wrangling) on the data. For example, consider the following dataset
Surname | FirstName | Discipline | 28-May-12 | 04-Jun-12 | 11-Jun-12 |
---|---|---|---|---|---|
Allan | Catherine | Manufacturing | 7 | 7 | 7 |
Babbcock | Neil | Design | 2 | 2 | 2 |
Cook | David | Engineering | 5 | 3 | 4 |
Dumas | Sue | IT | 7 | 2 | 7 |
Marks | Jan | Manufacturing | 1 | 3 | 2 |
Polly | Paul | Manufacturing | 0 | 3 | 7 |
Salmon | Christopher | IT | 5 | 0 | 0 |
Thatch | Andy | Engineering | 2 | 5 | 0 |
Whedon | Kelly | Design | 5 | 7 | 2 |
While the information presented make logical sense, in my case, this format is problematic. Ideally, the last three date columns should be consolidated to give just a single date, enabling me to run a quick pivot on results, with a new structure similar to the revised example table below.
Surname | FirstName | Discipline | Date | Hours |
---|---|---|---|---|
Allan | Catherine | Manufacturing | 28-May-12 | 7 |
Allan | Catherine | Manufacturing | 04-Jun-12 | 7 |
Allan | Catherine | Manufacturing | 11-Jun-12 | 7 |
Babbcock | Neil | Design | 28-May-12 | 2 |
However to achieve this outcome is a lot of cutting and pasting effort. The key difficulty here is I want to keep some of the original data and make a new consolidated column. The new column reflects the original data but in a more accessible format.
The solution
Thinking about this problem and the frequency with which I am faced with this task, lead me to the conclusion I should invest some effort on writing some quick code that could automate this process. I selected Python as it is something that leads itself to quick programming and has routines that are very good at handling data.
The Python code essentially does a number of things, but here are the key factors.
- Read information in from a CSV file
- Indicate the number of columns in the data read in
- Determine which columns can remain
- Perform a quick calculation to establish how many columns will need to be transposed
- Loop through the data, taking each column to transpose and appending this as a new record (complete with static columns)
- Output a transposed CSV file
Lazy nirvana achieved. The code is very much beta, meaning at the moment I do not see any need to industrialise it. However, it works and means I no longer have to turn to stone when I see a large data set in Excel.
Repository Link: Hydra