Data Visualization Series: Introduction

Since I moved here, I have been thinking about starting a series of blog posts on data visualization – design, implementation, libraries, data etc. Now I’m finally ready to start the game.

The Data

Throughout the series, I will be working on my favourite data, statistics from NHL games. All data will be retrieved from’s game reports like this. There are couple of reasons why I want to use these reports:  1) it is easy to retrieve all the urls for given season and 2) it is easy to scrape what ever data I want to get out of them since they are quite uniform. Since the table cells are not well named, there is some manual cleaning and defining to do.

Screen Shot 2014-03-30 at 8.27.14 PM


From each game report, we can retrieve information on goals, officials, 3 Stars awards, goalie performance and even actual lengths of penalties instead of just number of penalty minutes given.

The Tools

One of the motivations behind these series is to familiarize myself with different tools and techniques to scrape and manipulate data and to visualize it. So therefore I will be using R, Python and Javascript for all these steps. On visualization, I intend to use ggplot2 (R), Pygal and matplotlib (Python) and D3.js (Javascript).

On Python, I’m not yet sure which library I’m going to use, since I’ve used a forked version of Pygal which is more advanced than original and I have only done some exploratory data analysis on matplotlib so it will also be great way to learn the strengths of those libraries.

In addition to just showing what the libraries can do, I am interested on what kind of default settings they have (plotting something without touching any styles), how easy it is to customize, do they provide any interaction options and what kind of ideas they are based on.

These blog posts will also be a background research for my bachelor’s thesis in case I end up writing it.

The Syllabus

I do not yet have fully planned this to the end but I have a some idea how things are going to roll out.

  1. Data parsing and cleaning
    1. Python
    2. R
    3. Javascript
    4. Conclusions
  2. Visualization
    1. Design and theory behind data visualizations
    2. ggplot2
    3. matplotlib
    4. PyGal
    5. D3.js
    6. Conclusions

All code created will be published on MIT Licence in GitHub.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s