PrimeTime - Reddit submission success by timing

By Joey (5th of July, 2015)

The visualisation below shows the distribution of scores for Reddit submissions made to /r/DataIsBeautiful, grouped by the hour of the day they were submitted (in UTC time). The dataset used is the 1000 newest posts in that subreddit as collected on the 5th of July, 2015.

In case the visual form is unfamiliar, this is a Tukey box plot. Here's how to read the chart:

The thing in the chart that would probably stand out immediately is the 16:00 UTC time block, which is equivalent to 9am PDT (Pacific Daylight Time) and 2am AEST (Australian Eastern Standard Time). This hour box is significantly taller than other time periods, which indicates that statistically, submissions made during this hour of the day have a greater likelihood of obtaining a high score (above 100) than at other times. Interestingly, it is also one of the hours where the least submissions are made. The median is also relatively low compared to other hours, which means that while your chance of getting a higher score is greater, so is your chance of getting a lower score!

For the other hours, majority of the submissions only have single to two-digit scores. Some of the boxes even sit on the x-axis where you can't see the bottom whisker, indicating that possibly a quarter of submissions made in this time period had a score of 0.

As with every other data visualisation experiment in this space (so far), the box plot drawing function was coded from scratch rather than using tools like R (I like to get my hands dirty in this sort of stuff). My first cut at drawing the box plots for the Reddit data used a linear Y scale since it required less calculations, which produced this:

While this still showed the anomaly at 16:00, the other boxes were so small that they weren't even thick enough to escape the width of the median line markers. This is why I ended up with logarithmic scaling so that these boxes would be a bit bigger, and the subtle differences between the other hours was more obvious. I did however have to cheat by changing all zero scored submissions to 0.1, as log(0) in Base 10 is infinite. This is why the boxes don't touch the x-axis (if you're a statistician feel free to come slap my wrist).

I also tried grouping the submissions by a different time interval: day of the week. Like hours of the day, I had an early hypothesis that people tend to be busier on weekdays than weekends, and may not share as much stuff or making original content for Reddit during the week as a result. For the same dataset, here's what the chart ended up looking like (still logarithmically scaled):

This was not the result I thought I would get. Monday was the most successful day of the week in terms of Score distribution, but Tuesday and Thursday ended up being the days with the most submissions. However, it is worth remembering that this data uses UTC timestamps, and contributions come from all over the world. Additional metadata about the users (geography, occupation) would be required to draw any meaningful insights about the behaviour of Redditors that engage rather than simply browse.

There are some other factors which might invalidate this visualisation. For starters, the number of data points that fell into each hour slot differed, which already makes it somewhat unfair to compare the box plots side by side. The dataset also omits removed submissions, which does occur if certain submissions either break the rules of the subreddit, or the user chooses to remove it for whatever reason. There are probably even more improvement opportunities but my time is up this weekend.