Welcome to GraphGraph.
Read more about our site on the About page.
Welcome to GraphGraph.
Read more about our site on the About page.
This winner of this month’s award for “Unexpected Achievement in the World of Graphs” is Sean Taylor of the Facebook Data Science Team. Rather than describe what has been done, I’ll just leave a link here and say it’s Super Bowl related. It’s better explained by Sean anyway.
We here at GraphGraph appreciate a good graph, but what is getting our spreadsheets all in a pivot right now is dreaming of the amount of data that the good folks* at Facebook have at their disposal.
I have one problem with their presentation, and that would be the use of grey as a color. I understand that with 32 teams, there are only so many color options, and I can’t at this moment say how I would have done it differently. Nevertheless, to my eye, grey always looks like it represents “neutral” or “no data available,” not “Patriots or Colts or maybe even Cowboys.” Oh well.
There is a series of maps that shows the support for each remaining team as this year’s postseason progressed. I immediately wished there was an animated version, so I created a gif for your internetting consumption. Enjoy.
I’d like to see this map redone with the map weighted by population, like they do around election time. I also wouldn’t mind seeing this for other sports, like baseball and basketball and curling. Finally, I would be remiss not to mention two things:
1) Sean, Corey, and I all share the same alma mater.
2) Go Ravens.
*I sincerely hope that they are good, given all of the embarrassing pictures they have of young graph enthusiasts.
Monopoly’s a great game, isn’t it? Did you know that the properties in the US version of the game are named after actual streets in Atlantic City, NJ?
We had a thought: What would it look like if you drew lines on the ACTUAL streets on Atlantic City? Well, here you go:
We made this using the ‘Custom Maps’ feature of Google Maps, and the embedded map is below:
View Monopoly Streets in a larger map
Some interesting notes:
I recently leased a 2012 Kia Optima Hybrid. I’ve been doing a lot of driving for work lately, so I decided to get a mid-size car that could handle a lot of highway miles plus give me decent MPGs.
Being the numbers nerd that I am, I’ve been keeping track of various different stats.
The EPA estimates for the car when I bought it were 40 Highway & 35 City for an average of 37.
After the first nine fill-ups, I’ve been disappointed. I was averaging 29.9 MPGs, WAY below the EPA estimates.
Here’s a graph showing the numbers through the first nine fill-ups:
My numbers are extremely off. Why is this? There are a few options:
Let’s examine each of these:
“The brand of car doesn’t actually give the MPGs promised.”
Right after leasing the car, Kia (and parent company Hyundai) was docked by the EPA for overstating their MPG numbers. The new estimates were 39 Highway and 34 City, for an average of 36. Kia is trying to make it right, though, through partial reimbursements that you can read about here.
Even with the “new” estimates, however, I’m still way off the mark.
How do I compare with other Kia Optima Hybrid owners? My co-worked showed me an amazing site called Fuelly, which is essentially a fuel stat-tracking website. The added value of the website, however, is that you can look at all other owners of the same model and see how you compare to them.
Here’s the link to the list of other 2012 Kia Optima Hybrid owners.
Here’s some basic statistical data about the Optima, as of December 20, 2012:
MPG City: 34
MPG Highway: 39
Standard Deviation: 4.91
The mean is right at the MPG City number, and the median and mode are to the left of that.
For comparison’s sake, here’s a screenshot and some basic statistical data about the 2012 Toyota Prius from December 20, 2012.
MPG City: 51
MPG Highway: 48
Standard Deviation: 4.41
The mean, median, and mode all fall within the EPA estimates.
Perhaps my sample size is too small? My personal number of 30 MPGs is -.81 Standard Deviations off the mean, so perhaps it means that problem is multi-faceted: The MPGs for the brand are not what was promised, AND there are problems with my particular car. Let’s examine the second half of that point next.
“My specific car doesn’t actually give the MPGs promised” and “I’m a terrible driver who drives inefficiently”
According to my rough statistics, I’m -.81 Standard Deviations off the mean. So, what’s wrong with my particular car? Is it a problem with the car or with the driver? Or both?
In regards to the car, it’s a brand-new lease, so I would hope that there is nothing wrong with it. After the latest fill-up, I decided to check the tire pressure. The tires are meant to have 44 psi, but each tire was hovering between 30 to 34. Yikes! I’ll have to see if this gives me an improvement.
Looking at various sites about getting better fuel efficient driving, I stumbled up this post specifically about the Kia Optima Hybrid, including this video about “best practices”.
So perhaps the problem is my individual car (which I will have to continue investigating), but perhaps the problem is my driving? I feel that I’ve tried to adjust to the hybrid, but perhaps I can I still do better?
Instead of a tachometer, the Optima Hybrid has an “efficiency” gauge that gives instant feedback of “good”, “medium”, and “poor” driving. There’s even an “Eco Score” that gives you points for driving “efficiently”. I throw that in quotes because it’s based on what the car thinks is ideal, but from a gamification point-of-view it creates an incentive for me to try to drive better to earn virtual “points”, which should (in theory) correspond with better MPGs. I didn’t start tracking my Eco Points until my 7th fuel-up, and in a future post I’ll see if there’s a relationship between tank MPG and Eco Points.
[Image from CNET.]
Another potentially contributing factor would be where I live and the current weather. I’m doing a lot of travel between Pennsylvania and New Jersey, and it involves a lot of up and down through rolling hills. We’re also moving into winter, and as a result the car is very cold in the morning and has frost, meaning I need to burn fuel to defrost and to heat the car up when I start driving. Will other seasons be better for my MPGs?
“The gallons being dispensed at the pump are not the gallons actually being put into the car”
The last thing we’ll look at today is about the trust you have at the pump about the gallons that you purchased. At the end of November, I had two fuel-ups at the same station in New Jersey where my MPGs seemed really low compared to the average. When I fueled up the second time there, the tank was about 1/6 full but I noticed that they dispensed nearly a full tank’s worth of fuel. Something seemed off here.
Fortunately there are consumer-protection groups such as local Weights and Measures departments. I placed a call and they’re investigating, so we’ll see if I was dealing with a crooked gas station, or perhaps my far really did take a full tank of fuel.
From an overall point of view for MPGs, since it’s a ratio of miles divided by gallons, you have to assume that the gallons being dispensed at the pump are the actual gallons being put into your tank. If not, any calculation you do will be suspect.
So what’s next? I see a few actions:
Got any tips for fuel-efficient driving? Leave them in the comments!
Sometimes at work we get requests to visualize data on maps. It’s a really cool feature, but the challenge is generally the calculation being used to drive the chart is a straight sum, and states like California, New York, and Texas always seem to have the highest values.
XKCD had a great comic the other day that I 100% agree with: the problem with geographic heat maps is that it’s essentially just a population map.
This is why anytime you’re putting together a heat map, it’s best to normalize the data as best as you can with a per-capita calculation.
Compare the following two calculations:
This very simple switch allows you to make a much more effective comparison of large states like California to small states like Rhode Island.
USA Today offers a graph every morning in the bottom-left corner of the front page.
Take a look at this first one:
Why am I not surprised by these results? It’s essentially a “top 5 population” map. The only thing that seems off is that Arizona and Georgia are showing up here, but I’m bringing outside knowledge that the population is not very high there, so I can assume that it must be an outlier.
However a few weeks later I picked up the paper and pleasantly surprised to see this map:
Much better! Now that they’ve switched to a percentage-focused view, I get a much better sense that in these states the proportions are indeed larger when compared to other states.
Map visualizations can be very powerful, but a little simple division can help you get a greater wealth of information from the same pixel space!
During the course of the 2012 U.S. presidential election you’ve no doubt seen lots of maps of the United States.
The maps that most people saw on election night (and in the weeks running up to it) had a very simple binary look: Blue for Obama, and Red for Romney. Usually, they just have one color per state because that’s what matters in the Electoral College.
However, it’s interesting to split that data out into counties as well.
This map (found on Gawker) takes it three steps beyond just the standard red/blue state map. The second map shows counties with a binary red/blue scheme. The third map shows each individual county on a red to purple to blue scale. The final map changes the transparency of any given county based on the population of that county; the brighter the county the more people that live there.
I think this gives a great visualization because it gives a truer perspective of where the votes fell in this election.
Another way to show this data is through a cartogram. Since the presidential election is decided by electoral votes, it makes sense to scale the US appropriately. This cartogram mashes up the two concepts nicely; the shape still resembles the United States, but gives you a more accurate representation in each state’s contribution to the electoral vote total.
A great deal has been made this year about election spending. This video courtesy of NPR gives fascinating insight as to where the money in this election was being spent, represented in maps.
What other interesting maps did you find in this past election?
Last year I decided the best way to have fun on Halloween was to make graphs. It was so much fun, I decided to do it again this year.
When the night was over, we had a whole lot more leftover candy than last year. Did we buy too much candy? Did not enough Trick-or-Treaters visit this year? Why didn’t we run out of candy like we did last year?
The basic premise was the same:
Last Year’s Stats – 2011
This Year’s Stats – 2012
What a difference! We bought about the same about of Treats as the year prior, yet we had a LOT more leftover candy, even though there were more Trick-or-Treaters.
Because of this, we only ran out of two types of candy: M&M’s Peanut and Skittles, and we ran out of those types in the final 15 minutes.
Here’s a graph showing the starting and ending percentages of the different candies:
Purple marks if it was taken LESS relative to other candies.
Orange marks if it was taken MORE relative to other candies.
Let’s also group the candy types together and see if there’s a trend:
Candy that was in Bar form (for example, Hershey’s and Snickers) was less popular than candy in Bit form (for example, M&Ms and Starburst).
Sugar-based candies (Skittles and Starburst) were more popular than Chocolate and Nut candies. This is a departure from last year, when we had a bunch of Starburst left over!
In last year’s post I noted that trying to put all candies individually on a line chart would make it messy, and very hard to get any information out of the chart. This year, I decided to use a Trellis chart to help alleviate that problem.
In this chart, each brand gets its own view. However, for the candy types where I didn’t have a large starting amount, it’s hard to discern differences. If we start each brand at 100% and work downwards from there, we can see trends of how quickly (or slowly) a particular type of candy was taken, and since we counted at the mid-point, we can see which types went faster earlier or later in the evening.
Here is where this gets REALLY geeky.
Last year I put together a basic line chart highlighting the inverse relationship between number of Trick-or-Treaters and the amount of pieces they took. As it turns out, my formula for calculating that number was flawed.
Here’s last year’s chart:
What was flawed about it was the way I calculated the number of pieces per Trick-or-Treater. Last year I took a full count of the candy at specific times:
Here’s a screenshot of my Excel sheet from last year. Any cell with a gray background means an actual observation, and white cells represent a formula to approximate what the best-guess of the remaining amount was.
The problem was with the old formula. Last year I had assumed that the amount between observation points should have been:
S = Second Observation Point
F = First Observation Point
RR = Number of 15 Minute Intervals Remaining until Second Observation Point
TR = Total Number of 15 Minute Intervals between First and Second Observation Point
S + ((F-S) * RR/TR) = Candy Remaining for a Given Interval
This is pretty similar to a standard depreciation formula as you move from Date A to Date B.
However, that assumes the same amount of Trick-or-Treaters in each 15-minute interval, which was NOT the case. Given that I knew how many kids visited within each 15-minute interval, I could better refine the formula to approximate the number of pieces remaining within each time block.
S = Second Observation Point
F = First Observation Point
RTT = Number of Trick-or-Treaters remaining until Second Observation Point
TTT = Total Number of Trick-or-Treaters between First and Second Observation Point
S + ((F-S) * RTT/TTT) = Candy Remaining for a Given Interval.
This leads to a much more refined formula. Here’s last year’s chart again, with the old and new formulas:
So now that we’ve established a new (and hopefully better) formula, we can compare this year to last year using the same methodology:
Interesting things to note:
This year we picked up seven different types of multi-pack bags. In every instance, we received more candy than what was promised on the bag, which was a nice plus.
Favorites and Non-Favorites
Last year we saw that sugar-based candies where the least likely to be taken by the Trick-or-Treaters. What about this year?
This is a relatively boring graph, in that there’s very little movement. That in itself tells a story, though, in the fact that for the most part, kids were taking candy in roughly equal proportions. Was this because we broke the three types out into three separate buckets? Were kids just evenly grabbing from each?
Compared to last year, the increase of sugar candies compared to the average is the most interesting. What was the difference? Did more kids take a liking to Skittles and Starburst? Did we do a better job preventing the smaller Starburst packages from falling to the bottom of the bucket? These are the mysteries of life that allude us all.
Just for kicks, here’s the graph comparing Bar candies to Bit candies:
Again, very little movement!
Intervals for Trick-or-Treating
We had 216 total Trick-or-Treaters. I was able to track two things with each group:
There were 60 total groups of Trick-or-Treaters, with an average size of 3.6.
Here’s the distribution of groups:
This makes for interesting visualizations when you decide the time interval to split it by:
Planning for Next Halloween
Now, who wants to help me eat this leftover candy?
We created a guide to highlight good and poor areas of resource production to better help you pick intersections to build your settlements in Settlers of Catan.
We assume that you set up Catan as per the game instructions, and that you arrange the tokens in alphabetical order starting with the “A” token at the top of the board and work your way inward in a clockwise pattern, skipping the desert. Looking at your board, determine where the desert hex is and refer to that chart below to see your board’s map.
An Intersection Score is determined by adding up the expected number of resources you will receive from all of the hexes for a given token based on all the possible combinations from the roll of two dice. For example, an “8″ is expected to be rolled 5 times for every 36 rolls, while a “12″ is only expected 1 time for every 36 rolls. So, an intersection with a “5″, an “8″, and a “11″ would expect to produce 4, 5, and 2 resources for 36 rolls, giving it a total score of 11.
Different Good and Poor scales are applied to the Intersections depending on how many hexes touch it. 3-Hex Intersections are statistically going to produce more resources than 1-Hex Intersections, so different ranges are used and can be seen in the chart below.
This guide is only meant to highlight the statistical probabilities of resource production at any given intersection. We understand that gameplay and the placement of resources is of course more complex given things like the types of resources on the board, the harbors you are looking to obtain, and the use of the robber.
What’s interesting is how there are definitely “good” and “poor” areas that exist on the board where there are concentrations of higher probability numbers. We couldn’t display this on any typical graph, but instead we use a modified geospatial layout of Catan to convey the information. We could have used a heat map with strong gradations of color, but by defining definite ranges with strong opposing colors, you can see basic ranges and get an idea of where to build.
Click through to after the jump to see the graphs for each of the configurations.
I’d like to share with you a chart that can help you decide what’s the best way to graph the data that you’re dealing with.
I think it’s a great baseline, and while I don’t agree 100% on everything (like the use of pie charts), I think this covers most every situation you’ll come up against in chart design.
We ran across a good article on BusinessWeek a few weeks ago that shows us how graphs can easily be misused to demonstrate trends which may not exactly be connected.
This highlights the problem when people assume causation when it comes to trends in our societal data.
Click through to the article to see all the graphs.