Something to emphasize to everyone is that such statistical models do not come to fruition on their first pass. The published piece I worked on for my masters degree was what effect electric vehicle incentives have on buying behavior in California. I just did a count for fun on how many regression models I developed over 4 months of work: 36 models until we landed on our beautiful, simple model that showed utility subsidies and front end tax credits encourage the greatest EV purchasing.
Glad you've been able to make some more advancements with this regression in adding carousels Jack, and am interested to see more!
Looking over the brief pictures you've shared, it would appear on the surface that roller coaster count carries a strong correlation with overall ranking. If possible, I'd heavily encourage you to run a statistical regression of the data, to see to what effect each factor plays into the rankings - it could very well be that one factor, such as number of roller coasters or roller coaster height, trumps other factors in the rankings. There are a number of ways to control for this, such as taking the natural log of an input, which might allow for a more even keel. (Natural log is often a best recommendation for factors that are far outliers in statistical regression)
jackdude101 said:
I have a setup in place that considers outliers in the data when calculating the scores. If a piece of data is 5 standard deviations above or below the mean or more, it gets capped. Currently, for the height, the point at which 5 standard deviations is reached is ~299.93 feet. Kingda Ka obviously gets the max score for height, but because its score is capped at a certain point, it doesn't get the #1 overall score (it's currently #11). The one that does hold the #1 spot for roller coasters currently is Fury 325.
So your standard deviation is 59.9 ft.? Fascinating, I'd have ventured a larger deviation. How does the standard deviation look for speed, length, inversions, or other factors? Again, if you are finding height plays too overbearing a roll in the regression, it might be worth venturing other controls, such as taking natural log, or lowering the cap of standard deviation. EDIT: Actually, thinking more, don't lower the standard deviation just yet - you are wanting to focus on the differences in height of rides and how they affect park ranking; capping the roller coaster heights means you are assuming that at a certain point, more roller coaster height does not matter. This is not really the case, as it can be well argued and reasoned that a 300+ ft. roller coaster will show greater draw and popularity than a smaller roller coaster.
The #1 challenge of any statistical regression is being able to quantify the qualifiable. Saying you love something is easy enough - but it needs to be put to numbers in order to be made accountable. How much do you love something? 4 out of 5 times?
Looking over our results, we know we are getting close to a good calculation, but are not quite there. For instance, Knott's Berry Farm carries a larger return than Six Flags Magic Mountain. Cedar Point is also showing a massive, outlier return. I believe it would be safe to say that this is actually not the case, so there must be some additional factors we are not currently accounting for.
And as we have discussed before, there are a number of very popular roller coasters in the world that are statistically inferior. Maverick, Intamin Mega Lites, and RMC are just a few examples of this - accounting for this popularity would lend greater robustness to the model, and lend a more reliable output.
I would encourage a consideration of using rider survey data as an easy way to account for some of these "other" factors that are not being captured. It is simple enough to scale up or down the impact rider survey would have on the model, but with a large sample of rider survey data already available (
Mitch Hawker Coaster Poll), it is worth the time to see if this data could lend further insight.
So recap:
- check statistical significance of each factor to see what level of influence they play in the model. If one factor, such as coaster size, is found to be vastly more significant than other factors, this will make the model very correlated to that factor, and would require more control.
- Good regression controls: take natural log of factors, square the factors. I'm happy to talk more on other possible controls worth considering.
- Check the correlation of roller coaster count to park ranking. If there is a high correlation, the model is only reflecting that one variable.
- Consider using other variables to account for other factors. Using ride ratings could be one such way to easily account for other qualities that are not quantified, such as ride smoothness, asthetics, etc.
- Remember that no model is perfect. What is perfect is the interpretation of the output, and using proper citation and logic for using the output to influence your findings.
I'm happy to talk more about this with you offline, and help in running some of the model if you are willing to share. I do still have licenses to STATA and SAS, and would love help out!