Playing the data dress-up game - Don't Overthink This

# Playing the data dress-up game By:: [[Brian Heath]] 2022-09-10 Most people have heard of the Normal Distribution. It's the bell-shaped curve that seems to describe a lot of things in [[life]]. Typical examples include the height of males, the values you get when rolling dice, and stock market returns. Essentially, it describes any situation where most things are near the average with tails of abnormal things on either side. Most males are the same height, but a few are very tall and a few are very short. It's common to get a 7 when rolling dice, but very uncommon to get a 2 or 12. The Normal Distribution gives us a mental heuristic to assess and account for the randomness we see every day. It's valuable to know that most things will be average and that extreme values are rare. However, the Normal Distribution is just a model and one thing we know is that [[Trust and models|all models are wrong]]. Without getting too much into the weeds, it is virtually impossible to satisfy all the mathematical conditions for the Normal Distribution to hold up in the real world. It can still be useful, but there are times when the Normal Distribution is the wrong distribution for the situation. For example, the distribution of [[wealth]] is not normally distributed. Hence, mathematicians have developed many other distribution models to represent non-normal phenomena. But again, all of these distributions are models and, by their [[nature]], wrong. They exist because they are more useful for the task at hand. With all of these available distributions, analysts will typically spend significant time attempting to find the distribution that best fits the data. Once they have the best distribution, they can begin to extrapolate, infer, and simplify the system being studied based on the properties of the distribution model. The obvious issue is that the model is most certainly wrong, so why not just use the data itself and not a model of the data? At one point in time, collecting and processing data was hard. But we are [[living]] in the "Big Data" era with supercomputers in our pockets. Without a doubt, the data itself is a more true representation of the system than an abstract model of that data. Yet, we continue using hand [[tools]] when the original crafters of lore would have killed for power tools. There are cases where distribution fitting is valuable, but there are many other approaches that are closer to representing the system. Some of these are intuitive and easier to comprehend like Non-parametric Statistics. Others such as Bayesian Statistics are often more challenging to grasp. However, both are valuable analytics tools for interpreting the world with fewer starting assumptions. #### Related Items [[Analytics]] [[Statistics]] [[Normal Distribution]] [[Big Data]] [[Non-Parametric Statistics]] [[Bayesian Statistics]] [[Empirical Distribution]] [[Models]]