Gibson defines a histogram as the method of presenting distribution of numerical data in a graphical format. It is an estimate of the probability distribution of a continuous variable. When constructing a histogram, the first step is to study the whole range of values and dividing it into series or intervals called bins. The next step involves counting the number of values that fall into each group. A bin defines a specific interval of a variable. Bins need to be adjacent and of equal size. If they are of equal size, a rectangle is created with the height being proportional to the frequency and number of cases in each bin. However bins should not of equal width in this case, the erected rectangle has the area proportional to the frequency of cases in the bin. The vertical axis is not frequency but density: the number of cases per unit of the variable on the horizontal axis.
The above histogram represents the number of patients diagnosed with a number of secondary conditions. The minimum number of diagnosis is 1 while the maximum is 16. The data has been grouped (binned) into intervals of 4. It gives us four intervals. 7 patients were diagnosed with between 1 and 4 secondary conditions, 7 were diagnosed with between 5 and 8 secondary conditions, 4 were diagnosed with between 9 and while 2 patients were diagnosed with between 13 and 16 conditions. The histogram has taken a “stair” shape going down towards the left meaning that with increasing number of conditions fewer patients are diagnosed. Many patients were diagnosed with between 1 and 8 conditions.
An outlier defines data that is obtained from a different set of observation. In simple terms, an outlier is an expected outcome from a set of observations (Gibson). The causes of outliers are many including errors that occur from operations, errors caused by faulty equipment, anomalies resulting from overlapping of series and those that occur during input of data or war-up effects.
There are two major ways of identifying outliers in any presentation of data:
1. By examining the general shape of the graphed data for important features, including symmetry and departures from hypotheses. Each statistical study is based on pre-set assumptions, also known as hypothesis. When an observation is seen to be negating these assumptions, then there is a possibility of the observation being an outlier.
2. Examining the data for a different set observation, one that is far from the data that is under scrutiny.
In the above histogram, the outlier is in the intervals 1-4, 5-8 and 13-16. From a logical perspective, it would not be believable that 7 patients have been diagnosed with up to 8 conditions. This could have resulted from an operational error. It is not clear whether these patients were cumulative.
Presentation of data is as important as the outcome of a statistical research. Histograms provide a pictorial view of the outcomes observed from a statistical study. Even a layman would easily make sense of the data. However, it would be difficult to make sense of the outcome in cases where two consecutive observations share the same level in a histogram. From the histogram above, it is difficult to tell the border between the interval 1-4 and 5-8 in a case where the area of the bar is used to calculate the intensity of the observation. Owing to this weakness, other pictorial presentation techniques like pie charts may be preferred.
Gibson, S. Andrew. Exposure and Understanding the Histogram. 2nd Ed: Peachpit Press, 2014. Print.