Visualization and statistics are inseparable. Statisticians have known this for a long time, but non-statisticians in the visualization field have largely ignored the role of statistics in charts, maps, and graphics. Non-statisticians often believe that visualization follows data analysis. We aggregate, summarize, model, and then display the results. In this view, visualization is the last step in the chain and statistics is the first.
In GOG, statistics falls in the middle of the chain. The consequence of this architecture is that statistical methods are an integral part of the system. We can construct dynamic graphics, in which statistical methods can be changed (for exploratory purposes) without altering any other part of the specification and without restructuring the data. By including statistical methods in its architecture, GOG also makes plain the independence of statistical methods and geometric displays. There is no necessary connection between regression methods and curves or between confidence intervals and error bars or between histogram binning and histograms.
In GOG, the statistics component receives a varset, computes
various statistics, and outputs another varset. In the
simplest case, the statistical method is an identity. We do
this for scatterplots. Data points are input and the same data
points are output. In other cases, such as histogram binning,
a varset with
rows is input and and a varset with
rows is output, where
is the number of
bins (
). With smoothers (regression or interpolation),
a varset with
rows is input and and a varset with
rows is output, where
is the number of
knots in a mesh over which smoothed values are computed. With
point summaries (means, medians,
), a varset with
rows is input and a varset with one row is
output. With regions (confidence intervals, ranges,
),
a varset with
rows is input and and a varset with
two rows is output.
Understanding how the statistics component works reveals an important reason for mapping values to cases in a varset rather than the other way around. If
Notice that the list of caseIDs that is produced by
mean
is contained in the one row of the output
varset. We do not lose case information in this mapping, the
way we do when we compute results from an ordinary SQL query
on a database or when we compute a data cube for an OLAP or
when we pre-summarize data to produce a simple graphic. This
aspect of GOG is important for dynamic graphics systems that
allow drill-down or queries regarding metadata when
the user hovers over a particular graphic element.
Figure 11.8 shows an application of
a statistical method to the city data. We linearly regress 2000
population on 1980 population to see if population growth is
proportional to city size. On log-log scales, the estimated values
fall on a line whose slope is greater than
, suggesting that larger
cities grow faster than smaller. Ordinarily, we would draw a line to
represent the regression and we would include the data points as
well. We would also note that Lagos grew at an unusual rate (with
a Studentized residual of 3.4). Nevertheless, our main point is to
show that the statistical regression produces data points that are
exchangeable with the raw data insofar as the entire GOG system is
concerned. How we choose to represent the regressed values graphically
is the subject of the next section.