The statistical databases that are built by government agencies and non-profit organizations often contain confidential information such as income, credit ratings, type of disease or test scores of individuals. In corporate DWs, some strategic figures that are not related to individuals like sales for recently launched products may also be confidential. Whenever sensitive data is exchanged, it must be transmitted over a secure channel like the Secure Socket Layer (SSL), see Netscape (1996) in order to prevent unauthorized use of the system. For the purposes of this chapter, we assume that adequate measures for security and access control are in place, see Stallings (1999).
However, even if the information in the statistical database safely reaches the correct addressee, the system has to ensure that the released information does not compromise the privacy of individuals or other confidential information. Privacy breaches do not only occur as obvious disclosures of individual values in single queries. Often, the combination of multiple non-confidential query results may allow for the inference of new confidential facts that were formerly unknown.
We give an example. From Table 9.1, we take the total sales
for ''Tennis Shoes'' (
), ''Tennis Balls'' (
),
''Tennis Nets'' (
) and a fourth, new product (''Tennis
Socks'',
). We assume that sum queries for groups of products are
allowed but that single, product-specific sales values are
confidential. After querying the sum for balls and shoes (
)
and for balls and socks (
), the user can infer an interval of
[
;
] for the sales of shoes, as sales cannot be
negative. The length of the interval, which is the maximum error of
the user's estimation of the confidential shoe sales is only
of the actual value. This particular case of disclosure
is called interval inference, see Li et al. (2002). Other
types of inference include exact inference (concluding the
exact value of
for shoes sales) and statistical
inference (inferring estimates like mean
With this approach a query is either denied or responded with an exact answer as the upper sketch in Fig. 9.9 indicates.
Fellegi (1972) works by setting lower and upper bounds for the size of the query answer set based on the properties of the database and on the preferences fixed by the database administrator. If the number of returned records did not lie within these bounds, the information request would have to be rejected and the query answer is denied. As queries that are issued sequentially by one user often have a large numbers of entities in common, an improvement is the restriction of these entities to a maximum number, see Dobkin et al. (1979). Although popular, this method is not robust enough as a stand-alone solution, see Denning (1982).
It involves keeping up-to-date logs of all queries made by each user and constantly checking for possible disclosures whenever a new query is issued. One major drawback of this method is that it requires huge amounts of storage and CPU time to keep these logs updated. A well-known implementation of such an audit system is Audit Expert by Chin and Özsoyoglu (1982). It uses binary matrices, see bitmap indexes in Sect. 9.4.3, to indicate whether or not a record was involved in a query.
See Cox (1980) is an important method for categorical databases when
information is published in tabular form. Especially Census Bureaus
often make use of tabular data and publish counts of individuals based
on different categories. One of the main privacy objectives is to
avoid answers of small size. For example, if a snooper knows
somebody's residence, age and employer, he can issue a query for
(ZIP = 10,178, Age = 57, Employer = 'ABC'). If the answer is
one entity, the snooper could go on and query for (ZIP =
10,178, Age = 57, Employer = 'ABC', Diagnosis = 'Depression'). If the
answer is one again, the database is compromised and the person with
the diagnosis identified. The cells should have to be
suppressed. A common criterion to decide whether or not to suppress
a cell is the N-k rule where a cell is suppressed if the
top
respondents contribute at least
% of the cell
total.
and
are parameters that are fixed by the database
administrator, i.e. the Census Bureau. In the exemplary case of
and
, a cell which indicates aggregated income
($10M) of
individuals would have to be suppressed if the
top two earners' aggregate income exceeded $1M.
In the query restriction approach, either exact data is delivered from the original database or the query is denied. As depicted in the lower sketch of Fig. 9.9, an alternative is to perturb the original values such that confidential, individual data become useless for a snooper while the statistical properties of the attribute are preserved. The manipulated data is stored in a second database and is then freely accessible for the users.
If in Table 9.1, we permute the sales of tennis balls, tennis nets and tennis shoes, individual sales data is not correct anymore. But the arithmetic average and the standard deviation of the attribute sales stay the same. This procedure is called data swapping, see Denning (1982).
Noise addition for numerical attributes, see Traub et
al. (1984), means adding a disturbing term to each value:
, where
is the original value and
adheres to a given probability distribution with mean zero. As for
every value
value, the perturbation
is fixed,
conducting multiple queries does not refine the snooper's search for
confidential single values.
A hybrid approach are random-sample queries, Denning (1982),
where a sample is drawn from the query set in such a way that each
entity of the complete set is included in the sample with
probability
. If, for example, the sample of a COUNT query has
entities, then the size of the not perturbed query set can be
estimated as
. If
is large, there should be a set-size
restriction to avoid small query sets where all entities are included.
All methods presented in the preceding sections aim at lowering the
disclosure risk for data that is private and confidential. But at the
same time, each of these methods reduces, in some way, the utility of
the data for the legitimate data user. Duncan and Keller-McNulty
(2001) present a formal framework to measure this trade-off between
disclosure risk and data utility, the Risk-Utility (R-U)
map. There are numberless measures for disclosure risk, see
Domingo-Ferrer et al. (2002) for an excellent overview. We already
gave an intuitive measure for interval inference. The sales for tennis
shoes were predicted with an error of only
, see
Sect. 9.7.1.
However, it is far more difficult to measure data utility because it strongly depends on the varying preferences of the data user. Especially for this reason, classifying statistical disclosure control methods as presented here on an absolute scale is almost an impossible task.