## 1 Introduction

*Density estimation* or *distribution learning*

refers to the problem of estimating the unknown probability density function of a common source of independent sample observations. In any interesting case, we know that the unknown source density may come from a known class. In the

*parametric*

case, each density in this class can be specified using a bounded number of real parameters, the class of all Gaussian densities with any mean and any variance. The remaining cases are called

*nonparametric*. Examples of nonparametric classes include bounded monotone densities on , -Lipschitz densities for a given constant , and log-concave densities, to name a few. By

*minimax estimation*, we mean density estimation in the minimax sense, we are interested in the existence of a density estimate which minimizes its approximation error, even in the worst case.

There is a long line of work in the statistics literature about density estimation, and a growing interest coming from the theoretical computer science and machine learning communities; for a selection of new and old books on this topic, see

[10, 11, 13, 18, 26, 27]. The study of nonparametric density estimation began as early as in the 1950’s, when Grenander [16] described and studied properties of the maximum likelihood estimate of an unknown density taken from the class of bounded monotone densities on . This estimator and this class received much further treatment over the years, in particular by Prakasa Rao [23], Groeneboom [17], and Birgé [6, 7, 8], who identified the optimal -error minimax rate, up to a constant factor. Since then, countless more nonparametric classes have been studied, and many different all-purpose methods have been developed to obtain minimax results about these classes: for the construction of density estimates, see the maximum likelihood estimate, skeleton estimates, kernel estimates, and wavelet estimates, to name a few; and for minimax rate lower bounds, see the methods of Assouad, Fano, and Le Cam [10, 11, 13, 29].One very popular style of density estimate is the *histogram*, in
which the support of the random data is partitioned into bins, where
each bin receives a weight proportional to the number of data points
contained within, and such that the estimate is constant with the
given weight along each bin. Then, the selection of the bins
themselves becomes critical in the construction of a good histogram
estimate. Birgé [7] showed how histograms with
carefully chosen exponentially increasing bin sizes will have
-error within a constant factor of the optimal minimax rate for
the class of bounded non-increasing densities on . In general,
the right choice of an underlying partition for a histogram estimate
is not obvious.

In this work, we devise a recursive data-based approach for
determining the partition of the support for a histogram estimate of
discrete non-increasing densities. We also use a similar approach to
build a piecewise-linear estimator for discrete non-increasing convex
densities—see Anevski [1],
Jongbloed [20], and Groeneboom, Jongbloed, and
Wellner [19] for works concerning the maximum
likelihood and minimax estimation of continuous non-increasing convex
densities. Both of our estimators are *minimax-optimal*, their minimax -error is within a constant factor of the optimal
rate. Recursive data-based partitioning schemes have been extremely
popular in density estimation since the 1970’s with
Gessaman [15], Chen and Zhao [9], Lugosi and
Nobel [22]

, and countless others, with great interest coming from the machine learning and pattern recognition communities

[12]. Still, it seems that most of the literature involving recursive data-based partitions are not especially concerned with the rate of convergence of density estimates, but rather other properties such as consistency under different recursive schemes. Moreover, most of the density estimation literature is concerned with the estimation of continuous probability distributions. In discrete density estimation, not all of the constructions or methods used to develop arguments for analogous continuous classes will neatly apply, and in some cases, there are discrete phenomena that call for a different approach.## 2 Preliminaries and summary

Let be a given class of probability densities with respect to a base measure on the measurable space , and let . If

is a random variable taking values in

, we write to mean thatThe notation means that for each , and that are mutually independent.

Typically in density estimation, either , is the
Borel -algebra, and is the Lebesgue measure, or is
countable, , and is the counting measure. The
former case is referred to as the *continuous setting*, and the
latter case as the *discrete setting*, where is more often
called a *probability mass function* in the
literature. Throughout this paper, we will only be concerned with the
discrete setting, and even so, we still refer to as a class of
densities, and as a density itself. Plainly, in this case,
signifies that

Let be unknown. Given the samples
, our goal is to create a *density
estimate*

such that the probability measures corresponding to and
are close in *total variation (TV)
distance*, where for any probability measures , their
TV-distance is defined as

(1) |

The TV-distance has several equivalent definitions; importantly, if and are probability measures with corresponding densities and , then

(2) | ||||

(3) |

where for any function , we define the -norm of as

(In the continuous case, this sum is simply replaced with an integral.) In view of the relation between TV-distance and -norm in (LABEL:l1-interp), we will abuse notation and write

There are various possible measures of dissimilarity between probability distributions which can be considered in density estimation, the Hellinger distance, Wasserstein distance, -distance,

-divergence, Kullback-Leibler divergence, or any number of other divergences; see Sason and Verdú

[25] for a survey on many such functions and the relations between them. Here, we focus on the TV-distance due to its several appealing properties, such as being a metric, enjoying the natural probabilistic interpretation of (LABEL:prob-interp), and having the coupling characterization (LABEL:coupling-interp).If is a density estimate, we define the *risk* of the
estimator with respect to the class as

where the expectation is over the i.i.d. samples from , and
possible randomization of the estimator. From now on we will omit the
dependence of on unless it is not
obvious. The *minimax risk* or *minimax rate* for is
the smallest risk over all possible density estimates,

We can now state our results precisely. Let and let be the class of non-increasing densities on

, set of of all probability vectors

for which(4) |

monotone-result For sufficiently large ,

Let be the class of all non-increasing convex densities on , so each satisfies (LABEL:decr) and

convex-result For sufficiently large ,

We emphasize here that the above results give upper and lower bounds on the minimax rates and which are within universal constant factors of one another, for the entire range of and .

Our upper bounds will crucially rely on the next results, which allow
us to relate the minimax rate of a class to an old and well-studied
combinatorial quantity called the *Vapnik-Chervonenkis (VC)
dimension* [28]: For a family of
subsets of , the VC-dimension of , denoted by , is
the size of the largest set such that for every
, there exists such that . See, the book of Devroye and Lugosi [13] for
examples and applications of the VC-dimension in the study of density
estimation.

[Devroye, Lugosi [13]]minimum-distance-risk
Let be a class of densities supported on , and let
be a class
of density estimates satisfying
for every
. Let be the *Yatracos class*
of ,

For , let be the probability measure corresponding to . Let also be the empirical measure based on , where

Then, there is an estimate for which

The estimate in minimum-distance-risk is called the
*minimum distance estimate* in [13]—we omit the
details of its construction, though we emphasize that if computing
takes one unit of computation for any
and , then selecting takes time polynomial in the size of
, which is often exponential in the quantities of interest.

[Devroye, Lugosi [13]]tv-emp Let be as in minimum-distance-risk, and let . Then, there is a universal constant for which

The quantity in tv-emp is precisely equal to if is the Borel -algebra on .

minimum-distance-risk Let be as in minimum-distance-risk. Then, there is a universal constant for which

## 3 Non-increasing densities

monotone

This section is devoted to presenting a proof of the upper bound of monotone-result. The lower bound is proved in monotone-lower using a careful but standard application of Assouad’s Lemma [2]. Part of our analysis in proving monotone-result will involve the development of an explicit efficient estimator for a density in .

### 3.1 A greedy tree-based estimator

Suppose that is a power of two. This assumption can only, at worst, smudge some constant factors in the final minimax rate. Using the samples , we will recursively construct a rooted ordered binary tree which determines a partition of the interval , from which we can build a histogram estimate for . Specifically, let be the root of , where . We say that covers the interval . Then, for every node in covering the interval

we first check if , and if so we make a leaf in . Otherwise, if

are the first and second halves of , we verify the condition

(5) |

where are the number of samples which fall into the intervals ,

The inequality (LABEL:greedy-splitting-rule) is referred to as the
*greedy splitting rule*. If (LABEL:greedy-splitting-rule) is
satisfied, then create nodes covering and
respectively, and add them to as left and right children
of . If not, make a leaf in .

After applying this procedure, one obtains a (random) tree with leaves , and the set forms a partition of the support . Let be the histogram estimate based on this partition,

The density estimate will be called the *greedy
tree-based estimator*. See greedy for a typical plot of
, and a visualization of the tree .

Intuitively, we justify the rule (LABEL:greedy-splitting-rule) as follows: We expect that is at least as large as by monotonicity of the density , and the larger the difference , the finer a partition around and should be to minimize the error of a piecewise constant estimate of . However, even if and were equal in expectation, we expect with positive probability that may deviate from

on the order of a standard deviation, on the order of

, and this determines the threshold for splitting.One could argue that any good estimate of a non-increasing density should itself be non-increasing, and the estimate does not have this property. This can be rectified using a method of Birgé [7], who described a transformation of piecewise-constant density estimates which does not increase risk with respect to non-increasing densities. Specifically, suppose that the estimate is not non-increasing. Then, there are consecutive intervals , such that has constant value on and on , and . Let the transformed estimate be constant on , with value

the average value of on . Iterate the above transformation until a non-increasing estimate is obtained. It can be proven that this results in a unique estimate , regardless of the order of merged intervals, and that

### 3.2 An idealized tree-based estimator

ideal

Instead of analyzing the greedy tree-based estimator of the preceding section, we will fully analyze an idealized version. Indeed, in (LABEL:greedy-splitting-rule), the quantities are distributed as for , where we define

If we replace the quantities in (LABEL:greedy-splitting-rule) with
their expectations, we obtain the *idealized splitting rule*

(6) |

where we note that , since is non-increasing. Using the same procedure as in the preceding section, replacing the splitting rule with (LABEL:id-splitting-rule), we obtain a deterministic tree with leaves , and we set

is constant and equal to the average value of on each
interval for . We call the *idealized
tree-based estimate*. See ideal for a visualization of
and . It may be instructive to compare ideal to
greedy.

Of course, and both depend intimately upon knowledge of the density ; in practice, we only have access to the samples , and the density itself is unknown. In particular, we cannot practically use as an estimate for unknown . Importantly, as we will soon show, we can still use along with minimum-distance-risk to get a minimax rate upper bound for .

id-tv

###### Proof.

Writing out the TV-distance explicitly, we have

Let , and define . If , then , so assume that . In this case, let and be the left and right halves of the interval . In order to compute , we refer to biasarea. We view as the positive area between the curve and the line ; in biasarea, this is the patterned area. Let and be the average value of on and respectively. Then, is the positive area between and on , which is represented as the gray area on in biasarea, and is the positive area between and on , the gray area on in biasarea. The points are the unique points for which

Since the gray area on to the left of is equal to the gray area on to the right of , and the gray area on to the left of is equal to the gray area on to the right of , then

Then

where this last line follows from the splitting rule (LABEL:id-splitting-rule), since and . So,

where the last line follows from the Cauchy-Schwarz inequality. ∎

id-num-leaves If and , then

###### Proof.

Note that has height at most . Let be the set of nodes at depth in which have at least one leaf as a child, for , and label the children of the nodes in in order of appeareance from right to left in as . Since none of the nodes in are themselves leaves, then by (LABEL:id-splitting-rule),

and in particular since , then so that . In general,

and this recurrence relation can be solved to obtain that

(7) |

Let be the set of leaves at level in . The leaves at level form a subsequence of . Write for the total probability mass of held in the leaves ,

By (LABEL:monotone-leaf-prob),

so that

Summing over all leaves except on the last level, and using the facts that and ,

By Hölder’s inequality,

so finally

###### Proof of the upper bound in monotone-result.

The case is trivial, and follows simply because the TV-distance is always upper bounded by .

Suppose next that . In this regime, we can use a histogram estimator for with bins of size for each element of . It is well known that risk of this estimator is on the order of [13].

Finally, suppose that . Let be the class of all piecewise-constant probability densities on which have parts; in particular, . Let be the Yatracos class of ,

Then, , where is the class of all unions of intervals in . It is well known that , so . By minimum-distance-risk and id-tv, there are univeral constants for which

By id-num-leaves, we see that for sufficiently large , there is a universal constant such that

## 4 Non-increasing convex densities

convex

Recall that is the class of non-increasing convex densities supported on . Then, forms a subclass of , which we considered in monotone. This section is devoted to extending the techniques of monotone in order to obtain a minimax rate upper bound on . Again, the lower bound is proved using standard techniques in convex-lower.

In this section, we assume that is a power of three. In order to prove the upper bound of convex-result, we construct a ternary tree just as in monotone, now with a ternary splitting rule, where if a node has children in order from left to right, we split and recurse if

(8) |

Here we obtain a tree with leaves
. If has children from left to
right, let be the midpoint of for . Let
the estimate on to be composed of the secant line
passing through the points and . The estimate is again called the
*idealized tree-based estimate* for .
conv-id-tv

Before proving this, we first note that by convexity of , the slope of the secant line passing through and is at least the slope of the secant line passing through and . Equivalently,

###### Proof of conv-id-tv.

As in the proof of id-tv, we have

for . We refer to convexarea for a visualization of the quantity , which is depicted as the patterned area.

Let , , and denote the gray area on , , and respectively. It can be shown that

and

whence

The result then follows from the splitting rule (LABEL:convex-splitting-rule) and the Cauchy-Schwarz inequality. ∎

If and , then

###### Proof.

The tree has height at most . Let be the set of nodes at depth in with at least one leaf as a child, for , labelled in order of appearance from right to left in as . By the convex splitting rule (LABEL:convex-splitting-rule), and since is non-increasing,

so in particular, , and . In general,

(9) |

We claim now that , which we prove by induction; the base case is shown above, and by the induction hypothesis,

for all , while the cases can be manually verified. Then, by monotonicity of ,

(10) |

Let now be the set of leaves at level in . The leaves at level form a subsequence of . Let be the total probability mass of held in the leaves . By (LABEL:concave-prob-lower),

so that

Summing over all leaves except on the last level,

Comments

There are no comments yet.