\magnification=\magstep1
\hsize=16truecm
\input amstex
\TagsOnRight
\parindent=20pt
\parskip=2pt plus 1.7pt
\font\script =cmcsc10
\centerline{\bf ON THE ESTIMATION OF MULTIPLE RANDOM INTEGRALS}
\vskip2pt
\centerline{\bf AND $U$-STATISTICS}
\smallskip
\centerline{\it P\'eter Major}
\centerline{\it Alfr\'ed R\'enyi Mathematical Institute of the
Hungarian Academy of Sciences}
\beginsection 1. Introduction.
First I briefly describe the main subject of this work.
Fix a positive integer $n$, consider $n$ independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ on a measurable
space $(X,{\Cal X})$ with some distribution $\mu$ and take their
empirical distribution $\mu_n$ together with its normalization
$\sqrt n(\mu_n-\mu)$. Beside this, take a function $f(x_1,\dots,x_k)$
of $k$ variables on the $k$-fold product $(X^k,{\Cal X}^k)$ of the
space $(X,{\Cal X})$, introduce the $k$-th power of the normalized
empirical measure $\sqrt n(\mu_n-\mu)$ on $(X^k,{\Cal X}^k)$ and
define the integral of the function $f$ with respect to this
signed product measure. This integral is a random variable, and we
want to give a good estimate on its tail distribution. More precisely,
we take the integrals not on the whole space, the diagonals
$x_s=x_{s'}$, $1\le s,s'\le k$, $s\neq s'$, of the space $X^k$ are
omitted from the domain of integration. Such a modification of the
integral seems to be natural.
We shall also be interested in the following generalized version of
the above problem. Let us have a nice class of functions ${\Cal F}$
of $k$ variables on the product space $(X^k,{\Cal X}^k)$, and consider
the integrals of all functions in this class with respect to the
$k$-fold direct product of our normalized empirical measure. Give a
good estimate on the tail distribution of the supremum of these
integrals.
It may be asked why the above problems deserve a closer study. I
found them important, because they may help in solving some essential
problems in probability theory and mathematical statistics. I met
such problems when I tried to adapt the method of proof about the
Gaussian limit behaviour of the maximum likelihood estimate to some
similar but more difficult questions. In the original problem the
asymptotic behaviour of the solution of the so-called maximum
likelihood equation has to be investigated. The study of this
problem is hard in its original form. But by applying an appropriate
Taylor expansion of the function that appears in this equation and
throwing away its higher order terms we get an approximation whose
behaviour can be simply understood. So to describe the limit
behaviour of the maximum likelihood estimate it suffices to show
that this approximation causes only a negligible error.
One would try to apply a similar method in the study of more
difficult questions. I met some non-parametric maximum likelihood
problems, for instance the description of the limit behaviour of
the so-called Kaplan--Meyer product limit estimate when such an
approach could be applied. But in these problems it was harder
to show that the simplifying approximation causes only a
negligible error. In this case the solution of the above
mentioned problems was needed. In the non-parametric maximum
likelihood estimate problems I met, the estimation of multiple
(random) integrals played a role similar to the estimation of
the coefficients in the Taylor expansion in the study of maximum
likelihood estimates. Although I could apply this approach only
in some special cases, I believe that it works in very general
situations. But it demands some further work to show this.
The above formulated problems about random integrals are interesting
and non-trivial even in the special case $k=1$. Their solution
leads to some interesting and non-trivial generalization
of the fundamental theorem of the mathematical statistics about
the difference of the empirical and real distribution of a large
sample.
These problems have a natural counterpart about the behaviour of
so-called $U$-statistics, a fairly popular subject in probability
theory. The investigation of multiple random integrals and
$U$-statistics are closely related, and it turned out that it is
useful to consider them simultaneously.
Let us try to get some feeling about what kind of results can be
expected in these problems. For a large sample size $n$ the
normalized empirical measure $\sqrt n(\mu_n-\mu)$ behaves similarly
to a Gaussian random measure. This suggests that in the problems we
are interested in similar results should hold as in the problems
about multiple Gaussian integrals, called Wiener--It\^o integrals
in the literature. We may expect that the tail behaviour of the
distribution of a $k$-fold random integral with respect to a
normalized empirical measure is similar to that of the $k$-th
power of a Gaussian random variable with expectation zero and an
appropriate variance. Beside this, if we consider the supremum of
multiple random integrals of a class of functions with respect to
a normalized empirical measure or with respect to a Gaussian
random measure, then we expect that under not too restrictive
conditions this supremum is not much larger than the `worst'
random integral with the largest variance taking part in this
supremum. We may also hope that the methods of the theory of
multiple Gaussian integrals can be adapted to the investigation
of our problems.
The above presented heuristic considerations supply a fairly good
description of the situation, but they do not take into account a
very essential difference between the behaviour of multiple
Gaussian integrals and multiple integrals with respect to a
normalized empirical measure. If the variance of a multiple
integral with respect to a normalized empirical measure is very
small, what turns out to be equivalent to a very small $L_2$-norm
of the function we are integrating, then the behaviour of this
integral is different from that of a multiple Gaussian integral
with the same kernel function. In this case the effect of some
irregularities of the normalized empirical distribution turns
out to be non-negligible, and no good Gaussian approximation
holds any longer. This case must be better understood, and some
new methods have to be worked out to handle it.
The precise formulation of the results will be given in the
main part of the work. Beside their proof I also tried to explain
the main ideas behind them and the notions introduced in their
investigation. This work contains some new results, and also the
proof of some already rather classical theorems is presented.
The results about Gaussian random variables and their non-linear
functionals, in particular multiple integrals with respect to a
Gaussian field, have a most important role in the study of the
present work. Hence they will be discussed in detail together
with some of their counterparts about multiple random integrals
with respect to a normalized empirical measure and some results
about $U$-statistics.
The proofs apply results from different parts of the probability
theory. Papers investigating similar results refer to works dealing
with quite different subjects, and this makes their reading rather
hard. To overcome this difficulty I tried to work out the details
and to present a self-contained discussion even at the price of a
longer text. Thus I wrote down (in the main text or in the Appendix)
the proof of many interesting and basic results, like results about
Vapnik--\v{C}ervonenkis classes, about $U$-statistics and their
decomposition to sums of so-called degenerate $U$-statistics, about
so-called decoupled $U$-statistics and their relation to ordinary
$U$-statistics, the diagram formula about the product of
Wiener--It\^o integrals, their counterpart about the product of
degenerate $U$-statistics, etc. I tried to give such an exposition
where different parts of the problem are explained independently of
each other, and they can be understood in themselves.
An earlier version of this work was explained at the probability
seminar of the University Debrecen (Hungary).
\beginsection 2. Motivation of the investigation. Discussion of
some problems.
In this section I try to show by means of some examples why the
solution of the problems mentioned in the introduction may be
useful in the study of some important problems of the probability
theory. I try to give a good picture about the main ideas, but I
do not work out all details. Actually, the elaboration of some
details omitted from this discussion would demand hard work. But
as the present section is quite independent of the rest of the
paper, these omissions cause no problem in understanding the
subsequent part.
I start with a short discussion of the maximum likelihood
estimate in the simplest case. The following problem is considered.
Let us have a class of density functions $f(x,\vartheta)$ on the
real line depending on a parameter $\vartheta\in R^1$, and
observe a sequence of independent random variables
$\xi_1(\omega),\dots,\xi_n(\omega)$ with a density function
$f(x,\vartheta_0)$, where $\vartheta_0$ is an unknown parameter
we want to estimate with the help of the above sequence of random
variables.
The maximum likelihood method suggests the following approach. Choose
that value $\hat\vartheta_n =\hat\vartheta_n(\xi_1,\dots,\xi_n)$ as
the estimate of the parameter $\vartheta_0$ where the density function
of the random vector $(\xi_1,\dots,\xi_n)$, i.e.\ the product
$$
\prod_{k=1}^n f(\xi_k,\vartheta)=\exp\left\{\sum_{k=1}^n\log
f(\xi_k,\vartheta)\right\}
$$
takes its maximum. This point can be found as the solution of the
so-called maximum likelihood equation
$$
\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\vartheta)=0. \tag2.1
$$
We are interested in the asymptotic behaviour of the random variable
$\hat\vartheta_n-\vartheta_0$, where $\hat\vartheta_n$ is the
(appropriate) solution of the equation~(2.1).
The direct study of this equation is rather hard, but a Taylor
expansion of the expression at the left-hand side of~(2.1) around
the (unknown) point $\vartheta_0$ yields a good and simple
approximation of $\hat\vartheta_n$, and it enables us to describe
the asymptotic behaviour of $\hat\vartheta_n-\vartheta_0$.
This Taylor expansion yields that
$$ \allowdisplaybreaks
\align
&\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\hat\vartheta_n)=
\sum_{k=1}^n\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}\\
&\qquad\qquad+(\hat\vartheta_n-\vartheta_0)
\left(\sum_{k=1}^n\left(\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{\left(\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)\right)^2}
{f^2(\xi_k,\bar\vartheta_0)} \right)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right) \\
&=\qquad \sum_{k=1}^n
\left(\eta_k+\zeta_k(\hat\vartheta_n-\vartheta_0)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right), \tag2.2
\endalign
$$
where
$$
\eta_k=\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}\quad \text{and}\quad
\zeta_k=
\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{ \left( \frac{\partial}{\partial\vartheta}
f( \xi_k,\vartheta_0)\right)^2}{f^2(\xi_k,\bar\vartheta_0)}
$$
for $k=1,\dots,n$. We want to understand the asymptotic behaviour
of the (random) expression on the right-hand side of~(2.2). The
relation
$$
E\eta_k=\int\frac{\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)}{f(x,\vartheta_0)}f(x,\vartheta_0)\,dx
=\frac{\partial}{\partial\vartheta}\int
f(x,\vartheta_0)\,dx=0
$$
holds, since $\int f(x,\vartheta)\,dx=1$ for all $\vartheta$,
and a differentiation of this relation gives the last identity.
Similarly,
$E\eta^2_k=-E\zeta_k
=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$, \
$k=1,\dots,n$. Hence by the central limit theorem
$\chi_n=\frac{1}{\sqrt n}\sum\limits_{k=1}^n\eta_k$
is asymptotically normal with expectation zero and variance
$I^2=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$.
In the statistics literature this number $I$ is called the Fisher
information. By the laws of large numbers
$\frac{1}{n}\sum\limits_{k=1}^n\zeta_k\sim -I^2$.
Thus relation (2.2) suggests the approximation $\tilde\vartheta_n=
-\frac{\sum\limits_{k=1}^n\eta_k}{\sum\limits_{k=1}^n\zeta_k}$ of the
maximum-likelihood estimate $\hat\vartheta_n$, and $\sqrt
n(\tilde\vartheta_n-\vartheta_0)$ is asymptotically normal with
expectation zero and variance~$\frac1{I^2}$. The random variable
$\tilde\vartheta_n$ is not a solution of the equation (2.1),
the value of the expression at the left-hand side is of order
$O(n(\tilde\vartheta_n-\vartheta_0)^2)=O(1)$ in this point. On
the other hand, the derivative of the function at the left-hand
side is large in this point, it is greater than
$\text{const.}\, n$ with some $\text{const.}>0$. This implies
that the maximum-likelihood equation has a solution
$\hat\vartheta_n$ such that
$\hat\vartheta_n-\tilde\vartheta_n=O\left(\frac1n\right)$. Hence
$\sqrt n(\hat\vartheta_n-\vartheta_0)$ and
$\sqrt n(\tilde\vartheta_n-\vartheta_0)$ have the same asymptotic
limit behaviour.
The previous method can be summarized in the following way:
Take a simpler linearized version of the expression we want to
estimate by means of an appropriate Taylor expansion, describe the
limit distribution of this linearized version and show that the
linearization causes only a negligible error.
We want to show that such a method also works in more difficult
situations. But in some cases it is harder to show that the error
committed by a replacement of the original expression by a simpler
linearized version is negligible, and to show this the solution of
the problems mentioned in the introduction is needed. The discussion
of the following problem, called the Kaplan--Meyer method for the
estimation of the empirical distribution function with the help of
censored data shows such an example.
The following problem is considered. Let $(X_i,Z_i)$, $i=1,\dots,n$,
be a sequence of independent, identically distributed random vectors
such that the components $X_i$ and $Z_i$ are also independent with
some unknown distribution functions $F(x)$ and $G(x)$. We want to
estimate the distribution function $F$ of the random variables $X_i$,
but we cannot observe the variables $X_i$, only the random variables
$Y_i=\min(X_i,Z_i)$ and $\delta_i=I(X_i\leq Z_i)$. In other words, we
want to solve the following problem. There are certain objects whose
lifetime $X_i$ are independent and $F$ distributed. But we cannot
observe this lifetime $X_i$, because after a time $Z_i$ the
observation must be stopped. We also know whether the real lifetime
$X_i$ or the censoring variable $Z_i$ was observed. We make $n$
independent experiments and want to estimate with their help the
distribution function~$F$.
Kaplan and Meyer, on the basis of some maximum-likelihood estimation
type considerations, proposed the following so-called product limit
estimator $S_n(u)$ to estimate the unknown survival function
$S(u)=1-F(u)$:
$$
1-F_n(u)=S_n(u)=\left\{
\alignedat2
&\prod_{i=1}^n\left(\frac{N(Y_i)}{N(Y_i)+1}\right)^{I(Y_i\leq u,
\delta_i=1)} && \text{ if }u\leq\max(Y_1,\dots,Y_n)\\
&0&& \text{ if } u\geq\max(Y_1,\dots,Y_n),\;\delta_n =1,\\
&\text{undefined} &&\text{ if }u\geq\max(Y_1,\dots,Y_n),\;\delta_n
=0, \endalignedat\right. \tag2.3
$$
where
$$
N(t)=\#\{Y_i,\;\;Y_i>t,\;1\le i \le n\}=\sum_{i=1}^n I(Y_i>t).
$$
We want to show that the above estimate (2.3) is really good.
For this goal we shall approximate the random variables $S_n(u)$ by
some appropriate random variables. To do this first we introduce some
notations.
Put
$$
\aligned
H(u) &=P(Y_i\leq u)=1-\bar H(u),\\
\tilde H(u) &=P(Y_i\leq u,\,\delta_i=1),\quad
\tilde{\tilde H}(u)=P(Y_i\leq u,\,\delta_i =0)
\endaligned \tag2.4
$$
and
$$ \allowdisplaybreaks
\align
H_n(u) &=\frac{1}{n}\sum_{i=1}^n I(Y_i\leq u)\\
\tilde H_n(u) &=\frac1n \sum_{i=1}^n I(Y_i\leq u,\,\delta_i=1),
\quad \tilde{\tilde H}_n(u)
=\frac{1}{n}\sum_{i=1}^n I(Y_i\leq u,\,\delta_i=0). \tag2.5
\endalign
$$
Clearly $H(u)=\tilde H(u)+\tilde{\tilde H}(u)$ and
$ H_n(u)=\tilde H_n(u)+\tilde{\tilde H}_n(u)$.
We shall estimate $F_n(u)-F(u)$ for $u\in(-\infty, T]$ if
$$
1-H(T)>\delta \quad \text{with some fixed } \delta>0. \tag2.6
$$
Condition (2.6) implies that there are more than $\frac\delta2n$
sample points $Y_j$ larger than~$T$ with probability almost 1. The
complementary event has only an exponentially small probability.
This observation helps to show in the subsequent calculations that
some events have negligibly small probability.
We introduce the so-called cumulative hazard function and its
empirical version
$$
\Lambda(u)=-\log(1-F(u)), \quad \Lambda_n(u)=-\log(1-F_n(u)). \tag2.7
$$
Since $F_n(u)-F(u)=\exp(-\Lambda(u))
\left(1-\exp(\Lambda(u)-\Lambda_n(u))\right)$
a simple Taylor expansion yields
$$
F_n(u)-F(u)=(1-F(u))\left(\Lambda_n(u)-\Lambda(u)\right)+R_1(u),\tag2.8
$$
and it is easy to see that
$R_1(u)=O\left(\Lambda(u)-\Lambda_n(u))^2\right)$.
It follows from the subsequent estimations that
$\Lambda(u)-\Lambda_n(u)=O(n^{-1/2})$, thus $nR_1(u)=O(1)$. Hence it
is enough to investigate the term $\Lambda_n(u)$. We shall show that
$\Lambda_n(u)$ has an expansion with $\Lambda(u)$ as the main term
plus $n^{-1/2}$ times a term which is a linear functional of an
appropriate normalized empirical distribution function plus an error
term of order $O(n^{-1})$.
From~(2.3) it is obvious that
$$
\Lambda_n(u)=-\sum_{i=1}^n I(Y_i\leq u, \, \delta_i=1)
\log\left(1-\frac{1}{1+N(Y_i)}\right).
$$
It is not difficult to get rid of the unpleasant logarithmic function
in this formula by means of the relation $-\log(1-x)=x+O(x^2)$ for
small~$x$. It yields that
$$
\Lambda_n(u)=\sum_{i=1}^n \frac{I(Y_i\leq u, \,\delta_i=1)}{N(Y_i)}
+R_2(u)=\tilde{\Lambda}_n(u)+R_2(u) \tag2.9
$$
with an error term $R_2(u)$ such that $nR_2(u)$ is smaller than a
constant with probability almost one. (The probability of the
exceptional set is exponentially small.)
The expression $\tilde{\Lambda}_n(u)$ is still inappropriate for our
purposes. Since the denominators $N(Y_i)=\sum\limits_{j=1}^n I(Y_j>Y_i)$
are dependent for different indices~$i$ we cannot see directly the
limit behaviour of $\tilde{\Lambda}_n(u)$.
We try to approximate $\tilde{\Lambda}_n(u)$ by a simpler
expression. A natural approach would be to approximate the terms
$N(Y_i)$ in it by their conditional expectation $(n-1)\bar
H(Y_i)=(n-1)(1-H(Y_i))=E(N(Y_i)|Y_i)$ with respect to the
$\sigma$-algebra generated by the random variable~$Y_i$.
This is a too rough `first order' approximation, but the following
`second order approximation' will be sufficient for our goals. Put
$$
N(Y_i)=\sum_{j=1}^n I(Y_j>Y_i)=n\bar H(Y_i) \left(1+
\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}\right)
$$
and express the terms $\frac1{N(Y_i)}$ in the sum defining
$\tilde \Lambda_n$, (with $\tilde\Lambda_n$ introduced in~(2.9))
by means of the relation
$\frac1{1+z}=\sum\limits_{k=0}^\infty (-1)^kz^k=1-z+\varepsilon(z)$
with the choice
$z=\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}$. As
$|\varepsilon(z)|<2z^2$ for $|z|<\frac{1}{2}$ we get that
$$
\align
\tilde{\Lambda}_n(u)&
=\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1+\sum_{k=1}^\infty\left(- \frac{\sum\limits_{j=1}^n
I(Y_j>Y_i)-n\bar H(Y_i)} {n\bar H(Y_i)}\right)^k\right)\\
&=\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1-\frac{\sum\limits_{j=1}^n I(Y_j>Y_i)-n\bar H(Y_i)}
{n\bar H(Y_i)}\right)+R_3(u) \tag2.10 \\
&=2A(u)-B(u)+R_3(u),
\endalign
$$
where
$$
A(u)=A(n,u)=\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}{n\bar H(Y_i)}
$$
and
$$
B(u)=B(n,u)=\sum_{i=1}^n \sum_{j=1}^n\frac
{I(Y_i\leq u,\,\delta_i=1)I(Y_j>Y_i)}{n^2\bar H^2(Y_i)}.
$$
It can be proved by means of standard methods that $nR_3(u)$ is
exponentially small. Thus relations~(2.9) and~(2.10) yield that
$$
\Lambda_n(u)=2A(u)-B(u)+\text{negligible error.}\tag2.11
$$
This means that to solve our problem the asymptotic behaviour of the
random variables $A(u)$ and $B(u)$ has to be given. We can get a
better insight to this problem by rewriting the sum $A(u)$ as an
integral and the double sum $B(u)$ as a two-fold integral with
respect to empirical measures. Then these integrals can be rewritten
as sums of random integrals with respect to normalized empirical
measures and deterministic measures. Such an approach yields a
representation of $\Lambda_n(u)$ in the form of a sum whose terms
can be well understood.
Let us write
$$
\align
A(u)&=\int_{-\infty}^{+\infty}\frac{I(y\leq u)}{1-H(y)}\,d\tilde
H_n(y),\\
B(u) &=\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}
\frac{I(y\leq u)I(x>y)}{\left(1-H(y)\right)^2}\,dH_n(x) d\tilde H_n(y).
\endalign
$$
We rewrite the terms $A(u)$ and $B(u)$ in a form better for our
purposes. We express these terms as a sum of integrals with respect
to $dH(u)$, $d\tilde H(u)$ and the normalized emprical processes
$d\sqrt n(H_n(x)-H(x))$ and $d\sqrt n(\tilde H_n(y)-\tilde H(y))$.
For this goal observe that
$$
\align
H_n(x)\tilde H_n(y)&=H(x)\tilde H(y)+H(x)(\tilde H_n(y)-\tilde H(y))
+(H_n(x)-H(x))\tilde H(y)\\
&\qquad+(H_n(x)-H(x))(\tilde H_n(y)-\tilde H(y)).
\endalign
$$
Hence it can be written that
$B(u)=B_1(u)+B_2(u)+B_3(u)+B_4(u)$, where
$$
\align
B_1(u)&=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\tilde H(y)\;,\\
B_2(u) &=\frac{1}{\sqrt n}\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\left(\sqrt n
(\tilde H_n(y)-\tilde H(y))\right),\\
B_3(u)&=\frac1{\sqrt n}\int_{-\infty}^u
\int_{-\infty}^{+\infty}\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\,d\tilde H(y)\;,\\
B_4(u)&=\frac 1n \int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\,
d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right).
\endalign
$$
In the above decomposition of $B(u)$ the term $B_1$ is a
deterministic function, $B_2$, $B_3$ are linear functionals of
normalized empirical processes and $B_4$ is a nonlinear functional
of normalized empirical processes. The deterministic term $B_1(u)$
can be calculated explicitly. Indeed,
$$
B_1(u)=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x) d\tilde H(y)=
\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}.
$$
Then the relations
$\tilde H(u)=\int_{-\infty}^u\left(1-G(t)\right)\,dF(t)$ and
$1-H = (1-F)(1-G)$ imply that
$$
B_1(u)=\int_{-\infty}^u\frac{dF(y)}{1-F(y)}=
-\log(1-F(u))=\Lambda(u).\tag2.12
$$
Observe that
$$
\aligned
A(u) &=\int_{-\infty}^u\frac{d\,\tilde H_n(y)}{1-H(y)}\\
&=\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}+
\frac1{\sqrt n}\int_{-\infty}^u
\frac{d \left(\sqrt n (\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}\\
&=B_1(u)+B_2(u).
\endaligned\tag2.13
$$
From relations~(2.11), (2.12) and~(2.13) it follows that
$$
\Lambda_n(u)-\Lambda(u)=B_2(u)-B_3(u)-B_4(u)+\text{negligible error.}
\tag2.14
$$
Integration of $B_2$ and $B_3$ with respect to the variable $x$
and then integration by parts in the expression $B_2$ yields that
$$
\align
B_2(u)&=\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}\\
&=\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{\sqrt n(1-H(u))}-\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y)\\
B_3(u)&=\frac1{\sqrt n}\int_{-\infty}^u
\frac{\sqrt n\left(H(y)-H_n(y)\right)}
{\left(1-H(y)\right)^2}\,d\tilde H(y).
\endalign
$$
With the help of the above expressions for $B_2$ and $B_3$,~(2.14)
can be rewritten as
$$
\aligned
\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)
&=\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{1-H(u)}-\int_{-\infty}^u
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y)\\
&\qquad+\int_{-\infty}^u\frac{\sqrt n\left(H_n(y)-H(y)\right)}
{\left(1-H(y)\right)^2} \,d\tilde H(y)\\
&\qquad-\sqrt n B_4(u)+\text{negligible error.}
\endaligned\tag2.15
$$
Formula (2.15) (together with formula~(2.8)) almost agrees with
the statement we wanted to prove. Here the normalized error
$\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)$ is
expressed as a sum of linear functionals of normalized empirical
measures plus some negligible error terms plus the error term
$\sqrt nB_4(u)$. So to get a complete proof it
is enough to show that $\sqrt nB_4(u)$ also yields a negligible
error. But $B_4(u)$ is a double integral of a bounded function (here
we apply again formula (2.6)) with respect to a normalized empirical
measure. Hence to bound this term we need a good estimate of
multiple stochastic integrals (with multiplicity~2), and this is just
the problem formulated in the introduction. The estimate we need
here follows from Theorem~8.1 of the present work. Let us remark
that the problem discussed here corresponds to the estimation of the
coefficient of the second term in the Taylor expansion considered
in the study of the maximum likelihood estimation. One may worry a
little bit how to bound $B_4(u)$ with the help of estimations of
double stochastic integrals, since in the definition of $B_4(u)$
integration is taken with respect to different normalized empirical
processes in the two coordinates. But this is a not too difficult
technical problem. It can be simply overcome for instance by
rewriting the integral as a double integral with respect to the
empirical process
$\left(\sqrt n\left(H_n(x)-H(x)\right),
\sqrt n\left(\tilde H_n(y)-\tilde H(y)\right)\right)$
in the space $R^2$.
By working out the details of the above calculation we get that
the linear functional $B_2(u)-B_3(u)$ of normalized empirical
processes yields a good estimate on the expression $\sqrt
n(\Lambda_n(u)-\Lambda(u))$ for a fixed parameter~$u$. But we want
to prove somewhat more, we want to get an estimate uniform in the
parameter~$u$, i.e. to show that even the random variable
$\sup\limits_{u\le T}\left|
\sqrt n(\Lambda_n(u)-\Lambda(u))-B_2(u)+B_3(u)\right|$
is small. This can be done by making estimates uniform in the
parameter~$u$ in all steps of the above calculation. There appears
only one difficulty when trying to carry out this program. Namely,
we need an estimate on $\sup\limits_u |B_4(u)|$, i.e. we have to
bound the supremum of multiple random integrals with respect to a
normalized random measure for a nice class of kernel functions.
This can be done, but at this point the second problem mentioned
in the introduction appears. This difficulty can be overcome by
means of Theorem~8.2 of this work.
Thus the limit behaviour of the Kaplan--Meyer estimate can be
described by means of an appropriate expansion. The steps of the
calculation leading to such an expansion are fairly standard, the
only hard part is the solution of the problems mentioned in the
introduction. It can be expected that such a method also works in
a much more general situation.
I finish this section with a remark of Richard Gill he made
in a personal conversation after my talk on this subject at a
conference. He told that this approach had given a complete proof
about the limit behaviour of this estimate, but it had exploited the
explicit formula given in the Kaplan--Meyer estimate. He missed the
application of an argument based on the non-parametric maximum
likelihood character of this estimate. This was a completely
justified remark, since if we do not restrict our attention to
this problem, but
try to generalize it to general non-parametric maximum likelihood
estimates, then we have to understand how the maximum likelihood
character can be exploited. I believe that this can be done, but it
demands further studies.
\beginsection 3. Some estimates about sums of independent random
variables.
We need some results about the distribution of sums of independent
random variables bounded by a constant with probability one. Later
only the results about sums of independent and identically
distributed variables will be interesting for us. But since they
can be generalized without any effort to sums of not
necessarily identically distributed random variables the condition
about identical distribution of the summands will be dropped.
We are interested in the question when these
estimates give such a good bound as the central limit theorem
suggests, and what can be told otherwise.
More explicitly, the following problem will be considered: Let
$X_1,\dots,X_n$ be independent random variables, $EX_j=0$,
$\text{Var}\, X_j=\sigma_j^2$, $1\le j\le n$, and take the random sum
$S_n=\sum\limits_{j=1}^nX_j$ and its variance
$\text{Var}\, S_n=V_n^2=\sum\limits_{j=1}^n\sigma_j^2$.
We want to get a good
bound on the probability $P(S_n>u V_n)$. The central limit theorem
suggests that under general conditions an upper bound of the
order $1-\Phi(u)$ should hold for this probability, where $\Phi(u)$
denotes the standard normal distribution function. Since the
standard normal distribution function satisfies the inequality
$\left(\frac1u-\frac1{u^3}\right)
\frac{e^{-u^2/2}}{\sqrt{2\pi}} <1-\Phi(u)<
\frac1u\frac{e^{-u^2/2}}{\sqrt{2\pi}}$ for all $u>0$ it is natural
to ask when the probability $P(S_n >uV_n)$ is comparable with the
value $e^{-u^2/2}$. More generally, we shall call an upper bound of
the form $P(S_n>uV_n)\le e^{-Cu^2}$ with some constant $C>0$ a
Gaussian type estimate.
First I formulate Bernstein's inequality which tells for which values
$u$ the probability $P(S_n>uV_n)$ has a Gaussian type estimate.
It supplies such an estimate if $u\le \text{const}\, V_n$. On
the other hand, for $u\ge\text{const.}\, V_n$ it yields a much
weaker estimate. I also present
an example which shows that in this case only a very weak improvement
of Bernstein's inequality is possible. I also discuss another result,
called Bennett's inequality, which shows that such an improvement is
possible. The main difficulties we meet in this work are closely
related to the weakness of the estimates we have for the probability
of the event $P(S_n>uV_n)$ if $u\gg \text{const.}\, V_n$.
In the usual formulation of Bernstein's inequality a
real number~$M$ is introduced, and it is assumed that the terms in
the sum we investigate are bounded by this number. But since the
problem can be simply reduced to the special case $M=1$ I shall
consider only this special case.
\medskip\noindent
{\bf Theorem 3.1. (Bernstein's inequality).} {\it Let
$X_1,\dots,X_n$ be independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\text{Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
$$
P\left(S_n>uV_n\right)\le\exp\left\{-\frac{u^2}{2\left(1+\frac13
\frac u{V_n}\right)} \right\} \quad\text{for all }u>0. \tag3.1
$$
}
\medskip\noindent
{\it Proof of Theorem 3.1.} Let us give a good bound on the
exponential moments $Ee^{tS_n}$ for appropriate parameters
$t>0$. Since $EX_j=0$ and $E|X_j^{k+2}|\le\sigma^2$ for $k\ge0$ we can
write $Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k
\le 1+\frac{t^2\sigma_j^2}2\left(1+\sum\limits_{k=1}^\infty
\frac {2t^{k}}{(k+2)!}\right) \le 1+\frac{t^2\sigma_j^2}2
\left(1+\sum\limits_{k=1}^\infty 3^{-k}t^{k}\right)
= 1+\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}
\le\exp\left\{\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}\right\}$
if $0\le t<3$. Hence $Ee^{tS_n}=\prod\limits_{j=1}^n Ee^{tX_j}\le
\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3}\right\}$ for $0\le t<3$.
The above relation implies that
$$
P\left(S_n>uV_n\right)=P(e^{tS_n}>e^{tuV_n})\le
Ee^{tS_n}e^{-tuV_n}\le \exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac
t3}-tuV_n\right\}
$$
if $0\le t<3$. Choose the number $t$ in this inequality as the
solution of the equation $t^2V_n^2\frac1{1-\frac t3}=tuV_n$, i.e.
put $t=\frac u{V_n+\frac u3}$. Then $0\le t<3$, and we get that
$P(S_n>uV_n)\le e^{-tuV_n/2}=
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}\right\}$.
\medskip
If the random variables $X_1,\dots,X_n$ satisfy the conditions of
Bernstein's inequality, then also the random variables
$-X_1,\dots,-X_n$ satisfy them. By applying the above result in both
cases we get that
$P(|S_n|>uV_n)\le2
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}
\right\}$ under the conditions of Bernstein's inequality.
\medskip
By Bernstein's inequality for all $\varepsilon>0$ there is some
number $\alpha(\varepsilon)>0$ such that in the case
$\frac u{V_n}<\alpha(\varepsilon)$
$P(S_n>uV_n)\le e^{-(1-\varepsilon)u^2/2}$. Beside this, for all
fixed numbers $A>0$ there is some constant $C=C(A)>0$ such that in
the case $\frac u{V_n}uV_n)\le e^{-Cu^2}$
holds. This can be interpreted as a Gaussian type estimate for the
probability $P(S_n>uV_n)$ if $u\le \text{const.}\, V_n$.
On the other hand, if $\frac u{V_n}$ is very large, then Bernstein's
inequality yields a much worse estimate. The question arises whether
in this case Berstein's inequality can be replaced by a better, more
useful result. Next we present Theorem~3.2, the so-called Bennett's
inequality which provides a slight improvement of Bernstein's
inequality. But if $\frac u{V_n}$ is very large, then also
Bennett's inequality provides a much weaker estimate on the
probability $P(S_n>uV_n)$ than the bound suggested by a Gaussian
comparison. On the other hand, we shall give an example that shows
that (without imposing some additional conditions) no real
improvement of this estimate is possible.
\medskip\noindent
{\bf Theorem 3.2. (Bennett's inequality).} {\it Let $X_1,\dots,X_n$ be
independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\text{Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
$$
P(S_n>u)\le\exp\left\{-V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac u{V^2_n}\right]\right\}
\quad\text{for all
}u>0. \tag3.2
$$
As a consequence, for all $\varepsilon>0$ there exists some
$B=B(\varepsilon)>0$ such
that
$$
P\left(S_n>u\right)\le\exp\left\{-(1-\varepsilon)u\log \frac u{V^2_n}
\right\}\quad\text{if } u>BV_n^2, \tag3.3
$$
and there exists some positive constant $K>0$ such that
$$
P\left(S_n>u\right)\le\exp\left\{-Ku\log \frac u{V^2_n}
\right\}\quad\text{if }u>2V_n^2. \tag3.4
$$
}
\medskip\noindent
{\it Proof of Theorem 3.2.}\/ We have
$$
Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k\le
1+\sigma_j^2\sum\limits_{k=2}^\infty\frac {t^k}{k!}=1+\sigma_j^2
\left(e^t-1-t\right)\le e^{\sigma_j^2(e^t-1-t)}, \quad 1\le j\le n,
$$
and $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$ for all $t\ge0$. Hence
$P(S_n>u)\le e^{-tu}Ee^{tS_n}\le e^{-tu+V_n^2(e^t-1-t)}$ for all
$t\ge0$. We get relation (3.2) from this inequality with the choice
$t=\log\left(1+\frac u{V^2_n}\right)$. (This is the place of
minimum of the
function $-tu+V_n^2(e^t-1-t)$ for fixed $u$ in the parameter~$t$.)
Relation (3.2) and the observation
$\lim\limits_{v\to\infty}\frac{(v+1)\log(v+1)-v}{v\log v}=1$ with the
choice $v=\frac u{V_n^2}$ imply formula~(3.3). Because of relation
(3.3) to prove formula (3.4) it is enough to check it for $2\le
\frac u{V_n^2}\le B$ with some sufficiently large constant $B>0$.
In this case relation (3.4) follows directly from formula (3.2).
This can be seen for instance by observing that the expression
$\frac{V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac
u{V^2_n}\right]}{u\log\frac u{V^2_n}}$ is a continuous and positive
function of the variable $\frac u{V_n^2}$ in the interval $2\le
\frac u{V_n^2}\le B$, hence its minimum in this interval is strictly
positive.
\medskip
Let us make a short comparison between Bernstein's and Bennett's
inequality. Both results yield an estimate on the probability
$P(S_n>u)$, and their proofs are very similar. They are based on
an estimate of the moment generating functions $R_j(t)=Ee^{tX_j}$
of the summands~$X_j$, but Bennett's inequality yields a better
estimate. It may be worth mentioning that the estimate given for
$R_j(t)=Ee^{tX_j}$ in the proof of
Bennett's inequality agrees with the moment generating function
$Ee^{t(Y_j-EY_j)}$ of the normalization $Y_j-EY_j$ of a Poissonian
random variable $Y_j$ with parameter $\text{Var}\, X_j$. As a
consequence,
we get, by using the standard method of estimating tail-distributions
by means of the moment generating functions such an estimate for the
probability $P(S_n>u)$ which is comparable with the probability
$P(T_n-ET_n>u)$, where $T_n$ is a Poissonian random variable with
parameter $V_n=\text{Var}\, S_n$. We can say that Bernstein's
inequality yields a Gaussian and Bennett's inequality a Poissonian
type estimate for the sums of independent, bounded random variables.
\medskip\noindent
{\it Remark.}\/ Bennett's inequality yields a sharper estimate for
the probability $P(S_n>u)$ than Bernstein's inequality for all
numbers $u>0$. To prove this it is enough to show that for all
$0\le t<3$ the inequality $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$
appearing in the proof of Bennett's inequality is a sharper
estimate than the corresponding inequality
$Ee^{tS_n}\le\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3} \right\}$
appearing in the proof of Bernstein's inequality. (Recall, how we
estimate the probability $P(S_n>u)$ in these proofs with the help
of the exponenential moment $Ee^{tS_n}$.) But to prove this it is
enough to check that $e^t-1-t\le \frac{t^2}2\frac1{1-\frac t3}$
for all $0\le t<3$. This inequality clearly holds, since
$e^t-1-t=\sum\limits_{k=2}^\infty\frac{t^k}{k!}$, and
$\frac{t^2}2\frac1{1-\frac t3}=\sum\limits_{k=2}^\infty
\frac12(\frac13)^{k-2}t^k$.
\medskip
Next we present Example~3.3 which shows that Bennett's inequality
yields a sharp estimate also in the case $u\gg V_n^2$ when
Bernstein's inequality yields a weak bound. But Bennett's inequality
provides only a small improvement which has only a limited
importance. This may be the reason why Bernstein's inequality
which yields a more transparent estimate is more popular.
\medskip\noindent
{\bf Example 3.3. (Sums of independent random variables with bad
tail distribution for large values).} {\it Let us fix some
positive integer $n$, real numbers $u$ and $\sigma^2$ such that
$0<\sigma^2\le\frac18$, $n>4u\ge6$ and $u>4n\sigma^2$. Let
$\bar\sigma^2$ be that solution of the equation $x^2-x+\sigma^2=0$
which is smaller than~$\frac12$. Take a sequence of independent
and identically distributed random variables
$\bar X_1,\dots,\bar X_n$ such that $P(\bar X_j=1)=\bar\sigma^2$,
$P(\bar X_j=0)=1-\bar\sigma^2$ for all $1\le j\le n$. Put
$X_j=\bar X_j-E\bar X_j=X_j-\bar\sigma^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and $V_n^2=n\sigma^2$.
Then $P(|X_1|\le1)=1$, $EX_1=0$, $\text{Var}\, X_1=\sigma^2$, hence
$ES_n=0$, and $\text{Var}\, S_n=V_n^2$. Beside this
$$
P(S_n\ge u)>\exp\left\{-Bu\log \frac u{V^2_n}\right\}
$$
with some appropriate constant $B>0$ not depending on~$n$,
$\sigma$ and~$u$.}
\medskip\noindent
{\it Proof of Example 3.3.}\/ Simple calculation shows that $EX_j=0$,
$\text{Var}\, X_j=\bar\sigma^2-\bar\sigma^4=\sigma^2$,
$P(|X_j|\le1)=0$, and
also the inequality $\sigma^2\le\bar\sigma^2\le\frac32\sigma^2$ holds.
To see the upper bound in the last inequality observe that
$\bar\sigma^2\le\frac13$, i.e. $1-\bar\sigma^2\ge\frac23$, hence
$\sigma^2=\bar\sigma^2(1-\bar\sigma^2)\ge\frac23\bar\sigma^2$. In
the proof of the inequality of Example~3.3 we can restrict our
attention to the case when $u$ is an integer, because in the
general case we can apply the inequality with $\bar u=[u]+1$
instead of~$u$, where $[u]$ denotes the integer part of~$u$, and
since $u\le\bar u\le 2u$, the application of the result in this
case supplies the desired inequality with a possibly worse
constant~$B>0$.
Put $\bar S_n=\sum\limits_{j=1}^n\bar X_j$. We can write
$P(S_n\ge u)=P(\bar S_n\ge u+n\bar\sigma^2)\ge P(\bar S_n\ge2u)
\ge P(\bar S_n=2u)=\binom n{2u}\bar\sigma^{4u}(1-\bar\sigma^2)^{(n-2u)}
\ge(\frac {n\bar\sigma^2}{2u})^{2u}(1-\bar\sigma^2)^{(n-2u)}$,
since $u\ge n\bar\sigma^2$, and $n\ge2u$. On the other hand
$(1-\bar\sigma^2)^{(n-2u)}\ge e^{-2\bar\sigma^2(n-2u)}
\ge e^{-2n\bar\sigma^2}\ge e^{-u}$, hence
$$
\align
P(S_n\ge u)
&\ge\exp\left\{-2u\log\left(\frac u{n\bar\sigma^2}\right)
-2u\log2-u\right\}\\
&=\exp\left\{-2u\log\left(\frac u{n\sigma^2}\right)
-2u\log\frac{\bar\sigma^2}{\sigma^2}-2u\log2-u\right\}\\
&\ge\exp\left\{-100u\log\left(\frac u{V_n^2}\right)\right\}.
\endalign
$$
Example~3.3 is proved.
\medskip
In the case $u>4V_n^2$ Bernstein's inequality yields the estimate
$P(S_n>u)\le e^{-\alpha u}$ with some universal constant $\alpha>0$,
and the above example shows that at most an additional logarithmic
factor $K\log\frac u{V_n^2}$ can be expected in the exponent of
the upper bound in an improvement of this estimate. Bennett's
inequality shows that such an improvement is really possible.
\medskip
I finish this section with another estimate due to Hoeffding
which will be later useful in some symmetrization arguments.
\medskip\noindent
{\bf Theorem 3.4. (Hoeffding's inequality).} {\it Let
$\varepsilon_1,\dots,\varepsilon_n$
be independent random variables,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, and let $a_1,\dots,a_n$ be arbitrary real numbers. Put
$V=\sum\limits_{j=1}^na_j\varepsilon_j$. Then
$$
P(V>u)\le\exp\left\{-\frac{u^2}{2\sum_{j=1}^na_j^2 }\right\}\quad
\text{for all }u>0. \tag3.5
$$
}
\medskip\noindent
{\it Remark 1:}\/ Clearly $EV=0$ and
$\text{Var}\, V=\sum\limits_{j=1}^n a_j^2$,
hence Hoeffding's inequality yields such an estimate for $P(V>u)$
which the central limit theorem suggests. This estimate holds for
all real numbers $a_1,\dots,a_n$ and $u>0$.
\medskip\noindent
{\it Remark 2:}\/ The Rademacher functions $r_k(x)$, $k=1,2,\dots$,
defined by the formulas $r_k(x)=1$ if $(2j-1)2^{-k}\le x<2j2^{-k}$
and $r_k(x)=-1$ if $2(j-1)2^{-k}\le x<(2j-1)2^{-k}$,
$1\le j\le 2^{k-1}$, for all $k=1,2,\dots$, can be considered as
random variables on the probability space $\Omega=[0,1]$ with the
Borel $\sigma$-algebra and the Lebesgue measure as probability
measure on the interval $[0,1]$. They are independent random
variables with the same distribution as the random variables
$\varepsilon_1,\dots,\varepsilon_n$ considered in Theorem~3.4.
Therefore results
about such sequences of random variables whose distributions agree
with those in~Theorem~3.4 are also called sometimes results about
Rademacher functions in the literature. At some points we will
also apply this terminology.
\medskip\noindent
{\it Proof of Theorem 3.4.} Let us give a good bound on the
exponential moment $Ee^{tV}$ for all $t>0$. The identity
$Ee^{tV}=\prod\limits_{j=1}^nEe^{ta_j\varepsilon_j}=
\prod\limits_{j=1}^n\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2$ holds, and
$\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2=\sum\limits_{k=0}^\infty
\frac{a_{j}^{2k}} {(2k)!}t^{2k}\le \sum\limits_{k=0}^\infty \frac
{(a_jt)^{2k}}{2^{k}k!}=e^{a_j^2t^2/2}$, since $(2k)!\ge 2^k k!$
for all $k\ge0$. This implies that $Ee^{tV}\le
\exp\left\{\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$. Hence
$P(V>u)\le\exp\left\{-tu+\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$,
and we get relation (3.5) with the choice $t=u\left(\sum\limits_{j=1}^n
a_j^2\right)^{-1}$.
\beginsection 4. On the supremum of a nice class of partial sums.
This section contains an estimate about the supremum of a nice
class of normalized sums of independent and identically
distributed random variables together with an analogous result
about the supremum of an appropriate class of random one-fold
integrals with respect to a normalized empirical measure. The
second result deals with a one-variate version of the problem
about the estimation of multiple integrals with respect to a
normalized empirical measure. This problem was mentioned in
the introduction. Some natural questions related to these
results will be also discussed. It will be examined how
restrictive their conditions are. In particular, we are
interested in the question how the condition about the
countable cardinality of the class of random variables can be
weakened. A natural Gaussian counterpart of the supremum
problems about random one-fold integrals will be also
considered. Most proofs will be postponed to later sections.
To formulate these results first a notion will be
introduced that plays a most important role in the sequel.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions.} {\it Let a
measurable space $(Y,{\Cal Y})$ be given together with a class
${\Cal G}$ of ${\Cal Y}$ measurable real valued functions on this
space. The class of functions ${\Cal G}$ is called an $L_p$-dense
class of functions, $1\le p<\infty$, with parameter $D$ and
exponent $L$ if for all numbers $0<\varepsilon\le1$ and
probability measures $\nu$ on the space $(Y,{\Cal Y})$ there
exists a finite $\varepsilon$-dense subset
${\Cal G}_{\varepsilon,\nu}=\{g_1,\dots,g_m\}\subset{\Cal G}$
in the space $L_p(Y,{\Cal Y},\nu)$ with
$m\le D\varepsilon^{-L}$ elements, i.e. there exists
such a set ${\Cal G}_{\varepsilon,\nu}
\subset {\Cal G}$ with $m\le D\varepsilon^{-L}$
elements for which
$\inf\limits_{g_j\in {\Cal G}_{\varepsilon,\nu}}\int |g-g_j|^p\,d\nu
<\varepsilon^p$ for all
functions $g\in {\Cal G}$. (Here the set
${\Cal G}_{\varepsilon,\nu}$ may depend
on the measure $\nu$, but its cardinality is bounded by a number
depending only on $\varepsilon$.)}
\medskip
In most results of this work the above defined $L_p$-dense classes
will be considered only for the parameter $p=2$. But at some
points it will be useful to work also with $L_p$-dense classes with
a different parameter~$p$. Hence to avoid some repetitions I
introduced the above definition for a general parameter~$p$.
The following estimate will be proved.
\medskip\noindent
{\bf Theorem 4.1. (Estimate on the supremum of a class of partial
sums).} {\it Let us consider a sequence of independent and
identically distributed random variables $\xi_1,\dots,\xi_n$,
$n\ge2$, with values in a measurable space $(X,{\Cal X})$ and with
some distribution~$\mu$. Beside this, let a countable and
$L_2$-dense class of functions ${\Cal F}$ with some parameter $D\ge1$
and exponent $L\ge1$ be given on the space $(X,{\Cal X)}$ which
satisfies the conditions
$$
\align
\|f\|_\infty&=\sup_{x\in X}|f(x)|\le 1, \qquad \text{for all }
f\in{\Cal F} \tag4.1 \\
\|f\|_2^2&=\int f^2(x) \mu(\,dx)\le \sigma^2 \qquad \text{for all }
f\in {\Cal F} \tag4.2
\endalign
$$
with some constant $0<\sigma\le1$, and
$$
\int f(x)\mu(\,dx)=0 \quad \text{for all } f\in{\Cal F}. \tag4.3
$$
Define the normalized partial sums $S_n(f)=\frac1{\sqrt n}
\sum\limits_{k=1}^n f(\xi_k)$ for all $f\in {\Cal F}$.
There exist some universal constants $C>0$, $\alpha>0$ and $M>0$
such that the supremum of the normalized random sums $S_n(f)$,
$f\in {\Cal F}$, satisfies the inequality
$$
\aligned
P\left(\sup_{f\in{\Cal F}}|S_n(f)|\ge u\right)
\le C&\exp\left\{-\alpha\left(\frac u{\sigma}\right)^2\right\}
\quad \text{ for those numbers } u \\
\text{for which } &\sqrt n\sigma^2\ge
u\ge M\sigma(L^{3/4}\log^{1/2}\tfrac2\sigma +(\log D)^{3/4}),
\endaligned \tag4.4
$$
where the numbers~$D$ and $L$ in formula~(4.4) agree with the
parameter and exponent of the $L_2$-dense class~${\Cal F}$.}
\medskip\noindent
{\it Remark.}\/ Here and also in the subsequent part of this work
we consider random variables which take their values in a general
measurable space $(X,{\Cal X})$. The only restriction we impose
on these spaces is that all sets consisting of one point are
measurable, i.e. $\{x\}\in{\Cal X}$ for all $x\in X$.
\medskip
The condition $\sqrt n\sigma^2\ge u\ge
M\sigma(L^{3/4}\log^{1/2}\frac2\sigma +D^{3/4})$ about the number~$u$
in formula~(4.4) is natural. I discuss this after the formulation of
Theorem~4.2 which can be considered as the Gaussian counterpart of
Theorem~4.1. I also formulate a result in Example~4.3 which can be
considered as part of this discussion.
\medskip
The condition about the countable cardinality of ${\Cal F}$ can be
weakened with the help of the notion of countable approximability
introduced below. For the sake of later applications I define it
in a more general form than needed in this section.
\medskip\noindent
{\bf Definition of countably approximable classes of random
variables.} {\it Let us have a class of random variables $U(f)$,
$f\in {\Cal F}$, indexed by a class of functions $f\in{\Cal F}$
on a measurable space $(Y,{\Cal Y})$. This class of random variables
is called countably approximable if there is a countable subset
${\Cal F}'\subset {\Cal F}$ such that for all numbers $u>0$ the sets
$A(u)=\{\omega\colon\;\sup\limits_{f\in {\Cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\;\sup\limits_{f\in {\Cal F}'} |U(f)(\omega)|\ge u\}$
satisfy the identity $P(A(u)\setminus B(u))=0$.}
\medskip
Clearly, $B(u)\subset A(u)$. In the above definition it was demanded
that for all $u>0$ the set $B(u)$ should be almost as large as
$A(u)$. The following corollary of Theorem~4.1 holds.
\medskip\noindent
{\bf Corollary of Theorem~4.1.} {\it Let a class of functions
${\Cal F}$ satisfy the conditions of Theorem~4.1 with the only
exception that instead of the condition about the countable
cardinality of ${\Cal F}$ it is assumed that the class of random
variables $S_n(f)$, $f\in{\Cal F}$, is countably approximable. Then
the random variables $S_n(f)$, $f\in{\Cal F}$, satisfy
relation~(4.4).}
\medskip
This corollary can be simply proved, only Theorem~4.1 has to be
applied for the class ${\Cal F}'$. To do this it has to be checked
that if ${\Cal F}$ is an $L_2$-dense class with some parameter $D$
and exponent $L$, and ${\Cal F}'\subset {\Cal F}$, then ${\Cal F}'$ is
also an $L_2$-dense class with the same exponent $L$, only with a
possibly different parameter~$D'$.
To prove this statement let us choose for all numbers
$0<\varepsilon\le1$ and probability measures $\nu$ on
$(Y,{\Cal Y})$ some functions
$f_1,\dots,f_m\in {\Cal F}$ with
$m\le D\left(\frac\varepsilon2\right)^{-L}$ elements, such that
the sets ${\Cal D}_j=\left\{f\colon\;\int |f-f_j|^2\,d\nu\le
\left(\frac\varepsilon2\right)^2\right\}$ satisfy the relation
$\bigcup\limits_{j=1}^m {\Cal D}_j=Y$. For all sets
${\Cal D}_j$ for which ${\Cal D}_j\cap {\Cal F}'$ is
non-empty choose a function $f'_j\in {\Cal D}_j\cap {\Cal F}'$. In
such a way we get a collection of functions $f'_j$ from the class
${\Cal F}'$ containing at most $2^LD\varepsilon^{-L}$ elements
which satisfies
the condition imposed for $L_2$-dense classes with exponent $L$ and
parameter $2^LD$ for this number $\varepsilon$ and measure $\nu$.
\medskip
Next I formulate in Theorem~$4.1'$ a result about the supremum of
the integral of a class of functions with respect to a normalized
empirical distribution. It can be considered as a simple version
of Theorem~4.1. I formulated this result, because Theorems~4.1
and~$4.1'$ are special cases of their multivariate counterparts
about the supremum of so-called $U$-statistics and multiple
integrals with respect to a normalized empirical distribution
function discussed in Section~8. These results are also closely
related, but the explanation of their relation demands some work.
Given a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ taking values in $(X,{\Cal X})$ let us introduce
their empirical distribution on $(X,{\Cal X})$ as
$$
\mu_n(A)(\omega)
=\frac1n \#\left\{j\colon\; 1\le j\le n,\; \xi_j(\omega)\in
A\right\}, \quad A\in {\Cal X}, \tag4.5
$$
and define for all measurable and $\mu$~integrable functions~$f$
the (random) integral
$$
J_n(f)=J_{n,1}(f)=\sqrt n\int f(x)(\mu_n(\,dx)-\mu(\,dx)). \tag4.6
$$
Clearly $J_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n (f(\xi_j)-Ef(\xi_j))
=S_n(\bar f)$ with $\bar f(x)=f(x)-\int f(x)\mu(\,dx)$. It is not
difficult to see that $\sup\limits_{x\in X}|\bar f(x)|\le2$ if
$\sup\limits_{x\in X}|f(x)|\le1$, $\int \bar f(x)\mu(\,dx)=0$,
$\int \bar f^2(x)\mu(\,dx)\le\int f^2(x)\mu(\,dx)$, and if
${\Cal F}$ is an $L_2$-dense class of functions with parameter~$D$
and exponent~$L$, then the class of functions $\bar{\Cal F}$
consisting of the functions
$\bar f(x)=f(x)-\int f(x)\mu(\,dx)$, $f\in{\Cal F}$, is an $L_2$-dense
class of functions with parameter $2^LD$ and exponent $L$, since
$\int(\bar f-\bar g)^2\,d\mu\le\varepsilon$ if $f,g\in{\Cal F}$, and
$\int(f-g)^2\,d\mu\le\left(\frac\varepsilon2\right)^2$. Hence
Theorem~4.1 implies the following result.
\medskip\noindent
{\bf Theorem 4.1$'$. (Estimate on the supremum of random integrals
with respect to a normalized empirical measure).} {\it Let us have
a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge2$, with distribution~$\mu$ on
a measurable space $(X,{\Cal X})$ together with some class of
functions ${\Cal F}$ on this space which satisfies the conditions of
Theorem~4.1 with the possible exception of condition~(4.3). The
estimate (4.4) remains valid if the random sums $S_n(f)$ are replaced
in it by the random integrals $J_n(f)$ defined in~(4.6). Moreover,
similarly to the corollary of Theorem~4.1, the condition about the
countable cardinality of the set ${\Cal F}$ can be replaced by the
condition that the class of random variables $J_n(f)$, $f\in{\Cal F}$,
is countably approximable.}
\medskip
All finite dimensional distributions of the set of random variables
$S_n(f)$, $f\in{\Cal F}$, considered in Theorem~4.1 converge to those
of a Gaussian random field $Z(f)$, $f\in{\Cal F}$, with expectation
$EZ(f)=0$ and correlation $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\Cal F}$ as $n\to\infty$. Here, and in the subsequent part
of the paper a collection of random variables indexed by some set
of parameters will be called a Gaussian random field if for all
finite subsets of these parameters the random variables indexed by
this finite set are jointly Gaussian. We shall also define
so-called linear Gaussian random fields. They consist of jointly
Gaussian random variables $Z(f)$, $f\in{\Cal G}$, indexed by a linear
space $f\in{\Cal G}$ which satisfy the relation $Z(af+bg)=aZ(f)+bZ(g)$
with probability~1 for all real numbers $a$ and $b$ and $f,g\in{\Cal G}$.
Let us consider a linear Gaussian random field $Z(f)$, $f\in{\Cal G}$,
where the set of indices~${\Cal G}={\Cal G}_\mu$ consists of the
functions~$f$ square integrable with respect to a $\sigma$-finite
measure~$\mu$, and take an appropriate restriction of this field to
some parameter set ${\Cal F}\subset {\Cal G}$. In the next Theorem~4.2
we shall present a natural Gaussian counterpart of Theorem~4.1 by
means of an appropriate choice of~${\Cal F}$. Let me also remark that
in Section~10 multiple Wiener--It\^o integrals of functions of
$k$~variables with respect to a white noise will be defined for all
$k\ge1$. In the special case $k=1$ the Wiener--It\^o integrals for
an appropriate class of functions $f\in{\Cal F}$ yield a model for
which Theorem~4.2 is applicable. Before formulating this result let
us introduce the following definition which is a version of the
definition of $L_p$-dense functions.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions with respect to
a measure~$\mu$.} {\it Let a measurable space $(X,{\Cal X})$ be given
together with a measure $\mu$ on the $\sigma$-algebra ${\Cal X}$ and
a set ${\Cal F}$ of ${\Cal X}$ measurable real valued functions on
this space. The set of functions ${\Cal F}$ is called an $L_p$-dense
class of functions, $1\le p<\infty$, with respect to the
measure~$\mu$ with parameter $D$ and exponent $L$ if for all
numbers $0<\varepsilon\le1$ there exists a finite $\varepsilon$-dense
subset ${\Cal F}_\varepsilon=\{f_1,\dots,f_m\}\subset{\Cal F}$
in the space
$L_p(X,{\Cal X},\mu)$ with $m\le D\varepsilon^{-L}$ elements, i.e.
such a set ${\Cal F}_\varepsilon\subset {\Cal F}$ with
$m\le D\varepsilon^{-L}$ elements for which
$\inf\limits_{f_j\in {\Cal F}_\varepsilon}\int |f-f_j|^p\,d\mu
<\varepsilon^p$ for all functions $f\in\ {\Cal F}$.}
\medskip\noindent
{\bf Theorem 4.2. (Estimate on the supremum of a class of Gaussian
random variables).} {\it Let a probability measure $\mu$ be given
on a measurable space $(X,{\Cal X})$ together with a linear Gaussian
random field $Z(f)$, $f\in{\Cal G}$, such that $EZ(f)=0$,
$EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$, $f,g\in{\Cal G}$, where ${\Cal G}$
is the space of square integrable functions with respect to this
measure~$\mu$. Let ${\Cal F}\subset{\Cal G}$ be a countable and
$L_2$-dense class of functions with respect to the measure~$\mu$
with some exponent~$L\ge1$ and parameter~$D\ge1$ which also
satisfies condition~(4.2) with some $0<\sigma\le1$.
Then there exist some universal constants $C>0$ and $M>0$ (for
instance $C=4$ and $M=16$ is a good choice) such that the inequality
$$
P\left(\sup\limits_{f\in{\Cal F}}|Z(f)|
\ge u\right)\le C(D+1) \exp\left\{-\frac1{256}
\left(\frac u{\sigma}\right)^2\right\}
\quad \text{if }u\ge ML^{1/2}\sigma \log^{1/2}\frac2\sigma \tag4.7
$$
holds with the parameter $D$ and exponent $L$ introduced in this
theorem.}
\medskip
The exponent at the right-hand side of inequality~(4.7) does not
contain the best possible universal constant. One could choose the
coefficient $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ instead of the
coefficient $\frac1{256}$ in the exponent at the right-hand side
of~(4.7) if the universal constants $C>0$ and $M>0$ are chosen
sufficiently large in this inequality. Actually, later in Theorem~8.6
such an estimate will be proved which can be considered as the
multivariate generalization of Theorem~4.2 with the expression
$-\frac{(1-\varepsilon)u^2}{2\sigma^2}$ in the exponent.
The condition about the countable cardinality of the set ${\Cal F}$
in Theorem~4.2 could be weakened similarly to Theorem~4.1. But
I omit the discussion of this question, since Theorem~4.2 was
only introduced for the sake of a comparison between the
Gaussian and non-Gaussian case. An essential difference between
Theorems~4.1 and~4.2 is that the class of functions~${\Cal F}$
considered in Theorem~4.1 had to be $L_2$-dense, while in
Theorem~4.2 a weaker version of this property was needed. In
Theorem~4.2 it was demanded that there exists a subset of
${\Cal F}$ of relatively small cardinality which is dense in the
$L_2(\mu)$ norm. In the $L_2$-density property imposed in
Theorem~4.1 a similar property was demanded for all probability
measures~$\nu$. The appearance of such a property may be unexpected.
But as we shall see, the proof of Theorem~4.1 contains a
conditioning argument where a lot of new conditional measures
appear, and the $L_2$-density property is needed to work with all
of them. One would also like to know some results that enable us
to check when this condition holds. In the next section a notion
popular in probability theory, the Vapnik--\v{C}ervonenkis classes
will be introduced, and it will be shown that a
Vapnik--\v{C}ervonenkis class of functions bounded by~1 is
$L_2$-dense.
Another difference between Theorems~4.1 and~4.2 is that the
conditions of formula~(4.4) contain the upper bound
$\sqrt n\sigma^2>u$, and no such condition was imposed in
formula~(4.7). The appearance of this condition in Theorem~4.1
can be explained by comparing this result with those of Section~3.
As we have seen, we do not loose much information if we restrict
our attention to the case
$u\le\text{const.}\, V_n^2=\text{const.}, n\sigma^2$ in
Bernstein's inequality (if sums of independent and identically
distributed random variables are considered). Theorem~4.1 gives
an almost as good estimate for the supremum of normalized partial
sums under appropriate conditions for the class ${\Cal F}$ of
functions we consider in this theorem as Bernstein's inequality
yields for the normalized partial sums of independent and
identically distributed random variables with variance bounded
by~$\sigma^2$. But we could prove the estimate of Theorem~4.1 only
under the condition $\sqrt n\sigma^2>u$. We shall show in
Example~4.3 discussed below that in the case $u\gg\sqrt n\sigma^2$
only a weaker estimate holds. It has also a natural reason why
condition~(4.1) about the supremum of the functions $f\in {\Cal F}$
appeared in Theorems 4.1 and~$4.1'$, and no such condition was
needed in Theorem~4.2.
The lower bounds for the level~$u$ were imposed in formulas~(4.4)
and~(4.7) because of a similar reason. To understand why such a
condition is needed in formula (4.7) let us consider the
following example. Take a Wiener process $W(t)$, $0\le t\le1$,
define for all $0\le s0$ the following class of functions ${\Cal F}_\sigma$.
${\Cal F}_\sigma=\{f_{s,t}\colon\; 0\le su\right)
\le e^{-\text{const.}\,(u/\sigma)^2}$.
However, this relation does not hold if
$u=u(\sigma)<(1-\varepsilon)\sqrt2\sigma\log^{1/2}\frac1\sigma$
with some $\varepsilon>0$. In such cases
$P\left(\sup\limits_{f\in{\Cal F}_\sigma}Z(f) >u\right)\to1$,
as $\sigma\to0$. This can be proved relatively simply with the help
of the estimate
$P(Z(f_{s,t})>u(\sigma))\ge\text{const.}\, \sigma^{1-\varepsilon}$ if
$|t-s|=\sigma^2$ and the independence of the random integrals
$Z(f_{s,t})$ if the functions $f_{s,t}$ are indexed by such pairs
$(s,t)$ for which the intervals $(s,t)$ are disjoint. This means
that in this example formula~(4.7) holds only under the condition
$u\ge M\sigma\log^{1/2}\frac1\sigma$ with $M=\sqrt2$.
There is a classical result about the modulus of continuity of
Wiener processes, and actually this result helped us to find the
previous example. It is also worth mentioning that there are some
concentration inequalities, see Ledoux~[28] and Talagrand~[51],
which state that under very general conditions the distribution
of the supremum of a class of partial sums of independent random
variables or of the elements of a Gaussian random field is
strongly concentrated around the expected value of this supremum.
(Talagrand's result in this direction is also formulated in
Theorem~18.1 of this lecture note.) These results imply that the
problems discussed in Theorems~4.1 and~4.2 can be reduced to a
good estimate of the expected value
$E\sup\limits_{f\in{\Cal F}}|S_n(f)|$ and
$E\sup\limits_{f\in{\Cal F}}|Z(f)|$ of the supremum considered in
these results. However, the estimation of the expected value of
these suprema is not much simpler than the original problem.
Theorem~4.2 implies that under its conditions
$E\sup\limits_{f\in{\Cal F}}|Z(f)|
\le\text{const.}\, \sigma\log^{1/2}\frac2\sigma$
with an appropriate multiplying constant depending on the
parameter~$D$ and exponent~$L$ of the class of functions~${\Cal F}$.
In the case of Theorem~4.1 a similar estimate holds, but under more
restrictive conditions. We also have to impose that
$\sqrt n\sigma^2\ge\text{const.}\,\sigma\log^{1/2}\frac2\sigma$ with
a sufficiently large constant. This condition is needed to guarantee
that the set of numbers~$u$ satisfying condition~(4.4) is not empty.
If this condition is violated, then Theorem~4.1 supplies a weaker
estimate which we get by replacing $\sigma$ by an
appropriate~$\bar\sigma>\sigma$, and by applying Theorem~4.1 with
this number~$\bar\sigma$.
One may ask whether the above estimate about the expected value of
supremum of normalized partial sums may hold without the condition
$\sqrt n\sigma^2\ge\text{const.}\,\sigma\log^{1/2}\frac2\sigma$.
We show an example which gives a negative answer to this question.
Since here we discuss a rather particular problem which is outside
of our main interest in this work I give a rather sketchy
explanation of this example. I present this example together with
a Poissonian counterpart of it which may help explain why such a
result holds.
\medskip\noindent
{\bf Example 4.3. (Supremum of partial sums with bad tail behaviour).}
{\it Let $\xi_1,\dots,\xi_n$ be a sequence of independent random
variables with uniform distribution in the interval~$[0,1]$. Choose
a sequence of real numbers, $\varepsilon_n$, $n=3,4,\dots$, such that
$\varepsilon_n\to0$ as $n\to\infty$, and
$\frac12\ge\varepsilon_n\ge n^{-\delta}$ with a
sufficiently small number $\delta>0$. Put
$\sigma_n=\varepsilon_n\sqrt{\frac{\log n}n}$, and define the set of
functions $\bar f_{j,n}(\cdot)$ and $f_{j,n}(\cdot)$
on the interval $[0,1]$ by the formulas
$\bar f_{j,n}(x)=1$ if $(j-1)\sigma^2_n\le x0$. Then
$$
\lim_{n\to\infty}P\left(\sup_{f\in{\Cal F_n}}S_n(f)>u_n\right)=1.
$$
}
\medskip
This example has the following Poissonian counterpart.
\medskip\noindent
{\bf Example 4.3$'$. (A Poissonian counterpart of Example 4.3).}
{\it Let $\bar P_n(x)$ be a Poisson process on the interval~$[0,1]$
with parameter~$n$ and $P_n(x)=\frac1{\sqrt n}[\bar P_n(x)-nx]$,
$0\le x\le 1$. Consider the same sequences of numbers~$\varepsilon_n$,
$\sigma_n$ and~$u_n$ as in Example~4.3, and define the random
variables $Z_{n,j}=P_n(j\sigma^2_n)-P_n((j-1)\sigma^2_n)$ for all
$n=3,4,\dots$ and $1\le j\le \frac1{\sigma^2_n}$. Then
$$
\lim_{n\to\infty}P\left(\sup_{1\le j\le \frac1{\sigma_n}}
(Z_{n,j}-Z_{n,j-1})>u_n\right)=1.
$$
}
\medskip
The classes of functions ${\Cal F}_n$ in Example~4.3 are $L_2$-dense
classes of functions with some exponent~$L$ and parameter~$D$
not depending on the parameter~$n$ and the choice of the
numbers~$\sigma_n$. It can be seen that even the class of function
${\Cal F}=\{f\colon\; f(x)=1,\text{ if }s\le x0$ in this case. As
$\varepsilon_n\log\frac1{\varepsilon_n}\to0$ as $n\to\infty$,
this means that the
expected value of the supremum of the random sums considered in
Example~4.3 does not satisfy the estimate
$\limsup\limits_{n\to\infty}
\frac1{\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}}
E\sup\limits_{f\in{\Cal F}_n}S_n(f)<\infty$ suggested by
Theorem~4.1. Observe that $\sqrt n\bar\sigma^2_n
\sim\text{const.}\, \varepsilon_n\bar\sigma_n\log^{1/2}
\frac2{\bar\sigma_n}$
in this case, since
$\sqrt n\bar\sigma^2_n\sim\varepsilon_n^2\frac{\log n}{\sqrt n}$,
and $\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}
\sim \text{const.}\,\varepsilon_n\frac{\log n}{\sqrt n}$.
\medskip\noindent
{\it The proof of Examples~4.3 and~$4.3'$.} First we prove the
statement of Example~$4.3'$. For a fixed index~$n$ the number of
random variables $Z_{n,j}$ equals
$\frac1{\sigma_n^2}\ge\frac1{\varepsilon_n^2}\frac n{\log n}
\ge\frac n{\log n}$, and they are independent. Hence it is enough
to show that $P(Z_{n,j}>u_n)\ge n^{-1/2}$ if first $A>0$ and then
$\delta>0$ (appearing in the condition
$\varepsilon_n>n^{-\delta}$) are chosen sufficiently small, and
$n\ge n_0$ with some threshold index $n_0=n_0(A,\delta)$.
Put $\bar u_n=[\sqrt nu_n+n\sigma^2_n]+1$, where $[\cdot]$ denotes
integer part. Then
$P(Z_{n,j}>u_n)\ge P(\bar P_n(\sigma^2_n)\ge\bar u_n)
\ge P(\bar P_n(\sigma^2_n)=\bar u_n)
=\frac{(n\sigma_n^2)^{\bar u_n}}{\bar u_n!}e^{-n\sigma_n^2}
\ge \left(\frac{n\sigma_n^2}{\bar u_n}\right)^{\bar u_n}e^{-n\sigma_n^2}$.
Some calculation shows that
$\bar u_n\le\frac{A \log n}{\log \frac1{\varepsilon_n}}
+\varepsilon_n^2\log n+1
\le\frac{2A \log n}{\log \frac1{\varepsilon_n}}$,
$\frac{n\sigma_n^2}{\bar u_n}
\ge\frac{\varepsilon_n^2\log\frac1{\varepsilon_n}}{2A}$,
and $\log \frac{n\sigma_n^2}{\bar u_n}\ge- 2\log\frac1{\varepsilon_n}$
if the constants $A>0$, $\delta>0$ and threshold index $n_0$ are
appropriately chosen. Hence
$P(Z_{n,j}>u_n)\ge e^{-2\bar u_n\log(1/\varepsilon_n)-n\sigma^2}
\ge e^{-2A\log n-\varepsilon_n^2\log n}\ge n^{-1/2}$ if~$A_0>0$ is
sufficiently small.
The statement of Example~4.3 can be deduced from~Example~$4.3'$
by applying Poissonian approximation. Let us apply the result of
Example~$4.3'$ for a Poisson process $\bar P_{n/2}$ with
parameter~$\frac n2$ and with such a number~$\bar\varepsilon_{n/2}$
with which the value of $\sigma_{n/2}$ equals the previously
defined~$\sigma_n$. Then
$\bar\varepsilon_{n/2}\sim\frac{\varepsilon_n}{\sqrt 2}$,
and the number of sample points of $\bar P_{n/2}$ is less
than~$n$ with probability almost~1. Attaching additional sample
points to get exactly $n$ sample points we can get the result of
Example~4.3. I omit the details.
\medskip
In formulas~(4.4) and~(4.7) we formulated such a condition for
the validity of Theorem~4.1 and Theorem~4.2 which contains a large
multiplying constant $ML^{3/4}$ and $ML^{1/2}$ of
$\sigma\log^{1/2}\frac2\sigma$ in the lowerbound for the
number~$u$ if we deal with such an $L_2$-dense class of functions
${\Cal F}$ which has a large exponent~$L$. At a heuristic level
it is clear that in such a case a large multiplying constant
appears. On the other hand, I did not try to find the best possible
coefficients in the lower bound in relations~(4.4) and~(4.7).
\medskip
In Theorem~4.1 (and in its version 4.1$'$) it was demanded that
the class of functions ${\Cal F}$ should be countable. Later this
condition was replaced by a weaker one about countable
approximability. By restricting our attention to countable or
countably approximable classes we could avoid some unpleasant
measure theoretical problems which would have arisen if we had
worked with the supremum of non-countable number of random
variables which may be non-measurable.
There are some papers where possibly non-measurable models
are also considered with the help of some rather deep results
of the analysis and measure theory. Actually, the problem we met
here is the natural analog of an important problem in the theory
of the stochastic processes about the smoothness property of the
trajectories of an appropriate version of a stochastic process
which we can get by exploiting our freedom to change all random
variables on a set of probability zero.
The study of the problem in this work is simpler in one respect.
Here the set of random variables $S_n(f)(\omega)$ or $J_n(f)(\omega)$,
$f\in{\Cal F}$, are constructed directly with the help of the
underlying random variables $\xi_1(\omega),\dots,\xi_n(\omega)$ for all
$\omega\in\Omega$ separately. We are interested in when the sets of
random variables constructed in this way are countably approximable,
i.e.\ we are not looking for a possibly different, better version of
them with the same finite dimensional distributions. The next
simple Lemma~4.4 yields a sufficient condition for countable
approximability. Its condition can be interpreted as a smoothness
type condition for the trajectories of a
stochastic process indexed by the functions $f\in{\Cal F}$.
\medskip\noindent
{\bf Lemma 4.4.} {\it Let a class of random variables $U(f)$,
$f\in{\Cal F}$, indexed by some set ${\Cal F}$ of functions be given
on a space $(Y,{\Cal Y})$. If there exists a countable subset
${\Cal F}'\subset {\Cal F}$ of the set ${\Cal F}$ such that the sets
$A(u)=\{\omega\colon\;\sup\limits_{f\in {\Cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\;\sup\limits_{f\in {\Cal F}'} |U(f)(\omega)|\ge u\}$
introduced
for all $u>0$ in the definition of countable approximability satisfy
the relation $A(u)\subset B(u-\varepsilon)$ for all $u>\varepsilon>0$,
then the class
of random variables $U(f)$, $f\in{\Cal F}$, is countably approximable.
The above property holds if for all $f\in{\Cal F}$, $\varepsilon>0$
and $\omega\in\Omega$ there exists a function
$\bar f=\bar f(f,\varepsilon,\omega)\in{\Cal F}'$
such that $|U(\bar f)(\omega)|\ge|U(f)(\omega)|-\varepsilon$.}
\medskip\noindent
{\it Proof of Lemma 4.4.}\/ If $A(u)\subset B(u-\varepsilon)$ for
all $\varepsilon>0$, then
$P^*(A(U)\setminus B(u))\le \lim\limits_{\varepsilon\to0}
P(B(u-\varepsilon)\setminus B(u))=0$, where $P^*(X)$ denotes the
outer measure
of a not necessarily measurable set $X\subset\Omega$, since
$\bigcap\limits_{\varepsilon\to0}B(u-\varepsilon)=B(u)$, and this is
what we had to prove.
If $\omega\in A(u)$, then for all $\varepsilon>0$ there exists some
$f=f(\omega)\in{\Cal F}$ such that $|U(f)(\omega)|>u-\frac\varepsilon2$.
If there
exists some $\bar f=\bar f(f,\frac\varepsilon2,\omega)$,
$\bar f\in{\Cal F}'$ such that
$|U(\bar f)(\omega)| \ge |Uf(\omega)|-\frac\varepsilon2$,
then $|U(\bar f)(\omega)|
>u-\varepsilon$, and $\omega\in B(u-\varepsilon)$. This
means that $A(u)\subset B(u-\varepsilon)$.
\medskip
The question about countable approximability also appears in the
case of multiple random integrals with respect to a normalized
empirical measure. To avoid some repetition we prove a result which
also covers such cases. For this goal first we introduce the notion
of multiple integrals with respect to a normalized empirical measure.
Given a measurable function $f(x_1,\dots,x_k)$ on the $k$-fold
product space $(X^k,{\Cal X}^k)$ and a sequence of independent random
variables $\xi_1,\dots,\xi_n$ with some distribution $\mu$ on the
space $(X,{\Cal X})$ we define the integral $J_{n,k}(f)$ of the
function $f$ with respect to the $k$-fold product of the normalized
version of the empirical measure $\mu_n$ introduced in (4.5) by the
formula
$$
\align
J_{n,k}(f)&=\frac{n^{k/2}}{k!} \int'
f(x_1,\dots,x_k)(\mu_n(\,dx_1)-\mu(\,dx_1))\dots
(\mu_n(\,dx_k)-\mu(\,dx_k)),\\
&\qquad\text{where the prime in $\tsize\int'$ means that the
diagonals } x_j=x_l,\; 1\le ju\right)$. We have seen that
the above class of functions ${\Cal F}$ is countably approximable. The
results of the next section imply that this class of functions is
also $L_2$-dense. Otherwise it is not difficult to check this
property directly. Hence we can apply Theorem~$4.1'$ to the above
defined class of functions with $\sigma=1$, and it yields that
$P\left(n^{-1/2}\sup\limits_{f\in {\Cal F}}|J_n(f)|>u\right)
\le e^{-Cnu^2}$
if $1\ge u\ge\bar Cn^{-1/2}$ with some universal constants $C>0$ and
$\bar C>0$. (The condition $1\ge u$ can actually be dropped.) The
application of this estimate for the numbers $\varepsilon>0$ together
with the Borel--Cantelli lemma imply the fundamental theorem of the
mathematical statistics.
In short, the results of this section yield more information about
the closeness the empirical distribution function $F_n$ and
distribution function $F$ than the fundamental theorem of the
mathematical statistics. Moreover, since these results can also be
applied for other classes of functions, they yield useful
information about the closeness of the probability measure $\mu$
to the empirical measure~$\mu_n$.
\beginsection 5. Vapnik--\v{C}ervonenkis classes and $L_2$-dense
classes of functions.
In this section the most important notions and results will be
presented about Vapnik--\v{C}ervonenkis classes, and it will be
explained how they help to show in some important cases that
certain classes of functions are $L_2$-dense. The classes of
$L_2$-dense classes played an important role in the study of the
previous section. The results of this section may help to find
interesting classes of functions with this property. Some of the
results of this section will be proved in Appendix~A.
First I recall the following notions.
\medskip\noindent
{\bf Definition of Vapnik-\v{C}ervonenkis classes of sets and
functions.} {\it Let a set $X$ be given, and let us select a class
${\Cal D}$ of subsets of this set $X$. We call
${\Cal D}$ a Vapnik--\v{C}ervonenkis class if there exist two real
numbers $B$ and $K$ such that for all positive integers $n$ and
subsets $S(n)=\{x_1,\dots,x_n\}\subset X$ of cardinality $n$
of the set $X$ the collection of sets of the form $S(n)\cap D$,
$D\in{\Cal D}$, contains no more than $Bn^K$ subsets of~$S(n)$.
We shall call $B$ the parameter and $K$ the exponent of this
Vapnik--\v{C}ervonenkis class.
A class of real valued functions ${\Cal F}$ on a space $(Y,{\Cal Y})$
is called a Vapnik--\v{C}ervonenkis class if the collection of
graphs of these functions is a Vapnik--\v{C}ervonenkis class, i.e.\
if the sets $A(f)=\{(y,t)\colon\; y\in Y,\;\min(0,f(y))\le t\le
\max(0,f(y))\}$, $f\in {\Cal F}$, constitute a
Vapnik--\v{C}er\-vo\-nen\-kis class of subsets of the product space
$X=Y\times R^1$.}
\medskip
The following result which was first proved by Sauer plays a fundamental
role in the theory of Vapnik--\v{C}er\-vo\-nen\-kis classes.
This result provides a relatively simple condition for a class
${\Cal D}$ of subsets of a set~$X$ to be a
Vapnik--\v{C}ervonenkis class. Its proof is given in Appendix~A.
Before its formulation I introduce some terminology which seems to
be wide spread and generally accepted in the literature.
\medskip\noindent
{\bf Definition of shattering of a set.} {\it Let a set $S$ and a
class ${\Cal E}$ of subsets of $S$ be given. A finite set
$F\subset S$ is called shattered by the class ${\Cal E}$ if all
its subsets $H\subset F$ can be written in the form $H=E\cap F$
with some element $E\in{\Cal E}$ of the class of sets of ${\Cal E}$.}
\medskip\noindent
{\bf Theorem 5.1. (Sauer's lemma).} {\it Let a finite set $S=S(n)$
consisting of $n$ elements be given together with a class ${\Cal E}$
of subsets of $S$. If ${\Cal E}$ shatters no subset of $S$ of
cardinality~$k$, then ${\Cal E}$ contains at most
$\binom n0+\binom n1+\cdots+\binom n{k-1}$ subsets of $S$.}
\medskip
The estimate of Sauer's lemma is sharp. Indeed, if ${\Cal E}$ contains
all subsets of $S$ of cardinality less than or equal to $k-1$, then
it shatters no subset of a set $F$ of cardinality $k$ (a set $F$
of cardinality~$k$ cannot be written in the form $E\cap F$,
$E\in {\Cal E}$), and ${\Cal E}$ contains
$\binom n0+\binom n1+\cdots+\binom n{k-1}$ subsets of $S$.
Sauer's lemma states, that this is an extreme case. Any class of
subsets ${\Cal E}$ of $S$ with cardinality greater than
$\binom n0+\binom n1+\cdots+\binom n{k-1}$ shatters at least one
subset of~$S$ with cardinality~$k$.
Let us have a set $X$ and a class of subsets ${\Cal D}$ of it. One may
be interested in when ${\Cal D}$ is a Vapnik--\v{C}ervonenkis class.
Sauer's lemma gives a useful condition for it. Namely, it implies
that if there exists a positive integer $k$ such that the class
${\Cal D}$ shatters no subset of $X$ of cardinality~$k$,
then ${\Cal D}$
is a Vapnik--\v{C}ervonenkis class. Indeed, let us take some number
$n\ge k$, fix an arbitrary set $S(n)=\{x_1,\dots,x_n\}\subset X$ of
cardinality~$n$, and introduce the class of subsets
${\Cal E}={\Cal E}(S(n))=\{S(n)\cap D\colon\; D\subset{\Cal D}\}$. If
${\Cal D}$ shatters no subset of $X$ of cardinality~$k$, then ${\Cal E}$
shatters no subset of $S(n)$ of cardinality~$k$. Hence by
Sauer's lemma the class ${\Cal E}$ contains at most
$\binom n0+\binom n1+\cdots+\binom n{k-1}$ elements. Let me remark
that it is also proved that
$\binom n0+\binom n1+\cdots+\binom n{k-1}\le1.5\frac{n^{k-1}}{(k-1)!}$
if $n\ge k+1$. This estimate gives a bound on the parameter and
exponent of a Vapnik--\v{C}ervonenkis class which satisfies the
above condition.
Moreover, Theorem~5.1 also has the following consequence. Take
an (infinite) set $X$ and a class of its subsets ${\Cal D}$.
There are two possibilities. Either there is some set
$S(n)\subset X$ of cardinality $n$ for all integers $n$ such
that ${\Cal E}(S(n))$ contains all subsets
of $S(n)$, i.e. ${\Cal D}$ shatters this set, or
$\sup\limits_{S\colon\;S\subset X,\,|S|=n}|{\Cal E}(S)|$
tends to infinity at most in a polynomial order as
$n\to\infty$, where $|S|$ and $|{\Cal E}(S)|$
denote the cardinality of $S$ and ${\Cal E}(S)$.
\medskip
The following Theorem~5.2, an important result of Richard Dudley,
states that a Vapnik--\v{C}er\-vo\-nen\-kis class of functions
bounded by~1 is an $L_1$-dense class of functions.
\medskip\noindent
{\bf Theorem 5.2. (A relation between the $L_1$-dense class and
Vapnik--\v{C}er\-vo\-nen\-kis class property).} {\it Let $f(y)$,
$f\in {\Cal F}$, be a Vapnik--\v{C}ervonenkis class of real valued
functions on some measurable space $(Y,{\Cal Y})$ such that
$\sup\limits_{y\in Y}|f(y)|\le1$ for all $f\in{\Cal F}$.
Then ${\Cal F}$ is an
$L_1$-dense class of functions on $(Y,{\Cal Y})$. More explicitly, if
${\Cal F}$ is a Vapnik--\v{C}ervonenkis class with parameter $B\ge1$
and exponent $K>0$, then it is an $L_1$-dense class with exponent
$L=2K$ and parameter $D=CB^2 (4K)^{2K}$ with some universal
constant~$C>0$.}
\medskip\noindent
{\it Proof of Theorem 5.2.}\/ Let us fix some probability
measure $\nu$ on $(Y,{\Cal Y})$ and a real number
$0<\varepsilon\le1$. We are going to show that any finite set
${\Cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\Cal F}$
such that $\int|f_j-f_k|\,d\nu\ge\varepsilon$ if $j\neq k$,
$f_j,f_k\in{\Cal D}(\varepsilon,\nu)$ has cardinality
$M\le D\varepsilon^{-L}$ with some $D>0$ and $L>0$. This
implies that ${\Cal F}$ is an $L_1$-dense class with
parameter~$D$ and exponent~$L$. Indeed, let us take a maximal
subset
$\bar{\Cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\Cal F}$
such that the $L_1(\nu)$ distance of any two functions in this
subset is at least~$\varepsilon$. Maximality means in this context
that no function $f_{M+1}\in{\Cal F}$ can be attached to
$\bar{\Cal D}(\varepsilon,\nu)$ without violating this condition.
Thus the inequality $M\le D\varepsilon^{-L}$ means that
$\bar{\Cal D}(\varepsilon,\nu)$ is an $\varepsilon$-dense subset
of~${\Cal F}$ in the space $L_1(Y,{\Cal Y},\nu)$
with no more than $D\varepsilon^{-L}$ elements.
In the estimation of the cardinality $M$ of a (finite) set
${\Cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}$ with the property
$\int|f_j-f_k|\,d\nu\ge\varepsilon$ if $j\neq k$ the
Vapnik--\v{C}ervonenkis class property of ${\Cal F}$ is exploited in
the following way. Let us choose relatively few $p$ points
$(y_l,t_l)$, $y_l\in Y$, $-1\le t_l\le1$, $1\le l\le p$, in the
space $(Y\times [-1,1])$ in such a way that the set
$S_0(p)=\{(y_l,t_l),\,1\le l\le p\}$ and graphs
$A(f_j)=\{(y,t)\colon\; y\in Y,\;\min(0,f_j(y))\le t\le\max(0,f_j(y))\}$,
$f_j\in{\Cal D}(\varepsilon,\nu)\subset{\Cal F}$ have
the property that all
sets $A(f_j)\cap S_0(p)$, $1\le j\le M$, are different. Then the
Vapnik--\v{C}ervonenkis class property of ${\Cal F}$ implies that
$M\le Bp^K$. Hence if there exists a set $S_0(p)$ with the above
property and with a relatively small number $p$, then this yields a
useful estimate on $M$. Such a set $S_0(p)$ will be given by means of
the following random construction.
Let us choose the $p$ points $(y_l,t_l)$, $1\le l\le p$, of the
(random) set $S_0(p)$ independently of each other in such a way that
the coordinate $y_l$ is chosen with distribution $\nu$ on
$(Y,{\Cal Y})$ and the coordinate $t_l$ with uniform distribution on
the interval $[-1,1]$ independently of $y_l$. (The number~$p$ will be
chosen later.) Let us fix some indices $1\le j,k\le M$, and estimate
the probability that the sets $A(f_j)\cap S_0(p)$ and $A(f_k)\cap
S_0(p)$ agree, where $A(f)$ denotes the graph of the function~$f$.
Consider the symmetric difference $A(f_j)\Delta A(f_k)$
of the sets $A(f_j)$ and $A(f_k)$. The sets
$A(f_j)\cap S_0(p)$ and $A(f_k)\cap S_0(p)$ agree if and only if
$(y_l,t_l)\notin A(f_j)\Delta A(f_k)$ for all $(y_l,t_l)\in S_0(p)$.
Let us observe that for a fixed
$l$ the estimate $P((y_l,t_l)\in A(f_j)\Delta A(f_k))
=\frac12(\nu\times\lambda)(A(f_j)\Delta A(f_k))
=\frac12\int |f_j-f_k|\,d\nu\ge\frac\varepsilon2$ holds, where
$\lambda$ denotes the Lebesgue measure. This implies that the
probability that the (random) sets $A(f_j)\cap S_0(p)$ and
$A(f_k)\cap S_0(p)$ agree can be bounded from above by
$\left(1-\frac\varepsilon2\right)^p\le e^{-p\varepsilon/2}$.
Hence the probability that all sets $A(f_j)\cap S_0(p)$ are
different is greater than
$1-\binom M2 e^{-p\varepsilon/2}\ge1-\frac{M^2}2e^{-p\varepsilon/2}$.
Choose $p$ such that
$\frac74e^{p\varepsilon/2}>e^{(p+1)\varepsilon/2}>M^2\ge e^{p\varepsilon/2}$.
Then the above probability is greater than $\frac18$, and there exists
some set $S_0(p)$ with the desired property.
The inequalities $M\le Bp^K$ and $M^2\ge e^{p\varepsilon/2}$ imply
that $M\ge e^{\varepsilon M^{1/K}/4B^{1/K}}$, i.e.\
$\frac{\log M^{1/K}}{M^{1/K}}\ge \frac\varepsilon{4KB^{1/K}}$. As
$\frac{\log M^{1/K}}{M^{1/K}}\le CM^{-1/2K}$
for $M\ge1$ with some universal constant $C>0$, this estimate
implies that Theorem 5.2 holds with the exponent~$L$ and
parameter~$D$ given in its formulation.
\medskip
Let us observe that if ${\Cal F}$ is an $L_1$-dense class of
functions on a measure space $(Y,{\Cal Y})$ with some
exponent~$L$ and parameter~$D$, and also the inequality
$\sup\limits_{y\in Y}|f(y)|\le1$ holds for all $f\in{\Cal F}$,
then ${\Cal F}$ is an $L_2$-dense class of functions
with exponent $2L$ and parameter $D2^L$. Indeed, if we fix some
probability measure $\nu$ on $(Y,{\Cal Y})$ together with a number
$0<\varepsilon\le1$, and
${\Cal D}(\varepsilon,\nu)=\{f_1,\dots, f_M\}$ is an
$\frac{\varepsilon^2}2$-dense set of ${\Cal F}$ in the
space $L_1(Y,{\Cal Y},\nu)$,
$M\le2^L D \varepsilon^{-2L}$, then for all function
$f\in {\Cal F}$ some function $f_j\in{\Cal D}(\varepsilon,\nu)$ can
be chosen in such a way that
$\int(f-f_j)^2\,d\nu\le2\int|f-f_j|\,d\nu\le\varepsilon^2$. This
implies that ${\Cal F}$ is an $L_2$-dense class with the given
exponent and parameter.
It is not easy to check whether a collection of subsets ${\Cal D}$
of a set $X$ is a Vapnik--\v{C}ervonenkis class even with the help
of Theorem~5.1. Therefore the following Theorem~5.3 which enables
us to construct many non-trivial Vapnik--\v{C}ervonenkis classes
is of special interest. Its proof is given in Appendix~A.
\medskip\noindent
{\bf Theorem 5.3. (A way to construct Vapnik--\v{C}ervonenkis classes).}
{\it Let us consider a $k$-dimensional subspace ${\Cal G}_k$ of the
linear space of real valued functions defined on a set $X$, and define
the level-set $A(g)=\{x\colon\; x\in X,\,g(x)\ge0\}$ for all functions
$g\in{\Cal G}_k$. Take the class of subsets
${\Cal D}=\{A(g)\colon\; g\in {\Cal G}_k\}$ of the set $X$ consisting of
the above introduced level sets. No subset $S=S(k+1)\subset X$ of
cardinality $k+1$ is shattered by ${\Cal D}$. Hence by Theorem~5.1
${\Cal D}$ is a Vapnik--\v{C}ervonenkis class of subsets of~$X$.}
\medskip
Theorem~5.3 enables us to construct many interesting
Vapnik--\v{C}ervonenkis classes. Thus for instance the class of
all half-spaces in a Euclidean space, the class of all ellipses in
the plane, or more generally the level sets of $k$-order algebraic
functions with a fixed number $k$ constitute a
Vapnik--\v{C}ervonenkis
class. It can be proved that if ${\Cal C}$ and ${\Cal D}$ are
Vapnik--\v{C}ervonenkis classes of subsets of a set $S$, then also
their intersection
${\Cal C}\cap {\Cal D}=\{C\cap D\colon\; C\in{\Cal C},\,D\in{\Cal D}\}$,
their union ${\Cal C}\cup{\Cal D}
=\{C\cup D\colon\; C\in{\Cal C},\,D\in{\Cal D}\}$
and complementary sets ${\Cal C}^c
=\{S\setminus C\colon\; C\in{\Cal C}\}$
are Vapnik--\v{C}ervonenkis classes. These results are less
important for us, and their proofs will be omitted. We are
interested in Vapnik--\v{C}ervonenkis classes not for their own
sake. We are going to find $L_2$-dense classes of functions, and
Vapnik--\v{C}ervonenkis classes help us in finding such
classes. Indeed, Theorem 5.2 implies that if ${\Cal D}$ is a
Vapnik--\v{C}ervonenkis class of subsets of a set $S$, then their
indicator functions constitute an $L_1$-dense, hence also an
$L_2$-dense class of functions. Then the results of Lemma~5.4
formulated below enable us to construct new $L_2$-dense class of
functions.
\medskip\noindent
{\bf Lemma 5.4. (Some useful properties of $L_2$-dense classes).}
{\it Let ${\Cal G}$ be an $L_2$-dense class of functions
on some space $(Y,{\Cal Y})$ whose absolute values are bounded
by one, and let $f$ be a function on $(Y,{\Cal Y})$ also with
absolute value bounded by one. Then
$f\cdot{\Cal G}=\{f\cdot g\colon\; g\in G\}$ is also an
$L_2$-dense class of functions. Let ${\Cal G}_1$ and
${\Cal G}_2$ be two $L_2$-dense classes of functions on some
space $(Y,{\Cal Y})$ whose absolute values are
bounded by one. Then the classes of functions
${\Cal G}_1+{\Cal G}_2=\{g_1+g_2\colon\;
g_1\in{\Cal G}_1,\,g_2\in{\Cal G}_2\}$,
${\Cal G}_1\cdot{\Cal G}_2
=\{g_1g_2\colon\; g_1\in{\Cal G}_1,\,g_2\in{\Cal G}_2\}$,
$\min({\Cal G}_1,{\Cal G}_2)
=\{\min(g_1,g_2)\colon\; g_1\in{\Cal G}_1,\,g_2\in
{\Cal G}_2\}$, $\max({\Cal G}_1,{\Cal G}_2)
=\{\max(g_1,g_2)\colon\; g_1\in
{\Cal G}_1,\,g_2\in{\Cal G}_2\}$ are also $L_2$-dense.
If ${\Cal G}$ is an
$L_2$-dense class of functions, and ${\Cal G}'\subset{\Cal G}$,
then ${\Cal G}'$ is also an $L_2$-dense class.}
\medskip\noindent
The proof of Lemma 5.4 is rather straightforward. One has to observe
for instance that if $g_1,\bar g_1\in{\Cal G}_1$,
$g_2,\bar g_2\in{\Cal G}_2$ then $|\min(g_1,g_2)-\min(\bar g_1,\bar g_2)|
\le |g_1-\bar g_1)|+|g_2-\bar g_2|$, hence if
$g_{1,1},\dots,g_{1,M_1}$ is an $\frac\varepsilon2$-dense
subset of ${\Cal G}_1$
and $g_{2,1},\dots,g_{2,M_2}$ is an $\frac\varepsilon2$-dense
subset of ${\Cal G}_2$ in the space $L_2(Y,{\Cal Y},\nu)$ with
some probability measure
$\nu$, then the functions $\min(g_{1,j},g_{2,k})$, $1\le j\le M_1$,
$1\le k\le M_2$ constitute an $\varepsilon$-dense subset of
$\min({\Cal G}_1,{\Cal G}_2)$ in $L_2(Y,{\Cal Y},\nu)$. The last
statement of Lemma~5.4 was proved after the Corollary of
Theorem~4.1. The details are left to the reader.
\medskip
The above result enable us to construct some $L_2$ dense class of
functions. We give an example for it in the following Example~5.5
which is a consequence of Theorem~5.2 and Lemma~5.4.
\medskip\noindent
{\bf Example 5.5.} {\it Take $m$ measurable functions $f_j(x)$,
$1\le j\le m$, on a measurable space $(X,{\Cal X})$ which
have the property $\sup\limits_{x\in X}|f_j(x)|\le1$ for all
$1\le j\le m$. Let ${\Cal D}$ be a Vapnik-\v{C}ervonenkis class
consisting of measurable subsets of the set $X$. Define for all
pairs $f_j$, $1\le j\le m$, and $D\in{\Cal D}$ the function
$f_{j,D}(\cdot)$ as $f_{j,D}(x)=f_j(x)$ if $x\in D$, and
$f_{j,D}(x)=0$ if $x\notin D$, i.e. $f_{j,D}(\cdot)$ is the
restriction of the function $f_j(\cdot)$ to the set~$D$. The set
of functions $f_{j,D}$, $1\le j\le m$, $D\in{\Cal D}$, is an
$L_2$-dense class of functions.}
\medskip
Beside this, Theorem~5.3 helps us to construct
Vapnik-\v{C}ervonenkis classes of sets. Let me also remark that it
follows from the result of this section that the random variables
considered in Lemma~4.5 are not only countably approximable, but
the class of functions $f_{u_1,\dots,u_k,v_1,\dots,v_k}$
appearing in their definition is $L_2$-dense.
\beginsection 6. The proof of Theorems 4.1 and 4.2 on the
supremum of random sums.
In this section we prove Theorem~4.2, an estimate about the tail
distribution of the supremum of an appropriate class of Gaussian
random variables with the help of a method, called the chaining
argument. We also investigate the proof of Theorem~4.1 which can
be considered as a version of Theorem~4.2 about the supremum of
partial sums of independent and identically distributed random
variables. The chaining argument is not a strong enough method
to prove Theorem~4.1, but it enables us to prove a weakened form
of it formulated in Proposition~6.1. This result turned out to
be useful in the proof of Theorem~4.1. It enables us to reduce
the proof of Theorem~4.1 to a simpler statement formulated in
Proposition~6.2. In this section we prove Proposition~6.1,
formulate Proposition~6.2, and reduce the proof of Theorem~4.1
with the help of Proposition~6.1 to this result. The proof of
Proposition~6.2 which demands different arguments is postponed
to the next section. Before presenting the proofs of this section
I briefly describe the chaining argument.
Let us consider a countable class of functions ${\Cal F}$ on a
probabality space $(X,{\Cal X},\mu)$ which is $L_2$-dense with
respect to the probability measure~$\mu$. Let us have either a
class of Gaussian random variables $Z(f)$ with zero
expectation such that $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\Cal F}$, or a set of normalized partial sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in{\Cal F}$,
where $\xi_1,\dots,\xi_n$ is a sequence of independent $\mu$
distributed random variables with values in the space
$(X,{\Cal X})$, and assume that $Ef(\xi_j)=0$ for all
$f\in{\Cal F}$. We want to get a good estimate on the
probability $P\left(\sup\limits_{f\in{\Cal F}}Z(f)>u\right)$ or
$P\left(\sup\limits_{f\in{\Cal F}}S_n(f)>u\right)$ if the class of
functions~${\Cal F}$ has some nice properties. The chaining
argument suggests to prove such an estimate in the following way.
Let us try to find an appropriate sequence of subset
${\Cal F}_1\subset{\Cal F}_2\subset\cdots\subset{\Cal F}$ such that
$\bigcup\limits_{N=1}^\infty{\Cal F}_N={\Cal F}$, ${\Cal F}_N$ is
such a set of functions from ${\Cal F}$ with relatively few
elements for which
$\inf\limits_{f\in{\Cal F}_N}\int (f-\bar f)^2\,d\mu\le\delta_N$
with an appropriately chosen number $\delta_N$ for all functions
$\bar f\in{\Cal F}$, and let us give a good estimate on the
probability $P\left(\sup\limits_{f\in{\Cal F}_N}Z(f)>u_N\right)$ or
$P\left(\sup\limits_{f\in{\Cal F}_N}S_n(f)>u_N\right)$
for all $N=1,2,\dots$
with an appropriately chosen monotone increasing sequence $u_N$
such that $\lim\limits_{N\to\infty} u_N=u$.
We can get a relatively good estimate under appropriate conditions
for the class of functions~${\Cal F}$ by choosing the classes of
functions ${\Cal F}_N$ and numbers $\delta_N$ and $u_N$ in an
appropriate way. We try to bound the difference of the probabilities
$$
P\left(\sup_{f\in{\Cal F}_{N+1}}Z(f)>u_{N+1}\right)
-P\left(\sup_{f\in{\Cal F}_N}Z(f)>u_N\right)
$$
or of the analogous difference if $Z(f)$ is replaced by $S_n(f)$.
For the sake of completeness define this difference also in the
case $N=1$ with the choice ${\Cal F}_0=\emptyset$, when the
second probability in this difference equals zero.
This probability can be estimated in a natural way by taking for
all functions $f_{j_{N+1}}\in{\Cal F}_{N+1}$ a function
$f_{j_N}\in{\Cal F}_N$ which is close to it, more explicitly
$\int (f_{j_{N+1}}-f_{j_N})^2\,d\mu\le\delta_N^2$, and
calculating the probability that the difference of the random
variables corresponding to these two functions is greater than
$u_{N+1}-u_N$. We can estimate these probabilities with the help
of some results which give a relatively good bound on the tail
distribution of $Z(g)$ or $S_n(g)$ if $\int g^2\,d\mu$ is small.
The sum of all such probabilities gives an upper bound for the
above considered difference of probabilities. Then we get an
estimate for the probability
$P\left(\sup\limits_{f\in{\Cal F}_N}Z(f)>u_N\right)$
for all $N=1,2,\dots$,
by summing up the above estimate, and we get a bound on the
probability we are interested in by taking the limit
$N\to\infty$. This method is called the chaining argument. It
got this name, because we estimate the contribution of a random
variable corresponding to a function
$f_{j_{N+1}}\in{\Cal F}_{N+1}$ to the bound of the probability we
investigate by taking the random variable corresponding to a
function $f_{j_N}\in{\Cal F}_N$ close to it, then we choose
another random variable corresponding to a function
$f_{j_{N-1}}\in{\Cal F}_{N-1}$ close to this function, and so on
we take a chain of subsequent functions and the random variables
corresponding to them.
First we show how this method supplies the proof of Theorem~4.2.
Then we turn to the investigation of Theorem~4.1. In the study of
this problem the above method does not work well, because if two
functions are very close to each other in the $L_2(\mu)$-norm,
then the Bernstein inequality (or an improvement of it) supplies
a much weaker estimate for the difference of the partial sums
corresponding to these two functions than the bound suggested
by the central limit theorem. On the other hand, we shall prove
a weaker version of Theorem~4.1 in Proposition~6.1 with the help
of the chaining argument. This result will be also useful for us.
\medskip\noindent
{\it Proof of Theorem 4.2.}\/ Let us list the elements of ${\Cal F}$
as $\{f_0,f_1,\dots\}={\Cal F}$, and choose for all $p=0,1,2,\dots$
a set of functions
${\Cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset{\Cal F}$
with $m_p\le (D+1)\,2^{2pL}\sigma^{-L}$ elements in such a way that
$\inf\limits_{1\le j\le m_p}
\int (f-f_{a(j,p)})^2\,d\mu\le 2^{-4p}\sigma^2$
for all $f\in{\Cal F}$, and let $f_p\in {\Cal F}_p$. For all indices
$a(j,p)$ of the functions in ${\Cal F}_p$, \ $p=1,2,\dots$, define a
predecessor $a(j',p-1)$ from the indices of the set of functions
${\Cal F}_{p-1}$ in such a way that the functions $f_{a(j,p)}$ and
$f_{a(j',p-1))}$ satisfy the relation
$\int(f_{(j,p)}-f_{(j',p-1)})^2\,d\mu\le2^{-4(p-1)}\sigma^2$.
With the help of the behaviour of the standard normal distribution
function we can write the estimates
$$
\align
P(A(j,p))&=P\left(|Z(f_{a(j,p)})-Z(f_{a(j',p-1)})|
\ge 2^{-(1+p)}u\right)
\le 2\exp\left\{-\frac{2^{-2(p+1)}u^2}{2\cdot 2^{-4(p-1)}\sigma^2}
\right\}\\
&=2\exp\left\{-\frac{2^{2p}u^2}{128\sigma^2}\right\} \quad 1\le j\le
m_p,\; p=1,2,\dots,
\endalign
$$
and
$$
P(B(j))=P\left(|Z(f_{a(j,0)})|\ge \frac u2\right)\le
\exp\left\{-\frac {u^2}{8\sigma^2}\right\},
\quad 1\le j\le m_0.
$$
The above estimates together with the relation
$\bigcup\limits_{p=0}^\infty
{\Cal F}_p={\Cal F}$ which implies that \hfill\break
$\{|Z(f)|\ge u\}\subset\bigcup\limits_{p=1}^\infty
\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$ for all $f\in{\Cal F}$ yield that
$$ \allowdisplaybreaks
\align
&P\left(\sup_{f\in{\Cal F}} |Z(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \\
&\qquad \le \sum_{p=1}^{\infty} 2(D+1)2^{2pL}
\sigma^{-L} \exp\left\{-\frac{2^{2p}u^2}{128\sigma^2} \right\}
+2(D+1)\sigma^{-L} \exp\left\{-\frac {u^2}{8\sigma^2}\right\}.
\endalign
$$
If $u\ge ML^{1/2}\sigma\log^{1/2}\frac2\sigma$ with $M\ge16$ (and
$L\ge1$ and $0<\sigma\le1$), then
$$
2^{2pL}\sigma^{-L}\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2}
\right\}\le2^{2pL}\sigma^{-L}\left(\frac\sigma
2\right)^{2^{2p}M^2L/256}\le 2^{-pL}\le2^{-p}
$$
for all $p=0,1\dots$, hence the previous inequality implies that
$$
P\left(\sup_{f\in{\Cal F}}|Z(f)|\ge u\right)
\le 2(D+1)\sum\limits_{p=0}^\infty 2^{-p}
\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2} \right\}=4(D+1)
\exp\left\{-\frac{u^2}{256\sigma^2} \right\}.
$$
Theorem 4.2 is proved.
\medskip
With an appropriate choice of the bound of the integrals in the
definition of the sets ${\Cal F}_p$ in the proof of Theorem~4.2 and
some additional calculation it can be proved that the coefficient
$\frac1{256}$ in the exponent of the right-hand side (4.7) can be
replaced by $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ if the
remaining (universal) constants in this estimate are chosen
sufficiently large.
The proof of Theorem 4.2 was based on a sufficiently good estimate on
the probabilities $P(|Z(f)-Z(g)|>u)$ for pairs of functions
$f,g\in{\Cal F}$ and numbers $u>0$. In the case of Theorem~4.1 only a
weaker bound can be given for the corresponding probabilities. There
is no good estimate on the tail distribution of the difference
$S_n(f)-S_n(g)$ if its variance is small. As a consequence, the
chaining argument supplies only a weaker result in this case. This
result, where the tail distribution of the supremum of the normalized
random sums $S_n(f)$ is estimated on a relatively dense subset of the
class of functions $f\in{\Cal F}$ in the $L_2(\mu)$ norm will
be given in Proposition~6.1. Another result will be formulated in
Proposition~6.2 whose proof is postponed to the next section. It will
be shown that Theorem~4.1 follows from Propositions~6.1 and~6.2.
Before the formulation of Proposition~6.1 I recall an estimate which
is a simple consequence of Bernstein's inequality. If
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$ is the
normalized sum of
independent, identically random variables, $P(|f(\xi_1)|\le1)=1$,
$Ef(\xi_1)=0$, $Ef(\xi_1)^2\le\sigma^2$, then there exists some
constant $\alpha>0$ such that
$$
P(|S_n(f)|>u)\le 2e^{-\alpha u^2/\sigma^2}
\quad \text{if}\quad 0__\frac u{\bar A}\right)$
with some parameter~$\bar A>1$ where ${\Cal F}_{\bar\sigma}$ is an
appropriate finite subset of a set of functions~${\Cal F}$ satisfying
the conditions of Theorem~4.1. We can give a good estimate for the
above probability not for all $u>0$, but only for such numbers~$u$
which are in an appropriate interval depending on the
parameter~$\sigma$ appearing in condition~(4.2) of Theorem~4.1
and the parameter~$\bar A$ we chose in Proposition~6.1. This fact
is closely related to the condition imposed on the number~$u$ in
formula~(4.4) of Theorem~4.1. The choice of the set of functions
${\Cal F}_{\bar\sigma}\subset{\Cal F}$ depends of the number~$u$
appearing in the probability we want to estimate. It is such a subset
of relatively small cardinality of ${\Cal F}$ whose $L_2(\mu)$-norm
distance from all elements of ${\Cal F}$ is less than
$\bar\sigma=\bar\sigma(u)$ with an appropriately defined number
$\bar\sigma(u)$. To reduce the proof of Theorem~4.1 to that of
Proposition~6.2 which will be formulated later we still need some
upper and lower bounds on the value of $\bar\sigma(u)$.
Proposition~6.1 also contains such estimates.
\medskip\noindent
{\bf Proposition 6.1.} {\it Let us have a countable $L_2$-dense
class of functions ${\Cal F}$ with parameter $D\ge1$ and
exponent~$L\ge1$ with respect to some probability measure~$\mu$ on
a measurable space $(X,{\Cal X})$ whose elements
satisfy relations~(4.1), (4.2) and~(4.3) with this probability
measure $\mu$ and real number $0<\sigma\le1$. Take
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and define the normalized random
sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{l=1}^nf(\xi_l)$, for all
$f\in {\Cal F}$. Let us fix some number $\bar A\ge1$. There exists
some number $M=M(\bar A)$ such that with these parameters~$\bar A$
and~$M=M(\bar A)\ge1$ the following relations hold.
For all numbers $u>0$ such that
$n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M(L\log\frac2\sigma+\log D)$ a number
$\bar\sigma=\bar\sigma(u)$,
$0\le\bar\sigma\le \sigma\le1$, and a collection of functions
${\Cal F}_{\bar\sigma}=\{f_1,\dots,f_m\}\subset {\Cal F}$ with
$m\le D\bar\sigma^{-L}$ elements can be chosen in such a way that
the sets ${\Cal D}_j=\{f\colon\; f\in {\Cal F},\int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$, satisfy the relation
$\bigcup\limits_{j=1}^m{\Cal D}_j={\Cal F}$, and the normalized random
sums $S_n(f)$, $f\in{\Cal F}_{\bar\sigma}$, $n\ge2$, satisfy
the inequality
$$
P\left(\sup_{f\in{\Cal F}_{\bar\sigma}} |S_n(f)|\ge\frac u{\bar A}\right)
\le 4\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}
\quad \text{if } n\sigma^2\ge(\tfrac u\sigma)^2
\ge M(L\log\tfrac2\sigma+\log D)
\tag6.2
$$
with the constants $\alpha$ in formula~(6.1) and the exponent $L$
and parameter $D$ of the $L_2$-dense class ${\Cal F}$, and also the
inequality $\frac1{16}(\frac u{\bar A\bar\sigma})^2\ge n\bar\sigma^2
\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^2$ holds with the
number~$\bar\sigma=\bar\sigma(u)$. If the number~$u$ satisfies
also the inequality
$$
n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M\left(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}\right) \tag6.3
$$
with a sufficiently large number $M=M(\bar A)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too. (Formula~(6.3) is a
stonger restriction than the previous condition imposed on the
number~$(\frac u\sigma)^2$, since it contains constants $L^{3/2}$
and $(\log D)^{3/2}$ instead of~ $L$ and~$\log D$, and the constant
$M=M(\bar A)$ can be chosen larger in it.)}
\medskip
Proposition~6.1 helps to reduce the proof of Theorem~4.1 to the
case when the $L_2$-norm of the functions in the class ${\Cal F}$
is bounded by a relatively small number $\bar\sigma$. In more
detail, the proof of Theorem~4.1 can be reduced to a good
estimate on the distribution of the supremum of random variables
$\sup\limits_{f\in D_j}|S_n(f-f_j)|$ for all classes ${\Cal D}_j$,
$1\le j\le m$, by means of Proposition~6.1. We also have to know
that the number~$m$ of the classes ${\Cal D}_j$ is not too large.
Beside this, we need some estimates on the number $\bar\sigma$
which is the upper bound of the $L_2$-norm of the functions
$f-f_j$, $f\in{\Cal D}_j$. To get such bounds for $\bar\sigma$ that
we need in the applications of Proposition~6.1 we introduced a
large parameter~$\bar A$ in the formulation of Proposition~6.1
and imposed a condition with a sufficiently large
number~$M=M(\bar A)$ in formula~(6.3). This condition reappears
in Theorem~4.1 in the conditions of the estimate~(4.4).
Let me remark that one of the inequalities the number
$\bar\sigma$ introduced in Proposition~6.1 satisfies has the
consequence $u>\text{const.}\,\sqrt n\bar\sigma^2$ with an
appropriate
constant, and we want to estimate the probability
$P\left(\sup\limits_{f\in{\Cal F}} S_n(f)|>u\right)$ with
this number~$u$ and a
class of functions~${\Cal F}$ whose $L_2$ norm is bounded
by~$\bar\sigma$. Formula~(6.1), that will be applied in the
proof of Proposition~6.1 holds under the condition
$u<\sqrt n\sigma^2$, which is an inequality in the opposite
direction. Hence to complete the proof of Theorem~4.1 with the
help of Proposition~6.1 we need a result whose proof demands an
essentially different method. Proposition~6.2 formulated below
is such a result. I shall show that Theorem~4.1 is a consequence
of Propositions~6.1 and~6.2. Proposition~6.1 is proved at the
end of this section, while the proof of Proposition~6.2 is
postponed to the next section.
\medskip\noindent
{\bf Proposition 6.2.} {\it Let us have a probability measure $\mu$
on a measurable space $(X,{\Cal X})$ together with a sequence of
independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and a countable, $L_2$-dense class of
functions $f=f(x)$ on $(X,{\Cal X})$ with some parameter $D\ge1$ and
exponent $L\ge1$ which satisfies conditions~(4.1), (4.2) and~(4.3)
with some $0<\sigma\le1$ such that the inequality
$n\sigma^2>L\log n+\log D$ holds. Then there exists
a threshold index $A_0\ge5$ such that the normalized random sums
$S_n(f)$, $f\in {\Cal F}$, introduced in Theorem~4.1 satisfy the
inequality
$$
P\left(\sup_{f\in{\Cal F}}|S_n(f)|\ge A n^{1/2}\sigma^2\right)\le
e^{-A^{1/2}n\sigma^2/2}\quad \text{if } A\ge A_0. \tag6.4
$$
}
\medskip
I did not try to find optimal parameters in formula (6.4). Even the
coefficient $-A^{1/2}$ in the exponent at its right-hand side could
be improved. The result of Proposition~6.2 is similar to that of
Theorem~4.1. Both of them give an estimate on a probability of the
form $P\left(\sup\limits_{f\in{\Cal F}}|S_n(f)|\ge u\right)$ with
some class of functions~${\Cal F}$. The essential difference
between them is that in Theorem~4.1 this probability is considered
for $u\le n^{1/2}\sigma^2$ while in Proposition~6.2 the case
$u=A n^{1/2}\sigma^2$ with $A\ge A_0$ is taken, where $A_0$ is a
sufficiently large positive number. Let us observe that in this
case no good Gaussian type estimate can be given for the
probabilities $P(S_n(f)\ge u)$, $f\in{\Cal F}$. In this case
Bernstein's inequality yields the bound
$P(S_n(f)>An^{1/2}\sigma^2)=
P\left(\sum\limits_{l=1}^nf(\xi_l)>uV_n\right)n\sigma^2
\ge2^{6R}\left(\frac{u}{16\bar A\sigma}\right)^2$, define
$\bar\sigma^2=2^{-4R}\sigma^2$ and
${\Cal F}_{\bar\sigma}={\Cal F}_R$.
(As $n\sigma^2\ge\left(\frac u\sigma\right)^2$ and $\bar A\ge1$ by our
conditions, there exists such a number $R\ge1$. The number~$R$
was chosen as the largest number~$p$ for which the second
relation of formula~(6.7) holds.) Then the cardinality~$m$ of
the set ${\Cal F}_{\bar\sigma}$ equals $m_R\le D2^{2RL}\sigma^{-L}
=D\bar\sigma^{-L}$, and the sets ${\Cal D}_j$ are
${\Cal D}_j=\{f\colon\; f\in{\Cal F},\int (f_{a(j,R)}-f)^2\,d\mu\le
2^{-4R}\sigma^2\}$, $1\le j\le m_R$, hence $\bigcup\limits_{j=1}^m
{\Cal D}_j={\Cal F}$. Beside this, with our choice of the number $R$
inequalities (6.7) and (6.8) can be applied for $1\le p\le R$.
Hence the definition of the predecessor of an index $(j,p)$ implies
that
$\left\{\omega\colon\;\sup\limits_{f\in{\Cal F}_{\bar\sigma}}
|S_n(f)(\omega)|\ge
\frac u{\bar A}\right\}\subset
\bigcup\limits_{p=1}^R\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$, and
$$
\align
&P\left(\sup_{f\in{\Cal F}_{\bar\sigma}} |S_n(f)|\ge
\frac u{\bar A}\right)
\le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&\qquad \le
\sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))+\sum_{s=1}^{m_0}P(B(s))
\le \sum_{p=1}^{\infty} 2D\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}\\
&\qquad\qquad +2D\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^2\right\}.
\endalign
$$
If the relation $(\frac u\sigma)^2\ge M(L\log\frac2\sigma+\log D)$
holds with a sufficiently large constant~$M$ (depending on $\bar A$),
and $\sigma\le1$, then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp
\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^2 \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^2\right\}
\le\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}.
$$
Hence the previous estimate implies that
$$
\align
&P\left(\sup_{f\in{\Cal F}_{\bar\sigma}}
|S_n(f)|\ge \frac u{\bar A}\right)
\le\sum_{p=1}^{\infty}2\cdot 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^2
\right\}\\
&\qquad +2\exp\left\{-\alpha
\left(\frac u{10\bar A \sigma}\right)^2\right\}
\le 4 \exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^2\right\},
\endalign
$$
and relation (6.2) holds.
As $\sigma^2=2^{4R}\bar\sigma^2$ the inequality
$$
2^{-4R}\cdot\frac{2^{6R}}{256}\left(\frac{u}{\bar A\sigma}\right)^2\le
n\bar\sigma^2=2^{-4R} n\sigma^2\le
2^{-4R}\cdot\frac{2^{6(R+1)}}{256}
\left(\frac{u}{\bar A\sigma}\right)^2=
\frac14\cdot 2^{-2R}\left(\frac{u}{\bar A\bar \sigma}\right)^2
$$
holds, and this implies (together with the relation $R\ge1$) that
$$
\frac1{64}\left(\frac u{\bar A\sigma}\right)^2\le n\bar\sigma^2
\le\frac1{16}\left(\frac{u}{\bar A \bar\sigma}\right)^2,
$$
as we have claimed. It remained to show that under the
condition~(6.3) $n\bar\sigma^2\ge L\log n+\log D$.
This inequality clearly holds under the conditions of Proposition~6.1
if $\sigma\le n^{-1/3}$, since in this case $\log\frac2\sigma\ge
\frac{\log n}3$, and
$n\bar\sigma^2\ge\frac1{64}(\frac u {\bar A\sigma})^2
\ge\frac1{64\bar A^2} M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})\ge
\frac1{192\bar A^2} M(L^{3/2}\log n+(\log D)^{3/2})\ge L\log n+\log D$
if $M\ge M_0(\bar A)$ with a sufficiently large number $M_0(\bar A)$.
If $\sigma\ge n^{-1/3}$, we can exploit that the inequality
$2^{6R}\left(\frac u{\bar A\sigma}\right)^2 \le256n\sigma^2$ holds
because of the definition of the number~$R$. It can be rewritten as
$2^{-4R}\ge 2^{-16/3}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^2}
{n\sigma^2}\right]^{2/3}$. Hence
$n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1/3}\left(\frac u\sigma\right)^{4/3}$. As
$\log\frac2\sigma\ge\log2>\frac12$ the inequalities
$n\sigma^2\ge n^{1/3}$ and $(\frac u\sigma)^2\ge
M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold. They yield that
$$
\align
n\bar\sigma^2&\ge\frac{\bar A^{-4/3}}{50} (n\sigma^2)^{1/3}\left(\frac
u\sigma\right)^{4/3}
\ge\frac{\bar A^{-4/3}}{50}n^{1/9}\left(\frac M2\right)^{2/3}
(L^{3/2}+(\log D)^{3/2})^{2/3}\\
&\ge\frac{M^{2/3}n^{1/9}(L+\log D)}{100\bar A^{4/3}}\ge L\log n+\log D
\endalign
$$
if $M=M(\bar A)$ is chosen sufficiently large.
\beginsection 7. The completion of the proof of Theorem 4.1.
This section contains the proof of Proposition 6.2 with the help of
a symmetrization argument which completes the proof of Theorem~4.1.
By symmetrization argument I mean the reduction of the investigation
of sums of the form $\sum f(\xi_j)$ to sums of the form
$\sum\varepsilon_jf(x_j)$, where $\varepsilon_j$ are
independent random variables,
independent also of the random variables $\xi_j$, and
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$.
First a symmetrization lemma is proved, and then with the help of
this result and a conditioning argument the proof of Proposition~6.2
is reduced to the estimation of a probability which can be bounded by
means of the Hoeff\-ding inequality formulated in Theorem 3.4. Such
an approach makes possible to prove Proposition~6.2.
First I formulate the symmetrization lemma we shall apply.
\medskip\noindent
{\bf Lemma 7.1. (Symmetrization Lemma).} {\it Let $Z_n$ and $\bar
Z_n$, $n=1,2,\dots$, be two sequences of random variables
independent of each other, and let the random variables $\bar Z_n$,
$n=1,2,\dots$, satisfy the inequality
$$
P(|\bar Z_n|\le\alpha)\ge\beta\quad \text{for all } n=1,2,\dots \tag7.1
$$
with some numbers $\alpha>0$ and $\beta>0$. Then
$$
P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
\le\frac1\beta P\left(\sup\limits_{1\le
n<\infty}|Z_n-\bar Z_n|>u\right)\quad \text{for all } u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma 7.1.}\/ Put $\tau=\min\{n\colon\; |Z_n|>u+\alpha\}$
if there exists such an index $n$, and $\tau=0$ otherwise. Then the
event $\{\tau=n\}$ is independent of the sequence of random variables
$\bar Z_1,\bar Z_2,\dots$ for all $n=1,2,\dots$, and because of this
independence
$$
P(\{\tau=n\})\le\frac1\beta P(\{\tau=n\}\cap\{|\bar Z_n|\le\alpha\})
\le \frac1\beta P(\{\tau=n\}\cap\{|Z_n-\bar Z_n|>u\})
$$
for all $n=1,2,\dots$. Hence
$$
\align
&P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
=\sum_{l=1}^\infty P(\tau=l)
\le \frac1\beta
\sum_{l=1}^\infty P(\{\tau=l\}\cap\{|Z_l-\bar Z_l|>u\}) \\
&\qquad \le \frac1\beta \sum_{l=1}^\infty
P(\{\tau=l\}\cap\sup_{1\le n<\infty}|Z_n-\bar Z_n|>u\})
\le\frac1\beta P\left(\sup\limits_{1\le n<\infty}
|Z_n-\bar Z_n|>u\right).
\endalign
$$
Lemma 7.1 is proved.
\medskip
We shall apply the following Lemma~7.2 which is a of the
symmetrization lemma.
\medskip\noindent
{\bf Lemma 7.2.} {\it Let us fix a countable class of functions
${\Cal F}$ on a measurable space $(X,{\Cal X})$ together with a real
number $0<\sigma<1$. Consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(X,{\Cal X})$ such
that $Ef(\xi_1)=0$, $Ef^2(\xi_1)\le\sigma^2$ for all $f\in{\Cal F}$
together with another sequence $\varepsilon_1,\dots,\varepsilon_n$
of independent
random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, independent also of the random sequence
$\xi_1,\dots,\xi_n$. Then
$$
\aligned
&P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\Cal F}}\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A
n^{1/2}\sigma^{2}\right) \\
&\qquad \le 4P\left(\frac1{\sqrt n}\sup\limits_{f\in{\Cal F}}
\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\quad\text{if } A\ge \frac{3\sqrt2}{\sqrt n\sigma}.
\endaligned \tag7.2
$$
}
\medskip\noindent
{\it Proof of Lemma 7.2.}\/ Let us construct an independent copy
$\bar\xi_1,\dots,\bar\xi_n$ of the sequence $\xi_1,\dots,\xi_n$ in
such a way that all three sequences $\xi_1,\dots,\xi_n$, \
$\bar\xi_1,\dots,\bar\xi_n$ and $\varepsilon_1,\dots,\varepsilon_n$
are independent.
Define the random variables $S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n
f(\xi_j)$ and $\bar S_n(f)=\frac1{\sqrt n}
\sum\limits_{j=1}^n f(\bar\xi_j)$
for all $f\in{\Cal F}$. The inequality
$$
P\left(\sup_{f\in{\Cal F}}|S_n(f)|> A\sqrt n\sigma^2\right)\le
2P\left(\sup_{f\in{\Cal F}}|S_n(f)-\bar S_n(f)|> \frac23 A\sqrt
n\sigma^2\right). \tag7.3
$$
follows from Lemma~7.1 if it is applied for the countable set of
random variables $Z_n(f)=S_n(f)$ and $\bar Z_n(f)=\bar S_n(f)$,
$f\in{\Cal F}$, and the numbers $u=\frac23 A\sqrt n\sigma^2$ and
$\alpha=\frac13A\sqrt n\sigma^2$, since the random fields $S_n(f)$
and $\bar S_n(f)$ are independent, and
$P(|\bar S_n(f)|\le\alpha)>\frac12$ for all $f\in{\Cal F}$. Indeed,
$\alpha=\frac13 A\sqrt n\sigma^2\ge\sqrt2\sigma$, $E\bar S_n(f)^2
\le\sigma^2$, thus Chebishev's inequality implies that
$P(|\bar S_n(f)|\le\alpha)\ge P(|\bar S_n(f)|\le\sqrt2\sigma)
\ge\frac12$ for all $f\in{\Cal F}$.
Let us observe that the random field
$$
S_n(f)-\bar S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\Cal F}, \tag7.4
$$
and its randomization
$$
\frac1{\sqrt n}\sum_{j=1}^n \varepsilon_j \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\Cal F}, \tag$7.4'$
$$
have the same distribution. Indeed, even the conditional distribution
of ($7.4'$) under the condition that the values of the
$\varepsilon_j$-s are
prescribed agrees with the distribution of (7.4) for all possible
values of the $\varepsilon_j$-s. This follows from the observation
that the
distribution of the random field~(7.4) does not change if we exchange
the random variables $\xi_j$ and $\bar\xi_j$ for those indices $j$
for which $\varepsilon_j=-1$ and do not change them for those
indices~$j$ for which $\varepsilon_j=1$. On the other hand, the
distribution of the random
field obtained in such a way agrees with the conditional distribution
of the random field defined in ($7.4'$) under the condition that the
values of the random variables $\varepsilon_j$ are prescribed.
The above relation together with formula (7.3) imply that
$$ \allowdisplaybreaks
\align
&P\left(\frac1{\sqrt n}\sup_{f\in{\Cal F}}\left|\sum_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right)\\
&\qquad \le 2P\left(\frac1{\sqrt n}
\sup_{f\in{\Cal F}}\left|\sum_{j=1}^n
\varepsilon_j\left[f(\xi_j)-\bar f(\xi_j)\right]\right| \ge\frac23 A
n^{1/2}\sigma^{2}\right) \\
&\qquad\le 2P\left(\frac1{\sqrt n}\sup_{f\in{\Cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3 n^{1/2}\sigma^{2}\right) \\
&\qquad\qquad+ 2P\left(\frac1{\sqrt n}\sup_{f\in{\Cal F}}\left|
\sum_{j=1}^n \varepsilon_jf(\bar\xi_j)\right|
\ge\frac A3n^{1/2}\sigma^{2}\right) \\
&\qquad=4P\left(\frac1{\sqrt n}\sup_{f\in{\Cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3n^{1/2}\sigma^{2}\right).
\endalign
$$
Lemma~7.2 is proved.
\medskip
First I ty to explain the method of proof of Proposition~6.2.
A probability of the form
$P\left(n^{-1/2}\sup\limits_{f\in{\Cal F}}
\left|\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
has to be estimated. Lemma~7.2 enables us to replace this problem
by the estimation of the probability
$P\left(n^{-1/2}\sup\limits_{f\in{\Cal F}}\left| \sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right|>\frac u3\right)$ with some
independent random variables $\varepsilon_j$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $j=1,\dots,n$,
which are also independent of the random variables $\xi_j$. We
shall bound the conditional probability of the event appearing in
this modified problem under the condition that each
random variable $\xi_j$ has a prescribed values. This can be done
with the help of Hoeffding's inequality formulated in Theorem~3.4
and the $L_2$-density property of the class of functions ${\Cal F}$
we consider. We hope to get a sharp estimate in such a way which
is similar to the result we got in the study of the Gaussian
counterpart of this problem, because Hoeffding's inequality yields
always a Gaussian type upper bound for the tail distribution of
the random sum we are studying.
Nevertheless, there appears a problem when we try to apply such an
approach. To get a good estimate on the conditional tail distribution
of the supremum of the random sums we are studying with the help of
Hoeffding's inequality we need a good estimate on the supremum of
the conditional variances of the random sums we are studying, i.e.
on the tail distribution of
$\sup\limits_{f\in{\Cal F}}\frac1n\sum\limits_{j=1}^n f^2(\xi_j)$.
This problem is similar to the original one, and it is not simpler.
But a more detailed study shows that our approach to get a good
estimate with the help of Hoeffding's inequality works. In
comparing our original problem with the new, complementary problem
we have to understand at which level we need a good estimate on the
tail distribution of the supremum in the complementary problem to
get a good tail distribution estimate at level~$u$ in the original
problem. A detailed study shows that to bound the probability in
the original problem with parameter~$u$ we have to estimate the
probability
$P\left(n^{-1/2}\sup\limits_{f\in{\Cal F}'}\left|
\sum\limits_{j=1}^n f(\xi_j)\right|>u^{1+\alpha}\right)$ with
some new nice, appropriately defined $L_2$-dense class of
bounded functions ${\Cal F}'$ and some
number $\alpha>0$. We shall exploit that the number~$u$ is
replaced by a larger number $u^{1+\alpha}$ in the new problem. Let
us also observe that if the sum of bounded random variables is
considered, then for very large values~$u$ the probability we
investigate equals zero. On the basis of these observations an
appropriate backward induction procedure can be worked out. In its
$n$-th step we give a good upper bound on the probability
$P\left(n^{-1/2}\sup\limits_{f\in{\Cal F}}\left|
\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
if $u\ge T_n$ with an appropriately chosen number~$T_n$, and try
to diminish the number~$T_n$ in each step of this induction
procedure. We can prove Proposition~6.2 as a consequence of the
result we get by means of this backward induction procedure. To
work out the details we introduce the following notion.
\medskip\noindent
{\bf Definition of good tail behaviour for a class of normalized
random sums.}
{\it Let us have some measurable space $(X,{\Cal X})$ and a
probability measure $\mu$ on it together with some integer $n\ge2$
and real number $\sigma>0$. Consider some class ${\Cal F}$ of
functions $f(x)$ on the space $(X,{\Cal X})$, and take a sequence of
independent and $\mu$ distributed random variables $\xi_1,\dots,\xi_n$
with values in the space $(X,{\Cal X})$. Define the normalized random
sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in {\Cal F}$.
Given some real number $T>0$ we say that the set of normalized
random sums $S_n(f)$, $f\in{\Cal F}$,
has a good tail behaviour at level~$T$ (with parameters $n$ and
$\sigma^2$ which will be fixed in the sequel) if the inequality
$$
P\left(\sup_{f\in{\Cal F}}|S_n(f)|\ge A \sqrt n\sigma^2\right) \le
\exp\left\{-A^{1/2}n\sigma^2 \right\} \tag7.5
$$
holds for all numbers $A>T$.}
\medskip
Now I formulate Proposition 7.3 and show that Proposition 6.2
follows from it.
\medskip\noindent
{\bf Proposition 7.3.} {\it Let us fix a positive integer~$n\ge2$,
a real number $0<\sigma\le1$ and a probability measure $\mu$ on a
measurable space $(X,{\Cal X})$ together with some numbers $L\ge1$
and $D\ge1$ such that $n\sigma^2\ge L\log n+\log D$. Let us
consider those countable $L_2$-dense classes ${\Cal F}$ of functions
$f=f(x)$ on the space $(X,{\Cal X})$ with exponent~$L$ and
parameter~$D$ for which all functions $f\in{\Cal F}$ satisfy the
conditions
$\sup\limits_{x\in X}|f(x)|\le\frac14$, $\int f(x)\mu(\,dx)=0$
and $\int f^2(x)\mu(\,dx)\le\sigma^2$.
Let a number $T>1$ be such that for all classes of functions
${\Cal F}$ which satisfy the above conditions the set of normalized
random sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$,
$f\in{\Cal F}$, defined with the help of a sequence of independent
$\mu$ distributed random variables $\xi_1,\dots,\xi_n$ have a good
tail behaviour at level~$T^{4/3}$. There is a universal
constant~$\bar A_0$ such that if $T\ge\bar A_0$, then the set of the
above defined normalized sums, $S_n(f)$, $f\in{\Cal F}$, have a good
tail behaviour for all such classes of functions ${\Cal F}$ not
only at level $T^{4/3}$ but also at level~$T$.}
\medskip
Proposition~6.2 simply follows from Proposition~7.3. To show this
let us first observe that a class of normalized random sums
$S_n(f)$, $f\in{\Cal F}$, has a good tail behaviour at level
$T_0=\frac1{4\sigma^2}$ if this class of functions ${\Cal F}$
satisfies the conditions of Proposition~7.3. Indeed, in this
case
$P\left(\sup\limits_{f\in{\Cal F}}|S_n(f)|\ge A\sqrt n\sigma^2\right)
\le P\left(\sup\limits_{f\in{\Cal F}}|S_n(f)|>\frac{\sqrt n}4\right)=0$
for all
$A>T_0$. Then the repetitive application of Proposition~7.3 yields
that a class of random sums $S_n(f)$, $f\in{\Cal F}$, has a good tail
behaviour at all levels $T\ge T_0^{(3/4)^j}$ with an index~$j$ such
that $T_0^{(3/4)^j}\ge\bar A_0$ if the class of functions ${\Cal F}$
satisfies the conditions of Proposition~7.3. Hence it has a good
tail behaviour for $T=\bar A_0^{4/3}$. If a class of functions
$f\in{\Cal F}$ satisfies the conditions of Proposition~6.2, then
the class of functions $\bar{\Cal F}=\left\{\bar f=\frac f4\colon\;
f\in{\Cal F}\right\}$ satisfies the conditions of Proposition~7.3,
with the same parameters~$\sigma$, $L$ and~$D$. (Actually some of
the inequalities that must hold for the elements of a class of
functions~${\Cal F}$ satisfying the conditions of Proposition~7.3
are valid with smaller parameters. But we did not change these
parameters to satisfy also the condition
$n\sigma^2\ge L\log n+\log D$.) Hence the class of functions
$S_n(\bar f)$, $\bar f\in \bar{\Cal F}$, has a good tail
behaviour at level $T=\bar A_0^{4/3}$. This implies that the
original class of functions ${\Cal F}$ satisfies formula~(6.4) in
Proposition~6.2, and this is what we had to show.
\medskip\noindent
{\it Proof of Proposition 7.3.}\/ Fix a class of functions
${\Cal F}$ which satisfies the conditions of Proposition~7.3
together with two independent sequences $\xi_1,\dots,\xi_n$ and
$\varepsilon_1,\dots,\varepsilon_n$ of independent random variables,
where $\xi_j$ is $\mu$-distributed,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le j\le n$,
and investigate the conditional probability
$$
P(f,A|\xi_1,\dots,\xi_n)=P\left(\left.\frac1{\sqrt n}\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge
\frac A6\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all functions $f\in{\Cal F}$, $A> T$ and values
$(\xi_1,\dots,\xi_n)$ in the condition. By the Hoeffding inequality
formulated in Theorem~3.4
$$
P(f,A|\xi_1,\dots,\xi_n)\le 2\exp\left\{-\frac{\frac 1{36}
A^2 n\sigma^4}{2\bar S^2(f,\xi_1,\dots,\xi_n)}\right\} \tag7.6
$$
with
$$
\bar S^2(f,x_1,\dots,x_n)=\frac1n\sum_{j=1}^n f^2(x_j),
\quad f\in {\Cal F}.
$$
Let us introduce the set
$$
H=H(A)=\left\{(x_1,\dots,x_n)\colon\; \sup_{f\in{\Cal F}}
\bar S^2(f,x_1,\dots,x_n)
\ge \left(1+A^{4/3}\right)\sigma^2\right\}. \tag7.7
$$
I claim that
$$
P((\xi_1,\dots,\xi_n)\in H)\le e^{-A^{2/3} n\sigma^2}\quad\text{ if }
A>T. \tag7.8
$$
(The set $H$ plays the role of the small exceptional set, where we
cannot provide a good estimate for $P(f,A|\xi_1,\dots,\xi_n)$ for some
$f\in{\Cal F}$.)
To prove relation (7.8) let us consider the functions $\bar f=\bar
f(f)$, $\bar f(x)=f^2(x)-\int f^2(x)\mu(\,dx)$, and introduce the
class of functions $\bar{\Cal F}=\{\bar f(f)\colon\; f\in{\Cal F}\}$.
Let us show that the class of functions $\bar{\Cal F}$ satisfies the
conditions of Proposition~7.3, hence the estimate (7.5) holds for
the class of functions $\bar{\Cal F}$ if $A> T^{4/3}$.
The relation $\int \bar f(x)\mu(\,dx)=0$ clearly holds. The condition
$\sup| \bar f(x)|\le\frac 18<\frac14$ also holds if $\sup |f(x)|\le
\frac14$, and $\int \bar f^2(x)\mu(\,dx)\le \int f^4(x)\mu(\,dx)\le
\frac 1{16}\int f^2(x)\,\mu(\,dx)\le\frac{\sigma^2}{16}<\sigma^2$
if $f\in{\Cal F}$. It remained to show that $\bar{\Cal F}$ is an
$L_2$-dense class with exponent $L$ and parameter $D$. For this goal
we need a good estimate on $\int(\bar f(x)-\bar g(x))^2\rho(\,dx)$,
where $\bar f,\,\bar g\in\bar{\Cal F}$, and $\rho$ is an arbitrary
probability measure.
Observe that $\int (\bar f(x)-\bar g(x))^2\rho(\,dx)\le
2\int(f^2(x)-g^2(x))^2\rho(\,dx)+
2\int(f^2(x)-g^2(x))^2\mu(\,dx)\le2 (\sup\limits (|f(x)|+|g(x)|)^2
\left(\int (f(x)-g(x))^2(\rho(\,dx)+\mu(\,dx)\right)\le \int
(f(x)-g(x))^2\bar\rho(\,dx)$ for all $f, g\in{\Cal F}$, $\bar f=\bar
f(f)$, $\bar g=\bar g(g)$ and probability measure $\rho$, where
$\bar\rho=\frac{\rho+\mu}2$. This means that if $\{f_1,\dots,f_m\}$
is an $\varepsilon$-dense subset of ${\Cal F}$ in the space
$L_2(X,{\Cal X},\bar\rho)$, then
$\{\bar f_1,\dots,\bar f_m\}$ is an $\varepsilon$-dense
subset of $\bar{\Cal F}$ in the space $L_2(X,{\Cal X},\rho)$, and
not only ${\Cal F}$, but also $\bar{\Cal F}$ is an $L_2$-dense class
with exponent $L$ and parameter $D$.
Because of the conditions of Proposition 7.3 we can write for the
number $A^{4/3}> T^{4/3}$ and the class of functions $\bar{\Cal F}$
that
$$
\align
P((\xi_1,\dots,\xi_n)\in H)&=P\left(\sup_{f\in{\Cal F}}
\left(\frac1n \sum_{j=1}^n
\bar f(f)(\xi_j) +\frac1n \sum_{j=1}^n E f^2(\xi_j)\right)
\ge \left(1+A^{4/3}\right)\sigma^2\right)\\
&\le P\left(\sup_{\bar f\in\bar {\Cal F}}
\frac1{\sqrt n} \sum_{j=1}^n
\bar f(\xi_j) \ge A^{4/3}n^{1/2}\sigma^2\right)
\le e^{-A^{2/3} n\sigma^2},
\endalign
$$
i.e. relation (7.8) holds.
By formula (7.6) and the definition of the set $H$ given in (7.7)
the estimate
$$
P(f,A|\xi_1,\dots,\xi_n)\le 2e^{- A^{2/3} n\sigma^2/144} \quad
\text{if }(\xi_1,\dots,\xi_n)\notin H
\tag7.9
$$
holds for all $f\in {\Cal F}$ and $A>T\ge1$. (Here we used the
estimate $1+A^{4/3}\le2A^{4/3}$.) Let us introduce the conditional
probability
$$
P({\Cal F},A|\xi_1,\dots,\xi_n)=
P\left(\left.\sup_{f\in {\Cal F}} \frac1{\sqrt n}\left|
\sum\limits_{j=1}^n \varepsilon_jf(\xi_j)\right| \ge
\frac A3\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all $(\xi_1,\dots,\xi_n)$ and $A>T$. We shall
estimate this conditional probability with the help of relation~(7.9)
if $(\xi_1,\dots,\xi_n) \notin H$.
Given a vector $x^{(n)}=(x_1,\dots,x_n)\in X^n$, let us introduce
the measure $\nu=\nu(x_1,\dots,x_n)=\nu(x^{(n)})$ on $(X,{\Cal X})$
which is concentrated in the coordinates of the vector
$x^{(n)}=(x_1,\dots,x_n)$, and $\nu(\{x_j\})=\frac1n$ for all
points~$x_j$, $j=1,\dots,n$. If $\int f^2(u)\nu(\,du)\le\delta^2$
for a function $f$, then
$\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)\right|
\le n^{1/2}\int|f(u)|\nu(\,du)\le n^{1/2}\delta$. As a
consequence, we can write that
$$
\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)-
\frac1{\sqrt n}\sum\limits_{j=1}^n \varepsilon_jg(x_j)\right|
\le\frac A6 \sqrt n\sigma^2 \quad\text{if }
\int (f(u)-g(u))^2\,d\nu(u)\le\left(\frac {A\sigma^2}6\right)^2.
\tag7.10
$$
Let us list the elements of the (countable) set ${\Cal F}$ as
${\Cal F}=\{f_1,f_2,\dots\}$, fix a number $\delta=\frac{A\sigma^2}6$,
and choose for all vectors $x^{(n)}=(x_1,\dots,x_n)\in X^n$ a
sequence of indices $p_1(x^{(n)}),\dots,p_m(x^{(n)})$ taking
positive integer values with
$m=\max(1, D\delta^{-L})=\max(1,D(\frac6{A\sigma^2})^L)$
elements in such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\,d\nu(x^{(n)})(u)\le\delta^2$
for all $f\in{\Cal F}$ and $x^{(n)}\in X^n$ with the above defined
measure $\nu(x^{(n)})$ on the space $(X,{\Cal X})$. This is possible
because of the $L_2$-dense property of the class of
functions~${\Cal F}$. (This is the point where the $L_2$-dense
property of the class of functions ${\Cal F}$ is exploited in its
full strength.) In a complete proof of Theorem~7.3 we still have
to show that we can choose the indices $p_j(x^{(n)})$,
$1\le j\le m$, as measurable functions of their argument~$x^{(n)}$
on the space $(X^n,{\Cal X}^n)$. We shall show this in Lemma~7.4 at
the end of the proof.
Put $\xi^{(n)}(\omega)=(\xi_1(\omega),\dots,\xi_n(\omega))$.
Because of relation~(7.10), the choice of the number $\delta$ and
the property of the functions $f_{p_l(x^{(n)})}(\cdot)$ we have
$$
\align
&\left\{\omega\colon\;\sup_{f\in{\Cal F}}
\frac1{\sqrt n}\left|\sum\limits_{j=1}^n
\varepsilon_j(\omega)f(\xi_j(\omega))\right|
\ge\frac A3\sqrt n\sigma^2\right\} \\
&\qquad \subset\bigcup_{l=1}^m\left\{\omega\colon\;\frac1{\sqrt n}
\left|\sum\limits_{j=1}^n \varepsilon_j(\omega)f_{p_l(\xi^{(n)}(\omega))}
(\xi_j(\omega))\right|\ge\frac A6\sqrt n\sigma^2\right\}.
\endalign
$$
This relation together with inequality~(7.9) yield that
$$
\align
P({\Cal F},A|\xi_1,\dots,\xi_n)&
\le\sum\limits_{l=1}^m P(f_{p_l(\xi^{(n)})},A|\xi_1,\dots,\xi_n)\\
&\le 2\max\left(1,D\left(\frac 6{A\sigma^2}\right)^L\right)
e^{-A^{2/3} n\sigma^2/144} \\
&\qquad \text{if }(\xi_1,\dots,\xi_n)\notin H \text{ and } A> T.
\endalign
$$
If $A\ge\bar A_0$ with a sufficiently large constant~$\bar A_0$,
then this inequality together with Lemma~7.2 and the estimate~(7.8)
imply that
$$
\aligned
&P\left(\frac1{\sqrt n}\sup\limits_{f\in{\Cal F}}
\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right)
\le 4P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\Cal F}}\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)\\
&\qquad \le\max\left(4, 8D\left(\frac6{A\sigma^2}\right)^L
\right)e^{-A^{2/3}n\sigma^2/144}
+4e^{-A^{2/3}n\sigma^2} \quad \text{if } A>T.
\endaligned \tag7.11
$$
By the conditions of Proposition~7.3 the inequalities
$n\sigma^2\ge L\log n +\log D$ hold with some $L\ge1$, $D\ge1$
and $n\ge2$. This implies that $n\sigma^2\ge L\log2\ge\frac12$,
$(\frac6{A\sigma^2})^L
\le(\frac n{2n\sigma^2})^L\le n^L=e^{L\log n}
\le e^{n\sigma^2}$ if $A\ge\bar A_0$ with some sufficiently large
constant $\bar A_0>0$, and $2D=e^{\log2+\log D}\le e^{3n\sigma^2}$.
Hence the first term at the right-hand side of (7.11) can be
bounded by
$$
\max\left(4,8D\left(\frac6{A\sigma^2}\right)^L\right)
e^{-A^{2/3}n\sigma^2/144}
\le e^{-A^{2/3}n\sigma^2/144}\cdot 4e^{4n\sigma^2}
\le \frac12e^{-A^{1/2}n\sigma^2}
$$
if $A\ge\bar A_0$ with a sufficiently large~$\bar A_0$. The
second term at the right-hand side of~(7.11) can also be bounded
as $4e^{-A^{2/3}n\sigma^2}\le \frac12e^{-A^{1/2}n\sigma^2}$
with an appropriate choice of the number~$\bar A_0$.
By the above calculation formula (7.11) yields the inequality
$$
P\left(\frac1{\sqrt n}\sup\limits_{f\in{\Cal F}}\left|
\sum\limits_{j=1}^n f(\xi_j)\right| \ge An^{1/2}\sigma^{2}\right)
\le e^{-A^{1/2}n\sigma^2}
$$
if $A>T$, and the constant~$\bar A_0$ is chosen sufficiently large.
\medskip
To complete the proof of Proposition~7.3 we still prove the
following Lemma~7.4 together with some of its generalizations
needed in the proof of Propositions~15.3 and~15.4. The latter
results are those multivariate versions of Proposition~7.3 that
we need in the proof of the multivariate version of Proposition~6.2.
We formulated them not in their most general possible form,
but in the way as we need them in this work.
\medskip\noindent
{\bf Lemma~7.4.} {\it Let ${\Cal F}=\{f_1,f_2,\dots\}$ be a countable
and $L_2$-dense class of functions with some exponent $L>0$ and
parameter~$D\ge1$ on a measurable space $(X,{\Cal X})$. Fix some
positive integer~$n$, and define for all
$x^{(n)}=(x_1,\dots,x_n)\in X^n$ the probability measure
$\nu(x^{(n)})=\nu(x_1,\dots,x_n)$ on the space $(X,{\Cal X})$ by
the formula $\nu(x^{(n)})(x_j)=\frac1n$, $1\le j\le n$. For a
number $0\le\varepsilon\le 1$ put
$m=m(\varepsilon)=[D\varepsilon^{-L}]$, where $[\cdot]$
denotes integer part. For all $0\le\varepsilon\le 1$ there
exists $m=m(\varepsilon)$
measurable functions $p_l(x^{(n)})$, $1\le l\le m$, on the
measurable space $(X^n,{\Cal X}^n)$ with positive integer values in
such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\nu(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^n$ and $f\in{\Cal F}$.}
\medskip
In the proof of Proposition~15.3 we need the following result.
\medskip\noindent
{\bf Lemma~7.4A.} {\it Let ${\Cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some exponent
$L>0$ and parameter~$D\ge1$ on the $k$-fold product
$(X^k,{\Cal X}^k)$ of a measurable space $(X,{\Cal X})$ with some
$k\ge1$. Fix some positive integer~$n$, and define for all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$,
where $x^{(j)}_l\in X$ for all $j$ and $l$ the probability measure
$\rho(x^{(n)})$ on the space $(X^k,{\Cal X}^k)$ by the formula
$\rho(x^{(n)})(x_{l_j}^{(j)},\,1\le j\le k,\,1\le l_j\le n)
=\frac1{n^k}$ for all sequences
$(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)})$ , $1\le j\le k$,
$1\le l_j\le n$, with coordinates of the elements of the vector
$x^{(n)}=(x_l^{(j)},1\le l\le n,\,1\le j\le k)$. For all
$0\le\varepsilon\le 1$
there exist $m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable
functions $p_r(x^{(n)})$,
$1\le r\le m$, on the measurable space $(X^{kn},{\Cal X}^{kn})$ with
positive integer values in such a way that $\inf\limits_{1\le r\le m}
\int(f(u)-f_{p_r(x^{(n)})}(u))^2\rho(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^{kn}$ and $f\in{\Cal F}$.}
\medskip
In the proof of Proposition~15.4 we need the following result.
\medskip\noindent
{\bf Lemma~7.4B.} {\it
Let ${\Cal F}=\{f_1,f_2,\dots\}$ be a countable and $L_2$-dense class
of functions with some exponent $L>0$ and parameter~$D\ge1$ on the
product space $(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ with some
measurable spaces $(X,{\Cal X})$ and $(Y,{\Cal Y})$ and
integer~$k\ge1$. Fix some positive integer~$n$, and define for
all vectors $x^{(n)}=(x_l^{(j,1)},x_l^{(j,-1)},\,1\le l\le n,
\,1\le j\le k)\in X^{2kn}$,
where $x^{(j,\pm1)}_l\in X$ for all $j$ and $l$ a probability
measure $\alpha(x^{(n)})$ in the space
$(X^k\times Y,{\Cal X}^k\times{\Cal Y})$
in the following way. Fix some probability measure
$\rho$ on the space $(Y,{\Cal Y})$ and two $\pm1$ sequences
$\varepsilon^{(k)}_1=(\varepsilon_{1,1},\dots,\varepsilon_{k,1})$ and
$\varepsilon^{(k)}_2=(\varepsilon_{1,2},\dots,\varepsilon_{k,2})$ of
length~$k$. Define with
their help first the following probability measures
$\alpha_1(x^{(n)})=\alpha_1(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
and $\alpha_2(x^{(n)})=\alpha_2(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
on $(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ for all
$x^{(n)}\in{\Cal X}^{2kn}$. Let
$\alpha_1(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,1})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,1})}\}\times B)=\frac{\rho(B)}{n^k}$
and
$\alpha_2(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,2})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,2})}\}\times B)=\frac{\rho(B)}{n^k}$
with $1\le l_j\le n$ for all $1\le j\le k$ and $B\in{\Cal Y}$ if
$x_{l_j}^{(j,\varepsilon_{j,1})}$ and
$x_{l_j}^{(j,\varepsilon_{j,2})}$ are the appropriate coordinates
of the vector $x^{(n)}\in X^{2kn}$. Put
$\alpha(x^{(n)})=\frac{\alpha_1(x^{(n)})+\alpha_2(x^{(n)})}2$.
For all $0\le\varepsilon\le 1$ there exist
$m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable
functions $p_r(x^{(n)})$, $1\le r\le m$, on the measurable space
$(X^{2kn},{\Cal X}^{2kn})$ with positive integer values in
such a way that
$\inf\limits_{1\le r\le m}\int(f(u)-f_{p_r(x^{(n)})}(u))^2
\alpha(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^{2kn}$ and $f\in{\Cal F}$.}
\medskip\noindent
{\it Proof of Lemma 7.4.}\/ Fix some $0<\varepsilon\le 1$, put
the number $m=m(\varepsilon)$ introduced in the lemma, and let
us list the set of all vectors $(j_1,\dots,j_m)$ of length~$m$
with positive integer coordinates in some way. Define for all of
these vectors $(j_1,\dots,j_m)$ the set
$B(j_1,\dots,j_m)\subset X^n$ in the following way. We have
$x^{(n)}=(x_1,\dots,x_n)\in B(j_1,\dots,j_m)$
if and only if $\inf\limits_{1\le r\le m}
\int (f(u)-f_{j_r}(u))^2\,d\nu(x^{(n)})(u)\le\varepsilon^2$ for all
$f\in{\Cal F}$. Then all sets $B(j_1,\dots,j_m)$ are measurable, and
$\bigcup\limits_{(j_1,\dots,j_m)}B(j_1,\dots,j_m)=X^n$
because ${\Cal F}$ is an $L_2$-dense class of functions with
exponent~$L$ and parameter~$D$. Given a point
$x^{(n)}=(x_1,\dots,x_n)$ let us choose the first vector
$(j_1,\dots,j_m)=(j_1(x^{(n)}),\dots,j_m(x^{(n)}))$ in our list
of vectors for which $x^{(n)}\in B(j_1,\dots,j_m)$, and define
$p_l(x^{(n)})=j_l(x^{(n)})$ for all $1\le l\le m$ with this
vector $(j_1,\dots,j_m)$. Then the functions $p_l(x^{(n)})$ are
measurable, and the functions $f_{p_l(x^{(n)})}$, $1\le l\le m$,
defined with their help together with the probability measures
$\nu(x^{(n)})$ satisfy the inequality demanded in Lemma~7.4.
\medskip
The proof of the Lemmas~7.4A and~7.4B is almost the same. We
only have to modify the definition of the sets $B(j_1,\dots,j_m)$
in a natural way. The space of arguments $x^{(n)}$ are the spaces
$X^{kn}$ and $X^{2kn}$ in these two cases, and we have to integrate
with respect to the measures $\rho(x^{(n)})$ in the space $X^k$ and
with respect to the measures $\alpha(x^{(n)})$ in the space
$X^k\times Y$ respectively. The sets $B(j_1,\dots,j_m)$ are
measurable also in these cases, and the rest of the proof can be
applied without any change.
\beginsection 8. Formulation of the main results of this work.
Former sections of this work contain estimates about the tail
distribution of normalized sums of independent, identically
distributed random variables and about the tail distribution of
the supremum of appropriate classes of such random sums. These
results were considered together with some estimates about the
tail distribution of the integral of a (deterministic) function
and of the supremum of such integrals. These two kinds of problems
are closely related, and to understand them better it is useful
to investigate them together with their natural Gaussian
counterpart.
In this section we formulate the natural multivariate versions of
the above mentioned results. They will be proved in the subsequent
part of this work. To formulate them we have to introduce some new
notions. In the subsequent sections I discuss some new problems
whose solution helps to work out some methods that enable us to
prove the results of this section. I finish this section with a
short overview about the content of the remaining part of this
work. I shall also briefly indicate why it helps us to prove the
results formulated in this section.
I start this section with the formulation of two results,
Theorems~8.1 and~8.2 together with some of their simple
consequences which yield a sharp estimate about the tail
distribution of a multiple random integral with respect to a
normalized empirical distribution and about the analogous
problem when the tail distribution of the supremum of such
integrals is considered. These results are the natural
versions of the corresponding one-variate results about the tail
behaviour of an integral or of the supremum of a class of
integrals with respect to a normalized empirical distribution.
They can be formulated with the help of the notions introduced
before, in particular with the help of the notion of multiple
random integrals with respect to a normalized empirical
distribution function introduced in formula~(4.8).
To formulate the following two results, Theorems~8.3 and~8.4 and
their consequences, which are the natural multivariate versions
of the results about the tail distribution of partial sums of
independent random variables, and of the supremum of such sums
we have to make some preparation. First we introduce the
so-called $U$-statistics which can be considered as the natural
multivariate generalizations of the sum of independent and
identically distributed random variables. Moreover, we had a good
estimation about the tail distribution of sums of independent
random variables only if the summands had expectation zero, and
we have to find the natural multivariate version of this property.
Hence we define the so-called degenerate $U$-statistics which can
be considered as the natural multivariate counterpart of sums of
independent and identically distributed random variables with
zero expectation. Theorems~8.3 and~8.4 contain estimates about
the tail-distribution of degenerate $U$-statistics and of the
supremum of such expressions.
In Theorems~8.5 and~8.6 we formulate the Gaussian counterparts of
the above results. They deal with multiple Wiener-It\^o integrals
with respect to a so-called white noise. The notion of multiple
Wiener--It\^o integrals and their properties needed to have a good
understanding of these results will be explained in a later
section. Still two results are discussed in this section. They are
Examples~8.7 and~8.8, which state that the estimates of Theorems~8.5
and~8.3 are in a certain sense sharp.
\medskip
To formulate the first two results of this section let us consider
a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in a measurable space
$(X,{\Cal X})$. Let $\mu$ denote the distribution
of the random variables $\xi_j$, and introduce the empirical
distribution of the sequence $\xi_1,\dots,\xi_n$ defined in~(4.5).
Given a measurable function $f(x_1,\dots,x_k)$ on the
$k$-fold product space $(X^k,{\Cal X}^k)$ consider its integral
$J_{n,k}(f)$ with respect to the $k$-fold product of the normalized
empirical measure $\sqrt n(\mu_n-\mu)$ defined in formula (4.8). In
the definition of this integral the diagonals $x_j=x_l$,
$1\le j0$ and $\alpha=\alpha_k>0$, such that the random integral
$J_{n,k}(f)$ defined by formulas (4.5) and (4.8) satisfies the
inequality
$$
P(|J_{n,k}(f)|>u)\le C \max\left(e^{-\alpha(u/\sigma)^{2/k}},
e^{-\alpha(nu^2)^{1/(k+1)}} \right) \tag8.3
$$
for all $u>0$. The constants $C=C_k>0$ and
$\alpha=\alpha_k>0$ in formula~(8.3) depend only on the
parameter~$k$.}
\medskip
Theorem 8.1 can be reformulated in the following equivalent form.
\medskip\noindent
{\bf Theorem 8.1$'$.} {\it Under the conditions of Theorem 8.1
$$
P(|J_{n,k}(f)|>u)\le C e^{-\alpha(u/\sigma)^{2/k}}
\quad \text{for all } 0____0$,
$\alpha=\alpha_k>0$, depending only on the multiplicity~$k$ of the
integral $J_{n,k}(f)$.}
\medskip
Theorem 8.1 clearly implies Theorem~$8.1'$, since in the case
$u\le n^{k/2}\sigma^{k+1}$ the first term is larger than the second
one in the maximum at the right-hand side of formula~(8.3). On
the other hand, Theorem~$8.1'$ implies Theorem~8.1 also if
$u>n^{k/2}\sigma^{k+1}$. Indeed, in this case Theorem~$8.1'$ can be
applied with
$\bar\sigma=\left(u n^{-k/2}\right)^{1/(k+1)}\ge \sigma$ if
$u\le n^{k/2}$, hence also condition $0<\bar\sigma\le1$ is satisfied.
This yields that
$P\left(|J_{n,k}(f)|>u\right)\le C\exp\left\{-\alpha
\left(\frac u{\bar\sigma}\right)^{2/k}\right\}=C \exp\left\{-\alpha
(nu^2)^{1/(k+1)}\right\}$ if $n^{k/2}\ge u>n^{k/2}\sigma^{k+1}$,
and relation~(8.3) holds in this case. If $u>n^{k/2}$, then
$P(|J_{n,k}(f)|>u)=0$, and relation~(8.3) holds again.
Theorem~8.1 or Theorem~$8.1'$ state that the tail distribution
$P(|J_{n,k}(f)|>u)$ of the $k$-fold random integral $J_{n,k}(f)$ can
be bounded similarly to the probability
$P(|\text{const.}\,\sigma\eta^k|>u)$,
where $\eta$ is a random variable with standard normal distribution
and the number $0\le\sigma\le1$ satisfies relation (8.2), provided
that the level~$u$ we consider is less than $n^{k/2}\sigma^{k+1}$.
As we shall see later (see Corollary~1 of Theorem~9.4), the value of
the number $\sigma^2$ in formula (8.2) is closely related to the
variance of $J_{n,k}(f)$. At the end of this section an example is
given which shows that the condition $u\le n^{k/2}\sigma^{k+1}$ is
really needed in Theorem~$8.1'$.
The next result, Theorem 8.2, is the generalization of Theorem~$4.1'$
for multiple random integrals with respect to a normalized empirical
measure. In its formulation the notions of $L_2$-dense classes and
countably approximability introduced in Section~4 are applied.
\medskip\noindent
{\bf Theorem 8.2. (Estimate on the supremum of multiple random
integrals with respect to an empirical distribution).}
{\it Let us have a non-atomic probability measure
$\mu$ on a measurable space $(X,{\Cal X})$ together with a countable
and $L_2$-dense class ${\Cal F}$ of functions $f=f(x_1,\dots,x_k)$ of
$k$ variables with some parameter $D\ge1$ and exponent $L\ge1$ on
the product space $(X^k,{\Cal X}^k)$ which satisfies the conditions
$$
\|f\|_\infty=\sup\limits_{x_j\in X,\;1\le j\le k}|f(x_1,\dots,x_k)|\le 1,
\qquad \text{for all } f\in {\Cal F} \tag8.4
$$
and
$$
\|f\|_2^2=Ef^2(\xi_1,\dots,\xi_k)=\int f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k)\le \sigma^2 \qquad \text{for all }
f\in {\Cal F} \tag8.5
$$
with some constant $0<\sigma\le1$. There exist some constants
$C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$ depending only on
the parameter $k$ such that the supremum of the random integrals
$J_{n,k}(f)$, $f\in {\Cal F}$, defined by formula (4.8) satisfies
the inequality
$$
\aligned
P\left(\sup_{f\in{\Cal F}}|J_{n,k}(f)|\ge u\right)
\le C &\exp\left\{-\alpha
\left(\frac u{\sigma}\right)^{2/k}\right\}
\quad \text{for those numbers } u\\
\text{for which } &n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}),
\endaligned \tag8.6
$$
where the numbers $D$ and $L$ agree with the parameter and exponent
of the $L_2$-dense class~${\Cal F}$.
The condition about the countable cardinality of the class ${\Cal F}$
can be replaced by the weaker condition that the class of random
variables $J_{n,k}(f)$, $f\in{\Cal F}$, is countably approximable.}
\medskip
The condition given for the number~$u$ in formula~(8.6) appears in
Theorem~8.2 for a similar reason as the analogous condition
formulated in~(4.4) in its one-variate counterpart, Theorem~4.1.
The lower bound is needed, since we have a good estimate in
formula~(8.6) only for $u\ge E\sup\limits_{f\in{\Cal F}}|J_{n,k}(f)|$.
The upper bound appears, since we have a good estimate in
Theorem~$8.1'$ only for $0____0$ and $B=B(k)>0$ depending only
on the order $k$ of the $U$-statistic $I_{n,k}(f)$ such that
$$
P(k!n^{-k/2}|I_{n,k}(f)|>u)
\le A\exp\left\{-\frac{u^{2/k}}{2\sigma^{2/k}
\left(1+B\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}
\right)}\right\} \tag8.10
$$
for all $0\le u\le n^{k/2}\sigma^{k+1}$.}
\medskip
Let us also formulate the following simple corollary of Theorem~8.3.
\medskip\noindent
{\bf Corollary of Theorem 8.3} {\it Under the conditions
of Theorem~8.3 there exist some universal constants
$C=C(k)>0$ and $\alpha=\alpha(k)>0$
that
$$
P(k!n^{-k/2}|I_{n,k}(f)|>u)
\le C\exp\left\{-\alpha\left(\frac u\sigma\right)^{2/k}
\right\} \quad \text{for all } 0\le u\le n^{k/2}\sigma^{k+1}.
\tag$8.10'$
$$
}
\medskip
The following estimate holds about the supremum of degenerate
$U$-statistics.
\medskip\noindent
{\bf Theorem 8.4. (Estimate on the supremum of degenerate
$U$-statistics).} {\it Let us have a probability measure $\mu$ on
a measurable space $(X,{\Cal X})$ together with a countable and
$L_2$-dense class ${\Cal F}$ of functions $f=f(x_1,\dots,x_k)$ of $k$
variables with some parameter $D\ge1$ and exponent~$L\ge1$ on the
product space $(X^k,{\Cal X}^k)$ which satisfies conditions (8.4) and
(8.5) with some constant $0<\sigma\le1$. Let us take a sequence of
independent $\mu$ distributed random variables $\xi_1,\dots,\xi_n$,
$n\ge k$, and consider the $U$-statistics $I_{n,k}(f)$ with these
random variables and kernel functions $f\in{\Cal F}$. Let us assume
that all these $U$-statistics $I_{n,k}(f)$, $f\in{\Cal F}$, are
degenerate, or in an equivalent form, all functions $f\in {\Cal F}$
are canonical with respect to the measure~$\mu$. Then there exist
some constants $C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$
depending only on the parameter $k$ such that the inequality
$$
\aligned
P\left(\sup_{f\in{\Cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\le C
&\exp\left\{-\alpha \left(\frac u{\sigma}\right)^{2/k}\right\} \quad
\text{holds for those numbers } u \\
\text{for which } n\sigma^2\ge
&\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}),
\endaligned \tag8.11
$$
where the numbers $D$ and $L$ agree with the parameter and
exponent of the $L_2$-dense class~${\Cal F}$.
The condition about the countable cardinality of the class ${\Cal F}$
can be replaced by the weaker condition that the class of random
variables $n^{-k/2}I_{n,k}(f)$, $f\in{\Cal F}$, is countably
approximable.}
\medskip
Next I formulate a Gaussian counterpart of the above results. To do
this I need some notions that will be introduced in Section~10. In
that section the white noise with a reference measure $\mu$ will
be defined. It is an appropriate set of jointly Gaussian random
variables indexed by those measurable sets $A\in {\Cal X}$ of a
measure space $(X,{\Cal X},\mu)$ with a $\sigma$-finite
measure~$\mu$ for which $\mu(A)<\infty$. Its distribution depends
on the measure~$\mu$ which will be called the reference measure of
the white noise.
In Section~10 it will be also shown that given a white noise $\mu_W$
with a non-atomic $\sigma$-additive reference measure $\mu$ on a
measurable space $(X,{\Cal X})$ and a measurable function
$f(x_1,\dots,x_k)$ of $k$ variables on the product space
$(X^k,{\Cal X}^k)$ such that
$$
\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2<\infty
\tag8.12
$$
a $k$-fold Wiener-It\^o integral of the function $f$ with respect
to the white noise~$\mu_W$
$$
Z_{\mu,k}(f)=\frac1{k!}\int f(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots \mu_W(\,dx_k) \tag8.13
$$
can be defined, and the main properties of this integral will be
proved there. It will be seen that Wiener-It\^o integrals have a
similar relation to degenerate $U$-statistics and multiple
integrals with respect to normalized empirical measures as
normally distributed random variables have to partial sums of
independent random variables. Hence it is useful to find the
analogs of the previous estimates of this section about the
tail distribution of Wiener-It\^o integrals. The subsequent
Theorems~8.5 and~8.6 contain such results.
\medskip\noindent
{\bf Theorem 8.5. (Estimate on the tail distribution of a multiple
Wiener--It\^o integral).} {\it Let us fix a measurable space
$(X,{\Cal X})$ together with a $\sigma$-finite non-atomic
measure~$\mu$ on it, and let $\mu_W$ be a white noise with reference
measure $\mu$ on $(X,{\Cal X})$. If $f(x_1,\dots,x_k)$ is a measurable
function on $(X^k,{\Cal X}^k)$ which satisfies relation~(8.12) with
some $0<\sigma<\infty$, then
$$
P(k!|Z_{\mu,k}(f)|>u)\le C \exp\left\{-\frac12\left(\frac
u\sigma\right)^{2/k}\right\} \tag8.14
$$
for all $u>0$ with some constants $C=C(k)$ depending only on~$k$.}
\medskip\noindent
{\bf Theorem 8.6. (Estimate on the supremum of Wiener--It\^o
integrals).} {\it Let ${\Cal F}$ be a countable class of functions
of $k$ variables defined on the $k$-fold product $(X^k,{\Cal X}^k)$
of a measurable space $(X,{\Cal X})$ such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots \mu(\,dx_k)\le\sigma^2
\quad \text{\rm with some } 0<\sigma\le1 \text { \rm for all }
f\in {\Cal F}
$$
with some non-atomic $\sigma$-additive measure~$\mu$ on $(X,{\Cal X})$.
Let us also assume that ${\Cal F}$ is an $L_2$-dense class of functions
in the space $(X^k,{\Cal X}^k)$ with respect to the measure~$\mu^k$
with some exponent~$L\ge1$ and parameter~$D\ge1$, where $\mu^k$ is
the $k$-fold product of the measure~$\mu$. (The classes of
$L_2$-dense classes with respect to a measure were defined in
Section~4.)
Take a white noise $\mu_W$ on $(X,{\Cal X})$ with reference measure
$\mu$, and define the Wiener--It\^o integrals $Z_{\mu,k}(f)$ for
all $f\in{\Cal F}$. Fix some $0<\varepsilon\le1$. The inequality
$$
P\left(\sup_{f\in {\Cal F}} k!|Z_{\mu,k}(f)|>u\right)\le CD
\exp\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k}\right\}\tag8.15
$$
holds with some universal constants $C=C(k)>0$, $M=M(k)>0$
for those numbers~$u$ for which $u\ge ML^{k/2}\frac1\varepsilon
\log^{k/2}\frac2\varepsilon \cdot \sigma\log^{k/2}\frac2\sigma$.}
\medskip
Formula (8.15) yields an almost as good estimate for the supremum
of Wiener--It\^o integrals with the choice of a small
$\varepsilon>0$ as
formula (8.14) for a single Wiener--It\^o integral. But the lower
bound imposed on the number $u$ in the estimate (8.15) depends
on $\varepsilon$, and for a small number $\varepsilon>0$ it is large.
The subsequent result presented in Example~8.7 may help to
understand why Theorems~8.3 and~8.5 are sharp. Its proof and
the discussion of the question about the sharpness
of Theorems~8.3 and~8.5 will be postponed to Section~13.
\medskip\noindent
{\bf Example 8.7. (A converse estimate to Theorem 8.5).} {\it Let
us have a $\sigma$-finite measure $\mu$ on some measure space
$(X,{\Cal X})$ together with a white noise $\mu_W$ on $(X,{\Cal X})$
with counting measure~$\mu$. Let $f_0(x)$ be a real valued function
on $(X,{\Cal X})$ such that $\int f_0(x)^2\mu(\,dx)=1$, and take the
function $f(x_1,\dots,x_k)=\sigma f_0(x_1)\cdots f_0(x_k)$ with
some number $\sigma>0$ together with the Wiener--It\^o integral
$Z_{\mu,k}(f)$ introduced in formula (8.13).
Then the relation
$\int f(x_1,\dots,x_k)^2\,\mu(\,dx_1)\dots\,\mu(\,dx_k)=\sigma^2$
holds, and the Wiener--It\^o integral $Z_{\mu,k}(f)$ satisfies the
inequality
$$
P(k!|Z_{\mu,k}(f)|>u)
\ge \frac{\bar C}{\left(\frac u\sigma\right)^{1/k}+1}
\exp\left\{-\frac12\left(\frac u\sigma\right)^{2/k}\right\}\quad
\text{for all } u>0 \tag8.16
$$
with some constant $\bar C>0$.}
\medskip
The above results show that multiple integrals with respect to a
normalized empirical measure or degenerate $U$-statistics satisfy
some estimates similar to those about multiple Wiener--It\^o
integrals, but they hold under more restrictive conditions. The
difference between the estimates in these problems is similar to
the difference between the corresponding results in Section~4 whose
reason was explained there. Hence this will be only briefly
discussed here. The estimates of Theorem~8.1 and~8.3 are similar to
that of Theorem~8.5. Moreover, for
$0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a small number
$\varepsilon>0$ Theorem~8.3 yields an almost as good
estimate about degenerate $U$-statistics as Theorem~8.5 yields
for a Wiener--It\^o integral with the same kernel function $f$ and
underlying measure $\mu$. Example~8.7 shows that the constant in
the exponent of formula (8.14) cannot be improved, at least there
is no possibility of an improvement if only the $L_2$-norm
of the kernel function $f$ is known. Some results discussed later
indicate that neither the estimate of Theorem~8.3 can be improved.
The main difference between Theorem~8.5 and the results of
Theorem~8.1 or~8.3 is that in the latter case the kernel
function~$f$ must satisfy not only an $L_2$ but also an $L_\infty$
norm type condition, and the estimates of these results are
formulated under the additional condition
$u\le n^{k/2}\sigma^{k+1}$. It can be shown that the condition about
the $L_\infty$ norm of the kernel function cannot be dropped from
the conditions of these theorems, and a version of Example~3.3 will
be presented in Example~8.8 which shows that in the case
$u\gg n^{k/2}\sigma^{k+1}$ the left-hand side of~(8.10) may satisfy
only a much weaker estimate. This estimate will be given only for
$k=2$, but with some work it can be generalized for general
indices~$k$.
Theorems~8.2, 8.4 and~8.6 show that for the tail distribution of the
supremum of a not too large class of degenerate $U$-statistics or
multiple integrals a similar upper bound can be given as for the tail
distribution of a single degenerate $U$-statistic or multiple integral,
only the universal constants may be worse in the new estimates.
However, they hold only under the additional condition that the level
at which the tail distribution of the supremum is estimated is not too
low. A similar phenomenon appeared already in the results of Section~4.
Moreover, such a restriction had to be imposed in the formulation of
the results here and in Section~4 for the same reason.
In Theorem~8.2 and~8.4 an $L_2$-dense class of kernel functions was
considered, and this meant that the class of random integrals or
$U$-statistics we consider in this result is not too large. In
Theorem~8.6 a similar, but weaker condition was imposed on the class
of kernel functions. They had to satisfy a similar condition, but
only for the reference measure $\mu$ of the white noise appearing in
the Wiener--It\^o integral. A similar difference appears in the
comparison of Theorems~4.1 or~$4.1'$ with Theorem~4.2, and this
difference has the same reason in the two cases.
I still present the proof of the following Example~8.8
which is a multivariate version of Example~3.3. For the sake of
simplicity I restrict my attention to the case $k=2$.
\medskip\noindent
{\bf Example 8.8. (A converse estimate to Theorem 8.3).} {\it Let us
take a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in the plane $X=R^2$ such
that $\xi_j=(\eta_{j,1},\eta_{j,2})$, $\eta_{j,1}$ and $\eta_{j,2}$
are independent random variables with the following distributions.
The distribution of $\eta_{j,1}$ is defined with the help of a
parameter $\sigma^2$, $0<\sigma^2\le\frac18$, in the same way as
the distribution of the random variables $X_j$ in Example~3.3, i.e.
$\eta_{j,1}=\bar\eta_{j,1}-E\bar\eta_{j,1}$ with
$P(\bar\eta_{j,1}=1)=\bar\sigma^2$,
$P(\bar\eta_{j,1}=0)=1-\bar\sigma^2$, where $\bar\sigma^2$ is that
solution of the equation $x^2-x+\sigma^2=0$, which is smaller
than~$\frac12$. The distribution of the random variables is given
by the formula $P(\eta_{j,2}=1)=P(\eta_{j,2}=-1)=\frac12$ for all
$1\le j\le n$. Introduce the function
$f(x,y)=f((x_1,x_2),(y_1,y_2))=x_1y_2+x_2y_1$,
$x=(x_1,x_2)\in R^2$, $y=(y_1,y_2)\in R^2$ if $(x,y)$ is in the
support of the distribution of the random vector $(\xi_1,\xi_2)$,
i.e. if $x_1$ and $y_1$ take the values $1-\bar\sigma^2$ or
$-\bar\sigma^2$ and $x_2$ and $y_2$ take the values $\pm1$. Put
$f(x,y)=0$ otherwise. Define the $U$-statistic
$$
I_{n,2}(f)=\frac12\sum_{1\le j,k\le n,\,j\neq k} f(\xi_j,\xi_k)
=\frac12\sum_{1\le j,k\le n,\,j\neq k}
(\eta_{j,1}\eta_{k,2}+\eta_{k,1}\eta_{j,2})
$$
of order 2 with the above kernel function $f$ and sequence of
independent random variables $\xi_1,\dots,\xi_n$. Then $I_{n,2}(f)$
is a degenerate $U$-statistic, $|\sup f(x,y)|\le 1$ and
$Ef^2(\xi_j,\xi_j)=\sigma^2$.
If $u\ge B_1n\sigma^3$ with some appropriate constant $B_1>2$,
$\bar B_2^{-1}n\ge u\ge \bar B_2 n^{-1/2}$ with a sufficiently
large fixed number $\bar B_2>0$ and
$\frac14\ge\sigma^2\ge\frac1{n^2}$, and $n$ is a sufficiently
large number, then the estimate
$$
P(n^{-1}I_{n,2}(f)>u)\ge \exp\left\{-Bn^{1/3}u^{2/3}\log
\left(\frac u{n\sigma^3}\right)\right\} \tag8.17
$$
holds with some $B>0$.}
\medskip\noindent
{\it Remark:}\/ In Theorem~8.3 we got the estimate
$P(n^{-1}I_{n,2}(f)>u)\le e^{-\alpha u/\sigma}$ for the above
defined degenerate $U$-statistic $I_{n,2}(f)$ if
$0\le u\le n\sigma^3$. In the particular case $u=n\sigma^3$
we have the estimate
$P(n^{-1}I_{n,2}(f)>n\sigma^3)\le e^{-\alpha n\sigma^2}$. On the
other hand, the above example shows that in the case
$u\gg n\sigma^3$
we can get only a weaker estimate. It is worth looking at the
estimate~(8.17) with fixed parameters $n$ and $u$ and to observe
the dependence of the upper bound on the variance $\sigma^2$ of
$I_{n,2}(f)$. In the case $\sigma^2=u^{2/3}n^{-2/3}$ we have the
upper bound $e^{-\alpha n^{1/3}u^{2/3}}$. Example~8.8 shows
that in the case $\sigma^2\ll u^{2/3}n^{-2/3}$ we can get only a
relatively small improvement of this estimate. A similar picture
appears as in Example~3.3 in the case $k=1$.
\medskip
It is simple to check that the $U$-statistic introduced in the
above example is degenerate because of the independence of the
random variables $\eta_{j,1}$ and $\eta_{j,2}$ and the identity
$E\eta_{j,1}=E\eta_{j,2}=0$. Beside this,
$Ef(\xi_j,\xi_j)^2=\sigma^2$. In the proof of the estimate~(8.17)
the results of Section~3, in particular Example~3.3 can be applied
for the sequence $\eta_{j,1}$, $j=1,2,\dots,n$. Beside this, the
following result known from the theory of large deviations will
be applied. If $X_1,\dots,X_n$ are independent and identically
distributed random variables, $P(X_1=1)=P(X_1=-1)=\frac12$, then
for any number $0\le \alpha<1$ there exists some numbers
$C_1=C_1(\alpha)>0$ and $C_2=C_2(\alpha)>0$ such that
$P\left(\sum\limits_{j=1}^nX_j >u\right)\ge C_1e^{-C_2u^2/n}$ for all
$0\le u\le \alpha n$.
\medskip\noindent
{\it Proof of Example 8.8.}\/ The inequality
$$
P(n^{-1}I_{n,2}(f)>u)\ge P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right)
-P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right) \tag8.18
$$
holds. Because of the independence of the random variables
$\eta_{j,1}$ and $\eta_{j,2}$ the first probability at the
right-hand side of (8.18) can be bounded from below by bounding
the multiplicative terms in it with $v_1=4n^{1/3}u^{2/3}$ and
$v_2=n^{2/3}u^{1/3}$. The first term will be estimated by means
of Example 3.3. This estimate can be applied with the choice
$y=v_1$, since the relation $v_1\ge 4n\sigma^2$ holds if
$u\ge B_1n\sigma^3$ with $B_1>1$, and the remaining conditions
$0\le \sigma^2\le\frac18$ and $n\ge4v_1\ge6$ also hold under the
conditions of Example~8.8. The second term can be bounded with
the help of the large-deviation result mentioned after the
remark, since $v_2\le \frac12 n$ if $u\le \bar B_2^{-1}n$ with
a sufficiently large $\bar B_2>0$. In such a way we get the
estimate
$$
\align
&P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right)\ge
P\left(\sum_{j=1}^n\eta_{j,1} >v_1\right)
P\left(\sum_{j=1}^n\eta_{j,2}>v_2\right) \\
&\qquad \ge C\exp\left\{-B_1v_1\log
\left(\frac{v_1}{n\sigma^2}\right)-B_2\frac{v_2^2}{n}\right\}
\ge C\exp\left\{-B_3n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\}
\endalign
$$
with appropriate constants $B_1>1$, $B_2>0$ and $B_3>0$. On the
other hand, by applying Bennett's inequality, more precisely its
consequence given in formula (3.4) for the sum of the random
variables $X_j=\eta_{j,1}\eta_{j,2}$ at level $nu$ instead of
level~$u$ we get the following upper bound for the second term at
the right-hand side of~(8.18).
$$
\align
P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right)&\le
\exp\left\{-Knu\log \frac u{\sigma^2}\right\} \\
&\le \exp\left\{-2B_4n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\},
\endalign
$$
since $E\eta_{j,1}\eta_{j,2}=0$, $E\eta^2_{j,1}\eta^2_{j,2}=\sigma^2$,
$nu\ge B_1n^2\sigma^3\ge 2n\sigma^2$ because of the conditions $B_1>2$
and $n\sigma\ge1$. Hence the estimate~(3.4) (with parameter $nu$)
can be applied in this case. Beside this, the constant $B_4$ can be
chosen sufficiently large in the last inequality if the number~$n$
or the bound~$\bar B_2$ in Example~8.8 us chosen sufficiently large.
This means that this term is negligible small. The above estimates
imply the statement of Example~8.8.
\medskip
Let me remark that under some mild additional restrictions the
estimate (8.17) can be slightly sharpened, the term $\log$ can be
replaced by $\log^{2/3}$ in the exponent of the right-hand side
of~(8.17). To get such an estimate some additional calculation
is needed where the numbers $v_1$ and $v_2$ are replaced by
$\bar v_1=4n^{1/3}u^{2/3}\log^{-1/3}\left(\frac u{n\sigma^3}\right)$
and
$\bar v_2=n^{2/3}u^{1/3}\log^{1/3}\left(\frac u{n\sigma^3}\right)$.
\medskip
At the end of this section I present a short overview about the
content of the remaining part of this work.
In our proofs we needed some results about $U$-statistics, and this
is the main topic of Section~9. One of the results discussed here
is the so-called Hoeffding decomposition of $U$-statistics to the
linear combination of degenerate $U$-statistics of different order.
We also needed some additional results which explain how some
properties (e.g. a bound on the $L_2$ and $L_\infty$ norm of a
kernel function, the $L_2$-density property of a class~${\Cal F}$ of
kernel function) is inherited if we turn from the original
$U$-statistics to the degenerate $U$-statistics appearing in
their Hoeffding decomposition. Section~9 contains some results
in this direction. Another important result in it is Theorem~9.4
which yields a decomposition of multiple integrals with respect
to a normalized empirical distribution to the linear combination
of degenerate $U$-statistics. This result is very similar to the
Hoeffding decomposition of $U$-statistics. The main difference
between them is that in the decomposition of multiple integrals
much smaller coefficients appear. Theorem~9.4 makes possible to
reduce the proof of Theorems~8.1 and~8.2 to the corresponding
results in Theorems~8.3 and~8.4 about degenerate $U$-statistics.
The definition and the main properties of Wiener--It\^o integrals
needed in the proof of Theorems~8.5 and~8.6 are presented in
Section~10. It also contains a result, called the diagram formula
for Wiener--It\^o integrals which plays an important role in our
considerations. Beside this we proved a limit theorem, where we
expressed the limit of normalized degenerate $U$-statistics with
the help of multiple Wiener--It\^o integrals. This result may
explain why it is natural to consider Theorem~8.5 as the
natural Gaussian counterpart of Theorem~8.5, and Theorem~8.6 as
the natural Gaussian counterpart of Theorem~8.6.
We could prove Bernstein's and Bennett's inequality by means of a
good estimation of the exponential moments of the partial sums we
were investigating. In the proof of their multivariate versions,
in Theorems~8.3 and~8.5 this method does not work, because the
exponential moments we have to bound in these cases may be
infinite. On the other hand, we could prove these results by means
of a good estimate on the high moments of the random variables
whose tail distribution we wanted to estimate. In the proof of
Theorem~8.5 the moments of multiple Wiener--It\^o integrals
have to be bounded, and this can be done with the help of the
diagram formula for Wiener--It\^o integrals. In Sections~11
and~12 we proved that there is a version of the diagram formula
for degenerate $U$-statistics, and this enables us to estimate
the moments needed in the proof of Theorem~8.3. In Section~13
we proved Theorems~8.3, 8.5 and a multivariate version of the
Hoeffding inequality. At the end of this section we still
discussed some results which state that in certain cases when
we have, beside the upper bound of their $L_2$ and $L_\infty$
norm some additional information about the behaviour of the
kernel function~$f$ in Theorems~8.3 or~8.5, these results can
be improved.
Section~14 contains the natural multivariate versions of the
results in Section~6. In Section~6 Theorem~4.2 is proved about
the supremum of Gaussian random variables and in Section~14
its multivariate version, Theorem~8.6. Both results are proved
with the help of the chaining argument. On the other hand, the
chaining argument is not strong enough to prove Theorem~4.1.
But as it is shown in Section~6, it enables us to prove a result
formulated in Proposition~6.1, and to reduce the proof of
Theorem~4.1 with its help to a simpler result formulated in
Proposition~6.2. One of the results of Section~14,
Proposition~14.1 is a multivariate version of Proposition~6.1.
We showed that the proof of Theorem~8.4 can be reduced with its
help to the proof of a result formulated in Proposition~14.2,
which can be considered a multivariate version of Proposition~6.2.
Section~14 contains still another result. It turned out that
it is simpler to work with so-called decoupled $U$-statistics
introduced in this section than with usual $U$-statistics,
because they have more independence properties. In
Proposition~$14.2'$ a version of Proposition~14.2 is formulated
about degenerate $U$-statistics, and it is shown with the help
of a result of de la Pe\~na and Montgomery--Smith that the proof
of Proposition~14.2, and thus of Theorem~8.4 can be reduced to
the proof of Proposition~$14.2'$.
Proposition~$14.2'$ is proved similarly to its one-variate
version, Proposition~6.2. The strategy of the proof is explained
in Section~15. The main difference between the proof of the two
propositions is that since the independence properties exploited
in the proof of Proposition~6.2 hold only in a weaker form in the
present case, we have to apply a more refined and more difficult
argument. In particular, we have to apply instead of the
symmetrization lemma, Lemma~7.1, a more general version of it,
Lemma~15.2. It is hard to check its conditions when we try to
apply this result in the problems arising in the proof of
Proposition~$14.2'$. This is the reason why we had to prove
Propositiont~$14.2'$ with the help of two inductive propositions,
formulated in Propositions~15.3 and~15.4, while in the proof of
Proposition~6.2 it was enough to prove one such result, presented
in Proposition~7.3. We discuss the details of the problems and
the strategy of the proof in Section~15. The proof of
Propositions~15.3 and~15.4 is given in Sections~16 and~17.
Section~16 contains the symmetrization arguments needed for us,
and the proof is completed with its help in Section~17.
Finally in Section~18 we give an overview of this work, and
explain its relation to some similar researches. The proof of
some results is given in the Appendix.
\beginsection 9. Some results about $U$-statistics.
This section contains the proof of the Hoeffding decomposition
theorem, an important result about $U$-statistics. It states that
all $U$-statistics can be represented as a sum of degenerate
$U$-statistics of different order. This representation can be
considered
as the natural multivariate version of the decomposition of a
random variable as the sum of a random variable with expectation
zero plus a constant (which can be interpreted as a random variable
of zero variable). Some important properties of the Hoeffding
decomposition will also be proved. The properties of the kernel
function of a $U$-statistic will be compared to those of the kernel
functions of the $U$-statistics in its Hoeffding decomposition.
If the Hoeffding decomposition of a $U$-statistic is taken, then
the $L_2$ and $L_\infty$-norms of the kernel functions appearing
in the $U$-statistics of the Hoeffding decomposition will be
bounded
by means of the corresponding norm of the kernel function of the
original $U$-statistic. It will be also shown that if we take a
class of $U$-statistics with an $L_2$-dense class of kernel
functions (and the same sequence of independent and identically
distributed random variables in the definition of each
$U$-statistic), and we make the Hoeffding decomposition of all
$U$-statistics in this class, then the kernel functions of the
degenerate $U$-statistics appearing in these Hoeffding
decompositions also constitute an $L_2$-dense class. Another
important result of this section is Theorem~9.4. It yields a
decomposition of a $k$-fold random integral with respect to a
normalized empirical measure to the linear combination of
degenerate $U$-statistics. This result makes possible to derive
Theorem~8.1 from Theorem 8.3 and Theorem~8.2 from Theorem~8.4,
and it is also useful in the proof of Theorems~8.3 and~8.4.
Let us first consider the Hoeffding's decomposition. In the
special case $k=1$ it states that the sum
$S_n=\sum\limits_{j=1}^n\xi_j$ of independent and identically
distributed random variables can be rewritten as
$S_n=\sum\limits_{j=1}^n(\xi_j-E\xi_j)
+\left(\sum\limits_{j=1}^nE\xi_j\right)$, i.e.\
as the sum of independent random variables with zero expectation
plus a constant. We introduced the convention that a constant is
the kernel function of a degenerate $U$-statistic of order zero,
and $I_{n,0}(c)=c$ for a $U$-statistic of order zero. I wrote
down the above trivial formula, because Hoeffding's decomposition
is actually its adaptation to a more general situation. To
understand this let us first see how to adapt the above
construction to the case $k=2$.
In this case a sum of the form
$2I_{n,2}(f)=\sum\limits_{1\le j,k\le n,j\neq k} f(\xi_j,\xi_k)$
has to be considered. Write
$f(\xi_j,\xi_k)=[f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)]+
E(f(\xi_j,\xi_k)|\xi_k)=f_1(\xi_j,\xi_k)+\bar f_1(\xi_k)$ with
$f_1(\xi_j,\xi_k)=f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)$, and
$\bar f_1(\xi_k)=E(f(\xi_j,\xi_k)|\xi_k)$ to make the conditional
expectation of $f_1(\xi_j,\xi_k)$ with respect to $\xi_k$ equal
zero. Repeating this procedure for the first coordinate we define
$f_2(\xi_j,\xi_k)=f_1(\xi_j,\xi_k)-E(f_1(\xi_j,\xi_k)|\xi_j)$ and
$\bar f_2(\xi_j)=E(f_1(\xi_j,\xi_k)|\xi_j)$.
Let us also write $\bar f_1(\xi_k)=
[\bar f_1(\xi_k)-E\bar f_1(\xi_k)]+E\bar f_1(\xi_k)$ and
$\bar f_2(\xi_j)=[\bar f_2(\xi_j)-E\bar f_2(\xi_j)]
+E\bar f_2(\xi_j)$.
Simple calculation shows that $2I_{n,2}(f_2)$ is a degenerate
$U$-statistics of order 2, and the identity
$2I_{n,2}(f)=2I_{n,2}(f_2)+I_{n,1}((n-1)(\bar f_1-E\bar f_1))+
I_{n,1}((n-1)((\bar f_2-E\bar f_2))+n(n-1)E(\bar f_1+\bar f_2)$
yields the decomposition of $I_{n,2}(f)$ into a sum of degenerate
$U$-statistics of different orders.
Hoeffding's decomposition can be obtained by working out the details
of the above argument in the general case. But it is simpler to
calculate the appropriate conditional expectations with the help of
the kernel functions of the $U$-statistics. To carry out such a
program we introduce the following notations.
Let us consider the $k$-fold product $(X^k,{\Cal X}^k,\mu^k)$ of a
measure space $(X,{\Cal X},\mu)$ with some probability measure $\mu$,
and define for all integrable functions $f(x_1,\dots,x_k)$ and indices
$1\le j\le k$ the projection~$P_jf$ of the function $f$ to its $j$-th
coordinate as
$$
(P_jf)(x_1,\dots,x_{j-1},x_{j+1},\dots,x_k)=\int
f(x_1,\dots,x_k)\mu(\,dx_j), \quad 1\le j\le k. \tag9.1
$$
Let us also define the operators $Q_j=I-P_j$ i.e. $Q_jf=f-P_jf$ for
all integrable functions on $f$ on the space $(X^k,{\Cal X}^k,\mu^k)$,
$1\le j\le k$. In the definition (9.1) $P_jf$ is a function not
depending on the coordinate $x_j$, but in the definition of $Q_j$
we introduce the fictive coordinate $x_j$ to make the expression
$Q_jf=f-P_jf$ meaningful. The following result holds.
\medskip\noindent
{\bf Theorem 9.1. (The Hoeffding decomposition).} {\it
Let $f(x_1,\dots,x_k)$ be an integrable function on the $k$-fold
product space $(X^k,{\Cal X}^k,\mu^k)$ of a space $(X,{\Cal X},\mu)$
with a probability measure $\mu$. It has such a decomposition
$$
f=\sum\limits_{V\subset\{1,\dots,k\}} f_V, \quad \text{with} \quad
f_V(x_j,\,j\in V)=\left(\prod_{j\in\{1,\dots,k\}\setminus V}P_j
\prod_{j\in V}Q_j\right)f(x_1,\dots,x_k) \tag9.2
$$
for which all functions $f_V$, $V\subset \{1,\dots,k\}$, in (9.2)
are canonical with respect to the probability measure $\mu$, and
the function $f_V$ depends on the arguments $x_j$, $j\in V$.
Let $\xi_1,\dots,\xi_n$ be a sequence of
independent $\mu$ distributed random variables, and consider the
$U$-statistics $I_{n,k}(f)$ and $I_{n,|V|}(f_V)$ corresponding to
the kernel functions $f$, $f_V$ defined in (9.2) and random variables
$\xi_1,\dots,\xi_n$. Then
$$
k!I_{n,k}(f)=\sum_{V\subset\{1,\dots,k\}}
(n-|V|)(n-|V|-1)\cdots(n-k+1)|V|! I_{n,|V|}(f_V) \tag9.3
$$
is a representation of $I_{n,k}(f)$ as a sum of degenerate
$U$-statistics, where $|V|$ denotes the cardinality of the set $V$.
(The product $(n-|V|)(n-|V|-1)\cdots(n-k+1)$ is defined as 1 for
$V=\{1,\dots,k\}$, i.e. if $|V|=k$.) This representation is called
the Hoeffding decomposition of $I_{n,k}(f)$.}
\medskip\noindent
{\it Proof of Theorem 9.1.}\/ Write $f=\prod\limits_{j=1}^k(P_j+Q_j)f$.
By carrying out the multiplications in this identity and applying the
commutativity of the operators $P_j$ and $Q_j$ for different indices
$j$ we get formula (9.2). To show that the functions $f_V$ in formula
(9.2) are canonical let us observe that this property can be rewritten
in the form $P_jf_V\equiv0$ (in all coordinates $x_s$, $s\in
V\setminus\{j\}$ if $j\in V$).
Since $P_j=P_j^2$, and the identity $P_jQ_j=P_j-P_j^2=0$ holds for all
$j\in\{1,\dots,k\}$ this relation follows from the above mentioned
commutativity of the operators $P_j$ and $Q_j$, as $P_jf_V=
\left(\prod\limits_{s\in\{1,\dots,k\}\setminus V}
P_s\prod\limits_{s\in V\setminus\{j\}}Q_s\right)P_jQ_jf=0$.
By applying identity (9.2) for all terms
$f(\xi_{j_1},\dots,\xi_{j_k})$ in the sum defining the $U$-statistic
$I_{n,k}(f)$ and then summing them up we get relation (9.3).
\medskip
In the Hoeffding decomposition we rewrote a general $U$-statistic in
the form of a linear combination of degenerate $U$-statistics. In
many applications of this result we still we have to know how the
properties of the kernel function $f$ of the original $U$-statistic
are reflected in the properties of the kernel functions $f_V$ of
the degenerate $U$-statistics taking part in the Hoeffding
composition. In particular, we need a good estimate on
the $L_2$ and $L_\infty$ norm of the functions $f_V$ by means of
the corresponding norm of the function~$f$. Moreover, if we want to
prove estimates on the tail distribution of the supremum of
$U$-statistics $I_{n,k}(f)$ for a nice class of kernel functions
$f\in {\Cal F}$ which is an $L_2$-dense class of functions with some
exponent $L$ and parameter $D$, then we may need a similar estimate
on the class of kernel functions $f_V$, $f\in{\Cal F}$, with some
$V\in\{1,\dots,k\}$ appearing in the Hoeffding decomposition of
these functions. We have to show that this class of functions is
also $L_2$-dense, and we also need a good bound on the exponent and
parameter of this $L_2$-dense class. The next result formulates
such a statement.
\medskip\noindent
{\bf Theorem 9.2. (Some properties of the Hoeffding decomposition).}
{\it Let us consider a square integrable function $f(x_1,\dots,x_k)$
on the $k$-fold product space $(X^k,{\Cal X}^k,\mu^k)$ and take its
decomposition defined in formula (9.2). The inequalities
$$
\int f_V^2(x_j,\,j\in V)
\prod\limits_{j\in V}\mu(\,dx_j)\le \int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k) \tag9.4
$$
and
$$
\sup_{x_j,\, j\in V} |f_V(x_j,\,j\in V)|\le2^{|V|}\sup_{x_j,\,1\le
j\le k}|f(x_1,\dots,x_k)| \tag$9.4'$
$$
hold for all $V\subset\{1,\dots,k\}$. (In particular, $f_\emptyset^2\le
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$ for $V=\emptyset$.)
Let us consider an $L_2$-dense class ${\Cal F}$ of functions with some
parameter $D\ge1$ and exponent $L\ge0$ on the space $(X^k,{\Cal X}^k)$,
take the decomposition (9.2) of all functions $f\in {\Cal F}$, and
define the classes of functions
${\Cal F}_V=\{2^{-|V|}f_V\colon\; f\in {\Cal F}\}$ for all
$V\subset\{1,\dots,k\}$ with the functions
$f_V$ taking part in this decomposition. These classes of functions
${\Cal F}_V$ are also $L_2$-dense with the same parameter~$D$ and
exponent~$L$ for all $V\subset\{1,\dots,k\}$.}
\medskip
Theorem 9.2 will be proved as a consequence of Proposition~9.3
presented below. To formulate it first some notations will be
introduced:
Let us consider the product $(Y\times Z,{\Cal Y}\times{\Cal Z})$ of two
measurable spaces $(Y,{\Cal Y})$ and $(Z,{\Cal Z})$ together with a
probability measure $\mu$ on $(Z,{\Cal Z})$ and the operator
$$
Pf(y)=P_\mu f(y)=\int f(y,z)\mu(\,dz),\quad y\in Y,\; z\in Z \tag9.5
$$
defined for those $y\in Y$ for which the above integral is finite.
Let $I$ denote the identity operator on the space of functions on
$Y\times Z$, i.e. let $If(y,z)=f(y,z)$, and introduce the operator
$Q=Q_\mu=I-P=I-P_\mu$
$$
Q_\mu f(y,z)=(I-P_\mu)f(y,z)=f(y,z)-P_\mu f(y,z)
=f(y,z)-\int f(y,z)\mu(\,dz), \tag9.6
$$
defined for those points $(y,z)\in Y\times Z$ whose first
coordinate~$y$
is such that the expression~$P_\mu f(y)$ is meaningful.
(Here, and in the sequel a function $g(y)$ defined on the space
$(Y,{\Cal Y})$ will be sometimes identified with the function
$\bar g(y,z)=g(y)$ on the space
$(Y\times Z,{\Cal Y}\times {\Cal Z})$
which actually does not depend on the coordinate $z$.) The
following result holds:
\medskip\noindent
{\bf Proposition 9.3.} {\it Let us consider the direct product
$(Y\times Z,{\Cal Y}\times{\Cal Z})$ of two measure spaces
$(Y,{\Cal Y})$ and $(Z,{\Cal Z})$ together with a probability
measure $\mu$ on the
space $(Z,{\Cal Z})$. Take the transformations $P_\mu$ and $Q_\mu$
defined in formulas (9.5) and (9.6). Given any probability measure
$\rho$ on the space $(Y,{\Cal Y})$ consider the product measure
$\rho\times\mu$ on $(Y\times Z,{\Cal Y}\times{\Cal Z})$. Then the
transformations $P_\mu$ and $Q_\mu$, as maps from the space
$L_2(Y\times Z,{\Cal Y}\times{\Cal Z},\mu\times \rho)$ to
$L_2(Y,{\Cal Y},\rho)$ and
$L_2(Y\times Z,{\Cal Y}\times{\Cal Z},\rho\times\mu)$
respectively, have a norm less than or equal to 1, i.e.
$$
\int P_\mu f(y)^2\rho(\,dy)\le\int f(y,z)^2\rho(\,dy)\mu(\,dz),
\tag9.7
$$
and
$$
\int Q_\mu f(y,z)^2\rho(\,dy)\mu(\,dz)\le\int f(y,z)^2
\rho(\,dy)\mu(\,dz) \tag9.8
$$
for all functions
$f\in L_2(Y\times Z,{\Cal Y}\times{\Cal Z},\rho\times \mu)$.
If ${\Cal F}$ is an $L_2$-dense class of functions
$f(y,z)$ in the product space $(Y\times Z,{\Cal Y}\times{\Cal Z})$,
with some parameter $D\ge1$ and exponent $L\ge0$, then also the
classes ${\Cal F}_\mu=\{P_\mu f,\; f\in {\Cal F}\}$ and ${\Cal G}_\mu
=\{\frac12Q_\mu f=\frac12(f-P_\mu f),\; f\in{\Cal F}\}$ are
$L_2$-dense classes with the same exponent $L$ and parameter~$D$
in the spaces $(Y,{\Cal Y})$ and $(Y\times Z,{\Cal Y}\times{\Cal Z})$
respectively.}
\medskip
The following corollary of Proposition 9.3 is formally more general,
but it is a simple consequence of this result. Actually we shall
need this corollary.
\medskip\noindent
{\bf Corollary of Proposition 9.3.} {\it Let us consider the
product
$(Y_1\times Z\times Y_2,{\Cal Y}_1\times{\Cal Z}\times{\Cal Y}_2)$
of three measurable spaces $(Y_1,{\Cal Y}_1)$, $(Z,{\Cal Z})$
and $(Y_2,{\Cal Y}_2)$ with a probability measure $\mu$ on the
space $(Z,{\Cal Z})$ and a probability measure $\rho$ on
$Y_1\times Y_2,{\Cal Y}_1\times{\Cal Y}_2)$,
and define the transformations
$$
P_\mu f(y_1,y_2)=\int f(y_1,z,y_2)\mu(\,dz),\quad y_1\in Y_1,\;
z\in Z,\; y_2\in Y_2 \tag$9.5'$
$$
and
$$
\aligned
Q_\mu f(y_1,z,y_2)&=(I-P_\mu)f(y_1,z,y_2)=f(y_1,z,y_2)
-P_\mu f(y_1,z,y_2) \\
&=f(y_1,z,y_2)-\int f(y_1,z,y_2)\mu(\,dz),
\quad y_1\in Y_1,\; z\in Z, \;y_2\in Y_2
\endaligned \tag$9.6'$
$$
for the measurable functions $f$ on the space $Y_1\times Z\times Y_2$
integrable with respect the measure $\mu\times\rho$.
Then
$$
\int P_\mu f(y_1,y_2)^2\rho(\,dy_1,\,dy_2) \le\int
f(y,z)^2(\rho\times \mu)(\,dy_1,\,dz,\,dy_2) \tag$9.7'$
$$
for all probability measures $\rho$ on
$(Y_1\times Y_2,{\Cal Y}_1\times{\Cal Y}_2)$, where
$\rho\times\mu$ is the product of the probability measure $\rho$
on $(Y_1\times Y_2,{\Cal Y}_1\times{\Cal Y}_2)$ and $\mu$ is a
probability measure on $(Z,{\Cal Z})$. Also the inequality
$$
\int Q_\mu f(y_1,z,y_2)^2\rho(\,dy_1,\,dy_2)\mu(\,dz)\le\int
f(y_1,z,y_2)^2\rho(\,dy_1,\,dy_2)\mu(\,dz) \tag$9.8'$
$$
holds for all functions $f\in L_2(Y\times Z,{\Cal Y}\times{\Cal Z},
\rho\times\mu)$.
If ${\Cal F}$ is an $L_2$-dense class of functions $f(y_1,z,y_2)$ in
the product space $(Y_1\times Z\times Y_2,{\Cal Y}_1\times{\Cal Z}
\times Y_2)$, with some parameter $D\ge1$ and exponent $L\ge0$,
then also the classes ${\Cal F}_\mu=\{P_\mu f,\; f\in {\Cal F}\}$ and
${\Cal G}_\mu=\{\frac12Q_\mu f=\frac12(f-P_\mu f),\; f\in{\Cal F}\}$
are $L_2$-dense classes with exponent $L$ and parameter~$D$ in the
spaces $(Y_1\times Y_2,{\Cal Y}_1\times {\Cal Y}_2)$ and $(Y_1\times
Z\times Y_2,{\Cal Y}_1\times{\Cal Z}\times{\Cal Y}_2)$ respectively.}
\medskip
This corollary is a simple consequence of Proposition~9.3 if we
apply it with $(Y,{\Cal Y})=(Y_1\times Y_2,{\Cal Y}_1\times{\Cal Y}_2)$
and take the natural mapping $f((y_1,y_2),z)\to f(y_1,z,y_2)$ of a
function from the space $(Y\times Z,{\Cal Y}\times {\Cal Z})$ to a
function on $(Y_1\times Z\times Y_2,{\Cal Y}_1\times{\Cal Z}\times
{\Cal Y}_2)$. Beside this, we apply that measure on
$(Y_1\times Z\times Y_2,{\Cal Y}_1\times {\Cal Z}\times {\Cal Y}_2)$
which is the image of the product measure $\rho\times\mu$ with
respect to the map induced by the above transformation on the space
of measures.
Proposition 9.3, more precisely its corollary implies Theorem 9.2,
since it implies that the operators $P_s$, $Q_s$, $1\le s\le k$,
applied in Theorem~9.2 do not increase the $L_2(\mu)$ norm of a
function $f$, and it is also clear that the norm of $P_s$ is
bounded by 1, the norm of $Q_s=I-P_s$ is bounded by 2 as an operator
from $L_\infty$ spaces to $L_\infty$ spaces. The corollary of
Proposition~9.3 also implies that if ${\Cal F}$ is an $L_2$-dense
class of functions with parameter $D$ and exponent~$L$, then the
same property holds for the classes of functions
${\Cal F}_{P_s}=\{P_sf\colon\; f\in {\Cal F}\}$ and
${\Cal F}_{Q_s}=\{\frac12Q_sf\colon\; f\in {\Cal F}\}$,
$1\le s\le k$. These relations together with the identity
$f_V=\left(\prod\limits\limits_{s\in\{1,\dots,k\}\setminus V}P_s
\prod\limits_{s\in V}Q_s\right)f$
imply Theorem~9.2.
\medskip\noindent
{\it Proof of Proposition 9.3.}\/ The Schwarz inequality yields that
$P_\mu(f)^2\le\int f(y,z)^2\mu(\,dz)$, and integrating this
inequality with respect to the probability measure $\rho(\,dy)$ we
get inequality~(9.7). Also the inequality
$$
\align
\int Q_\mu f(y,z)^2\rho(\,dy)\mu(\,dz)&=\int [f(y,z)-P_\mu
f(y,z)]^2\rho(\,dy)\mu(\,dz) \\
&\le\int f(y,z)^2\rho(\,dy)\mu(\,dz)
\endalign
$$
holds, and this is relation (9.8). This follows for instance from
the observation that the functions $f(y,z)-P_\mu f(y,z)$ and
$P_\mu f(y,z)$ are orthogonal in the space
$L_2(Y\times Z,{\Cal Y}\times{\Cal Z},\rho\times\mu)$.
Let us consider an arbitrary probability measure $\rho$ on the space
$(Y,{\Cal Y})$. To prove that ${\Cal F}_\mu$ is an $L_2$-dense class with
parameter~$D$ and exponent~$L$ if the same relation holds for
${\Cal F}$ we have to find for all $0<\varepsilon\le1$ a set
$\{f_1,\dots,f_m\}\subset {\Cal F}_\mu$, $1\le j\le m$ with
$m\le D \varepsilon^{-L}$ elements, such that
$\inf\limits_{1\le j\le m}\int(f_j-f)^2\,d\rho\le \varepsilon^2$ for all
$f\in {\Cal F}_\mu$. But a similar property holds for ${\Cal F}$ in
the space $Y\times Z$ with the probability measure $\rho\times\mu$.
This property together with the $L_2$ contraction property of
$P_\mu$ formulated in (9.7) imply that ${\Cal F}_\mu$ is an
$L_2$-dense class.
To prove that ${\Cal G}_\mu$ is also $L_2$-dense with
parameter~$D$ and exponent~$L$ under the same condition we have
to find for all numbers $0<\varepsilon\le1$ and probability
measures $\rho$ on $Y\times Z$ a subset
$\{g_1,\dots,g_m\}\subset{\Cal G}_\mu$ with
$m\le D\varepsilon^{-L}$ elements such that
$\inf\limits_{1\le j\le m}\int (g_j-g)^2\,d\rho\le \varepsilon^2$
for all $g\in{\Cal G}_\mu$.
To show this let us consider the probability measure
$\tilde\rho=\frac12(\rho+\bar\rho\times\mu)$ on $(Y\times Z,\Cal
Y\times{\Cal Z})$, where $\bar\rho$ is the projection of the measure
$\rho$ to $(Y,{\Cal Y})$, i.e. $\bar\rho(A)=\rho(A\times Z)$ for all
$A\in{\Cal Y}$, take a class of function
${\Cal F}_0(\varepsilon,\tilde \rho)
=\{f_1,\dots,f_m\}\subset{\Cal F}$ with $m\le D\varepsilon^{-L}$
elements such that
$\inf\limits_{1\le j\le m}\int (f_j-f)^2\,d\tilde\rho\le \varepsilon^2$
for all $f\in{\Cal F}$, and put
$\{g_1,\dots,g_m\}=\{\frac12Q_\mu f_1,\dots,\frac12Q_\mu f_m\}$.
All functions $g\in{\Cal G}_\mu$ can be written in the form
$g=\frac12Q_\mu f$ with some $f\in {\Cal F}$, and there exists some
function $f_j\in{\Cal F}_0(\varepsilon,\tilde\rho)$ such that
$\int (f-f_m)^2\,d\tilde\rho\le\varepsilon^2$. Hence to complete
the proof
of Proposition~9.3 it is enough to show that $\int\frac14(Q_\mu f
-Q_\mu\bar f)^2\,d\rho\le\int(f-\bar f)^2\,d\tilde\rho$ for all
pairs $f,\bar f\in{\Cal F}$. This inequality holds, since
$\int\frac14(Q_\mu f-Q_\mu\bar f)^2\,d\rho\le\int\frac12(f-\bar
f)^2\,d\rho+\int\frac12(P_\mu f-P_\mu\bar f)^2\,d\rho$, and
$\int(P_\mu f-P_\mu\bar f)^2\,d\rho=\int(P_\mu f-P_\mu\bar
f)^2\,d\bar\rho\le\int(f-\bar f)^2\,d(\bar\rho\times\mu)$ by
formula 9.7. The above relations imply that $\int\frac14(Q_\mu
f-Q\mu\bar f)^2\,d\rho\le \int(f-\bar f)^2\frac12
d\,(\rho+\bar\rho\times\mu)=\int(f-\bar f)^2d\,\tilde\rho$ as we
have claimed.
\medskip
Now we shall discuss the relation between Theorem~$8.1'$ and
Theorem~8.3 and between Theorem 8.2 and Theorem 8.4. First we
show that Theorem~8.1 (or Theorem~$8.1'$) is equivalent
to the estimate~$(8.10')$ in the corollary of Theorem~8.3 which
is slightly weaker than the estimate~(8.10) of Theorem~8.3. We
also claim that Theorems~8.2 and~8.4 are equivalent. Both in
Theorem~8.2 and in Theorem~8.4 we can restrict our attention to
the case when the class of functions ${\Cal F}$ is countable,
since the case of countably approximable classes can be simply
reduced to this situation. Let us remark that integration with
respect to the measure $\mu_n-\mu$ in the definition~(4.8) of
the integral $J_{n,k}(f)$ yields some kind of normalization
which is missing in the definition of the $U$-statistics
$I_{n,k}(f)$. This is the cause why degenerate $U$-statistics
had to be considered in Theorems~8.3 and~8.4. The deduction of
the corollary of Theorem~8.3 from Theorems~$8.1'$ or of
Theorem~8.4 from Theorem~8.2 is fairly simple if the underlying
probability measure $\mu$ is non-atomic, since in this case the
identity $I_{n,k}(f)=J_{n,k}(f)$ holds for a canonical function
with respect to the measure $\mu$. Let us remark that the
non-atomic property of the measure $\mu$ is needed in this
argument not only because of the conditions of Theorems~$8.1'$
and~8.2, but since in the proof of the above identity we need
the identity $\int f(x_1,\dots,x_k)\mu(\,dx_j)\equiv0$ in the
case when the domain of integration is not the whole space~$X$
but the set $X\setminus\{x_1,\dots,x_{j-1},x_{j+1},\dots,x_k\}$.
The case of possibly atomic measures $\mu$ can be simply reduced
to the case of non-atomic measures by means of the following
enlargement of the space $(X,{\Cal X},\mu)$. Let us introduce the
product space $(\bar X,\bar{\Cal X},\bar\mu)=(X,{\Cal X},\mu)
\times([0,1],{\Cal B},\lambda)$, where ${\Cal B}$ is the
$\sigma$-algebra and $\lambda$ is the Lebesgue measure on
$[0,1]$. Define the function
$\bar f((x_1,u_1),\dots,(x_k,u_k))=f(x_1,\dots,x_k)$ in this
enlarged space. Then $I_{n,k}(f)=I_{n,k}(\bar f)$, the measure
$\bar\mu=\mu\times\lambda$ is non-atomic, and $\bar f$ is canonical
with respect to~$\bar\mu$ if $f$ is canonical with respect to~$\mu$.
Hence the corollary of Theorem~8.3 and Theorem~8.4 can be derived
from Theorems~$8.1'$ and~8.2 respectively by proving them first for
their counterpart in the above constructed enlarged space with the
above defined functions.
Also Theorems~$8.1'$ and~8.2 can be derived from Theorems~8.3
and~8.4 respectively, but this is a much harder problem. To do
this let us observe that a random integral $J_{n,k}(f)$ can
be written as a sum of $U$-statistics of different order, and it
can also be expressed as a sum of degenerate $U$-statistics if
Hoeffding's decomposition is applied for each $U$-statistic in
this sum. Moreover, we shall show that the multiple integral of
a function~$f$ of $k$~variables with respect to a normalized
empirical distribution can be decomposed to the linear
combination of degenerate $U$-statistics with the same kernel
functions~$f_V$ which appeared in~Theorem~9.1 with relatively
small coefficients. This is the content of the following
Theorem~9.4. For the sake of a better understanding I shall
reformulate it in a more explicit form in the special case $k=2$
in Corollary~2 of Theorem~9.4 at the end of this section.
\medskip\noindent
{\bf Theorem 9.4. (Decomposition of a multiple random integral with
respect to a normalized empirical measure to a linear combination of
degenerate $U$-statistics).} {\it Let a non-atomic measure~$\mu$ be
given on a measurable space $(X,{\Cal X})$ together with a sequence of
independent, $\mu$-distributed random variables $\xi_1,\dots,\xi_n$.
Take a function $f(x_1,\dots,x_k)$ of $k$ variables integrable with
respect to the product measur~$\mu^k$ on the product space
$(X^k,{\Cal X}^k)$, and consider the empirical distribution $\mu_n$
of the sequence $\xi_1,\dots,\xi_n$ introduced in~(4.5) together
with the $k$-fold random integral $J_{n,k}(f)$ of the function~$f$
defined in~(4.8). The identity
$$
k!J_{n,k}(f)=\sum_{V\subset\{1,\dots,k\}}C(n,k,|V|)n^{-|V|/2}
|V|!I_{n,|V|}(f_V) \tag9.9
$$
holds with the set of (canonical) functions $f_V(x_j,\;j\in V)$
(with respect to the measure $\mu$) defined in formula (9.2)
together with some appropriate real numbers $C(n,k,p)$,
$0\le p\le k$, where $I_{n,|V|}(f_V)$ denotes the (degenerate)
$U$-statistic of order $|V|$ with the random variables
$\xi_1,\dots,\xi_n$ and kernel function $f_V$. The constants
$C(n,k,p)$ in formula (9.9) satisfy the inequality
$|C(n,k,p)|\le C(k)$ for all $n\ge k$ and $0\le p\le k$ with some
constant $C(k)<\infty$ depending only on the order $k$ of the
integral $J_{n,k}(f)$. The relations
$\lim\limits_{n\to\infty}C(n,k,p)=C(k,p)$ hold with some appropriate
constant $C(k,p)$ for all $1\le p\le k$, and $C(n,k,k)=1$.}
\medskip\noindent
{\it Remark.} As the proof of Theorem 9.4 will show, the constant
$C(n,k,p)$ in formula~(9.9) is a polynomial order~$k-1$ of the
argument $n^{-1/2}$ with some coefficients depending on the
parameters~$k$ and~$p$. As a consequence, $C(k,p)$ equals the
constant term of this polynomial.
\medskip
Theorems~$8.1'$ and~8.2 can be simply derived from Theorems~8.3
and~8.4 respectively with the help of Theorem~9.4. Indeed, to
get Theorem~$8.1'$ observe that formula~(9.9) implies the inequality
$$
P(|J_{n,k}(f)|>u)\le \sum_{V\subset\{1,\dots,k\}}
P\left(n^{-|V|/2}|I_{n,|V|}(f_V)|>\frac u{2^kC(k)}\right) \tag9.10
$$
with a constant $C(k)$ satisfying the inequality $p!C(n,k,p)\le
k!C(k)$ for all coefficients $C(n,k,p)$, $1\le p\le k$, in~(9.9). Hence
Theorem~$8.1'$ follows from Theorem~8.3 and relations~(9.4)
and~$(9.4')$ in Theorem~9.2 by which the $L_2$-norm of the
functions $f_V$ is bounded by the $L_2$-norm of the function~$f$
and the $L_\infty$-norm of $f_V$ is bounded by the $2^{|V|}$-times
the $L_\infty$-norm or $f$. It is enough to estimate each term at
the right-hand side of (9.10) by means of Theorem~8.3. It can be
assumed that $2^kC(k)>1$. Let us first assume that also the
inequality $\frac u{2^kC(k) \sigma}\ge1$ holds. In this case
formula $(8.3')$ in Theorem~$8.1'$ can be obtained by means of the
estimation of each term at the right-hand side of (9.10). Observe
that
$\exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}\right)^{2/s}
\right\}\le \exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}
\right)^{2/k}\right\}$ for all
$s\le k$ if $\frac u{2^kC(k)\sigma}\ge1$. In the other case, when
$\frac u{2^kC(k)\sigma}\le1$, formula~$(8.3')$ holds again with a
sufficiently large $C>0$, because in this case its right-hand side
of~$(8.3')$ is greater than~1.
Theorem~8.2 can be similarly derived from Theorem~8.4 by observing
that relation~(9.10) remains valid if $|J_{n,k}(f)|$ is replaced
by $\sup\limits_{f\in{\Cal F}}|J_{n,k}(f)|$ and $|I_{n,|V|}(f_V)|$
by $\sup\limits_{f_V\in{\Cal F}_V} |I_{n,|V|}(f_V)|$ in it, and we
have the right to choose the constant~$M$ in formula~(8.6) of
Theorem~8.2 sufficiently large. The only difference in the argument
is that beside formulas~(9.4) and~$(9.4')$ the last statement of
Theorem~9.2 also has to be applied in this case. It tells that if
${\Cal F}$ is an $L_2$-dense class of functions on a space
$(X^k,{\Cal X}^k)$, then the classes of functions
${\Cal F}_V=\{2^{-|V|}f_V\colon\, f\in{\Cal F}\}$ are also
$L_2$-dense classes of functions for all
$V\subset\{1,\dots,k\}$ with the same exponent and parameter.
\medskip
I make some comments about the content of Theorem~9.4. The
expression $J_{n,k}(f)$ was defined as a $k$-fold random integral
with respect to the signed measure $\mu_n-\mu$, where the diagonals
were omitted from the domain of integration. Formula~(9.9) expresses
the random integral $J_{n,k}(f)$ as a linear combination of
degenerate $U$-statistics of different order. This is similar to
the Hoeffding decomposition of the $U$-statistic $I_{n,k}(f)$ to the
linear combination of degenerate $U$-statistics defined with the
same kernel functions~$f_V$. The main difference between these two
formulas is that in the expansion~(9.9) of
$J_{n,k}(f)$ the terms $I_{n,|V|}(f_V)$ appear with small
coefficients $C(n,k,|V|)|V|!n^{-|V|/2}$. As we shall see,
$E(C(n,k,|V|)|V|!n^{-|V|/2}I_{n,V}(f_V))^20$ there is a finite partition
$A=\bigcup\limits_{s=1}^N B_s$ of the set~$A$ with the property
$\mu(B_s)<\varepsilon$ for all $1\le s\le N$. There is a formally
weaker definition of a non-atomic measures by which a
$\sigma$-finite measure~$\mu$ is non-atomic if for all measurable
sets $A$ such that $0<\mu(A)<\infty$ there exists a $B\subset A$
with the property $0<\mu(B)<\mu(A)$. But these two definitions of
non-atomic measures are actually equivalent, although this
equivalence is far from trivial. I do not discuss this problem
here, since it is a little bit outside from the direction of the
present work. In our further considerations we shall work with
the first definition of non-atomic measures.
The $k$-fold Wiener-It\^o integrals of the functions
$f\in{\Cal H}_{\mu,k}$ with respect to the white noise~$\mu_W$ will
be defined in a rather standard way. First they will be defined for
some simple functions, called elementary functions, then it will be
shown that the integral for this elementary functions have an $L_2$
contraction property which makes possible to extend it to the class
of functions in ${\Cal H}_{\mu,k}$.
Let us first introduce the following class of elementary
functions $\bar{\Cal H}_{\mu,k}$ of $k$ variables. A function
$f(x_1,\dots,x_k)$ on $(X^k,{\Cal X}^k)$ belongs to
$\bar{\Cal H}_{\mu,k}$ if there exist finitely many disjoint
measurable subsets $A_1,\dots,A_M$, $1\le M<\infty$, of the
set~$X$ (i.e. $A_j\cap A_{j'}=\emptyset$ if $j\neq j'$) such that
$\mu(A_j)<\infty$ for all $1\le j\le M$, and the function $f$ has
the form
$$
f(x_1,\dots,x_k)=\left\{
\aligned
&c(j_1,\dots,j_k)\quad\text{if } (x_1,\dots,x_k) \in
A_{j_1}\times\cdots \times A_{j_k} \\
&\qquad \text{with some indices } (j_1,\dots,j_k),
\quad 1\le j_s\le M,\; 1\le s\le k,\\
&\qquad \text{ such that all numbers } j_1,\dots,j_k
\text{ are different} \\
&0 \quad\text{if }(x_1,\dots,x_k)\notin\bigcup\limits\Sb
(j_1,\dots,j_k)\colon \; 1\le j_s\le M, \; 1\le s\le k,\\
\text{ and all } j_1,\dots,j_k\text { are different.}\endSb
A_{j_1}\times\cdots \times A_{j_k}
\endaligned \right. \tag10.2
$$
with some real numbers $c(j_1,\dots,j_k)$, $1\le j_s\le M$, $1\le
s\le k$, if all $j_1,\dots,j_k$ are different numbers. This means
that the function $f$ is constant on all $k$-dimensional
rectangles $A_{j_1}\times\dots\times A_{j_k}$ with different,
non-intersecting edges, and it equals zero on the complementary
set of the union of these rectangles. The property that the support
of the function~$f$ is on the union of rectangles with
non-intersecting edges is sometimes interpreted so that the
diagonals are omitted from the domain of integration of
Wiener--It\^o integrals.
The Wiener-It\^o integral of an elementary function
$f(x_1,\dots,x_k)$ of the form (10.2) with respect to a white
noise $\mu_W$ with the (non-atomic) reference measure $\mu$
is defined by the formula
$$
\aligned
&\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k)\\
&\qquad=\sum\Sb 1\le j_s\le M,\;1\le s\le k \\
\text{all } j_1,\dots,j_k \text{ are different}\endSb
c(j_1,\dots,j_k) \mu_W(A_{j_1})\cdots\mu_W(A_{j_k}).
\endaligned \tag10.3
$$
(The representation of the function $f$ in (10.2) is not unique,
the sets $A_j$ can be divided to smaller disjoint sets, but its
Wiener--It\^o integral defined in (10.3) does not depend on its
representation.) The notation
$$
Z_{\mu,k}(f)=\frac1{k!}
\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k), \tag10.4
$$
will be used in the sequel, and the expression $Z_{\mu,k}(f)$
will be called the normalized Wiener--It\^o integral of the
function~$f$. Such a terminology will be applied also for the
Wiener--It\^o integrals of all functions $f\in{\Cal H}_{\mu,k}$ to
be defined later.
If $f$ is an elementary function in $\bar{\Cal H}_{\mu,k}$ defined
in (10.2), then its normalized Wiener--It\^o integral defined
in~(10.3) and~(10.4) satisfies the relations
$$
\aligned
Ek!Z_{\mu,k}(f)&=0, \\
E(k!Z_{\mu,k}(f))^2&=\sum\Sb (j_1,\dots,j_k)\colon\;
1\le j_s\le M,\; 1\le s\le k, \\
\text{and all } j_1,\dots,j_k\text{ are different.}\endSb
\sum_{\pi\in \Pi_k}
c(j_1,\dots,j_k)c(j_{\pi(1)},\dots,j_{\pi(k)}) \\
&\qquad\qquad\qquad E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j_{\pi(1)}})\cdots\mu_W(A_{j_{\pi(k)}}) \\
&=k!\int \text{Sym\,} f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k)\\
&\le k!\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k),
\endaligned \tag10.5
$$
with $\text{Sym\,}f(x_1,\dots,x_k)=
\frac1{k!}\sum\limits_{\pi\in\Pi_k}f(x_{\pi(1)},\dots,x_{\pi(k)})$,
where $\Pi_k$ denotes the set of all permutations
$\pi=\{\pi(1),\dots,\pi(k)\}$ of the set $\{1,\dots,k\}$.
The identities written down in (10.5) can be simply checked. The
first relation follows from the identity
$E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})=0$ for disjoint sets
$A_{j_1},\dots,A_{j_k}$, which holds, since the expectation of the
product of independent random variables with zero expectation is
taken. The second identity follows similarly from the identity
$$
\align
&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})=0\\
&\qquad \text{ if the sets of indices }
\{j_1,\dots,j_k\} \text { and }
\{j'_1,\dots,j'_k\} \text{ are different,} \\
&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})
=\mu(A_{j_1})\cdots\mu(A_{j_k})\\
&\qquad \text{ if } \{j_1,\dots,j_k\}=
\{j'_1,\dots,j'_k\} \text{ i.e. if }
j'_1=j_{\pi(1)},\dots,j'_k=j_{\pi(k)} \\
&\qquad \text{ with some permutation } \pi\in\Pi_k,
\endalign
$$
which holds because of the facts that the $\mu_W$ measure of
disjoint sets are independent with expectation zero, and
$E\mu_W(A)^2=\mu(A)$. The remaining relations in (10.5) can be
simply checked.
It is not difficult to check that
$$
EZ_{\mu,k}(f)Z_{\mu,k'}(g)=0 \tag10.6
$$
for all functions $f\in \bar{\Cal H}_{\mu,k}$ and
$g\in \bar{\Cal H}_{\mu,k'}$ if $k\neq k'$, and
$$
Z_{\mu,k}(f)=Z_{\mu,k}(\text{Sym}\, f) \tag10.7
$$
for all functions $f\in \bar{\Cal H}_{\mu,k}$.
The definition of Wiener--It\^o integrals can be extended to
general functions $f\in{\Cal H}_{\mu,k}$ with the help of the
estimate~(10.5). But to carry out this extension we still have
to know that the class of functions $\bar{\Cal H}_{\mu,k}$ is
a dense subset of the class ${\Cal H}_{\mu,k}$ in the Hilbert
space $L_2(X^k,{\Cal X}^k,\mu^k)$, where $\mu^k$ is the $k$-th power
of the reference measure $\mu$ of the white noise~$\mu_W$. I
briefly explain how this property of $\bar{\Cal H}_{\mu,k}$ can be
proved. The non-atomic property of the measure~$\mu$ is exploited
at this point.
To prove this statement it is enough to show that the indicator
function of any product set $A_1\times\cdots\times A_k$
such that $\mu(A_j)<\infty$, $1\le j\le k$, but the sets
$A_1,\dots,A_k$ may be non-disjoint is in the $L_2(\mu^k)$
closure of $\bar{\Cal H}_{\mu,k}$. In the proof of this
statement it will be exploited that since $\mu$ is a non-atomic
measure, the sets $A_j$ can be represented for all
$\varepsilon>0$ and $1\le j\le k$ as a finite union
$A_j=\bigcup\limits_s B_{j,s}$ of disjoint sets $B_{j,s}$
with the property $\mu(B_{j,s})<\varepsilon$.
By means of these relations the
product $A_1\times\cdots\times A_k$ can be written in the form
$$
A_1\times\cdots\times A_k=\bigcup_{s_1,\dots,s_k}
B_{1,s_1}\times\cdots\times B_{k,s_k} \tag10.8
$$
with some sets $B_{j,s_j}$ such that $\mu(B_{j,s_j})<\varepsilon$
for all sets in this union. Moreover, we may assume, by refining
the partitions of the sets $A_j$ if this is necessary that any
two sets $B_{j,s_j}$ and $B_{j',s'_{j'}}$ in this representation
are either disjoint, or they agree. Take such a representation of
$A_1\times\cdots\times A_k$, and consider the set we obtain by
omitting those products $B_{1,s_1}\times\cdots\times B_{k,s_k}$
from the union at the right-hand side of (10.8) for which
$B_{i,s_i}=B_{j,s_j}$
for some $1\le i0$, and
the sets $B_j$ into the union of small disjoint sets $F_j^{(m)}$,
$1\le j\le l$, with some fixed number $1\le m\le M$, in such a way
that $\mu(F_j^{(m)})\le \varepsilon$ with some fixed
$\varepsilon>0$. Beside this, we
also require that two sets $D_j^{(m)}$ and $F_{j'}^{(m')}$ should
be either disjoint or they should agree. (The sets $D_j^{(m)}$ are
disjoint for different indices, and the same relation holds for the
sets $F_{j'}^{(m')}$.)
Then the identity
$$
k!Z_{\mu,k}(f)
=\prod\limits_{j=1}^k\left(\sum\limits_{m=1}^M\mu_W(D_j^{(m)})\right)
\quad \text{and} \quad l!Z_{\mu,l}(g)
=\prod\limits_{j'=1}^l
\left(\sum\limits_{m'=1}^M\mu_W(F_{j'}^{(m')})\right),
$$
holds, and the product of these two Wiener--It\^o integrals can be
written in the form of a sum by means of a term by term
multiplication. Let us divide the terms of the sum we get in such a
way into classes indexed by the diagrams $\gamma\in\Gamma(k,l)$
in the following way: Each term in this sum is a product of the form
$\prod\limits_{j=1}^k\mu_W(D_j^{(m_j)})
\prod\limits_{j'=1}^l\mu_W(F_{j'}^{(m_j')})$. Let it belong to the
class indexed by the diagram $\gamma$ with edges
$((1,j_1),(2,j_1'))$,\dots, and $((1,j_s),(2,j'_s))$ if the elements
in the pairs $(D_{j_1}^{m_{j_1}},F_{j'_1}^{m_{j'_1}})$,\dots,
$(D_{j_s}^{m_{j_s}},F_{j'_s}^{m_{j'_s}})$ agree, and otherwise all
terms are different. Then letting $\varepsilon\to0$ (and taking
partitions of the sets $D_j$ and $F_{j'}$ corresponding to the
parameter $\varepsilon$) the
sums of the terms in each class turn to integrals, and our
calculation suggests the identity
$$
(k!Z_{\mu,k}(f))(l!Z_{\mu,l}(g))
=\sum\limits_{\gamma\in\Gamma(k,l)}\bar Z_\gamma \tag10.13
$$
with
$$
\aligned
\bar Z_\gamma&=\int
f(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k)})
g(x_{(2,1)},\dots,x_{(2,l)}) \\
&\qquad \mu_W(\,dx_{\alpha_\gamma(1,1)})\dots
\mu_W(\,dx_{\alpha_\gamma(1,k)})
\mu_W(\,dx_{(2,1)})\dots\mu_W(\,dx_{(2,l)})
\endaligned \tag10.13a
$$
with the function $\alpha_\gamma(\cdot)$ introduced before formula
(10.9). The indices $\alpha(1,j)$ of the arguments in (10.13a) mean
that in the case $\alpha_\gamma(1,j)=(2,j')$ the argument
$x_{(1,j)}$ has to be replaced by $x_{(2,j')}$. In particular,
$\mu_W(\,dx_{\alpha(1,j)})\mu_W(\,dx_{(2,j')})
=\mu_W(\,dx_{(2,j')})^2=\mu(\,dx_{(2,j')})$ in this case because
of the `identity' $(\mu_W(\,dx))^2=\mu(\,dx)$. Hence the above
informal calculation yields the identity
$\bar Z_\gamma=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma)$. Hence
relations~(10.13) and~(10.13a) imply formula~(10.12).
A similar heuristic argument can be applied to get formulas for the
product of integrals of normalized empirical distributions or
(normalized) Poisson fields, only the starting formula
$(\mu_W(\,dx))^2=\mu(\,dx)$ changes in these cases, some additional
terms appear which modify the final result. I return to this
question in the next section.
\medskip
It is not difficult to generalize Theorem~10.2A with the help of
some additional notations to a diagram formula about the product of
finitely many Wiener--It\^o integrals.
Let us consider $m\ge2$ Wiener--It\^o integrals $k_p!Z_{\mu,k_p}(f_p)$,
of functions $f_p(x_1,\dots,x_{k_p})\in{\Cal H}_{\mu,k_p}$, of order
$k_p\ge1$, $1\le p\le m$, and define a class of diagrams
$\Gamma=\Gamma(k_1,\dots,k_m)$ in the following way.
The diagrams $\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ have
vertices of the form $(p,r)$, $1\le p\le m$, $1\le r\le k_p$. The
set of vertices $\{(p,r)\colon\; 1\le r\le k_p\}$ with a fixed number
$p$ will be called the $p$-th row of the diagram $\gamma$. A diagram
$\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ may have some edges. All
edges of a diagram connect vertices from different rows, and from
each vertex there starts at most one edge. All diagrams satisfying
these properties belong to $\Gamma(k_1,\dots,k_m)$. If a diagram
$\gamma$ contains an edge of the form $((p_1,r_1),(p_2,r_2))$ with
$p_1p\}$,
i.e. ${\Cal B}_1(\gamma,p)$ consists of those chains $\beta\in\Gamma$
which have colour~$1$, all their vertices are in the first~$p$
rows of the diagram, and contain a vertex in the~$p$-th row, while
${\Cal B}_2(\gamma,p)$ consists of those chains $\beta\in\gamma$
which have either colour~$-1$, and all their vertices are in the
first~$p$ rows of the diagram, or they have (with an arbitrary
colour) a vertex both in the first~$p$ rows both in the remaining
rows of the diagram. Put $B_1(\gamma,p)=|{\Cal B}_1(\gamma,p)|$ and
$B_2(\gamma,p)=|{\Cal B}_2(\gamma,p)|$.
With the help of these numbers we define
$$
J_n(\gamma,p)=\left\{
\aligned
\prod_{j=1}^{B_1(\gamma,p)}
&\left(\frac{n-B_1(\gamma,p)-B_2(\gamma,p)+j}n\right)
\quad\text{if } B_1(\gamma,p)\ge1\\
& \!\!\!\!\!\!\!\!\! 1\quad \text{if } B_1(\gamma,p)=0
\endaligned \right. \tag11.14
$$
for all $2\le p\le m$ and diagrams $\gamma\in\Gamma(k_1,\dots,k_m)$.
Theorem 11.2 will be formulated with the help of the above notations.
\medskip\noindent
{\bf Theorem 11.2. (The diagram formula for the product of several
degenerate $U$-statistics).} {\it Let a sequence of independent and
identically distributed random variables $\xi_1,\xi_2,\dots$ be
given with some distribution $\mu$ on a measurable space
$(X,{\Cal X})$ together with $m\ge2$ bounded functions
$f_p(x_1,\dots,x_{k_p})$ on the spaces $(X^{k_p},{\Cal X}^{k_p})$,
$1\le p\le m$, canonical with respect to the probability
measure~$\mu$. Let us consider the class of coloured diagrams
$\Gamma(k_1,\dots,k_m)$ together with the functions
$F_\gamma=F_{\gamma}(f_1,\dots,f_m)$,
$\gamma\in\Gamma(k_1,\dots,k_m)$, defined in formulas
(11.9)---(11.12) and the constants $W(\gamma)$ and $J_n(\gamma,p)$,
$1\le p\le m$, given in formulas~(11.13) and~(11.14).
The functions $F_\gamma(f_1,\dots,f_m)$ are canonical with respect
to the measure $\mu$ with $|O(\gamma)|$ variables, and the product
of the degenerate $U$-statistics $I_{n,k_p}(f_p)$, $1\le p\le m$,
$n\ge \max\limits_{1\le p\le m} k_p$, defined in~(8.7) can be
expressed as
$$
\aligned
\prod_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_{k_p})
&={\sum_{\gamma\in\Gamma(k_1,\dots,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\prime(n,\,m)}
\left(\prod_{p=2}^m J_n(\gamma,p)\right) n^{-W(\gamma)/2} \\
&\qquad\qquad n^{-|O(\gamma)|/2} |O(\gamma)|!
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m)),
\endaligned\tag11.15
$$
where $\sum^{\prime(n,\,m)}$ means that summation is taken
for those $\gamma\in\Gamma(k_1,\dots,k_m)$ which satisfy the
relation $B_1(\gamma,p)+B_2(\gamma,p)\le n$ for all $2\le p\le m$
with the quantities $B_1(\gamma,p)$ and $B_2(\gamma,p)$
introduced before the definition of $J_n(\gamma,p)$ in~(11.14), and
the expression $W(\gamma)$ was defined in~(11.13).
The terms $I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))$ at the
right-hand side of formula (11.15) can be replaced by
$I_{n,|O(\gamma)|}(\text{\rm Sym}\,F_\gamma(f_1,\dots,f_m))$.}
\medskip
In Theorem 11.2 the product of such degenerate $U$-statistics were
considered, whose kernel functions were bounded. This also implies
that all functions $F_\gamma$ appearing at the right-hand side of
(11.15) are well-defined (i.e. the integrals appearing in their
definition are convergent) and bounded. In the applications of
Theorem~11.2 it is useful to have more information about the
behaviour of the functions $F_\gamma$. We shall need some good
bound on their $L_2$-norm. Such a result is formulated in
the following
\medskip\noindent
{\bf Lemma 11.3. (Estimate about the $L_2$-norm of the kernel
functions of the $U$-statistics appearing in the diagram formula).}
{\it Let $m$ functions $f_p(x_1,\dots,x_{k_p})$
be given on the products $(X^{k_p},{\Cal X}^{k_p})$ of some measurable
space $(X,{\Cal X})$, $1\le p\le m$, with a probability measure $\mu$
on it, which satisfy inequalities (8.1) and (8.2) (if the index
$k$ is replaced by the index $k_p$ in them), but these
functions need not be canonical. Let us take a coloured diagram
$\gamma\in\Gamma(k_1,\dots,k_m)$, and consider the function
$F_\gamma(f_1,\dots,f_m)$ defined by formulas
(11.9)---(11.12). The $L_2$-norm of the function
$F_\gamma(f_1,\dots,f_m)$ (with respect to the power of the
measure~$\mu$ to the space where $F_\gamma(f_1,\dots,f_m)$ is
defined) satisfies the inequality
$$
\|F_\gamma(f_1,\dots,f_m)\|_2
\le2^{W(\gamma)}\prod_{p\in U(\gamma)} \|f_p\|_2,
$$
where $W(\gamma)$ is given in~(11.13), and the set
$U(\gamma)\subset\{1,\dots,m\}$ is defined in the following way.
Let us define for a coloured chain
$\beta=\{(l_1,r_1),(l_2,r_2),\dots,(l_s,r_s)\}\in\gamma$ with
$1\le l_1<\cdots2$.
These are those diagrams $\gamma\in\bar\Gamma(k_1,\dots,k_m)$
for which $W(\gamma)>1$. The estimate~(11.18) given for the terms
$F_\gamma$ corresponding to such diagrams is
weaker than the estimate given for the terms $F_\gamma$ with
$W(\gamma)=0$, since $|U(\gamma)|0. \tag12.4
$$
for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$.
Since $I_{n,|O(\gamma)|}\left(F_\gamma(f_1,f_2)\right)
=I_{n,|O(\gamma)|}\left(f_1\circ f_2)_\gamma\right)$
relation~(12.3) follows from relation~(12.2) just as
formula~(9.3) follows from formula~(9.2) in the proof of the
Hoeffding decomposition. Let us understand why the coefficient
$n^{|C(\gamma)|}J_n(\gamma)$ appears at the right-hand
side of~(12.3).
This coefficient can be calculated in the following way. Take a
general term
$(f_1\circ f_2)_\gamma(\xi_{j_{l_u}},\,l_u\in O(\gamma))$ in the
$U$-statistic $|O(\gamma)|!I_{n,|O(\gamma)|}((f_1\circ f_2)_\gamma)$,
and calculate the number of terms
$(\overline{f_1\circ f_2})_{\bar\gamma}
(\xi_{j'_1},\xi_{j'_2}\dots,\xi_{j'_{s(\bar\gamma)}})$ in the
$U$-statistic $|O(\bar\gamma)|! I_{n,\bar s(\bar\gamma)}
((\overline{f_1\circ f_2})_{\bar\gamma})$ for which the sequence
of indices $(j'_1,\dots,j'_{s(\bar\gamma)})$ satisfies the relation
$j'_{l_u}=j_{l_u}$ for all $l_u\in O(\gamma)$. I claim that it
equals $n^{|C(\gamma)|}J_n(\gamma)$. It can be seen that
this number $n^{|C(\gamma)|}J_n(\gamma)$ appears as the coefficient
at right-hand side of~(12.3).
Indeed, we have to calculate the number of such sequences
$j'_1,j'_2,\dots,j'_{s(\bar\gamma)}$ for which the value
$j'_{l_u}=j_{l_u}$ is prescribed for the indices
$l_u\in O(\gamma)$, and the other elements of the
sequence can take arbitrary integer value between~1 and~$n$
with the only restriction that all elements of the sequence
$j'_1,j'_2,\dots,j'_{s(\bar\gamma)}$ must be different.
The number of such sequences equals
$(n-|O(\gamma)|)(n-|O(\gamma)|-1)\cdots(n-|C(\gamma)|-|O(\gamma)|+1)
=J_n(\gamma)n^{|C(\gamma)|}$. (In this calculation we exploited the
fact that $|O(\gamma)|+|C(\gamma)|=s(\gamma)$.)
Let us observe that $k_1+k_2-2|C(\gamma)|=|O(\gamma)|+W(\gamma)$
with the number $W(\gamma)$ introduced in the formulation of
Theorem~11.1. Hence
$$
n^{-(k_1+k_2)/2}n^{|C(\gamma)|}=n^{-W(\gamma)/2}n^{-|O(\gamma)|/2}.
$$
Let us replace the left-hand side of the last identity by its
right-hand side in~(12.3), and let us sum up the identity
we get in such a way for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$
such that $s(\bar\gamma)\le n$. The identity we get in
such a way together with formulas~(12.1) and~(12.4) imply the
identity~(11.6). Clearly, $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))=
I_{n,|O(\gamma)|}(\text{\rm Sym}F_\gamma(f_1,f_2))$, hence the
term $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))$ can be replaced by
$I_{n,|O(\gamma)|}(\text{\rm Sym}F_\gamma(f_1,f_2))$ in
formula~(11.6). We still have to prove inequalities~(11.7)
and~(11.8).
\medskip
Inequality (11.7), the estimate of the $L_2$-norm of the function
$(f_1\circ f_2)_\gamma$ follows from the Schwarz inequality, and
actually it agrees with inequality (10.11), proved at the start
of Appendix~B. Hence its proof is omitted here.
To prove inequality (11.8) let us introduce, similarly to
formula (11.2), the operators
$$
\tilde Q_{u_j}h(x_{u_1},\dots,x_{u_r})=h(x_{u_1},\dots,x_{u_r})+
\int h(x_{u_1},\dots,x_{u_r})\mu(\,dx_{u_j}), \quad 1\le j\le r,
\tag 12.5
$$
in the space of functions $h(x_{u_1},\dots,x_{u_r})$ with coordinates
in the space $(X,{\Cal X})$. (The indices $u_1,\dots,u_r$ are all
different.) Observe that both the operators $\tilde Q_{u_j}$ and the
operators $P_{u_j}$ defined in (11.1) are positive, i.e. these
operators map a non-negative function to a non-negative function.
Beside this, $Q_{u_j}\le\tilde Q_{u_j}$, and the norms of the
operators $\frac{\tilde Q_{u_j}}2$ and $P_{u_j}$ are bounded by 1
both in the $L_1(\mu)$, the $L_2(\mu)$ and the supremum norm.
Let us define the function
$$
(\widetilde{f_1\circ f_2})_\gamma(x_j,\;j\in O(\gamma))
=\left(\prod_{j\in C(\gamma)}P_j
\prod_{j\in O_2(\gamma) } \tilde Q_j\right)
\overline{(f_1\circ f_2)}_\gamma(x_j,\;j\in C(\gamma)\cup O(\gamma))
\tag12.6
$$
with the notation of Section~11. The function
$(\widetilde{f_1\circ f_2})_\gamma$ was defined with the help of
$\overline{(f_1\circ f_2)}_\gamma$ similarly to
$(f_1\circ f_2)_\gamma$ defined in~(11.4), only the operators $Q_j$
were replaced by $\tilde Q_j$ in its definition.
In the proof of (11.8) it may be assumed that $\|f_1\|_2\le\|f_2\|_2$.
The properties of the operators $P_{u_j}$ and $\tilde Q_{u_j}$
listed above together with the condition
$\sup|f_2(x_1,\dots,x_k)|\le1$ imply that
$$
|(f_1\circ f_2)_\gamma|\le (\widetilde{|f_1|\circ |f_2|})_\gamma\le
(\widetilde{|f_1|\circ 1})_\gamma, \tag12.7
$$
where `$\le$' means that the function at the right-hand side is
greater than or equal to the function at the left-hand side in all
points, and the term~1 in~(12.7) denotes the function which equals
identically~1. Because of the identity
$\|F_\gamma(f_1,f_2)\|_2=\|(f_1\circ f_2)_\gamma\|_2$ and
relation~(12.7) it is enough to show that
$$
\aligned
\|(\widetilde{|f_1|\circ1})_\gamma\|_2&=
\left\|\left(\prod_{j\in C(\gamma)} P_j
\prod_{j\in O_2(\gamma)} \tilde Q_j\right)
|f_1(x_{\alpha_\gamma(1,1)},
\dots,x_{\alpha_\gamma(1,k_1)})|\right\|_2 \\
&\le 2^{W(\gamma)}\|f_1\|_2
\endaligned \tag12.8
$$
to prove relation~(11.8). But this inequality trivially holds, since
the norm of all operators $P_j$ in formula (12.8) is bounded
by~1, the norm of all operators $\tilde Q_j$ is bounded
by~2 in the $L_2(\mu)$ norm, and $|O_2(\gamma)|=W(\gamma)$.
\medskip\noindent
{\it Proof of Theorem 11.2.} Theorem~11.2 will be proved with the
help of Theorem~11.1 by induction with respect to the number of
degenerate $U$-statistics $k_p!I_{n,k_p}(f_p)$, $1\le p\le m$.
Formula~(11.15) holds for $m=2$ by Theorem~11.1. To prove it for
a general parameter~$m$ let us first fix a coloured diagram
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ and consider the set of
diagrams of~$m$ rows which are its `continuation', i.e. let
$$\
\Gamma(\bar\gamma)=\{\gamma\colon\; \gamma\in\Gamma(k_1,\dots,k_m),\,
\gamma_{pr}=\bar\gamma\}.
$$
(Here we work with the diagrams $\gamma_{pr}$ and $\gamma_{cl}$
introduced for a diagram~$\gamma\in\Gamma(k_1,\dots,k_m)$ in the
previous section.) I claim that
$$
\aligned
&n^{-|O(\bar\gamma)|/2}|O(\bar\gamma)|!I_{n,|O(\bar\gamma)|}
(F_{\bar\gamma}(f_1,\dots,f_{m-1}))
\cdot n^{-k_m/2}k_m!I_{n,k_m}(f_m)\\
&\qquad=
{\sum_{\gamma\in\Gamma(\bar\gamma)}}^{\!\!\!\!\prime(n)}
\prod_{j=1}^{|C(\gamma_{cl})|}\left(\frac{n-s(\gamma_{cl})+j}n\right)
n^{-W(\gamma_{cl})/2} \\
&\qquad\qquad\qquad n^{-|O(\gamma)|/2}|O(\gamma))|!
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m)),
\endaligned\tag12.9
$$
where
${\sum\limits_{\gamma\in\Gamma(\bar\gamma)}}^{\!\!\!\!\!\prime(n)}$
means that summation is taken for such
$\gamma\in\Gamma(\bar\gamma)$ for which $s(\gamma_{cl})\le n$, and
$\prod_{j=1}^{|C(\gamma_{cl})|}$ equals~1, if $|C(\gamma_{cl})|=0$.
Relation~(12.9) can be checked by applying Theorem~11.1 for
the pair of $U$-statistics with kernel functions
$F_{\bar\gamma}(f_1,\dots,f_{m-1})$ and~$f_m$. To get it first we
show that there is a mutual correspondence between the coloured
diagrams $\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ and the class of
diagrams $\{\gamma_{cl}\colon\,\gamma\in\Gamma(\bar\gamma)\}$
in such a way that two diagrams $\gamma\in\Gamma(\bar\gamma)$ and
$\gamma'\in\Gamma(|O(\gamma)|,k_m)$ correspond to each other if
and only if $\gamma'=\gamma_{cl}$. We shall fix an enumeration
of the chains of the diagram~$\bar\gamma$, and we shall take
such an enumeration of the chains in all diagrams
$\gamma\in\Gamma(\bar\gamma)$ for which the enumeration of
the chains of $\bar\gamma$ and $\gamma_{pr}$ agree. The
correspondence between the above mentioned two classes of
diagrams depends on the enumeration of the chains
of~$\bar\gamma$, but this will cause no problem. To get it
observe that for each $\gamma\in\Gamma(\bar\gamma)$ there is
a diagram $\gamma'=\gamma_{cl}\in\Gamma(|O(\bar\gamma)|,k_m)$.
On the other hand, I claim that for all diagrams
$\gamma'\in \Gamma(|O(\bar\gamma)|,k_m)$ such a diagram
$\gamma(\gamma')\in\Gamma(\bar\gamma)$ can be found for which
$\gamma(\gamma')_{cl}=\gamma'$.
This diagram $\gamma(\gamma')\in\Gamma(\bar\gamma)$ will be
defined in the following way. Let
$\bar l_1,\bar l_2,\dots,\bar l_{|O(\bar\gamma)|}$ be the
indices of the chains of the diagram~$\bar\gamma$ with colour~$-1$.
The diagram $\gamma(\gamma')$ will be defined so that the chains
of colour~$1$ of $\bar\gamma$ will be chains of colour~1 of
$\gamma(\gamma')$, too. If the vertex $(1,j)$ of the
diagram~$\gamma'$ is contained in a chain of length~1, then the
diagram~$\gamma(\gamma')$ contains the chain $\beta(\bar l_j)$
with colour~$-1$. If this vertex is contained in a chain
$\{(1,j),(2,r_j)\}\in\gamma'$ of length~2, then $\gamma(\gamma')$
contains the diagram $\beta(\bar l_j)\cup\{(m,r_j)\}$ with the
same colour as the chain $\{(1,j),(2,r_j)\}$ has in $\gamma'$.
Finally, if the vertex $(2,r)$ is contained in the chain $\{(2,r)\}$
of length~1 in $\gamma'$, then $\{(m,r)\}$ will be a chain of
length~1 of~$\gamma(\gamma')$ with colour~$-1$. In such a way we
get such a diagram $\gamma(\gamma')\in\Gamma(\bar\gamma)$ for which
$\gamma(\gamma')_{cl}=\gamma'$.
We get relation~(12.9) by applying Theorem~11.1 for the product
$$
n^{-|O(\bar\gamma)|/2}|O(\bar\gamma)|!I_{n,|O(\bar\gamma)|}
(F_{\bar\gamma}(f_1,\dots,f_{m-1}))\cdot n^{-k_m/2}k_m!I_{n,k_m}(f_m)
$$
and writing all diagrams~$\gamma'\in\Gamma(|O(\gamma)|,k_m)$
in the form $\gamma_{cl}$, where $\gamma_{cl}$ is the
closing diagram of the diagram
$\gamma(\gamma')\in\Gamma(\bar\gamma)$ defined in the previous
paragraph.
Relation~(11.15) for the parameter~$m$ can be proved with the help
of relation~(12.9) and the inductive assumption by which it holds
for~$m-1$. Indeed, let us multiply formula~(12.9) by
$\prod\limits_{p=2}^{m-1}J_n(\bar\gamma,p)n^{-W(\bar\gamma)/2}$,
and sum up this identity for all such diagrams
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ for which
$B_1(\gamma,p)+B_2(\bar\gamma,p)\le n$ for all $2\le p\le m-1$.
Then the sum of the terms at the left-hand side equals the
left-hand side of formula~(11.15) for parameter~$m$.
I claim that the sum of the terms at the right-hand side equals
the right-hand side of formula~(11.15) for parameter~$m$. To see
this it is enough to check that for all
$\gamma\in\Gamma(\bar\gamma)$ we have $W(\bar\gamma)+W(\gamma_{cl})
=W(\gamma_{pr})+W(\gamma_{cl})=W(\gamma)$,
$\prod\limits_{p=2}^{m-1} J_n(\gamma_{pr},p)
\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)
=\prod\limits_{p=2}^m J_n(\gamma,p)$, where
$\prod\limits_{j=1}^{|C(\gamma_{cl})|}=1$ if $|C(\gamma_{cl})|=0$,
and the relation $B_1(\gamma,p)+B_2(\gamma,p)\le n$ holds for all
$2\le p\le m$ if and only if
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$
for all $2\le p\le m-1$, and $s(\gamma_{cl})\le n$. But these
relations can be simply checked. The identity about the
function~$W(\cdot)$ can be checked by taking into account the
definition of the diagrams~$\gamma_{pr}$ and~$\gamma_{cl}$, in
particular the colouring of the chains in these diagrams. The
remaining relations can be proved with the help of the
observation that for a diagram $\gamma\in\Gamma(k_1,\dots,k_m)$
$B_1(\gamma_{pr},p)=B_1(\gamma,p)$ and
$B_2(\gamma_{pr},p)=B_2(\gamma,p)$ for all $2\le p\le m-1$.
Beside this $|C(\gamma_{cl})|=B_1(\gamma,m)$ and
$|O(\gamma_{cl})|=B_2(\gamma,m)$. Theorem~11.2 is proved.
\medskip\noindent
{\it Proof of Lemma 11.3.} The proof is similar to that of
formula (11.8) at the end of Theorem~11.1. Let us define the
functions $\tilde F_{\gamma}(f_1,\dots,f_p)$,
$\gamma\in\Gamma(k_1,\dots,k_p)$, recursively for all
$2\le p\le m$ similarly to the definition of the functions
$F_\gamma(f_1,\dots,f_p)$ with the difference that
the operator $Q_{u_j}=I-P_{u_j}$ is replaced by
$\tilde Q_{u_j}=I+P_{u_j}$ in the new definition. Then we have
$|F_\gamma(f_1,\dots,f_m)|\le\tilde F_\gamma(|f_1|,\dots,|f_m|)$
in all points. Hence $\|F_\gamma(f_1,\dots,f_m)\|_2\le
\|\tilde F_\gamma(f_1,\dots,f_m)\|_2$, and to prove Lemma~11.3
it is enough to show that
$$
\|\tilde F_\gamma(|f_1|,\dots,|f_m|)\|_2\le 2^{W(\gamma)}
\prod_{p\in U(\gamma)}\|f_p\|_2 \quad\text{if }
\gamma\in\Gamma(k_1,\dots,k_m) \tag12.10
$$
with the same number $W(\gamma)$ and set $U(\gamma)$ which
were considered in Lemma~11.3. Relation~(12.10) will be proved by
induction with respect to~$m$.
Relation~(12.10) holds for $m=2$. Indeed, if $W(\gamma)=0$, then
$U(\gamma)=\{1,2\}$, we have $\tilde F_\gamma=F_\gamma$, and
formula~(11.7) supplies the estimate. If $W(\gamma)\ge1$, then
$U(\gamma)=\{1\}$, and actually in the proof of relation~(11.8)
we proved this relation.
In the case $m>2$ this inequality will be proved by induction with
the help of the identity (with the notation of formula~(11.3)
$$
\aligned
&\|\tilde F_\gamma(|f_1|,\dots,|f_m|)\|_2
=\biggl\|\left(\prod_{p\in C(\gamma_{cl})}
P_{p}\prod_{p\in O_2(\gamma_{cl})} \tilde Q_p\right)\\
&\qquad\qquad\qquad \overline{(\tilde F_{\gamma_{pr}}(|f_1|,\dots,
|f_{m-1}|)\circ |f_m|)}_{\gamma_{cl}}
(x_p,\;p\in O(\gamma_{cl})\cup C(\gamma_{cl})) \biggr\|_2.
\endaligned \tag12.11
$$
In the case $W(\gamma_{cl})=0$, i.e. if $\gamma_{cl}$
contains no open chain of length~2 we have
$U(\gamma)=U(\gamma_{pr})\cup\{m\}$, $W(\gamma)=W(\gamma_{pr})$,
and formula~(2.11) contains no operator~$\tilde Q_p$. In this
case inequality~(12.10) follows from the representation of
$\|\tilde F_\gamma(|f_1|,\dots,|f_m|)\|_2$ given in~(12.11),
relation~(11.7) and from the inductive hypothesis by which
inequality~(12.10) holds for
$\|(\tilde F_{\gamma_{pr}}(|f_1|,\dots,|f_{m-1}|)\|_2$.
In the case $W(\gamma_{cl})>0$ we have $U(\gamma)=U(\gamma_{pr})$,
$W(\gamma)=W(\gamma_{pr})+W(\gamma_{cl})$, and inequality~(12.10)
can be proved similarly to the case $W(\gamma_{cl})=0$ with the
only difference that in this case instead of~(11.7) we have to apply
that strengthened version of~(11.8) which is contained in
formula~(12.10) in the special case $m=2$. Lemma~11.3 is proved.
\medskip
The corollary of Theorem 11.2 is a simple consequence of
Theorem~11.2 and Lem\-ma~11.3.
\medskip\noindent
{\it Proof of the corollary of Theorem 11.2.}\/ Observe that
$F_\gamma$ is a function of $|O(\gamma)|$ arguments. Hence a
coloured diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ is in the
class of closed diagrams, i.e.
$\gamma\in\bar\Gamma(k_1,\dots,k_m)$ if and only if
$F_\gamma(f_1,\dots,f_m)$ is a constant. Thus formula~(11.16)
is a simple consequence of relation (11.15) and the observation
that $EI_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))=0$ if
$|O(\gamma)|\ge1$, i.e. if
$\gamma\notin\bar\Gamma(k_1,\dots,k_m)$, and
$$
\align
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))&=
I_{n,0}(F_\gamma(f_1,\dots,f_m))=F_\gamma(f_1,\dots,f_m) \\
&\qquad\qquad \text{ if } \gamma\in\bar\Gamma(k_1,\dots,k_m).
\endalign
$$
Relations~(11.17) and~(11.18) follow from relation~(11.16) and
Lemma~11.3.
\beginsection 13. The proof of Theorems 8.3, 8.5 and Example 8.7.
This section contains the proof of the estimates on the distribution
of a multiple Wiener--It\^o integral or degenerate $U$-statistic
formulated in Theorems~8.5 and~8.3 together with the proof of
Example~8.7. Beside this, also a multivariate version of Hoeffding's
inequality (Theorem~3.4) will be proved here. The latter result
is useful in the estimation of the supremum of degenerate
$U$-statistics. The estimate on the distribution of a multiple
random integral with respect to a normalized empirical
distribution given in Theorem~8.1 is omitted, because, as it was
shown in Section~9, this result follows from the estimate of
Theorem~8.3 on degenerate $U$-statistics. This section will be
finished with a separate part Section~13~B, where the results
proved in this section are discussed together with the method of
their proofs and some recent results.
The proof of Theorems~8.5 and~8.3 is based on a good estimate on
high moments of Wiener--It\^o integrals and degenerate
$U$-statistics. These estimates follow from the corollaries of
Theorems~10.2 and~11.2. Such an approach slightly differs from the
classical proof in the one-variate case. The natural one-variate
version of the problems discussed here is an estimate about the
tail distribution of a sum of independent random variables. This
estimate is generally proved with the help of a good bound on
the moment generating function of the sum. Such a method may
not work in the multivariate case, because, as later
calculations will show, there is no good estimate on the
moment-generating function estimate of $U$-statistics or multiple
Wiener--It\^o integrals of order $k\ge3$. Actually, the
moment-generating function of a Wiener--It\^o integral of order
$k\ge3$ is always divergent, because the tail behaviour of such
a random integral is similar to that of the $k$-th power of a
Gaussian random variable. On the other hand, good bounds on the
moments $EZ^{2M}$ of a random variable~$Z$ for all positive
integers~$M$ (or at least for a sufficiently rich class of
parameters~$M$) together with the application of the Markov
inequality for $Z^{2M}$ and an appropriate choice of the
parameter~$M$ yield a good estimate on the distribution of~$Z$.
Propositions~13.1 and~13.2 give estimates on the moments of
Wiener--It\^o integrals and degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.1. (Estimate of the moments of Wiener--It\^o
integrals).} {\it Let $f(x_1,\dots,x_k)$ be a function of $k$
variables on some measurable space $(X,{\Cal X})$ that satisfies
formula (8.12) with some $\sigma$-finite measure $\mu$. Take
the $k$-fold Wiener--It\^o integral $Z_{\mu,k}(f)$ of this function
with respect to a white noise $\mu_W$ with reference measure~$\mu$.
The inequality
$$
E\left(k!|Z_{\mu,k}(f)|\right)^{2M}\le 1\cdot3\cdot5\cdots
(2kM-1)\sigma^{2M}\quad\text {for all }M=1,2,\dots \tag13.1
$$
holds.}
\medskip
By Stirling's formula Proposition~13.1 implies that
$$
E(k!|Z_{\mu,k}(f)|)^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!}\sigma^{2M}
\le A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M} \tag13.2
$$
for any $A>\sqrt2$ if $M\ge M_0=M_0(A)$. Formula (13.2) can be
considered as a simpler, better applicable version of
Proposition~13.1. It can be better compared with the moment estimate
on~degenerate $U$-statistics given in~(13.3).
Proposition~13.2 provides a similar, but weaker inequality for the
moments of normalized degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.2. (Estimate on the moments of degenerate
$U$-statistics).} {\it Let us consider a degenerate $U$-statistic
$I_{n,k}(f)$ of order $k$ with sample size $n$ and with a kernel
function $f$ satisfying relations (8.1) and (8.2) with some
$0<\sigma^2\le1$. Fix a positive number $\eta>0$. There exist some
universal constants $A=A(k)>\sqrt2$, $C=C(k)>0$ and $M_0=M_0(k)\ge1$
depending only on the order of the $U$-statistic $I_{n,k}(f)$ such
that
$$
\aligned
E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}&
\le A\left(1+C\sqrt\eta\right)^{2kM}
\left(\frac2e\right)^{kM}\left(kM\right)^{kM}\sigma^{2M} \\
&\qquad \text{for all integers } M \text{ such that }
kM_0\le kM\le \eta n\sigma^2.
\endaligned \tag13.3
$$
In formula (13.3) such a constant $C=C(k)$ can be chosen which does
not depend on the order $k$ of the $U$-statistic $I_{n,k}(f)$.
For instance $C=4$ is an appropriate choice.}
\medskip
Theorem~13.2 yields a good estimate on
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ with a fixed
exponent~$2M$ with
the choice $\eta=\frac{kM}{n\sigma^2}$. With such a choice of the
number $\eta$ formula~(13.3) yields an estimate on the moments
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ comparable with the
estimate on the corresponding Wiener--It\^o integral if
$M\le n\sigma^2$, while
it yields a much weaker estimate if $M\gg n\sigma^2$.
Now I turn to the proof of these propositions.
\medskip\noindent
{\it Proof of Proposition 13.1.}\/ Proposition 13.1 can be simply
proved by means of the Corollary of Theorem 10.2 with the choice
$m=2M$, and $f_p=f$ for all $1\le p\le 2M$. Formulas~(10.18)
and~(10.19) yield that
$$
E\left(k!Z_{\mu,k}(f)^{2M}\right)\le\left( \int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(dx_k)\right)^M|
\Gamma_{2M}(k)|\le |\Gamma_{2M}(k)|\sigma^{2M},
$$
where $|\Gamma_{2M}(k)|$ denotes the number of closed diagrams
$\gamma$ in the class
$\bar\Gamma(\underbrace{k,\dots,k}_{2M\text{ times}})$
introduced in the corollary of Theorem~10.2. Thus to complete the
proof of Proposition~13.1 it is enough to show that
$|\Gamma_{2M}(k)|\le 1\cdot3\cdot5\cdots(2kM-1)$. But this can
easily be seen with the help of the following observation. Let
$\bar\Gamma_{2M}(k)$ denote the class of all graphs with vertices
$(l,j)$, $1\le l\le 2M$, $1\le j\le k$, such that from all vertices
$(l,j)$ exactly one edge starts, all edges connect different
vertices, but edges connecting vertices $(l,j)$ and $(l,j')$ with
the same first coordinate~$l$ are also allowed. Let
$|\bar\Gamma_{2M}(k)|$ denote the number of graphs in
$\bar\Gamma_{2M}(k)$. Then clearly
$|\Gamma_{2M}(k)|\le|\bar\Gamma_{2M}(k)|$. On the other hand,
$|\bar\Gamma_{2M}(k)|=1\cdot3\cdot5\cdots(2kM-1)$. Indeed, let us
list the vertices of the graphs from $\bar\Gamma_{2M}(k)$ in an
arbitrary way. Then the first vertex can be paired with another
vertex in $2kM-1$ way, after this the first vertex from which no
edge starts can be paired with $2kM-3$ vertices from which no edge
starts. By following this procedure the next edge can be chosen
$2kM-5$ ways, and by continuing this calculation we get the desired
formula.
\medskip\noindent
{\it Proof of Proposition 13.2.}\/ Relation (13.3) will be proved by
means of relations (11.17) and (11.18) in the Corollary of
Theorem~11.2 with the choice $m=2M$ and $f_p=f$ for all
$1\le p\le 2M$. Let us take the class of closed coloured diagrams
$\Gamma(k,M)=\bar\Gamma(\underbrace{k,\dots,k}_{2M\text{times}})$.
This will be partitioned into subclasses
$\Gamma(k,M,r)$, $1\le r\le kM$, where $\Gamma(M,k,r)$ contains
those closed diagrams $\gamma\in\Gamma(k,M)$ for which
$W(\gamma)=2r$. Let us recall that $W(\gamma)$ was defined
in~(11.13), and in the case of closed diagrams
$W(\gamma)=\sum\limits_{\beta\in\gamma}(\ell(\beta)-2)$. For a diagram
$\gamma\in\Gamma(k,M)$, $W(\gamma)$ is an even number, since
$W(\gamma)+2s(\gamma)=2kM$, where $s(\gamma)$ denotes the
number of chains in~$\gamma$.
First we prove an estimate about the cardinality of~$\Gamma(M,k,r)$.
We claim that there exist some constant $A=A(k)>0$ and threshold
index $M_0=M_0(k)$ depending only the order~$k$ of the~$U$-statistic
$I{n,k}(f)$ for which
$$
|\Gamma(k,M,r)|\le A\binom{2kM}{2r}
\left(\frac2e\right)^{kM}(kM)^{kM+r}2^{2r}
\quad\text{for all }0\le r\le kM \tag13.4
$$
if $A\ge A_0(k)$ and $M\ge M_0(k)$.
To prove formula~(13.4) we define a map
$T\colon\;\gamma\to T(\gamma)$
from the set of diagrams $\gamma\in\Gamma(k,M,r)$ to the set of
paired diagrams in such a way that $T(\gamma)\neq T(\gamma')$ if
$\gamma\neq\gamma'$, and give a good bound on the number of
paired diagrams~$T(\gamma)$, $\gamma\in\Gamma(k,M,r)$, obtained
in such a way. (We shall call a diagram $\gamma$ a paired diagram,
if all of its chains have length~2, i.e. they have the form
$\beta=\{(p,r),(p',r')\}\in\gamma$, with~$p\neq p'$. We shall work
with paired diagrams consisting of $2M$ rows, but we do not fix
the length of the rows.) To define the
paired diagrams we shall work with first we introduce the set
${\Cal W}(\gamma)=\bigcup\limits_{\beta\in\gamma}
\{(p_2(\beta),q_2(\beta)),\dots,(p_{s-1}(\beta),q_{s-1}(\beta))\}$,
for all $\gamma\in \Gamma(k,M,r)$, where
$\beta=\{(p_1(\beta),q_1(\beta)),\dots,(p_s(\beta),q_s(\beta))\}$
with $1\le p_1(\beta)\sqrt2$ if $M\ge M_0$ with some $M_0=M_0(A)$.
Since $r\le kM$ we can write
$(kM+r)^{kM+r}\le (kM)^{kM}\left(1+\frac r{kM}\right)^{kM}
(2kM)^r\le (kM)^{kM+r}e^r2^r$. The above calculation
together with~(13.5) imply inequality~(13.4).
\medskip
For a diagram $\gamma\in\Gamma(k,M,r)$ we have $W(\gamma)=2r$,
and beside this the cardinality of the set~$U(\gamma)$ defined
in the formulation of Lemma~11.3 satisfies the inequality
$|U(\gamma)|\ge 2M-W(\gamma)=2M-2r$. Hence by relation~(11.18)
$n^{-W(\gamma)/2}|F_\gamma|\le 2^{2r} n^{-r}\sigma^{|U(\gamma)|}
\le 2^{2r} \left(n\sigma^2\right)^{-r}\sigma^{2M}
\le\eta^{r}2^{2r}(kM)^{-r}
\sigma^{2M}$ for $\gamma\in\Gamma(k,M,r)$ if
$kM\le \eta n\sigma^2$ and $\sigma^2\le1$.
This estimate together with relation~(11.17) imply that for
$kM\le\eta n\sigma^2$
$$
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}
\le\sum_{\gamma\in\Gamma(k,M)}
n^{-W(\gamma)/2}\cdot |F_\gamma|\le\sum_{r=0}^{kM}|\Gamma(k,M,r)|
\eta^{r}2^{2r}(kM)^{-r}\sigma^{2M}.
$$
Hence by formula (13.4)
$$
\align
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}&\le
A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\sum_{r=0}^{kM}\binom{2kM}{2r}
\left(4\sqrt{\eta}\right)^{2r}\\
&\le A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\left(1+4\sqrt{\eta}\right)^{2kM}
\endalign
$$
if $kM_0\le kM\le\eta n\sigma^2$. Thus we have proved
Proposition~13.2 with $C=4$.
\medskip
It is not difficult to prove Theorem 8.5 with the help of
Proposition~13.1.
\medskip\noindent
{\it Proof of Theorem 8.5.}\/
By formula (13.2) which is a consequence of Proposition~13.1 and
the Markov inequality
$$
P\left(|k!Z_{\mu,k}(f)|>u\right)\le
\frac{E\left(k!Z_{\mu,k}(f)\right)^{2M}}{u^{2M}}
\le A\left(\frac {2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM} \tag13.6
$$
with some constant $A>\sqrt2$ if $M\ge M_0$ with some constant
$M_0=M_0(A)$, and $M$ is an integer.
Put $\bar M=\bar M(u)=\frac1{2k}\left(\frac u\sigma\right)^{2/k}$,
and $M=M(u)=[\bar M]$, where $[x]$ denotes the integer part of
a real number $x$. Choose some number $u_0$ such that
$\frac1{2k}\left(\frac {u_0}\sigma\right)^{2/k}\ge M_0+1$. Then
relation (13.6) can be applied with $M=M(u)$ for $u\ge u_0$, and
this yields that
$$
\aligned
P\left(|k!Z_{\mu,k}(f)|>u\right) &\le A\left(\frac
{2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM}
\le e^{-kM}\le Ae^{k}e^{-k\bar M}\\
&=Ae^k\exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\} \quad\text{if } u\ge u_0.
\endaligned \tag13.7
$$
Relation (13.7) means that relation (8.14) holds for $u\ge u_0$ with
the pre-exponential coefficient $Ae^k$. By enlarging this
coefficient if it is needed it can be guaranteed that
relation~(8.14) holds for all $u>0$. Theorem~8.5 is proved.
\medskip
Theorem 8.3 can be proved similarly by means of Proposition~13.2.
Nevertheless, the proof is technically more complicated, since
in this case the optimal choice of the parameter in the Markov
inequality cannot be given in such a direct form as in the proof of
Theorem~8.5. In this case the Markov inequality is applied with an
only almost optimal choice of the parameter~$M$.
\medskip\noindent
{\it Proof of Theorem 8.3.}\/ The Markov inequality and
relation~(13.3) with $\eta=\frac{kM}{n\sigma^2}$ imply that
$$
\aligned
P(k!n^{-k/2}|I_{n,k}(f)|>u)&\le
\frac{E\left(k!n^{-k/2}I_{n,k}(f)\right)^{2M}}{u^{2M}}\\
&\le A\left(\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}\right)^{kM}
\endaligned \tag13.8
$$
for all integers $M\ge M_0$ with some $M_0=M_0(A)$.
Relation (8.10) will be proved with the help of estimate (13.8)
first in the case $D\le\frac u\sigma \le n^{k/2}\sigma^k$ with a
sufficiently large constant $D=D(k,C)>0$ depending on $k$ and the
constant $C$ in (13.8). To this end let us introduce the number
$\bar M$ by means of the formula
$$
k\bar M=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+B\frac{ \left(\frac u\sigma\right)^{1/k}}{\sqrt n\sigma}}
=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}
$$
with a sufficiently large number $B=B(C)>0$ and $M=[\bar M]$,
where $[x]$ means the integer part of the number $x$.
Observe that $\sqrt{k\bar M}\le\left(\frac u\sigma\right)^{1/k}$,
$\frac{\sqrt{k\bar M}}{\sqrt n\sigma}
\le\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}\le1$,
and
$$
\left(1+C\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\right)^2\le
1+B\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\le 1+B\left(u n^{-k/2}
\sigma^{-(k+1)}\right)^{1/k}
$$
with a sufficiently large $B=B(C)>0$ if
$\frac u\sigma\le n^{k/2}\sigma^k$. Hence
$$
\aligned
\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}&\le
\frac1e\cdot 2k\bar M\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k} \\
&=\frac1e\cdot\frac{\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2}
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}\le\frac1e
\endaligned \tag13.9
$$
if $\frac u\sigma\le n^{k/2}\sigma^k$. If the inequality
$D\le\frac u\sigma$ also holds with a sufficiently large
$D=D(B,k)>0$, then $M\ge M_0$, and the conditions of
inequality~(13.8) hold. This inequality together with
inequality~(13.9) yield that
$$
P(k!n^{-k/2}|I_{n,k}(f)|>u)\le A e^{-kM}\le Ae^k e^{-k\bar M}
$$
if
$D\le\frac u\sigma\le n^{k/2}\sigma^k$, i.e.\ inequality (8.10)
holds in this case with a pre-exponential constant $Ae^k$. In the
case $\frac u\sigma\le D$ the right-hand side of~(8.10) is larger
than~1 if we choose the pre-exponential term~$A$ sufficiently
large. Hence inequality~(8.10) holds for all
$0\le\frac u\sigma\le n^{k/2}\sigma^k$ with a sufficiently large
pre-exponential term~$A$. Theorem~8.3 is proved.
\medskip
Example 8.7 is a relatively simple consequence of It\^o's formula
for multiple Wiener--It\^o integrals.
\medskip\noindent
{\it Proof of Example 8.7.}\/ We may restrict our attention to the
case $k\ge2$. It\^o's formula for multiple Wiener-It\^o integrals,
more explicitly relation~(10.21), implies that the random
variable $k!Z_{\mu,k}(f)$ can be expressed as $k!Z_{\mu,k}(f)
=\sigma H_k\left(\int f_0(x)\mu_W(\,dx)\right)=\sigma H_k(\eta)$,
where $H_k(x)$ is the $k$-th Hermite polynomial with leading
coefficient~1, and $\eta=\int f_0(x)\mu_W(\,dx)$ is a standard
normal random variable. Hence we get by exploiting that the
coefficient of $x^{k-1}$ in the polynomial $H_k(x)$ is zero that
$P(k!|Z_{\mu,k}(f)|>u)=P(|H_k(\eta)|\ge\frac u\sigma)\ge
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)$ with a
sufficiently large constant $D>0$ if $\frac u\sigma>1$. There
exist such positive constants $A$ and $B$ that
$$
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)
\ge P\left(|\eta^k|>\frac u\sigma+A
\left(\frac u\sigma\right)^{(k-2)/k}\right)\quad
\text{if } \frac u\sigma>B.
$$
Hence
$$
P(k!|Z_{\mu,k}(f)|>u)\ge
P\left(|\eta|>\left(\frac u\sigma\right)^{1/k}
\left(1+A\left(\frac u\sigma\right)^{-2/k}\right)\right)
\ge\frac{\bar C \exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\}}
{\left(\frac u\sigma\right)^{1/k}+1}
$$
with an appropriate $\bar C>0$ if $\frac u\sigma>B$. Since
$P(k!|Z_{\mu,k}(f)|>0)>0$, the above inequality also holds
for $0\le \frac u\sigma\le B$ if the constant $\bar C>0$ is chosen
sufficiently small. This means that relation (8.16) holds.
\medskip
Next we prove a multivariate version of Hoeffding's inequality.
Before its formulation some notations will be introduced.
Let us fix two positive integers~$k$ and~$n$ and some
real numbers $a(j_1,\dots,j_k)$ for all sequences of arguments
$\{j_1,\dots,j_k\}$ such that $1\le j_l\le n$, $1\le l\le k$, and
$j_l\neq j_{l'}$ if $l\neq l'$.
With the help of the above real numbers $a(\cdot)$ and a
sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, the random
variable
$$
V=\sum\Sb (j_1,\dots, j_k)\colon\; 1\le j_l\le n \text{ for all }
1\le l\le k,\\
j_l\neq j_{l'} \text{ if }l\neq l' \endSb a(j_1,\dots, j_k)
\varepsilon_{j_1}\cdots \varepsilon_{j_k} \tag13.10
$$
and number
$$
S^2=\sum\Sb (j_1,\dots, j_k)\colon\; 1\le j_l\le n \text{ for all }
1\le l\le k,\\
j_l\neq j_{l'} \text{ if }l\neq l' \endSb
a^2(j_1,\dots, j_k). \tag13.11
$$
will be introduced.
With the help of the above notations the following result can be
formulated.
\medskip\noindent
{\bf Theorem 13.3. (The multivariate version of Hoeffding's
inequality).} {\it The random variable $V$ defined in
formula~(13.10) satisfies the inequality
$$
P(|V|>u)\le C \exp\left\{-\frac12\left(\frac uS\right)^{2/k}\right\}
\quad\text{for all }u\ge 0 \tag13.12
$$
with the constant $S$ defined in (13.11) and some constants $C>0$
depending only on the parameter $k$ in the expression $V$.}
\medskip
Theorem~13.3 will be proved by means of two simple lemmas. Before
their formulation the random variable
$$
Z=\sum\Sb (j_1,\dots, j_k)\colon\; 1\le j_l\le n \text{ for all }
1\le l\le k,\\
j_l\neq j_{l'} \text{ if }l\neq l'\endSb |a(j_1,\dots,j_k)|
\eta_{j_1}\cdots \eta_{j_k} \tag13.13
$$
will be introduced, where $\eta_1,\dots,\eta_n$ are independent
random variables with standard normal distribution, and the numbers
$a(j_1,\dots,j_k)$ agree with those in formula (13.10). The
following lemmas will be proved.
\medskip\noindent
{\bf Lemma 13.4.} {\it The random variables $V$ and $Z$ introduced
in (13.10) and (13.13) satisfy the inequality
$$
EV^{2M}\le EZ^{2M}\quad\text{for all }M=1,2,\dots. \tag13.14
$$
}
\medskip\noindent
{\bf Lemma 13.5.} {\it The random variable $Z$ defined in formula
(13.13) satisfies the inequality
$$
EZ^{2M}\le 1\cdot3\cdot5\cdots(2kM-1)S^{2M}\quad\text{for all }
M=1,2,\dots \tag13.15
$$
with the constant $S$ defined in formula (13.11).}
\medskip\noindent
{\it Proof of Lemma 13.4.}\/ We can write, by carrying out the
multiplications in the expressions $EV^{2M}$ and $EZ^{2M}$,
by exploiting the additive and multiplicative properties of the
expectation for sums and products of independent random variables
together with the identities
$E\varepsilon_j^{2k+1}=0$ and $E\eta_j^{2k+1}=0$
for all $k=0,1,\dots$ that
$$
EV^{2M}= \!\!\!\!\!\!\!\!\!
\sum\Sb (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM\endSb \!\!\!\!\!\!\!\!\!
A(j_1,\dots,j_l,m_1,\dots,m_l)
E\varepsilon_{j_1}^{2m_1}\cdots E\varepsilon_{j_l}^{2m_l}
\tag13.16
$$
and
$$
EZ^{2M}= \!\!\!\!\!\!\!\!\!
\sum \Sb (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM\endSb \!\!\!\!\!\!\!\!\!
B(j_1,\dots,j_l,m_1,\dots,m_l) E\eta_{j_1}^{2m_1}\cdots
E\eta_{j_l}^{2m_l} \tag13.17
$$
with some coefficients $A(j_1,\dots,j_l,m_1,\dots,m_l)$ and
$B(j_1,\dots,j_l,m_1,\dots,m_l)$ such that
$$|
A(j_1,\dots,j_l,m_1,\dots,m_l)|\le
B(j_1,\dots,j_l,m_1,\dots,m_l). \tag13.18
$$
The coefficients $A(\cdot,\cdot,\cdot)$ and $B(\cdot,\cdot,\cdot)$
could be expressed explicitly, but we do not need such a formula.
What is important for us is that $A(\cdot,\cdot,\cdot)$ can be
expressed as the sum of certain terms, and $B(\cdot,\cdot,\cdot)$
as the sum of the absolute value of the same terms. Hence
relation~(13.18) holds. Since $E\varepsilon_j^{2m}\le E\eta_j^{2m}$
for all parameters $j$ and $m$ formulas~(13.16), (13.17)
and~(13.18) imply
Lemma~13.4.
\medskip\noindent
{\it Proof of Lemma~13.5.} Let us consider a white noise $W(\cdot)$
on the unit interval $[0,1]$ with the Lebesgue measure $\lambda$ on
$[0,1]$ as its reference measure, i.e.\ let us take a set of
Gaussian random variables $W(A)$ indexed by the measurable sets
$A\subset [0,1]$ such that $EW(A)=0$, $EW(A)W(B)=\lambda(A\cap B)$
with the Lebesgue measure $\lambda$ for all measurable subsets of
the interval $[0,1]$. Let us introduce $n$ orthonormal functions
$\varphi_1(x),\dots,\varphi_n(x)$ with respect to the Lebesgue
measure on the interval $[0,1]$, and define the random variables
$\eta_j=\int \varphi_j(x)W(\,dx)$, $0\le j\le n$. Then
$\eta_1,\dots,\eta_n$ are independent random variables with standard
normal distribution, hence we may assume that they appear in the
definition of the random variable~$Z$ in formula (13.13). Beside
this, the identity $\eta_{j_1}\cdots\eta_{j_k}=\int \varphi_{j_1}(x_1)
\cdots\varphi_{j_k}(x_k)W(\,dx_1)\dots W(\,dx_k)$ holds for all
$k$-tuples $(j_1,\dots,j_k)$, such that $1\le j_s\le n$ for all
$1\le s\le k$, and the indices $j_1$,\dots, $j_s$ are different.
This identity follows from It\^o's formula for multiple Wiener--It\^o
integrals formulated in formula~(10.20) of Theorem~10.3.
Hence the random variable $Z$ defined in~(13.13) can be written in
the form
$$
Z=\int f(x_1,\dots,x_k)W(\,dx_1)\dots W(\,dx_k)
$$
with the function
$$
f(x_1,\dots,x_k)=
\sum\Sb (j_1,\dots, j_k)\colon\; 1\le j_l\le n \text{ for all } 1\le
l\le k,\\
j_l\neq j_{l'} \text{ if }l\neq l'\endSb |a(j_1,\dots,j_k)|
\varphi_{j_1}(x_1)\cdots \varphi_{j_k}(x_k).
$$
Because of the orthogonality of the functions $\varphi_j(x)$
$$
S^2=\int_{[0,1]^k} f^2(x_1,\dots,x_k)\,dx_1\dots\,dx_k.
$$
Lemma~13.5 is a straightforward consequence of the above relations
and formula~(13.1) in Proposition~13.1.
\medskip\noindent
{\it Proof of Theorem~13.3.}\/ The proof of Theorem~13.3 with the
help of Lemmas~13.4 and~13.5 is an almost word for word repetition
of the proof of Theorem~8.5. By Lemma~13.4 inequality~(13.15)
remains valid if the random variable $Z$ is replaced by the random
variable~$V$ at its left-hand side. Hence the Stirling formula
yields that
$$
EV^{2M}\le EZ^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!} S^{2M}\le C
\left(\frac2e\right)^{kM}(kM)^{kM}S^{2M}
$$
for any $C\ge\sqrt2$ if $M\ge M_0(A)$. As a consequence, by the
Markov inequality the estimate
$$
P(|V|>u)\le\frac{EV^{2M}}{u^{2M}}\le C\left(\frac{2kM}e\left(\frac
Su\right)^{2/k}\right)^{kM} \tag13.19
$$
holds for all $C>\sqrt 2$ if $M\ge M_0(C)$. Put $k\bar M=k\bar
M(u)=\frac12\left(\frac uS\right)^{2/k}$ and $M=M(u)=[\bar M]$, where
$[x]$ denotes the integer part of the number~$x$. Let us choose
a threshold number $u_0$ by the identity
$\frac1{2k}\left(\frac{u_0}S\right)^{2/k}=M_0(C)+1$.
Formula~(13.19) can be applied with $M=M(u)$ for $u\ge u_0$, and
it yields that
$$
P(|V|>u)\le Ce^{-kM}\le Ce^ke^{-k\bar M}=Ce^k\exp\left\{-\frac12
\left(\frac uS\right)^{2/k}\right\}\ \quad\text{if } u\ge u_0.
$$
The last inequality means that relation (13.12) holds for $u\ge u_0$
if the constant $C$ is replaced by $Ce^k$ in it. With the choice of
a sufficiently large constant~$C$ relation (13.12) holds
for all $u\ge0$. Theorem~13.3 is proved.
\vfill\eject
\medskip\noindent
{\script 13. B) A short discussion about the methods and results.}
\medskip\noindent
A comparison of Theorem 8.5 and Example 8.7 shows that the estimate
(8.15) is sharp. At least no essential improvement of this estimate
is possible which holds for {\it all}\/ Wiener--It\^o integrals
with a kernel function $f$ satisfying the conditions of Theorem~8.5.
This fact also indicates that the bounds~(13.1) and~(13.2) on high
moments of Wiener--It\^o integrals are sharp. It is worth
while comparing formula (13.2) with the estimate of
Proposition~13.2 on moments of degenerate $U$-statistics.
Let us consider a normalized $k$-fold degenerate $U$-statistic
$n^{-k/2}k!I_{n,k}(f)$ with some kernel function $f$ and a
$\mu$-distributed sample of size~$n$. Let us compare its moments
with those of a $k$-fold Wiener--It\^o integral k!$Z_{\mu,k}(f)$
with the same kernel function~$f$ with respect to a white noise
$\mu_W$ with reference measure~$\mu$. Let $\sigma$ denote the
$L_2$-norm of the kernel function~$f$. If
$M\le\varepsilon n\sigma^2$ with a small number $\varepsilon>0$,
then Proposition~13.2 (with an appropriate
choice of the parameter~$\eta$ which is small in this case)
provides an almost as good bound on the $2M$-th moment of the
normalized $U$-statistic as Proposition~13.1 provides on the
$2M$-th moment of the corresponding Wiener--It\^o integral. In
the case $M\le Cn\sigma^2$ with some fixed (not necessarily small)
number $C>0$ the $2M$-th moment of the normalized $U$-statistic
can be bounded by $C(k)^M$ times the natural estimate on the
$2M$-th moment of the Wiener--It\^o integral with some
constant~$C(k)>0$ depending only on the number~$C$. This can be
so interpreted that in this case the estimate on the moments of the
normalized $U$-statistic is weaker than the estimate on the moments
of the Wiener--It\^o integral, but they are still comparable.
Finally, in the case $M\gg n\sigma^2$ the estimate on the $2M$-th
moment of the normalized $U$-statistic is much worse than the
estimate on the $2M$-th moment of the Wiener--It\^o integral.
A similar picture arises if the distribution of the normalized
degenerate $U$-statistic
$$
F_n(u)=P(n^{-k/2}k!|I_{n,k}(f)|>u)
$$
is compared to the distribution of the Wiener--It\^o integral
$$
G(u)=P(k!|Z_{\mu,k}(f)|>u).
$$
A comparison of Theorems~8.3 and~8.5 shows that for
$0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a small $\varepsilon>0$
an almost as good estimate holds $F_n(u)$ as for $G(u)$. In the
case $0\le u\le n^{k/2}\sigma^{k+1}$ the behaviour of $F_n(u)$
and $G(u)$ is similar, only in the exponent of the estimate on
$F_n(u)$ in formula (8.10) a worse constant appears. Finally, if
$u\gg n^{k/2}\sigma^{k+1}$, then --- as Example~8.8 shows, at
least in the case $k=2$, --- the (tail) distribution function
$F_n(u)$ satisfies a much worse estimate than the function
$G(u)$. Thus a similar picture arises as in the case when the
estimate on the tail-distribution of normalized sums of independent
random variables, discussed in Section~3, is compared to the
behaviour of the standard normal distribution in the neighbourhood
of infinity. To understand this similarity better it is useful to
recall Theorem~10.4, the limit theorem about normalized degenerate
$U$-statistics. Theorems 8.3 and~8.5 enable us to compare the tail
behaviour of normalized degenerate $U$-statistics with their limit
presented in the form of multiple Wiener--It\^o integrals, while
the one-variate versions of these results compare the distribution
of sums of independent random variables with their Gaussian limit.
The above results show that good bounds on the moments of degenerate
$U$-statistics and multiple Wiener--It\^o also provide a good
estimate on their distribution. To understand the behaviour of high
moments of degenerate $U$-statistics it is useful to have a closer
look at the simplest case $k=1$, when the moments of sums of
independent random variables with expectation zero are considered.
Let us consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with expectation
zero, take their sum $S_n=\sum\limits_{j=1}^n\xi_j$, and let us try
to give a good estimate on the moments $ES_n^{2M}$ for all
$M=1,2,\dots$. Because of the independence of the random variables
$\xi_j$ and the condition $E\xi_j=0$ the identity
$$
ES_n^{2M}=\sum\Sb (j_1,\dots,j_s,l_1,\dots,l_s)\\
j_1+\dots+j_s=2M,\,j_u\ge 2,\text{ for all }1\le u\le s \\
l_u\neq l_{u'} \text { if }u\neq u'\endSb
E\xi_{l_1}^{j_1}\cdots E\xi_{l_s}^{j_s}
\tag13.20
$$
holds. Simple combinatorial considerations show that a dominating
number of terms at the right-hand side of (13.20) are indexed by a
vector $(j_1,\dots,j_M;\,l_1,\dots,l_M)$ such that $j_u=2$ for all
$1\le u\le M$, and the number of such vectors is equal to $\binom nM
\frac{(2M)!}{2^M}\sim n^M\frac{(2M)!}{2^MM!}$. The last asymptotic
relation holds if the number $n$ of terms in the random sum~$S_n$
is sufficiently large. The above considerations suggest that under
not too restrictive conditions $ES_n^{2M}\sim
\left(n\sigma^2\right)^M\frac{(2M)!}{2^MM!}=E\eta_{n\sigma^2}^{2M}$,
where $\sigma^2=E\xi^2$ is the variance of the terms in the sum
$S_n$, and $\eta_u$ denotes a random variable with normal distribution
with expectation zero and variance~$u$. The question arises when
the above heuristic argument gives a right estimate.
For the sake of simplicity let us restrict our attention to the
case when the absolute value of the random variables $\xi_j$ is
bounded by~1. Let us observe that even in this case the above
heuristic argument holds only under the condition that the variance
$\sigma^2$ of the random variables $\xi_j$ is not too small.
Indeed, let us consider such random variables $\xi_j$, for which
$P(\xi_j=1)=P(\xi_j=-1)=\frac{\sigma^2}2$, $P(\xi_j=0)=1-\sigma^2$.
Then these random variables $\xi_j$ have variance $\sigma^2$, and
the contribution of the terms $E\xi_j^{2M}$, $1\le j\le n$, to the
sum in (13.20) equals $n\sigma^2$. If $\sigma^2$ is very small,
then it may happen that $n\sigma^2\gg\left(n\sigma^2\right)^M
\frac{(2M)!}{2^MM!}$, and the approximation given for $ES_n^{2M}$
in the previous paragraph does not hold any longer. Hence the
asymptotic relation for a very high moment $ES_n^{2M}$ suggested
by the above heuristic argument may only hold if the variance
$\sigma^2$ of the summands satisfies an appropriate lower bound.
In the proof of Proposition~13.2 a similar picture appears in a
hidden way. In the calculation of the moments of a degenerate
$U$-statistic the contribution of certain (closed) diagrams,
more precisely of some integrals defined with their help, has to
be estimated. Some of these diagrams (those in which all chains
have length~2) appear also in the calculation of the moments of
multiple Wiener--It\^o integrals. In the calculation of the
moments of sums of independent random variables the terms
consisting of products of second moments play such a role in
the sum in formula~(13.20) as the `nice' diagrams consisting of
chains of length~2 play in the calculation of the moments of
degenerate $U$-statistics in formula~(11.17). In nice cases the
remaining diagrams do not give a much greater contribution than
these `nice' diagrams, and we get an almost as good bound for
the moments of a normalized degenerate $U$-statistic as for the
moments of the corresponding multiple Wiener--It\^o integral.
The proof of Proposition~13.2 shows that such a situation
appears under very general conditions.
Let me also remark that there is an essential difference
between the tail behaviour of Wiener--It\^o integrals and
normalized degenerate $U$-statistics. A good estimate can be
given on the tail distribution of Wiener--It\^o integrals which
depends only on the $L_2$-norm of the kernel function, while in
the case of normalized degenerate $U$-statistics the
corresponding estimate depends not only on the $L_2$-norm but
also on the $L_\infty$ norm of the kernel function. In
Theorem~8.3 such an estimate is proved.
For $k\ge2$ the distribution of $k$-fold Wiener-It\^o integrals are
not determined by the $L_2$-norm of their kernel functions. This is
an essential difference between Wiener--It\^o integrals of order
$k\ge2$ and $k=1$. In the case $k=1$ a Wiener--It\^o integral is
a Gaussian random variable with expectation zero, and its variance
equals the square of the $L_2$-norm of its kernel function. Hence
its distribution is completely determined by the $L_2$-norm of its
kernel function. On the other hand, the distribution of a
Wiener--It\^o integral of order $k\ge2$ is not determined by its
variance. Theorem~8.5 yields a `worst case' estimate on the
distribution of Wiener--It\^o integrals if we have a bound on their
variance. In the statistical problems which were the main
motivation for this work such estimates are needed, but it may be
interesting to know what kind of estimates are known about the
distribution of a multiple Wiener--It\^o integral or degenerate
$U$-statistic if we have some additional information about its
kernel function. Some results will be mentioned in this direction,
but most technical details will be omitted from their discussion.
H. P. Mc. Kean proved the following lower bound on the distribution
of multiple Wiener--It\^o integrals. (See [29] or [42].)
\medskip\noindent
{\bf Theorem 13.6. (Lower bound on the distribution of
Wiener--It\^o integrals).} {\it All $k$-fold Wiener--It\^o integrals
$Z_{\mu,k}(f)$ satisfy the inequality
$$
P(|Z_{\mu,k}(f)|>u)>Ke^{-Au^{2/k}} \tag13.21
$$
with some numbers $K=K(f,\mu)>0$ and $A=A(f,\mu)>0$.}
\medskip\noindent
The constant $A$ in the exponent $Au^{2/k}$ of formula~(13.21) is
always finite, but Mc.~Kean's proof yields no explicit upper
bound on it. The following example shows that in certain cases
if we fix the constant~$K$ in relation~(13.21), then this
inequality holds only with a very large constant $A>0$ even
if the variance of the Wiener--It\^o integral equals~1.
Take a probability measure $\mu$ and a white noise $\mu_W$ with
reference measure $\mu$ on a measurable space $(X,{\Cal X})$, and let
$\varphi_1,\varphi_2,\dots$ be a sequence of orthonormal functions
on $(X,{\Cal X})$ with respect to this measure $\mu$. Define for all
$L=1,2,\dots$, the function
$$
f(x_1,\dots,x_k)=f_L(x_1,\dots,x_k)=(k!)^{1/2}L^{-1/2}
\sum\limits_{j=1}^L \varphi_j(x_1)\cdots\varphi_j(x_k) \tag13.22
$$
and the Wiener--It\^o integral
$$
Z_{\mu,k}(f)=Z_{\mu,k}(f_L)=\frac1{k!}\int f_L(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots\mu_W(\,dx_k).
$$
Then $EZ_{\mu,k}^2(f)=1$, and the high moments of $Z_{\mu,k}(f)$ can
be well estimated. For a large parameter~$L$ these moments are much
smaller, than the quantities suggested by Proposition~13.1. (The
calculation leading to the estimation of the moments of
$Z_{\mu,k}(f)$ will be omitted.) These moment estimates also imply
that if the parameter~$L$ is large, then for not too large
numbers~$u$ the probability $P(|Z_{\mu,k}(f)|>u)$ has a much better
estimate than that given in Theorem~8.5. As a consequence,
for a large number $L$ and fixed number~$K$ relation~(13.21) may
hold only with a very big number $A>0$.
We can expect that if we take a Gaussian random
polynomial~$P(\xi_1,\dots,\xi_n)$ whose arguments are Gaussian
random variables $\xi_1,\dots,\xi_n$, and which is the sum of
many small almost independent terms, then
a similar picture arises as in the case of a Wiener--It\^o
integral with kernel function~(13.22) with a large parameter~$L$.
Such a random polynomial has an almost Gaussian distribution by
the central limit theorem, and we can also expect that its not
too high moments behave so as the corresponding moments of a
Gaussian random variable with expectation zero and the same
variance as the Gaussian random polynomial we consider. Such a
bound on the moments has the consequence that the estimate on
the probability $(P(\xi_1,\dots,\xi_n)>u)$ given in Theorem~8.5
can be improved if the number~$u$ is not too large. A similar
picture arises if we consider Wiener--It\^o integrals whose
kernel function satisfies some `almost independence' properties.
The problem is to find the right properties under which we can
get a good estimate that exploits the almost independence
property of a Gaussian random polynomial or of a Wiener--It\^o
integral. The main result of R.~Lata{\l}a's paper~[26] can be
considered as a response to this question. I describe this
result below.
To formulate Lata{\l}a's result some new notions have to be
introduced. Given a finite set $A$ let ${\Cal P}(A)$ denote the
set of all its partitions. If a partition
$P=\{B_1,\dots,B_s\}\in{\Cal P}(A)$ consists of $s$ elements then we
say that this partition has order~$s$, and write $|P|=s$. In the
special case $A=\{1,\dots,k\}$ the notation ${\Cal P}(A)={\Cal P}_k$
will be used. Given a measurable space $(X,{\Cal X})$ with a
probability measure $\mu$ on it together with a finite set
$B=\{b_1,\dots,b_j\}$ let us introduce the following notations. Take
$j$ different copies $(X_{b_r},{\Cal X}_{b_r})$ and $\mu_{b_r}$,
$1\le r\le j$, of this measurable space and probability measure indexed
by the elements of the set $B$, and define their product
$(X^{(B)},{\Cal X}^{(B)},\mu^{(B)})=\left(\prod\limits_{r=1}^j X_{b_r},
\prod\limits_{r=1}^j{\Cal X}_{b_r},
\prod\limits_{r=1}^j\mu_{b_r}\right)$. The points
$(x_{b_1},\dots,x_{b_j})\in X^{(B)}$ will be denoted by
$x^{(B)}\in X^{(B)}$ in the sequel. With the help of the above
notations I introduce the quantities needed in the formulation of the
following Theorem~13.7.
Let a function $f=f(x_1,\dots,x_k)$ be given on the $k$-fold product
$(X^k,{\Cal X}^k,\mu^k)$ of a measurable space $(X,{\Cal X})$ with a
probability measure $\mu$. For all partitions
$P=\{B_1,\dots,B_s\}\in{\Cal P}_k$ of the set $\{1,\dots,k\}$ consider
the functions $g_r\left(x^{(B_r)}\right)$ on the space $X^{(B_r)}$,
$1\le r\le s$, and define with their help the quantities
$$
\aligned
\alpha(P)&=\alpha(P,f,\mu)
=\sup_{g_1,\dots,g_s} \int f(x_1,\dots,x_k)
g_1\left(x^{(B_1)}\right)\cdots g_s\left(x^{(B_s)}\right)\mu(dx_1)
\dots\mu(dx_k); \\
&\qquad \text{where supremum is taken for such functions }
g_1,\dots,g_s,\quad g_r\colon\; X^{B_r}\to R^1 \\
&\qquad\qquad \text{for which }
\int g_r^2\left(x^{(B_r)}\right)\mu^{(B_r)}\left(\,dx^{(B_r)}\right)\le1
\quad \text{for all } 1\le r\le s,
\endaligned \tag13.23
$$
and put
$$
\alpha_s=\max_{P\in{\Cal P}_k,\,|P|=s}\alpha(P),
\quad 1\le s\le k. \tag13.24
$$
In Lata{\l}a's estimation of Wiener--It\^o integrals of order~$k$
the quantities $\alpha_s$, $1\le s\le k$, play a similar role as
the number $\sigma^2$ in Theorem~8.5. Observe that in the case
$|P|=1$, i.e.\ if $P=\{1,\dots,k\}$ the identity
$\alpha^2(P)=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$
holds, which means that $\alpha_1=\sigma$. The following estimate
is valid for Wiener--It\^o integrals of general order.
\medskip\noindent
{\bf Theorem 13.7. (Lata{\l}a's estimate about the tail-distribution
of Wiener--It\^o integrals).} {\it Let a $k$-fold Wiener--It\^o integral
$Z_{\mu,k}(f)$, $k\ge1$, be defined with the help of a white noise
$\mu_W$ with a non-atomic reference measure~$\mu$ and a kernel
function $f$ of $k$-variable such that
$\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty$. There is
some universal constant $C(k)<\infty$ depending only of the order~$k$
of the random integral such that the inequalities
$$
E(Z_{\mu,k}(f))^{2M}\le
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
\tag13.25
$$
and
$$
P(|Z_{\mu,k}(f)|>u)\le C(k)\exp\left\{-\frac1{C(k)}\min_{1\le s\le k}
\left(\frac u{\alpha_s}\right)^{2/s}\right\} \tag13.26
$$
hold for all $M=1,2,\dots$ and $u>0$ with the quantities $\alpha_s$,
defined in formulas~(13.23) and~(13.24).}
\medskip
Inequality~(13.26) is a simple consequence of~(13.25). In the special
case when $\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$, then
inequality~(13.25) says that the moment $EZ_{\mu,k}(f)^{2M}$ has
the same magnitude as the $2M$-th moment of a standard Gaussian
random variable multiplied by a constant, and it implies a good
estimate on $P(|Z_{\mu,k}(f)|>u)$ given in~(13.26). Actually the
result of Theorem~13.7 can be reduced to the special case when
$\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$. Thus it can be
interpreted so that if the quantities~$\alpha_s$ of a $k$-fold
Wiener--It\^o integral are sufficiently small, then these `almost
independence' conditions imply that the $2M$-th moment of this
integrals behaves like a one-fold Wiener--It\^o integral with the
same variance.
Actually Lata{\l}a formulated his result in a different form, and
he proved a slightly weaker result. He considered Gaussian
polynomials of the following form:
$$
P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k)=\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)\xi^{(1)}_{j_1}\cdots\xi^{(k)}_{j_k}, \tag13.27
$$
where $\xi_j^{(s)}$, $1\le j\le n$ and $1\le s\le k$, are independent
standard normal random variables. Lata{\l}a gave an estimate about
the moments and tail-distribution of such random polynomials.
The problem about the behaviour of such random polynomials can be
reformulated as a problem about the behaviour of Wiener--It\^o
integrals in the following way: Take a measurable space $(X,{\Cal X})$
with a non-atomic measure~$\mu$ on it. Let $Z_\mu$ be a white noise
with reference measure~$\mu$, let us choose a set of orthogonal
functions $h^{(s)}_j(x)$, $1\le j\le n$, $1\le s\le k$, on the
space $(X,{\Cal X})$ with respect to the measure~$\mu$, and define
the function
$$
f(x_1,\dots,x_k)=\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)h^{(1)}_{j_1}(x_1)\cdots h^{(k)}_{j_k}(x_k)
\tag13.28
$$
together with the Wiener--It\^o integral $Z_{\mu,k}(f)$. Since
the random integrals $\bar\xi_j^{(s)}=\int h_j^{(s)}(x)Z_\mu(\,dx)$,
$1\le j\le n$, $1\le s\le k$, are independent, standard Gaussian
random variables, it is not difficult to see with the help of
It\^o's formula (Theorem~10.3 in this work) that the distributions
of the random polynomial
$P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k)$ and $Z_{\mu,k}(f)$
agree. Here we reformulated Lata{\l}a's estimates about random
polynomials of the form~(13.27) to estimates about Wiener--It\^o
integrals with kernel function of the form~(13.28).
These estimates are equivalent to Lata{\l}a's result if we restrict
our attention to the special class of Wiener--It\^o integrals with
kernel functions of the form~(13.28). But we have formulated our
result for Wiener--It\^o integrals with a general kernel function.
Lata{\l}a's proof heavily exploits the special structure of the
random polynomials given in~(13.27), the independence of the
random variables~$\xi_j^{(s)}$ for different parameters~$s$ in
it. (It would be interesting to find a proof which does not
exploit this property.) On the other hand, this result can
be generalized to the case discussed in Theorem~13.7. This
generalization can be proved by exploiting the theorem of
de la Pe{\~n}a and Montgomery--Smith about the comparison of
$U$-statistics and decoupled $U$-statistics (formulated in
Theorem~14.3 of this work) and the properties of the
Wiener--It\^o integrals. I omit the details of the proof.
Lata{\l}a also proved a converse estimate in~[26] about random
polynomials of Gaussian random polynomials which shows that the
estimates of Theorem~13.7 are sharp. We formulate it in its
original form, i.e. we restrict our attention to the case of
Wiener--It\^o integrals with kernel functions of the form~(13.28).
\medskip\noindent
{\bf Theorem 13.8. (A lower bound about the tail distribution of
Wiener--It\^o integrals).} {\it A random integral $Z_{\mu,k}(f)$
with a kernel function of the form~(13.28) satisfies the
inequalities
$$
E(Z_{\mu,k}(f))^{2M}\ge
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
$$
and
$$
P(|Z_{\mu,k}(f)|>u)\ge \frac1{C(k)}\exp\left\{-C(k)
\min_{1\le s\le k}\left(\frac u{\alpha_s} \right)^{2/s}\right\}
$$
for all $M=1,2,\dots$ and $u>0$ with some universal constant
$C(k)>0$ depending only on the order~$k$ of the integral and the
quantities $\alpha_s$, defined in formula~(13.23) and~(13.24).}
\medskip
Let me finally remark that there is a counterpart of Theorem~13.7
about degenerate $U$-statistics. Adamczak's paper~[1] contains
such a result. Here we do not discuss it, because this result is
far from the main topic of this work. We only remark that some new
quantities have to be introduced to formulate it. The appearance of
these conditions is related to the fact that in an estimate about
the tail-behaviour of a degenerate $U$-statistic we need a bound
not only on the $L_2$-norm but also on the supremum norm of the
kernel function. In a sharp estimate the bound about the supremum
of the kernel function has to be replaced by a more complex system
of conditions, just as the condition about the $L_2$-norm of the
kernel function was replaced by a condition about the quantities
$\alpha_s$, $1\le s\le k$, defined in formulas~(13.23) and~(13.24)
in Theorem~13.7.
\beginsection 14. Reduction of the main result in this work.
The main result of this work is Theorem 8.4 or its multiple integral
version Theorem~8.2. It was shown in Section~9 that Theorem 8.2
follows from Theorems~8.4. Hence it is enough to prove Theorem~8.4.
It may be useful to study this problem together with its multiple
Wiener--It\^o integral version, Theorem~8.6.
Theorems~8.6 and~8.4 will be proved similarly to their one-variate
versions, Theorems~4.2 and~4.1. Theorem~8.6 will be proved with
the help of~Theorem~8.5 about the estimation of the tail
distribution of multiple Wiener--It\^o integrals. A natural
modification of the chaining argument applied in the proof of
Theorem~4.2 works also in this case. No new difficulties arise. On
the other hand, in the proof of Theorem~8.4 several new
difficulties have to be overcome. I start with the proof of
Theorem~8.6.
\medskip\noindent
{\it Proof of Theorem 8.6.}\/ Fix a number $0<\varepsilon<1$, and
let us list the elements of the countable set
${\Cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$ let us
choose by exploiting the conditions of Theorem~8.6 a set
${\Cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}
\subset{\Cal F}$ of function with
$m_p\le2D\,2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}$ elements in
such a way that
$\inf\limits_{1\le j\le m_p}\int(f-f_{a(j,p)})^2\,d\mu
\le 2^{-4p-8}\varepsilon^2\sigma^2$ for all $f\in{\Cal F}$, and
beside this let
$f_p\in{\Cal F}_p$. For all indices $a(j,p)$, $p=1,2,\dots$,
$1\le j\le m_p$, choose a predecessor $a(j',p-1)$, $j'=j'(j,p)$,
$1\le j'\le m_{p-1}$, in such a way that the functions
$f_{a(j,p)}$ and
$f_{a(j',p-1)}$ satisfy the relation
$\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu
\le\varepsilon^2\sigma^22^{-4(p+1)}$.
Theorem~8.5 with the choice
$\bar u=\bar u(p)=2^{-(p+1)}\varepsilon u$ and
$\bar\sigma=\bar\sigma(p)=2^{-2p-2}\varepsilon\sigma$ yields
the estimates
$$
\aligned
P(A(j,p))&=P\left(k!|Z_{\mu,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
2^{-(1+p)}\varepsilon u\right)\\
&\le C \exp\left\{-\frac12
\left(\frac{2^{p+1}u}\sigma\right)^{2/k}\right\},
\qquad 1\le j\le m_p,
\endaligned \tag14.1
$$
for all $p=1,2,\dots$, and
$$
\aligned
P(B(s))&=P\left(k!|Z_{\mu,k}(f_{a(0,s)})|
\ge \left(1-\frac \varepsilon2\right)u\right)\le
C\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\},\\
&\qquad 1\le s\le m_0.
\endaligned \tag14.2
$$
Since all $f\in{\Cal F}$ is the element of at least one set
${\Cal F}_p$, $p=0,1,2,\dots$, (We made a construction, where
$f_p\in {\Cal F}_p$), the definition of the predecessor of an
index $a(j,p)$ and of the events $A(j,p)$ and $B(s)$ in
formulas~(14.1) and~(14.2) together with the previous estimates
imply that
$$ \allowdisplaybreaks
\align
&P\left(\sup_{f\in{\Cal F}}k!|Z_{\mu,k}(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \tag14.3 \\
&\qquad \le \sum_{p=1}^{\infty} 2CD2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}
\sigma\right)^{2/k} \right\}\\
&\qquad\qquad +2^{1+4L}CD\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}.
\endalign
$$
Standard calculation shows that if
$u\ge ML^{k/2}\frac1\varepsilon\log^{k/2}\frac2\varepsilon
\cdot\sigma\log^{k/2}\frac2\sigma$
with a sufficiently large constant~$M$, then the inequalities
$$
2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}\sigma\right)^{2/k}
\right\}\le
2^{-p}\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k} \right\}
$$
hold for all $p=1,2\dots$, and
$$
2^{4L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}\le
\exp\left\{-\frac12\left(\frac{\left(1-\varepsilon\right)u}
\sigma\right)^{2/k}\right\}.
$$
These inequalities together with relation (14.3) imply
relation (8.15). Theorem~8.6 is proved.
\medskip
The proof of Theorem~8.4 is harder. In this case the chaining
argument in itself does not supply the proof, since Theorem~8.3
gives a good estimate about the distribution of a degenerate
$U$-statistic only if it has a not too small variance. The same
difficulty appeared in the proof of Theorem~4.1, and the method
applied in that case will be adapted to the present situation.
A multivariate version of Proposition~6.1 will be proved in
Proposition~14.1, and another result which can be considered as
a multidimensional version of Proposition~6.2 will be formulated
in Proposition~14.2. It will be shown that Theorem~8.4 follows
from Propositions~14.1 and~14.2. Most steps of these proofs can
be considered as a simple repetition of the corresponding
arguments in the proof of the results in Section~6. Nevertheless,
I wrote them down for the sake of completeness.
\medskip
The result formulated in Proposition~14.1 can be proved in almost
the same way as its one-variate version, Proposition~6.1. The only
essential difference is that now we apply a multivariate version
of the Bernstein inequality given in the Corollary of Theorem~8.3.
In the calculations of the
proof of Proposition~14.1 the term $(\frac u\sigma)^{2/k}$ shows a
behaviour similar to the term $(\frac u\sigma)^2$ in Proposition~6.1.
Theorem~14.1 contains the information we can get by applying
Theorem~8.3 together with the chaining argument. Its main content,
inequality~(14.4), yields a good estimate on the supremum of
degenerated $U$-statistics if it is taken for an
appropriate finite subclass ${\Cal F}_{\bar\sigma}$ of the original
class of kernel functions~${\Cal F}$. The class of kernel functions
${\Cal F}_{\bar\sigma}$ is a relatively dense subclass of ${\Cal F}$ in
the $L_2$ norm. Proposition~14.1 also provides some useful estimates
on the value of the parameter~$\bar\sigma$ which describes how
dense the class of functions ${\Cal F}_{\bar\sigma}$ is in ${\Cal F}$.
\medskip\noindent
{\bf Proposition 14.1.} {\it Let the $k$-fold power
$(X^k,{\Cal X}^k)$ of a measurable space $(X,{\Cal X})$ be given
together with some probability measure $\mu$ on $(X,{\Cal X})$
and a countable, $L_2$-dense class ${\Cal F}$ of functions
$f(x_1,\dots,x_k)$ of~$k$ variables with some exponent~$L\ge1$
and parameter~$D\ge1$ with respect to the measure~$\mu$ on the
product space $(X^k,{\Cal X}^k)$ which has the following
properties. All functions $f\in{\Cal F}$ are canonical with
respect to the measure~$\mu$, and they satisfy
conditions~(8.4) and~(8.5) with some real number
$0<\sigma\le1$. Take a sequence of independent, $\mu$-distributed
random variables $\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and
consider the (degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in {\Cal F}$, defined in formula~(8.7), and fix some
number $\bar A=\bar A_k\ge2^k$.
There is a number $M=M(\bar A,k)$ such that for all
numbers~$u>0$ for which the inequality
$n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}
\ge M(L\log\frac2\sigma+\log D)$ holds, a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$,
and a collection of functions
${\Cal F}_{\bar\sigma}={\Cal F}_{\bar\sigma(u)}
=\{f_1,\dots,f_m\}\subset{\Cal F}$ with $m\le D\bar\sigma^{-L}$
elements can be chosen in such a way that the sets
${\Cal D}_j=\{f\colon\, f\in {\Cal F},\; \int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$, satisfy the relation
${\Cal F}=\bigcup\limits_{j=1}^m{\Cal D}_j$, and for the
(degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in{\Cal F}_{\bar\sigma(u)}$, the inequality
$$ \allowdisplaybreaks
\align
P&\left(\sup_{f\in{\Cal F}_{\bar\sigma(u)}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar A}\right)\le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\} \\
&\qquad \qquad \text{if}\quad n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M\left(L\log\frac2\sigma+\log D\right) \tag14.4
\endalign
$$
holds with the constants $\alpha=\alpha(k)$, $C=C(k)$ appearing in
formula~($8.10'$) of the Corollary of Theorem~8.3 and the exponent
$L$ and parameter $D$ of the $L_2$-dense class ${\Cal F}$. Beside
this, also the inequality
$4\left(\frac u{\bar A\bar\sigma}\right)^{2/k}\ge
n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$
holds for this number $\bar\sigma=\bar\sigma(u)$. If the
number~$u$ satisfies also the inequality
$$
n\sigma^2\ge \left(\frac u\sigma\right)^{2/k}\ge
M(L^{3/2}\log\frac2\sigma +(\log D)^{3/2}) \tag14.5
$$
with a sufficiently large number $M=M(\bar A,k)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too.}
\medskip\noindent
{\it Proof of Proposition 14.1.} Let us list the elements of the
countable set ${\Cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$
let us choose, by exploiting the $L_2$-density property of the class
${\Cal F}$, a set ${\Cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset
{\Cal F}$ with $m_p\le D\,2^{2pL}\sigma^{-L}$ elements in such a way
that $\inf\limits_{1\le j\le m_p}\int (f-f_{a(j,p)})^2\,d\mu\le
2^{-4p}\sigma^2$ for all $f\in{\Cal F}$.
For all indices $a(j,p)$, $p=1,2,\dots$, $1\le j\le m_p$, choose a
predecessor $a(j',p-1)$, $j'=j'(j,p)$, $1\le j'\le m_{p-1}$, in such a
way that the functions $f_{a(j,p)}$ and $f_{a(j',p-1)}$ satisfy the
relation $\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu\le \sigma^2
2^{-4(p-1)}$. Then the inequalities
$\int\left(\frac{f_{a(j,p)}-f_{a(j',p-1)}}2\right)^2\,d\mu
\le4\sigma^2 2^{-4p}$
and $\sup\limits_{x_j\in X,\,1\le j\le k}\left|
\frac{f_{a(j,p)}(x_1,\dots,x_k)-f_{a(j',p-1)}(x_1,\dots,x_k)}2\right|
\le 1$ hold. The Corollary of Theorem~8.3 yields that
$$
\aligned
P(A(j,p))&=P\left(n^{-k/2}|I_{n,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
\frac{2^{-(1+p)}u}{\bar A}\right)\\
&\le C \exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A
\sigma}\right)^{2/k} \right\}
\quad \text {if}\quad 4n\sigma^2 2^{-4p}\ge\left(\frac{2^{p}u}
{8\bar A\sigma}\right)^{2/k}, \\
&\qquad\qquad 1\le j\le m_p,\; p=1,2,\dots,
\endaligned \tag14.6
$$
and
$$
\align
P(B(s))&=P\left(n^{-k/2}|I_{n,k}(f_{0,s})|
\ge \frac u{2\bar A}\right)\le
C\exp\left\{-\alpha\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\},
\quad 1\le s\le m_0, \\
&\qquad\qquad\qquad\text{if} \quad n\sigma^2\ge \left(\frac u{2\bar
A\sigma}\right)^{2/k}. \tag14.7
\endalign
$$
Introduce an integer $R=R(u)$, $R>0$, which satisfies the relations
$$
2^{(4+{2/k})(R+1)}\left(\frac{u}{\bar A\sigma}\right)^{2/k} \ge
2^{2+6/k}n\sigma^2\ge 2^{(4+2/k)R}
\left(\frac{u}{\bar A\sigma}\right)^{2/k},
$$
and define $\bar\sigma^2=2^{-4R}\sigma^2$ and $\Cal
F_{\bar\sigma}={\Cal F}_R$ (this is the class of functions
${\Cal F}_p$ introduced at the start of the proof with $p=R$).
We defined the number~$R$, analogously to the proof of Theorem~6.1,
as the largest number~$p$ for which the condition formulated
in~(14.6) holds. As $n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}$,
and $\bar A\ge2^k$ by our
conditions, there exists such a positive integer $R$.) The
cardinality~$m$ of the set ${\Cal F}_{\bar\sigma}$ is clearly not
greater than $D\bar\sigma^{-L}$, and
$\bigcup\limits_{j=1}^m \Cal D_j={\Cal F}$. Beside this, the number
$R$ was chosen in such a way that the inequalities
(14.6) and (14.7) hold for $1\le p\le R$. Hence the
definition of the predecessor of an index $a(j,p)$ implies that
$$
\align
&P\left(\sup_{f\in{\Cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar
A}\right) \le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&\qquad \le \sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s))
\le \sum_{p=1}^{\infty} CD\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A\sigma}\right)^{2/k}
\right\}\\
&\qquad\qquad +CD\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^{2/k}\right\}.
\endalign
$$
If the condition $\left(\frac u\sigma\right)^{2/k}\ge
M(L\log\frac2\sigma+\log D)$ holds with a sufficiently large
constant $M$ (depending on $\bar A$), then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar
A\sigma}\right)^{2/k} \right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^{2/k} \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\}\le
\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}.
$$
Hence the previous estimate implies that
$$
\align
&P\left(\sup_{f\in{\Cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|\ge
\frac u{\bar A}\right) \le\sum_{p=1}^{\infty}C 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^{2/k}
\right\}\\
&\qquad +C\exp\left\{-\alpha\left(\frac u{10\bar A
\sigma}\right)^{2/k}\right\} \le 2C\exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^{2/k}\right\},
\endalign
$$
and relation (14.4) holds.
The estimates
$$
\align
\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
&\le2^{-2-6/k}2^{2k/R}\left(\frac u{\bar A\sigma}\right)^{2/k}
=2^{-4R}\cdot2^{(4+2/k)R-2-6/k}\left(\frac{u}
{\bar A\sigma}\right)^{2/k}\\
&\le n\bar\sigma^2=2^{-4R} n\sigma^2\le
2^{-4R}\cdot2^{(4+2/k)(R+1)-2-6/k}
\left(\frac{u}{\bar A\sigma}\right)^{2/k}\\
&=2^{2-4/k}\cdot 2^{2R/k}\left(\frac{u}{\bar A \sigma}\right)^{2/k}
=2^{2-4/k}\cdot2^{-2R/k} \left(\frac{u}
{\bar A\bar\sigma}\right)^{2/k}
\le 4 \left(\frac{u}{\bar A\bar\sigma}\right)^{2/k}
\endalign
$$
hold because of the relation~$R\ge1$. This means that
$n\bar\sigma^2$
has the upper and lower bound formulated in Proposition~14.1.
It remained to show that $n\bar\sigma^2\ge
L\log n+D$ if relation~(14.5) holds.
This inequality clearly holds under the conditions of
Proposition~14.1
if $\sigma\le n^{-1/3}$, since in this case $\log\frac2\sigma\ge
\frac{\log n}3$, and
$n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
\ge\frac1{64}\bar A^{-2/k}
M(L^{3/2}\log\frac2\sigma +(\log D)^{3/2})^{3/2}\ge
\frac1{192}\bar A^{-2/k} M(L^{3/2}\log n+(\log D)^{3/2})
\ge L\log n+\log D$ if $M= M(\bar A,k)$ is sufficiently large.
If $\sigma\ge n^{-1/3}$, then the inequality $2^{(4+2/k)R}
\left(\frac u{\bar A\sigma}\right)^{2/k} \le2^{2+6/k} n\sigma^2$
can be applied. This
implies that $2^{-4R}\ge2^{-4(2+6/k))/(4+2/k)}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^{2/k}}
{n\sigma^2}\right]^{4/(4+2/k)}$, and
$$
n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma
\text{ with } \gamma=\frac4{4+\frac2k}\ge\frac23.
$$
The inequalities $n\sigma^2\ge n^{1/3}$ and
$n\sigma^2\ge(\frac u\sigma)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold,
(since $\log\frac2\sigma\ge\frac12$). They yield that for
sufficiently large $M=M(\bar A,k)$
$(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma\ge
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3}=
(n\sigma^2)^{1/(2k+1)}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3}$, and
$$
\align
n\bar\sigma^2&\ge \frac{\bar A^{-4/3}}{50}
(n\sigma^2)^{1/(2k+1)}\left[\left(\frac
u\sigma\right)^{2/k}\right]^{2/3}\\
&\ge \frac{\bar A^{-4/3}}{50}n^{1/3(2k+1)}
\left(\frac M2\right)^{2/3} (L^{3/2}+(\log D)^{3/2})^{2/3}
\ge L\log n+\log D.
\endalign
$$
\medskip
A multivariate analog of Proposition~6.2 is formulated in
Proposition~14.2, and it will be shown that Propositions~14.1
and~14.2 imply Theorem~8.4.
\medskip\noindent
{\bf Proposition 14.2.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\Cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ and a countable $L_2$-dense class ${\Cal F}$ of
canonical (with respect to the measure~$\mu$) kernel functions
$f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$ and exponent
$L\ge1$ on the product space $(X^k,{\Cal X}^k)$. Let all functions
$f\in{\Cal F}$ satisfy conditions~(8.1) and~(8.2) with some
$0<\sigma\le1$ such that $n\sigma^2>L\log n+D$. Let us consider
the (degenerate) $U$-statistics $I_{n,k}(f)$ with the random
sequence $\xi_1,\dots,\xi_n$, $n\ge\max(2,k)$, and kernel
functions $f\in{\Cal F}$. There exists a threshold index
$A_0=A_0(k)>0$ and two numbers $\bar C=\bar C(k)>0$ and
$\gamma=\gamma(k)>0$ depending only on the order $k$ of the
$U$-statistics such that the degenerate $U$-statistics
$I_{n,k}(f)$, $f\in{\Cal F}$, satisfy the inequality
$$
P\left(\sup_{f\in{\Cal F}}|n^{-k/2}I_{n,k}(f)|
\ge A n^{k/2}\sigma^{k+1}\right)
\le \bar C e^{-\gamma A^{1/2k}n\sigma^2}\quad \text{if } A\ge A_0.
\tag14.8
$$
}
\medskip
Proposition~14.2 yields an estimate for the tail distribution of
the supremum of degenerate $U$-statistics at level
$u\ge A_0n^{k/2}\sigma^{k+1}$, i.e. in the case when Theorem~8.3
does not give a good estimate on the tail-distribution of the single
degenerate $U$-statistics taking part in the supremum at
the left-hand side of~(14.8).
Formula (8.11) will be proved by means of Proposition~14.1 with an
appropriate choice of the parameter $\bar A$ in it and
Proposition~14.2 with the choice $\sigma=\bar\sigma=\bar\sigma(u)$
and the classes of functions
${\Cal F}_j=\left\{\frac{g-f_j}2\colon\; g\in\Cal D_j\right\}$
with the number $\bar\sigma$, functions~$f_j$ and sets of
functions~$\Cal D_j$, $1\le j\le m$, introduced in Proposition~14.1.
Clearly,
$$
\aligned
&P\left(\sup\limits_{f\in{\Cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\le
P\left(\sup_{f\in{\Cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar A}\right) \\
&\qquad\qquad +\sum_{j=1}^m P\left(\sup_{g\in\Cal D_j} n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar A}\right)u\right),
\endaligned \tag14.9
$$
where $m$ is the cardinality of the set of functions $\Cal
F_{\bar\sigma}$ appearing in Proposition~14.1.
We shall estimate the two terms of the sum at the right-hand side
of~(14.9) by means of Propositions~14.1 and~14.2 with a good choice
of the parameters $\bar A$ and the corresponding $M=M(\bar A)$ in
Proposition~14.1 together with a parameter $A\ge A_0$ in
Proposition~14.2.
We shall choose the parameter~$A\ge A_0$ in the application of
Proposition~14.2 so that it satisfies also the relation
$\gamma\ A^{1/2k}\ge2$ with the
number~$\gamma$ appearing in Proposition~14.2, hence we put
$A=\max(A_0,(\frac2\gamma)^{2k})$. After this choice we want to
define the parameter $\bar A$ in Proposition~14.1 in such a way
that the numbers~$u$ satisfying the conditions of Proposition~14.1
also satisfy the relation
$(\frac12-\frac1{2\bar A})u\ge An^{k/2}\bar\sigma^{k+1}$ with
the already fixed number~$A$. This inequality can be rewritten
in the form $A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}
(\frac u{\bar\sigma})^{2/k}\ge n\bar\sigma^2$. On the other hand,
under the conditions of Proposition~14.1 the inequality
$4(\frac u{\bar A\bar\sigma})^{2/k}\ge n\bar\sigma^2$ holds.
Hence the desired inequality holds if
$A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}\ge 4{\bar A}^{-2/k}$.
Thus the number $\bar A=2^{k+1}A+1$ is an appropriate choice.
With such a choice of $\bar A$ (together with the corresponding
$M=M(\bar A,k)$) and $A$ we can write
$$
\align
&P\left(\sup_{g\in\Cal D_j} n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar
A}\right)u\right)\\
&\qquad\le P\left(\sup_{g\in\Cal D_j}n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\bar A_0n^{k/2}\bar\sigma^{k+1}\right)
\le \bar Ce^{-\gamma A^{1/2k}n\bar\sigma^2}
\endalign
$$
for all $1\le j\le m$.
(Observe that the set of functions $\frac{f_j-g}2,\;g\in\Cal D_j$, is
an $L_2$-dense class with parameter $D$ and exponent $L$.) Hence
Proposition~14.1 (relation (14.4) together with the inequality $m\le
D\bar \sigma^{-L}$) and formula (14.8) with our $A>A_0$ and r
elation~(14.9) imply that
$$
P\left(\sup\limits_{f\in{\Cal F}} n^{-k/2}|I_{n,k}(f)|\ge u\right)
\le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}
+\bar CD\bar\sigma^{-L} e^{-\gamma\bar A_0^{1/2k}n\bar\sigma^2}.
\tag14.10
$$
We show by repeating an argument given in Section~6 that
$D\bar\sigma^{-L}\le e^{n\bar\sigma^2}$. Indeed, we have to show
that $\log D+L\log\frac1{\bar\sigma}\le n\bar\sigma^2$. But, as we
have seen, the relation $n\bar\sigma^2\ge L\log n+\log D$ with
$L\ge1$ and $D\ge1$ implies that $n\bar\sigma^2\ge\log n$, hence
$\log\frac1\sigma\le\log n$, and
$\log D+L\log\frac1{\bar\sigma}\le \log D+L\log n\le n\bar\sigma^2$.
On the other hand, $\gamma A^{1/2k}\ge2$ by the definition of
the number~$A$, and by the estimates of Proposition~14.1
$n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$.
The above relations imply that
$D\bar\sigma^{-L} e^{-\gamma A^{1/2k}n \bar\sigma^2}
\le e^{-\gamma A^{1/2k}n\bar\sigma^2/2}
\le \exp\left\{-\frac\gamma{128} A^{1/2k} \bar A^{-2/k}
\left(\frac u\sigma\right)^{2/k}\right\}$.
Hence relation (14.10) yields that
$$
\align
&P\left(\sup\limits_{f\in{\Cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\\
&\qquad\le 2C\exp \left\{-\frac\alpha{(10\bar A)^2}\left(\frac
u\sigma\right)^{2/k}\right\} +\bar C\exp\left\{-\frac\gamma{128}
\bar A_0^{1/2k} \bar A^{-2/k}
\left(\frac u\sigma\right)^{2/k}\right\},
\endalign
$$
and this estimate implies Theorem~8.4.
\medskip
To complete the proof of Theorem~8.4 we have to prove
Proposition~14.2. It will be proved, similarly to its one-variate
version Proposition~6.2, by means of a symmetrization argument.
We want to find its right formulation. It would be natural to
formulate it as a result about the supremum of degenerate
$U$-statistics. However, we shall choose a slightly different
approach. There is a notion, called decoupled $U$-statistic.
Decoupled $U$-statistics behave similarly to $U$-statistics, but
it is simpler to work with them, because they have more
independence properties. It turned out to be useful to introduce
this notion and to apply a result of de la Pe\~na and
Montgomery--Smith which enables us to reduce the estimation of
$U$-statistics to the estimation of decoupled $U$-statistics,
and to work out the symmetrization argument for decoupled
$U$-statistics.
Next we introduce the notion of decoupled $U$-statistics
together with their randomized version. We also formulate a
result of de la Pe\~na and Montgomery--Smith in Theorem~14.3
which enables us to reduce Proposition~14.2 to a version of it,
presented in Proposition~$14.2'$. It states a result similar
to Proposition~14.2 about decoupled $U$-statistics. The proof of
Proposition~$14.2'$ is the hardest part of the problem. In
Sections~15, 16 and~17 we deal essentially with this problem.
The result of de la Pe\~na and Montgomery--Smith will be
proved in Appendix~D.
Now we introduce the following notions.
\medskip\noindent
{\bf The definition of decoupled and randomized
decoupled $U$-statistics.} {\it Let us have $k$ independent
copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of a sequence
$\xi_1,\dots,\xi_n$ of independent and identically distributed
random variables taking their values in a measurable space
$(X,{\Cal X})$ together with a measurable function $f(x_1,\dots,x_k)$
on the product space $(X^k,{\Cal X}^k)$ with values in a separable
Banach space. The decoupled $U$-statistic $\bar I_{n,k}(f)$
determined by the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, and kernel function $f$ is defined by the formula
$$
\bar I_{n,k}(f)=\frac1{k!}\sum\limits\Sb (l_1,\dots,l_k)\colon \;
1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
f\left(\xi_{l_1}^{(1)},\dots,\xi_{l_k}^{(k)}\right). \tag14.11
$$
Let us have beside the sequences
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and function
$f(x_1,\dots,x_k)$ a sequence of independent random variables
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is independent also of the sequences of
random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$.
The randomized decoupled $U$-statistic $\bar I_{n,k}(f,\varepsilon)$
(depending on the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function $f$ and the randomizing sequence
$\varepsilon_1,\dots,\varepsilon_n$) is defined by the formula
$$
\bar I^\varepsilon_{n,k}(f)=\frac1{k!}\sum\limits\Sb
(l_1,\dots,l_k)\colon \; 1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f\left(\xi_{l_1}^{(1)},
\dots,\xi_{l_k}^{(k)}\right).
\tag14.12
$$
}
\medskip
A decoupled or randomized decoupled $U$-statistics (with real
valued kernel function) will be called degenerate if its kernel
function is canonical. This terminology is in full accordance with
the definition of (usual) degenerate $U$-statistics.
A result of de la Pe\~na and Montgomery--Smith will be formulated
below. It gives an upper bound for the tail distribution of a
$U$-statistic by means of the tail distribution of an appropriate
decoupled $U$-statistic. It also has a generalization, where the
supremum of $U$-statistics is bounded by the supremum of decoupled
$U$-statistics. It enables us to reduce Proposition~14.2 to a
version formulated Proposition~$14.2'$, which gives a bound on the
tail distribution of the supremum of decoupled $U$-statistics.
It is simpler to prove this result than the original one.
Before the formulation of the theorem of de la Pe\~na and
Montgomery--Smith I make some remark about it. It considers
more general $U$-statistics with kernel functions taking values
in a separable Banach space, and it compares the norm of
Banach space valued $U$-statistics and decoupled $U$-statistics.
(Decoupled $U$-statistics were defined with general Banach space
valued kernel functions, and the definition of $U$-statistics can
also be generalized to separable Banach space valued kernel
functions in a natural way.) This result was formulated in such
a general form for a special reason. This helped to derive
formula~(14.14) of the subsequent theorem from formula~(14.13).
It can be exploited in the proof of formula~(14.14) that the
constants in the estimate~(14.13) do not depend on the Banach
space, where the kernel function~$f$ takes its values.
\medskip\noindent
{\bf Theorem 14.3. (Theorem of de la Pe\~na and Montgomery--Smith
about the comparison of $U$-statistics and decoupled
$U$-statistics).} {\it Let us consider a sequence of independent
and identically distributed random variables $\xi_1,\dots,\xi_n$
with values in a measurable space $(X,{\Cal X})$ together with $k$
independent copies $\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$,
of this sequence. Let us also have a function $f(x_1,\dots,x_k)$ on
the $k$-fold product space $(X^k,{\Cal X}^k)$ which takes its values
in a separable Banach space~$B$. Let us take the $U$-statistic and
decoupled $U$-statistic $I_{n,k}(f)$ and $\bar I_{n,k}(f)$ with
the help of the above random sequences $\xi_1,\dots,\xi_n$,
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and kernel
function~$f$. There exist some constants $\bar C=\bar C(k)>0$
and $\gamma=\gamma(k)>0$ depending only on the order~$k$ of the
$U$-statistic such that
$$
P\left(\|I_{n,k}(f)\|>u\right)
\le\bar CP\left(\|\bar I_{n,k}(f)\|>\gamma u\right)
\tag14.13
$$
for all $u>0$. Here $\|\cdot\|$ denotes the norm in the Banach
space~$B$ where the function~$f$ takes its values.
More generally, if we have a countable sequence of functions
$f_s$, $s=1,2,\dots$, taking their values in the same separable
Banach-space, then
$$
P\left(\sup_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|>u\right)\le
\bar CP\left(\sup_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|
>\gamma u\right). \tag14.14
$$
}
\medskip
Now I formulate the following version of Proposition~4.2.
\medskip\noindent
{\bf Proposition $14.2'$.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\Cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and a countable $L_2$-dense
class ${\Cal F}$ of canonical (with respect to the measure~$\mu$)
kernel functions $f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$
and exponent $L\ge1$ on the product space $(X^k,{\Cal X}^k)$. Let
all functions $f\in{\Cal F}$ satisfy conditions~(8.1) and~(8.2)
with some $0<\sigma\le1$ such that $n\sigma^2>L\log n+\log D$.
Let us take $k$ independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, of the random sequence $\xi_1,\dots,\xi_n$, and
consider the decoupled $U$-statistics $\bar I_{n,k}(f)$,
$f\in{\Cal F}$, defined with their help in formula~(14.11).
There exists a threshold index $A_0=A_0(k)>0$ depending only on
the order $k$ of the decoupled $U$-statistics $I_{n,k}(f)$,
$f\in{\Cal F}$, such that the (degenerate) decoupled
$U$-statistics $\bar I_{n,k}(f)$, $f\in{\Cal F}$, satisfy the
following version of inequality (14.8):
$$
P\left(\sup_{f\in{\Cal F}}n^{-k/2}|\bar I_{n,k}(f)|
\ge An^{k/2}\sigma^{k+1}\right)
\le e^{-2^{-(1/2+1/2k)} A^{1/2k}n\sigma^2}\quad \text{if } A\ge A_0.
\tag14.15
$$
}
\medskip
It is clear that Proposition~$14.2'$ and Theorem~14.3, more
explicitly formula~(14.14) in it, imply Proposition 14.2. Hence the
proof of Theorem~8.4 was reduced to Proposition~$14.2'$ in this
section. The proof of Proposition~$14.2'$ is based on a symmetrization
argument. Its main ideas will be explained in the next section.
\beginsection 15. The strategy of the proof for the main result of
this work.
In the previous section the proof of Theorem~8.4 was reduced to
that of Proposition~$14.2'$. Proposition~$14.2'$ is a multivariate
version of Proposition~6.2, and its proof is based on similar
ideas. An important step in the proof of Theorem~6.2 was a
symmetrization argument in which we reduced the estimation of
the probability $P\left(\sup\limits_{f\in{\Cal F}}
\sum\limits_{j=1}^nf(\xi_j)>u\right)$
to the estimation of the probability
$P\left(\sup\limits_{f\in{\Cal F}}
\sum\limits_{j=1}^n\varepsilon_jf(\xi_j)>\frac u3\right)$, where
$\xi_1,\dots,\xi_n$ is a sequence of independent and identically
distributed random variables, and $\varepsilon_j$, $1\le j\le n$, is a
sequence of independent random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, independent of the
sequence~$\xi_j$. Let us understand how to formulate the
corresponding symmetrization argument in the proof of
Proposition~$14.2'$ and how to prove it.
The symmetrization argument applied in the proof of Proposition~6.2
was carried out in two steps. We took a copy $\xi_1',\dots,\xi'_n$
of the sequence $\xi_1,\dots,\xi_n$, i.e. a sequence of independent
random variables which is independent also of the original
sequence $\xi_1,\dots,\xi_n$, and has the same distribution. In the
first step we compared the tail distribution of the expression
$\sup\limits_{f\in{\Cal F}}\sum\limits_{j=1}^n[f(\xi_j)-f(\xi'_j)]$
with that of $\sup\limits_{f\in{\Cal F}}\sum\limits_{j=1}^nf(\xi_j)$.
This was done with the help of Lemma~7.1. In the second step, in
Lemma~7.2, we proved a `randomization argument' which stated that
the distribution of the random fields
$\sum\limits_{j=1}^n[f(\xi_j)-f(\xi_j')]$ and
$\sum\limits_{j=1}^n\varepsilon_j[f(\xi_j)-f(\xi_j')]$,
$f\in{\Cal F}$, agree. The symmetrization argument was proved
with the help of these two observations.
In the proof of Proposition~$14.2'$ we would like to reduce the
estimation of the tail distribution of the supremum of decoupled
$U$-statistics $\sup\limits_{f\in{\Cal F}}\bar I_{n,k}(f)$ defined
in formula~(14.11) to the estimation of the tail distribution of
the supremum of randomized decoupled $U$-statistics
$\sup\limits_{f\in{\Cal F}}\bar I_{n,k}^\varepsilon(f)$ defined
in formula~(14.12)
by means of a similar argument. To do this first we have to
understand what kind of random fields should be introduced
instead of $\sum\limits_{j=1}^n[f(\xi_j)-f(\xi'_j)]$, $f\in{\Cal F}$,
in the new case. In formula~(15.1) we shall define such a random
field. Its definition reminds a bit to the definition of
Stieltjes measures. In Lemma~15.1 we will show that a version
of the `randomization argument' of Lemma~7.2 can be applied when
we are working with this random field.
The adaptation of the first step of the symmetrization argument
in the proof of Proposition~6.2 to the present case is much harder.
The proof of Proposition~6.2 was based on the symmetrization lemma,
Lemma~7.1, which does not work in the present case. Hence we shall
prove a generalization of this result in Lemma~15.2. The proof of
symmetrization argument is difficult even with the help of this
result. The hardest part of our problem appears at this point. I
return to this point after the formulation of Lemma~15.2.
To formulate Lemma~15.1 needed in our proof we introduce some
notations.
Let ${\Cal V}_k$ denote the set of all sequences $(v(1),\dots,v(k))$
of length~$k$ such that $v(j)=+1$ or $v(j)=-1$ for all $1\le j\le k$.
Let $m(v)$, $v=(v(1),\dots,v(k))\in{\Cal V}_k$, denote the number of
digits $-1$ in the sequence $v$. Let a (real valued) function
$f(x_1,\dots,x_k)$ of $k$ variables be given on a measurable space
$(X,{\Cal X})$ together with a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(X,{\Cal X})$ and $2k$ independent copies
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, $1\le j\le k$, of this
sequence. Let us have beside them another sequence
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, of
independent random variables, also independent of all previously
introduced random variables. With the help of the above
quantities we introduce the random variables
$$
\tilde I_{n,k}(f)=\frac1{k!}\sum_{v\in {\Cal V}_k}
(-1)^{m(v)} \sum\limits\Sb (l_1,\dots,l_k)\colon\; 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'} \text{ if } r\neq r'\endSb
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right) \tag15.1
$$
and
$$
\tilde I^\varepsilon_{n,k}(f)=\frac1{k!}\sum_{v\in {\Cal V}_k}
(-1)^{m(v)} \sum\limits\Sb (l_1,\dots,l_k)\colon\; 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'}
\text{ if } r\neq r'\endSb \varepsilon_{l_1}\cdots\varepsilon_{l_k}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right) \tag15.2
$$
The number $m(v)$ in the above formulas denotes the number of the
digits $-1$ in the $\pm1$ sequence $v$ of length~$k$, hence it
counts how many random variables $\xi_{l_j}^{(j,1)}$, $1\le j\le k$,
were replaced by the `secondary copy' $\xi_{l_j}^{(j,-1)}$ for a
$v\in{\Cal V}_k$ in the inner sum in formulas~(15.1) or~(15.2).
The following result holds.
\medskip\noindent
{\bf Lemma 15.1.} {\it Let us consider a (non-empty) class of
functions ${\Cal F}$ of $k$ variables $f(x_1,\dots,x_k)$ on the
space $(X^k,{\Cal X}^k)$ together with the random variables
$\tilde I_{n,k}(f)$ and $\tilde I^\varepsilon_{n,k}(f)$ defined in
formulas~(15.1) and (15.2) for all $f\in {\Cal F}$. The
distributions of the random fields $\tilde I_{n,k}(f)$,
$f\in{\Cal F}$, and $\tilde I^\varepsilon_{n,k}(f)$, $f\in {\Cal F}$,
agree.}
\medskip
Let me recall that we say that the distribution of two random
fields $X(f)$, $f\in{\Cal F}$, and $Y(f)$, $f\in{\Cal F}$,
agree if for any finite sets $\{f_1,\dots,f_p\}\in{\Cal F}$ the
distribution of the random vectors $(X(f_1),\dots,X(f_p))$ and
$(Y(f_1),\dots,Y(f_p))$ agree.
\medskip\noindent
{\it Proof of Lemma 15.1.}\/ I even claim that for any fixed
sequence $u=(u(1),\dots,u(n))$, $u(l)=\pm1$, $1\le l\le n$,
of length~$n$ the conditional distribution of the field
$\tilde I^\varepsilon_{n,k}(f)$, $f\in {\Cal F}$, under the
condition $(\varepsilon_1,\dots,\varepsilon_n)=u=(u(1),\dots,u(n))$
agrees with the distribution of the field of $\tilde I_{n,k}(f)$,
$f\in{\Cal F}$.
Indeed, the random variables $\tilde I_{n,k}(f)$, $f\in{\Cal F}$,
defined in (15.1) are functions of a random vector with coordinates
$(\xi_l^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$,
$1\le l\le n$, $1\le j\le k$, and the distribution of this random
vector does not change if the coordinates
$(\xi_{l}^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
with such indices $(l,j)$ for which $u(l)=-1$ (and the index~$j$
is arbitrary) are replaced by
$(\bar\xi_l^{(j)},\xi_l^{(j)})=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$,
and the coordinates $(\xi_{l}^{(j)},\bar\xi_l^{(j)})$ with such
indices $(l,j)$ for which $u(l)=1$ are not changed. As a
consequence, the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\Cal F}$, we get by replacing the
original vector $(\xi_l^{(j)},\bar\xi_l^{(j)})$, $1\le l\le n$,
$1\le j\le k$, in the definition of the expression
$\tilde I_{n,k}(f)$ in~(15.1) for all $f\in {\Cal F}$ by this
modified vector depending on~$u$ has the same distribution as the
random field $\tilde I_{n,k}(f)$, $f\in{\Cal F}$. On the other hand,
I claim that the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\Cal F}$, agrees with the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\Cal F}$, defined in~(15.2) under the condition that
$(\varepsilon_1,\dots,\varepsilon_n)=u$ with $u=(u(1),\dots,u(n))$.
To prove the last statement let us observe that the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\Cal F}$, under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$ is the same
as the distribution of the random field we obtain by putting
$u(l)=\varepsilon_l$, $1\le l\le n$, in all coordinates
$\varepsilon_l$ of the random variables
$\tilde I^\varepsilon_{n,k}(f)$. On the other hand, the
random variables we get in such a way agree with the random
variables appearing in the sum defining $\tilde I_{n,k}(f|u)$,
only the terms in this sum are listed in a different order.
Lemma~15.1 is proved.
\medskip
Next we prove the following generalization of Lemma~7.1.
\medskip\noindent
{\bf Lemma 15.2. (Generalized version of the Symmetrization Lemma).}
{\it Let $Z_p$ and $\bar Z_p$, $p=1,2,\dots$, be two sequences of
random variables on a probability space $(\Omega,{\Cal A},P)$. Let a
$\sigma$-algebra ${\Cal B}\subset {\Cal A}$ be given on the probability
space $(\Omega,{\Cal A},P)$ together with a ${\Cal B}$-measurable set
$B$ and two numbers $\alpha>0$ and $\beta>0$ such that the random
variables $Z_p$, $p=1,2,\dots$, are ${\Cal B}$ measurable, and the
inequality
$$
P(|\bar Z_p|\le\alpha|{\Cal B})(\omega)\ge\beta\quad \text{for all }
\,p=1,2,\dots \text{ if }\,\omega\in B \tag15.3
$$
holds.
Then
$$
P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)
\le\frac1\beta P\left(\sup\limits_{1\le
p<\infty}|Z_p-\bar Z_p|>u\right)+(1-P(B))\quad\text{for all } u>0.
\tag15.4
$$
}
\medskip\noindent
{\it Proof of Lemma 15.2.}\/ Put $\tau=\min\{p\colon\; |Z_p|>\alpha+u)$
if there exists such an index $p\ge1$, and put $\tau=0$ otherwise. Then
$$
\align
P(\{\tau=p\}\cap B)&\le\int_{\{\tau=p\}\cap B}\frac1\beta
P(|\bar Z_p|\le \alpha|{\Cal B})\,dP
=\frac1\beta P(\{\tau=p\}\cap\{|\bar Z_p|\le\alpha\}\cap B)\\
&\le \frac1\beta P(\{\tau=p\}\cap\{|Z_p-\bar Z_p|>u\})
\quad \text{for all } p=1,2,\dots.
\endalign
$$
Hence
$$ \allowdisplaybreaks
\align
&P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)-(1-P(B))\le
P\left(\left\{\sup_{1\le p<\infty}|Z_p|>
\alpha+u\right\}\cap B\right) \\
&\qquad=\sum_{p=1}^\infty P(\{\tau=p\}\cap B)
\le \frac1\beta \sum_{p=1}^\infty P(\{\tau=p\}\cap\{|Z_p-\bar
Z_p|>u\}) \\
&\qquad \le\frac1\beta
P\left(\sup_{1\le p<\infty}|Z_p-\bar Z_p|>u\right).
\endalign
$$
Lemma~15.2 is proved.
\medskip
To find a symmetrization argument useful in the proof of
Proposition~$14.2'$ we want to bound the probability
$P\left(\sup\limits_{f\in{\Cal F}}|\bar I_{n,k}(f)|>u\right)$ by
$C\cdot P\left(\sup\limits_{f\in{\Cal F}}|\tilde I_{n,k}(f)|>c u\right)$
plus a negligible error term with some appropriate numbers
$C<\infty$ and $00$ we say that the
set of decoupled $U$-statistics determined by the class of
functions ${\Cal F}$ has a good tail behaviour at level~$T$ (with
parameters $n$ and $\sigma^2$ which are fixed in the sequel) if
$$
P\left(\sup_{f\in{\Cal F}}|n^{-k/2}\bar I_{n,k}(f)|\ge A
n^{k/2}\sigma^{k+1}\right)
\le \exp\left\{-A^{1/2k}n\sigma^2 \right\}
\quad \text{for all } A>T. \tag15.5
$$
}
\medskip\noindent
{\bf Definition of good tail behaviour for a class of integrals of
decoupled $U$-statistics.} {\it Let us have a product space
$(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ with some product measure
$\mu^k\times\rho$, where $(X^k,{\Cal X}^k,\mu^k)$ is the $k$-fold
product of some probability space $(X,{\Cal X},\mu)$, and $(Y,\Cal
Y,\rho)$ is some other probability space. Fix some positive
integer~$n\ge k$ and a positive number $0<\sigma\le1$, and consider
some countable class ${\Cal F}$ of functions $f(x_1,\dots,x_k,y)$ on
the product space
$(X^k\times Y,{\Cal X}^k\times{\Cal Y},\mu^k\times\rho)$. Take $k$
independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. For all $f\in{\Cal F}$ and $y\in Y$ let us define
the decoupled $U$-statistics $\bar I_{n,k}(f,y)=\bar I_{n,k}(f_y)$
by means of these random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function
$f_y(x_1,\dots,x_k)=f(x_1,\dots,x_k,y)$ and formula~(14.11). Define
with the help of these $U$-statistics $\bar I_{n,k}(f,y)$ the random
integrals
$$
H_{n,k}(f)=\int \bar I_{n,k}(f,y)^2\rho(\,dy), \quad f\in{\Cal F}.
\tag15.6
$$
Choose some real number $T>0$. We say that the set of random
integrals $H_{n,k}(f)$, $f\in{\Cal F}$, has a good tail behaviour at
level $T$ (with parameters $n$ and $\sigma^2$ which we fix in the
sequel) if
$$
P\left(\sup_{f\in{\Cal F}} n^{-k}H_{n,k}(f)
\ge A^2 n^k\sigma^{2k+2}\right)
\le \exp\left\{-A^{1/(2k+1)}n\sigma^2 \right\}
\quad \text{for all } A> T. \tag15.7
$$
}
\medskip
Propositions~15.3 and~15.4 will be formulated with the help of the
above notions.
\medskip\noindent
{\bf Proposition 15.3.} {\it Let us fix a positive
integer~$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a
probability measure $\mu$ on a measurable space $(X,{\Cal X})$
together with two real numbers $L\ge1$ and $D\ge1$ such that
$n\sigma^2\ge L\log n+\log D$. Let us consider those countable
$L_2$-dense classes ${\Cal F}$ of canonical kernel functions
$f=f(x_1,\dots,x_k)$ (with respect to the measure~$\mu$) on the
$k$-fold product space $(X^k,{\Cal X}^k)$ with exponent~$L$
and parameter~$D$ for which all functions $f\in{\Cal F}$ satisfy the
inequalities $\sup\limits_{x_j\in X, 1\le j\le k}
|f(x_1,\dots,x_k)|\le 2^{-(k+1)}$ and $\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$.
There is some real number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\Cal F}$ which satisfy the above conditions
the sets of decoupled $U$-statistics $\bar I_{n,k}(f)$, $
f\in{\Cal F}$, have a good tail behaviour at level~$T^{4/3}$ for
some $T\ge A_0$, then they also have a good tail behaviour at
level~$T$.}
\medskip\noindent
{\bf Proposition 15.4.} {\it Fix some positive integer
$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a product
space $(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ with some product
measure $\mu^k\times\rho$, where $(X^k,{\Cal X}^k,\mu^k)$ is the
$k$-fold product of some probability space $(X,{\Cal X},\mu)$, and
$(Y,{\Cal Y},\rho)$ is some other probability space together with
two real numbers $L\ge1$ and $D\ge1$ such that
$n\sigma^2>L\log n+\log D$ hold.
Let us consider those countable $L_2$-dense classes ${\Cal F}$
consisting of canonical functions $f(x_1,\dots,x_k,y)$ on the
product space $(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ with
exponent $L\ge1$ and parameter $D\ge1$ whose elements
$f\in{\Cal F}$ satisfy the inequalities
$$
\sup\limits_{x_j\in X, 1\le j\le k, y\in Y}|f(x_1,\dots,x_k,y)|\le
2^{-(k+1)} \tag15.8
$$
and
$$
\int f^2(x_1,\dots,x_k,y)\mu(\,dx_1)\dots\mu(\,dx_k)\rho(\,dy)
\le\sigma^2 \quad \text{for all } f\in {\Cal F}. \tag15.9
$$
There exists some number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\Cal F}$ which satisfy the above conditions
the random integrals $H_{n,k}(f)$, $f\in{\Cal F}$, defined
in~(15.6) have a good tail behaviour at level $T^{(2k+1)/2k}$
with some $T\ge A_0$, then they also have a good tail behaviour
at level~$T$.}
\medskip\noindent
{\it Remark:}\/ To complete the formulation of Proposition~15.4 we
still have to clarify when we call a function $f(x_1,\dots,x_k,y)$
defined on the product space
$(X^k\times Y,{\Cal X}^k\times{\Cal Y},\mu^k\times\rho)$ canonical.
Here the definition is slightly differs from that given
in formula~(8.8).
We say that a function
$f(x_1,\dots,x_k,y)$ on the product space $(X^k\times Y,\Cal
X^k\times{\Cal Y},\mu^k\times\rho)$ is canonical if
$$
\align
&\int f(x_1,\dots,x_{j-1},u,x_{j+1},\dots,x_k,y)\mu(\,du)=0\\
&\qquad \qquad \text{for all } 1\le j\le k,\; x_s\in X,
\;s\neq j \text{ and }y\in Y.
\endalign
$$
In this definition we do not require the analogous identity if we
integrate with respect to the variable $Y$ with fixed arguments
$x_j\in X$, $1\le j\le k$.
\medskip
Let me also remark that the estimate (15.7) we have formulated in
the definition of the property `good tail behaviour for a class of
integrals of $U$-statistics' is fairly natural. We have applied the
natural normalization, and with such a normalization it is natural to
expect that the tail behaviour of the distribution of
$\sup\limits_{f\in{\Cal F}}n^{-k}H_{n,k}(f)$ is similar to that of
$\text{const}\,\left(\sigma\eta^k\right)^2$, where $\eta$ is a standard
normal random variable. Formula~(15.7) expresses such a behaviour,
only the power of the number~$A$ in the exponent at the right-hand
side was chosen in a non-optimal way. Formula~(15.5) in the
formulation of the property `good tail behaviour for a class of
decoupled $U$-statistics' has a similar interpretation. It says that
$\sup\limits_{f\in{\Cal F}}|n^{-k/2}I_{n,k}(f)|$ behaves similarly to
$\text{const},\sigma|\eta^k|$ with a standard normal random
variable $\eta$.
We wanted to prove the property of good tail behaviour for a class
of integrals of decoupled $U$-statistics under appropriate, not too
restrictive conditions. Let me remark that in Proposition~15.4 we
have imposed beside formula (15.8) a fairly weak condition (15.9)
about the $L_2$-norm of the function~$f$. Most difficulties appear
in the proof, because we did not want to impose more restrictive
conditions.
It is not difficult to derive Proposition~$14.2'$ from
Proposition~15.3. Indeed, let us observe that the set of decoupled
$U$-statistics determined by a class of functions ${\Cal F}$
satisfying the conditions of Proposition~15.3 has a good
tail-behaviour at level $T_0=\sigma^{-(k+1)}$, since under the
conditions of this Proposition the probability at the left-hand
side of (15.5) equals zero for $A>\sigma^{-(k+1)}$. Then we get
from Proposition~15.3 by induction with respect to the number $j$,
that this set of decoupled $U$-statistics has a good tail-behaviour
also for all $T=T_j=\ge T_0^{(3/4)^j}=\sigma^{-(k+1)(3/4)^j}$,
$j=0,1,2,\dots$, with such indices~$j$ for which
$T_j=\sigma^{-(k+1)(3/4)^j}\ge A_0$. This implies that if a class of
functions ${\Cal F}$ satisfies the conditions of Proposition~15.3,
then the set of decoupled $U$-statistics determined by this class
of functions has a good tail-behaviour at level $T=A_0^{4/3}$,
i.e. at a level which depends only on the order~$k$ of the
decoupled $U$-statistics. This result implies Proposition~$14.2'$,
only it has to be applied for the class of function
${\Cal F}'=\{2^{-(k+1)}f,\; f\in{\Cal F}\}$ instead of the original
class of functions ${\Cal F}$ which appears in Proposition~$14.2'$
with the same parameters~$\sigma$, $L$ and~$D$.
Similarly to the above argument an inductive procedure yields a
corollary of Proposition~15.4 formulated below. Actually, we shall
need this corollary of Proposition~15.4.
\medskip\noindent
{\bf Corollary of Proposition 15.4.} {\it If the class of functions
${\Cal F}$ satisfies the conditions of Proposition~15.4, then there
exists a constant $\bar A_0=\bar A_0(k)>0$ depending only on $k$
such that the class of integrals $H_{n,k}(f)$, $f\in {\Cal F}$,
defined in formula (15.6) have a good tail behaviour at level
$\bar A_0$.}
\medskip
The main difficulty in the proof of Proposition 15.3 arises in the
application of the symmetrization procedure corresponding to
Lemma~7.2 in the one-variate case. This difficulty can be overcome
by means of Proposition~15.4, more precisely by means of its
corollary. It helps us to estimate the conditional variances of
the decoupled $U$-statistics we have to handle in the proof of
Proposition~15.3. The proof of Propositions~15.3 and~15.4 apply
similar arguments, and they will be proved simultaneously. The
following inductive procedure will be applied in their proof.
First Proposition~15.3 and then Proposition~15.4 is proved for
$k=1$. If Propositions~15.3 and~15.4 are already proved
for all $k'0$ and $\gamma=\gamma_k>0$
such that the inequality
$$
\aligned
&P\left(\sup_{f\in{\Cal F}} n^{-k/2}\left|\bar
I_{n,k}(f)\right|>An^{k/2}\sigma^{k+1}\right) \\
& \qquad<
2^{k+1}P\left(\sup_{f\in{\Cal F}}n^{-k/2}
\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^{k/2}\sigma^{k+1}\right)
+2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\endaligned \tag16.1
$$
holds for all $A\ge A_0$.}
\medskip
It may be worth remarking that the second term at the right-hand side
of formula~(16.1) yields a small contribution to the upper bound in
this relation because of the condition $n\sigma^2\ge L\log n+\log D$.
To formulate Lemma~16.1B first some new quantities have to be
introduced. Some of them will be used somewhat later. The quantities
$\bar I_{n,k}^V(f,y)$ introduced in the subsequent formula~(16.2)
depend on the sets $V\subset\{1,\dots,k\}$, and they are the
natural modifications of the inner sum terms in formula (15.1).
Such expressions are needed in the formulation of the
symmetrization result applied in the proof of Proposition~15.4.
Their randomized versions $\bar I_{n,k}^{(V,\varepsilon)}(f,y)$,
introduced in formula (16.5), correspond to the inner sum terms in
formula~(15.2). The integrals of these expressions will be also
introduced in formulas~(16.3) and~(16.6).
Let us consider a class ${\Cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\Cal F}$ on a space $(X^k\times Y, {\Cal X}^k
\times {\Cal Y},\mu^k\times\rho)$ which satisfies the conditions of
Proposition~15.4. Let us take $2k$ independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$\bar\xi_1^{(j)},\dots,\bar\xi_n^{(j)}$, $1\le j\le k$,
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_k$ together with a sequence of independent random
variables
$(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous random sequences.
Let us introduce the notation $\xi_l^{(j,1)}=\xi_l^{(j)}$
and $\xi_l^{(j,-1)}=\bar\xi_l^{(j)}$, $1\le l\le n$, $1\le j\le k$.
For all subsets $V\subset\{1,\dots,k\}$ of the set
$\{1,\dots,k\}$ let $|V|$ denote the cardinality of this set,
and define for all functions $f(x_1,\dots,x_k,y)\in {\Cal F}$ and
$V\subset\{1,\dots,k\}$ the decoupled $U$-statistics
$$
\bar I_{n,k}^V(f,y)=\frac1{k!}\sum\limits\Sb (l_1,\dots,l_k)\colon\;
1\le l_j\le n,\;j=1,\dots,k\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,\xi_{l_k}
^{(k,\delta_k(V))},y\right),
\tag16.2
$$
where $\delta_j(V)=\pm1$, $1\le j\le k$, $\delta_j(V)=1$ if $j\in V$,
and $\delta_j(V)=-1$ if $j\notin V$, together with the random
variables
$$
H_{n,k}^V(f)=\int \bar I_{n,k}^V(f,y)^2\rho(\,dy), \quad f\in{\Cal F}.
\tag16.3
$$
We shall consider $\bar I_{n,k}^V(f,y)$ defined in~(16.2) as a random
variable with values in the space $L_2(Y,{\Cal Y},\rho)$.
Put
$$
\bar I_{n,k}(f,y)=\bar I_{n,k}^{\{1,\dots,k\}}(f,y),\quad
H_{n,k}(f)=H_{n,k}^{\{1,\dots,k\}}(f), \tag16.4
$$
i.e. $\bar I_{n,k}(f,y)$ and $H_{n,k}(f)$ are the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ with
$V=\{1,\dots,k\}$, which means that these expressions are defined
with the help of the random variables $\xi^{(j)}_l=\xi_l^{(j,1)}$,
$1\le j\le k$, $1\le l\le n$.
Let us also define the `randomized version' of the random variables
$\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ as
$$
\bar I_{n,k}^{(V,\varepsilon)}(f,y)=\frac1{k!}
\sum\Sb (l_1,\dots,l_k)\colon\;
1\le l_j\le n,\;
j=1,\dots, k\\ l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!\!
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))},y\right),\quad f\in{\Cal F}, \tag16.5
$$
and
$$
H_{n,k}^{(V,\varepsilon)}(f)=\int
\bar I_{n,k}^{(V,\varepsilon)}(f,y)^2\rho(\,dy)
,\quad f\in{\Cal F}, \tag16.6
$$
where $\delta_j(V)=1$ if $j\in V$, and $\delta_j(V)=-1$ if
$j\in\{1,\dots,k\}\setminus V$.
Similarly to formula~(16.2), we shall consider
$\bar I_{n,k}^{V,\varepsilon}(f,y)$ defined in~(16.5) as a random
variable with values in the space $L_2(Y,{\Cal Y},\rho)$.
Let us also introduce the random variables
$$
\bar W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}
\bar I_{n,k}^{(V,\varepsilon)}(f,y)\right]^2\rho(\,dy),
\quad f\in{\Cal F}. \tag16.7
$$
With the help of the above notations Lemma~16.1B can be formulated
in the following way.
\medskip\noindent
{\bf Lemma 16.1B. (Randomization argument in the proof of
Proposition~15.4).} {\it Let ${\Cal F}$ be a set of functions on
$(X^k\times Y,{\Cal X}^k\times{\Cal Y})$ which satisfies the conditions
of Proposition~15.4 with some probability measure $\mu^k\times\rho$.
Let us have $2k$ independent copies
$\xi_{1}^{(j,\pm1)},\dots,\xi_{n}^{(j,\pm1)}$, $1\le j\le k$, of a
sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, which is independent also of the previously considered
sequences.
Then there exists some constants $A_0=A_0(k)>0$ and
$\gamma=\gamma_k$ such that if the integrals $H_{n,k}(f)$,
$f\in{\Cal F}$, determined by this class of functions ${\Cal F}$ have
a good tail behaviour at level $T^{(2k+1)/2k}$ for some $T\ge A_0$,
(this property was defined in Section~15 in the definition of good
tail behaviour for a class of integrals of decoupled $U$-statistics
before the formulation of Propositions~15.3 and~15.4), then the
inequality
$$
\aligned
P\left(\sup_{f\in{\Cal F}} \left|H_{n,k}(f)\right|
>A^2n^{2k}\sigma^{2(k+1)}\right)
&<2P\left(\sup_{f\in{\Cal F}} \left|\bar W(f)\right|
>\frac{A^2}2 n^{2k}\sigma^{2(k+1)}\right)\\
&\qquad+2^{2k+1}n^{k-1}e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\endaligned \tag16.8
$$
holds with the random variables $H_{n,k}(f)$ introduced in the
second identity of relation (16.4) and with $\bar W(f)$ defined in
formula (16.7) if $\gamma_k>0$ is a sufficiently small positive
number for all $A\ge T$.}
\medskip
A corollary of Lemma~16.1B will be formulated which can be
better applied than the original lemma. Lemma~16.B is a little bit
inconvenient, because the expression at the right-hand side of
formula (16.8) contains a probability depending on
$\sup\limits_{f\in{\Cal F}}|\bar W(f)|$, and $\bar W(f)$ is a too
complicated expression. Some new formulas~(16.9) and~(16.10) will
be introduced which enable us to rewrite $\bar W(f)$ in a slightly
simpler form. These formulas yield such a corollary of Lemma~16.B
which is more appropriate for our purposes. To work out the details
first some diagrams will be introduced.
Let ${\Cal G}={\Cal G}(k)$ denote the set of all diagrams
consisting of two rows, such that both rows of these diagrams are
the set $\{1,\dots,k\}$, and these diagrams contain some edges
$\{(j_1,j_1')\dots,(j_s,j_s')\}$, $0\le s\le k$, connecting a
point (vertex) of the first row with a point (vertex) of the
second row. The vertices $j_1,\dots,j_s$ which are end points of
some edge in the first row are all different, and the same relation
holds also for the vertices $j_1',\dots,j_s'$ in the second row.
Given some diagram $G\in{\Cal G}$
let $e(G)=\{(j_1,j_1')\dots,(j_s,j_s')\}$ denote the set of its
edges, and let $v_1(G)=\{j_1,\dots,j_s\}$ be the set of those
vertices in the first row and $v_2(G)=\{j_1',\dots,j_s'\}$ the
set of those vertices in the second row of the diagram~$G$ from
which an edge of~$G$ starts.
Given some diagram $G\in {\Cal G}$ and two sets
$V_1,V_2\subset\{1,\dots,k\}$, we define the following random
variables $H_{n,k}(f|G,V_1,V_2)$ with the help of the random
variables $\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$,
$\xi_{1}^{(j,-1)},\dots,\xi_{n}^{(j,-1)}$, $1\le j\le k$, and
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ taking part
in the definition of the random
variables $\bar W(f)$:
$$ \allowdisplaybreaks
\align
H_{n,k}(f|G,V_1,V_2)&=\sum\Sb
(l_1,\dots,l_k,\,l'_1,\dots,l'_k)\colon \\
1\le l_j\le n,\, l_j\neq l_{j'}
\text{ if }j\neq j',\,1\le j,j'\le k,\\
1\le l'_j\le n,\, l'_j\neq l'_{j'}\text { if }
j\neq j',\,1\le j,j'\le
k,\\ l_j=l'_{j'} \text { if } (j,j')\in e(G),\; l_j\neq l'_{j'}
\text { if } (j,j')\notin e(G)\endSb
\!\!\!\!\!\!\!\!\!\!\!\! \prod_{j\in\{1,\dots,k\}
\setminus v_1(G)} \!\!\!\!
\varepsilon_{l_j} \prod_{j\in\{1,\dots,k\}
\setminus v_2(G)} \!\!\!\! \varepsilon_{l'_j} \\
&\qquad\frac1{k!^2} \int
f(\xi_{l_1}^{(1,\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y)\\
& \qquad\qquad f(\xi_{l'_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l'_k}^{(k,\delta_k(V_2))},y)
\rho(\,dy), \tag16.9
\endalign
$$
where $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$. (Let us observe that if the
graph $G$ contains $s$ edges, then the product of the
$\varepsilon$-s in (16.9)
contains $2(k-s)$ terms, and the number of terms in the sum (16.9) is
less than $n^{2k-s}$.) As the Corollary of Lemma~16.1B will indicate,
in the proof of Proposition~15.4 we shall need a good estimate on the
tail distribution of the random variables $H_{n,k}(f|G,V_1,V_2)$
for all $f\in{\Cal F}$ and $G\in{\Cal G}$, $V_1,V_2\subset\{1,\dots,k\}$.
Such an estimate can be obtained by means of Theorem 13.3, the
multivariate version of Hoeffding's inequality. But the estimate we
get in such a way will be rewritten in a form more appropriate for our
inductive procedure. This will be done in the next section.
The identity
$$
\bar W(f)=\sum_{G\in {\Cal G},\, V_1,V_2\subset\{1,\dots,k\}}
(-1)^{|V_1|+|V_2|} H_{n,k}(f|G,V_1,V_2) \tag16.10
$$
will be proved.
To prove this identity let us write first
$$
\bar W(f)=\sum_{V_1,V_2\subset \{1,\dots,k\}} (-1)^{|V_1|+|V_2|}
\int\bar I_{n,k}^{(V_1,\varepsilon)}
(f,y)\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy).
$$
Let us express the products
$\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)\bar I_{n,k}^{(V_2,
\varepsilon)}(f,y)$ by means of formula (16.5).
Let us rewrite
this product as a sum of products of the form
$\frac1{k!^2}\prod\limits_{j=1}^k\varepsilon_{l_j}f(\cdots)
\prod\limits_{j=1}^k\varepsilon_{l_j'}f(\cdots)$ and let us
define the following
partition of the terms in this sum. The elements of this partition
are indexed by the diagrams $G\in {\Cal G}$, and if we take a diagram
$G\in{\Cal G}$ with the set of edges $e(G)=
\{(j_1,j_1'),\dots,(j_s,j_s')\}$, then the term of this sum
determined by the indices $l_1,\dots,l_k,l'_1,\dots,l'_k$
belongs to the element of the partition indexed by this diagram
$G$ if and only if $l_{j_u}=l_{j_u'}'$ for all $1\le u\le s$, and
no more numbers between the indices $l_1,\dots,l_k,l_1'\dots,l'_k$
may agree. Since $\varepsilon_{l_{j_u}}\varepsilon_{l'_{j_u'}}=1$
for all $1\le u\le s$ and the set of indices of the remaining random
variables $\varepsilon_{l_j}$ is
$\{l_j\colon\;j\in\{1,\dots,k\}\setminus v_1(G)\}$,
the set of indices of the remaining random variables
$\varepsilon_{l_j'}$
is $\{l'_{j'}\colon\;j\in\{1,\dots,k\}\setminus v_2(G)\}$, we get
by integrating the product
$\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)$
with respect to the measure $\rho$ that
$$
\int\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy)
=\sum_{G\in {\Cal G}} H_{n,k}(f|G,V_1,V_2)
$$
for all $V_1,V_2\in\{1,\dots,k\}$. The last two identities imply
formula (16.10).
Since the number of terms in the sum of formula (16.10) is less than
$2^{4k}k!$, this relation implies that Lemma~16.1B has the following
corollary:
\medskip\noindent
{\bf Corollary of Lemma 16.1B. (A simplified version of the
randomization argument of Lemma~16.1B).} {\it Let a set of
functions ${\Cal F}$ satisfy the conditions of Proposition~15.4. Then
there exist some constants $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that if the integrals $H_{n,k}(f)$, $f\in{\Cal F}$, determined
by this class of functions ${\Cal F}$ have a good tail behaviour at
level $T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
$$
\aligned
&P\left(\sup_{f\in{\Cal F}} |H_{n,k}(f)|>A^2n^{2k}
\sigma^{2(k+1)}\right)\\
&\qquad\qquad\le 2\sum_{G\in {\Cal G},\, V_1,V_2\subset\{1,\dots,k\}}
P\left(\sup_{f\in{\Cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!} \right) \\
&\qquad\qquad\qquad +2^{2k+1}n^{k-1}
e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\endaligned \tag16.11
$$
holds with the random variables $H_{n,k}(f)$ and
$H_{n,k}(f|G,V_1,V_2)$
defined in formulas (16.4) and (16.9) for all $A\ge T$.}
\medskip\noindent
In the proof of Lemmas 16.1A and 16.1B the result of the
following Lemmas~16.2A and~16.2B will be applied.
\medskip\noindent
{\bf Lemma 16.2A.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)} \quad \text{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous sequences.
Let ${\Cal F}$ be a class of functions which satisfies the
conditions of Proposition 15.3. Introduce with the help of the above
random variables for all sets $V\subset\{1,\dots,k\}$ and functions
$f\in {\Cal F}$ the decoupled $U$-statistic
$$
\bar I_{n,k}^V(f)=\frac1{k!}\sum\limits\Sb (l_1,\dots,l_k)\colon \;
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,\xi_{l_k}
^{(k,\delta_k(V))}\right) \tag16.12
$$
and its `randomized version'
$$
\bar I_{n,k}^{(V,\varepsilon)}(f)=\frac1{k!}
\sum\Sb (l_1,\dots,l_k)\colon\;
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))}\right), \quad f\in{\Cal F}, \tag$16.12'$
$$
where $\delta_j(V)=\pm1$, and $\delta_j(V)=1$ if $j\in V$, and
$\delta_j(V)=-1$ if $j\in\{1,\dots,k\}\setminus V$.
Then the sets of random variables
$$
S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar I_{n,k}^V(f),
\quad f\in{\Cal F}, \tag16.13
$$
and
$$
\bar S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar
I_{n,k}^{(V,\varepsilon)}(f), \quad f\in{\Cal F}, \tag$16.13'$
$$
have the same joint distribution.}
\medskip\noindent
{\bf Lemma 16.2B.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}\quad \text{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent, $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is also independent of the previous sequences.
Let us consider a class ${\Cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\Cal F}$ on a space $(X^k\times Y, {\Cal X}^k
\times {\Cal Y},\mu^k\times\rho)$ which satisfies the conditions of
Proposition~15.4. For all functions $f\in {\Cal F}$
and $V\in\{1,\dots,k\}$ consider the decoupled $U$-statistics
$\bar I_{n,k}^V(f,y)$ defined by formula (16.2) with the help of
the random variables $\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, and define with their help
the random variables
$$
W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar
I_{n,k}^V(f,y)\right]^2\rho(\,dy), \quad f\in{\Cal F}. \tag16.14
$$
Then the random vectors $\{W(f)\colon\; f\in {\Cal F}\}$ defined
in~(16.14) and $\{\bar W(f)\colon\; f\in {\Cal F}\}$ defined
in~(16.7) have the same distribution.}
\medskip\noindent
{\it Proof of Lemmas 16.2A and 16.2B.} Lemma~16.2A actually agrees
with the already proved Lemma~15.1, only the notation is
different. The proof of Lemma~16.2B is very similar to the proof of
Lemma~15.1. It can be shown that even the following stronger statement
holds. For any $\pm1$ sequence $u=(u_1,\dots,u_n)$ of length~$n$ the
conditional distribution of the random field $\bar W(f)$, $f\in{\Cal F}$,
under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u=(u_1,\dots,u_n)$ agrees with
the distribution of the random field $W(f)$, $f\in{\Cal F}$.
To see this relation let us first observe that the conditional
distribution of the field $\bar W(f)$ under this condition agrees
with the distribution of the random field we get by replacing the
random variables $\varepsilon_l$ by $u_l$ for all $1\le l\le n$ in
formulas~(16.5), (16.6) and~(16.7). Beside this, define the vector
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
by the formula
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$
for those indices $(j,l)$ for which $u_l=-1$, and
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
for which $u_l=1$ (independently of the value of the parameter $j$).
Then the joint distribution of the vectors
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
and $(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
agree. Hence the joint distribution of the random vectors
$\bar I_{n,k}^V(f,y)$, $f\in{\Cal F}$, $V\subset \{1,\dots,k\}$ defined
in~(16.2) and of the random vectors $W(f)$, $f\in{\Cal F}$, defined
in~(16.14) do not change if we replace in their definition the random
variables $\xi^{(j,1)}_l$ and $\xi^{(j,-1)}_l$ by $\xi(u)^{(j,1)}_l$
and $\xi(u)^{(j,-1)}_l$. But the set of random variables $W(f)$,
$f\in{\Cal F}$, obtained in this way agrees with the set of random
variables we introduced to get a set of random variables with the
same distribution as the conditional distribution of $\bar W(f)$,
$f\in {\Cal F}$ under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$. (These
random variables are defined as the square integral of the same sum,
only the terms of this sum are listed in a different order in the
two cases.) These facts imply Lemma~16.2B.
\medskip
In the next step we prove the following Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3A.} {\it Let us consider a class of functions
${\Cal F}$ satisfying the conditions of Proposition 15.3 with
parameter~$k$ together with $2k$ independent copies
$\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n$ and
$\xi^{(j,-1)}_1,\dots,\xi^{(j,-1)}_n$, $1\le j\le k$, of a sequence
of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables $\bar I_{n,k}^V(f)$,
$f\in{\Cal F}$, $V\subset\{1,\dots,k\}$, defined with the help of
these quantities in formula (16.12). Let
${\Cal B}={\Cal B}(\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)};\;1\le j\le k)$
denote the $\sigma$-algebra generated by the random variables
$\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$ , $1\le j\le k$, i.e.\ by
the random variables with upper indices of the form $(j,1)$,
$1\le j\le k$. There exists a number $A_0=A_0(k)>0$ such that
for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, the
inequality
$$
P\left(\sup_{f\in{\Cal F}}
\left.E\left(\bar I_{n,k}^V(f)^2\right|{\Cal B}\right)
> 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right)<
n^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\tag16.15
$$
holds with a sufficiently small $\gamma_k>0$ if $A\ge A_0$.}
\medskip\noindent
{\it Proof of Lemma 16.3A.}\/ Let us first consider the case
$V=\emptyset$. In this case the estimate $\left.E\left(\bar
I_{n,k}^\emptyset(f)^2\right|{\Cal B}\right)
=E\left(\bar I_{n,k}^\emptyset(f)^2\right)
\le\frac{n^k}{k!}\sigma^2\le 2^kn^{2k}\sigma^{2k+2}$ holds for all
$f\in{\Cal F}$. In the above calculation it was exploited that the
functions $f\in{\Cal F}$ are canonical, which implies certain
orthogonalities, and beside this the inequality $n\sigma^2\ge\frac12$
holds, because of the relation $n\sigma^2\ge L\log n+\log D$.
The above relations imply that for $V=\emptyset$ the probability at
the left-hand side of (16.15) equals zero if the number $A_0$ is
chosen sufficiently large. Hence inequality~(16.15) holds in this
case.
To avoid some complications in the notation let us first restrict our
attention to sets of the form $V=\{1,\dots,u\}$ with some $1\le u 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2} \right\}\\
&\qquad \subset \bigcup \Sb (l_{u+1},\dots,l_k)\colon \\
1\le l_j\le n,\; j=u+1,\dots,k.\\
l_j\neq l_{j'} \text { if } j\neq j'\endSb
\left\{\omega\colon\; \sup_{f\in{\Cal F}}\left. E\left(\bar
I_{n,k}^V(f,l_{u+1},\dots,l_k)^2\right|\Cal
B\right)(\omega)
>\frac{A^2n^{2k}\sigma^{2k+2}}{2^{(3k+3)}n^{k-u}} \right\}.
\endaligned
\tag16.17
$$
The probability of the events in the union at the right-hand side
of (16.17) can be estimated with the help of the Corollary of
Proposition~15.4 with parameter $u\frac {A^2n^{k+u}\sigma^{2k+2}} {2^{(3k+3)}}\right)\le
e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2} \tag16.18
$$
with an appropriate $\gamma_k>0$ for all sequences
$(l_{u+1},\dots,l_k)$, $1\le l_j\le n$, $u+1\le j\le k$, and such
that $l_j\neq l_{j'}$ if $j\neq j'$.
Let us show that if a class of functions $f\in {\Cal F}$ satisfies
the conditions of Proposition~15.3, then it also satisfies
relation~(16.18).
For this goal introduce the space $(Y,{\Cal Y},\rho)=(X^{k-u},
{\Cal X}^{k-u},\mu^{k-u})$, the $k-u$-fold power of the measure
space $(X, {\Cal X},\mu)$, and for the sake of simpler notations
write $y=(x_{u+1},\dots,x_k)$ for a point $y\in Y$. Let us also
introduce the class of those function $\bar{\Cal F}$ in the
space $(X^u\times Y,{\Cal X}^u\times{\Cal Y},\mu^u\times\rho)$
consisting of functions $\bar f$ of the form
$\bar f(x_1,\dots,x_u,y)=f(x_1,\dots,x_k)$ with
$y=(x_{u+1},\dots,x_k)$ and some function
$f(x_1,\dots,x_k)\in{\Cal F}$.
If the class of function ${\Cal F}$ satisfies the conditions of
Proposition~15.3 (with parameter~$k$), then the class of functions
$\bar{\Cal F}$ satisfies the conditions of Proposition~15.4 with
parameter $u0$. We have
$$
\align
&P\left(\sup_{\bar f\in\bar{\Cal F}}
E(\bar I_{n,k}^V(f,l_{u+1},\dots,l_k)^2|{\Cal B})
\ge \left(\frac{k!}{u!}\right)^2 \gamma_k^{2/(2u+1)}
A^2 (n+u-k)^{2u}\sigma^{2u+2}\right) \\
&\qquad=P\left(\sup_{\bar f\in\bar{\Cal F}} (n+u-k)^{-u}
H^{l(u)}_{n+u-k,u}(\bar f)\ge \gamma_k^{2/(2u+1)}
A^2(n+u-k)^u\sigma^{2u+2}\right) \\
&\qquad\le e^{-\gamma_kA^{1/(2u+1)}(n+k-u)\sigma^2}
\quad \text{for } A>A_0(u)\gamma_k^{-2/(2u+1)}.
\tag16.20
\endalign
$$
It is not difficult to derive formula (16.18) from relation (16.20).
It is enough to check that the level
$\frac{A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ in the
probability at the left-hand side of (16.18) can be replaced by
$\gamma_k^{2/(2u+1)}
A^2\left(\frac{k!}{u!}\right)^2(n+u-k)^{2u}\sigma^{2u+2}$
if $\gamma_k>0$ is chosen sufficiently small. This statement holds,
since
$\gamma_k^{2/(2u+1)}
A^2\left(\frac{k!}{u!}\right)^2(n+u-k)^{2u}\sigma^{2u+2}<
\gamma_k^{2/(2k+1)}A^2\left(\frac{k!}{u!}\right)^2n^{2u}\sigma^{2u+2}
\le\frac {A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ if the constant
$\gamma_k>0$ is chosen sufficiently small, since
$n\sigma^2>L\log n\le \frac12$ by the conditions of
Proposition~15.3.
Relations (16.17) and (16.18) imply that
$$
P\left(\sup_{f\in{\Cal F}}\left. E\left(\bar I_{n,k}^V(f)^2\right|
{\Cal B}\right)(\omega) > 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2} \right)
\le n^{k-u}e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2}.
$$
Since $e^{-\gamma_k A^{1/(2u+1)}(n+u-k)\sigma^2}
\le e^{-\gamma_k A^{1/(2k-1)}n\sigma^2/k}$
if $u\le k-1$, $n\ge k$ and $A>A_0$ with a sufficiently large
number~$A_0$, inequality (16.15) holds for all sets $V$ of the form
$V=\{1,\dots,u\}$, $1\le uAn^{k/2}\sigma^{k+1}\right)&<
2P\left(\sup_{f\in{\Cal F}} |S(f)|>\frac A2n^k\sigma^{k+1}\right)\\
&\qquad +2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\endaligned \tag16.21
$$
with the function $S(f)$ defined in (16.13). To prove relation (16.21)
introduce the random variables $Z(f)=\bar I_{n,k}^{\{1,\dots,k\}}(f)$
and
$\bar Z(f)=-\sum\limits_{V\subset \{1,\dots,k\},\,V\neq\{1,\dots,k\}}
(-1)^{k-|V|}\bar I_{n,k}^V(f)$ for all $f\in{\Cal F}$, the
$\sigma$-algebra ${\Cal B}$ considered in Lemma~16.3A and the set
$$
B=\bigcap\Sb V\subset\{1,\dots,k\}\\V\neq\{1,\dots,k\}\endSb
\left\{\omega\colon\;
\sup_{f\in{\Cal F}}\left.E\left(\bar I_{n,k}^V(f)^2\right|
{\Cal B}\right)(\omega) \le 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right\}.
$$
Observe that $S(f)=Z(f)-\bar Z(f)$, $f\in{\Cal F}$, $B\in{\Cal B}$,
and by Lemma~16.3A the inequality
$1-P(B)\le2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)}
n\sigma^2/k}$ holds. To prove relation (16.21) apply Lemma~15.2 with
the above introduced random variables $Z(f)$ and $\bar Z(f)$,
$f\in{\Cal F}$, (both here and in the subsequent proof of Lemma~16.1B
we work with random variables $Z(\cdot)$ and $\bar Z(\cdot)$ indexed
by the countable set of functions $f\in{\Cal F}$, hence the functions
$f\in{\Cal F}$ play the role of the parameters~$p$ when Lemma~15.2 is
applied) random set $B$ and $\alpha=\frac A2n^k\sigma^{k+1}$,
$u=\frac A2n^k\sigma^{k+1}$. It is enough to show that
$$
P\left(|\bar Z(f)|
>\frac A2n^k\sigma^{k+1}|{\Cal B}\right)(\omega)\le\frac12
\quad \text{ for all }f\in{\Cal F} \quad \text {if } \omega\in{\Cal B}.
\tag16.22
$$
But $P\left(|\bar I_{n,k}^{|V|}(f)|>2^{-(k+1)} An^k\sigma^{k+1}|
{\Cal B}\right)(\omega)
\le\frac{2^{2(k+1)}E(\bar I^{|V|}_{n,k}(f)^2|{\Cal B})(\omega)}
{A^2n^{2k}\sigma^{2(k+1)}}\le 2^{-(k+1)}$ for all functions
$f\in {\Cal F}$ and sets
$V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, if $\omega\in B$
by the `conditional Chebishev inequality', hence relations (16.22)
and~(16.21) hold.
Lemma 16.1A follows from relation (16.21), Lemma~16.2A and the
observation that the random variables
$\bar I_{n,k}^{(V,\varepsilon)}(f)$,
$f\in{\Cal F}$, defined in $(16.12')$ have the same distribution for
all $V\subset\{1,\dots,k\}$ as the random variables
$\bar I_{n,k}^{\varepsilon}(f)$, defined in formula~(14.12).
Hence Lemma~16.2A and the definition ($16.13'$) of the random
variables $\bar S(f)$, $f\in{\Cal F}$, imply the inequality
$$
\align
P\left(\sup_{f\in{\Cal F}} |S(f)|>\frac A2n^k\sigma^{k+1}\right)
&=P\left(\sup_{f\in{\Cal F}} |\bar S(f)|
>\frac A2n^k\sigma^{k+1}\right)\\
&\le 2^kP\left(\sup_{f\in{\Cal F}}
\left|\bar I_{n,k}^\varepsilon(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right).
\endalign
$$
Lemma 16.1A is proved.
\medskip
Lemma~16.1B will be proved with the help of the following
Lemma~16.3B, which is a version of Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3B.} {\it Let us consider a class of functions ${\Cal F}$
satisfying the conditions of Proposition~15.4 together with $2k$
independent copies $\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n$ and
$\xi^{(j,-1)}_1$,\dots, $\xi^{(j,-1)}_n$, $1\le j\le k$, of a
sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables
$\bar I_{n,k}^V(f,y)$ and $H^V_{n,k}(f)$, $f\in{\Cal F}$,
$V\subset\{1,\dots,k\}$, defined in formulas~(16.2) and~(16.3) with
the help of these quantities. Let
${\Cal B}={\Cal B}(\xi_1^{(j,1)},\dots, \xi_n^{(j,1)};\;1\le j\le k)$
denote the $\sigma$-algebra generated by the random variables
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$, $1\le j\le k$, i.e. by those
random variables which appear in the definition of the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ introduced in
formulas (16.2) and (16.3), and have second argument~1 in their
upper index.
\medskip
\item{a)}
There exist some numbers $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$,
the inequality
$$
P\left(\sup_{f\in{\Cal F}} E(H^{V}_{n,k}(f)|{\Cal B})
>2^{-(4k+4)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right)<
n^{k-1}e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\tag16.23
$$
holds if $A\ge A_0$.
\medskip
\item{b)} Given two subsets $V_1,V_2\subset\{1,\dots,k\}$ of the
set $\{1,\dots,k\}$ define the integrals (of random kernel functions)
$$
H_{n,k}^{(V_1,V_2)}(f)=\int |\bar I_{n,k}^{V_1}(f,y)
\bar I_{n,k}^{V_2}(f,y)| \rho(\,dy),
\quad f\in{\Cal F}, \tag16.24
$$
with the help of the functions $\bar I_{n,k}^V(f,y)$ defined in~(16.2).
There exist some number $A_0=A_0(k)>0$ and $\gamma=\gamma_k$ such that
if the integrals $H_{n,k}(f)$, $f\in{\Cal F}$, determined by
this class of functions ${\Cal F}$ have a good tail behaviour at level
$T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
$$
P\left(\sup_{f\in{\Cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\Cal B})
>2^{-(2k+2)}A^2n^{2k}\sigma^{2k+2}\right)
<2n^{k-1}e^{-\gamma_k A^{1/2k}n\sigma^2/k}
\tag16.25
$$
holds for any pairs of subsets $V_1,V_2\subset\{1,\dots,k\}$ with
the property that at least one of them does not equal the set
$\{1,\dots,k\}$ if the number~$A$ satisfies the condition $A>T$.}
\medskip\noindent
{\it Proof of Lemma 16.3B.}\/ Part a) of Lemma 16.3B can be proved
in almost the same way as Lemma 16.3A. Hence I only briefly
explain the main step of the proof. In the case $V=\emptyset$ the
identity $E(H^{V}_{n,k}(f)|{\Cal B})=E(H^{V}_{n,k}(f))$ holds, hence it
is enough to show that $E(H^{V}_{n,k}(f))\le\frac{n^k\sigma^2}{k!}
\le2^k\frac{n^{2k}\sigma^{2k+2}}{k!}$ for all $f\in{\Cal F}$ under the
conditions of Proposition~15.4. (This relation holds, because
the functions of the class ${\Cal F}$ are canonical.) The case of a
general set $V$, $V\neq\emptyset$ and $V\neq\{1,\dots,k\}$, can be
reduced to the case $V=\{1,\dots,u\}$ with some $1\le u\frac {A^{(2k-1)/k}n^{k+u}\sigma^{2k+2}}{2^{(4k+4)}}\right)\\
&\qquad \le e^{-\gamma_kA^{(2k-1)/2k(2u+1)}(n+u-k)\sigma^2}
\endalign
$$
with a sufficiently small $\gamma_k>0$. This inequality can be
proved, similarly to relation (16.18) in the proof of Lemma 16.3A
with the help of the Corollary of Proposition~15.4. Only here we
have to work in the space $(X^u\times \bar Y,\Cal
X^u\times\bar{\Cal Y}, \mu^u\times\bar \rho)$ where $\bar
Y=X^{k-u}\times Y$, $\bar{\Cal Y}={\Cal X}^{k-u}\times{\Cal Y}$,
$\bar\rho=\mu^{k-u}\times\rho$ with the class of function
$\bar f\in\bar{\Cal F}$ consisting of the functions~$\bar f$ defined
by the formula $\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)$
with some $f(x_1,\dots,x_k,y)\in {\Cal F}$, where
$\bar y=(x_{u+1},\dots,x_k,y)$. Here we apply the following version
of formula~(16.19).
$$
E\left(\bar I_{n,k}^V(f,l_{u+1},\dots,l_k,y)^2|{\Cal B}\right)
=\left(\frac{u!}{k!}\right)^2\int \bar I^{l(u)}_{n+u-k,u}
(\bar f,\bar y)^2\bar\rho(\,d\bar y)
=\left(\frac{u!}{k!}\right)^2H_{n+u-k,u}(\bar f)
$$
with the function $\bar f\in\bar{\Cal F}$ for which the identity
$\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)$ holds with
$\bar y=(x_{u+1},\dots,x_k,y)$ and the random variables
$\bar I^{l(u)}_{n+u-k,u}(\bar f,\bar y)$ and $H_{n+u-k,u}(\bar f)$
defined similarly as the corresponding terms after formula~(16.19),
only $y$ is replaced by $\bar y$, the measure $\rho$ by $\bar\rho$,
and the presently defined $\bar f\in\bar{\Cal F}$ are considered in
the present case. I omit the details.
\medskip\noindent
Part b) of Lemma 16.3B will be proved with the help of Part a) and
the inequality
$$
\sup_{f\in{\Cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\Cal B}) \le
\left(\sup_{f\in{\Cal F}} E(H^{V_1}_{n,k}(f)|{\Cal B})\right)^{1/2}
\left(\sup_{f\in{\Cal F}} E(H^{V_2}_{n,k}(f)|{\Cal B})\right)^{1/2}
$$
which follows from the Schwarz inequality applied for integrals with
respect to conditional distributions. Let us assume that
$V_1\neq\{1,\dots,k\}$. The last inequality implies that
$$
\aligned
&P\left(\sup_{f\in{\Cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\Cal B})
>2^{-(2k+2)}A^2n^{2k}\sigma^{2k+2}\right)\\
&\qquad \le P\left(\sup_{f\in{\Cal F}} E(H^{V_1}_{n,k}(f)|{\Cal B})
>2^{-(4k+4)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right) \\
&\qquad\qquad+P\left(\sup_{f\in{\Cal F}} E(H^{V_2}_{n,k}(f)|{\Cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\endaligned
$$
Hence if we know that also the inequality
$$
P\left(\sup_{f\in{\Cal F}} E(H^{V_2}_{n,k}(f)|{\Cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\le n^{k-1} e^{-\gamma_k A^{1/2k}n\sigma^2} \tag16.26
$$
holds, then we can deduce relation~(16.25) from the estimate~(16.23)
and the last inequality. Relation~(16.26) follows from Part a) of
Lemma~16.3B if $V_2\neq\{1,\dots,k\}$ and $A\ge1$, since in this
case the level $A^{(2k+1)/k} n^{2k}\sigma^{2k+2}$ can be replaced
by the smaller number $2^{-(4k+2)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}$
in the probability of formula (16.26). In the case
$V_2=\{1,\dots,k\}$ it follows from the conditions of Part~b) of
Lemma~16.3B if the number $\gamma_k$ is chosen for some
$\gamma_k\le1$. Indeed, since $A^{(2k+1)/2k}>T^{(2k+1)/2k}$, by
the conditions of Proposition~15.4 the estimate~(15.7) holds if
the number $A$ is replaced in it by $A^{(2k+1)/2k}$ (at both side
of the inequality), and this relation implies inequality~(16.26)
in this case.
\medskip
Now we turn to the proof of Lemma~16.1B.
\medskip\noindent
{\it Proof of Lemma 16.1B.}\/ By Lemma~16.2B it is enough to
prove that relation (16.8) holds if the random variables $\bar W(f)$
are replaced in it by the random variables $W(f)$ defined in
formula~(16.14). We shall prove this by applying the generalized
form of the symmetrization lemma, Lemma~15.2, with the choice of
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)$, $\bar V=\{1,\dots,k\}$,
$\bar Z(f)=Z(f)-W(f)$, $f\in{\Cal F}$,
${\Cal B}={\Cal B}(\xi_1^{(j,1)},\dots,\xi_n^{(j,1)};\;1\le j\le k)$,
$\alpha=\frac{A^2}2n^{2k}\sigma^{2k+2}$,
$u=\frac{A^2}2n^{2k}\sigma^{2k+2}$ and the set
$$
B=\bigcap\Sb (V_1,V_2)\colon\; V_j\in \{1,\dots,k\},\;j=1,2,\\
V_1\neq\{1,\dots,k\} \text { or } V_2\neq\{1,\dots,k\} \endSb
\left\{\omega\colon
\sup_{f\in{\Cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\Cal B})(\omega)
\le 2^{-(2k+2)} A^{2} n^{2k}\sigma^{2k+2}\right\}.
$$
By part~b) of Lemma 16.3B the inequality $1-P(B)\le2^{2k+1}n^{k-1}
e^{-\gamma_k A^{1/2k}n\sigma^2/k}$ holds. Observe that
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)=H_{n,k}(f)$ for all $f\in{\Cal F}$.
Hence to prove Lemma 16.1B with the
help of Lemma~15.2 it is enough to show that
$$
P\left(\left.|\bar Z(f)|>\frac{A^2}2 n^{2k}\sigma^{2k+2}\right|
{\Cal B}\right)(\omega)\le\frac12 \quad \text{for all }f\in{\Cal F}
\text{ if } \omega\in B. \tag16.27
$$
To prove this relation observe that because of the definition of the
set~$B$
$$
E (|\bar Z(f)| |{\Cal B})(\omega)\le \sum\limits \Sb
(V_1,V_2)\colon\; V_j\in \{1,\dots,k\},\;j=1,2,\\
V_1\neq\{1,\dots,k\} \text { or } V_2\neq\{1,\dots,k\} \endSb
E(H^{(V_1,V_2)}_{n,k}(f)|{\Cal B})(\omega)
\le\frac{A^2}4n^{2k}\sigma^{2k+2}
$$
if $\omega\in B$ for all $f\in {\Cal F}$. Hence the `conditional
Markov inequality' implies that
$P\left(\left.|\bar Z(f)|>\frac{A^2}2 n^{2k}\sigma^{2(k+1)}\right|
{\Cal B}\right)(\omega)\le\frac
{2E(|\bar Z(f)| |{\Cal B})(\omega)}{A^2n^{2k}\sigma^{2k+2}}\le\frac12$
if $\omega\in B$, and inequality~(16.27) holds. Lemma~16.1B is proved.
\beginsection 17. The proof of the main result.
This section contains the proof of Proposition 15.3 together with
Proposition~15.4. They complete the proof of of Theorem~8.4, of
the main result of this work.
\medskip\noindent
{\script A.) The proof of Proposition 15.3.}
\medskip\noindent
The proof of Proposition 15.3 is similar to that of Proposition~7.3.
It applies an induction procedure with respect to the parameter $k$.
In the proof of Proposition~15.3 for parameter~$k$ we may assume that
Propositions~15.3 and~15.4 hold for $u2^{-(k+1)}A n^k\sigma^{k+1}\right)
$$
appearing at the right-hand side of the estimate (16.1) in
Lemma~16.1A. To estimate this probability we introduce (using
the notation of Proposition~15.3) the functions
$$
S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)=\frac1{k!}\sum\Sb
(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; j=1,\dots, k,\\ l_j\neq l_{j'} \text{ if } j\neq
j'\endSb f^2\left(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)}\right),
\quad f\in{\Cal F},
\tag17.1
$$
with $x_l^{(j)}\in X$, $1\le l\le n$, $1\le j\le k$.
We define with the help of this function the following set
$H=H(A)\subset X^{kn}$ for all $A>T$ similarly to the set defined
in formula~(7.7).
$$
\aligned
H=H(A)&=\biggl\{(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\colon \\
&\qquad \sup_{f\in{\Cal F}} S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,
1\le j\le k)>2^kA^{4/3}n^k\sigma^2\biggr\}.
\endaligned \tag17.2
$$
We want to show that
$$
P(\{\omega\colon \; (\xi_l^{(j)}(\omega),
\,1\le j\le n,\,1\le j\le k)\in H\})
\le 2^k e^{-A^{2/3k}n\sigma^2} \quad\text{if }A\ge T. \tag17.3
$$
To prove relation (17.3) we take the Hoeffding decomposition of the
$U$-statistics with kernel functions $f^2(x_1,\dots,x_k)$,
$f\in{\Cal F}$, given in Theorem~9.1, i.e. we write
$$
f^2(x_1,\dots,x_k)
=\sum\limits_{V\subset\{1,\dots,k\}} f_V(x_j,j\in V),
\quad f\in{\Cal F}, \tag17.4
$$
with
$f_V(x_j,j\in V)=\prod\limits_{j\notin V}P_j\prod\limits_{j\in V}Q_j
f^2(x_1,\dots,x_k)$, where $P_j$ is the projection defined in formula
(9.1), and $Q_j=I-P_j$ agrees with the operator $Q_j$ defined in
formula~(9.2).
The functions $f_V$ appearing in formula (17.4) are canonical (with
respect to the measure $\mu$), and the identity
$S^2_{n,k}(f)(\xi_l^{(j)}\,1\le l\le n,1\le j \le k)=\bar I_{n,k}(f^2)$
holds for all $f\in {\Cal F}$ with the expression $\bar I_{n,k}(\cdot)$
defined in~(14.11). By applying the Hoeff\-ding decomposition~(17.4)
for each term $f^2(\xi_{l_1}^{(1)}\dots,\xi_{l_k}^{(k)})$ in the
expression $S^2_{n,k}(f)$ we get that
$$
\aligned
&P\left(\sup_{f\in{\Cal F}}S^2_{n,k}(f)(\xi_l^{(j)},
\,1\le l\le n,\,1\le j\le
k) >2^kA^{4/3}n^k\sigma^2\right)\\
&\qquad \le\sum\limits_{V\subset\{1,\dots,k\}} P\left(\frac{|V|!}{k!}
\sup_{f\in{\Cal F}}
n^{k-|V|}|\bar I_{n,|V|}(f_V)|>A^{4/3}n^k\sigma^2\right)
\endaligned \tag17.5
$$
with the functions $f_V$ appearing in formula~(17.4). We want to give
a good estimate for each term in the sum at the right-hand side
in~(17.5). For this goal first we show that the classes of functions
$\{f_V\colon\;f\in {\Cal F}\}$ in the expansion~(17.4) satisfy the
conditions of Proposition~15.3 for all $V\subset\{1,\dots,k\}$.
The functions $f_V$ are canonical for all $V\subset\{1,\dots,k\}$.
It follows from the conditions of Proposition~15.3 that
$|f^2(x_1,\dots,x_k)|\le 2^{-2(k+1)}$ and
$$
\int f^4(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le
2^{-(k+1)}\sigma^2.
$$
Hence relations (9.4) and $(9.4')$ of Theorem~9.2 imply that
$\left|\sup\limits_{x_j\in X,j\in V}f_V(x_j,j\in V)\right|\le
2^{-(k+2)}\le2^{-(k+1)}$ for all $V\subset\{1,\dots,k\}$ and
$\int f^2_V(x_j,j\in V)\prod\limits_{j\in V}\mu(\,dx_j)
\le 2^{-(k+1)} \sigma^2\le\sigma^2$ for all
$V\subset\{1,\dots,k\}$. Finally, to check that the class of
functions ${\Cal F}_V=\{f_V\colon\; f\in{\Cal F}\}$
is $L_2$-dense with exponent $L$ and parameter $D$ observe
that for all probability measures $\rho$ on $(X^k,{\Cal X}^k)$
and pairs of functions $f,g\in {\Cal F}$ the inequality
$\int(f^2-g^2)^2\,d\rho\le 2^{-2k}\int(f-g)^2\,d\rho$ holds.
This implies that if $\{f_1,\dots,f_m\}$,
$m\le D\varepsilon^{-L}$, is an
$\varepsilon$-dense subset of ${\Cal F}$ in the space
$L_2(X^k,{\Cal X}^k,\rho)$,
then the set of functions $\{2^kf_1^2,\dots,2^kf_m^2\}$ is an
$\varepsilon$-dense subset of the class of functions
${\Cal F}'=\{2^kf^2\colon\;
f\in {\Cal F}\}$, hence ${\Cal F}'$ is also an $L_2$-dense class
of functions with exponent~$L$ and parameter~$D$. Then by
Theorem~9.2 the class of functions ${\Cal F}_V$ is also
$L_2$-dense with exponent $L$ and
parameter~$D$ for all sets $V\subset\{1,\dots,k\}$.
For $V=\emptyset$, the function $f_V$ is constant,
$f_V=\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$
holds, and $\bar I_{|V|}(f_{|V|})|=f_V\le\sigma^2$. Therefore
the term corresponding to $V=\emptyset$ in the sum of probabilities
at the right-hand side of (17.5) equals zero under the conditions
of Proposition~15.3 with the choice of some $A_0\ge1$. I claim that
the remaining terms in the sum at the right-hand side of~(17.5)
satisfy the inequality
$$ \allowdisplaybreaks
\align
&P\left(\frac{|V|!}{k!}n^{k-|V|}\sup_{f\in{\Cal F}}
|\bar I_{n,|V|}(f_V)|>A^{4/3}n^{k}\sigma^2\right)\\
&\qquad \le P\left(\sup_{f\in{\Cal F}}
|\bar I_{n,|V|}(f_V)|>A^{4/3}\frac{k!}
{|V|!}n^{|V|}\sigma^{|V|+1}\right)
\le e^{-A^{2/3k}n\sigma^2}
\quad\text{if } 1\le|V|\le k. \tag17.6
\endalign
$$
The first inequality in (17.6) holds, since
$\sigma^{|V|+1}\le\sigma^2$
for $|V|\ge1$, and $n\ge k\ge|V|$. The second inequality follows from
the inductive hypothesis if $|V|2^{-(k+2)}A n^{k/2}\sigma^{k+1}\right)$ with respect to the
random variables $\xi_l^{(j)}$, $1\le l\le n$, $1\le j\le k$ we get
with the help of the multivariate version of Hoeff\-ding's inequality
(Theorem~13.3) that
$$
\align
&P\left(\left.\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+2)}A n^k\sigma^{k+1}\right|\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\right) \\
&\qquad
\le C\exp\left\{-\frac12
\left(\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{2k+4}
S^2_{n,k}(x_l^{(j)},1\le l\le n,\,1\le j\le k)/k!}
\right)^{1/k}\right\} \tag17.7 \\
&\qquad \le Ce^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2} \quad
\text{for all }f\in{\Cal F}\quad \text{if } (x_l^{(j)},\,
1\le l\le n,\,1\le j\le k) \notin H
\endalign
$$
with some appropriate constant $C=C(k)>0$.
Define for all $1\le j\le k$ and sets of points $x_l^{(j)}\in X$,
$1\le l\le n$, the probability measures
$\rho_j=\rho_{j,\,(x_l^{(j)},\,
1\le l\le n)}$, $1\le j\le k$, uniformly distributed on the set of
points $\{x_l^{(j)},\; 1\le l\le n\}$, i.e. let
$\rho_j(x_l^{(j)})=\frac1n$ for all $1\le l\le n$. Let us also
define the product $\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)
=\rho_1\times\cdots\times\rho_k$ of these measures on the space
$(X^k,{\Cal X}^k)$. If $f$ is a function on $(X^k,{\Cal X}^k)$ such
that $\int f^2\,d\rho\le\delta^2$ with some $\delta>0$, then
$$
\align
\sup_{\varepsilon_1,\dots,\varepsilon_n} |\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)|
&\le\frac{n^k}{k!}\int
|f(u_1,\dots,u_k)|\rho(\,du_1,\dots,\,du_k)\\
&\le\frac{n^k}{k!}
\left(\int f^2\,d\rho\right)^{1/2}\le\frac{n^k}{k!}\delta,
\endalign
$$
$u_j\in R^k$, $1\le j\le k$, and as a consequence
$$
\align
&\sup_{\varepsilon_1,\dots,\varepsilon_n}|\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)-\bar I_{n,k}^\varepsilon(g)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)|\\
&\qquad \le2^{-(k+2)}An^k\sigma^{k+1} \quad\text{if }
\int (f-g)^2\,d\rho\le (2^{-(k+2)}k!A\sigma^{k+1})^2, \tag17.8
\endalign
$$
where
$\bar I_{n,k}^\varepsilon(f)(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)$
equals the expression $\bar I_{n,k}^\varepsilon(f)$ defined
in~(14.12) if we replace $\xi_{l_j}^{(j)}$ by $x_{l_j}^{(j)}$
for all $1\le j\le k$, and $1\le l_j\le n$ in it, and $\rho$ is
the measure
$\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$ defined above.
Let us fix the number $\delta=2^{-(k+2)}k!A\sigma^{k+1}$, and let us
list the elements of the set ${\Cal F}$ as ${\Cal F}=\{f_1,f_2,\dots\}$.
Put
$$
m=m(\delta)=\max(1,D\delta^{-L})
=\max(1,D(2^{(k+2)}(k!)^{-1}A^{-(1)}\sigma^{-(k+1)})^L),
$$
and choose for all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$ such a
sequence of positive integers $p_1(x^{(n)}),\dots,p_m(x^{(n)}))$
for which
$$
\inf\limits_{1\le l\le m}\int (f(u)-f_{p_l(x^{(n)})}(u))^2
\,d\rho(x^{(n)})\le\delta^2\quad\text{for all } f\in{\Cal F}.
$$
(Here we apply the notation
$\rho(x^{(n)})=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$.)
This is possible, since ${\Cal F}$ is an $L_2$-dense
class with exponent~$L$ and parameter~$D$, and we can choose
$m=D\delta^{-l}$, if $\delta<1$, Beside this, we can choose $m=1$ if
$\delta=1$, since
$\int |f-g|^2\,d\rho\le \sup|f(x)-g(x)|^2\le2^{-2k}\le1$ for all
$f,g\in{\Cal F}$. Moreover, it follows from Lemma~7.4A that the
functions $p_l(x^{(n)})$, $1\le l\le m$, can be chosen as measurable
functions of the argument $x^{(n)}\in X^{kn}$.
Let us introduce the random vector
$\xi^{(n)}(\omega)=(\xi^{(j)}_l(\omega),\,1\le l\le n,\,1\le j\le k)$.
By arguing similarly as we did in the proof of Proposition~7.3 we
get with the help of relation~(17.8) and the property of the
functions $f_{p_l(x^{(n)})}(\cdot)$ constructed above that
$$
\align
&\left\{\omega\colon\;\sup_{f\in{\Cal F}}
|\bar I_{n,k}^\varepsilon(f)(\omega)|
\ge2^{-(k+1)}An^k\sigma^{k+1}\right\} \\
&\qquad \subset\bigcup\limits_{l=1}^m\left\{\omega\colon\;
|\bar I_{n,k}^\varepsilon(f_{p_l(\xi^{(n)}(\omega))})(\omega)|
\ge2^{-(k+2)}An^k\sigma^{(k+1)} \right\}.
\endalign
$$
The above relation and formula (17.7) imply that
$$ \allowdisplaybreaks
\align
&P\left(\sup_{f\in{\Cal F}}\left.
\left|\bar I_{n,k}^{\varepsilon}(f)(\omega)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right|\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\right) \\
&\qquad \le \sum_{l=1}^m
P\left(\left.\left|
\bar I_{n,k}^{\varepsilon}(f_{p_l(\xi^{(n)}(\omega))}(\omega)\right|
>\frac{A n^k\sigma^{k+1}}{2^{k+2}}\right|\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\right) \\
&\qquad \le C m(\delta) e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2} \\
&\qquad \le C (1+D(2^{k+2} A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2} \tag17.9 \\
&\qquad\qquad \text{if }
\{x_l^{(j)},\, 1\le l\le n,\,1\le j\le k\}\notin H.
\endalign
$$
Relations (17.3) and (17.9) imply that
$$
\aligned
&P\left(\sup_{f\in{\Cal F}}\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right)
\le C (1+D(2^{k+2}A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L)\\
&\qquad\qquad e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2}
+2^k e^{-A^{2/3k}n\sigma^2} \quad\text{if }A> T.
\endaligned \tag17.10
$$
Proposition 15.3 follows from the estimates (16.1), (17.10) and the
condition $n\sigma^2\ge L\log n+\log D$, $L,D\ge 1$, if $A\ge A_0$
with a sufficiently large number~$A_0$. Indeed, in this case
$n\sigma^2\ge\frac12$,
$(2^{k+2}A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L\le
(\frac{n^{(k+1)/2}}{(2n\sigma^2)^{(k+1)/2}})^L\le n^{L(k+1)/2}=
e^{L\log n\cdot (k+1)/2}\le e^{(k+1)n\sigma^2/2}$,
$D=e^{\log D}\le e^{n\sigma^2}$, and
$$
C (1+D(2^{k+2} A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2}
\le\frac13 e^{-A^{1/2k}n\sigma^2}.
$$
The estimation of the remaining terms in the upper bound of the
estimates~(16.1) and~(17.10) leading to the proof of
relation~(15.5) is simpler. We can exploit that
$e^{-A^{2/3k}n\sigma^2}\ll e^{-A^{1/2k}n\sigma^2}$ and as
$n^{k-1}\le e^{(k-1)n\sigma^2}$
$$
2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}\le
2^ke^{(k-1)n\sigma^2}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\ll e^{-A^{1/2k}n\sigma^2}
$$
for a large number~$A$.
Now we turn to the proof of Proposition~15.4.
\medskip\noindent
{\script B.) The proof of Proposition 15.4.}
\medskip\noindent
Because of formula (16.11) in the Corollary of Lemma~16.1B to prove
Proposition 15.4 i.e. inequality (15.7) it is enough to choose a
sufficiently large parameter $A_0$ and to show that with such a choice
the random variables $H_{n,k}(f|G,V_1,V_2)$ defined in formula (16.9)
satisfy the inequality
$$
\aligned
&P\left(\sup_{f\in{\Cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!} \right) \le
2^{k+1} e^{-A^{1/2k}n\sigma^2}\\
&\qquad\text{ for all } G\in {\Cal G}\quad \text{and }
\;V_1,V_2\in\{1,\dots,k\} \quad\text{if } A>T\ge A_0
\endaligned \tag17.11
$$
under the conditions of Proposition~15.4.
Let us first prove formula (17.11) in the case $|e(G)|=k$, i.e.\
when all vertices of the diagram $G$ are end-points of some edge,
and the expression $H_{n,k}(f|G,V_1,V_2)$ contains no
`symmetrizing term' $\varepsilon_j$. In this case we apply a
special argument to prove relation~(17.11).
It can be seen with the help of the Schwarz inequality that for a
diagram $G$ such that $|e(G)|=k$
$$ \allowdisplaybreaks
\align
|H_{n,k}(f|G,V_1,V_2)|&\le\frac1{k!}
\left(\sum\Sb (l_1,\dots,l_k)\colon \\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb \int
f^2(\xi_{l_1}^{(1),\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y)
\rho(\,dy)\right)^{1/2}\\
&\qquad \frac1{k!}\left(\sum\Sb (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \text{ if }j\neq j'\endSb
\int f^2(\xi_{l_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l_k}^{(k,\delta_k(V_2))},y) \rho(\,dy)\right)^{1/2} \\
&\tag17.12
\endalign
$$
with $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$.
Relation (17.12) can be proved for instance by bounding first each
integral in formula (16.9) by means of the Schwarz inequality, and
then by bounding the sum appearing in such a way by means of the
inequality
$\sum |a_jb_j|\le \left(\sum a_j^2\right)^{1/2}
\left(\sum b_j^2\right)^{1/2}$.
Observe that in the case $|(e(G)|=k$ the summation in~(16.9) is
taken for such vectors $(l_1,\dots,l_k,l_1',\dots,l_k')$ for
which $(l_1',\dots,l_k')$ is a permutation of the sequence
$(l_1,\dots,l_k)$ determined by the diagram~$G$. Hence the sum we
get after applying the Schwarz inequality for each integral
in~(16.9) has the form $\sum a_jb_j$ where the set of
indices~$j$ in this sum agrees with
the set of vectors $(l_1,\dots,l_k)$ such that $1\le l_p\le n$
for all $1\le p\le k$, and $l_p\neq l_{p'}$ if $p\neq p'$.
By formula (17.12)
$$ \allowdisplaybreaks
\align
&\left\{\omega\colon\;\sup_{f\in{\Cal F}}
\left |H_{n,k}(f|G,V_1,V_2)(\omega)\right|
>\frac{A^2n^{2k}\sigma^{(2(k+1)}}{2^{4k+1}k!} \right\} \\
&\qquad \subset
\biggl\{\omega\colon\; \sup_{f\in{\Cal F}} \!\!\!\! \sum\Sb
(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; 1\le j\le k, \\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!\! \int
f^2(\xi_{l_1}^{(1,\delta_1(V_1))}(\omega),
\dots,\xi_{l_k}^{(k,\delta_k(V_1))}
(\omega),y) \rho(\,dy) \\
&\hskip9truecm >\frac {A^2n^{2k}\sigma^{2(k+1)}k!}
{2^{4k+1}} \biggr\}\\
&\qquad\qquad \cup \biggl\{\omega\colon\; \sup_{f\in{\Cal F}} \!\!\!\!
\sum\Sb (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \; 1\le j\le k, \\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb \!\!\!\!\! \int
f^2(\xi_{l_1}^{(1,\delta_1(V_2))}(\omega),\dots,
\xi_{l_k}^{(k,\delta_k(V_2))}
(\omega),y)\rho(\,dy) \\
&\hskip9truecm >\frac{A^2n^{2k}\sigma^{2(k+1)}k!}{2^{4k+1}}\biggr\},
\endalign
$$
hence
$$ \allowdisplaybreaks
\align
&P\left(\sup_{f\in{\Cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right) \tag17.13 \\
&\qquad \le 2P\left(\sup_{f\in{\Cal F}}\frac1{k!}
\sum\Sb (l_1,\dots,l_k)\colon\;
1\le l_j\le n,\; 1\le j\le k, \\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
h_f(\xi_{l_1}^{(1,1)},\dots,\xi_{l_k}^{(k,1)})
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right)
\endalign
$$
with the functions $h_f(x_1,\dots,x_k)=\int
f^2(x_1,\dots,x_k,y)\rho(\,dy)$, $f\in{\Cal F}$. (In this upper bound
we could get rid of the terms $\delta_j(V_1)$ and $\delta_j(V_2)$,
i.e. on the dependence of the expression $H_{n,k}(f|G,V_1,V_2)$ on
the sets $V_1$ and $V_2$, since the probability of the events in the
previous formula do not depend on them.)
I claim that
$$
P\left(\sup\limits_{f\in{\Cal F}} |\bar I_{n,k}(h_f)|
\ge2^k An^k \sigma^2\right)\le
2^k e^{-A^{1/2k}n\sigma^2} \quad \text{for }A\ge A_0 \tag17.14
$$
if the constant $A_0=A_0(k)$ is chosen sufficiently large in
Proposition~15.4. Relation (17.14) together with the relation
$A^2\frac{n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\ge2^kA n^k\sigma^2$ (if
$A>A_0$ with a sufficiently large~$A_0$) imply that the probability
at the right-hand side of (17.13) can be bounded by
$2^{k+1}e^{-A^{1/2k}n\sigma^2}$, and the estimate~(17.11)
holds in the case $|e(G)|=k$.
Relation (17.14) is similar to relation (17.3) (together with the
definition of the random set~$H$ in formula~(17.2)), and a
modification of the proof of the latter estimate yields the proof
also in this case. Indeed, it follows from the conditions of
Proposition~15.4 that
$0\le\int h_f(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$
for all $f\in{\Cal F}$, and it is not difficult to check that
$\sup|h_f(x_1,\dots,x_k)|\le2^{-2(k+1)}$, and the class of functions
${\Cal H}=\{2^kh_f,\; f\in{\Cal F}\}$ is an $L_2$-dense class with
exponent $L$ and parameter $D$. Hence by applying the Hoeff\-ding
decomposition of the functions $h_f$, $f\in {\Cal F}$, similarly to
formula (17.4) we get for all $V\subset \{1,\dots,k\}$ such a set
of functions $\{h_f)_V,\,f\in{\Cal F}\}$, which satisfies the
conditions of Proposition~15.3. Hence a natural adaptation of the
estimate given for the expression at the right-hand side of (17.5)
(with the help of~(17.6) and the investigation of
$\bar I_{|V|}(f_V)$ for $V=\emptyset$) yields the proof of
formula (17.14). We only have to replace
$S_{n,k}(f)$ by $\bar I_{n,k}(h_f)$, then $\bar I_{n,|V|}(f_V)$ by
$\bar I_{n,|V|}((h_f)_V)$ and the levels $2^kA^{4/3}n^k\sigma^2$
and $A^{4/3}n^k\sigma^2$ by $2^kAn^k\sigma^2$ and $An^k\sigma^2$.
Let us observe that each term of the upper bound we get in such
a way can be directly bounded, since during the proof of
Proposition~15.4 for parameter~$k$ we may assume that the result
of Proposition~15.3 holds also for this parameter~$k$.
In the case $e(G)2^{2k}A^{8/3}n^{2k}\sigma^4\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2} \quad\text{if }A\ge A_0\text{ and }
e(G)2^{2k}A^{8/3}n^{2k}\sigma^4) \le
2P\left(\sup\limits_{f\in{\Cal F}}
\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right)
$$
with $h_f(x_1,\dots,x_k)=\int f^2(x_1,\dots,x_k,y)\rho(\,dy)$.
(Here we exploited that in the last formula $S^2({\Cal F}|G,V_1,V_2)$
is bounded by the product of two random variables whose distributions
do not depend on the sets $V_1$ and $V_2$.) Thus to prove inequality
(17.16) it is enough to show that
$$
2P\left(\sup\limits_{f\in{\Cal F}}
\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2} \quad \text{if } A\ge A_0. \tag17.21
$$
Actually formula (17.21) follows from the already proven formula
(17.14), only the parameter $A$ has to be replaced by $A^{4/3}$
in it.
With the help of relation (17.16) the proof of Proposition~15.4
can be completed similarly to Proposition~15.3. The following
version of inequality~(17.7) can be proved with the help of the
multivariate version of Hoeff\-ding's inequality, Theorem~13.3, and
the representation of the random variable $H_{n,k}(f|G,V_1,V_2)$ in
the form~(17.15).
$$
\aligned
&P\left(\left.|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+2}k!} n^{2k}\sigma^{2(k+1)}
\right| \xi^{j,\pm1}_{l},\,1\le l\le n,\,1\le j\le k\right)(\omega)\\
&\qquad \le Ce^{-2^{-(6+2/k)} A^{2/3k}n\sigma^2} \quad
\text{if}\quad S^2({\Cal F}|G,V_1,V_2)(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \text{ and }A\ge A_0
\endaligned\tag17.22
$$
with an appropriate constant $C=C(k)>0$ for all $f\in{\Cal F}$ and
$G\in {\Cal G}$ such that $|e(G)|0$, where $2j=2k-2|e(G)|$, and
$0\le |e(G)|\le k-1$. Since $j\le k$, $n\sigma^2\ge\frac12$,
and also $\frac{A^{4/3}}{2^{10k+4}}\ge2$ if $A_0$ is chosen
sufficiently large we can write in the above upper bound for
the left-hand side of~(17.22) $j=k$, and in such a way we get
inequality~(17.22).
The next inequality in which we estimate
$\sup\limits_{f\in{\Cal F}}H_{n,k}(f|G,V_1,V_2)$ is a natural
version of formula~(17.9) in the proof of Proposition~15.3.
$$
\align
&P\left(\left.\sup_{f\in{\Cal F}} |H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+1}k!} n^{2k}\sigma^{2(k+1)}
\right| \xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\right)
(\omega)\\
&\qquad \le C \left(1+D\left(\frac{2^{4k+3}k!}
{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2} \\
&\qquad \text{if } S^2({\Cal F}|G,V_1,V_2))(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \text{ and } A\ge A_0 \tag17.23
\endalign
$$
for all $G\in{\Cal G}$ such that $|e(G)|\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}
\right| \xi^{(j,\pm1)}_{l},
\,1\le l\le n,\,1\le j\le k\right)(\omega)\\
&\qquad\le \sum_{l=1}^m
P\biggl(\left. |H_{n,k}(f_{p_l(\xi^{(n)}(\omega))}|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right| \\
&\qquad\qquad\qquad\qquad
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)
\endalign
$$
for almost all~$\omega$. The last inequality together with~(17.22)
and the inequality $m=\max(1,D\delta^{-L})\le 1+D
\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L$ imply
relation~(17.23).
It follows from relations (17.16) and (17.23) that
$$
\align
&P\left(\sup_{f\in{\Cal F}}|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2}\\
&\qquad + C
\left(1+D\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2}
\quad\text{if }A\ge A_0
\endalign
$$
for all $V_1,V_2\subset\{1,\dots,k\}$ and diagram $G\in{\Cal G}$
such that $|e(G)|\le k-1$. This inequality implies that
relation~(17.11) holds also in the case $|e(G)|\le k-1$ if the
constants $A_0$ is chosen sufficiently large in Proposition~15.4,
and we this completes the proof of Proposition~15.4. To prove
relation~(17.11) in the case $|e(G)|\le k-1$ we still have to
show that
$D(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}})^L
\le e^{\text{const.}\, n\sigma^2}$
if $A>A_0$ with a sufficiently large~$A_0$, since this implies that
the second term at the right-hand of our last estimation is not too
large.
This follows from the inequality $n\sigma^2\ge L\log n+\log D$
which implies that
$$
\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L\le
\left(\frac{n^{(k+1)}}{(2n\sigma^2)^{(k+1)}}\right)^L
\le e^{(k+1)L\log n}\le e^{{(k+1)}n\sigma^2}
$$
if $A_0$ is sufficiently large, and $D=e^{\log D}\le e^{n\sigma^2}$.
\beginsection 18. An overview of the results in this work.
I discuss briefly the problems investigated in this work,
recall some basic results related to them, and also give some
references. I also write about the background of these problems
which may explain the motivation for their study.
I met the main problem considered in this work when I tried to
adapt the method of proof of the central limit theorem for
maximum-likelihood estimates to some more difficult questions about
so-called non-parametric maximum likelihood estimate problems.
The Kaplan--Meyer estimate for the empirical distribution function
with the help of censored data investigated in the second section
is such a problem. It is not a maximum-likelihood estimate in the
classical sense, but it can be considered as a non-parametric
maximum likelihood estimate. In the estimation of the empirical
distribution function with the help of censored data we cannot
apply the classical maximum likelihood method, since in the
solution of this problem we have to choose our estimate from a
too large class of distribution functions. The main problem is
that there is no dominating measure with respect to which all
candidates which may appear as our estimate have a density
function. A natural way to overcome this difficulty is to choose
an appropriate smaller class of distribution functions, to compare
the probability of the appearance of the sample we observed with
respect to all distribution functions of this class and to
choose that distribution function as our estimate for which this
probability takes its maximum.
The Kaplan--Meyer estimate can be found on the basis of the above
principle in the following way: Let us estimate the distribution
function $F(x)$ of the censored data simultaneously together with
the distribution function $G(x)$ of the censoring data. (We have a
sample of size $n$ and know which sample elements are censored and
which are censoring data.) Let us consider the class of such pairs
of estimates $(F_n(x),G_n(x))$ of the pair $(F(x),G(x))$ for which
the distribution function $F_n(x)$ is concentrated in the censored
sample points and the distribution function $G_n(x)$ is
concentrated in the censoring sample points; more precisely, let us
also assume that if the largest sample point is a censored point,
then the distribution function $G_n(x)$ of the censoring data takes
still another value which is larger than any sample point, and if
it is a censoring point then the distribution function $F_n(x)$ of
the censored data takes still another value larger than any sample
point. (This modification at the end of the definition is needed,
since if the largest sample points is from the class of censored
data, then the distribution $G(x)$ of the censoring data in this
point must be strictly less than~1, and if it is from the class of
censoring data, then the value of the distribution function $F(x)$
of the censored data must be strictly less than~1 in this point.)
Let us take this class of pairs of distribution functions
$(F_n(x),G_n(x))$, and let us choose that pair of distribution
functions of this class as the (non-parametric maximum likelihood)
estimate with respect to which our observation has the greatest
probability.
The above extremal problem for the pairs of distribution functions
$(F_n(x),G_n(x))$ can be solved explicitly, (see~[25]), and it
yields the estimate of $F_n(x)$ written down in formula~(2.3).
(The function $G_n(x)$ satisfies a similar relation, only the
random variables~$X_j$ and~$Y_j$ and the events $\delta_j=1$ and
$\delta_j=0$ have to be replaced in it.) Then, as I have indicated,
a natural analog of the linearization procedure in the proof of
the central limit theorem for the classical maximum likelihood
estimate works also in this case, and there is only
one really hard part of the proof. We have to show that the
linearization procedure gives a small error. The estimation of
this error led to the problem about a good estimate on the tail
distribution of the integral of an appropriate function of two
variables with respect to the product of a normalized empirical
measure with itself. Moreover, as a more detailed investigation
showed, we actually need the solution of a more general problem
where we have to bound the tail distribution of the supremum of
a class of such integrals. The main subject of this work is to
solve the above problems in a more general setting, to estimate
not only two-fold, but also $k$-fold random integrals and the
supremum of such integrals for an appropriate class of kernel
functions with respect to a normalized empirical distribution
for all~$k\ge1$.
The proof of the limit theorem for the Kaplan--Meyer estimate
explained in this work applied the explicit form of this estimate.
It would be interesting to find such a modification of this proof
which only exploits that the Kaplan--Meyer estimate is the solution
of an appropriate extremal problem. We may expect that such a proof
can be generalized to a general result about the limit behaviour
for a wide class of non-parametric maximum likelihood estimates.
Such a consideration was behind the remark of Richard Gill I quoted
at the end of Section~2.
A detailed proof together with a sharp estimate on the speed of
convergence for the limit behaviour of the Kaplan--Meyer
estimate based on the ideas presented in Section~2 is given
in paper~[38]. Paper~[39] explains more about its background, and
it also discusses the solution of some other non-parametric maximum
likelihood problems. The results about multiple integrals with
respect to a normalized empirical distribution function needed in
these works were proved in~[30]. These results were satisfactory
for the study in~[38], but they also have some drawbacks. They do
not show that if the random integrals we are considering have
small variances, then they satisfy better estimates. Beside this,
if we consider the supremum of random integrals of an appropriate
class of functions, then these results can be applied only in
very special cases. Moreover, the method of proof of~[30] did not
allow a real generalization of these results, hence I had to find
a different approach when tried to generalize them.
I do not know of other works where the distribution of multiple
random integrals with respect to a normalized empirical distribution
is studied. On the other hand, there are some works where the
distribution of (degenerate) $U$-statistics is investigated. The
most important results obtained in this field are contained in the
book of de la Pe\~na and Gin\'e {\it Decoupling, From Dependence to
Independence}\/~[7]. The problems about the behaviour of degenerate
$U$-statistics and multiple integrals with respect to a normalized
empirical distribution function are closely related, but the
explanation of their relation is far from trivial. The main
difference between them is that integration with respect to
$\mu_n-\mu$ instead of the empirical distribution $\mu_n$ means
some sort of normalization, while this normalization is missing
in the definition of $U$-statistics. I return to this question
later.
The main part of this work starts at Section~3. A general overview
of the results without the hard technical details can be found
in~[33].
First the estimation of sums of independent random variables
or one-fold random integrals with respect to a normalized empirical
distribution and the supremum of such expressions is investigated
in Sections~3 and~4. This question has a fairly big literature. I
would mention first of all the books {\it A course on empirical
processes}\/~[11], {\it Real Analysis and Probability}\/~[12] and
{\it Uniform Central Limit Theorems}\/~[13] of R.~M.~Dudley.
These books contain a much more detailed description of the
empirical processes than the present work together with a lot of
interesting results.
Section~3 deals with the tail behaviour of sums of independent and
bounded random variables with expectation zero. The proof of two
already classical results, Bernstein's and Bennett's inequalities
is given there. (Their proofs can be found e.g. in Theorem~1.3.2
of~[13] and~[5]). We are also interested in the question when they
give such an estimate which the central limit theorem suggests.
Actually, as it is explained in Section~3, Bennett's inequality
gives a bound suggested by a Poissonian approximation of partial
sums of independent random variables. Bernstein's inequality
provides an estimate suggested by the central limit theorem if the
variance of the sum we consider is not too small. (The results in
Section~3 explain this statement more explicitly.) If the variance
of the sum is too small, then Bennett's inequality provides a
slight improvement of Bernstein's inequality. Moreover, as
Example~3.3 shows, Bennett's inequality is essentially sharp in
this case.
The estimate on the tail distribution of a sum of independent random
variables is weak if this sum has a small variance. This means that
in this case the probability that the sum is larger than a given
value may be much larger than the (rather small) value suggested by
the central limit theorem. Such a behaviour may occur, because the
contribution of some unpleasant irregularities to this probability
may be non-negligible in the case of a small variance.
In the study of the supremum of sums of independent random variables
a good control is needed on the tail distribution of the (supremum
of) sums of independent random variables even if they have small
variance. The solution of this problem (and of its natural
multivariate version) turned out to be the hardest part of this
work. The results based on the similar behaviour of partial sums
and their Gaussian counterpart is not sufficient in this case,
some new ideas have to be applied. In the proof of sharp estimates
in this case we also use some kind of symmetrization arguments.
The last result of Section~3, Hoeff\-ding's inequality presented
in Theorem~3.4 is an important ingredient of these symmetrization
arguments. It is also a classical result whose proof can be found
for instance in~[23].
Section~4 contains the one-variate version of our main result
about the supremum of the integrals of a class ${\Cal F}$ of
functions with respect to a normalized empirical measure together
with an equivalent statement about the tail distribution of the
supremum of a class of random sums defined with the help of a
sequence of independent and identically distributed random
variables and a class of functions ${\Cal F}$ with some nice
properties. These results are formulated in Theorems~4.1 and~$4.1'$.
Also a Gaussian version of them is presented in Theorem~4.2 about
the distribution of the supremum of a Gaussian random field with
some appropriate properties. The content of these results can be
so interpreted that if we take the supremum of random integrals
or of random sums determined by a nice class of functions ${\Cal F}$
in the way described in Section~4, then the tail distribution of
this supremum satisfies an almost as good estimate as the `worst
element' of the random variables taking part in this supremum. I
also discussed a result in Example~4.3 which shows that some rather
technical conditions of Theorem~4.1 cannot be omitted.
The most important condition in Theorem~4.1 was that the class of
functions ${\Cal F}$ we considered in it is $L_2$-dense. This
property was introduced before the formulation of this result.
One may ask whether one can prove a better version of this result,
where we prove similar bound with a different, possibly larger
class of functions~${\Cal F}$. It is worth mentioning that
Talagrand proved results similar to Theorem~4.1 for different
classes of functions~${\Cal F}$ in his book~[52]. These classes
of functions are very different of ours, and Talagrand's results
seem to be incomparable with ours. I return to this question later.
In the above mentioned results we have imposed the condition that
the class of functions~${\Cal F}$ or what is equivalent, the set of
random variables whose supremum we estimate is countable. In the
proofs this condition is really exploited. On the other hand, in
some important applications we also need results about the
supremum of a possibly non-countable set of random variables.
To handle such cases I introduced the notion of countably
approximable classes of random variables and proved that in the
results of this work the condition about countability can be
replaced by the weaker condition that the supremum of countably
approximable classes is taken. R.~M.~Dudley worked out a different
method to handle the supremum of possibly non-countably many
random variables, and generally his method is applied in the
literature. The relation between these two methods deserves
some discussion.
Let us first recall that if we take a class of random variables $S_t$,
$t\in T$, indexed by some index set $T$ and consider a set $A$,
measurable with respect to the $\sigma$-algebra generated by the
random variables $S_t$, $t\in T$, then there exists a countable
subset $T'=T'(A)\subset T$ such that the set $A$ is measurable also
with respect to the smaller $\sigma$-algebra generated by the random
variable $S_t$, $t\in T'$. Beside this, if the finite dimensional
distributions of the random variables $S_t$, $t\in T$, are given,
then by the results of classical measure theory the probability
of the events measurable with respect to the $\sigma$-algebra
generated by these random variables $S_t$, $t\in T$, is also
determined. But we cannot get the probability of all events we
are interested in such a way. In particular, if $T$ is a
non-countable set, then the events
$\left\{\omega\colon\;\sup\limits_{t\in T}S_t(\omega)>u\right\}$ are
non-measurable with respect to the above $\sigma$-algebra, and
generally we cannot speak of their probabilities. To overcome
this difficulty Dudley worked out a theory which enabled him to
work also with outer measures. His theory is based on some
rather deep results of the analysis. It can be found for
instance in his book~[13].
I restricted my attention to such cases when after the completion of
the probability measure $P$ we can also speak of the real (and not
only outer) probabilities $P\left(\sup\limits_{t\in T}S_t>u\right)$.
I tried to
find appropriate conditions under which these probabilities really
exist. More explicitly, we are interested in the case when for all
$u>0$ there exists some set $A=A_u$ measurable with respect to the
$\sigma$-algebra generated by the random variables $S_t$, $t\in T$,
such that the symmetric difference of the sets $A_u$ and
$\left\{\omega\colon\;\sup\limits_{t\in T}S_t(\omega)>u\right\}$
is contained
in a set measurable with respect to the $\sigma$-algebra generated
by the random variables $S_t$, $t\in T$, which has probability
zero. In such a case the probability
$P\left(\sup\limits_{t\in T}S_t>u\right)$
can be defined as $P(A_u)$. This approach led me to the definition
of countable approximable classes of random variables. If this
property holds, then we can speak about the probability of the
event that the supremum of the random variables we are interested
in is larger than some fixed value. I proved a simple but
useful result in Lemma~4.4 which provides a condition for the
validity of this property. In Lemma~4.5 I proved with its help
that an important class of functions is countably approximable. It
seems that this property can be proved for many other interesting
classes of functions with the help of Lemma~4.4, but I did not
investigate this question in more detail.
The problem we met here is not an abstract, technical difficulty.
Indeed, the distribution of such a supremum can become different
if we modify each random variable on a set of probability zero,
although the finite dimensional distributions of the random
variables we consider remain the same after such an operation.
Hence, if we are interested in the probability of the supremum
of a non-countable set of random variables with described finite
dimensional distributions we have to describe more explicitly
which version of this set of random variables we consider. It
is natural to look for such an appropriate version of the
random field $S_t$, $t\in T$, whose `trajectories' $S_t(\omega)$,
$t\in T$, have nice properties for all elementary events
$\omega\in\Omega$. Lemma~4.4 can be interpreted as a result in
this spirit. The condition given for the countable
approximability of a class of random variables at the end of
this lemma can be considered as a smoothness type condition about
the `trajectories' of the random field we consider. This
approach shows some analogy to some important problems in the
theory of stochastic processes when a regular version of a
stochastic process is considered and the smoothness properties
of its trajectories are investigated.
In our problems the version of the set of random variables $S_t$,
$t\in T$, we shall work with appears in a simple and natural
way. In these problems we have finitely many random variables
$\xi_1,\dots,\xi_n$ at the start, and all random variables
$S_t(\omega)$, $t\in T$, we are considering can be defined
individually for each $\omega$ as a function of these random
variables $\xi_1(\omega),\dots,\xi_n(\omega)$. We take the
version of the random field $S_t(\omega)$, $t\in T$, we get in
such a way and want to show that it is countably approximable.
In Section~4 this property is proved in an important model,
probably in the most important model in possible applications
we are interested in. In more complicated situations when our
random variables are defined not as a function of finitely
many sample points, for instance in the case when we define
our set of random variables by means of integrals with respect
to a Gaussian random field it is harder to find the right
regular version of our sets of random variables. In this case the
integrals we consider are defined only with probability~1, and it
demands some extra work to find their right version. But in
the problems we study in this work such an approach is satisfactory for
our purposes, and it is simpler than that of Dudley; we do not have
to follow his rather difficult technique. On the other hand, I must
admit that I do not know the precise relation between the approach
of this work and that of Dudley.
In Section~4 the notion of $L_p$-dense classes, $1\le p<\infty$,
also has been introduced. The notion of $L_2$-dense classes
appeared in the formulation Theorems~4.1 and~$4.1'$. It can be
considered as a version of the $\varepsilon$-entropy, discussed
at many places in the literature. On the other hand, there
seems to be no standard definition of the
$\varepsilon$-entropy. The term of $L_2$-dense
classes seemed to be the appropriate object to work with in
this work. To apply the results related to $L_2$-dense classes we
also need some knowledge about how to check this property in
concrete models. For this goal I discussed here
Vapnik--\v{C}ervonenkis classes, a popular and important notion of
modern probability theory. Several books and papers, (see e.g. the
books~[13], [44],~[53] and the references in them) deal with this
subject. An important result in this field is Sauer's lemma,
(Lemma~5.1) which together with some other results, like Lemma~5.3
imply that several interesting classes of sets or functions are
Vapnik--\v{C}ervonenkis classes.
I put the proof of these results to the Appendix, partly because
they can be found in the literature, partly because in this work
Vapnik--\v{C}ervonenkis classes play a different and less important
role than at other places. Here Vapnik--\v{C}ervonenkis classes are
applied to show that certain classes of functions are $L_2$-dense.
A result of Dudley formulated in Lemma~5.2 implies that a
Vapnik--\v{C}ervonenkis class of functions with absolute value
bounded by a fixed constant is an $L_1$, and as a consequence, also
an $L_2$-dense class of functions. The proof of this important
result which seems to be less known even among experts of this
subject than it would deserve is contained in the main text.
Dudley's original result was formulated in the special case when
the functions we consider are indicator functions of some sets.
But its proof contains all important ideas needed in the proof of
Lemma~5.2.
Theorem 4.2, which is the Gaussian counterpart of Theorems~4.1
and~$4.1'$ is proved in Section~6 by means of a natural and
important technique, called the chaining argument. This means the
application of an inductive procedure, in which an appropriate
sequence of finite subsets of the original set of random variables
is introduced, and a good estimate is given on the supremum of
the random variables in these subsets by means of an inductive
procedure. The subsets became denser subsets of the original set
of the random variables at each step of this procedure. This
chaining argument is a popular method in certain investigation.
It is hard to say with whom to attach it. Its introduction may
be connected to some works of R.~M.~Dudley. It is worth mentioning
that Talagrand~[52] worked out a sharpened version of it which
yields in the study of certain problems a sharper and more useful
estimate. But it seems to me that in the study of the problems of
this work this improvement has a limited importance, it turns out
to be useful in the study of different problems.
Theorem 4.2 can be proved by means of the chaining argument, but
this method is not strong enough to supply a proof of Theorem~4.1.
The chaining argument provides only a weak estimate in this case,
because there is no good estimate on the probability that a sum of
independent random variables is greater than a prescribed value if
these random variables have too small variances. As a consequence
the chaining argument supplies a much weaker estimate than the result
we want to prove under the conditions of Theorem~4.1. Lemma~6.1
contains the result the chaining argument yields under these
conditions. In Section~6 still another result, Lemma~6.2 is
formulated. It can be considered as a special case of Theorem~4.1
where only the supremum of partial sums with small variances is
estimated. It is also shown that Lemmas~6.1 and~6.2 together imply
Theorem~4.1. The proof is not difficult, despite of some
non-attractive details. It has to be checked that the parameters
in Lemmas~6.1 and~6.2 can be fitted to each other.
Lemma~6.2 is proved in Section~7. It is based on a symmetrization
argument. This proof applies the ideas of a paper of Kenneth
Alexander~[2], and although its presentation is different from
Alexander's approach, it can be considered as a version of his
proof.
A similar problem should also be mentioned at this place.
M.~Talagrand wrote a series of papers about concentration
inequalities, (see e.g. [50] or [51]), and his research was also
continued by some other authors. I would mention the works of
M.~Ledoux~[27] and P.~Massart~[41]. Concentration inequalities
give a bound about the difference between the supremum of a set of
appropriately defined random variables and the expected value of
this supremum. They express how strongly this supremum is
concentrated around its expected value. Such results are closely
related to Theorem~4.1, and the discussion of their relation
deserves some attention. A typical concentration inequality is
the following result of Talagrand~[51].
\medskip\noindent
{\bf Theorem 18.1. (Theorem of Talagrand).} {\it Consider $n$
independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ with values in some measurable space
$(X,{\Cal X})$. Let ${\Cal F}$ be some countable family of
real-valued measurable functions of $(X,{\Cal X})$ such that
$\|f\|_\infty\le b<\infty$ for every $f\in{\Cal F}$. Let
$Z=\sup\limits_{f\in{\Cal F}}\sum\limits_{i=1}^n f(\xi_i)$ and
$v=E\left(\sup\limits_{f\in{\Cal F}}\sum\limits_{i=1}^n f^2(\xi_i)\right)$.
Then for every positive number~$x$,
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac 1{K'}\frac
xb\log\left(1+\frac{xb}v\right)\right\}
$$
and
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac{x^2}{2(c_1v+c_2bx)}\right\},
$$
where $K$, $K'$, $c_1$ and $c_2$ are universal positive constants.
Moreover, the same inequalities hold when replacing $Z$ by $-Z$.}
\medskip
Theorem~18.1 yields, similarly to Theorem~4.1, an estimate about
the distribution of the supremum for a class of sums of independent
random variables. It can be considered as a generalization of
Bernstein's and Bennett's inequalities when the distribution of the
supremum of partial sums (and not only the distribution of one
partial sum) is estimated. A remarkable feature of this
result is that it assumes no condition about the structure of the
class of functions ${\Cal F}$ (like the condition of $L_2$-dense
property of the class ${\Cal F}$ imposed in Theorem~4.1.) On the
other hand, the estimates in Theorem~18.1 contain the quantity
$EZ=E\left(\sup\limits_{f\in{\Cal F}}
\sum\limits_{i=1}^n f(\xi_i)\right)$. Such an
expectation of some supremum appears in all concentration
inequalities. As a consequence, they are useful only if we can
bound the expected value of the supremum we want to estimate. This
is a hard question in the general case. There is a paper~[16] which
provides a useful estimate about the expected value of the
supremum of random sums under the conditions of Theorem~4.1.
But I preferred a direct proof of this result. Let me remark
that because of the above mentioned concentration inequality the
condition $u\ge\text{const.}\,\sigma\log^{1/2}\frac2\sigma$
with some appropriate constant which cannot be dropped from
Theorem~4.1 can be interpreted so that under the conditions of
Theorem~4.1 $\text{const.}\,\sigma\log^{1/2}\frac2\sigma$
is an upper bound
for the expected value of the supremum we are studying.
It is also worth mentioning Talagrand's work~[52] which
contains several interesting results similar to Theorem~4.1.
But despite their formal similarity, they are essentially
different from the results of this work. This difference
deserves some special discussion.
Talagrand proved in~[52] by working out a more refined, better
version of the chaining argument a sharp upper bound for the
expected value $E\sup\limits_{t\in T}\xi_t$ of the supremum of
countably many (jointly) Gaussian random variable with zero
expectation. This result is sharp. Indeed, Talagrand proved also
a lower bound for this expected value, and the proportion of his
upper and lower bound is bounded by a universal constant.
By applying similar arguments he also gave an upper bound for
$E\sup\limits_{f\in{\Cal F}}\sum\limits_{k=1}^N f(\xi_k)$
in Proposition~2.7.2 of his book, where $\xi_1,\dots,\xi_N$ is a
sequence of independent, identically distributed random variables
with some known distribution~$\mu$, and ${\Cal F}$ is a class of
functions with some nice properties. Then he proved in Chapter~3
of his book some estimates with the help of this result for
certain models which solved some problems that could not be solved
with the help of the original version of the chaining argument.
Let us make some short comparison between the results of these
work and those of Talagrand.
Talagrand investigated in his book~[52] the expected value of the
supremum of partial sums, while we gave an estimate on its tail
distribution. But this is not a great difference. Talagrand's
results also give an estimate on the tail distribution of the
supremum by means of concentration inequalities, and actually his
proofs also provide a direct estimate for the tail distribution
we are interested in without the application of these results.
The main difference between the two works is that Talagrand's
method gives a sharp estimate for different classes of
functions~${\Cal F}$.
Talagrand could prove sharp results in such cases when the class
of functions ${\Cal F}$ for which the supremum is taken consists of
smooth functions. An example for such classes of function which he
thoroughly investigated is the class of Lipschitz~1 functions. On
the other hand we can give sharp results in such cases when
${\Cal F}$ consists of non-smooth functions. (See Example~5.5.)
This difference in the conditions of the results in these two
books is not a small technical detail. Talagrand heavily exploited in
his proof that he worked with such classes of functions~${\Cal F}$
from which he could select such a subclass of functions of
relatively small cardinality which is dense in ${\Cal F}$ not
only in the $L_2(\mu)$-norm with the probability measure!$\mu$ he
was working with, but also in the
supremum norm. He needed this property, because this enabled
him to get sharp estimates on the tail distribution of the
differences of functions he had to work with by means of the
Bernstein's inequality. He needed such estimates to apply (a
refined version of) the chaining argument. On the other hand,
we considered such classes of functions ${\Cal F}$ which may have
no small subclasses which are dense in ${\Cal F}$ in the supremum
norm. I would characterize the difference between the results
of the two works in the following way. Talagrand proved the
sharpest possible estimates which can be obtained by a
refinement of the chaining argument, while our main problem
was to get sharp estimates also in such cases when the
chaining argument does not work.
\medskip
The main results of this work are presented in Section~8. A weaker
version of Theorem~8.3 about an estimate of the distribution of a
degenerate $U$-statistic was first proved in a paper of Arcones and
Gin\'e in~[3]. The result of Theorem~8.3 in the present form is
proved in my paper~[36]. Its version about multiple integrals with
respect to a normalized empirical measure formulated in Theorem~8.1
is proved in~[32]. This paper contains a direct proof. On the other
hand, Theorem 8.1 can be derived from Theorem~8.3 by means of
Theorem~9.4 of this paper. Theorem 8.5 is the natural Gaussian
counterpart of Theorem~8.3. The limit theorem about degenerate
$U$-statistics, Theorem~10.4 (and its version about limit theorems
for multiple integrals with respect to normalized empirical
measures, presented in Theorem~$10.4'$ of Appendix~C) was
discussed in this work to explain better the relation between
degenerate $U$-statistics (or multiple integrals with respect
to normalized empirical measures) and multiple Wiener--It\^o
integrals. A proof of this result based on similar ideas as that
discussed here can be found in~[14]. Theorem~6.6 of my lecture
note~[29] contains such a weakened version of Theorem~8.5 which
does not take into account the variance of the random integral.
Example~8.7 is a natural supplement of Theorem~8.5. It shows
that the estimate of Theorem~8.5 is sharp if only the variance
of a Wiener--It\^o integral is known. At the end of Section~13
I also mentioned the results of papers~[1] and~[26] without proof
which also have some relation to this problem. I discussed
mainly the content of~[26] and explained its relation to some
results discussed in this work. The proof of these papers
apply a method different of those of this work. It would be
interesting to prove them with the methods discussed here.
These papers contain such a refinement of Theorems~8.5 and~8.3
respectively whose estimates depend on some other rather
complicated quantities. In some cases they supply a better
estimate. On the other hand, in the problems discussed here
they have a restricted importance because their conditions are
hard to check.
Theorems~8.2 and~8.4 yield an estimate about the supremum of
(degenerate) $U$-statistics or of multiple random integrals with
respect to a normalized empirical measure when the class of kernel
functions in these $U$-statistics or random integrals satisfy some
conditions. They were proved in my paper~[34]. Earlier Arcones
and Gin\'e proved a weaker form of this result in paper~[4], but
their work did not help in the proof of the results of this note.
They were based on an adaptation of Alexander's method to the
multivariate case. Theorem~8.6 contains the natural Gaussian
counterpart of Theorems~8.2 and~8.4.
Example~8.8 in Section~8 shows that the condition
$u\le\text{const.}\, n\sigma^3$ imposed in Theorem~8.3 in
the case $k=2$ cannot be dropped. The paper of Arcones and
Gin\'e~[3] contains another example explained by Talagrand to
the authors of that paper which also has a similar consequence.
But that example does not provide such an explicit comparison
of the upper and lower bound on the probability investigated
in Theorem~8.3 as Example~8.8. Similar examples could be
constructed for all $k\ge1$.
Example 8.8 shows that at high levels only a very weak (and from
practical point of view not really important) improvement of the
estimation on the tail distribution of degenerate $U$-statistics
is possible. But probably there exists a multivariate version of
Bennett's inequality, i.e. of Theorem~3.2 which provides such
an estimate. Moreover, there is some hope to get a similar
strengthened form of Theorems~8.2 and~8.4 (or of Theorem~4.2 in
the one-dimensional case). This question is not investigated in
the present work.
Section 9 deals with the properties of $U$-statistics. Its first
result, Theorem~9.1, is a rather classical result. It is the
so-called Hoeffding decomposition of $U$-statistics to the sum of
degenerate statistics. Its proof first appeared in the paper~[22],
but it can be found at many places. The explanation of this work
contains some ideas similar to~[49]. I tried to explain that
Hoeffding's decomposition is the natural multivariate version of
the (trivial) decomposition of sums of independent random variables
to sums of independent random variables {\it with expectation
zero}\/ plus the sum of the expectations of the original random
variables. Moreover, even the proof of the Hoeffding's
decomposition shows some similarity to this simple decomposition.
Theorem~9.2 and Proposition~9.3 can be considered as a continuation
of the investigation of the Hoeffding's decomposition in Theorem~9.1.
They tell how the properties of the kernel function of the original
$U$-statistic are inherited in the properties of the kernel
functions of the degenerate $U$-statistics taking part in its
Hoeffding decomposition. In several applications of Hoeffding's
decomposition we need such results.
The last result of Section~9, Theorem~9.4, enables us to reduce the
estimation of multiple random integrals with respect to normalized
empirical measures to the estimation of degenerate $U$-sta\-tis\-tics.
This result is a version of Hoeffding's decomposition, where
multiple integrals with respect to a normalized empirical
distribution are decomposed to the sum of degenerate $U$-statistics.
Multiple random integrals with respect to a normalized empirical
measure can be simply written as sums of $U$-statistics, and
by applying the Hoeffding decomposition for each term of these sums
we get the desired decomposition. Theorem~9.4 yields the result
we get in such a way. This formula is very similar to the original
Hoeffding decomposition. The main difference between them is that
the coefficients of the degenerate $U$-statistics in the
decomposition of Theorem~9.4 are relatively small. The cancellation
effect caused by integration with respect to a {\it normalized}\/
empirical measure is reflected in the appearance of small
coefficients in the decomposition given in Theorem~9.4.
Theorem~9.4 was proved in~[34]. The proof given in this note is
essentially different from that of~[34].
Theorem~8.1 can be derived from Theorem~8.3 and Theorem~8.2 from
Theorem~8.4 by means of Theorem~9.4. The proof of the latter
results is simpler. The results of Sections 10--12 contain the
results needed in the proof of Theorem~8.3 and its Gaussian
counterpart Theorems~8.5 and~8.7. The proof of these results is
based on good estimates of high moments of degenerate
$U$-statistics and multiple Wiener--It\^o integrals. The
classical proof of the one-variate counterparts of these results is
based on a good estimate of the moment generating function. This
method was replaced by the estimate of high moments, because the
moment generating function of a $k$-fold Wiener--It\^o integral is
divergent for $k\ge3$, and this property is also reflected in the
behaviour of degenerate $U$-statistics. On the other hand, good
estimates on high moments can replace the estimate of the moment
generating function. A good estimate can be given for all moments
of a Wiener--It\^o integral, while we have a good estimate only on
not too high moments of degenerate $U$-statistics. This has the
consequence that we can give a good estimate on the tail
distribution of degenerate $U$-statistic only for not too large
values. We met a similar situation in Section~3 in the study of
Bernstein's and Bennett's inequality.
I know of two deep methods to study high moments of multiple
Wiener--It\^o integrals. Both of them can be adapted to the study
of the moments of degenerate $U$-statistics. They deserve a more
detailed discussion.
The first one is called Nelson's inequality named after
Edward Nelson who published it in his paper~[43]. This inequality
simply implies Theorem~8.5 about multiple Wiener--It\^o integrals,
although with worse constants. Later Leonhard Gross discovered a
deep and useful generalization of this result which he
published in the work {\it Logarithmic Sobolev inequalities}\/~[19].
In that paper Gross compared two Markov processes with the same
infinitesimal operator but with possibly different initial
distribution, where the second Markov process had stationary
distribution. He could give a sharp bound on the Radon--Nikodym
derivative of the distribution of the first Markov process with
respect to the (stationary) distribution of the second Markov
process at all time~$T$ on the basis of the properties of the
infinitesimal operator of the Markov processes. With the help of
this result he could prove a more general form of Nelson's
inequality. In particular, his result may help to prove (a weaker
version of) Theorem~8.3 (with worse universal constants). Let me
also remark that Gross' method works not only in the study of
these problems, but in several hard problems of the probability
theory. (See e.g~[20] or~[27]). Nevertheless, in the present note
I applied a different method, because this seemed to be more
appropriate here.
I applied a method related to the names of Kyoshi It\^o and Roland
L'vovich Dobrushin. This is the theory of multiple Wiener--It\^o
integrals with respect to a white noise. This integral was
introduced in paper~[24]. It is useful, because every random
variable measurable with respect to the $\sigma$-algebra generated
by the Gaussian random variables of the underlying white noise with
finite second moment can be written as the sum of Wiener--It\^o
integrals of different order. Moreover, if only Wiener--It\^o
integrals of symmetric kernel functions are taken, then this
representation is unique. An important result, the so-called
diagram formula, formulated in Theorem~10.2, expresses products
of Wiener--It\^o integrals as a sum of such integrals. This result
which shows some similarity to the Feynman diagrams applied in the
statistical physics was proved in~[9]. Actually this paper discussed
a modified version of Wiener--It\^o integrals which is more
appropriate to study the action of shift operators for non-linear
functionals of a stationary Gaussian field. But these modified
Wiener--It\^o integrals can be investigated in almost the same way
as the original ones. The diagram formula has a simple consequence
formulated in Corollary of Theorem~10.2 of this note. It enables us
to calculate the expectation of products of Wiener--It\^o integrals,
in particular it yields an explicit formula about the moments of
a Wiener--It\^o integral. This result was applied in the proof of
Theorem~8.5, i.e.\ in the estimation of the tail-distribution of
Wiener--It\^o integrals. It\^o's formula for multiple Wiener--It\^o
integrals (Theorem~10.3) was proved in~[24].
The diagram formula has a natural and useful analog both for
degenerate $U$-statistics and multiple integrals with respect to a
normalized empirical measure. They enable us to express the product
of degenerate $U$-statistics and multiple integrals as the sum of
such expressions. These results enable us to adapt several useful
methods in the study of non-linear functionals of a Gaussian
random field to the study of non-linear functionals of normalized
empirical measures. A version of the diagram formula was proved
for degenerate $U$-statistics in~[36] and for multiple random
integrals with respect to a normalized empirical measures in~[32].
Let me remark that in the formulation of the result in the
work~[36] a different notation was applied than in the present
note. In that paper I wanted to formulate version of the diagram
formula for $U$-statistics with the help of such diagrams which
appear in the classical form of diagram formula presented for
Wiener--It\^o integrals. I could do this only in a somewhat
artificial way. In this work I formulated this result by
introducing first more general diagrams which may contain some
chains. The formulation of the result with the help of such more
general diagrams seems to be more natural. Let me also remark that
the study of results similar to the diagram formula for
Wiener--It\^o integrals did not get such an attention in the
literature as it would deserve in my opinion. I know only of one
work where such questions were investigated. It is the paper of
Surgailis~[46], where a version of the diagram formula is proved
for Poissonian integrals. The Corollary of Theorem~11.2 is of
special interest for us, because it enables us to prove such
moment estimates which are useful in the proof of Theorem~8.3.
It is worth mentioning that the problems about Wiener--It\^o
integrals are closely related to the study of Hermite polynomials
or to their multivariate version, to the so-called Wick polynomials.
(See e.g.~[29] or~[40] for the definition of Wick polynomials.)
Appendix~C contains the most important properties of Hermite
polynomials needed in the study of Wiener--It\^o integrals. In
particular, it contains the proof of Proposition~C2 which states
that the set of all Hermite polynomials is a complete orthogonal
system in the Hilbert space of the functions square integrable
with respect to the standard Gaussian distribution. This result
can be found for instance in Theorem~5.2.7 of~[48]. In the
present proof I wanted to show that this result is closely
related to the so-called moment problem, i.e.\ to the question
when a distribution is determined by its moments uniquely. This
method, with some refinement, can be applied to prove some
generalizations of Proposition~C2 about the completeness of
orthogonal polynomials with respect to more general weight
functions.
It\^o's formula creates a relation between Wiener--It\^o integrals
and Hermite polynomials. The results about multiple Wiener--It\^o
integrals have their analogs for Wick polynomials. Thus for
instance there is a diagram formula for the product of Wick
polynomials which also has some interesting generalizations.
Such questions are studied both in probability theory and
statistical physics, see~[40] and~[45]. The relation between
Wiener--It\^o integrals and Hermite polynomials also has a
natural counterpart in the study of other multiple random
integrals. The so-called Appell polynomials, (see~[47]),
appeared in such a way.
Theorems~8.3,~8.5 and~8.7 were proved on the basis of the results
in Sections 10--12 and in Section~13. Section~13 also contains
the proof of a multivariate version of Hoeffding's inequality,
formulated in Theorem~13.3. This result is needed in the
symmetrization argument applied in the proof of Theorem~8.4. A
weaker version of it (an estimate with a worse constant in the
exponent) which would be satisfactory for our purposes would
simply follow from a classical result, called Borell's inequality.
But since this result is not discussed in this note, and I was
interested in a proof which yields the best estimate in the
exponent of this estimate I have chosen another proof, given
in~[35] which is based on the results of Sections~10--12.
Later I have learned that this estimate is contained in an
implicit form also in the paper~[6] of A.~Bonami.
Sections 14--17 are devoted to the proof of Theorems~8.4 and~8.6.
They are based on a similar argument as their one-variate
counterparts, Theorems~4.1 and~4.2. The proof of Theorem~8.6
about the supremum of Wiener--It\^o integrals is based, similarly
to the proof of Theorem~4.2 on the chaining argument. In the
proof of Theorem~8.4 the chaining argument yields only a weaker
result formulated in Proposition~14.1 which helps to reduce
Theorem~8.4 to the proof of Proposition~14.2. In the one-variate
case a similar approach was applied. In that case the proof of
Theorem~4.1 was reduced to that of Proposition~6.2 by means of
Proposition~6.1. The next step in the proof of Theorem~8.4 has
no one-variate counterpart. The notion of so-called decoupled
$U$-statistics was introduced, and Proposition~14.2 was reduced
to a similar result about decoupled $U$-statistics formulated
in Proposition~$14.2'$.
The adjective `decoupled' in the expression decoupled $U$-statistic
refers to the fact that it is such a version of a $U$-statistic
where independent copies of a sequence of independent and
identically distributed random variables are put into different
coordinates of the kernel function. Their study is a popular
subject of some mathematicians. In particular, the main subject of
the book~[7] is a comparison of the properties of $U$-statistics
and decoupled $U$-statistics. A result of de la Pe\~na and
Montgomery--Smith~[8] formulated in Theorem~14.3 helps in reducing
some problems about $U$-statistics to a similar problem about
decoupled $U$-statistics. In this lecture note the proof of
Theorem~14.3 is given in Appendix~D. It follows the argument of
the original proof, but several steps are worked out in detail
where the authors gave only a very short explanation. Paper~[8]
also contains some kind of converse results to~Theorem~14.3, but
as they are not needed in the present work, I omitted their
discussion.
Decoupled $U$-statistics behave similarly to the original
$U$-statistics. Beside this, some symmetrization arguments
becomes considerably simpler if we are working with decoupled
$U$-statistics instead of the original $U$-statistics. This can
be exploited in some investigations. For example the proof of
Proposition~$14.2'$ is simpler than a direct proof of
Proposition~14.2. On the other hand, Theorem~14.3 enables us
to reduce the proof of Proposition~14.2 to that of
Proposition~$14.2'$, and we have exploited this possibility.
The proof of Theorem~8.4 was reduced to that of Proposition~$14.2'$
in Section~14. Sections 15--17 deal with the proof of this result.
It was proved in my paper~[34]. The proof is similar to that of
its one-variate version, Proposition~6.2, but some additional
difficulties have to be overcome. The main difficulty appears when
we want to find the multivariate analog of the symmetrization
argument which could be carried out by means of the Symmetrization
Lemma, Lemma~7.1 and Lemma~7.2 in the one-variate case.
In the multivariate case Lemma~7.1 is not sufficient for us. We
work instead of it with a generalized version of this result,
formulated in Lemma~15.2. The proof of Lemma~15.2 is not hard.
The real difficulty arises when we want to apply it in the proof
of Proposition~$14.2'$. We have to check its condition given in
formula~(15.3), and this means in this case a non-trivial
estimation of some complicated conditional probabilities. This
is the hardest part in the proof of Proposition~$14.2'$.
Proposition $14.2'$ was proved by means of an inductive procedure
formulated in Proposition 15.3, which is the multivariate analog
of Proposition~7.3. A basic ingredient of both proofs was a
symmetrization argument. But while this symmetrization argument
could be simply carried out in the one-variate case, its
adaptation to the multivariate case in the proof of Theorem~15.3
was a most serious problem. To overcome this difficulty another
result was formulated in Proposition~15.4. Propositions~15.3
and~15.4 were proved simultaneously by means of an appropriate
inductive procedure. Their proofs were based on a refinement of
the arguments in the proof of Proposition~7.3. We also had to
apply Theorem~13.3, a multivariate version of Hoeff\-ding's
inequality, and some properties of the Hoeff\-ding decomposition
of $U$-statistics proved in Section~9.
\beginsection Appendix A.
{\it The proof of some results about Vapnik--\v{C}ervonenkis classes.}
\medskip\noindent
{\it Proof of Theorem 5.1. (Sauer's lemma).}\/ This result has several
different proofs. Here I write down a relatively simple proof
of P. Frankl and J. Pach which appeared in~[15]. It is based on some
linear algebraic arguments.
The following equivalent reformulation of Sauer's lemma will be
proved. Let us take a set $S=S(n)$ consisting of $n$ elements and
a class $\Cal E$ of subsets of $S$ consisting of $m$ elements
$E_1,\dots,E_m\subset S$. Assume that $m\ge m_0+1$ with
$m_0=m_0(n,k)=\binom n0+\binom n1+\cdots+\binom n{k-1}$. Then there
exists a set $F\subset S$ of cardinality $k$ which is shattered by
the class of sets $\Cal E$. Actually, it is enough to show that
there exists a set $F$ of cardinality greater than or equal to~$k$
which is shattered by the class of sets $\Cal E$, because if a set
has this property, then all of its subsets have it. This latter
statement will be proved.
To prove this statement let us first list the subsets
$X_0,\dots,X_{m_0}$ of the set $S$ of cardinality less than or equal
to $k-1$, and correspond to all sets $E_i\in\Cal E$ the vector
$e_i=(e_{i,1},\dots,e_{i,m_0})$, $1\le i\le m$, with elements
$$
e_{i,j}=\left\{\aligned 1&\quad\text{if }X_j\subseteq E_i \\
0&\quad\text{if }X_j\not\subseteq E_i
\endaligned \right. \qquad 1\le i\le m, \text{ and } 1\le j\le m_0.
$$
Since $m>m_0$, the vectors $e_1,\dots,e_m$ are linearly dependent.
Because of the definition of the vectors $e_i$, $1\le i\le m$, this can
be expressed in the following way: There is a non-zero vector
$(f(E_1),\dots,f(E_m))$ such that
$$
\sum_{E_i\colon\; E_i\supseteq X_j} f(E_i)=0 \quad \text{for all }
1\le j\le m_0. \tag A1
$$
Let $F$, $F\subset S$, be a {\it minimal}\/ set with the property
$$
\sum_{E_i\colon\; E_i\supseteq F} f(E_i)=\alpha\neq0. \tag A2
$$
Such a set $F$ really exists, since every maximal element of the
family $\{E_i\colon\; 1\le i\le m,\, f(E_i)\neq0\}$ satisfies
relation (A2). The requirement that $F$ should be a minimal set means
that if $F$ is replaced by some $H\subset F$, $H\neq F$, at the
left-hand side of~(A2), then this expression equals zero. The
inequality $|F|\ge k$ holds because of relation (A1) and the
definition of the sets $X_j$.
Introduce the quantities
$$
Z_F(H)=\sum_{E_i\colon\; E_i\cap F=H} f(E_i)
$$
for all $H\subseteq F$.
Then $Z_F(F)=\alpha$, and for any set of the form $H=F\setminus\{x\}$,
$x\in F$,
$$
Z_F(H)=\sum_{E_i\colon\; E_i\cap F=H} f(E_i)
=\sum_{E_i\colon\; E_i\supseteq H}f(E_i)
-\sum_{E_i\colon\; E_i\supseteq F}f(E_i)=0-\alpha=-\alpha
$$
because of the minimality property of the set $F$.
Moreover, the identity
$$
Z_F(H)=(-1)^p\alpha \quad\text{for all } H\subseteq F
\text{ such that } |H|=|F|-p, \; 0\le p\le |F|. \tag A3
$$
holds. To show relation (A3) observe that
$$
Z_F(H)=\sum_{E_i\colon\; E_i\cap F=H} f(E_i)=\sum_{j=0}^p
(-1)^j\sum_{G\colon\;H\subset G\subset F,\;|G|=|H|+j}
\sum_{E_i\colon\; E_i\supseteq G}f(E_i) \tag A4
$$
for all sets $H\subset F$ with cardinality $|H|=|F|-p$.
Identity~(A4) holds, since the term $f(E_i)$ is counted at the
right-hand side of~(A4)
$\sum\limits_{j=0}^l (-1)^j\binom lj=(1-1)^l=0$ times if
$E_i\cap F=G$ with some $H\subset G\subseteq F$ with $|G|=|H|+l$
elements, $1\le l\le p$, while in the case $E_i\cap F=H$ it is
counted once. Relation~(A4) together with~(A2) and the minimality
property of the set~$F$ imply relation~(A3).
It follows from relation (A3) and the definition of the function
$Z_F(H)$ that for all sets $H\subseteq F$ there exists some
set $E_i$ such that $H=E_i\cap F$, i.e. $F$ is shattered by
$\Cal E$. Since $|F|\ge k$, this implies Theorem~5.1.
\medskip\noindent
{\it Proof of Theorem 5.3.}\/ Let us fix an arbitrary set
$F=\{x_1,\dots,x_{k+1}\}$ of the set $X$, and consider the set of
vectors
${\Cal G}_k(F)=\{(g(x_1),\dots,g(x_{k+1}))\colon\; g\in{\Cal G}_k\}$
of the $k+1$-dimensional space $R^{k+1}$. By the conditions of
Theorem~5.3 ${\Cal G}_k(F)$ is an at most $k$-dimensional subspace of
$R^{k+1}$. Hence there exists a non-zero vector
$a=(a_1,\dots,a_{k+1})$ such that
$\sum\limits_{j=1}^{k+1} a_jg(x_j)=0$ for all $g\in{\Cal G}_k$. We
may assume that the set $A=A(a)=\{j\colon\; a_j<0,\, 1\le j\le k+1\}$
is non-empty, by multiplying the vector $a$ by $-1$ if it is necessary.
Thus the identity
$$
\sum_{j\in A} a_jg(x_j)=\sum_{j\in \{1,\dots,k+1\}\setminus A}
(-a_j)g(x_j),\qquad \text{for all }g\in{\Cal G}_k \tag A5
$$
holds. Put $B=\{x_j\colon\; j\in A\}$. Then $B\subset F$, and
$F\setminus B\neq\{x\colon\; g(x)\ge0\}\cap F$ for all
$g\in{\Cal G}_k$. Indeed, if there were some $g\in {\Cal G}_k$
such that $F\setminus B=\{x\colon\; g(x)\ge0\}\cap F$, then
the left-hand side of the equation (A5) would be strictly
positive (as $a_j<0$, $g(x_j)<0$ if $j\in A$, and
$A\neq\emptyset$) its right-hand side would be non-positive for
this $g\in{\Cal G}_k$, and this is a contradiction.
The above proved property means that $\Cal D$ shatters no set
$F\subset X$ of cardinality~$k+1$. Hence Theorem~5.1
implies that $\Cal D$ is a Vapnik--\v{C}ervonenkis class.
\beginsection Appendix B. The proof of the diagram formula for
Wiener--It\^o integrals.
We start the proof of Theorem~10.2A (the diagram formula for the
product of two Wiener--It\^o integrals) with the proof of
inequality~(10.11). To show that this relation holds let us observe
that the Cauchy inequality yields the following bound on the function
$F_\gamma$ defined in~(10.10) (with the notation introduced there):
$$
\aligned
&F^2_\gamma(x_{(1,j)},x_{(2,j')},\,\;(1,j)\in
V_1(\gamma),\, (2,j')\in V_2(\gamma))\\
&\qquad\le
\int f^2(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)} \mu(\,dx_{(2,j)})\\
&\qquad\qquad
\int g^2(x_{(2,1)},\dots,x_{(2,l)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)}\mu(\,dx_{(2,j)}).
\endaligned \tag B1
$$
The expression at the right-hand side of inequality~(B1) is the
product of two functions with different arguments. The first
function has arguments $x_{(1,j)}$ with $(1,j)\in V_1(\gamma)$ and
the second one $x_{(2,j')}$ with $(2,j')\in V_2(\gamma)$.
By integrating both sides of inequality~(B1) with respect to these
arguments we get inequality~(10.11).
Relation (10.12) will be proved first for the product of the
Wiener--It\^o integrals of two elementary functions. Let us consider
two (elementary) functions $f(x_1,\dots,x_k)$ and $g(x_1,\dots,x_l)$
given in the following form: Let some disjoint sets $A_1,\dots,A_M$,
$\mu(A_s)<\infty$, $1\le s\le M$, be given together with some real
numbers $c(s_1,\dots,s_k)$ indexed with such $k$-tuples
$(s_1,\dots,s_k)$, $1\le s_j\le M$, $1\le j\le k$, for which the
numbers $s_1,\dots,s_k$ in a $k$-tuple are all different. Put
$f(x_1,\dots,x_k)=c(s_1,\dots,s_k)$ on the rectangles
$A_{s_1}\times\cdots\times A_{s_k}$ with edges $A_s$, indexed
with the above $k$-tuples, and let $f(x_1,\dots,x_k)=0$ outside of
these rectangles. Take similarly some disjoint sets
$B_1,\dots,B_{M'}$, $\mu(B_t)<\infty$, $1\le t\le M'$, and some
real numbers $d(t_1,\dots,t_l)$, indexed with such $l$-tuples
$(t_1,\dots,t_l)$, $1\le t_{j'}\le M'$, $1\le j'\le l$, for which
the numbers $t_1,\dots,t_l$ in an $l$-tuple are different. Put
$g(x_1,\dots,x_l)=d(t_1,\dots,t_l)$ on the rectangles
$B_{t_1}\times\cdots\times B_{t_l}$ with edges
indexed with the above introduced $l$-tuples, and let
$g(x_1,\dots,x_l)=0$ outside of these rectangles.
Let us take some small number $\varepsilon>0$ and rewrite the above
introduced functions $f(x_1,\dots,x_k)$ and $g(x_1,\dots,x_l)$
with the help of this number $\varepsilon>0$ in the following way.
Divide the sets $A_1,\dots,A_M$ to smaller sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{s=1}^{M(\varepsilon)} A_s^\varepsilon=
\bigcup\limits_{s=1}^{M} A_s$, in such a way that all sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$ are disjoint,
and $\mu(A^\varepsilon_s)\le\varepsilon$,
$1\le s\le M(\varepsilon)$. Similarly, take sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{t=1}^{M'(\varepsilon)} B_t^\varepsilon
=\bigcup\limits_{t=1}^{M'} B_t$, in such a way that all sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$ are
disjoint, and $\mu(B^\varepsilon_t)\le\varepsilon$,
$1\le t\le M'(\varepsilon)$. Beside this, let us also demand
that two sets $A_s^\varepsilon$ and $B_t^\varepsilon$,
$1\le s\le M(\varepsilon)$, $1\le t\le M'(\varepsilon)$,
are either disjoint or they agree. Such a partition exists because
of the non-atomic property of measure $\mu$. The above defined
functions $f(x_1,\dots,x_k)$ and $g(x_1,\dots,x_l)$ can be
rewritten by means of these new sets $A^\varepsilon_s$ and
$B^\varepsilon_t$. Namely, let
$f(x_1,\dots,x_k)=c^\varepsilon(s_1,\dots,s_k)$ on the rectangles
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}$
with $1\le s_j\le M(\varepsilon)$,
$1\le j\le k$, with different indices $s_1,\dots,s_k$, where
$c^\varepsilon(s_1,\dots,s_k)=c(p_1,\dots,p_k)$ with those indices
$(p_1,\dots,p_k)$ for which
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}\subset
A_{p_1}\times\cdots\times A_{p_k}$. The function $f$ disappears
outside of these rectangles. The function $g(x_1,\dots,x_l)$ can be
written similarly in the form $g(x_1,\dots,x_l)
=d^\varepsilon(t_1,\dots,t_l)$ on the rectangles
$B^\varepsilon_{t_1}\times\cdots\times B^\varepsilon_{t_l}$ with
$1\le t_{j'}\le M'(\varepsilon)$, $1\le j'\le l$, and different
indices, $t_1,\dots,t_l$. Beside this, the function $g$
disappears outside of these rectangles.
The above representation of the functions $f$ and $g$ through a
parameter $\varepsilon$ is useful, since it enables us to give a good
asymptotic formula for the product $k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)$
which yields the diagram formula for the product of Wiener--It\^o
integrals of elementary functions with the help of a limiting
procedure $\varepsilon\to0$.
Fix a small number $\varepsilon>0$, take the representation of
the functions $f$ and $g$ with its help, and write
$$
k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)
=\sum_{\gamma\in \Gamma(k,l)} Z_\gamma(\varepsilon)
\tag B2
$$
with
$$
Z_\gamma(\varepsilon)={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l)
\mu_W(A^\varepsilon_{s_1})\dots\mu_W(A^\varepsilon_{s_k})
\mu_W(B^\varepsilon_{t_1})\dots\mu_W(B^\varepsilon_{t_l}), \tag B3
$$
where $\Gamma(k,l)$ denotes the class of diagrams introduced before
the formulation of Theorem~10.2A, and $\sum^\gamma$ denotes
summation for such $k+l$-tuples $(s_1,\dots,s_k,t_1,\dots,t_l)$,
$1\le s_j\le M(\varepsilon)$, $1\le j\le k$, and
$1\le t_{j'}\le M'(\varepsilon)$,
$1\le j'\le l$, for which
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, i.e.\ if it is an edge of $\gamma$,
and otherwise all sets $A^\varepsilon_{s_j}$ and
$B^\varepsilon_{t_{j'}}$ are
disjoint. (This sum also depends on $\varepsilon$.) In the
case of an empty sum $Z_\gamma(\varepsilon)$ equals zero.
For all $\gamma\in\Gamma(k,l)$ the expression
$Z_\gamma(\varepsilon)$ will be written in the form
$$
Z_\gamma(\varepsilon)=Z_\gamma^{(1)}(\varepsilon)
+Z_\gamma^{(2)}(\varepsilon),
\quad \gamma\in\Gamma(k,l), \tag B4
$$
with
$$
\aligned
Z^{(1)}_\gamma(\varepsilon)
&={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l)\\
&\qquad\prod_{j\colon\; (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}})\\
&\qquad \prod_{j\colon\; (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})
\endaligned \tag B5
$$
and
$$ \allowdisplaybreaks
\align
Z^{(2)}_\gamma(\varepsilon)&={\sum}^\gamma
c^\varepsilon(s_1,\dots,s_k) d^\varepsilon(t_1,\dots,t_l)\\
&\qquad \prod_{j\colon\; (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \\
&\qquad \biggl[\prod_{j\colon\; (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \\
&\qquad\qquad -\prod_{j\colon\; (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr], \tag B6
\endalign
$$
where $V_1(\gamma)$ and $V_2(\gamma)$ (introduced before
formula~(10.9) during the preparation to the formulation of
Theorem~10.2A) are the sets of vertices in the first and second
row of the diagram $\gamma$ from which no edge starts.
I claim that there is some constant $C>0$ not depending on
$\varepsilon$ such that
$$
E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma)-
Z^{(1)}_\gamma(\varepsilon)\right)^2\le C\varepsilon
\quad \text{for all } \gamma\in\Gamma(k,l) \tag B7
$$
with the Wiener--It\^o integral with the kernel function
$F_\gamma$ defined in (10.9), (10.9a) and (10.10), and
$$
E\left(Z^{(2)}_\gamma(\varepsilon)\right)^2
\le C\varepsilon\quad\text{for all } \gamma\in\Gamma(k,l). \tag B8
$$
Relations (B7) and (B8) imply relation (10.12)
if $f$ and $g$ are elementary functions. Indeed, they imply that
$$
\lim_{\varepsilon\to0}\left\|\,|\gamma|!Z_{\mu,|\gamma|}(F_\gamma)
-Z_\gamma(\varepsilon)\right\|_2\to0
\quad\text{for all }\gamma\in\Gamma(k,l),
$$
and this relation together with (B2) yield relation (10.12) with
the help of a limiting procedure $\varepsilon\to0$.
To prove relation (B7) let us introduce the function
$$
\align
&F^\varepsilon_\gamma(x_{(1,j)},x_{(2,j')},\; (1,j)\in
V_1(\gamma),\; (2,j')\in V_2(\gamma))\\
&\qquad=
F_\gamma(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))\\
&\qquad\qquad\text{if } x_{(1,j)}\in A^\varepsilon_{s_j},
\text{ for all } (1,j)\in V_1(\gamma),\\
&\qquad\qquad\text{ } x_{(2,j')}\in B^\varepsilon_{t_{j'}},
\text{ for all } (2,j')\in V_2(\gamma)), \quad\text{and}\\
& \qquad\qquad\text{ all sets }
A^\varepsilon_{s_j},\; (1,j)\in V_1(\gamma),
\text{ and } B^\varepsilon_{t_{j'}}, \; (2,j')\in V_2(\gamma)
\text{ are different.}
\endalign
$$
with the function~$F_\gamma$ defined in~(10.9a) and~(10.10), and
put
$$
F^\varepsilon_\gamma(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))=0 \quad
\text{otherwise.}
$$
The function $F_\gamma^\varepsilon$ is elementary, and a comparison
of its definition with relation~(B5) and the definition of the
function $F_\gamma$ yield that
$$
Z_\gamma^{(1)}(\varepsilon)=|\gamma|!
Z_{\mu,|\gamma|}(F_\gamma^\varepsilon). \tag B9
$$
The function $F^\varepsilon_\gamma$ slightly differs from $F_\gamma$,
since the function $F_\gamma$ may not disappear in such points
$(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\,
(2,j')\in V_2(\gamma))$ for which there is some pair $(j,j')$ with
the property $x_{(1,j)}\in A^\varepsilon_{s_j}$ and
$x_{(2,j')}\in B^\varepsilon_{t_{j'}}$ with some sets
$A^\varepsilon_{s_j}$ and
$B^\varepsilon_{t_{j'}}$ such that
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$, while
$F_\gamma^\varepsilon$ must be zero in such points. On the other
hand, in the case $|\gamma|=\max(k,l)-\min(k,l)$, i.e. if one
of the sets $V_1(\gamma)$ or $V_2(\gamma)$ is empty,
$F_\gamma=F^\varepsilon_\gamma$, \
$Z_\gamma^{(1)}=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma)$, and
relation~(B7) clearly holds for such diagrams $\gamma$.
In the case $|\gamma|=\max(k,l)-\min(k,l)>0$ such an estimate
will be proved for the probability of the set where
$F_\gamma\neq F_\gamma^\varepsilon$ which implies relation~(B7).
Let us define the sets $A=\bigcup\limits_{s=1}^{M(\varepsilon)}
A^\varepsilon_s$ and
$B=\bigcup\limits_{t=1}^{M'(\varepsilon)} B^\varepsilon_t$.
These sets $A$ and $B$ do
not depend on the parameter $\varepsilon$. Beside this
$\mu(A)<\infty$, and $\mu(B)<\infty$. Define for all pairs
$(j_0,j_0')$ such that $(1,j_0)\in V_1(\gamma)$,
$(2,j_0')\in V_2(\gamma)$ the set
$$
\align
D(j_0,j'_0)&=\{(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma), \,
(2,j')\in V_2(\gamma)) \colon\;\\
&\qquad x_{(1,j_0)}\in A^\varepsilon_{s_{j_0}}, \;
x_{(1,j'_0)}\in B^\varepsilon_{t_{j'_0}}
\quad \text{for some } s_{j_0}
\text{ and } t_{j'_0} \text{ such that }
A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}} \\
&\qquad x_{(1,j)}\in A\text{ for all } (1,j)\in V_1(\gamma),
\text{ and }x_{(2,j')}\in B\text{ for all }(2,j')\in V_2(\gamma)\}.
\endalign
$$
Introduce the notation $x^\gamma=(x_{(1,j)},x_{(2,j')}),\,
(1,j)\in V_1(\gamma),\,(2,j')\in V_2(\gamma))$ and put
$D_\gamma
=\{x^\gamma\colon\;
F^\varepsilon_\gamma(x^\gamma)\neq F_\gamma(x^\gamma)\}$.
The relation
$D_\gamma\subset\bigcup\limits_{j=1}^k
\bigcup\limits_{j'=1}^l D(j_0,j_0')$
holds, since if
$F^\varepsilon_\gamma(x^\gamma)\neq F_\gamma(x^\gamma)$ for some
vector~$x^\gamma$, then it has some coordinates
$(1,j_0)\in V_1(\gamma)$ and $(2,j'_0)\in V_2(\gamma)$ such that
$x_{(1,j_0)}\in A^\varepsilon_{s_{j_0}}$ and
$x_{(1,j'_0)}\in B^\varepsilon_{t_{j'_0}}$ with some sets
$A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}}$, and the
relation in the last
line of the definition of $D(j_0,j'_0)$ must also hold for this
vector $x^\gamma$, since otherwise
$F_\gamma(x_\gamma)=0=F^\varepsilon_\gamma(x_\gamma)$.
I claim that there is some constant $C_1$ such that
$$
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D(j_0,j'_0))\le C_1\varepsilon
\quad\text{for all sets } D(j_0,j'_0),
$$
where $\mu^{|V_1(\gamma)|+|V_2(\gamma)|}$
denotes the direct product of the measure $\mu$ on some copies of
the original space $(X,{\Cal X})$ indexed by $(1,j)\in V_1(\gamma)$
and $(2,j')\in V_2(\gamma)$. To see this relation one has to
observe that
$\sum\limits_{A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}}}
\mu(A^\varepsilon_{s_{j_0}})\mu(B^\varepsilon_{t_{j'_0}})\le
\sum\limits\varepsilon \mu(A^\varepsilon_{s_{j_0}})
=\varepsilon\mu(A)$.
Thus the set $D(j_0,j'_0)$ can be covered by the direct product
of a set whose $\mu$ measure is not greater than
$\varepsilon\mu(A)$ and of a rectangle whose edges are
either the set $A$ or the set $B$.
The above relations imply that
$$
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C_2\varepsilon \tag B10
$$
with some constant $C_2>0$.
Relation (B9), estimate (B10), the property c) formulated in
Theorem~10.1 for Wiener--It\^o integrals and the observation that
the function $F_\gamma=F_\gamma(f,g)$ is bounded in supremum norm
if $f$ and $g$ are elementary functions imply the inequality
$$
\align
E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma)-
Z^{(1)}_\gamma(\varepsilon)\right)^2&=
|\gamma!|^2E\left( Z_{\mu,|\gamma|}
(F_\gamma-F_\gamma^\varepsilon)\right)^2
\le |\gamma|!\| F_\gamma-F_\gamma^\varepsilon\|_2^2 \\
&\le K\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C\varepsilon.
\endalign
$$
This means that relation (B7) holds.
To prove relation (B8) write
$E\left(Z^{(2)}_\gamma(\varepsilon)\right)^2$
in the following form:
$$
\aligned
E\left(Z^{(2)}_\gamma(\varepsilon)\right)^2&={\sum}^\gamma
{\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l)
c^\varepsilon(\bar s_1,\dots,\bar s_k)
d^\varepsilon(\bar t_1, \dots,\bar t_l)\\
&\qquad\qquad EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)
\endaligned \tag B11
$$
with
$$ \allowdisplaybreaks
\align
&U(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)\\
&\qquad =\prod_{j\colon\; (1,j)
\in V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \\
&\qquad\qquad
\prod_{\bar\jmath\colon\; (1,\bar\jmath)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}})
\prod_{\bar\jmath'\colon\; (2,\bar\jmath')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{\bar t_{\bar\jmath'}}) \\
&\qquad\qquad \biggl[\prod_{j\colon\; (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \\
&\qquad\qquad\qquad
-\prod_{j\colon\; (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr]\\
&\qquad\qquad
\biggl[\prod_{\bar\jmath\colon\; (1,\bar\jmath)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}})
\prod_{\bar\jmath'\colon\;
(2,\bar\jmath')\in \{(2,1),\dots,(2,l)\}
\setminus\in V_2(\gamma)}
\mu_W(B^\varepsilon_{\bar t_{\bar \jmath'}}) \\
&\qquad\qquad\qquad
-\prod_{\bar\jmath\colon\; (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}
\mu(A^\varepsilon_{\bar s_{\bar\jmath}})\biggr]. \tag B12
\endalign
$$
The double sum $\sum^\gamma\sum^\gamma$ in (B11) has to be
understood in the following way. The first summation is taken for
vectors $(s_1,\dots,s_k,t_1,\dots,t_l)$, and these vectors take
such values which were defined in $\sum^\gamma$ in formula (B3).
The second summation is taken for vectors
$(\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)$, and again with
values defined in the summation $\sum^\gamma$.
Relation~(B8) will be proved by means of some estimates about
the expectation of the above defined random variable $U(\cdot)$
which will be presented in the following Lemma~B. Before its
formulation I introduce the following Properties~A and~B.
\medskip\noindent
{\bf Property A.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar\jmath}\le M(\varepsilon)$, for
$1\le j,\bar\jmath\le k$, and
$1\le t_j,\bar t_{\bar\jmath'}\le M'(\varepsilon)$ for
$1\le j',\bar\jmath'\le l$,
satisfies Property~A (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequences of sets
$\{A^\varepsilon_{s_j},B^\varepsilon_{t_{j'}},(1,j)\in V_1(\gamma),
(2,j')\in V_2(\gamma)\}$
and
$\{A^\varepsilon_{\bar s_{\bar\jmath}},
B^\varepsilon_{\bar t_{\bar\jmath'}},
(1,\bar\jmath)\in V_1(\gamma),(2,\bar\jmath')\in V_2(\gamma)\}$
agree.
(Here we say that two sequences agree if they contain the same
elements in a possibly different order.)}
\medskip\noindent
{\bf Property B.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar\jmath}\le M(\varepsilon)$, for
$1\le j,\bar\jmath\le k$, and
$1\le t_j,\bar t_{\bar\jmath'}\le M'(\varepsilon)$ for
$1\le j',\bar\jmath'\le l$,
satisfies Property~B (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequences of sets
$$
\{A^\varepsilon_{s_j},B^\varepsilon_{t_{j'}},(1,j)
\in\{(1,1),\dots,(1,k)\}\setminus
V_1(\gamma),\;(2,j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)\}
$$
and
$$
\{A^\varepsilon_{\bar s_{\bar\jmath}},
B^\varepsilon_{\bar t_{\bar\jmath'}},
(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma),\;
(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)\}
$$
have at least one common element.}
\medskip
(In the above definitions two sets $A^\varepsilon_s$ and
$B^\varepsilon_t$ are
identified if $A^\varepsilon_s=B^\varepsilon_t$.)
Now I formulate the following
\medskip\noindent
{\bf Lemma B.} {\it Let us consider the function $U(\cdot)$
introduced in formula (B12). Assume that its arguments
$s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$ are chosen
in such a way that the function $U(\cdot)$ with these
arguments appears in the double sum $\sum^\gamma\sum^\gamma$
in formula~(B11), i.e.\
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, otherwise all sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$ are disjoint,
and an analogous statement holds if
the coordinates $s_1,\dots,s_k,t_1,\dots,t_l$ are replaced by
$\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$. Then
$$
EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)=0 \tag B13
$$
if the sequence of the arguments in $U(\cdot)$ does not satisfies
either Property~A or Property~B.
If the sequence of the arguments in $U(\cdot)$ satisfies both
Property~A and Property~B, then
$$
\aligned
&|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)|\\
&\qquad \le C\varepsilon \prod{\vphantom\prod}'
\mu(A^\varepsilon_{s_j})\mu(A^\varepsilon_{\bar s_{\bar\jmath}})
\mu(B^\varepsilon_{t_{j'}})\mu(B^\varepsilon_{\bar t_{\bar\jmath'}})
\endaligned
\tag B14
$$
with some appropriate constant $C=C(k,l)>0$ depending only on
the number of variables $k$ and $l$ of the functions $f$ and $g$.
The prime in the product $\prod'$ at the right-hand side of~(B14)
means that in this product the measure $\mu$ of those sets
$A^\varepsilon_{s_j}$, $A^\varepsilon_{\bar s_{\bar\jmath}}$,
$B^\varepsilon_{t_{j'}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$ are considered,
whose indices are
listed among the arguments
$s_j,\bar s_{\bar\jmath},t_{j'}$ or $\bar t_{\bar\jmath'}$ of
$U(\cdot)$, and the measure~$\mu$ of each such set appears
exactly once. (This means e.g. that if
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ or
$A^\varepsilon_{s_j}=B^\varepsilon_{\bar t_{\bar\jmath'}}$
for some indices $j$ and
$j'$ or $\bar\jmath'$, then one of the terms between
$\mu(A^\varepsilon_{s_j})$ and $\mu(B^\varepsilon_{t_{j'}})$ or
$\mu(B^\varepsilon_{\bar t_{\bar\jmath'}})$ is omitted from the
product. For the sake of definitiveness let us preserve the set
$\mu(A^\varepsilon_{j_s})$ in such a case.)}
\medskip\noindent
{\it Remark.}\/ The content of Lemma~B is that most terms in the
double sum in formula~(B11) equal zero, and even the non-zero
terms are small.
\medskip\noindent
{\it The proof of Lemma B.}\/ Let us prove first relation (B13)
in the case when Property~A does not hold. It will be exploited
that for disjoint sets the random variables $\mu_W(A_s)$ and
$\mu_W(B_t)$ are independent, and this provides a good
factorization of the expectation of certain products. Let us
carry out the multiplications in the definition of $U(\cdot)$
in formula~(B12), and show that each product obtained in such a
way has zero expectation. If Property~A does not hold for the
arguments of $U(\cdot)$, and beside this the arguments
$s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$ satisfy the
remaining conditions of Lemma~B, then each
product we consider contains a factor $\mu_W(A^\varepsilon_{s_{j_0}})$,
$(1,j_0)\in V_1(\gamma)$, which is independent of all those terms
in this product which are in the following list:
$\mu_W(A^\varepsilon_{s_j})$ with some $j\neq j_0$, $1\le j\le k$,
or $\mu_W(B^\varepsilon_{t_{j'}})$, $1\le j\le l$, or
$\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}})$ with
$(1,\bar\jmath)\in V_1(\gamma)$, or
$\mu_W(B^\varepsilon_{\bar t_{\bar\jmath'}})$ with
$(2,\bar\jmath')\in V_2(\gamma)$. We will show with the help of
this property that the expectation of each term has a factorization
with a factor either of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})=0$ or
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$, hence it equals zero.
Indeed, although the above properties do not exclude the
appearance of such a pair of arguments $A^\varepsilon_{t_{\bar j'}}$,
$(1,\bar\jmath')\in\{(1,1),\dots,(1,k)\setminus V_1(\gamma)$ and
$B^\varepsilon_{t_{\bar j'}}$, $(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)$ in the product for which
$A^\varepsilon_{t_{\bar j}}
=B^\varepsilon_{t_{\bar j'}}=A^\varepsilon_{s_{j_0}}$, and in
such a case a term of the form $E\mu_W(A^\varepsilon_{s_{j_0}})$
will not appear in the product, but if this happens, then the
product contains a factor of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$. Hence an
appropriate factorization of each term of $EU(\cdot)$
contains either a factor of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})=0$ or
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$ if
$U(\cdot)$ does not satisfy Property~A.
To finish the proof of relation (B13) it is enough consider the
case when the arguments of $U(\cdot)$ satisfy Property~A, but they
do not satisfy Property~B. The validity of Property~A implies that
the sets
$\{A^\varepsilon_{s_j},\,j\in V_1\}
\cup\{B^\varepsilon_{t_{j'}},\,j'\in V_2\}$
and
$\{A^\varepsilon_{\bar s_j},\,j\in V_1\}
\cup\{B^\varepsilon_{\bar t_{j'}},\,j'\in V_2\}$
agree. The conditions of Lemma~B also imply that the elements
of these sets are such sets which are disjoint of the sets
$A^\varepsilon_{s_j}$,
$B^\varepsilon_{t_{j'}}$, $A^\varepsilon_{\bar s_{\bar\jmath}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$ with indices
$(1,j),(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)$
and
$(2,j'),(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$.
If Property~B does not hold, then the latter class of sets can be
divided into two subclasses in such a way that the elements in
different subclasses are disjoint. The first subclass consists of the
sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$, and the second
one of the sets $A^\varepsilon_{\bar s_{\bar\jmath}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$
with indices such that
$(1,j),(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)$ and
$(2,j'),(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$.
These facts imply that $EU(\cdot)$ has a factorization,
which contains the term
$$ \allowdisplaybreaks
\align
&E\biggl[\prod_{j\colon\; (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\; (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \\
&\qquad\qquad -\prod_{j\colon\; (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr]=0,
\endalign
$$
hence relation (B13) holds also in this case. The last expression
has zero expectation, since if we take such pairs
$A^\varepsilon_{s_j},B^\varepsilon_{t_j'}$ for the sets appearing
in it for which
that $((1,j),(2,j'))\in E(\gamma)$, i.e. these vertices are
connected with an edge of $\gamma$, then
$A^\varepsilon_{s_j}=B^\varepsilon_{t_j'}$
in a pair, and elements in different pairs are disjoint. This
observation allows a factorization in the product whose
expectation is taken, and then the identity
$E\mu_W(A^\varepsilon_{s_j})\mu_W(B^\varepsilon_{t_{j'}})
=\mu(A^\varepsilon_{s_j})$ implies the desired identity.
To prove relation (B14) if the arguments of the function~$U(\cdot)$
satisfy both Properties~A and~B consider the expression (B12) which
defines $U(\cdot)$, carry out the term by term multiplication
between the two differences at the end of this formula, take
expectation for each term of the sum obtained in such a way and
factorize them. Since $E\mu_W(A)^2=\mu(A)$,
$E\mu_W(A)^4=3\mu(A)^2$ for all sets $A\in{\Cal X}$, $\mu(A)<\infty$,
some calculation shows that each term can be expressed as constant
times a product whose elements are those probabilities
$\mu(A_s^\varepsilon)$ and $\mu(B_t^\varepsilon)$ or their square
which appear at the right-hand
side of (B14). Moreover, since the arguments of $U(\cdot)$ satisfy
Property~B, there will be at least one term of the form
$\mu(A_s^\varepsilon)^2$ in this product. Since
$\mu(A_s^\varepsilon)^2\le \varepsilon\mu(A_s^\varepsilon)$,
these calculations provide
formula~(B14). Lemma~B is proved.
\medskip
Relation (B11) implies that
$$
E\left(Z^{(2)}_\gamma(\varepsilon)\right)^2
\le K\sum{\vphantom{\sum}}^\gamma
\sum{\vphantom{\sum}}^\gamma
|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)| \tag B15
$$
with some appropriate $K>0$. By Lemma~B it is enough to sum up
only for such terms $U(\cdot)$ in (B15) whose arguments satisfy
both Properties~A and~B. Moreover, each such term can be bounded
by means of inequality (B14). Let us list the sets
$A^\varepsilon_{s_j},A^\varepsilon_{\bar s_{\bar\jmath}},
B^\varepsilon_{t_{j'}},
B^\varepsilon_{\bar t_{\bar\jmath'}}$ appearing in the
upper bound at the right-hand side of (B14) for
all functions $U(\cdot)$ taking part in the sum at the right-hand
side of (B15). Since all fixed sequences of the sets
$A^\varepsilon_s$ and
$B^\varepsilon_t$ appear less than $C(k,l)$ times with
an appropriate
constant $C(k,l)$ depending only on the order $k$ and $l$ of the
integrals we are considering, and
$\sum\limits_{s=1}^{M(\varepsilon)}
\mu(A^\varepsilon_s)+\sum\limits_{t=1}^{M'(\varepsilon)}
\mu(B^\varepsilon_t)
=\mu(A)+\mu(B)<\infty$, the above relations imply that
$$
E\left(Z^{(2)}_\gamma(\varepsilon)\right)^2
\le C_1\varepsilon\sum_{j=1}^{k+l}(\mu(A)+\mu(B))^j
\le C\varepsilon.
$$
Hence relation (B8) holds.
\medskip
To prove Theorem 10.2A in the general case take for all pairs of
functions $f\in{\Cal H}_{\mu,k}$ and $g\in{\Cal H}_{\mu,l}$ two
sequences of elementary functions $f_n\in\bar{\Cal H}_{\mu,k}$
and $g_n\in\bar{\Cal H}_{\mu,l}$, $n=1,2,\dots$, such that
$\|f_n-f\|_2\to0$ and $\|g_n-g\|_2\to0$ as $n\to\infty$. Let us
introduce the notation $F_\gamma(f,g)=F_\gamma$ if the function
$F_\gamma$ is defined in formulas~(10.9a) and~(10.10) with the
help of the functions $f$ and~$g$. It is enough to show that
$$
E|k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)-k!Z_{\mu,k}(f_n)
l!Z_{\mu,l}(g_n)|\to0\quad \text{as }n\to\infty,
\tag B16
$$
and
$$
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\to0\quad \text{as }
n\to\infty \quad\text{for all } \gamma\in\Gamma(k,l), \tag B17
$$
since then a simple limiting procedure $n\to\infty$, and the
already proved part of the theorem for Wiener--It\^o integrals of
elementary functions imply Theorem~10.2A.
To prove relation (B16) write
$$
\align
&E|k!Z,{\mu,k}(f)l!Z_{\mu,l}(g)-
k!Z_{\mu,k}(f_n)l!Z_{\mu,l}(g_n)|\\
&\qquad\le k!l!\left(E|Z_{\mu,k}(f)Z_{\mu,l}(g-g_n)|
+E|Z_{\mu,k}(f-f_n)Z_{\mu,l}(g_n)\right)| \\
&\qquad\le k!l!
\left(\left(EZ^2_{\mu,k}(f)\right)^{1/2}
\left(EZ^2_{\mu,l}(g-g_n)\right)^{1/2}+
\left(EZ^2_{\mu,k}(f-f_n)\right)^{1/2}
\left(EZ^2_{\mu,l}(g_n)\right)^{1/2}\right)\\
&\qquad\le (k!l!)^{1/2}\left(\|f\|_2\|g-g_n\|_2
+\|f-f_n\|_2\|g_n\|_2\right).
\endalign
$$
Relation (B16) follows from this inequality with a limiting
procedure $n\to\infty$.
To prove relation (B17) write
$$
\align
&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&\qquad\le
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g-g_n))\right|+
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f-f_n,g_n))\right|\\
&\qquad\le
|\gamma|!\left(EZ^2_{\mu,|\gamma|}(F_\gamma(f,g-g_n))\right)^{1/2}+
|\gamma|!\left(EZ^2_{\mu,|\gamma|}(F_\gamma(f-f_n,g_n))\right)^{1/2}\\
&\qquad\le (|\gamma|!)^{1/2}\left(\|F_\gamma(f,g-g_n)\|_2+
\|F_\gamma(f-f_n,g_n)\|_2\right),
\endalign
$$
and observe that by relation (10.11)
$\|F_\gamma(f,g-g_n)\|_2\le \|f\|_2\|g-g_n\|_2$, and
$\|F_\gamma(f-f_n,g_n)\|_2\le \|f-f_n\|_2\|g_n\|_2$. Hence
$$
\align
&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&\qquad\le(|\gamma|!)^{1/2}
\left(\|f\|_2\|g-g_n\|_2+\|f-f_n\|_2\|g_n\|_2\right).
\endalign
$$
The last inequality implies relation (B17) with a limiting procedure
$n\to\infty$. Theorem 10.2A is proved.
\beginsection Appendix C. The proof of some results about
Wiener--It\^o integrals.
First I prove It\^o's formula about multiple Wiener--It\^o integrals
(Theorem~10.3). The proof is based on the diagram formula for
Wiener--It\^o integrals and a recursive formula about Hermite
polynomials proved in Proposition~C. In Proposition~C2 I present the
proof of another important property of Hermite polynomials. This
result states that the class of all Hermite polynomials is a
{\it complete} orthogonal system in an appropriate Hilbert space. It
is needed in the proof of Theorem 10.5 about the isomorphism of Fock
spaces to the Hilbert space generated by Wiener--It\^o
integrals. At the end of Appendix~C the proof of Theorem~10.4, a
limit theorem about degenerated $U$-statistics is given
together with a version of this result about the limiting
behaviour of multiple integrals with respect to a normalized
empirical distribution.
\medskip\noindent
{\bf Proposition C about some properties of Hermite polynomials.}
{\it The functions
$$
H_k(x)=(-1)^k e^{x^2/2}\frac {d^k}{dx^k}e^{-x^2/2},
\quad k=0,1,2,\dots \tag C1
$$
are the Hermite polynomials with leading coefficient 1, i.e.\ $H_k(x)$
is a polynomial of order $k$ with leading coefficient 1 such that
$$
\int_{-\infty}^\infty H_k(x)H_l(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \text{if } k\neq l. \tag C2
$$
Beside this,
$$
\int_{-\infty}^\infty H^2_k(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=k!
\quad \text{for all } k=0,1,2\dots. \tag C$2'$
$$
The recursive relation
$$
H_k(x)=x H_{k-1}(x)-(k-1)H_{k-2}(x) \tag C3
$$
holds for all $k=1,2,\dots$.}
\medskip\noindent
{\it Remark.} It is more convenient to consider relation (C3) valid
also in the case $k=1$. In this case $H_1(x)=x$, $H_0(x)=1$, and
relation holds with an arbitrary function $H_{-1}(x)$.
\medskip\noindent
{\it Proof of Proposition C.} It is clear from formula (C1) that
$H_k(x)$ is a polynomial of order $k$ with leading coefficient 1.
Take $l\ge k$, and write by means of integration by parts
$$
\align
\int_{-\infty}^\infty H_k(x)H_l(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx&=
\int_{-\infty}^\infty\frac1{\sqrt{2\pi}} H_k(x)(-1)^l\frac{d^l}{dx^l}
e^{-x^2/2}\,dx\\
&=\int_{-\infty}^\infty\frac1{\sqrt{2\pi}} \frac d{dx} H_k(x)
(-1)^{l-1}\frac{d^{l-1}}{dx^{l-1}}e^{-x^2/2}\,dx.
\endalign
$$
Successive partial integration together with the identity
$\frac{d^k}{dx^k}H_k(x)=k!$ yield that
$$
\int_{-\infty}^\infty H_k(x)H_l(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=
k!\int_{-\infty}^\infty\frac1{\sqrt{2\pi}}
(-1)^{l-k}\frac{d^{l-k}}{dx^{l-k}}e^{-x^2/2}\,dx.
$$
The last relation supplies formulas (C2) and (C$2'$).
To prove relation (C3) observe that $H_k(x)-xH_{k-1}(x)$ is a
polynomial of order $k-2$. (The term $x^{k-1}$ is missing from this
expression. Indeed, if $k$ is an even number, then the polynomial
$H_k(x)-xH_{k-1}(x)$ is an even function, and it does not contain
the term $x^{k-1}$ with an odd exponent $k-1$. Similar argument
holds if the number $k$ is odd.) Beside this, it is orthogonal
(with respect to the standard normal distribution) to all Hermite
polynomials $H_l(x)$ with $0\le l\le k-3$. Hence
$H_k(x)-xH_{k-1}(x)=CH_{k-2}(x)$ with some constant $C$ to be
determined.
Multiply both sides of the last identity with $H_{k-2}(x)$
and integrate them with respect to the standard normal
distribution. Apply the orthogonality of the polynomials
$H_k(x)$ and $H_{k-2}(x)$, and observe that the identity
$$
\int H_{k-1}(x)xH_{k-2}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=
\int H^2_{k-1}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=(k-1)!
$$
holds. (In this calculation we have exploited that $H_{k-1}(x)$
is orthogonal to $H_{k-1}(x)-xH_{k-2}(x)$, because the order of
the latter polynomial is less than $k-1$.) In such a way we get
the identity $-(k-1)!=C(k-2)!$ for the constant~$C$ in the last
identity, i.e. $C=-(k-1)$, and this implies relation (C3).
\medskip\noindent
{\it Proof of It\^o's formula for multiple Wiener--It\^o
integrals.}\/ Let $K=\sum\limits_{p=1}^m k_p$, the sum of the
order of the Hermite polynomials, denote the order of the
expression in relation (10.20). Formula (10.20) clearly holds
for expressions of order $K=1$. It will be proved in the general
case by means of induction with respect to the order~$K$.
In the proof the functions $f(x_1)=\varphi_1(x_1)$ and
$$
g(x_1,\dots,x_{K_m-1})=\prod_{j=1}^{K_1-1}\varphi_1(x_j)
\cdot \prod_{p=2}^m \prod_{j=K_{p-1}}^{K_p-1}\varphi_p(x_j),
$$
will be introduced and the product
$Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)$ will be calculated by means
of the diagram formula. (The same notation is applied as in
Theorem 10.3. In particular, $K=K_m$, and in the case $K_1=1$ the
convention $\prod\limits_{j=1}^{K_1-1}\varphi_1(x_j)=1$ is applied.)
In the application of the diagram formula diagrams with two rows
appear. The first row of these diagrams contains the
vertex $(1,1)$ and the second row contains the vertices
$(2,1),\dots,(2,K_m-1)$. It is useful to divide the diagrams to
three disjoint classes. The first class, $\Gamma_0$ contains
only the diagram $\gamma_0$ without any edges. The second
class $\Gamma_1$ consists of those diagrams which have an edge
of the form $((1,1),(2,j))$ with some $1\le j\le k_1-1$, and
the third class $\Gamma_2$ is the set of those diagrams which
have an edge of the form $((1,1),(2,j))$ with some
$k_1\le j\le K_m-1$. Because of the orthogonality of the
functions $\varphi_s$ for different indices~$s$
$F_\gamma\equiv0$ and $Z_{\mu,K_m-2}(F_\gamma)=0$ for
$\gamma\in\Gamma_2$. The class $\Gamma_1$ contains $k_1-1$
diagrams. Let us consider a diagram $\gamma$ from this class
with an edge $((1,1),(2,j_0))$, $1\le j\le k_1-1$. We have for
such a diagram
$F_\gamma=\prod\limits_{j\in\{1,\dots,K_1-1\}
\setminus \{j_0\}}\varphi_1(x_{(2,j)}) \prod\limits_{p=2}^m
\prod\limits_{j=K_{p-1}}^{K_p-1}\varphi_p(x_{(2,j)})$, and
by our inductive hypothesis $(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)=
H_{k_1-2}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p)$. Finally
$$
K_m!Z_{\mu,K_m}(F_{\gamma_0})=
K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m \left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)
$$
for the diagram $\gamma_0\in\Gamma_0$ without any edge.
Our inductive hypothesis also implies the following identity
for the expression we wanted to calculate with the help of
the diagram formula.
$$
Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)=\eta_1
H_{k_1-1}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p).
$$
The above calculations together with the observation
$|\Gamma_1|=k_1-1$ yield the identity
$$ \allowdisplaybreaks
\align
&K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m \left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)=K_m!Z_{\mu,K_m}(F_{\gamma_0})\\
&\qquad=Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)-
\sum_{\gamma\in\Gamma_1}(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)\\
&\qquad=\eta_1 H_{k_1-1}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)
-(k_1-1) H_{k_1-2}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)\\
&\qquad=\left[\eta_1H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)\right]\prod_{p=2}^m H_{k_p}(\eta_p).
\tag C4
\endalign
$$
On the other hand, $\eta_1 H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)=H_{k_1}(\eta_1)$ by formula (C3).
These relations imply formula (10.20), i.e. It\^o's formula.
\medskip
I present the proof of another important property of the Hermite
polynomials in the following Proposition~C2.
\medskip\noindent
{\bf Proposition~C2 on the completeness of the orthogonal system
of Hermite polynomials.} {\it The Hermite polynomials $H_k(x)$,
$k=0,1,2,\dots$, defined in formula (C4) constitute a complete
orthonormal system in the $L_2$-space of the functions square
integrable with respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ on the real line.}
\medskip\noindent
{\it Proof of Proposition C2.} Let us consider the orthogonal
complement of the subspace generated by the Hermite polynomials
in the space of the square integrable functions with respect
to the measure $\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$. It is enough
to prove that this orthogonal completion contains only the
identically zero function. Since the orthogonality of a function to
all polynomials of the form $x^k$, $k=0,1,2,\dots$ is equivalent
to the orthogonality of this function to all Hermite polynomials
$H_k(x)$, $k=0,1,2,\dots$, Proposition~C2 can be reformulated in
the following form:
If a function $g(x)$ on the real line is such that
$$
\int_{-\infty}^\infty x^k g(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \text{for all }k=0,1,2,\dots \tag C5
$$
and
$$
\int_{-\infty}^\infty g^2(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx<\infty,
\tag C6
$$
then $g(x)=0$ for almost all $x$.
Given a function $g(x)$ on the real line whose absolute value is
integrable with respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ define the (finite)
measure $\nu_g$,
$$
\nu_g(A)=\int_A g(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
$$
on the measurable sets of the real line together with its Fourier
transform $\tilde\nu_g(t)=\int_{-\infty}^\infty e^{itx}\nu_g(\,dx)$.
(This measure $\nu_g$ and its Fourier transform can
be defined for all functions~$g$ satisfying relation (C6), because
their absolute value is integrable with respect to the Gaussian
measure.) First I show that Proposition~C2 can be reduced to the
following statement: If a function $g$ satisfies both (C5) and (C6)
then $\tilde\nu_g(t)=0$ for all $-\inftyu\right)\le A(k)
P\left(\|\bar I_{n,k}(f(\ell))\|>\gamma(k)u\right)
\tag14.13d
$$
with some constants $A(k)>0$ and $\gamma(k)>0$ depending only
on the order~$k$ of these generalized $U$-statistics.
We concentrate mainly on the proof of the generalization (14.13d) of
relation (14.13). Formula~(14.14) is a relatively simple
consequence of it. Formula~(14.13d) will be proved by means of an
inductive procedure which works only in this more general setting.
It will be derived from the following statement.
Let us take two independent copies $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$
and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ of our original sequence of
random variables $\xi_1,\dots,\xi_n$, and introduce for all sets
$V\subset \{1,\dots,k\}$ the function $\alpha_V(j)$, $1\le j\le k$,
defined as $\alpha_V(j)=1$ if $j\in V$ and $\alpha_V(j)=2$ if
$j\notin V$. Let us define with their help the following
version of decoupled $U$-statistics
$$
\align
I_{n,k,V}(f(\ell))&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon \;
1\le l_j\le n,\; j=1,\dots,k}
\!\!\!\!
f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\alpha_V(1))},\dots,\xi_{l_k}^{(\alpha_V(k))}\right)\\
&\qquad\qquad\qquad\qquad \text{for all }V\subset \{1,\dots,k\}.
\tag D3
\endalign
$$
The following inequality will be proved: There are some constants
$C_k>0$ and $D_k>0$ depending only on the order $k$ of the
generalized $U$-statistic $I_{n,k}(f(\ell))$ such that for all
numbers $u>0$
$$
P\left(\|I_{n,k}(f(\ell))\|>u\right)\le
\sum_{V\subset\{1,\dots,k\},\,1\le|V|\le k-1} C_kP\left(D_k\|
I_{n,k,V}(f(\ell))\|>u\right). \tag D4
$$
Here $|V|$ denotes the cardinality of the set $V$, and the condition
$1\le |V|\le k-1$ in the summation of formula (D4) means that the
sets $V=\emptyset$ and $V=\{1,\dots,k\}$ are omitted from the
summation, i.e. the terms where either $\alpha_V(j)=1$
or $\alpha_V(j)=2$ for all $1\le j\le k$ are not considered.
Formula (14.13d) can be derived from formula~(D4) by means
of an inductive argument. The hard part of the
problem is to prove formula~(D4). To do this first the following
simple lemma will be proved.
\medskip\noindent
{\bf Lemma D1.} {\it Let $\xi$ and $\eta$ be two independent and
identically distributed random variables taking values in a
separable Banach space~$B$. Then
$$
3P\left(|\xi+\eta|>\frac 23u\right)\ge P(|\xi|>u)
\quad \text{for all }u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma D1.}\/ {\it Let $\xi$, $\eta$ and $\zeta$ be
three independent, identically distributed random variables taking
values in~$B$. Then
$$
\align
3P\left(|\xi+\eta|>\frac23 u\right)
&=P\left(|\xi+\eta|>\frac23 u\right)+
P\left(|\xi+\zeta|>\frac23 u\right)
+P\left(|-(\eta+\zeta)|>\frac23 u\right)\\
&\ge P(|\xi+\eta+\xi+\zeta-\eta-\zeta|>2u)=P(|\xi|>u).
\endalign
$$
}
\medskip
To prove formula (D4) we introduce the random variable
$$
T_{n,k}(f(\ell))=\frac1{k!}
\sum\limits\Sb (l_1,\dots,l_k),\; (s_1,\dots,s_k) \colon \\
1\le l_j\le n,\, s_j=1 \text{ or }s_j=2,\; j=1,\dots, k,\endSb
\!\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\right)
= \!\!\! \sum_{V\subset\{1,\dots,k\}}\!\!\!\!\!
I_{n,k,V}(f(\ell)). \tag D5
$$
Observe that the random variables $I_{n,k}(f(\ell))$,
$I_{n,k,\emptyset}(f(\ell))$ and $I_{n,k,\{1,\dots,k\}}(f(\ell))$
are identically distributed, and the last two random variables are
independent of each other. Hence Lemma~D1 yields that
$$ \allowdisplaybreaks
\align
P(\|I_{n,k}(f(\ell))\|>u)
&\le3P\left(\|I_{n,k,\emptyset}(f(\ell))
+I_{n,k,\{1,\dots,k\}}(f(\ell))\|>\frac23u\right)\\
&=3P\left(\left\|T_{n,k}(f(\ell))-\!\!\!\!\!\!
\sum_{V\colon\; V\subset\{1,\dots,k\},\,
1\le|V|\le k-1} I_{n,k,|V|}(f(\ell))\right\|>\frac23u\right) \!\!\!\!\!\!
\\
&\le 3P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u) \tag D6 \\
&\qquad+
\!\!\!\!\!\!\!\!\!
\sum\limits_{V\colon\; V\subset\{1,\dots,k\},\, 1\le|V|\le k-1}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
3P(3\cdot2^{k-1}\|I_{n,k,|V|}(f(\ell))\|>u).
\endalign
$$
To derive relation (D4) from relation (D6) a good estimate is needed
on the probability $P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u)$. To get
such an estimate the tail distribution of $\|T_{n,k}(f(\ell))\|$
will be compared with that of $\|I_{n,k,V}(f(\ell))\|$ for an
arbitrary set $V\subset\{1,\dots,k\}$. This will be done with the
help of Lemmas~D2 and~D4 formulated below.
In Lemma~D2 such a random variable $\|\bar I_{n,k,V}(f(\ell))\|$
will be constructed whose distribution agrees with that of
$\|I_{n,k,V}(f(\ell))\|$. The expression $\bar I_{n,k,V}(f(\ell))$,
whose norm will be investigated will be defined in formulas~(D7)
and~(D8). It is a random polynomial of some Rademacher functions
$\varepsilon_1,\dots,\varepsilon_n$. The coefficients of
this polynomial are random variables, independent of the
Rademacher functions $\varepsilon_1,\dots,\varepsilon_n$. Beside
this, the constant term of this
polynomial equals $T_{n,k}(f(\ell))$. These properties of the
polynomial $\bar I_{n,k,V}(f(\ell))$ together with Lemma~D4
formulated below enable us prove such an estimate on the
distribution of $\|T_{n,k}(f(\ell))\|$ that together with
formula~(D6) imply relation~(D4). Let us formulate these lemmas.
\medskip\noindent
{\bf Lemma D2.} {\it Let us consider a sequence of independent
random variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is also independent of the random variables
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ appearing in the definition of
the modified decoupled $U$-statistics $I_{n,k,V}(f(\ell))$ given
in formula (D3). Let us define with their help the sequences of
random variables $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ whose elements
$(\eta_l^{(1)},\eta_l^{(2)})
=(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))$,
$1\le l\le n$, are defined by the formula
$$
(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))
=\left(\frac{1+\varepsilon_l}2\xi_l^{(1)}+
\frac{1-\varepsilon_l}2\xi_l^{(2)},\frac{1-\varepsilon_l}2\xi_l^{(1)}+
\frac{1+\varepsilon_l}2\xi_l^{(2)}\right),
$$
i.e. let $(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(1)},\xi_l^{(2)})$ if $\varepsilon_l=1$, and
$(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(2)},\xi_l^{(1)})$ if $\varepsilon_l=-1$, $1\le l\le n$.
Then the joint distribution of the pair of sequences of random
variables $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ agrees with that of the pair of
sequences $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$, which is also independent of the
sequence $\varepsilon_1,\dots,\varepsilon_n$.
Let us fix some $V\subset\{1,\dots,k\}$, and introduce the random
variable
$$
\bar I_{n,k,V}(f(\ell))=\frac1{k!}\sum_{(l_1,\dots,l_k) \colon\;
1\le l_j\le n,\; j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right), \tag D7
$$
where similarly to formula (D3) $\alpha_V(j)=1$ if $j\in V$, and
$\alpha_V(j)=2$ if $j\notin V$. Then the identity
$$
\align
&2^k\bar I_{n,k,V}(f(\ell)) \tag D8 \\
&\qquad=\frac1{k!}\sum\limits\Sb (l_1,\dots,l_k),
\;(s_1,\dots,s_k)\colon \\
1\le l_j\le n,\; s_j=1 \text{ or }s_j=2,\;j=1,\dots, k, \endSb
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right)
\endalign
$$
holds, where $\kappa^{(j)}_{1,V}=1$ and $\kappa^{(j)}_{2,V}=-1$ if
$j\in V$, and $\kappa^{(j)}_{1,V}=-1$ and $\kappa^{(j)}_{2,V}=1$ if
$j\notin V$, i.e. $\kappa_{1,V}^{(j)}=3-2\alpha_V(j)$ and
$\kappa_{2,V}^{(j)}=-\kappa_{1,V}^{(j)}$.}
\medskip
Before the formulation of Lemma~D4 another Lemma~D3 will be
presented which will be applied in its proof.
\medskip\noindent
{\bf Lemma D3.} {\it Let $Z$ be a random variable taking values in
a separable Banach space $B$ with expectation zero, i.e. let
$E\kappa(Z)=0$ for all $\kappa\in B'$, where $B'$ denotes the
(Banach) space of all (bounded) linear transformations of $B$ to
the real line. Then $P(\|v+Z\|\ge\|v\|)\ge \inf\limits_{\kappa\in B'}
\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$ for all $v\in B$.}
\medskip\noindent
{\bf Lemma D4.} {\it Let us consider a positive integer $n$ and
a sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le l\le n$.
Beside this,
fix some positive integer $k$, take a separable Banach space~$B$ and
choose some elements $a_s(l_1,\dots,l_s)$ of this Banach space $B$,
$1\le s\le k$, $1\le l_j\le n$, $l_j\neq l_{j'}$ if $j\neq j'$,
$1\le j,j'\le s$. With the above notations the inequality
$$
P\left(\left\|v+\sum_{s=1}^k\sum \Sb (l_1,\dots,l_s)\colon \;
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s}\right\|
\ge\|v\|\right)\ge c_k \tag D9
$$
holds for all $v\in B$ with some constant $c_k>0$ which depends
only on the parameter $k$. In particular, it does not depend on
the norm in the separable Banach space~$B$.}
\medskip\noindent
{\it Proof of Lemma D2.}\/ Let us consider the conditional
joint distribution of the sequences of random variables
$\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ under the condition that the
random vector $\varepsilon_1,\dots,\varepsilon_n$ takes
the value of some prescribed
$\pm1$ series of length~$n$. Observe that this conditional
distribution agrees with the joint distribution of the sequences
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ for all possible conditions.
This fact implies the statement about the joint distribution of
the sequences $(\eta_l^{(1)},\eta_l^{(2)})$, $1\le l\le n$ and their
independence of the sequence $\varepsilon_1,\dots,\varepsilon_n$.
To prove identity~(D8) let us fix a set $M\subset\{1,\dots,n\}$,
and consider the case when $\varepsilon_l=1$ if $l\in M$ and
$\varepsilon_l=-1$ if
$l\notin M$. Put $\beta_{V,M}(j,l)=1$ if $j\in V$ and $l\in M$
or $j\notin V$ and $l \notin M$, and let $\beta_{V,M}(j,l)=2$
otherwise. Then we have for all $(l_1,\dots,l_k)$, $1\le l_j\le n$,
$1\le j\le k$, and our fixed set $V$
$$
\aligned
&\sum\limits_{(s_1,\dots,s_k)\colon\; s_j=1
\text{ or }s_j=2,\;j=1,\dots, k,}
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right)\\
&\qquad\qquad \qquad=2^k f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right),
\endaligned \tag D10
$$
since the product
$(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})$
equals either zero or $2^k$, and it equals $2^k$ for that sequence
$(s_1,\dots,s_k)$ for which
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ for all
$1\le j\le k$, and the relation
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ is
equivalent to $\beta_{V,M}(j,l_j)=s_j$ for all $1\le j\le k$. (In
relation~(D10) it is sufficient to consider only such products for
which $l_j\neq l_{j'}$ if $j\neq j'$ because of the properties of
the functions $f_{l_1,\dots,l_k}$.)
Beside this, $\xi_l^{\beta_{V,M}(l,j)}=\eta_l^{\alpha_V(j)}$
for all $1\le l\le n$ and $1\le j\le k$, and as a consequence
$$f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right)=
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right).
$$
Summing up the identities (D10) for all $1\le l_1,\dots,l_k\le n$
and applying the last
identity we get relation~(D8), since the identity obtained in such
a way holds for all $M\subset\{1,\dots,n\}$.
\medskip\noindent
{\it Proof of Lemma D3.}\/ Let us first observe that if $\xi$
is a real valued random variable with zero expectation, then
$P(\xi\ge0) \ge \frac{(E|\xi|)^2}{4E\xi^2}$ since $(E|\xi|)^2
=4(E(\xi I(\{\xi\ge0\}))^2\le 4P(\xi\ge0)E\xi^2$ by the Schwarz
inequality, where $I(A)$ denotes the indicator function of
the set~$A$. (In the above calculation and in the subsequent proofs
I apply the convention $\frac00=1$. We need this convention if
$E\xi^2=0$. In this case we have, because of the condition $E\xi=0$
the identity $P(\xi=0)=1$, hence the above proved identity holds in
this case, too.)
Given some $v\in B$, let us choose a linear operator $\kappa$ such
that $\|\kappa\|=1$, and $\kappa(v)=\|v\|$. Such an operator exists
by the Banach--Hahn theorem. Observe that
$\{\omega\colon\;\|v+Z(\omega)\|
\ge\|v\|\} \supset\; \{\omega\colon\;
\kappa(v+Z(\omega))\ge\kappa(v)\}
=\{\omega\colon\; \kappa(Z(\omega))\ge0\}$. Beside this,
$E\kappa(Z)=0$. Hence we can apply the above proved inequality
for $\xi=\kappa(Z)$, and it yields that
$P(\|v+Z\|\ge\|v\|)\ge P(\kappa(Z)\ge0)
\ge\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$. Lemma~D3 is proved.
\medskip\noindent
{\it Proof of Lemma D4.}\/
Take the class of random polynomials
$$
Y=\sum_{s=1}^k\sum \Sb (l_1,\dots,l_s)\colon\;
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
b_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
where $\varepsilon_l$, $1\le l\le n$, are independent random
variables with $P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
and the coefficients
$b_s(l_1,\dots,l_s)$, $1\le s\le k$, are arbitrary real numbers.
The proof of Lemma~D4 can be reduced to the statement that there
exists a constant $c_k>0$ depending only on the order~$k$ of these
polynomials such that the inequality
$$
(E|Y|)^2\ge 4c_k EY^2. \tag D11
$$
holds for all such polynomials~$Y$. Indeed, consider the polynomial
$$
Z=\sum_{s=1}^k\sum \Sb (l_1,\dots,l_s)\colon\;
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
and observe that $E\kappa(Z)=0$ for all linear functionals $\kappa$
on the space $B$. Hence Lemma~D3 implies that the left-hand side
expression in~(D9) is bounded from below by
$\inf\limits_{\kappa\in B'}\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$.
On the other hand, relation~(D11) implies that
$\inf\limits_{\kappa\in G'}\frac{(E|\kappa(Z)|)^2}
{4E\kappa(Z)^2}\ge c_k$.
To prove relation (D11) first we compare the moments $EY^2$ and
$EY^4$. Let us introduce the random variables
$$
Y_s=\sum \Sb (l_1,\dots,l_s)\colon \;
1\le l_j\le n,\; j=1,\dots, s,\\ l_j\neq l_{j'} \text{ if }
j\neq j'\endSb b_s(l_1,\dots,l_s)
\varepsilon_{l_1}\cdots\varepsilon_{l_s}
\quad 1\le s\le k.
$$
We shall show that the estimates of Section~13 imply that
$$
EY_s^4\le 2^{4s} \left(EY_s^2\right)^2 \tag D12
$$
for these random variables $Y_s$.
Relation (D12) together with the uncorrelatedness of the random
variables $Y_s$, $1\le s\le k$, imply that
$$
\align
EY^4&=E\left(\sum_{s=1}^k Y_s\right)^4\le k^3\sum_{s=1}^k EY_s^4\le
k^3 2^{4k} \sum_{s=1}^k (EY_s^2)^2\\
&\le k^3 2^{4k}\left(\sum_{s=1}^k EY_s^2\right)^2=k^3 2^{4k}(EY^2)^2.
\endalign
$$
This estimate together with the H\"older inequality with $p=3$ and
$q=\frac32$ yield that $EY^2=E|Y|^{4/3}|\cdot|Y|^{2/3}\le
(EY^4)^{1/3}(E|Y|)^{2/3}\le k2^{4k/3}(EY^2)^{2/3}(E|Y|)^{2/3}$,
i.e. $EY^2\le k^32^{4k}(E|Y|)^2$, and relation (D11) holds with
$4c_k=k^{-3}2^{-4k}$. Hence to complete the proof of Lemma~D4
it is enough to check relation~(D12).
In the proof of relation (D12) it can be assumed that the
coefficients $b_s(l_1,\dots,l_s)$ of the random variable $Y_s$ are
symmetric functions of the arguments
$l_1,\dots,l_s$, since a symmetrization of these coefficients does
not change the value of $Y$. Put
$$
B^2_s=\sum \Sb (l_1,\dots,l_s)\colon\;
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \text{ if } j\neq j'\endSb b_s^2(l_1,\dots,l_s),
\quad 1\le s\le k.
$$
Then
$$
EY_s^2=s! B_s^2,
$$
and
$$
EY_s^4\le 1\cdot3\cdot5\cdots(4s-1)B_s^4
=\frac{(4s)!}{2^{2s}(2s)!}B_s^4
$$
by Lemmas 13.4 and 13.5 with the choice $M=2$ and $k=s$.
Inequality~(D12) follows from the last two relations. Indeed, to
prove formula~(D12) by means of these relations it is enough to
check that $\frac{(4s)!}{2^{2s}(2s)!(s!)^2}\le 2^{4s}$. But it is
easy to check this inequality with induction with respect to $s$.
(Actually, there is a well-known inequality in the literature,
known under the name Borell's inequality, which implies
inequality~(D12) with a better coefficient at the right hand side
of this estimate.) We have proved Lemma~D4.
\medskip
Let us turn back to the estimation of the probability
$P(3\cdot2^{k-1}\|T_{n,k}(f)\|>u)$. Let us introduce the
$\sigma$-algebra ${\Cal F}={\Cal B}(\xi_l^{(1)},\xi_l^{(2)},\,1\le
l\le n)$ generated by the random variables $\xi_l^{(1)},\xi_l^{(2)}$,
$1\le l\le n$, and fix some set $V\subset\{1,\dots,k\}$.
I show with the help of Lemma~D4 and formula~(D8) that there
exists some constant $c_k>0$ such that the random
variables $T_{n,k}f(\ell))$ defined in formula~(D5) and
$\bar I_{n,k,V}(f(\ell))$ defined in formula~(D7) satisfy
the inequality
$$
P\left(\|2^k\bar I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\Cal F}\right)
\ge c_k \quad \text{ with probability 1.} \tag D13
$$
In the proof of~(D13) I shall exploit that in formula~(D8)
$2^k\bar I_{n,k,V}(f(\ell))$ is represented by a polynomial of the
Rademacher functions $\varepsilon_1,\dots,\varepsilon_n$ whose
constant term is
$T_{n,k}(f(\ell))$. The coefficients of this polynomial are
functions of the random variables $\xi^{(1)}_l$ and $\xi^{(2)}_l$,
$1\le l\le n$. The independence of these random variables from
$\varepsilon_{l}$, $1\le l\le n$, and the definition of the
$\sigma$-algebra ${\Cal F}$ yield that
$$
\align
&P\left(\|2^k\bar I_{n,k,V}(f(\ell))\|>
\|T_{n,k}(f(\ell))\||{\Cal F}\right)\\
&\qquad=P_{\varepsilon_V}\biggl(\biggl\|\frac1{k!} \!\!\!\!\!\!\!\!\!\!\!
\sum\limits\Sb (l_1,\dots,l_k),\; (s_1,\dots,s_k)\colon\\
1\le l_j\le n, s_j=1 \text{ or }s_j=2,\; j=1,\dots, k,\endSb
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right)\biggr\|\\
&\qquad \qquad\qquad\qquad\qquad\qquad
>\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|\biggr),
\tag D14
\endalign
$$
where $P_{\varepsilon_V}$ means that the values of the random variables
$\xi_l^{(1)}$, $\xi_l^{(2)}$, $1\le l\le n$, are fixed, (their value
depend on the atom of the $\sigma$-algebra ${\Cal F}$ we are
considering) and the probability is taken with respect to the
remaining random variables $\varepsilon_l$, $1\le l\le n$.
At the right-hand side of (D14) the probability
of such an event is considered that the norm of a polynomial of
order~$k$ of the random variables
$\varepsilon_1,\dots,\varepsilon_n$ is larger than
$\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|$.
Beside this, the constant term of this polynomial
equals~$T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)$.
Hence this probability can be bounded by
means of Lemma~D4, and this result yields relation~(D13).
As the distributions of $I_{n,k,V}(f(\ell))$ and
$\bar I_{n,k,V}(f(\ell))$ agree by the first statement of Lemma~D2
and a comparison of formulas~(D3) and~(D7), relation (D13)
implies that
$$
\align
&P\left(\|2^k I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right)
=P\left(\|2^k\bar I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right) \\
&\qquad
\ge P\left(\|2^k\bar I_{n,k,V}(f(\ell))\|\ge\|T_{n,k}(f(\ell))\|,\;
\|T_{n,k}(f(\ell))\|\ge\frac13\cdot2^{1-k} u\right)\\
&\qquad=\int_{\{\omega\colon\; \|T_{n,k}(f(\ell))(\omega)\|
\ge\frac13\cdot2^{1-k} u\}}
P\left(\|2^k\bar I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\Cal F}\right)\,dP\\
&\qquad \ge c_k P(3\cdot2^{k-1} \|T_{n,k}(f(\ell))\|\ge u).
\endalign
$$
The last inequality with the choice of any set $V\subset\{1,\dots,k\}$,
$1\le |V|\le k-1$, together with relation~(D6) imply formula~(D4).
Relation (14.13d) will be proved together with another inductive
hypothesis with the help of relation~(D4) by means of an induction
procedure with respect to the order $k$ of the $U$-statistic. To
formulate the other inductive hypothesis some new quantities will
be introduced. Let $\Cal W=\Cal W(k)$
denote the set of all partitions of the set $\{1,\dots,k\}$. Let
us fix $k$ independent copies $\xi_{1}^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, of the sequence of random variables
$\xi_{1},\dots,\xi_n$. Given a partition
$W=(U_1,\dots,U_s)\in\Cal W(k)$ let us introduce the function
$s_W(j)$, $1\le j\le k$, which tells for all arguments $j$ the index
of that element of the partition~$W$ which contains the point $j$,
i.e. the value of the function $s_W(j)$, $1\le j\le k$, in a point $j$
is defined by the relation $j\in V_{s_W(j)}$. Let us introduce the
expression
$$
\align
I_{n,k,W}(f(\ell))&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon \;
1\le l_j\le n,\;j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_W(1))},
\dots,\xi_{l_k}^{(s_W(k))}\right)\\
&\qquad\qquad\qquad\qquad \text{for all }W\in\Cal W(k).
\endalign
$$
An expression of the form $I_{n,k,W}(f(\ell))$, $W\in\Cal W_k$,
will be called a decoupled $U$-statistic with generalized
decoupling. Given a partition $W=(U_1,\dots,U_s)\in\Cal W_k$
let us call the number $s$, i.e.\ the number of the elements of
this partition the rank both of the partition $W$ and of the
decoupled $U$-statistic $I_{n,k,W}(f(\ell))$ with generalized
decoupling.
Now I formulate the following hypothesis. For all $k\ge2$ and
$2\le j\le k$ there exist some constants $C(k,j)>0$ and
$\delta(k,j)>0$ such that for all $W\in\Cal W_k$ a decoupled
$U$-statistic $I_{n,k,W}(f(\ell))$ with generalized decoupling
satisfies the inequality
$$
\aligned
&P(\|I_{n,k,W}(f(\ell))\|>u)\le C(k,j)P\left(\|\bar
I_{n,k}(f(\ell))\|>\delta(k,j) u\right) \\
&\qquad\qquad\qquad\text{for all }2
\le j\le k \text{ if the rank of } W \text{ equals }j.
\endaligned \tag D15
$$
It will be proved by induction with respect to $k$ that both
relations~(14.13d) and~(D15) hold for $U$-statistics of order~$k$.
Let us observe that for $k=2$ relation~(14.13d) follows from~(D4).
Relation~(D15) also holds for $k=2$, since in this case we have to
consider only the case $j=k=2$, and relation (D15) clearly holds
in this case with $C(2,2)=1$ and
$\delta(2,2)=1$. Hence we can start our inductive proof
with $k=3$. First I prove relation~(D15).
In relation (D15) the tail-distribution of decoupled
$U$-sta\-tis\-tics with generalized decoupling is compared
with that of the decoupled $U$-statistic
$\bar I_{n,k}(f(\ell))$ introduced in~(D2). Given the order $k$
of these $U$-statistics it will be proved
by means of a backward induction with
respect to the rank $j$ of the decoupled $U$-statistics
$I_{n,k,W}(f(\ell))$ with generalized decoupling.
Relation (D15) clearly holds for $j=k$ with $C(k,k)=1$ and
$\delta(k,k)=1$. To prove it for decoupled $U$-statistics
with generalized decoupling of rank $2\le ju)\le \bar A(k) P\left(\|I_{n,k,\bar W}
(f(\ell))\|>\bar \gamma(k) u\right) \tag D16
$$
with $\bar A(k)=\sup\limits_{2\le p\le k-1}A(p)$,
$\bar\gamma(k)=\inf\limits_{2\le p\le k-1}\gamma(p)$ if the
rank $j$ of $W$ is such that $2\le j\le k-1$, where the constants
$A(p)$ and $\gamma(p)$ agree with the corresponding coefficients in
formula~(14.13d).
To prove relation~(D16) (in the case $U_j=\{t,\dots,k\}$) let
us define the $\sigma$-algebra ${\Cal F}$ generated by the random
variables appearing in the first $t-1$ coordinates of these
$U$-statistics, i.e. by the random variables $\xi^{s_W(j)}_{l_j}$,
$1\le j\le t-1$, and $1\le l_j\le n$ for all $1\le j\le t-1$. We have
$2\le t\le k-1$. By our inductive hypothesis relation~(14.13d) holds
for $U$-statistics of order $p=k-t+1$, since $2\le p\le k-1$. I
claim that this implies that
$$
P(\|I_{n,k,W}(f(\ell))\|>u|{\Cal F})\le A(k-t+1)
P\left(\|I_{n,k,\bar W}(f(\ell))\|
>\gamma(k-t+1)u|{\Cal F}\right) \tag D17
$$
with probability~1. Indeed, by the independence properties of
the random variables $\xi_l^{s_W(j)}$
(and $\xi_l^{s_{\bar W}(j)}$),
$1\le j\le k$, $1\le l\le n$,
$$
P(\|I_{n,k,W}(f(\ell))\|>u|{\Cal F})
=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,W}(f(\ell)\|>u)
$$
and
$$
P\left(\|I_{n,k,\bar W}(f(\ell))\|>\gamma(k-t+1)u|{\Cal F}\right)
=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,\bar W}f(\ell)\|
>\gamma(k-t+1)u),
$$
where $P_{\xi_l^{s_W(j)}, 1\le j\le t-1}$ denotes that the
values of the
random variables $\xi_l^{s_W(j)}(\omega)$, $1\le j\le t-1$,
$1\le l\le n$, are fixed, and we consider the probability that
the appropriate functions of these fixed values and of the
remaining random variables
$\xi^{s_W(j)}$ and $\xi^{s_{\bar W}(j)}$, $t\le j\le k$,
satisfy the desired relation. These identities and the relation
between the sets $W$ and $\bar W$ imply that relation (D17) is
equivalent to the identity (14.13d) for the generalized
$U$-statistics of order $2\le k-t+1\le k-1$ with kernel functions
$$
\align
&f_{l_t,\dots,l_k}(x_t,\dots,x_k)\\
&\qquad=\sum_{(l_1,\dots,l_{t-1})
\colon\, 1\le l_j\le n,\;1\le j\le t-1}
\!\!\!\!\!
f_{l_1,\dots,l_k}(\xi_{l_1}^{s_W(1)}(\omega),\dots,
\xi_{l_{t-1}}^{s_W(t-1)}(\omega),x_t,\dots,x_k).
\endalign
$$
Relation~(D16) follows from inequality~(D17) if expectation is
taken at both sides. As the rank of $\bar W$ is strictly greater
than the rank of $W$, relation~(D16) together with our backward
inductive assumption imply relation~(D15) for all $2\le j\le k$.
Relation~(D15) implies in particular (with the applications of
partitions of order~$k$ and rank~2) that the terms in the sum at
the right-hand side of~(D4) satisfy the inequality
$P\left(D_k\|I_{n,k,V}(f(\ell))\|>u\right)\le \bar C(k,j)
P\left(\|\bar I_{n,k}(f(\ell))\|>\bar D_k u\right)$ with some
appropriate $\bar C_k>0$ and $\bar D_k>0$ for all
$V\subset\{1,\dots,k\}$, $1\le|V|\le k-1$. This inequality together
with relation~(D4) imply that inequality~(14.13d) also holds for
the parameter~$k$.
\medskip
In such a way we get the proof of relation (14.13d) and of its
special case, relation~(14.13). Let us prove
formula~(14.14) with its help first in the simpler case when the
supremum of finitely many functions is taken. If $M<\infty$
functions $f_1,\dots,f_M$ are considered, then relation (14.14)
for the supremum of the $U$-statistics and decoupled
$U$-statistics with these kernel functions can be derived from
formula (14.13) if it is applied for the function
$f=(f_1,\dots,f_M)$ with values in the separable Banach space
$B_M$ which consists of the vectors
$(v_1,\dots,v_M)$, $v_j\in B$, $1\le j\le M$, and the norm
$\|(v_1,\dots,v_M)\|=\sup\limits_{1\le j\le m}\|v_j\|$ is
introduced in it. The application of formula (14.13) with this
choice yields formula~(14.14) for this supremum. Let us emphasize
that the constants appearing in this estimate do not depend on the
number~$M$. (We took only $M<\infty$ kernel functions, because
with such a choice the Banach space $B_M$ defined above
is also separable.) Since the distribution of the random
variables $\sup\limits_{1\le s\le M}\left\|I_{n,k}(f_s)\right\|$
converge to that of
$\sup\limits_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|$, and
the distribution of the random variables $\sup\limits_{1\le s\le M}
\left\| \bar I_{n,k}(f_s)\right\|$ converge to that of
$\sup\limits_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|$ as
$M\to\infty$, relation (14.14) in the general case follows from its
already proved special case and a limiting procedure $M\to\infty$.
\medskip\noindent
{\it Remark.} The above proved formula (14.13d) can be slightly
generalized. It also holds if the expressions $I_{n,k}(f(\ell))$
and $\bar I_{n,k}(f(\ell))$ appearing in this inequality are defined
in a more general way. Namely, they are the random functions
introduced in formulas (D1) and (D2), but the sequences
$\xi_1,\dots,\xi_n$ and their independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$ in these formulas are independent
random variables which may also be non-identically distributed.
Such a generalization can be proved without any essential change in
the original proof.
\parskip=1pt plus 0.5pt
\beginsection References:
\item{1.)} Adamczak, R. (2006) Moment inequalities for
$U$-statistics. {\it Annals of Probability} {\bf34}, 2288--2314
\item{2.)} Alexander, K. (1987) The central limit theorem for
empirical processes over Vapnik--\v{C}ervonenkis classes. {\it Annals
of Probability} {\bf 15}, 178--203
\item{3.)} Arcones, M. A. and Gin\'e, E. (1993) Limit theorems for
$U$-processes. {\it Annals of Probability}, {\bf 21}, 1494--1542
\item{4.)} Arcones, M. A. and Gin\'e, E. (1994) $U$-processes
indexed by Vapnik--\v{C}ervonenkis classes of functions with
application to asymptotics and bootstrap of $U$-statistics with
estimated parameters. {\it Stoch. Proc. Appl.} {\bf 52}, 17--38
\item{5.)} Bennett, G. (1962) Probability inequality for the sum of
independent random variables. {\it J. Amer. Statist. Assoc.}\/
{\bf 57}, 33--45
\item{6.)} Bonami, A. (1970) \'Etude des coefficients de Fourier des
fonctions de $L^p(G)$. {\it Ann. Inst. Fourier (Grenoble)\/} {\bf 20}
335--402
\item{7.)} de la Pe\~na, V. H. and Gin\'e, E. (1999) {\it
Decoupling. From dependence to independence.}\/ Springer series in
statistics. Probability and its application. Springer Verlag,
New York, Berlin, Heidelberg
\item{8.)} de la Pe\~na, V. H. and Montgomery--Smith, S. (1995)
Decoupling inequalities for the tail-probabilities of multivariate
$U$-statistics. {\it Ann. Probab.}, {\bf 23}, 806--816
\item{9.)} Dobrushin, R. L. (1979) Gaussian and their subordinated
fields. {\it Annals of Probability}\/ {\bf 7}, 1-28
\item{10.)} Dudley, R. M. (1978) Central limit theorems for empirical
measures. {\it Annals of Probability}\/ {\bf 6}, 899--929
\item{11.)} Dudley, R. M. (1984) A course on empirical processes.
{\it Lecture Notes in Mathematics}\/ {\bf 1097}, 1--142 Springer
Verlag, New York
\item{12.)} Dudley, R. M. (1989) {\it Real Analysis and
Probability.}\/ Wadsworth \& Brooks, Pacific Grove, California
\item{13.)} Dudley, R. M. (1998) {\it Uniform Central Limit
Theorems.}\/ Cambridge University Press, Cambridge U.K.
\item{14.)} Dynkin, E. B. and Mandelbaum, A. (1983) Symmetric
statistics, Poisson processes and multiple Wiener integrals. {\it
Annals of Statistics\/} {\bf 11}, 739--745
\item{15.)} Frankl, P. and Pach J. (1983) On the number of sets in
null-$t$-design. {\it European J. Combinatorics} {\bf 4} 21--23
\item{16.)} Gin\'e, E. and Guillou, A. (2001) On consistency of
kernel density estimators for randomly censored data: Rates holding
uniformly over adaptive intervals. {\it Ann. Inst. Henri
Poincar\'e PR\/} {\bf 37} 503--522
\item{17.)} Gin\'e, E., Kwapie\'n, S, Lata\l{}a, R. and Zinn, J.
(2001) The LIL for canonical $U$-statistics of order~2.
{Annals of Probability} {\bf 29} 520--527
\item{18.)} Gin\'e, E., Lata\l{}a, R. and Zinn, J. (2000)
Exponential and moment inequalities for $U$-statistics in {\it High
dimensional probability II.} Progress in Probability 47. 13--38.
Birkh\"auser Boston, Boston, MA.
\item{19.)} Gross, L. (1975) Logarithmic Sobolev inequalities.
Amer. J. Math. {\bf 97}, 1061--1083
\item{20.)} Guionnet, A. and Zegarlinski, B. (2003) Lectures on
Logarithmic Sobolev inequalities. {\it Lecture Notes in Mathematics}
{\bf 1801} 1--134 2. Springer Verlag, New York
\item{21.)} Hanson, D. L. and Wright, F. T. (1971) A bound on the
tail probabilities for quadratic forms in independent random
variables. {\it Ann. Math. Statist.} {\bf 42} 52--61
\item{22.)} Hoeffding, W. (1948) A class of statistics with
asymptotically normal distribution. {\it Ann. Math. Statist.}
{\bf 19} 293--325
\item{23.)} Hoeffding, W. (1963) Probability inequalities for sums
of bounded random variables. {\it J. Amer. Math. Society}\/
{\bf 58}, 13--30
\item{24.)} It\^o K. (1951) Multiple Wiener integral. {\it J. Math.
Soc. Japan}\/ {\bf3}. 157--164
\item{25.)} Kaplan, E.L. and Meier P. (1958) Nonparametric
estimation from incomplete data, {\it Journal of American
Statistical Association}, {\bf 53}, 457--481.
\item{26.)} Lata\l{a}, R. (2006) Estimates of moments and tails of
Gaussian chaoses. {\it Annals of Probability} {\bf34} 2315--2331
\item{27.)} Ledoux, M. (1996) On Talagrand deviation inequalities
for product measures. {\it ESAIM: Probab. Statist.}\/ {\bf 1.}
63--87. Available at http://www.emath./fr/ps/.
\item{28.)} Ledoux, M. (2001) The concentration of measure phenomenon.
{\it Mathematical Surveys and Monographs}\/ {\bf 89} American Mathematical
Society, Providence, RI.
\item{29.)} Major, P. (1981) Multiple Wiener--It\^o integrals. {\it
Lecture Notes in Mathematics\/} {\bf 849}, Springer Verlag, Berlin,
Heidelberg, New York,
\item{30.)} Major, P. (1988) On the tail behaviour of the
distribution function of multiple stochastic integrals. {\it
Probability Theory and Related Fields}, {\bf 78}, 419--435
\item{31.)} Major, P. (1994) Asymptotic distributions for weighted
$U$-statistics. {\it The Annals of Probability}, {\bf 22} 1514--1535
\item{32.)} Major, P. (2005) An estimate about multiple stochastic
integrals with respect to a normalized empirical measure.
{\it Studia Scientarum Mathematicarum Hungarica.} 295--341
\item{33.)} Major, P. (2005) Tail behaviour of multiple random
integrals and $U$-sta\-tis\-tics. {\it Probability Reviews.} 448--505
\item{34.)} Major, P. (2006) An estimate on the maximum of a nice
class of stochastic integrals. {\it Probability Theory
and Related Fields.} {\bf 134}, 489--537
\item{35.)} Major, P. (2006) A multivariate generalization of
Hoeffding's inequality. {\it Electronic Communication in
Probability} {\bf 2} (220--229)
\item{36.)} Major, P. (2007) On a multivariate version of
Bernstein's inequality {\it Electronic Journal of
Probability} {\bf12} 966--988
\item{37.)} Major, P. (2005) On the tail behaviour of multiple
random integrals and degenerate $U$-statistics. (First version of
this lecture note) http://www.renyi.hu/\~{}major
\item{38.)} Major, P. and Rejt\H{o}, L. (1988) Strong embedding of
the distribution function under random censorship. {\it Annals of
Statistics} {\bf 16}, 1113--1132
\item{39.)} Major, P. and Rejt\H{o}, L. (1998) A note on
nonparametric estimations. In the conference volume to the 65.
birthday of Mikl\'os Cs\"org\H{o}. 759--774
\item{40.)} Malyshev, V. A. and Minlos, R. A. (991) Gibbs Random
Fields. Method of cluster expansion. Kluwer, Academic Publishers,
Dordrecht
\item{41.)} Massart, P. (2000) About the constants in Talagrand's
concentration inequalities for empirical processes. {\it Annals of
Probability}\/ {\bf 28}, 863--884
\item{42.)} Mc. Kean, H. P. (1973) Wiener's theory of non-linear
noise. in {\it Stochastic Differential Equations}
SIAM--AMS Proc. 6 197--209
\item{43.)} Nelson, E. (1973) The free Markov field. J. Functional
Analysis {\bf 12}, 211--227
\item{44.)} Pollard, D. (1984) {\it Convergence of Stochastic
Processes.}\/ Springer Verlag, New York
\item{45.)} Rota, G.-C. and Wallstrom, C. (1997) Stochastic
integrals: a combinatorial approach. {\it Annals of Probability}
{\bf 25} (3) 1257--1283
\item{46.)} Surgailis, D. (1984) On multiple Poisson stochastic
integrals and associated Markov semigroups. {\it Probab. Math.
Statist.} 3. no. {\bf 2} 217-239
\item{47.)} Surgailis, D. (2000) Long-range dependence and Appell
rank. {\it Annals of Probability} {\bf 28} 478--497
%\item{41.)} Surgailis, D. (2000) CLTs for polynomials of linear
%sequences: Diagram formulae with illustrations. in {\it Long Range
%Dependence} 111--128 Birkh\"auser, Boston, Boston, MA.
\item{48.)} Szeg\H{o}, G. (1967) {\it Orthogonal Polynomials.}
American Mathematical Society Colloquium Publications. Vol. 23
\item{49.)} Takemura, A. (1983) Tensor Analysis of ANOVA
decomposition. {\it J. Amer. Statist. Assoc.} {\bf 78}, 894--900
\item{50.)} Talagrand, M. (1994) Sharper bounds for Gaussian and
empirical processes. {\it Annals of Probability} {\bf 22}, 28--76
\item{51.)} Talagrand, M. (1996) New concentration inequalities in
product spaces. {\it Invent. Math.} {\bf 126}, 505--563
\item{52.)} Talagrand, M. (2005) {\it The general chaining.}
Springer Monographs in Mathematics. Springer Verlag, Berlin
Heidelberg New York
\item{53.)} Vapnik, V. N. (1995) {\it The Nature of Statistical
Learning Theory.} Springer Verlag, New York
\vfill\eject
\centerline {\script CONTENT}
$$
\vbox{\halign{\hfill # \ &\vtop{\hsize=12truecm\parindent=0pt #
\vskip3pt} \quad &\vtop{\hsize=0.5truecm\parindent=0pt #
\vskip3pt} \cr
1. & Introduction. \dotfill &\rightline{1}\cr
2. & Motivation of the investigation. Discussion of some problems.
\dotfill & \rightline{3}\cr
3. & Some estimates about sums of independent random variables.
\dotfill & \rightline{11}\cr
4. & On the supremum of a nice class of partial sums.
\dotfill & \rightline{16}\cr
5. & Vapnik--\v{C}ervonenkis classes and $L_2$-dense classes of
functions. \dotfill & \rightline{26}\cr
6. & The proof of Theorems 4.1 and 4.2 on the supremum of random sums.
\dotfill & \vskip5pt \rightline{31} \cr
7. & The completion of the proof of Theorem 4.1.
\dotfill & \rightline{40}\cr
8. & Formulation of the main results of this work.
\dotfill & \rightline{49}\cr
9. & Some results about $U$-statistics.
\dotfill & \rightline{61}\cr
10. & Multiple Wiener--It\^o integrals and their properties.
\dotfill & \rightline{76}\cr
11. & The diagram formula for products of degenerate $U$-statistics.
\dotfill & \rightline{93}\cr
12. & The proof of the diagram formula for $U$-statistics. \dotfill &
\rightline{105}\cr
13. & The proof of Theorems 8.3, 8.5 and Example 8.7. \dotfill &
\rightline{112}\cr
14. & Reduction of the main result in this work. \dotfill &
\rightline{127}\cr
15. & The strategy of the proof for the main result of this work.
\dotfill & \rightline{137}\cr
16. & A symmetrization argument. \dotfill & \rightline{144}\cr
17. & The proof of the main result. \dotfill & \rightline{159}\cr
18. & An overview of the results in this work.
\dotfill & \rightline{173}\cr
\noalign{\medskip}
&Appendix A.
The proof of some results about Vapnik--\v{C}ervonenkis classes.
\dotfill & \vskip5pt \rightline{187}\cr
& Appendix B.
The proof of the diagram formula for Wiener--It\^o integrals.
\dotfill &\vskip5pt \rightline{189}\cr
&Appendix C.
The proof of some results about Wiener--It\^o integrals.
\dotfill &\vskip5pt \rightline{197}\cr
& Appendix D.
The proof of Theorem 14.3. ( A result about the comparison of
$U$-statistics and decoupled $U$-statistics.)
\dotfill &\vskip5pt \rightline{205}\cr
\noalign{\medskip}
&References. \dotfill & \rightline{215}\cr
&Content. \dotfill & \rightline{218}\cr}}
$$
\bye__