\documentclass[graybox,envcountchap,sectrefs]{svmono}
\usepackage{amssymb,amsmath}
\usepackage{amsfonts}
%\usepackage{mathptmx}
\usepackage{helvet}
\usepackage{courier}
\usepackage{type1cm}
\usepackage{makeidx}
\usepackage{multicol}
\makeindex
\font\script=cmcsc10
\makeatletter
\renewcommand{\theenumi}{\alph{enumi}}
\renewcommand{\labelenumi}{\theenumi)}
\makeatother
\begin{document}
\author{P\'eter Major}
\title{On the estimation of multiple random
integrals and $U$-statistics}
\subtitle{-- Lecture Note --}
%\subtitle{-- Monograph --}
\maketitle
\frontmatter
\tableofcontents
\mainmatter
\chapter{Introduction}
First I briefly describe the main subject of this work.
Fix a positive integer $n$, consider $n$ independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ on a measurable
space $(X,{\cal X})$ with some distribution $\mu$ and take their
empirical distribution $\mu_n$ together with its normalization
$\sqrt n(\mu_n-\mu)$. Beside this, take a function $f(x_1,\dots,x_k)$
of $k$ variables on the $k$-fold product $(X^k,{\cal X}^k)$ of the
space $(X,{\cal X})$, introduce the $k$-th power of the normalized
empirical measure $\sqrt n(\mu_n-\mu)$ on $(X^k,{\cal X}^k)$ and
define the integral of the function $f$ with respect to this
signed product measure. This integral is a random variable, and we
want to give a good estimate on its tail distribution. More precisely,
we take the integrals not on the whole space, the diagonals
$x_s=x_{s'}$, $1\le s,s'\le k$, $s\neq s'$, of the space $X^k$ are
omitted from the domain of integration. Such a modification of the
integral seems to be natural.
We shall also be interested in the following generalized version of
the above problem. Let us have a nice class of functions ${\cal F}$
of $k$ variables on the product space $(X^k,{\cal X}^k)$, and consider
the integrals of all functions in this class with respect to the
$k$-fold direct product of our normalized empirical measure. Give a
good estimate on the tail distribution of the supremum of these
integrals.
It may be asked why the above problems deserve a closer study. I
found them important, because they may help in solving some essential
problems in probability theory and mathematical statistics. I met
such problems when I tried to adapt the method of proof about the
Gaussian limit behaviour of the maximum likelihood estimate to some
similar but more difficult questions. In the original problem the
asymptotic behaviour of the solution of the so-called maximum
likelihood equation has to be investigated. The study of this
problem is hard in its original form. But by applying an appropriate
Taylor expansion of the function that appears in this equation and
throwing away its higher order terms we get an approximation whose
behaviour can be simply understood. So to describe the limit
behaviour of the maximum likelihood estimate it suffices to show
that this approximation causes only a negligible error.
One would try to apply a similar method in the study of more
difficult questions. I met some non-parametric maximum likelihood
problems, for instance the description of the limit behaviour of
the so-called Kaplan--Meyer product limit estimate when such an
approach could be applied. But in these problems it was harder
to show that the simplifying approximation causes only a
negligible error. In this case the solution of the above
mentioned problems was needed. In the non-parametric maximum
likelihood estimate problems I met, the estimation of multiple
(random) integrals played a role similar to the estimation of
the coefficients in the Taylor expansion in the study of maximum
likelihood estimates. Although I could apply this approach only
in some special cases, I believe that it works in very general
situations. But it demands some further work to show this.
The above formulated problems about random integrals are interesting
and non-trivial even in the special case $k=1$. Their solution
leads to some interesting and non-trivial generalization
of the fundamental theorem of the mathematical statistics about
the difference of the empirical and real distribution of a large
sample.
These problems have a natural counterpart about the behaviour of
so-called $U$-statistics, a fairly popular subject in probability
theory. The investigation of multiple random integrals and
$U$-statistics are closely related, and it turned out that it is
useful to consider them simultaneously.
Let us try to get some feeling about what kind of results can be
expected in these problems. For a large sample size $n$ the
normalized empirical measure $\sqrt n(\mu_n-\mu)$ behaves similarly
to a Gaussian random measure. This suggests that in the problems we
are interested in similar results should hold as in the problems
about multiple Gaussian integrals, called Wiener--It\^o integrals
in the literature. We may expect that the tail behaviour of the
distribution of a $k$-fold random integral with respect to a
normalized empirical measure is similar to that of the $k$-th
power of a Gaussian random variable with expectation zero and an
appropriate variance. Beside this, if we consider the supremum
of multiple random integrals of a class of functions with
respect to a normalized empirical measure or with respect to a
Gaussian random measure, then we expect that under not too
restrictive conditions this supremum is not much larger than
the `worst' random integral with the largest variance taking
part in this supremum. We may also hope that the methods of the
theory of multiple Gaussian integrals can be adapted to the
investigation of our problems.
The above presented heuristic considerations supply a fairly good
description of the situation, but they do not take into account a
very essential difference between the behaviour of multiple
Gaussian integrals and multiple integrals with respect to a
normalized empirical measure. If the variance of a multiple
integral with respect to a normalized empirical measure is very
small, what turns out to be equivalent to a very small $L_2$-norm
of the function we are integrating, then the behaviour of this
integral is different from that of a multiple Gaussian integral
with the same kernel function. In this case the effect of some
irregularities of the normalized empirical distribution turns
out to be non-negligible, and no good Gaussian approximation
holds any longer. This case must be better understood, and some
new methods have to be worked out to handle it.
The precise formulation of the results will be given in the
main part of the work. Beside their proofs I also tried to explain
the main ideas behind them and the notions introduced in their
investigation. This work contains some new results, and also the
proof of some already rather classical theorems is presented.
The results about Gaussian random variables and their non-linear
functionals, in particular multiple integrals with respect to a
Gaussian field, have a most important role in the study of the
present work. Hence they will be discussed in detail together
with some of their counterparts about multiple random integrals
with respect to a normalized empirical measure and some results
about $U$-statistics.
The proofs apply results from different parts of the probability
theory. Papers investigating similar results refer to works dealing
with quite different subjects, and this makes their reading rather
hard. To overcome this difficulty I tried to work out the details
and to present a self-contained discussion even at the price of a
longer text. Thus I wrote down (in the main text or in the Appendix)
the proof of many interesting and basic results, like results about
Vapnik--\v{C}ervonenkis classes, about $U$-statistics and their
decomposition to sums of so-called degenerate $U$-statistics, about
so-called decoupled $U$-statistics and their relation to ordinary
$U$-statistics, the diagram formula about the product of
Wiener--It\^o integrals, their counterpart about the product of
degenerate $U$-statistics, etc. I tried to give such an exposition
where different parts of the problem are explained independently of
each other, and they can be understood in themselves. I wrote
about the history of the problems discussed in this work and their
relation to some other question in the last section of this Lecture
Note before the Appendix, in Section~18.
An earlier version of this work was explained at the probability
seminar of the University Debrecen (Hungary).
\chapter{Motivation of the investigation. Discussion of
some problems}
In this section I try to show by means of some examples why the
solution of the problems mentioned in the introduction may be
useful in the study of some important problems of the probability
theory. I try to give a good picture about the main ideas, but I
do not work out all details. Actually, the elaboration of some
details omitted from this discussion would demand hard work.
But as the present section is quite independent of the rest of
the paper, these omissions cause no problem in understanding
the subsequent part.
I start with a short discussion of the maximum likelihood
estimate in the simplest case. The following problem is considered.
Let us have a class of density functions $f(x,\vartheta)$ on the
real line depending on a parameter $\vartheta\in R^1$, and
observe a sequence of independent random variables
$\xi_1(\omega),\dots,\xi_n(\omega)$ with a density function
$f(x,\vartheta_0)$, where $\vartheta_0$ is an unknown parameter
we want to estimate with the help of the above sequence of random
variables.
The maximum likelihood method suggests the following approach. Choose
that value $\hat\vartheta_n =\hat\vartheta_n(\xi_1,\dots,\xi_n)$ as
the estimate of the parameter $\vartheta_0$ where the density function
of the random vector $(\xi_1,\dots,\xi_n)$, i.e.\ the product
$$
\prod_{k=1}^n f(\xi_k,\vartheta)=\exp\left\{\sum_{k=1}^n\log
f(\xi_k,\vartheta)\right\}
$$
takes its maximum. This point can be found as the solution of the
so-called maximum likelihood equation\index{maximum likelihood equation}
\begin{equation}
\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\vartheta)=0. \label{(2.1)}
\end{equation}
We are interested in the asymptotic behaviour of the random variable
$\hat\vartheta_n-\vartheta_0$, where $\hat\vartheta_n$ is the
(appropriate) solution of the equation~(\ref{(2.1)}).
The direct study of this equation is rather hard, but a Taylor
expansion of the expression at the left-hand side of~(\ref{(2.1)})
around the (unknown) point $\vartheta_0$ yields a good and simple
approximation of $\hat\vartheta_n$, and it enables us to describe
the asymptotic behaviour of $\hat\vartheta_n-\vartheta_0$.
This Taylor expansion yields that
\begin{eqnarray}
&&\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\hat\vartheta_n)=
\sum_{k=1}^n\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)} \nonumber \\
&&+(\hat\vartheta_n-\vartheta_0)
\left(\sum_{k=1}^n\left(\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{\left(\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)\right)^2}
{f^2(\xi_k,\bar\vartheta_0)} \right)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right) \nonumber \\
&&= \sum_{k=1}^n
\left(\eta_k+\zeta_k(\hat\vartheta_n-\vartheta_0)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right),
\label{(2.2)}
\end{eqnarray}
where
$$
\eta_k=\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}\quad \textrm{and}\quad
\zeta_k=
\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{ \left( \frac{\partial}{\partial\vartheta}
f( \xi_k,\vartheta_0)\right)^2}{f^2(\xi_k,\bar\vartheta_0)}
$$
for $k=1,\dots,n$. We want to understand the asymptotic behaviour
of the (random) expression on the right-hand side of~(\ref{(2.2)}).
The relation
$$
E\eta_k=\int\frac{\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)}{f(x,\vartheta_0)}f(x,\vartheta_0)\,dx
=\frac{\partial}{\partial\vartheta}\int
f(x,\vartheta_0)\,dx=0
$$
holds, since $\int f(x,\vartheta)\,dx=1$ for all $\vartheta$,
and a differentiation of this relation gives the last identity.
Similarly,
$E\eta^2_k=-E\zeta_k
=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$, \
$k=1,\dots,n$. Hence by the central limit theorem
$\chi_n=\frac{1}{\sqrt n}\sum\limits_{k=1}^n\eta_k$
is asymptotically normal with expectation zero and variance
$I^2=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$.
In the statistics literature this number $I$ is called the Fisher
information. By the laws of large numbers
$\frac{1}{n}\sum\limits_{k=1}^n\zeta_k\sim -I^2$.
Thus relation (\ref{(2.2)}) suggests the approximation of the
maximum-likelihood estimate $\hat\vartheta_n$ by the random variable
$\tilde\vartheta_n$ given by the identity $\tilde\vartheta_n-\vartheta_0=
-\frac{\sum\limits_{k=1}^n\eta_k}{\sum\limits_{k=1}^n\zeta_k}$, and
the previous calculations imply that
$\sqrt n(\tilde\vartheta_n-\vartheta_0)$
is asymptotically normal with
expectation zero and variance~$\frac1{I^2}$. The random variable
$\tilde\vartheta_n$ is not a solution of the equation (\ref{(2.1)}),
the value of the expression at the left-hand side is of order
$O(n(\tilde\vartheta_n-\vartheta_0)^2)=O(1)$ in this point. On
the other hand, some calculations show that the derivative of the
function at the left-hand side is large in this point, it is greater
than $\textrm{const.}\, n$ with some $\textrm{const.}>0$. This implies
that the maximum-likelihood equation has a solution
$\hat\vartheta_n$ such that
$\hat\vartheta_n-\tilde\vartheta_n=O\left(\frac1n\right)$. Hence
$\sqrt n(\hat\vartheta_n-\vartheta_0)$ and
$\sqrt n(\tilde\vartheta_n-\vartheta_0)$ have the same asymptotic
limit behaviour.
The previous method can be summarized in the following way:
Take a simpler linearized version of the expression we want to
estimate by means of an appropriate Taylor expansion, describe the
limit distribution of this linearized version and show that the
linearization causes only a negligible error.
We want to show that such a method also works in more difficult
situations. But in some cases it is harder to show that the error
committed by a replacement of the original expression by a simpler
linearized version is negligible, and to show this the solution of
the problems mentioned in the introduction is needed. The discussion
of the following problem, called the Kaplan--Meyer method for the
estimation of the empirical distribution function with the help of
censored data shows such an example.
The following problem is considered. Let $(X_i,Z_i)$, $i=1,\dots,n$,
be a sequence of independent, identically distributed random vectors
such that the components $X_i$ and $Z_i$ are also independent with
some unknown, continuous distribution functions $F(x)$ and $G(x)$.
We want to estimate the distribution function $F$ of the random
variables $X_i$, but we cannot observe the variables $X_i$, only
the random variables $Y_i=\min(X_i,Z_i)$ and
$\delta_i=I(X_i\leq Z_i)$. In other words, we want to solve the
following problem. There are certain objects whose lifetime $X_i$
are independent and $F$ distributed. But we cannot observe this
lifetime $X_i$, because after a time $Z_i$ the observation must
be stopped. We also know whether the real lifetime $X_i$ or the
censoring variable $Z_i$ was observed. We make $n$ independent
experiments and want to estimate with their help the distribution
function~$F$.
Kaplan and Meyer, on the basis of some maximum-likelihood estimation
type considerations, proposed the following so-called product limit
estimator\index{product limit estimator (Kaplan--Meyer method)}
$S_n(u)$ to estimate the unknown survival function
$S(u)=1-F(u)$:
\begin{equation}
1-F_n(u)=S_n(u)=\left\{
\begin{array}{l}
\prod\limits_{i=1}^n\left(\frac{N(Y_i)}{N(Y_i)+1}\right)^{I(Y_i\leq u,
\delta_i=1)} \textrm{ if }u\leq\max(Y_1,\dots,Y_n)\\
0 \textrm{ if } u\geq\max(Y_1,\dots,Y_n),\textrm{ and } \delta_n =1,\\
\textrm{undefined if }u\geq\max(Y_1,\dots,Y_n),\textrm{ and } \delta_n=0,
\end{array}
\right.
\label{(2.3)}
\end{equation}
where
$$
N(t)=\#\{Y_i,\;\;Y_i>t,\;1\le i \le n\}=\sum_{i=1}^n I(Y_i>t).
$$
We want to show that the above estimate (\ref{(2.3)}) is really good.
For this goal we shall approximate the random variables $S_n(u)$ by
some appropriate random variables. To do this first we introduce some
notations.
Put
\begin{eqnarray}
H(u) &=&P(Y_i\leq u)=1-\bar H(u), \nonumber \\
\tilde H(u) &=&P(Y_i\leq u,\,\delta_i=1),\quad
\tilde{\tilde H}(u)=P(Y_i\leq u,\,\delta_i =0)
\label{(2.4)}
\end{eqnarray}
and
\begin{eqnarray}
H_n(u) &=&\frac{1}{n} \sum_{i=1}^n I( Y_i \leq u)\label{(2.5)} \\
\tilde H_n(u) &=&\frac1n \sum_{i=1}^n I(Y_i\leq u,\, \delta_i
=1), \quad \tilde{\tilde H}_n(u)=\frac{1}{n}\sum_{i=1}^n I( Y_i
\leq u, \, \delta_i=0). \nonumber
\end{eqnarray}
Clearly $H(u)=\tilde H(u)+\tilde{\tilde H}(u)$ and
$ H_n(u)=\tilde H_n(u)+\tilde{\tilde H}_n(u)$.
We shall estimate $F_n(u)-F(u)$ for $u\in(-\infty, T]$ if
\begin{equation}
1-H(T)>\delta \quad \textrm{with some fixed } \delta>0.
\label{(2.6)}
\end{equation}
Condition (\ref{(2.6)}) implies that there are more than
$\frac\delta2n$
sample points $Y_j$ larger than~$T$ with probability almost 1. The
complementary event has only an exponentially small probability.
This observation helps to show in the subsequent calculations that
some events have negligibly small probability.
We introduce the so-called cumulative hazard function and its
empirical version
\begin{equation}
\Lambda(u)=-\log(1-F(u)), \quad \Lambda_n(u)=-\log(1-F_n(u)).
\label{(2.7)}
\end{equation}
Since $F_n(u)-F(u)=\exp(-\Lambda(u))
\left(1-\exp(\Lambda(u)-\Lambda_n(u))\right)$
a simple Taylor expansion yields
\begin{equation}
F_n(u)-F(u)=(1-F(u))\left(\Lambda_n(u)-\Lambda(u)\right)+R_1(u),
\label{(2.8)}
\end{equation}
and it is easy to see that
$R_1(u)=O\left(\Lambda(u)-\Lambda_n(u))^2\right)$.
It follows from the subsequent estimations that
$\Lambda(u)-\Lambda_n(u)=O(n^{-1/2})$, thus $nR_1(u)=O(1)$. Hence it
is enough to investigate the term $\Lambda_n(u)$. We shall show that
$\Lambda_n(u)$ has an expansion with $\Lambda(u)$ as the main term
plus $n^{-1/2}$ times a term which is a linear functional of an
appropriate normalized empirical distribution function plus an error
term of order $O(n^{-1})$.
From~(\ref{(2.3)}) it is obvious that
$$
\Lambda_n(u)=-\sum_{i=1}^n I(Y_i\leq u, \, \delta_i=1)
\log\left(1-\frac{1}{1+N(Y_i)}\right).
$$
It is not difficult to get rid of the unpleasant logarithmic function
in this formula by means of the relation $-\log(1-x)=x+O(x^2)$ for
small~$x$. It yields that
\begin{equation}
\Lambda_n(u)=\sum_{i=1}^n \frac{I(Y_i\leq u, \,\delta_i=1)}{N(Y_i)}
+R_2(u)=\tilde{\Lambda}_n(u)+R_2(u) \label{(2.9)}
\end{equation}
with an error term $R_2(u)$ such that $nR_2(u)$ is smaller than a
constant with probability almost one. (The probability of the
exceptional set is exponentially small.)
The expression $\tilde{\Lambda}_n(u)$ is still inappropriate for our
purposes. Since the denominators $N(Y_i)=\sum\limits_{j=1}^n I(Y_j>Y_i)$
are dependent for different indices~$i$ we cannot see directly the
limit behaviour of $\tilde{\Lambda}_n(u)$.
We try to approximate $\tilde{\Lambda}_n(u)$ by a simpler
expression. A natural approach would be to approximate the terms
$N(Y_i)$ in it by their conditional expectation $(n-1)\bar
H(Y_i)=(n-1)(1-H(Y_i))=E(N(Y_i)|Y_i)$ with respect to the
$\sigma$-algebra generated by the random variable~$Y_i$. This is a
too rough `first order' approximation, but the following `second
order approximation' will be sufficient for our goals. Put
$$
N(Y_i)=\sum_{j=1}^n I(Y_j>Y_i)=n\bar H(Y_i) \left(1+
\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}\right)
$$
and express the terms $\frac1{N(Y_i)}$ in the sum defining
$\tilde \Lambda_n$, (with $\tilde\Lambda_n$ introduced in~(\ref{(2.9)}))
by means of the relation
$\frac1{1+z}=\sum\limits_{k=0}^\infty (-1)^kz^k=1-z+\varepsilon(z)$
with the choice
$z=\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}$. As
$|\varepsilon(z)|<2z^2$ for $|z|<\frac{1}{2}$ we get that
\begin{eqnarray}
\tilde{\Lambda}_n(u)
&=&\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1+\sum_{k=1}^\infty\left(- \frac{\sum\limits_{j=1}^n
I(Y_j>Y_i)-n\bar H(Y_i)} {n\bar H(Y_i)}\right)^k\right)\nonumber \\
&=&\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1-\frac{\sum\limits_{j=1}^n I(Y_j>Y_i)-n\bar H(Y_i)}
{n\bar H(Y_i)}\right)+R_3(u) \nonumber \\
&=&2A(u)-B(u)+R_3(u), \label{(2.10)}
\end{eqnarray}
where
$$
A(u)=A(n,u)=\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}{n\bar H(Y_i)}
$$
and
$$
B(u)=B(n,u)=\sum_{i=1}^n \sum_{j=1}^n\frac
{I(Y_i\leq u,\,\delta_i=1)I(Y_j>Y_i)}{n^2\bar H^2(Y_i)}.
$$
It can be proved by means of standard methods that $nR_3(u)$ is
exponentially small. Thus relations~(\ref{(2.9)})
and~(\ref{(2.10)}) yield that
\begin{equation}
\Lambda_n(u)=2A(u)-B(u)+\textrm{negligible error.}
\label{(2.11)}
\end{equation}
This means that to solve our problem the asymptotic behaviour of the
random variables $A(u)$ and $B(u)$ has to be given. We can get a
better insight to this problem by rewriting the sum $A(u)$ as an
integral and the double sum $B(u)$ as a two-fold integral with
respect to empirical measures. Then these integrals can be rewritten
as sums of random integrals with respect to normalized empirical
measures and deterministic measures. Such an approach yields a
representation of $\Lambda_n(u)$ in the form of a sum whose terms
can be well understood.
Let us write
\begin{eqnarray*}
A(u)&=&\int_{-\infty}^{+\infty}\frac{I(y\leq u)}{1-H(y)}\,d\tilde
H_n(y),\\
B(u) &=&\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}
\frac{I(y\leq u)I(x>y)}{\left(1-H(y)\right)^2}\,dH_n(x) d\tilde H_n(y).
\end{eqnarray*}
We rewrite the terms $A(u)$ and $B(u)$ in a form better for our
purposes. We express these terms as a sum of integrals with respect
to $dH(u)$, $d\tilde H(u)$ and the normalized empirical processes
$d\sqrt n(H_n(x)-H(x))$ and $d\sqrt n(\tilde H_n(y)-\tilde H(y))$.
For this goal observe that
\begin{eqnarray*}
H_n(x)\tilde H_n(y)&&=H(x)\tilde H(y)+H(x)(\tilde H_n(y)-\tilde H(y))
+(H_n(x)-H(x))\tilde H(y)\\
&&\qquad+(H_n(x)-H(x))(\tilde H_n(y)-\tilde H(y)).
\end{eqnarray*}
Hence it can be written that
$B(u)=B_1(u)+B_2(u)+B_3(u)+B_4(u)$, where
\begin{eqnarray*}
B_1(u)&&=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\tilde H(y)\;,\\
B_2(u) &&=\frac{1}{\sqrt n}\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\left(\sqrt n
(\tilde H_n(y)-\tilde H(y))\right),\\
B_3(u)&&=\frac1{\sqrt n}\int_{-\infty}^u
\int_{-\infty}^{+\infty}\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\,d\tilde H(y)\;,\\
B_4(u)&&=\frac 1n \int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\, \\
&& \qquad \qquad\qquad\qquad
d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right).
\end{eqnarray*}
In the above decomposition of $B(u)$ the term $B_1$ is a
deterministic function, $B_2$, $B_3$ are linear functionals of
normalized empirical processes and $B_4$ is a nonlinear functional
of normalized empirical processes. The deterministic term $B_1(u)$
can be calculated explicitly. Indeed,
$$
B_1(u)=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x) d\tilde H(y)=
\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}.
$$
Then the relations
$\tilde H(u)=\int_{-\infty}^u\left(1-G(t)\right)\,dF(t)$ and
$1-H = (1-F)(1-G)$ imply that
\begin{equation}
B_1(u)=\int_{-\infty}^u\frac{dF(y)}{1-F(y)}=
-\log(1-F(u))=\Lambda(u). \label{(2.12)}
\end{equation}
Observe that
\begin{eqnarray}
A(u)&=&\int_{-\infty}^u\frac{d\,\tilde H_n(y)}{1-H(y)}\nonumber \\
&=&\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}+
\frac1{\sqrt n}\int_{-\infty}^u
\frac{d \left(\sqrt n (\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}
\nonumber \\
&=& B_1(u)+B_2(u). \label{(2.13)}
\end{eqnarray}
From relations~(\ref{(2.11)}), (\ref{(2.12)}) and~(\ref{(2.13)})
it follows that
\begin{equation}
\Lambda_n(u)-\Lambda(u)=B_2(u)-B_3(u)-B_4(u)+\textrm{negligible error.}
\label{(2.14)}
\end{equation}
Integration of $B_2$ and $B_3$ with respect to the variable $x$
and then integration by parts in the expression $B_2$ yields that
\begin{eqnarray*}
B_2(u)&=&\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}\\
&=&\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{\sqrt n(1-H(u))}-\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y),\\
B_3(u)&=&\frac1{\sqrt n}\int_{-\infty}^u
\frac{\sqrt n\left(H(y)-H_n(y)\right)}
{\left(1-H(y)\right)^2}\,d\tilde H(y).
\end{eqnarray*}
With the help of the above expressions for $B_2$ and $B_3$
(\ref{(2.14)}) can be rewritten as
\begin{eqnarray}
\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)
&=\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{1-H(u)}-\int_{-\infty}^u
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y)\nonumber \\
&\qquad+\int_{-\infty}^u\frac{\sqrt n\left(H_n(y)-H(y)\right)}
{\left(1-H(y)\right)^2} \,d\tilde H(y)\nonumber \\
&\qquad-\sqrt n B_4(u)+\textrm{negligible error.}
\label{(2.15)}
\end{eqnarray}
Formula (\ref{(2.15)}) (together with formula~(\ref{(2.8)}))
almost agrees with the statement we wanted to prove. Here the
normalized error $\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)$
is expressed as a sum of linear functionals of normalized
empirical measures plus some negligible error terms plus the
error term $\sqrt nB_4(u)$. So to get a complete proof it
is enough to show that $\sqrt nB_4(u)$ also yields a negligible
error. But $nB_4(u)$ is a double integral of a bounded function (here
we apply again formula (\ref{(2.6)})) with respect to a normalized
empirical measure. Hence to bound this term we need a good estimate of
multiple stochastic integrals (with multiplicity~2), and this is just
the problem formulated in the introduction. The estimate we need
here follows from Theorem~8.1 of the present work. Let us remark
that the problem discussed here corresponds to the estimation of the
coefficient of the second term in the Taylor expansion considered
in the study of the maximum likelihood estimation. One may worry a
little bit how to bound $nB_4(u)$ with the help of estimations of
double stochastic integrals, since in the definition of $B_4(u)$
integration is taken with respect to different normalized empirical
processes in the two coordinates.
%But this is a not too difficult technical problem. It can be simply
%overcome for instance by
But this problem can be overcome e.g. by
rewriting the integral as a double integral with respect to the
empirical process
$\left(\sqrt n\left(H_n(x)-H(x)\right),
\sqrt n\left(\tilde H_n(y)-\tilde H(y)\right)\right)$
in %the space
$R^2$.
By working out the details of the above calculation we get that
the linear functional $B_2(u)-B_3(u)$ of normalized empirical
processes yields a good estimate on the expression
$\sqrt n(\Lambda_n(u)-\Lambda(u))$ for a fixed parameter~$u$.
But we want to prove somewhat more, we want to get an estimate
uniform in the parameter~$u$, i.e. to show that even the random
variable $\sup\limits_{u\le T}\left|
\sqrt n(\Lambda_n(u)-\Lambda(u))-B_2(u)+B_3(u)\right|$
is small. This can be done by making estimates uniform in the
parameter~$u$ in all steps of the above calculation. There appears
only one difficulty when trying to carry out this program. Namely,
we need an estimate on $\sup\limits_{u\le T} |nB_4(u)|$, i.e. we
have to bound the supremum of multiple random integrals with respect
to a normalized random measure for a nice class of kernel functions.
This can be done, but at this point the second problem mentioned
in the introduction appears. This difficulty can be overcome by
means of Theorem~8.2 of this work.
Thus we could describe the limit behaviour of the Kaplan--Meyer
estimate by means of an appropriate expansion. But to do it we
needed the solution of the problems mentioned in the introduction.
It can be expected that such a method also works in a much more
general situation.
%Thus the limit behaviour of the Kaplan--Meyer estimate can be
%described by means of an appropriate expansion. The steps of the
%calculation leading to such an expansion are fairly standard, the
%only hard part is the solution of the problems mentioned in the
%introduction. It can be expected that such a method also works in
%a much more general situation.
I finish this section with a remark of Richard Gill he made in a
personal conversation after my talk on this subject at a conference.
While he accepted my proof he missed an argument in it about the
maximum likelihood character of the Kaplan--Meyer estimate. This
was a completely justified remark, since if we do not restrict our
attention to this problem, but try to generalize it to general
non-parametric maximum likelihood estimates, then we have to
understand how the maximum likelihood character of the estimate
can be exploited. I believe that this can be done, but it demands
further studies.
\chapter{Some estimates about sums of independent random
variables}
We shall need a good bound on the tail distribution of sums of
independent random variables bounded by a constant with probability
one. Later only the results about sums of independent and identically
distributed variables will be interesting for us. But since they
can be generalized without any effort to sums of not
necessarily identically distributed random variables the condition
about identical distribution of the summands will be dropped.
We are interested in the question when these
estimates give such a good bound as the central limit theorem
suggests, and what can be told otherwise.
More explicitly, the following problem will be considered: Let
$X_1,\dots,X_n$ be independent random variables, $EX_j=0$,
$\textrm{Var}\, X_j=\sigma_j^2$, $1\le j\le n$, and take the random sum
$S_n=\sum\limits_{j=1}^nX_j$ and its variance
$\textrm{Var}\, S_n=V_n^2=\sum\limits_{j=1}^n\sigma_j^2$.
We want to get a good
bound on the probability $P(S_n>u V_n)$. The central limit theorem
suggests that under general conditions an upper bound of the
order $1-\Phi(u)$ should hold for this probability, where $\Phi(u)$
denotes the standard normal distribution function. Since the
standard normal distribution function satisfies the inequality
$\left(\frac1u-\frac1{u^3}\right)
\frac{e^{-u^2/2}}{\sqrt{2\pi}} <1-\Phi(u)<
\frac1u\frac{e^{-u^2/2}}{\sqrt{2\pi}}$ for all $u>0$ it is natural
to ask when the probability $P(S_n >uV_n)$ is comparable with the
value $e^{-u^2/2}$. More generally, we shall call an upper bound of
the form $P(S_n>uV_n)\le e^{-Cu^2}$ with some constant $C>0$ a
Gaussian type estimate.
First I formulate Bernstein's inequality which tells for which
values $u$ the probability $P(S_n>uV_n)$ has a Gaussian type estimate.
It supplies such an estimate if $u\le \textrm{const}\, V_n$. On
the other hand, for $u\ge\textrm{const.}\, V_n$ it yields a much
weaker estimate. I shall formulate another result, called Bennett's
inequality, which is a slight improvement of Bernstein's inequality.
It helps us to tell what can be expected if Bernstein's
inequality does not provide a Gaussian type estimate. I shall also
present an example which shows that Bennett's inequality is in some
sense sharp. The main difficulties we meet in this work are closely
related to the weakness of the estimates we have for the probability
$P(S_n>uV_n)$ if it does not satisfy a Gaussian type estimate. As we
shall see this happens if $u\gg \textrm{const.}\, V_n$.
In the usual formulation of Bernstein's inequality a real
number~$M$ is introduced, and it is assumed that the terms in
the sum we investigate are bounded by this number. But since the
problem can be simply reduced to the case $M=1$ I shall consider
only this special case.
\medskip\noindent
{\bf Theorem 3.1 (Bernstein's
inequality).}\index{Bernstein's inequality} {\it Let
$X_1,\dots,X_n$ be independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\textrm{\rm Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P\left(S_n>uV_n\right)\le\exp\left\{-\frac{u^2}{2\left(1+\frac13
\frac u{V_n}\right)} \right\} \quad\textrm{for all }u>0.
\label{(3.1)}
\end{equation}
}
\medskip\noindent
{\it Proof of Theorem 3.1.} Let us give a good bound on the
exponential moments $Ee^{tS_n}$ for appropriate parameters
$t>0$. Since $EX_j=0$ and $E|X_j^{k+2}|\le\sigma^2_j$ for $k\ge0$ we can
write $Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k
\le 1+\frac{t^2\sigma_j^2}2\left(1+\sum\limits_{k=1}^\infty
\frac {2t^{k}}{(k+2)!}\right) \le 1+\frac{t^2\sigma_j^2}2
\left(1+\sum\limits_{k=1}^\infty 3^{-k}t^{k}\right)
= 1+\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}
\le\exp\left\{\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}\right\}$
if $0\le t<3$. Hence $Ee^{tS_n}=\prod\limits_{j=1}^n Ee^{tX_j}\le
\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3}\right\}$ for $0\le t<3$.
The above relation implies that
\begin{eqnarray*}
P\left(S_n>uV_n\right)=P(e^{tS_n}>e^{tuV_n})&\le&
Ee^{tS_n}e^{-tuV_n} \\
&\le& \exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac
t3}-tuV_n\right\}
\end{eqnarray*}
if $0\le t<3$. Choose the number $t$ in this inequality as the
solution of the equation $t^2V_n^2\frac1{1-\frac t3}=tuV_n$, i.e.
put $t=\frac u{V_n+\frac u3}$. Then $0\le t<3$, and we get that
$P(S_n>uV_n)\le e^{-tuV_n/2}=
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}\right\}$.
\medskip
If the random variables $X_1,\dots,X_n$ satisfy the conditions of
Bernstein's inequality, then also the random variables
$-X_1,\dots,-X_n$ satisfy them. By applying the above result in both
cases we get that
$P(|S_n|>uV_n)\le2
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}
\right\}$ under the conditions of Bernstein's inequality.
\medskip
By Bernstein's inequality for all $\varepsilon>0$ there is some
number $\alpha(\varepsilon)>0$ such that in the case
$\frac u{V_n}<\alpha(\varepsilon)$
$P(S_n>uV_n)\le e^{-(1-\varepsilon)u^2/2}$. Beside this, for all
fixed numbers $A>0$ there is some constant $C=C(A)>0$ such that in
the case $\frac u{V_n}uV_n)\le e^{-Cu^2}$
holds. This can be interpreted as a Gaussian type estimate for the
probability $P(S_n>uV_n)$ if $u\le \textrm{const.}\, V_n$.
On the other hand, if $\frac u{V_n}$ is very large, then Bernstein's
inequality yields a much worse estimate. The question arises whether
in this case Bernstein's inequality can be replaced by a better, more
useful result. Next we present Theorem~3.2, the so-called Bennett's
inequality which provides a slight improvement of Bernstein's
inequality. But if $\frac u{V_n}$ is very large, then also
Bennett's inequality provides a much weaker estimate on the
probability $P(S_n>uV_n)$ than the bound suggested by a Gaussian
comparison. On the other hand, we shall give an example that shows
that (without imposing some additional conditions) no real
improvement of this estimate is possible.
\medskip\noindent
{\bf Theorem 3.2 (Bennett's inequality).}\index{Bennett's inequality}
{\it Let $X_1,\dots,X_n$ be
independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\textrm{\rm Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P(S_n>u)\le\exp\left\{-V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac u{V^2_n}\right]\right\}
\quad\textrm{for all } u>0. \label{(3.2)}
\end{equation}
As a consequence, for all $\varepsilon>0$ there exists some
$B=B(\varepsilon)>0$ such
that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-(1-\varepsilon)u\log \frac u{V^2_n}
\right\}\quad\textrm{if } u>BV_n^2, \label{(3.3)}
\end{equation}
and there exists some positive constant $K>0$ such that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-Ku\log \frac u{V^2_n}
\right\}\quad\textrm{if }u>2V_n^2. \label{(3.4)}
\end{equation}
}
\medskip\noindent
{\it Proof of Theorem 3.2.}\/ We have
\begin{eqnarray*}
Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k\le
1+\sigma_j^2\sum\limits_{k=2}^\infty\frac {t^k}{k!}&&=1+\sigma_j^2
\left(e^t-1-t\right)\le e^{\sigma_j^2(e^t-1-t)}, \\
&& \qquad\quad 1\le j\le n,
\end{eqnarray*}
and $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$ for all $t\ge0$. Hence
$P(S_n>u)\le e^{-tu}Ee^{tS_n}\le e^{-tu+V_n^2(e^t-1-t)}$ for all
$t\ge0$. We get relation (\ref{(3.2)}) from this inequality
with the choice $t=\log\left(1+\frac u{V^2_n}\right)$. (This is
the place of minimum of the
function $-tu+V_n^2(e^t-1-t)$ for fixed $u$ in the parameter~$t$.)
Relation (\ref{(3.2)}) and the observation
$\lim\limits_{v\to\infty}\frac{(v+1)\log(v+1)-v}{v\log v}=1$
with the choice $v=\frac u{V_n^2}$ imply formula~(\ref{(3.3)}).
Because of relation (\ref{(3.3)}) to prove formula (\ref{(3.4)})
it is enough to check it for $2\le\frac u{V_n^2}\le B$
with some sufficiently large constant $B>0$.
In this case relation (3.4) follows directly from formula
(\ref{(3.2)}). This can be seen for instance by observing that
the expression $\frac{V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac
u{V^2_n}\right]}{u\log\frac u{V^2_n}}$ is a continuous and positive
function of the variable $\frac u{V_n^2}$ in the interval $2\le
\frac u{V_n^2}\le B$, hence its minimum in this interval is strictly
positive.
\medskip
Let us make a short comparison between Bernstein's and Bennett's
inequalities. Both results yield an estimate on the probability
$P(S_n>u)$, and their proofs are very similar. They are based on
an estimate of the moment generating functions $R_j(t)=Ee^{tX_j}$
of the summands~$X_j$, but Bennett's inequality yields a better
estimate. It may be worth mentioning that the estimate given for
$R_j(t)=Ee^{tX_j}$ in the proof of
Bennett's inequality agrees with the moment generating function
$Ee^{t(Y_j-EY_j)}$ of the normalization $Y_j-EY_j$ of a Poissonian
random variable $Y_j$ with parameter $\textrm{Var}\, X_j$. As a
consequence,
we get, by using the standard method of estimating tail-distributions
by means of the moment generating functions such an estimate for the
probability $P(S_n>u)$ which is comparable with the probability
$P(T_n-ET_n>u)$, where $T_n$ is a Poissonian random variable with
parameter $V_n=\textrm{Var}\, S_n$. We can say that Bernstein's
inequality yields a Gaussian and Bennett's inequality a Poissonian
type estimate for the sums of independent, bounded random variables.
\medskip\noindent
{\it Remark.}\/ Bennett's inequality yields a sharper estimate for
the probability $P(S_n>u)$ than Bernstein's inequality for all
numbers $u>0$. To prove this it is enough to show that for all
$0\le t<3$ the inequality $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$
appearing in the proof of Bennett's inequality is a sharper
estimate than the corresponding inequality
$Ee^{tS_n}\le\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3} \right\}$
appearing in the proof of Bernstein's inequality. (Recall, how we
estimate the probability $P(S_n>u)$ in these proofs with the help of
the exponential moment $Ee^{tS_n}$.) But to prove this
it is enough to check that $e^t-1-t\le \frac{t^2}2\frac1{1-\frac t3}$
for all $0\le t<3$. This inequality clearly holds, since
$e^t-1-t=\sum\limits_{k=2}^\infty\frac{t^k}{k!}$, and
$\frac{t^2}2\frac1{1-\frac t3}=\sum\limits_{k=2}^\infty
\frac12(\frac13)^{k-2}t^k$.
\medskip
Next we present Example~3.3 which shows that Bennett's inequality
yields a sharp estimate also in the case $u\gg V_n^2$ when
Bernstein's inequality yields a weak bound. But Bennett's inequality
provides only a small improvement which has only a limited
importance. This may be the reason why Bernstein's inequality
which yields a more transparent estimate is more popular.
\medskip\noindent
{\bf Example 3.3 (Sums of independent random variables with bad
tail distribution for large values).} {\it Let us fix some
positive integer $n$, real numbers $u$ and $\sigma^2$ such that
$0<\sigma^2\le\frac18$, $n>4u\ge6$ and $u>4n\sigma^2$. Let
$\bar\sigma^2$ be that solution of the equation $x^2-x+\sigma^2=0$
which is smaller than~$\frac12$. Take a sequence of independent
and identically distributed random variables
$\bar X_1,\dots,\bar X_n$ such that $P(\bar X_j=1)=\bar\sigma^2$,
$P(\bar X_j=0)=1-\bar\sigma^2$ for all $1\le j\le n$. Put
$X_j=\bar X_j-E\bar X_j=X_j-\bar\sigma^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and $V_n^2=n\sigma^2$.
Then $P(|X_1|\le1)=1$, $EX_1=0$, $\textrm{\rm Var}\, X_1=\sigma^2$,
hence $ES_n=0$, and $\textrm{\rm Var}\, S_n=V_n^2$. Beside this
$$
P(S_n\ge u)>\exp\left\{-Bu\log \frac u{V^2_n}\right\}
$$
with some appropriate constant $B>0$ not depending on~$n$,
$\sigma$ and~$u$.}
\medskip\noindent
{\it Proof of Example 3.3.}\/ Simple calculation shows that $EX_j=0$,
$\textrm{Var}\, X_j=\bar\sigma^2-\bar\sigma^4=\sigma^2$,
$P(|X_j|\le1)=0$, and
also the inequality $\sigma^2\le\bar\sigma^2\le\frac32\sigma^2$ holds.
To see the upper bound in the last inequality observe that
$\bar\sigma^2\le\frac13$, i.e. $1-\bar\sigma^2\ge\frac23$, hence
$\sigma^2=\bar\sigma^2(1-\bar\sigma^2)\ge\frac23\bar\sigma^2$. In
the proof of the inequality of Example~3.3 we can restrict our
attention to the case when $u$ is an integer, because in the
general case we can apply the inequality with $\bar u=[u]+1$
instead of~$u$, where $[u]$ denotes the integer part of~$u$, and
since $u\le\bar u\le 2u$, the application of the result in this
case supplies the desired inequality with a possibly worse
constant~$B>0$.
Put $\bar S_n=\sum\limits_{j=1}^n\bar X_j$. We can write
$P(S_n\ge u)=P(\bar S_n\ge u+n\bar\sigma^2)\ge P(\bar S_n\ge2u)
\ge P(\bar S_n=2u)={n\choose{2u}}
\bar\sigma^{4u}(1-\bar\sigma^2)^{(n-2u)}
\ge(\frac {n\bar\sigma^2}{2u})^{2u}(1-\bar\sigma^2)^{(n-2u)}$,
since $u\ge n\bar\sigma^2$, and $n\ge2u$. On the other hand
$(1-\bar\sigma^2)^{(n-2u)}\ge e^{-2\bar\sigma^2(n-2u)}
\ge e^{-2n\bar\sigma^2}\ge e^{-u}$, hence
\begin{eqnarray*}
P(S_n\ge u)
&\ge&\exp\left\{-2u\log\left(\frac u{n\bar\sigma^2}\right)
-2u\log2-u\right\}\\
&=&\exp\left\{-2u\log\left(\frac u{n\sigma^2}\right)
-2u\log\frac{\bar\sigma^2}{\sigma^2}-2u\log2-u\right\}\\
&\ge&\exp\left\{-100u\log\left(\frac u{V_n^2}\right)\right\}.
\end{eqnarray*}
Example~3.3 is proved.
\medskip
In the case $u>4V_n^2$ Bernstein's inequality yields the estimate
$P(S_n>u)\le e^{-\alpha u}$ with some universal constant $\alpha>0$,
and the above example shows that at most an additional logarithmic
factor $K\log\frac u{V_n^2}$ can be expected in the exponent of
the upper bound in an improvement of this estimate. Bennett's
inequality shows that such an improvement is really possible.
\medskip
I finish this section with another estimate due to Hoeffding
which will be later useful in some symmetrization arguments.
\medskip\noindent
{\bf Theorem 3.4 (Hoeffding's inequality).}\index{Hoeffding's
inequality} {\it Let $\varepsilon_1,\dots,\varepsilon_n$
be independent random variables,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, and let $a_1,\dots,a_n$ be arbitrary real numbers. Put
$V=\sum\limits_{j=1}^na_j\varepsilon_j$. Then
\begin{equation}
P(V>u)\le\exp\left\{-\frac{u^2}{2\sum_{j=1}^na_j^2 }\right\}\quad
\textrm{for all }u>0. \label{(3.5)}
\end{equation}
}
\medskip\noindent
{\it Remark 1:}\/ Clearly $EV=0$ and
$\textrm{Var}\, V=\sum\limits_{j=1}^n a_j^2$,
hence Hoeffding's inequality yields such an estimate for $P(V>u)$
which the central limit theorem suggests. This estimate holds for
all real numbers $a_1,\dots,a_n$ and $u>0$.
\medskip\noindent
{\it Remark 2:}\/ The Rademacher
functions\index{Rademacher functions} $r_k(x)$, $k=1,2,\dots$,
defined by the formulas $r_k(x)=1$ if $(2j-1)2^{-k}\le x<2j2^{-k}$
and $r_k(x)=-1$ if $2(j-1)2^{-k}\le x<(2j-1)2^{-k}$,
$1\le j\le 2^{k-1}$, for all $k=1,2,\dots$, can be considered as
random variables on the probability space $\Omega=[0,1]$ with the
Borel $\sigma$-algebra and the Lebesgue measure as probability
measure on the interval $[0,1]$. They are independent random
variables with the same distribution as the random variables
$\varepsilon_1,\dots,\varepsilon_n$ considered in Theorem~3.4.
Therefore results
about such sequences of random variables whose distributions agree
with those in~Theorem~3.4 are also called sometimes results about
Rademacher functions in the literature. At some points we will
also apply this terminology.
\medskip\noindent
{\it Proof of Theorem 3.4.} Let us give a good bound on the
exponential moment $Ee^{tV}$ for all $t>0$. The identity
$Ee^{tV}=\prod\limits_{j=1}^nEe^{ta_j\varepsilon_j}=
\prod\limits_{j=1}^n\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2$ holds,
and
$\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2=\sum\limits_{k=0}^\infty
\frac{a_{j}^{2k}} {(2k)!}t^{2k}\le \sum\limits_{k=0}^\infty \frac
{(a_jt)^{2k}}{2^{k}k!}=e^{a_j^2t^2/2}$, since $(2k)!\ge 2^k k!$
for all $k\ge0$. This implies that $Ee^{tV}\le
\exp\left\{\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$. Hence
$P(V>u)\le\exp\left\{-tu+\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$,
and we get relation (\ref{(3.5)}) with the choice
$t=u\left(\sum\limits_{j=1}^n a_j^2\right)^{-1}$.
\chapter{On the supremum of a nice class of partial sums}
This section contains an estimate about the supremum of a nice
class of normalized sums of independent and identically
distributed random variables together with an analogous result
about the supremum of an appropriate class of one-fold random
integrals with respect to a normalized empirical measure. The
second result deals with a one-variate version of the problem
about the estimation of multiple integrals with respect to a
normalized empirical measure. This problem was mentioned in
the introduction. Some natural questions related to these
results will be also discussed. It will be examined how
restrictive their conditions are. In particular, we are
interested in the question how the condition about the
countable cardinality of the class of random variables can be
weakened. A natural Gaussian counterpart of the supremum
problems about random one-fold integrals will be also
considered. Most proofs will be postponed to later sections.
To formulate these results first a notion will be
introduced that plays a most important role in the sequel.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions.}\index{L${}_p$-dense,
(in particular $L_2$-dense classes) of functions}
{\it Let a measurable space $(Y,{\cal Y})$ be given together with
a class ${\cal G}$ of ${\cal Y}$ measurable real valued functions
on this space. The class of functions ${\cal G}$ is called an
$L_p$-dense class of functions, $1\le p<\infty$, with parameter~$D$
and exponent~$L$ if for all numbers $0<\varepsilon\le1$ and
probability measures $\nu$ on the space $(Y,{\cal Y})$ there
exists a finite $\varepsilon$-dense subset
${\cal G}_{\varepsilon,\nu}=\{g_1,\dots,g_m\}\subset {\cal G}$
in the space $L_p(Y,{\cal Y},\nu)$ with $m\le D\varepsilon^{-L}$
elements, i.e. there exists such a set ${\cal G}_{\varepsilon,\nu}
\subset {\cal G}$ with $m\le D\varepsilon^{-L}$ elements for which
$\inf\limits_{g_j\in{\cal G}_{\varepsilon,\nu}}\int |g-g_j|^p\,d\nu
<\varepsilon^p$
for all functions $g\in {\cal G}$. (Here the set
${\cal G}_{\varepsilon,\nu}$ may depend
on the measure $\nu$, but its cardinality is bounded by a number
depending only on $\varepsilon$.)}
\medskip
In most results of this work the above defined $L_p$-dense classes
will be considered only for the parameter $p=2$. But at some points
it will be useful to work also with $L_p$-dense classes with a
different parameter~$p$. Hence to avoid some repetitions I introduced
the above definition for a general parameter~$p$. When working with
$L_p$-dense classes we shall consider only such classes of functions
${\cal G}$ whose elements are functions with bounded absolute value.
Hence all integrals appearing in the definition of $L_p$-dense
classes of functions are finite.
The following estimate will be proved.
\medskip\noindent
{\bf Theorem 4.1 (Estimate on the supremum of a class of partial
sums).}\index{estimate on the supremum of a class of partial sums}
{\it Let us consider a sequence of independent and
identically distributed random variables $\xi_1,\dots,\xi_n$,
$n\ge2$, with values in a measurable space $(X,{\cal X})$ and with
some distribution~$\mu$. Beside this, let a countable and
$L_2$-dense class of functions ${\cal F}$ with some parameter $D\ge1$
and exponent $L\ge1$ be given on the space $(X,{\cal X)}$ which
satisfies the conditions
\begin{eqnarray}
\|f\|_\infty&=&\sup_{x\in X}|f(x)|\le 1, \qquad \textrm{for all }
f\in{\cal F} \label{(4.1)} \\
\|f\|_2^2&=&\int f^2(x) \mu(\,dx)\le \sigma^2 \qquad \textrm{for all }
f\in {\cal F} \label{(4.2)}
\end{eqnarray}
with some constant $0<\sigma\le1$, and
\begin{equation}
\int f(x)\mu(\,dx)=0 \quad \textrm{for all } f\in{\cal F}. \label{(4.3)}
\end{equation}
Define the normalized partial sums $S_n(f)=\frac1{\sqrt n}
\sum\limits_{k=1}^n f(\xi_k)$ for all $f\in {\cal F}$.
There exist some universal constants $C>0$, $\alpha>0$ and $M>0$
such that the supremum of the normalized random sums $S_n(f)$,
$f\in {\cal F}$, satisfies the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge u\right)
\le C\exp\left\{-\alpha\left(\frac u{\sigma}\right)^2\right\}
\quad \textrm{for those numbers } u \nonumber \\
&&\qquad \textrm{for which }\sqrt n\sigma^2\ge
u\ge M\sigma(L^{3/4}\log^{1/2}\tfrac2\sigma +(\log D)^{3/4}),
\label{(4.4)}
\end{eqnarray}
where the numbers~$D$ and $L$ in formula~(\ref{(4.4)}) agree with
the parameter and exponent of the $L_2$-dense class~${\cal F}$.}
\medskip\noindent
{\it Remark.}\/ Here and also in the subsequent part of this work
we consider random variables which take their values in a general
measurable space $(X,{\cal X})$. The only restriction we impose
on these spaces is that all sets consisting of one point are
measurable, i.e. $\{x\}\in{\cal X}$ for all $x\in X$.
\medskip
The condition $\sqrt n\sigma^2\ge u\ge
M\sigma(L^{3/4}\log^{1/2}\frac2\sigma +D^{3/4})$ about the number~$u$
in formula~(4.4) is natural. I discuss this after the formulation of
Theorem~4.2 which can be considered as the Gaussian counterpart of
Theorem~4.1. I also formulate a result in Example~4.3 which can be
considered as part of this discussion.
\medskip
The condition about the countable cardinality of ${\cal F}$ can be
weakened with the help of the notion of countable approximability
introduced below. For the sake of later applications I define it
in a more general form than needed in this section. In the subsequent
part of this work I shall assume that the probability measure I work
with is complete, i.e. for all such pairs of sets~$A$ and~$B$ in the
probability space $(\Omega,{\cal A},P)$ for which $A\in{\cal A}$,
$P(A)=0$ and $B\subset A$ we have $B\in{\cal A}$ and $P(B)=0$.
\medskip\noindent
{\bf Definition of countably approximable classes of random
variables.} \index{countably approximable classes of random variables}
{\it Let us have a class of random variables $U(f)$,
$f\in {\cal F}$, indexed by a class of functions $f\in{\cal F}$
on a measurable space $(Y,{\cal Y})$. This class of random variables
is called countably approximable if there is a countable subset
${\cal F}'\subset {\cal F}$ such that for all numbers $u>0$ the sets
$A(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}'} |U(f)(\omega)|\ge u\}$
satisfy the identity $P(A(u)\setminus B(u))=0$.}
\medskip
Clearly, $B(u)\subset A(u)$. In the above definition it was demanded
that for all $u>0$ the set $B(u)$ should be almost as large as
$A(u)$. The following corollary of Theorem~4.1 holds.
\medskip\noindent
{\bf Corollary of Theorem~4.1.} {\it Let a class of functions
${\cal F}$ satisfy the conditions of Theorem~4.1 with the only
exception that instead of the condition about the countable
cardinality of ${\cal F}$ it is assumed that the class of random
variables $S_n(f)$, $f\in{\cal F}$, is countably approximable. Then
the random variables $S_n(f)$, $f\in{\cal F}$, satisfy
relation~(\ref{(4.4)}).}
\medskip
This corollary can be simply proved, only Theorem~4.1 has to be
applied for the class ${\cal F}'$. To do this it has to be checked
that if ${\cal F}$ is an $L_2$-dense class with some parameter $D$
and exponent $L$, and ${\cal F}'\subset {\cal F}$, then ${\cal F}'$ is
also an $L_2$-dense class with the same exponent $L$, only with a
possibly different parameter~$D'$.
To prove this statement let us choose for all numbers
$0<\varepsilon\le1$ and probability measures $\nu$ on
$(Y,{\cal Y})$ some functions
$f_1,\dots,f_m\in {\cal F}$ with
$m\le D\left(\frac\varepsilon2\right)^{-L}$ elements, such that
the sets ${\cal D}_j=\left\{f\colon\,\int |f-f_j|^2\,d\nu\le
\left(\frac\varepsilon2\right)^2\right\}$ satisfy the relation
$\bigcup\limits_{j=1}^m {\cal D}_j=Y$. For all sets
${\cal D}_j$ for which ${\cal D}_j\cap {\cal F}'$ is
non-empty choose a function $f'_j\in {\cal D}_j\cap {\cal F}'$. In
such a way we get a collection of functions $f'_j$ from the class
${\cal F}'$ containing at most $2^LD\varepsilon^{-L}$ elements
which satisfies
the condition imposed for $L_2$-dense classes with exponent $L$ and
parameter $2^LD$ for this number $\varepsilon$ and measure $\nu$.
\medskip
Next I formulate in Theorem~$4.1'$ a result about the supremum of
the integral of a class of functions with respect to a normalized
empirical distribution. It can be considered as a simple version
of Theorem~4.1. I formulated this result, because Theorems~4.1
and~$4.1'$ are special cases of their multivariate counterparts
about the supremum of so-called $U$-statistics and multiple
integrals with respect to a normalized empirical distribution
function discussed in Section~8. These results are also closely
related, but the explanation of their relation demands some work.
Given a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ taking values in $(X,{\cal X})$ let us introduce
their empirical distribution on $(X,{\cal X})$ as
\begin{equation}
\mu_n(A)(\omega)
=\frac1n \#\left\{j\colon\, 1\le j\le n,\; \xi_j(\omega)\in
A\right\}, \quad A\in {\cal X}, \label{(4.5)}
\end{equation}
and define for all measurable and $\mu$~integrable functions~$f$
the (random) integral
\begin{equation}
J_n(f)=J_{n,1}(f)=\sqrt n\int f(x)(\mu_n(\,dx)-\mu(\,dx)). \label{(4.6)}
\end{equation}
Clearly $J_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n (f(\xi_j)-Ef(\xi_j))
=S_n(\hat f)$ with $\hat f(x)=f(x)-\int f(x)\mu(\,dx)$. It is not
difficult to see that $\sup\limits_{x\in X}|\hat f(x)|\le2$ if
$\sup\limits_{x\in X}|f(x)|\le1$, $\int \hat f(x)\mu(\,dx)=0$,
$\int \hat f^2(x)\mu(\,dx)\le\int f^2(x)\mu(\,dx)$, and if
${\cal F}$ is an $L_2$-dense class of functions with parameter~$D$
and exponent~$L$, then the class of functions $\bar{\cal F}$
consisting of the functions
$\bar f(x)=\frac12\left(f(x)-\int f(x)\mu(\,dx)\right)$,
$f\in{\cal F}$, is an $L_2$-dense class of functions with parameter
$D$ and exponent $L$, since $\int(\bar f-\bar g)^2\,d\mu
\le\left(\frac\varepsilon2\right)^2\le\varepsilon^2$
if $f,g\in{\cal F}$, and $\int(f-g)^2\,d\mu\le\varepsilon^2$. Hence
Theorem~4.1 implies the following result.
\medskip\noindent
{\bf Theorem 4.1$'$ (Estimate on the supremum of random integrals
with respect to a normalized empirical distribution).}\index{estimate
on the supremum of random integrals with respect to a normalized
empirical distribution} {\it Let us
have a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge2$, with distribution~$\mu$ on
a measurable space $(X,{\cal X})$ together with some class of
functions ${\cal F}$ on this space which satisfies the
conditions of Theorem~4.1 with the possible exception of
condition~(\ref{(4.3)}). The estimate~(\ref{(4.4)}) remains valid
if the random sums $S_n(f)$ are replaced in it by the random
integrals $J_n(f)$ defined in~(\ref{(4.6)}). Moreover,
similarly to the corollary of Theorem~4.1, the condition about the
countable cardinality of the set ${\cal F}$ can be replaced by the
condition that the class of random variables $J_n(f)$,
$f\in{\cal F}$, is countably approximable.}
\medskip
All finite dimensional distributions of the set of random variables
$S_n(f)$, $f\in{\cal F}$, considered in Theorem~4.1 converge to those
of a Gaussian random field $Z(f)$, $f\in{\cal F}$, with expectation
$EZ(f)=0$ and correlation $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\cal F}$ as $n\to\infty$. Here, and in the subsequent part
of the paper a collection of random variables indexed by some set
of parameters will be called a Gaussian random field if for all
finite subsets of these parameters the random variables indexed by
this finite set are jointly Gaussian. We shall also define
so-called linear Gaussian random fields.\index{linear Gaussian random
fields} They consist of jointly Gaussian random variables $Z(f)$,
$f\in{\cal G}$, indexed by a linear space $f\in{\cal G}$ which
satisfy the relation $Z(af+bg)=aZ(f)+bZ(g)$ with probability~1 for
all real numbers $a$ and $b$ and $f,g\in{\cal G}$.
(Let us observe that a set of Gaussian random variables $Z(f)$,
indexed by a linear space $f\in{\cal G}$ such that $EZ(f)=0$, and
$EZ(f)Z(g)=\int f(x)g(x)\,\mu(\,dx)$ for all $f,g\in{\cal F}$ is a
linear Gaussian random field. This can be seen by checking the
identity $E[Z(af+bg)-(aZ(f)+bZ(g))]^2=0$ for all real numbers $a$
and $b$ and $f,g\in{\cal G}$ in this case.)
Let us consider a linear Gaussian random field $Z(f)$, $f\in{\cal G}$,
where the set of indices~${\cal G}={\cal G}_\mu$ consists of the
functions~$f$ square integrable with respect to a $\sigma$-finite
measure~$\mu$, and take an appropriate restriction of this field to
some parameter set ${\cal F}\subset {\cal G}$. In the next Theorem~4.2
we shall present a natural Gaussian counterpart of Theorem~4.1 by
means of an appropriate choice of~${\cal F}$. Let me also remark that
in Section~10 multiple Wiener--It\^o integrals of functions of
$k$~variables with respect to a white noise will be defined for all
$k\ge1$. In the special case $k=1$ the Wiener--It\^o integrals for
an appropriate class of functions $f\in{\cal F}$ yield a model for
which Theorem~4.2 is applicable. Before formulating this result let
us introduce the following definition which is a version of the
definition of $L_p$-dense functions.
\medskip\noindent
{\bf Definition of
$L_p$-dense classes of functions with respect to a
measure~$\mu$.}\index{L${}_p$-dense classes of functions with
respect to a measure~$\mu$}
{\it Let a measurable space $(X,{\cal X})$ be given
together with a measure $\mu$ on the $\sigma$-algebra ${\cal X}$ and
a set ${\cal F}$ of ${\cal X}$ measurable real valued functions on
this space. The set of functions ${\cal F}$ is called an $L_p$-dense
class of functions, $1\le p<\infty$, with respect to the
measure~$\mu$ with parameter $D$ and exponent $L$ if for all
numbers $0<\varepsilon\le1$ there exists a finite $\varepsilon$-dense
subset ${\cal F}_\varepsilon=\{f_1,\dots,f_m\}\subset{\cal F}$
in the space
$L_p(X,{\cal X},\mu)$ with $m\le D\varepsilon^{-L}$ elements, i.e.
such a set ${\cal F}_\varepsilon\subset {\cal F}$ with
$m\le D\varepsilon^{-L}$ elements for which
$\inf\limits_{f_j\in {\cal F}_\varepsilon}\int |f-f_j|^p\,d\mu
<\varepsilon^p$ for all functions $f\in\ {\cal F}$.}
\medskip\noindent
{\bf Theorem 4.2 (Estimate on the supremum of a class of Gaussian
random variables).} \index{estimate on the supremum of a class of
Gaussian random variables} {\it Let a probability measure $\mu$ be given
on a measurable space $(X,{\cal X})$ together with a linear Gaussian
random field $Z(f)$, $f\in{\cal G}$, such that $EZ(f)=0$,
$EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$, $f,g\in{\cal G}$, where ${\cal G}$
is the space of square integrable functions with respect to this
measure~$\mu$. Let ${\cal F}\subset{\cal G}$ be a countable and
$L_2$-dense class of functions with respect to the measure~$\mu$
with some exponent~$L\ge1$ and parameter~$D\ge1$ which also
satisfies condition~(\ref{(4.2)}) with some $0<\sigma\le1$.
Then there exist some universal constants $C>0$ and $M>0$ (for
instance $C=4$ and $M=16$ is a good choice) such that the inequality
\begin{eqnarray}
P\left(\sup\limits_{f\in{\cal F}}|Z(f)|
\ge u\right)&&\le C(D+1) \exp\left\{-\frac1{256}
\left(\frac u{\sigma}\right)^2\right\} \nonumber \\
&&\qquad \textrm{if }u\ge ML^{1/2}\sigma \log^{1/2}\frac2\sigma
\label{(4.7)}
\end{eqnarray}
holds with the parameter $D$ and exponent $L$ introduced in this
theorem.}
\medskip
The exponent at the right-hand side of inequality~(\ref{(4.7)})
does not contain the best possible universal constant. One could
choose the coefficient $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ instead of the coefficient $\frac1{256}$ in the
exponent at the right-hand side of~(\ref{(4.7)}) if the universal
constants $C>0$ and $M>0$ are chosen sufficiently large in this
inequality. Actually, later in Theorem~8.6 such an estimate will
be proved which can be considered as the multivariate
generalization of Theorem~4.2 with the expression
$-\frac{(1-\varepsilon)u^2}{2\sigma^2}$ in the exponent.
The condition about the countable cardinality of the set ${\cal F}$
in Theorem~4.2 could be weakened similarly to Theorem~4.1. But
I omit the discussion of this question, since Theorem~4.2 was
only introduced for the sake of a comparison between the
Gaussian and non-Gaussian case. An essential difference between
Theorems~4.1 and~4.2 is that the class of functions~${\cal F}$
considered in Theorem~4.1 had to be $L_2$-dense, while in
Theorem~4.2 a weaker version of this property was needed. In
Theorem~4.2 it was demanded that there exists a subset of
${\cal F}$ of relatively small cardinality which is dense in the
$L_2(\mu)$ norm. In the $L_2$-density property imposed in
Theorem~4.1 a similar property was demanded for all probability
measures~$\nu$. The appearance of such a condition may be unexpected.
But as we shall see, the proof of Theorem~4.1 contains a
conditioning argument where a lot of new conditional measures
appear, and the $L_2$-density property is needed to work with all
of them. One would also like to know some results that enable us
to check when this condition holds. In the next section a notion
popular in probability theory, the notion of Vapnik--\v{C}ervonenkis
classes will be introduced, and it will be shown that a
Vapnik--\v{C}ervonenkis class of functions bounded by~1 is
$L_2$-dense.
Another difference between Theorems~4.1 and~4.2 is that the
conditions of formula~(\ref{(4.4)}) contain the upper bound
$\sqrt n\sigma^2>u$, and no such condition was imposed in
formula~(\ref{(4.7)}). The appearance of this condition in
Theorem~4.1 can be explained by comparing this result with those
of Section~3. As we have seen, we do not loose much information
if we restrict our attention to the case
$u\le\textrm{const.}\, V_n^2=\textrm{const.}\, n\sigma^2$ in
Bernstein's inequality (if sums of independent and identically
distributed random variables are considered). Theorem~4.1 gives
an almost as good estimate for the supremum of normalized partial
sums under appropriate conditions for the class ${\cal F}$ of
functions we consider in this theorem as Bernstein's inequality
yields for the normalized partial sums of independent and
identically distributed random variables with variance bounded
by~$\sigma^2$. But we could prove the estimate of Theorem~4.1 only
under the condition $\sqrt n\sigma^2>u$. We shall show in
Example~4.3 discussed below that in the case $u\gg\sqrt n\sigma^2$
only a weaker estimate holds. It has also a natural reason why
condition~(\ref{(4.1)}) about the supremum of the functions
$f\in {\cal F}$ appeared in Theorems 4.1 and~$4.1'$, and no such
condition was needed in Theorem~4.2.
The lower bounds for the level~$u$ were imposed in
formulas~(\ref{(4.4)}) and~(\ref{(4.7)}) because of a similar
reason. To understand why such a condition is needed in
formula~(\ref{(4.7)}) let us consider the
following example. Take a Wiener process $W(t)$, $0\le t\le1$,
define for all $0\le s0$ the following class of functions ${\cal F}_\sigma$.
${\cal F}_\sigma=\{f_{s,t}\colon\, 0\le su\right)
\le e^{-\textrm{const.}\,(u/\sigma)^2}$.
However, this relation does not hold if
$u=u(\sigma)<2(1-\varepsilon)\sigma\log^{1/2}\frac1\sigma$
with some $\varepsilon>0$. In such cases
$P\left(\sup\limits_{f\in{\cal F}_\sigma}Z(f) >u\right)\to1$,
as $\sigma\to0$. This can be proved relatively simply with the help
of the estimate
$P(Z(f_{s,t})>u(\sigma))\ge\textrm{const.}\, \sigma^{2(1-\varepsilon)^2}$
if $|t-s|=\sigma^2$ and the independence of the random integrals
$Z(f_{s,t})$ if the functions $f_{s,t}$ are indexed by such pairs
$(s,t)$ for which the intervals $(s,t)$ are disjoint. This means
that in this example formula~(\ref{(4.7)}) holds only under the
condition $u\ge M\sigma\log^{1/2}\frac1\sigma$ with $M=2$.
There is a classical result about the modulus of continuity of
Wiener processes, and actually this result helped us to find the
previous example. It is also worth mentioning that there are some
concentration inequalities, \index{concentration inequalities}
see Ledoux~\cite{r29} and Talagrand~\cite{r52},
which state that under very general conditions the distribution
of the supremum of a class of partial sums of independent random
variables or of the elements of a Gaussian random field is
strongly concentrated around the expected value of this supremum.
(Talagrand's result in this direction is also formulated in
Theorem~18.1 of this lecture note.) These results imply that the
problems discussed in Theorems~4.1 and~4.2 can be reduced to a
good estimate of the expected value
$E\sup\limits_{f\in{\cal F}}|S_n(f)|$ and
$E\sup\limits_{f\in{\cal F}}|Z(f)|$ of the supremum considered in
these results. However, the estimation of the expected value of
these suprema is not much simpler than the original problem.
Theorem~4.2 implies that under its conditions
$$E
\sup\limits_{f\in{\cal F}}|Z(f)|
\le\textrm{const.}\, \sigma\log^{1/2}\frac2\sigma
$$
with an appropriate multiplying constant depending on the
parameter~$D$ and exponent~$L$ of the class of functions~${\cal F}$.
In the case of Theorem~4.1 a similar estimate holds, but under more
restrictive conditions. We also have to impose that
$\sqrt n\sigma^2\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$ with
a sufficiently large constant. This condition is needed to guarantee
that the set of numbers~$u$ satisfying condition~(\ref{(4.4)}) is
not empty. If this condition is violated, then Theorem~4.1 supplies
a weaker estimate which we get by replacing $\sigma$ by an
appropriate~$\bar\sigma>\sigma$, and by applying Theorem~4.1 with
this number~$\bar\sigma$.
One may ask whether the above estimate about the expected value of
the supremum of normalized partial sums holds without the condition
$\sqrt n\sigma^2\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$.
We show an example which gives a negative answer to this question.
Since here we discuss a rather particular problem which is outside
of our main interest in this work I give a rather sketchy
explanation of this example. I present this example together with
a Poissonian counterpart of it which may help to explain its
background.
\medskip\noindent
{\bf Example 4.3 (Supremum of partial sums with bad tail behaviour).}
{\it Let $\xi_1,\dots,\xi_n$ be a sequence of independent random
variables with uniform distribution in the interval~$[0,1]$. Choose
a sequence of real numbers, $\varepsilon_n$, $n=3,4,\dots$, such that
$\varepsilon_n\to0$ as $n\to\infty$, and
$\frac12\ge\varepsilon_n\ge n^{-\delta}$ with a
sufficiently small number $\delta>0$. Put
$\sigma_n=\varepsilon_n\sqrt{\frac{\log n}n}$, and define the set of
functions $\bar f_{j,n}(\cdot)$ and $f_{j,n}(\cdot)$
on the interval $[0,1]$ by the formulas
$\bar f_{j,n}(x)=1$ if $(j-1)\sigma^2_n\le x0$. Then
$$
\lim_{n\to\infty}P\left(\sup_{f\in{\cal F}_n}S_n(f)>u_n\right)=1.
$$
}
\medskip
This example has the following Poissonian counterpart.
\medskip\noindent
{\bf Example 4.3$'$ (A Poissonian counterpart of Example 4.3).}
{\it Let $\bar P_n(x)$ be a Poisson process on the interval~$[0,1]$
with parameter~$n$ and $P_n(x)=\frac1{\sqrt n}[\bar P_n(x)-nx]$,
$0\le x\le 1$. Consider the same sequences of numbers~$\varepsilon_n$,
$\sigma_n$ and~$u_n$ as in Example~4.3, and define the random
variables $Z_{n,j}=P_n(j\sigma^2_n)-P_n((j-1)\sigma^2_n)$ for all
$n=3,4,\dots$ and $1\le j\le \frac1{\sigma^2_n}$. Then
$$
\lim_{n\to\infty}P\left(\sup_{1\le j\le \frac1{\sigma_n}}
Z_{n,j}>u_n\right)=1.
$$
}
\medskip
The classes of functions ${\cal F}_n$ in Example~4.3 are $L_2$-dense
classes of functions with some exponent~$L$ and parameter~$D$
not depending on the parameter~$n$ and the choice of the
numbers~$\sigma_n$. It can be seen that even the class of function
${\cal F}=\{f_{s,t}\colon\, f_{s,t}(x)=1,\textrm{ if }s\le x0$ in this case. As
$\varepsilon_n\log\frac1{\varepsilon_n}\to0$ as $n\to\infty$,
this means that the
expected value of the supremum of the random sums considered in
Example~4.3 does not satisfy the estimate
$\limsup\limits_{n\to\infty}
\frac1{\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}}
E\sup\limits_{f\in{\cal F}_n}S_n(f)<\infty$ suggested by
Theorem~4.1. Observe that $\sqrt n\bar\sigma^2_n
\sim\textrm{const.}\, \varepsilon_n\bar\sigma_n\log^{1/2}
\frac2{\bar\sigma_n}$
in this case, since
$\sqrt n\bar\sigma^2_n\sim\varepsilon_n^2\frac{\log n}{\sqrt n}$,
and $\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}
\sim \textrm{const.}\,\varepsilon_n\frac{\log n}{\sqrt n}$.
\medskip\noindent
{\it The proof of Examples~4.3 and~$4.3'$.} First we prove the
statement of Example~$4.3'$. For a fixed index~$n$ the number of
random variables $Z_{n,j}$ equals
$\frac1{\sigma_n^2}\ge\frac1{\varepsilon_n^2}\frac n{\log n}
\ge\frac n{\log n}$, and they are independent. Hence it is enough
to show that $P(Z_{n,j}>u_n)\ge n^{-1/2}$ if first $A>0$ and then
$\delta>0$ (appearing in the condition
$\varepsilon_n>n^{-\delta}$) are chosen sufficiently small, and
$n\ge n_0$ with some threshold index $n_0=n_0(A,\delta)$.
Put $\bar u_n=[\sqrt nu_n+n\sigma^2_n]+1$, where $[\cdot]$ denotes
integer part. Then
$P(Z_{n,j}>u_n)\ge P(\bar P_n(\sigma^2_n)\ge\bar u_n)
\ge P(\bar P_n(\sigma^2_n)=\bar u_n)
=\frac{(n\sigma_n^2)^{\bar u_n}}{\bar u_n!}e^{-n\sigma_n^2}
\ge \left(\frac{n\sigma_n^2}{\bar u_n}\right)^{\bar u_n}e^{-n\sigma_n^2}$.
Some calculation shows that
$\bar u_n\le\frac{A \log n}{\log \frac1{\varepsilon_n}}
+\varepsilon_n^2\log n+1
\le\frac{2A \log n}{\log \frac1{\varepsilon_n}}$,
$\frac{n\sigma_n^2}{\bar u_n}
\ge\frac{\varepsilon_n^2\log\frac1{\varepsilon_n}}{2A}$,
and $\log\frac{n\sigma_n^2}{\bar u_n}\ge- 2\log\frac1{\varepsilon_n}$
if the constants $A>0$, $\delta>0$ and threshold index $n_0$ are
appropriately chosen. Hence
$P(Z_{n,j}>u_n)\ge e^{-2\bar u_n\log(1/\varepsilon_n)-n\sigma_n^2}
\ge e^{-2A\log n-\varepsilon_n^2\log n}\ge\frac1{\sqrt n}$
if~$A_0>0$ is small enough.
The statement of Example~4.3 can be deduced from~Example~$4.3'$
by applying Poissonian approximation. Let us apply the result of
Example~$4.3'$ for a Poisson process $\bar P_{n/2}$ with
parameter~$\frac n2$ and with such a number~$\bar\varepsilon_{n/2}$
with which the value of $\sigma_{n/2}$ equals the previously
defined~$\sigma_n$. Then
$\bar\varepsilon_{n/2}\sim\frac{\varepsilon_n}{\sqrt 2}$,
and the number of sample points of $\bar P_{n/2}$ is less
than~$n$ with probability almost~1. Attaching additional sample
points to get exactly $n$ sample points we can get the result of
Example~4.3. I omit the details.
\medskip
In formulas~(\ref{(4.4)}) and~(\ref{(4.7)}) we formulated such a
condition for the validity of Theorem~4.1 and Theorem~4.2 which
contains a large multiplying constant $ML^{3/4}$ and $ML^{1/2}$
of $\sigma\log^{1/2}\frac2\sigma$ in the lower bound for the
number~$u$ if we deal with such an $L_2$-dense class of functions
${\cal F}$ which has a large exponent~$L$. At a heuristic level
it is clear that in such a case a large multiplying constant
appears. On the other hand, I did not try to find the best
possible coefficients in the lower bound in
relations~(\ref{(4.4)}) and~(\ref{(4.7)}).
\medskip
In Theorem~4.1 (and in its version 4.1$'$) it was demanded that
the class of functions ${\cal F}$ should be countable. Later this
condition was replaced by a weaker one about countable
approximability. By restricting our attention to countable or
countably approximable classes we could avoid some unpleasant
measure theoretical problems which would have arisen if we had
worked with the supremum of non-countable number of random
variables which may be non-measurable.
There are some papers where possibly non-measurable models
are also considered with the help of some rather deep results
of the analysis and measure theory. Actually, the problem we met
here is the natural analogue of an important problem in the theory
of the stochastic processes about the smoothness property of the
trajectories of an appropriate version of a stochastic process
which we can get by exploiting our freedom to change all random
variables on a set of probability zero.
The study of the problem in this work is simpler in one respect.
Here the set of random variables $S_n(f)(\omega)$ or $J_n(f)(\omega)$,
$f\in{\cal F}$, are constructed directly with the help of the
underlying random variables $\xi_1(\omega),\dots,\xi_n(\omega)$ for all
$\omega\in\Omega$ separately. We are interested in the question when
the sets of random variables constructed in this way are countably
approximable, i.e.\ we are not looking for a possibly different,
better version of them with the same finite dimensional distributions.
The next simple Lemma~4.4 yields a sufficient condition for countable
approximability. Its condition can be interpreted as a smoothness
type condition for the trajectories of a
stochastic process indexed by the functions $f\in{\cal F}$.
\medskip\noindent
{\bf Lemma 4.4.} {\it Let a class of random variables $U(f)$,
$f\in{\cal F}$, indexed by some set ${\cal F}$ of functions be given
on a space $(Y,{\cal Y})$. If there exists a countable subset
${\cal F}'\subset {\cal F}$ of the set ${\cal F}$ such that the sets
$A(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}'}
|U(f)(\omega)|\ge u\}$
introduced
for all $u>0$ in the definition of countable approximability satisfy
the relation $A(u)\subset B(u-\varepsilon)$ for all $u>\varepsilon>0$,
then the class
of random variables $U(f)$, $f\in{\cal F}$, is countably approximable.
The above property holds if for all $f\in{\cal F}$, $\varepsilon>0$
and $\omega\in\Omega$ there exists a function
$\bar f=\bar f(f,\varepsilon,\omega)\in{\cal F}'$
such that $|U(\bar f)(\omega)|\ge|U(f)(\omega)|-\varepsilon$.}
\medskip\noindent
{\it Proof of Lemma 4.4.}\/ If $A(u)\subset B(u-\varepsilon)$ for
all $\varepsilon>0$, then
$P^*(A(U)\setminus B(u))\le \lim\limits_{\varepsilon\to0}
P(B(u-\varepsilon)\setminus B(u))=0$, where $P^*(X)$ denotes the
outer measure
of a not necessarily measurable set $X\subset\Omega$, since
$\bigcap\limits_{\varepsilon\to0}B(u-\varepsilon)=B(u)$, and this is
what we had to prove.
If $\omega\in A(u)$, then for all $\varepsilon>0$ there exists some
$f=f(\omega)\in{\cal F}$ such that $|U(f)(\omega)|>u-\frac\varepsilon2$.
If there
exists some $\bar f=\bar f(f,\frac\varepsilon2,\omega)$,
$\bar f\in{\cal F}'$ such that
$|U(\bar f)(\omega)| \ge |Uf(\omega)|-\frac\varepsilon2$,
then $|U(\bar f)(\omega)|
>u-\varepsilon$, and $\omega\in B(u-\varepsilon)$. This
means that $A(u)\subset B(u-\varepsilon)$.
\medskip
The question about countable approximability also appears in the
case of multiple random integrals with respect to a normalized
empirical measure. To avoid some repetition we prove a result which
also covers such cases. For this goal first we introduce the notion
of multiple integrals with respect to a normalized empirical
measure.\index{multiple integrals with respect to a normalized
empirical measure}
Given a measurable function $f(x_1,\dots,x_k)$ on the $k$-fold
product space $(X^k,{\cal X}^k)$ and a sequence of independent random
variables $\xi_1,\dots,\xi_n$ with some distribution $\mu$ on the
space $(X,{\cal X})$ we define the integral $J_{n,k}(f)$ of the
function $f$ with respect to the $k$-fold product of the normalized
version of the empirical measure $\mu_n$ introduced in (\ref{(4.5)})
by the formula
\begin{eqnarray}
J_{n,k}(f)&&=\frac{n^{k/2}}{k!} \int'
f(x_1,\dots,x_k)(\mu_n(dx_1)-\mu(dx_1))\dots
(\mu_n(dx_k)-\mu(dx_k)), \nonumber \\
&&\quad\textrm{where the prime in $\int'$ means that the
diagonals } x_j=x_l,\; \nonumber\\
&&\quad 1\le ju\right)$. We have
seen that the above class of functions ${\cal F}$ is countably
approximable. The results of the next section imply that this
class of functions is also $L_2$-dense. Let me remark that
actually it is not difficult to check this property directly.
Hence we can apply Theorem~$4.1'$ to the above defined class of
functions with $\sigma=1$, and it yields that
$P\left(n^{-1/2}\sup\limits_{f\in {\cal F}}|J_n(f)|>u\right)
\le e^{-Cnu^2}$
if $1\ge u\ge\bar Cn^{-1/2}$ with some universal constants $C>0$ and
$\bar C>0$. (The condition $1\ge u$ can actually be dropped.) The
application of this estimate for the numbers $\varepsilon>0$ together
with the Borel--Cantelli lemma imply the fundamental theorem of the
mathematical statistics.
In short, the results of this section yield more information about
the closeness the empirical distribution function $F_n$ and
distribution function $F$ than the fundamental theorem of the
mathematical statistics. Moreover, since these results can also be
applied for other classes of functions, they yield useful
information about the closeness of the probability measure $\mu$
to the empirical measure~$\mu_n$.
\chapter{Vapnik--\v{C}ervonenkis classes and $L_2$-dense
classes of functions}
In this section the most important notions and results will be
presented about Vapnik--\v{C}ervonenkis classes, and it will be
explained how they help to show in some important cases that
certain classes of functions are $L_2$-dense. The classes of
$L_2$-dense classes played an important role in the study of the
previous section. The results of this section may help to find
interesting classes of functions with this property. Some of the
results of this section will be proved in Appendix~A.
First I recall the following notions.
\medskip\noindent
{\bf Definition of Vapnik-\v{C}ervonenkis classes of sets and
functions.}\index{Vapnik-\v{C}ervonenkis classes of sets and functions}
{\it Let a set $X$ be given, and let us select a class
${\cal D}$ of subsets of this set $X$. We call
${\cal D}$ a Vapnik--\v{C}ervonenkis class if there exist two real
numbers $B$ and $K$ such that for all positive integers $n$ and
subsets $S(n)=\{x_1,\dots,x_n\}\subset X$ of cardinality $n$
of the set $X$ the collection of sets of the form $S(n)\cap D$,
$D\in{\cal D}$, contains no more than $Bn^K$ subsets of~$S(n)$.
We shall call $B$ the parameter and $K$ the exponent of this
Vapnik--\v{C}ervonenkis class.
A class of real valued functions ${\cal F}$ on a space
$(Y,{\cal Y})$ is called a Vapnik--\v{C}ervonenkis class if
the collection of graphs of these functions is a
Vapnik--\v{C}ervonenkis class, i.e.\ if the sets
$A(f)=\{(y,t)\colon\, y\in Y,\;\min(0,f(y))\le t\le\max(0,f(y))\}$,
$f\in {\cal F}$, constitute a Vapnik--\v{C}er\-vo\-nen\-kis class
of subsets of the product space $X=Y\times R^1$.}
\medskip
The following result which was first proved by Sauer plays a
fundamental role in the theory of Vapnik--\v{C}er\-vo\-nen\-kis
classes. This result provides a relatively simple condition for
a class ${\cal D}$ of subsets of a set~$X$ to be a
Vapnik--\v{C}ervonenkis class. Its proof is given in Appendix~A.
Before its formulation I introduce some terminology which seems to
be wide spread and generally accepted in the literature.
\medskip\noindent
{\bf Definition of shattering of a set.}\index{shattering of a set}
{\it Let a set $S$ and a class ${\cal E}$ of subsets of $S$ be
given. A finite set $F\subset S$ is called shattered by the
class ${\cal E}$ if all its subsets $H\subset F$ can be written
in the form $H=E\cap F$ with some element $E\in{\cal E}$ of the
class of sets of ${\cal E}$.}
\medskip\noindent
{\bf Theorem 5.1 (Sauer's lemma).}\index{Sauer's lemma}
{\it Let a finite set $S=S(n)$ consisting of $n$ elements be given
together with a class ${\cal E}$ of subsets of $S$. If ${\cal E}$
shatters no subset of $S$ of cardinality~$k$, then ${\cal E}$
contains at most ${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$
subsets of $S$.}
\medskip
The estimate of Sauer's lemma is sharp. Indeed, if ${\cal E}$ contains
all subsets of $S$ of cardinality less than or equal to $k-1$, then
it shatters no subset of a set $F$ of cardinality $k$ (a set $F$
of cardinality~$k$ cannot be written in the form $E\cap F$,
$E\in {\cal E}$), and ${\cal E}$ contains
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ subsets of $S$.
Sauer's lemma states, that this is an extreme case. Any class of
subsets ${\cal E}$ of $S$ with cardinality greater than
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ shatters at
least one subset of~$S$ with cardinality~$k$.
Let us have a set $X$ and a class of subsets ${\cal D}$ of it. One may
be interested in when ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.
Sauer's lemma gives a useful condition for it. Namely, it implies
that if there exists a positive integer $k$ such that the class
${\cal D}$ shatters no subset of $X$ of cardinality~$k$,
then ${\cal D}$
is a Vapnik--\v{C}ervonenkis class. Indeed, let us take some number
$n\ge k$, fix an arbitrary set $S(n)=\{x_1,\dots,x_n\}\subset X$ of
cardinality~$n$, and introduce the class of subsets
${\cal E}={\cal E}(S(n))=\{S(n)\cap D\colon\, D\subset{\cal D}\}$. If
${\cal D}$ shatters no subset of $X$ of cardinality~$k$, then ${\cal E}$
shatters no subset of $S(n)$ of cardinality~$k$. Hence by
Sauer's lemma the class ${\cal E}$ contains at most
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ elements.
Let me remark that it is also proved that
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}\le1.5\frac{n^{k-1}}{(k-1)!}$
if $n\ge k+1$. This estimate gives a bound on the parameter and
exponent of a Vapnik--\v{C}ervonenkis class which satisfies the
above condition.
Moreover, Theorem~5.1 also has the following consequence. Take
an (infinite) set $X$ and a class of its subsets ${\cal D}$.
There are two possibilities. Either there is some set
$S(n)\subset X$ of cardinality $n$ for all integers $n$ such
that ${\cal E}(S(n))$ contains all subsets
of $S(n)$, i.e. ${\cal D}$ shatters this set, or
$\sup\limits_{S\colon\,S\subset X,\,|S|=n}|{\cal E}(S)|$
tends to infinity at most in a polynomial order as
$n\to\infty$, where $|S|$ and $|{\cal E}(S)|$
denote the cardinality of $S$ and ${\cal E}(S)$.
To understand why the Sauer lemma plays an important role in the
theory of Vapnik--\v{C}ervonenkis classes let us formulate the
following consequence of the above considerations.
\medskip\noindent
{\bf Corrolary of the Sauer's lemma.} \index{Sauer's lemma}
{\it Let a set $X$ be given together with a class
${\cal D}$ of subsets of this set $X$. This class of sets
${\cal D}$ is a Vapnik--\v{C}ervonenkis class if there exists a positive
integer~$k$ such that ${\cal D}$ shatters no subset $F\subset X$ of
cardinality~$k$. In other words if each set
$F=\{x_1,\dots,x_k\}\subset X$ of cardinality~$k$ has a subset $G\subset F$
which cannot be written in the form $G=D\cap F$ with some $D\in{\cal D}$,
then ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.}
\medskip
The following Theorem~5.2, an important result of Richard Dudley,
states that a Vapnik--\v{C}er\-vo\-nen\-kis class of functions
bounded by~1 is an $L_1$-dense class of functions.
\medskip\noindent
{\bf Theorem 5.2 (A relation between the $L_1$-dense class and
Vapnik--\v{C}er\-vo\-nen\-kis class property).}\index{relation
between $L_1$-dense and Vapnik--\v{C}ervonenkis classes}
{\it Let $f(y)$,
$f\in {\cal F}$, be a Vapnik--\v{C}ervonenkis class of real valued
functions on some measurable space $(Y,{\cal Y})$ such that
$\sup\limits_{y\in Y}|f(y)|\le1$ for all $f\in{\cal F}$.
Then ${\cal F}$ is an
$L_1$-dense class of functions on $(Y,{\cal Y})$. More explicitly, if
${\cal F}$ is a Vapnik--\v{C}ervonenkis class with parameter $B\ge1$
and exponent $K>0$, then it is an $L_1$-dense class with exponent
$L=2K$ and parameter $D=CB^2 (4K)^{2K}$ with some universal
constant~$C>0$.}
\medskip\noindent
{\it Proof of Theorem 5.2.}\/ Let us fix some probability
measure $\nu$ on $(Y,{\cal Y})$ and a real number
$0<\varepsilon\le1$. We are going to show that any finite set
${\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\cal F}$
such that $\int|f_j-f_k|\,d\nu\ge\varepsilon$ if $j\neq k$,
$f_j,f_k\in{\cal D}(\varepsilon,\nu)$ has cardinality
$M\le D\varepsilon^{-L}$ with some $D>0$ and $L>0$. This
implies that ${\cal F}$ is an $L_1$-dense class with
parameter~$D$ and exponent~$L$. Indeed, let us take a maximal
subset
$\bar{\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\cal F}$
such that the $L_1(\nu)$ distance of any two functions in this
subset is at least~$\varepsilon$. Maximality means in this context
that no function $f_{M+1}\in{\cal F}$ can be attached to
$\bar{\cal D}(\varepsilon,\nu)$ without violating this condition.
Thus the inequality $M\le D\varepsilon^{-L}$ means that
$\bar{\cal D}(\varepsilon,\nu)$ is an $\varepsilon$-dense subset
of~${\cal F}$ in the space $L_1(Y,{\cal Y},\nu)$
with no more than $D\varepsilon^{-L}$ elements.
In the estimation of the cardinality $M$ of a set
${\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset {\cal F}$
with the property $\int|f_j-f_k|\,d\nu\ge\varepsilon$ if
$j\neq k$ we exploit the Vapnik--\v{C}ervonenkis class
property of ${\cal F}$ in the following way.
Let us choose relatively few $p=p(M,\varepsilon)$ points
$(y_l,t_l)$, $y_l\in Y$, $-1\le t_l\le1$, $1\le l\le p$, in the
space $(Y\times [-1,1])$ in such a way that the set
$S_0(p)=\{(y_l,t_l),\,1\le l\le p\}$ and graphs
$A(f_j)=\{(y,t)\colon\, y\in Y,\;\min(0,f_j(y))
\le t\le\max(0,f_j(y))\}$,
$f_j\in{\cal D}(\varepsilon,\nu)\subset{\cal F}$ have
the property that all
sets $A(f_j)\cap S_0(p)$, $1\le j\le M$, are different. Then the
Vapnik--\v{C}ervonenkis class property of ${\cal F}$ implies that
$M\le Bp^K$. Hence if there exists a set $S_0(p)$ with the above
property and with a relatively small number $p$, then this yields a
useful estimate on $M$. Such a set $S_0(p)$ will be given by means of
the following random construction.
Let us choose the $p$ points $(y_l,t_l)$, $1\le l\le p$, of the
(random) set $S_0(p)$ independently of each other in such a way that
the coordinate $y_l$ is chosen with distribution $\nu$ on
$(Y,{\cal Y})$ and the coordinate $t_l$ with uniform distribution on
the interval $[-1,1]$ independently of $y_l$. (The number~$p$ will be
chosen later.) Let us fix some indices $1\le j,k\le M$, and estimate
from above the probability that the sets $A(f_j)\cap S_0(p)$ and
$A(f_k)\cap S_0(p)$ agree, where $A(f)$ denotes the graph of the
function~$f$. Consider the symmetric difference $A(f_j)\Delta A(f_k)$
of the sets $A(f_j)$ and $A(f_k)$. The sets
$A(f_j)\cap S_0(p)$ and $A(f_k)\cap S_0(p)$ agree if and only if
$(y_l,t_l)\notin A(f_j)\Delta A(f_k)$ for all $(y_l,t_l)\in S_0(p)$.
Let us observe that for a fixed
$l$ the estimate $P((y_l,t_l)\in A(f_j)\Delta A(f_k))
=\frac12(\nu\times\lambda)(A(f_j)\Delta A(f_k))
=\frac12\int |f_j-f_k|\,d\nu\ge\frac\varepsilon2$ holds, where
$\lambda$ denotes the Lebesgue measure. This implies that the
probability that the (random) sets $A(f_j)\cap S_0(p)$ and
$A(f_k)\cap S_0(p)$ agree can be bounded from above by
$\left(1-\frac\varepsilon2\right)^p\le e^{-p\varepsilon/2}$.
Hence the probability that all sets $A(f_j)\cap S_0(p)$ are
different is greater than
$1-{M\choose2} e^{-p\varepsilon/2}\ge1-\frac{M^2}2e^{-p\varepsilon/2}$.
Choose $p$ such that
$\frac74e^{p\varepsilon/2}>e^{(p+1)\varepsilon/2}>M^2\ge e^{p\varepsilon/2}$.
Then the above probability is greater than $\frac18$, and there exists
some set $S_0(p)$ with the desired property.
The inequalities $M\le Bp^K$ and $M^2\ge e^{p\varepsilon/2}$ imply
that $M\ge e^{\varepsilon M^{1/K}/4B^{1/K}}$, i.e.\
$\frac{\log M^{1/K}}{M^{1/K}}\ge \frac\varepsilon{4KB^{1/K}}$. As
$\frac{\log M^{1/K}}{M^{1/K}}\le CM^{-1/2K}$
for $M\ge1$ with some universal constant $C>0$, this estimate
implies that Theorem 5.2 holds with the exponent~$L$ and
parameter~$D$ given in its formulation.
\medskip
Let us observe that if ${\cal F}$ is an $L_1$-dense class of
functions on a measure space $(Y,{\cal Y})$ with some
exponent~$L$ and parameter~$D$, and also the inequality
$\sup\limits_{y\in Y}|f(y)|\le1$ holds for all $f\in{\cal F}$,
then ${\cal F}$ is an $L_2$-dense class of functions
with exponent $2L$ and parameter $D2^L$. Indeed, if we fix some
probability measure $\nu$ on $(Y,{\cal Y})$ together with a number
$0<\varepsilon\le1$, and
${\cal D}(\varepsilon,\nu)=\{f_1,\dots, f_M\}$ is an
$\frac{\varepsilon^2}2$-dense set of ${\cal F}$ in the
space $L_1(Y,{\cal Y},\nu)$,
$M\le2^L D \varepsilon^{-2L}$, then for all function
$f\in {\cal F}$ some function $f_j\in{\cal D}(\varepsilon,\nu)$ can
be chosen in such a way that
$\int(f-f_j)^2\,d\nu\le2\int|f-f_j|\,d\nu\le\varepsilon^2$. This
implies that ${\cal F}$ is an $L_2$-dense class with the given
exponent and parameter.
It is not easy to check whether a collection of subsets ${\cal D}$
of a set $X$ is a Vapnik--\v{C}ervonenkis class even with the help
of Theorem~5.1. Therefore the following Theorem~5.3 which enables
us to construct many non-trivial Vapnik--\v{C}ervonenkis classes
is of special interest. Its proof is given in Appendix~A.
\medskip\noindent
{\bf Theorem 5.3 (A way to construct Vapnik--\v{C}ervonenkis classes).}
{\it Let us consider a $k$-dimensional subspace ${\cal G}_k$ of the
linear space of real valued functions defined on a set $X$, and
define the level-set $A(g)=\{x\colon\, x\in X,\,g(x)\ge0\}$ for
all functions $g\in{\cal G}_k$. Take the class of subsets
${\cal D}=\{A(g)\colon\, g\in {\cal G}_k\}$ of the set $X$ consisting
of the above introduced level sets. No subset $S=S(k+1)\subset X$ of
cardinality $k+1$ is shattered by ${\cal D}$. Hence by Theorem~5.1
${\cal D}$ is a Vapnik--\v{C}ervonenkis class of subsets of~$X$.}
\medskip
Theorem~5.3 enables us to construct many interesting
Vapnik--\v{C}ervonenkis classes. Thus for instance the class of
all half-spaces in a Euclidean space, the class of all ellipses in
the plane, or more generally the level sets of $k$-order algebraic
functions of $p$~variables with a fixed number $k$ constitute a
Vapnik--\v{C}ervonenkis class in the $p$-dimensional Euclidean
space~$R^p$. It can be proved that if ${\cal C}$ and ${\cal D}$ are
Vapnik--\v{C}ervonenkis classes of subsets of a set $S$, then also
their intersection
${\cal C}\cap{\cal D}=\{C\cap D\colon\,C\in{\cal C},\,D\in{\cal D}\}$,
their union
${\cal C}\cup {\cal D}=\{C\cup D\colon\, C\in{\cal C},\,D\in{\cal D}\}$
and complementary sets ${\cal C}^c
=\{S\setminus C\colon\, C\in{\cal C}\}$
are Vapnik--\v{C}ervonenkis classes. These results are less
important for us, and their proofs will be omitted. We are
interested in Vapnik--\v{C}ervonenkis classes not for their own
sake. We are going to find $L_2$-dense classes of functions, and
Vapnik--\v{C}ervonenkis classes help us in this. Indeed, Theorem 5.2
implies that if ${\cal D}$ is a Vapnik--\v{C}ervonenkis class of
subsets of a set $S$, then their indicator functions constitute a
Vapnik--\v{C}ervonenkis class of functions, and as a consequence
an $L_1$-dense, hence also an $L_2$-dense class of functions.
Then the results of Lemma~5.4 formulated below enable us to
construct new $L_2$-dense classes of functions.
\medskip\noindent
{\bf Lemma 5.4 (Some useful properties of $L_2$-dense classes).}
{\it Let ${\cal G}$ be an $L_2$-dense class of functions
on some space $(Y,{\cal Y})$ whose absolute values are bounded
by one, and let $f$ be a function on $(Y,{\cal Y})$ also with
absolute value bounded by one. Then
$f\cdot{\cal G}=\{f\cdot g\colon\, g\in{\cal G}\}$ is also an
$L_2$-dense class of functions. Let ${\cal G}_1$ and
${\cal G}_2$ be two $L_2$-dense classes of functions on some
space $(Y,{\cal Y})$ whose absolute values are
bounded by one. Then the classes of functions
${\cal G}_1+{\cal G}_2=\{g_1+g_2\colon\,
g_1\in{\cal G}_1,\,g_2\in{\cal G}_2\}$,
${\cal G}_1\cdot{\cal G}_2
=\{g_1g_2\colon\, g_1\in{\cal G}_1,\,g_2\in{\cal G}_2\}$,
$\min({\cal G}_1,{\cal G}_2)
=\{\min(g_1,g_2)\colon\, g_1\in{\cal G}_1,\,g_2\in
{\cal G}_2\}$, $\max({\cal G}_1,{\cal G}_2)
=\{\max(g_1,g_2)\colon\, g_1\in
{\cal G}_1,\,g_2\in{\cal G}_2\}$ are also $L_2$-dense.
If ${\cal G}$ is an
$L_2$-dense class of functions, and ${\cal G}'\subset{\cal G}$,
then ${\cal G}'$ is also an $L_2$-dense class.}
\medskip\noindent
The proof of Lemma 5.4 is rather straightforward. One has to observe
for instance that if $g_1,\bar g_1\in{\cal G}_1$,
$g_2,\bar g_2\in{\cal G}_2$ then $|\min(g_1,g_2)-\min(\bar g_1,\bar g_2)|
\le |g_1-\bar g_1)|+|g_2-\bar g_2|$, hence if
$g_{1,1},\dots,g_{1,M_1}$ is an $\frac\varepsilon2$-dense
subset of ${\cal G}_1$
and $g_{2,1},\dots,g_{2,M_2}$ is an $\frac\varepsilon2$-dense
subset of ${\cal G}_2$ in the space $L_2(Y,{\cal Y},\nu)$ with
some probability measure
$\nu$, then the functions $\min(g_{1,j},g_{2,k})$, $1\le j\le M_1$,
$1\le k\le M_2$ constitute an $\varepsilon$-dense subset of
$\min({\cal G}_1,{\cal G}_2)$ in $L_2(Y,{\cal Y},\nu)$. The last
statement of Lemma~5.4 was proved after the Corollary of
Theorem~4.1. The details are left to the reader.
\medskip
The above result enable us to construct some $L_2$ dense class of
functions. We give an example for it in the following Example~5.5
which is a consequence of Theorem~5.2 and Lemma~5.4.
\medskip\noindent
{\bf Example 5.5.} {\it Take $m$ measurable functions $f_j(x)$,
$1\le j\le m$, on a measurable space $(X,{\cal X})$ which
have the property $\sup\limits_{x\in X}|f_j(x)|\le1$ for all
$1\le j\le m$. Let ${\cal D}$ be a Vapnik-\v{C}ervonenkis class
consisting of measurable subsets of the set $X$. Define for all
pairs $(f_j,D)$, $f_j$, $1\le j\le m$, and $D\in{\cal D}$ the
function $f_{j,D}(\cdot)$ as $f_{j,D}(x)=f_j(x)$ if $x\in D$, and
$f_{j,D}(x)=0$ if $x\notin D$, i.e. $f_{j,D}(\cdot)$ is the
restriction of the function $f_j(\cdot)$ to the set~$D$. Then
$f_{j,D}$, $1\le j\le m$, $D\in{\cal D}$, is an $L_2$-dense class
of functions.}
\medskip
Beside this, Theorem~5.3 helps us to construct
Vapnik-\v{C}ervonenkis classes of sets. Let me also remark that it
follows from the result of this section that the random variables
considered in Lemma~4.5 are not only countably approximable, but
the class of functions $f_{u_1,\dots,u_k,v_1,\dots,v_k}$
appearing in their definition is $L_2$-dense.
\chapter{The proof of Theorems 4.1 and 4.2 on the
supremum of random sums}
In this section we prove Theorem~4.2, an estimate about the tail
distribution of the supremum of an appropriate class of Gaussian
random variables with the help of a method, called the chaining
argument. We also investigate the proof of Theorem~4.1 which can
be considered as a version of Theorem~4.2 about the supremum of
partial sums of independent and identically distributed random
variables. The chaining argument is not a strong enough method
to prove Theorem~4.1, but it enables us to prove a weakened form
of it formulated in Proposition~6.1. This result turned out to
be useful in the proof of Theorem~4.1. It enables us to reduce
the proof of Theorem~4.1 to a simpler statement formulated in
Proposition~6.2. In this section we prove Proposition~6.1,
formulate Proposition~6.2, and reduce the proof of Theorem~4.1
with the help of Proposition~6.1 to this result. The proof of
Proposition~6.2 which demands different arguments is postponed
to the next section. Before presenting the proofs of this section
I briefly describe the chaining argument.\index{chaining argument}
Let us consider a countable class of functions ${\cal F}$ on a
probability space $(X,{\cal X},\mu)$ which is $L_2$-dense with
respect to the probability measure~$\mu$. Let us have either a
class of Gaussian random variables $Z(f)$ with zero
expectation such that $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\cal F}$, or a set of normalized partial sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in{\cal F}$,
where $\xi_1,\dots,\xi_n$ is a sequence of independent $\mu$
distributed random variables with values in the space
$(X,{\cal X})$, and assume that $Ef(\xi_j)=0$ for all
$f\in{\cal F}$. We want to get a good estimate on the
probability $P\left(\sup\limits_{f\in{\cal F}}Z(f)>u\right)$ or
$P\left(\sup\limits_{f\in{\cal F}}S_n(f)>u\right)$ if the class of
functions~${\cal F}$ has some nice properties. The chaining
argument suggests to prove such an estimate in the following way.
Let us try to find an appropriate sequence of subset
${\cal F}_1\subset{\cal F}_2\subset\cdots\subset{\cal F}$ such that
$\bigcup\limits_{N=1}^\infty{\cal F}_N={\cal F}$, ${\cal F}_N$ is
such a set of functions from ${\cal F}$ with relatively few
elements for which
$\inf\limits_{f\in{\cal F}_N}\int (f-\bar f)^2\,d\mu\le\delta_N$
with an appropriately chosen number $\delta_N$ for all functions
$\bar f\in{\cal F}$, and let us give a good estimate on the
probability $P\left(\sup\limits_{f\in{\cal F}_N}Z(f)>u_N\right)$ or
$P\left(\sup\limits_{f\in{\cal F}_N}S_n(f)>u_N\right)$
for all $N=1,2,\dots$
with an appropriately chosen monotone increasing sequence $u_N$
such that $\lim\limits_{N\to\infty} u_N=u$.
We can get a relatively good estimate under appropriate conditions
for the class of functions~${\cal F}$ by choosing the classes of
functions ${\cal F}_N$ and numbers $\delta_N$ and $u_N$ in an
appropriate way. We try to bound the difference of the probabilities
$$
P\left(\sup_{f\in{\cal F}_{N+1}}Z(f)>u_{N+1}\right)
-P\left(\sup_{f\in{\cal F}_N}Z(f)>u_N\right)
$$
or of the analogous difference if $Z(f)$ is replaced by $S_n(f)$.
For the sake of completeness define this difference also in the
case $N=1$ with the choice ${\cal F}_0=\emptyset$, when the
second probability in this difference equals zero.
The above mentioned difference of probabilities can be estimated
in a natural way by taking for all functions
$f_{j_{N+1}}\in{\cal F}_{N+1}$ a function
$f_{j_N}\in{\cal F}_N$ which is close to it, more explicitly
$\int (f_{j_{N+1}}-f_{j_N})^2\,d\mu\le\delta_N^2$, and
calculating the probability that the difference of the random
variables corresponding to these two functions is greater than
$u_{N+1}-u_N$. We can estimate these probabilities with the help
of some results which give a relatively good bound on the tail
distribution of $Z(g)$ or $S_n(g)$ if $\int g^2\,d\mu$ is small.
The sum of all such probabilities gives an upper bound for the
above considered difference of probabilities. Then we get an
estimate for the probability
$P\left(\sup\limits_{f\in{\cal F}_N}Z(f)>u_N\right)$
for all $N=1,2,\dots$,
by summing up the above estimate, and we get a bound on the
probability we are interested in by taking the limit
$N\to\infty$. This method is called the chaining argument. It
got this name, because we estimate the contribution of a random
variable corresponding to a function
$f_{j_{N+1}}\in{\cal F}_{N+1}$ to the bound of the probability we
investigate by taking the random variable corresponding to a
function $f_{j_N}\in{\cal F}_N$ close to it, then we choose
another random variable corresponding to a function
$f_{j_{N-1}}\in{\cal F}_{N-1}$ close to this function, and so on
we take a chain of subsequent functions and the random variables
corresponding to them.
First we show how this method supplies the proof of Theorem~4.2.
Then we turn to the investigation of Theorem~4.1. In the study of
this problem the above method does not work well, because if two
functions are very close to each other in the $L_2(\mu)$-norm,
then the Bernstein inequality (or an improvement of it) supplies
a much weaker estimate for the difference of the partial sums
corresponding to these two functions than the bound suggested
by the central limit theorem. On the other hand, we shall prove
a weaker version of Theorem~4.1 in Proposition~6.1 with the help
of the chaining argument. This result will be also useful for us.
\medskip\noindent
{\it Proof of Theorem 4.2.}\/\index{estimate on the supremum of
a class of Gaussian random variables} Let us list the elements
of ${\cal F}$ as $\{f_0,f_1,\dots\}={\cal F}$, and choose for all
$p=0,1,2,\dots$ a set of functions
${\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset{\cal F}$
with $m_p\le (D+1)\,2^{2pL}\sigma^{-L}$ elements in such a way that
$\inf\limits_{1\le j\le m_p}
\int (f-f_{a(j,p)})^2\,d\mu\le 2^{-4p}\sigma^2$
for all $f\in{\cal F}$, and let the set ${\cal F}_p$ contain also
the function~$f_p$. (We imposed the condition $f_p\in{\cal F}_p$
to guarantee that the relation $f\in{\cal F}_p$ holds with some
index~$p$ for all $f\in{\cal F}$. We could do this by slightly
enlarging the upper bound we can give for the number~$m_p$ by
replacing the factor~$D$ by~$D+1$ in it.) For all indices
$a(j,p)$ of the functions in ${\cal F}_p$, \ $p=1,2,\dots$, define a
predecessor $a(j',p-1)$ from the indices of the set of functions
${\cal F}_{p-1}$ in such a way that the functions $f_{a(j,p)}$ and
$f_{a(j',p-1))}$ satisfy the relation
$\int(f_{(j,p)}-f_{(j',p-1)})^2\,d\mu\le2^{-4(p-1)}\sigma^2$.
With the help of the behaviour of the standard normal distribution
function we can write the estimates
\begin{eqnarray*}
P(A(j,p))&&=P\left(|Z(f_{a(j,p)})-Z(f_{a(j',p-1)})|
\ge 2^{-(1+p)}u\right)\\
&&\le 2\exp\left\{-\frac{2^{-2(p+1)}u^2}{2\cdot 2^{-4(p-1)}\sigma^2}
\right\}
=2\exp\left\{-\frac{2^{2p}u^2}{128\sigma^2}\right\}\\
&&\qquad 1\le j\le m_p,\; p=1,2,\dots,
\end{eqnarray*}
and
$$
P(B(j))=P\left(|Z(f_{a(j,0)})|\ge \frac u2\right)\le
\exp\left\{-\frac {u^2}{8\sigma^2}\right\},
\quad 1\le j\le m_0.
$$
The above estimates together with the relation
$\bigcup\limits_{p=0}^\infty
{\cal F}_p={\cal F}$ which implies that \hfill\break
$\{|Z(f)|\ge u\}\subset\bigcup\limits_{p=1}^\infty
\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$ for all $f\in{\cal F}$ yield that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}} |Z(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \\
&&\qquad \le \sum_{p=1}^{\infty} 2(D+1)2^{2pL}
\sigma^{-L} \exp\left\{-\frac{2^{2p}u^2}{128\sigma^2} \right\}
+2(D+1)\sigma^{-L} \exp\left\{-\frac {u^2}{8\sigma^2}\right\}.
\end{eqnarray*}
If $u\ge ML^{1/2}\sigma\log^{1/2}\frac2\sigma$ with $M\ge16$ (and
$L\ge1$ and $0<\sigma\le1$), then
$$
2^{2pL}\sigma^{-L}\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2}
\right\}\le2^{2pL}\sigma^{-L}\left(\frac\sigma
2\right)^{2^{2p}M^2L/256}\le 2^{-pL}\le2^{-p}
$$
for all $p=0,1\dots$, hence the previous inequality implies that
\begin{eqnarray*}
P\left(\sup_{f\in{\cal F}}|Z(f)|\ge u\right)
&\le& 2(D+1)\sum\limits_{p=0}^\infty 2^{-p}
\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2} \right\} \\
&=&4(D+1) \exp\left\{-\frac{u^2}{256\sigma^2} \right\}.
\end{eqnarray*}
Theorem 4.2 is proved.
\medskip
With an appropriate choice of the bound of the integrals in the
definition of the sets ${\cal F}_p$ in the proof of Theorem~4.2 and
some additional calculation it can be proved that the coefficient
$\frac1{256}$ in the exponent of the right-hand side~(\ref{(4.7)})
can be replaced by $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ if the remaining (universal) constants in this
estimate are chosen sufficiently large.
The proof of Theorem 4.2 was based on a sufficiently good estimate on
the probabilities $P(|Z(f)-Z(g)|>u)$ for pairs of functions
$f,g\in{\cal F}$ and numbers $u>0$. In the case of Theorem~4.1 only a
weaker bound can be given for the corresponding probabilities. There
is no good estimate on the tail distribution of the difference
$S_n(f)-S_n(g)$ if its variance is small. As a consequence, the
chaining argument supplies only a weaker result in this case. This
result, where the tail distribution of the supremum of the normalized
random sums $S_n(f)$ is estimated on a relatively dense subset of the
class of functions $f\in{\cal F}$ in the $L_2(\mu)$ norm will
be given in Proposition~6.1. Another result will be formulated in
Proposition~6.2 whose proof is postponed to the next section. It will
be shown that Theorem~4.1 follows from Propositions~6.1 and~6.2.
Before the formulation of Proposition~6.1 I recall an estimate which
is a simple consequence of Bernstein's inequality. If
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$ is the
normalized sum of
independent, identically random variables, $P(|f(\xi_1)|\le1)=1$,
$Ef(\xi_1)=0$, $Ef(\xi_1)^2\le\sigma^2$, then there exists some
constant $\alpha>0$ such that
\begin{equation}
P(|S_n(f)|>u)\le 2e^{-\alpha u^2/\sigma^2}
\quad \textrm{if}\quad 0__\frac u{\bar A}\right)$
with some parameter~$\bar A>1$, where ${\cal F}_{\bar\sigma}$ is
an appropriate finite subset of a set of functions~${\cal F}$
satisfying the conditions of Theorem~4.1. We cannot give a good
estimate for the above probability for all $u>0$, we can do this
only for such numbers~$u$ which are in an appropriate interval
depending on the parameter~$\sigma$ appearing in
condition~(\ref{(4.2)}) of Theorem~4.1 and the
parameter~$\bar A$ we chose in Proposition~6.1. This fact may
explain why we could prove the estimate of Theorem~4.1 only
for such numbers~$u$ which satisfy the condition imposed in
formula~(\ref{(4.4)}). The choice of the set of functions
${\cal F}_{\bar\sigma}\subset{\cal F}$ depends of the number~$u$
appearing in the probability we want to estimate. It is such a
subset of relatively small cardinality of ${\cal F}$ whose
$L_2(\mu)$-norm distance from all elements of ${\cal F}$ is less
than $\bar\sigma=\bar\sigma(u)$ with an appropriately defined
number $\bar\sigma(u)$. With the help of Proposition~6.1 we want
to reduce the proof of Theorem~4.1 to a result formulated in the
subsequent Proposition~6.2. To do this we still need an upper
bound on the cardinality of ${\cal F}_{\bar\sigma}$ and some upper
and lower bounds on the value of $\bar\sigma(u)$ which will be
also contained in Proposition~6.1.
\index{estimate on the supremum of a class of partial sums}
\medskip\noindent
{\bf Proposition 6.1.} {\it Let us have a countable, $L_2$-dense
class of functions ${\cal F}$ with parameter $D\ge1$ and
exponent~$L\ge1$ with respect to some probability measure~$\mu$
on a measurable space $(X,{\cal X})$ whose elements satisfy
relations~(\ref{(4.1)}), (\ref{(4.2)})
and~(\ref{(4.3)}) with this probability measure $\mu$ on
$(X,{\cal X})$ and some real number $0<\sigma\le1$. Take
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and define the normalized random
sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{l=1}^nf(\xi_l)$, for all
$f\in {\cal F}$. Let us fix some number $\bar A\ge1$. There exists
some number $M=M(\bar A)$ such that with these parameters~$\bar A$
and~$M=M(\bar A)\ge1$ the following relations hold.
For all numbers $u>0$ such that
$n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M(L\log\frac2\sigma+\log D)$ a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$, and a
collection of functions
${\cal F}_{\bar\sigma}=\{f_1,\dots,f_m\}\subset{\cal F}$ with
$m\le D\bar\sigma^{-L}$ elements can be chosen in such a way that
the sets ${\cal D}_j=\{f\colon\, f\in {\cal F},\int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$, satisfy the relation
$\bigcup\limits_{j=1}^m{\cal D}_j={\cal F}$, and the normalized
random sums $S_n(f)$, $f\in{\cal F}_{\bar\sigma}$, $n\ge2$, satisfy
the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}} |S_n(f)|
\ge\frac u{\bar A}\right)
\le 4\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}
\nonumber \\
&&\qquad \textrm{under our condition } n\sigma^2\ge(\tfrac u\sigma)^2
\ge M(L\log\tfrac2\sigma+\log D) \label{(6.2)}
\end{eqnarray}
with the constants $\alpha$ in formula~(\ref{(6.1)}) and the
exponent~$L$ and parameter $D$ of the $L_2$-dense class ${\cal F}$. The
inequality $\frac1{16}(\frac u{\bar A\bar\sigma})^2\ge n\bar\sigma^2
\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^2$ also holds with the
number~$\bar\sigma=\bar\sigma(u)$. If the number~$u$ satisfies
also the inequality
\begin{equation}
n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M\left(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}\right) \label{(6.3)}
\end{equation}
with a sufficiently large number $M=M(\bar A)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too. (In formula~(\ref{(6.3)})
we have imposed a stronger condition on the number~$(\frac u\sigma)^2$
than in~(\ref{(6.2)}), since here we wrote $L^{3/2}$ and
$(\log D)^{3/2}$ instead of~ $L$ and~$\log D$, and also
the constant $M=M(\bar A)$ can be chosen larger in it.)}
\medskip
Proposition~6.1 helps to reduce the proof of Theorem~4.1 to the
case when the $L_2$-norm of the functions in the class ${\cal F}$
is bounded by a relatively small number $\bar\sigma$. In more
detail, the proof of Theorem~4.1 can be reduced to a good
estimate on the distribution of the supremum of random variables
$\sup\limits_{f\in {\cal D}_j}|S_n(f-f_j)|$ for all classes ${\cal D}_j$,
$1\le j\le m$, by means of Proposition~6.1. To carry out such a
reduction we also need the inequality $n\bar\sigma^2\ge L\log n+\log D$
(or a slightly weaker version of it). This is the reason why we have
finished Proposition~6.1 with the statement that it holds under
the condition~(\ref{(6.3)}). We also have to know that the
number~$m$ of the classes ${\cal D}_j$ is not too large.
Beside this, we need some estimates on the number
$\bar\sigma=\bar\sigma(u)$
which is the upper bound of the $L_2$-norm of the functions
$f-f_j$, $f\in{\cal D}_j$. To get such bounds for $\bar\sigma$ that
we need in the applications of Proposition~6.1 we introduced a
large parameter~$\bar A$ in the formulation of Proposition~6.1
and imposed a condition with a sufficiently large
number~$M=M(\bar A)$ in formula~(\ref{(6.3)}). This condition
reappears in Theorem~4.1 in the conditions of the
estimate~(\ref{(4.4)}).
Let me remark that one of the inequalities the number
$\bar\sigma$ introduced in Proposition~6.1 satisfies has the
consequence $u>\textrm{const.}\,\sqrt n\bar\sigma^2$ with an
appropriate
constant, and we want to estimate the probability
$P\left(\sup\limits_{f\in{\cal F}} S_n(f)|>u\right)$ with
this number~$u$ and a
class of functions~${\cal F}$ whose $L_2$ norm is bounded
by~$\bar\sigma$. Formula~(\ref{(6.1)}), that will be applied in the
proof of Proposition~6.1 holds under the condition
$u<\sqrt n\sigma^2$, which is an inequality in the opposite
direction. Hence to complete the proof of Theorem~4.1 with the
help of Proposition~6.1 we need a result whose proof demands an
essentially different method. Proposition~6.2 formulated below
is such a result. I shall show that Theorem~4.1 is a consequence
of Propositions~6.1 and~6.2. Proposition~6.1 is proved at the
end of this section, while the proof of Proposition~6.2 is
postponed to the next section.
\medskip\noindent
{\bf Proposition 6.2.}\index{estimate on the supremum of a class
of partial sums} {\it Let us have a probability measure $\mu$
on a measurable space $(X,{\cal X})$ together with a sequence of
independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and a countable, $L_2$-dense class of
functions $f=f(x)$ on $(X,{\cal X})$ with some parameter $D\ge1$ and
exponent $L\ge1$ which satisfies conditions~(\ref{(4.1)}),
(\ref{(4.2)}) and~(\ref{(4.3)})
with some $0<\sigma\le1$ such that the inequality
$n\sigma^2>L\log n+\log D$ holds. Then there exists
a threshold index $A_0\ge5$ such that the normalized random sums
$S_n(f)$, $f\in {\cal F}$, introduced in Theorem~4.1 satisfy the
inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge A n^{1/2}\sigma^2\right)\le
e^{-A^{1/2}n\sigma^2/2}\quad \textrm{if } A\ge A_0. \label{(6.4)}
\end{equation}
}
\medskip
I did not try to find optimal parameters in formula~(\ref{(6.4)}).
Even the coefficient $-A^{1/2}$ in the exponent at its right-hand
side could be improved. The result of Proposition~6.2 is similar
to that of Theorem~4.1. Both of them give an estimate on a
probability of the form
$P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|\ge u\right)$ with
some class of functions~${\cal F}$. The essential difference
between them is that in Theorem~4.1 this probability is considered
for $u\le n^{1/2}\sigma^2$ while in Proposition~6.2 the case
$u=A n^{1/2}\sigma^2$ with $A\ge A_0$ is taken, where $A_0$ is a
sufficiently large positive number. Let us observe that in this
case no good Gaussian type estimate can be given for the
probabilities $P(S_n(f)\ge u)$, $f\in{\cal F}$. In this case
Bernstein's inequality yields the bound
$P(S_n(f)>An^{1/2}\sigma^2)=
P\left(\sum\limits_{l=1}^nf(\xi_l)>uV_n\right)n\sigma^2
\ge2^{6R}\left(\frac{u}{16\bar A\sigma}\right)^2,
$$
define $\bar\sigma^2=2^{-4R}\sigma^2$ and
${\cal F}_{\bar\sigma}={\cal F}_R$.
(As $n\sigma^2\ge\left(\frac u\sigma\right)^2$ and $\bar A\ge1$
by our conditions, there exists such a number $R\ge1$. The
number~$R$ was chosen as the largest number~$p$ for which the
second relation of formula~(\ref{(6.7)}) holds.) Then the
cardinality~$m$ of the set ${\cal F}_{\bar\sigma}$ equals
$m_R\le D2^{2RL}\sigma^{-L}
=D\bar\sigma^{-L}$, and the sets ${\cal D}_j$ are
${\cal D}_j=\{f\colon\, f\in{\cal F},\int (f_{a(j,R)}-f)^2\,d\mu\le
2^{-4R}\sigma^2\}$, $1\le j\le m_R$, hence $\bigcup\limits_{j=1}^m
{\cal D}_j={\cal F}$. Beside this, with our choice of the number $R$
inequalities~(\ref{(6.7)}) and~(\ref{(6.8)}) can be applied
for $1\le p\le R$.
Hence the definition of the predecessor of an index $(j,p)$ implies
that
$\left\{\omega\colon\,\sup\limits_{f\in{\cal F}_{\bar\sigma}}
|S_n(f)(\omega)|\ge
\frac u{\bar A}\right\}\subset
\bigcup\limits_{p=1}^R\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$, and
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}} |S_n(f)|\ge
\frac u{\bar A}\right)
\le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le
\sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))+\sum_{s=1}^{m_0}P(B(s)) \\
&&\qquad\le\sum_{p=1}^{\infty} 2D\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}
+2D\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^2\right\}.
\end{eqnarray*}
If the relation $(\frac u\sigma)^2\ge M(L\log\frac2\sigma+\log D)$
holds with a sufficiently large constant~$M$ (depending on $\bar A$),
and $\sigma\le1$, then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp
\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^2 \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^2\right\}
\le\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}.
$$
Hence the previous estimate implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}
|S_n(f)|\ge \frac u{\bar A}\right)
\le\sum_{p=1}^{\infty}2\cdot 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^2
\right\}\\
&&\qquad +2\exp\left\{-\alpha
\left(\frac u{10\bar A \sigma}\right)^2\right\}
\le 4 \exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^2\right\},
\end{eqnarray*}
and relation~(\ref{(6.2)}) holds.
As $\sigma^2=2^{4R}\bar\sigma^2$ the inequality
\begin{eqnarray*}
2^{-4R}\cdot\frac{2^{6R}}{256}\left(\frac{u}{\bar A\sigma}\right)^2
&\le& n\bar\sigma^2=2^{-4R} n\sigma^2\\
&\le& 2^{-4R}\cdot\frac{2^{6(R+1)}}{256}
\left(\frac{u}{\bar A\sigma}\right)^2
=\frac14\cdot 2^{-2R}\left(\frac{u}{\bar A\bar \sigma}\right)^2
\end{eqnarray*}
holds, and this implies (together with the relation $R\ge1$) that
$$
\frac1{64}\left(\frac u{\bar A\sigma}\right)^2\le n\bar\sigma^2
\le\frac1{16}\left(\frac{u}{\bar A \bar\sigma}\right)^2,
$$
as we have claimed. It remained to show that under the
condition~(\ref{(6.3)}) $n\bar\sigma^2\ge L\log n+\log D$.
This inequality clearly holds under the conditions of Proposition~6.1
if $\sigma\le n^{-1/3}$, since in this case $\log\frac2\sigma\ge
\frac{\log n}3$, and
$n\bar\sigma^2\ge\frac1{64}(\frac u {\bar A\sigma})^2
\ge\frac1{64\bar A^2} M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})\ge
\frac1{192\bar A^2} M(L^{3/2}\log n+(\log D)^{3/2})\ge L\log n+\log D$
if $M\ge M_0(\bar A)$ with a sufficiently large number $M_0(\bar A)$.
If $\sigma\ge n^{-1/3}$, we can exploit that the inequality
$2^{6R}\left(\frac u{\bar A\sigma}\right)^2 \le256n\sigma^2$ holds
because of the definition of the number~$R$. It can be rewritten as
$2^{-4R}\ge 2^{-16/3}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^2}
{n\sigma^2}\right]^{2/3}$. Hence
$n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1/3}\left(\frac u\sigma\right)^{4/3}$. As
$\log\frac2\sigma\ge\log2>\frac12$ the inequalities
$n\sigma^2\ge n^{1/3}$ and $(\frac u\sigma)^2\ge
M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold. They yield that
\begin{eqnarray*}
n\bar\sigma^2&\ge&\frac{\bar A^{-4/3}}{50} (n\sigma^2)^{1/3}\left(\frac
u\sigma\right)^{4/3}
\ge\frac{\bar A^{-4/3}}{50}n^{1/9}\left(\frac M2\right)^{2/3}
(L^{3/2}+(\log D)^{3/2})^{2/3}\\
&\ge&\frac{M^{2/3}n^{1/9}(L+\log D)}{100\bar A^{4/3}}\ge L\log n+\log D
\end{eqnarray*}
if $M=M(\bar A)$ is chosen sufficiently large.
\chapter{The completion of the proof of Theorem 4.1}
This section contains the proof of Proposition 6.2 with the help of
a symmetrization argument which completes the proof of Theorem~4.1.
By symmetrization argument I mean the reduction of the investigation
of sums of the form $\sum f(\xi_j)$ to sums of the form
$\sum\varepsilon_jf(\xi_j)$, where $\varepsilon_j$ are
independent random variables,
independent also of the random variables $\xi_j$, and
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$.
First a symmetrization lemma is proved, and then with the help of
this result and a conditioning argument the proof of Proposition~6.2
is reduced to the estimation of a probability which can be bounded by
means of the Hoeff\-ding inequality formulated in Theorem 3.4. Such
an approach makes possible to prove Proposition~6.2.
First I formulate the symmetrization lemma we shall apply.
\medskip\noindent
{\bf Lemma 7.1 (Symmetrization Lemma).}\index{symmetrization lemma}
{\it Let $Z_n$ and $\bar
Z_n$, $n=1,2,\dots$, be two sequences of random variables
independent of each other, and let the random variables $\bar Z_n$,
$n=1,2,\dots$, satisfy the inequality
\begin{equation}
P(|\bar Z_n|\le\alpha)\ge\beta\quad \textrm{for all } n=1,2,\dots
\label{(7.1)}
\end{equation}
with some numbers $\alpha>0$ and $\beta>0$. Then
$$
P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
\le\frac1\beta P\left(\sup\limits_{1\le
n<\infty}|Z_n-\bar Z_n|>u\right)\quad \textrm{for all } u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma 7.1.}\/ Put $\tau=\min\{n\colon\, |Z_n|>u+\alpha\}$
if there exists such an index $n$, and $\tau=0$ otherwise. Then the
event $\{\tau=n\}$ is independent of the sequence of random variables
$\bar Z_1,\bar Z_2,\dots$ for all $n=1,2,\dots$, and because of this
independence
$$
P(\{\tau=n\})\le\frac1\beta P(\{\tau=n\}\cap\{|\bar Z_n|\le\alpha\})
\le \frac1\beta P(\{\tau=n\}\cap\{|Z_n-\bar Z_n|>u\})
$$
for all $n=1,2,\dots$. Hence
\begin{eqnarray*}
&&P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
=\sum_{l=1}^\infty P(\tau=l)\\
&&\qquad \le \frac1\beta
\sum_{l=1}^\infty P(\{\tau=l\}\cap\{|Z_l-\bar Z_l|>u\}) \\
&&\qquad \le \frac1\beta \sum_{l=1}^\infty
P(\{\tau=l\}\cap\sup_{1\le n<\infty}|Z_n-\bar Z_n|>u\}) \\
&&\qquad \le\frac1\beta P\left(\sup\limits_{1\le n<\infty}
|Z_n-\bar Z_n|>u\right).
\end{eqnarray*}
Lemma 7.1 is proved.
\medskip
We shall apply the following Lemma~7.2 which is a consequence of the
symmetrization lemma~7.1.
\medskip\noindent
{\bf Lemma 7.2.} {\it Let us fix a countable class of functions
${\cal F}$ on a measurable space $(X,{\cal X})$ together with a real
number $0<\sigma<1$. Consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(X,{\cal X})$ such
that $Ef(\xi_1)=0$, $Ef^2(\xi_1)\le\sigma^2$ for all $f\in{\cal F}$
together with another sequence $\varepsilon_1,\dots,\varepsilon_n$
of independent
random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, independent also of the random sequence
$\xi_1,\dots,\xi_n$. Then
\begin{eqnarray}
&&P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\cal F}}\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A
n^{1/2}\sigma^{2}\right) \nonumber \\
&&\qquad \le 4P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\quad\textrm{if } A\ge \frac{3\sqrt2}{\sqrt n\sigma}.
\label{(7.2)}
\end{eqnarray}
}
\medskip\noindent
{\it Proof of Lemma 7.2.}\/ Let us construct an independent copy
$\bar\xi_1,\dots,\bar\xi_n$ of the sequence $\xi_1,\dots,\xi_n$ in
such a way that all three sequences $\xi_1,\dots,\xi_n$, \
$\bar\xi_1,\dots,\bar\xi_n$ and $\varepsilon_1,\dots,\varepsilon_n$
are independent.
Define the random variables
$$
S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n f(\xi_j) \quad \textrm{and}\quad
\bar S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n f(\bar\xi_j)
$$
for all $f\in{\cal F}$. The inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|> A\sqrt n\sigma^2\right)\le
2P\left(\sup_{f\in{\cal F}}|S_n(f)-\bar S_n(f)|> \frac23 A\sqrt
n\sigma^2\right). \label{(7.3)}
\end{equation}
follows from Lemma~7.1 if it is applied for the countable set of
random variables $Z_n(f)=S_n(f)$ and $\bar Z_n(f)=\bar S_n(f)$,
$f\in{\cal F}$, and the numbers $u=\frac23 A\sqrt n\sigma^2$ and
$\alpha=\frac13A\sqrt n\sigma^2$, since the random fields $S_n(f)$
and $\bar S_n(f)$ are independent, and
$P(|\bar S_n(f)|\le\alpha)>\frac12$ for all $f\in{\cal F}$. Indeed,
$\alpha=\frac13 A\sqrt n\sigma^2\ge\sqrt2\sigma$, $E\bar S_n(f)^2
\le\sigma^2$, thus Chebishev's inequality implies that
$P(|\bar S_n(f)|\le\alpha)\ge P(|\bar S_n(f)|\le\sqrt2\sigma)
\ge\frac12$ for all $f\in{\cal F}$.
Let us observe that the random field
\begin{equation}
S_n(f)-\bar S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\cal F}, \label{(7.4)}
\end{equation}
and its randomization
\begin{equation}
\frac1{\sqrt n}\sum_{j=1}^n \varepsilon_j \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\cal F}, \label{($7.4'$)}
\end{equation}
have the same distribution. Indeed, even the conditional distribution
of~(\ref{($7.4'$)}) under the condition that the values of the
$\varepsilon_j$-s are
prescribed agrees with the distribution of~(\ref{(7.4)}) for
all possible values of the $\varepsilon_j$-s. This follows from
the observation that the distribution of the random
field~(\ref{(7.4)}) does not change if we exchange
the random variables $\xi_j$ and $\bar\xi_j$ for those indices $j$
for which $\varepsilon_j=-1$ and do not change them for those
indices~$j$ for which $\varepsilon_j=1$. On the other hand, the
distribution of the random
field obtained in such a way agrees with the conditional distribution
of the random field defined in~(\ref{($7.4'$)}) under the
condition that the values of the random variables $\varepsilon_j$
are prescribed.
The above relation together with formula~(\ref{(7.3)}) imply that
\begin{eqnarray*}
&&P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}\left|\sum_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right)\\
&&\qquad \le 2P\left(\frac1{\sqrt n}
\sup_{f\in{\cal F}}\left|\sum_{j=1}^n
\varepsilon_j\left[f(\xi_j)-\bar f(\xi_j)\right]\right| \ge\frac23 A
n^{1/2}\sigma^{2}\right) \\
&&\qquad\le 2P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3 n^{1/2}\sigma^{2}\right) \\
&&\qquad\qquad+ 2P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}\left|
\sum_{j=1}^n \varepsilon_jf(\bar\xi_j)\right|
\ge\frac A3n^{1/2}\sigma^{2}\right) \\
&&\qquad=4P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3n^{1/2}\sigma^{2}\right).
\end{eqnarray*}
Lemma~7.2 is proved.
\medskip
First I try to explain briefly the method of proof of
Proposition~6.2. A probability of the form
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
has to be estimated. Lemma~7.2 enables us to replace this problem
by the estimation of the probability
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}\left| \sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right|>\frac u3\right)$ with some
independent random variables $\varepsilon_j$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $j=1,\dots,n$,
which are also independent of the random variables $\xi_j$. We
shall bound the conditional probability of the event appearing in
this modified problem under the condition that each random
variable $\xi_j$ has a prescribed value. This can be done
with the help of Hoeffding's inequality formulated in Theorem~3.4
and the $L_2$-density property of the class of functions ${\cal F}$
we consider. We hope to get a sharp estimate in such a way which
is similar to the result we got in the study of the Gaussian
counterpart of this problem, because Hoeffding's inequality yields
always a Gaussian type upper bound for the tail distribution of
the random sum we are studying.
Nevertheless, there appears a problem when we try to apply such an
approach. To get a good estimate on the conditional tail distribution
of the supremum of the random sums we are studying with the help of
Hoeffding's inequality we need a good estimate on the supremum of
the conditional variances of the random sums we are studying, i.e.
on the tail distribution of
$\sup\limits_{f\in{\cal F}}\frac1n\sum\limits_{j=1}^n f^2(\xi_j)$.
This problem is similar to the original one, and it is not simpler.
But a more detailed study shows that our approach to get a good
estimate with the help of Hoeffding's inequality works. In
comparing our original problem with the new, complementary problem
we have to understand at which level we need a good estimate on the
tail distribution of the supremum in the complementary problem to
get a good tail distribution estimate at level~$u$ in the original
problem. A detailed study shows that to bound the probability in
the original problem with parameter~$u$ we have to estimate the
probability
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}'}\left|
\sum\limits_{j=1}^n f(\xi_j)\right|>u^{1+\alpha}\right)$ with
some new nice, appropriately defined $L_2$-dense class of
bounded functions ${\cal F}'$ and some
number $\alpha>0$. We shall exploit that the number~$u$ is
replaced by a larger number $u^{1+\alpha}$ in the new problem. Let
us also observe that if the sum of bounded random variables is
considered, then for very large numbers~$u$ the probability we
investigate equals zero. On the basis of these observations an
appropriate backward induction procedure can be worked out. In its
$n$-th step we give a good upper bound on the probability
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}\left|
\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
if $u\ge T_n$ with an appropriately chosen number~$T_n$, and try
to diminish the number~$T_n$ in each step of this induction
procedure. We can prove Proposition~6.2 as a consequence of the
result we get by means of this backward induction procedure. To
work out the details we introduce the following notion.
\medskip\noindent
{\bf Definition of good tail behaviour for a class of normalized
random sums.}
\index{good tail behaviour for a class of normalized random sums}
{\it Let us have some measurable space $(X,{\cal X})$ and a
probability measure $\mu$ on it together with some integer $n\ge2$
and real number $\sigma>0$. Consider some class ${\cal F}$ of
functions $f(x)$ on the space $(X,{\cal X})$, and take a sequence of
independent and $\mu$ distributed random variables $\xi_1,\dots,\xi_n$
with values in the space $(X,{\cal X})$. Define the normalized random
sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in {\cal F}$.
Given some real number $T>0$ we say that the set of normalized
random sums $S_n(f)$, $f\in{\cal F}$,
has a good tail behaviour at level~$T$ (with parameters $n$ and
$\sigma^2$ which will be fixed in the sequel) if the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge A \sqrt n\sigma^2\right) \le
\exp\left\{-A^{1/2}n\sigma^2 \right\} \label{(7.5)}
\end{equation}
holds for all numbers $A>T$.}
\medskip
Now I formulate Proposition 7.3 and show that Proposition 6.2
follows from it.
\medskip\noindent
{\bf Proposition 7.3.} {\it Let us fix a positive integer~$n\ge2$,
a real number $0<\sigma\le1$ and a probability measure $\mu$ on a
measurable space $(X,{\cal X})$ together with some numbers $L\ge1$
and $D\ge1$ such that $n\sigma^2\ge L\log n+\log D$. Let us
consider those countable $L_2$-dense classes ${\cal F}$ of functions
$f=f(x)$ on the space $(X,{\cal X})$ with exponent~$L$ and
parameter~$D$ for which all functions $f\in{\cal F}$ satisfy the
conditions
$\sup\limits_{x\in X}|f(x)|\le\frac14$, $\int f(x)\mu(\,dx)=0$
and $\int f^2(x)\mu(\,dx)\le\sigma^2$.
Let a number $T>1$ be such that for all classes of functions
${\cal F}$ which satisfy the above conditions the set of normalized
random sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$,
$f\in{\cal F}$, defined with the help of a sequence of independent
$\mu$ distributed random variables $\xi_1,\dots,\xi_n$ have a good
tail behaviour at level~$T^{4/3}$. There is a universal
constant~$\bar A_0$ such that if $T\ge\bar A_0$, then the set of the
above defined normalized sums, $S_n(f)$, $f\in{\cal F}$, have a good
tail behaviour for all such classes of functions ${\cal F}$ not
only at level $T^{4/3}$ but also at level~$T$.}
\medskip
Proposition~6.2 simply follows from Proposition~7.3. To show this
let us first observe that a class of normalized random sums
$S_n(f)$, $f\in{\cal F}$, has a good tail behaviour at level
$T_0=\frac1{4\sigma^2}$ if this class of functions ${\cal F}$
satisfies the conditions of Proposition~7.3. Indeed, in this
case
$P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|\ge A\sqrt n\sigma^2\right)
\le P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|>\frac{\sqrt n}4\right)=0$
for all
$A>T_0$. Then the repetitive application of Proposition~7.3 yields
that a class of random sums $S_n(f)$, $f\in{\cal F}$, has a good tail
behaviour at all levels $T\ge T_0^{(3/4)^j}$ with an index~$j$ such
that $T_0^{(3/4)^j}\ge\bar A_0$ if the class of functions ${\cal F}$
satisfies the conditions of Proposition~7.3. Hence it has a good
tail behaviour for $T=\bar A_0^{4/3}$ with the number~$\bar A_0$
appearing in Proposition~7.3. If a class of functions
$f\in{\cal F}$ satisfies the conditions of Proposition~6.2, then
the class of functions $\bar{\cal F}=\left\{\bar f=\frac f4\colon\,
f\in{\cal F}\right\}$ satisfies the conditions of Proposition~7.3,
with the same parameters~$\sigma$, $L$ and~$D$. (Actually some of
the inequalities that must hold for the elements of a class of
functions~${\cal F}$ satisfying the conditions of Proposition~7.3
are valid with smaller parameters. But we did not change these
parameters to satisfy also the condition
$n\sigma^2\ge L\log n+\log D$.) Hence the class of functions
$S_n(\bar f)$, $\bar f\in \bar{\cal F}$, has a good tail
behaviour at level $T=\bar A_0^{4/3}$. This implies that the
original class of functions ${\cal F}$ satisfies
formula~(\ref{(6.4)}) in Proposition~6.2, and this is what we
had to show.\index{estimate on the supremum of a class of
partial sums}
\medskip\noindent
{\it Proof of Proposition 7.3.}\/ Fix a class of functions
${\cal F}$ which satisfies the conditions of Proposition~7.3
together with two independent sequences $\xi_1,\dots,\xi_n$ and
$\varepsilon_1,\dots,\varepsilon_n$ of independent random variables,
where $\xi_j$ is $\mu$-distributed,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le j\le n$,
and investigate the conditional probability
$$
P(f,A|\xi_1,\dots,\xi_n)=P\left(\left.\frac1{\sqrt n}\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge
\frac A6\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all functions $f\in{\cal F}$, $A> T$ and values
$(\xi_1,\dots,\xi_n)$ in the condition. By the Hoeffding inequality
formulated in Theorem~3.4
\begin{equation}
P(f,A|\xi_1,\dots,\xi_n)\le 2\exp\left\{-\frac{\frac 1{36}
A^2 n\sigma^4}{2\bar S^2(f,\xi_1,\dots,\xi_n)}\right\} \label{(7.6)}
\end{equation}
with
$$
\bar S^2(f,x_1,\dots,x_n)=\frac1n\sum_{j=1}^n f^2(x_j),
\quad f\in {\cal F}.
$$
Let us introduce the set
\begin{equation}
H=H(A)=\left\{(x_1,\dots,x_n)\colon\, \sup_{f\in{\cal F}}
\bar S^2(f,x_1,\dots,x_n)
\ge \left(1+A^{4/3}\right)\sigma^2\right\}. \label{(7.7)}
\end{equation}
I claim that
\begin{equation}
P((\xi_1,\dots,\xi_n)\in H)\le e^{-A^{2/3} n\sigma^2}\quad\textrm{ if }
A>T. \label{(7.8)}
\end{equation}
(The set $H$ is the small exceptional set of those points
$(x_1,\dots,x_n)$ for which we cannot give a good estimate for
$P(f,A|\xi_1(\omega),\dots,\xi_n(\omega))$ with
$\xi_1(\omega)=x_1$,\dots, $\xi_n(\omega)=x_n$ for some $f\in{\cal F}$.)
To prove relation~(\ref{(7.8)}) let us consider the functions
$\bar f=\bar f(f)$, $\bar f(x)=f^2(x)-\int f^2(x)\mu(\,dx)$, and
introduce the
class of functions $\bar{\cal F}=\{\bar f(f)\colon\, f\in{\cal F}\}$.
Let us show that the class of functions $\bar{\cal F}$ satisfies the
conditions of Proposition~7.3, hence the estimate~(\ref{(7.5)}) holds
for the class of functions $\bar{\cal F}$ if $A> T^{4/3}$.
The relation $\int \bar f(x)\mu(\,dx)=0$ clearly holds. The condition
$\sup| \bar f(x)|\le\frac 18<\frac14$ also holds if $\sup |f(x)|\le
\frac14$, and $\int \bar f^2(x)\mu(\,dx)\le \int f^4(x)\mu(\,dx)\le
\frac 1{16}\int f^2(x)\,\mu(\,dx)\le\frac{\sigma^2}{16}<\sigma^2$
if $f\in{\cal F}$. It remained to show that $\bar{\cal F}$ is an
$L_2$-dense class with exponent $L$ and parameter $D$. For this goal
we need a good estimate on $\int(\bar f(x)-\bar g(x))^2\rho(\,dx)$,
where $\bar f,\,\bar g\in\bar{\cal F}$, and $\rho$ is an arbitrary
probability measure.
Observe that
\begin{eqnarray*}
&&\int (\bar f(x)-\bar g(x))^2\rho(\,dx) \\
&&\qquad \le 2\int(f^2(x)-g^2(x))^2\rho(\,dx)+
2\int(f^2(x)-g^2(x))^2\mu(\,dx) \\
&&\qquad \le2 (\sup\limits (|f(x)|+|g(x)|)^2
\left(\int (f(x)-g(x))^2(\rho(\,dx)+\mu(\,dx)\right) \\
&&\qquad \le \int (f(x)-g(x))^2\bar\rho(\,dx)
\end{eqnarray*}
for all $f, g\in{\cal F}$, $\bar f=\bar
f(f)$, $\bar g=\bar g(g)$ and probability measure $\rho$, where
$\bar\rho=\frac{\rho+\mu}2$. This means that if $\{f_1,\dots,f_m\}$
is an $\varepsilon$-dense subset of ${\cal F}$ in the space
$L_2(X,{\cal X},\bar\rho)$, then
$\{\bar f_1,\dots,\bar f_m\}$ is an $\varepsilon$-dense
subset of $\bar{\cal F}$ in the space $L_2(X,{\cal X},\rho)$, and
not only ${\cal F}$, but also $\bar{\cal F}$ is an $L_2$-dense class
with exponent $L$ and parameter $D$.
Because of the conditions of Proposition 7.3 we can write
for the number $A^{4/3}> T^{4/3}$ and the class of functions
$\bar{\cal F}$ that
\begin{eqnarray*}
&&P((\xi_1,\dots,\xi_n)\in H) \\
&&=P\left(\sup_{f\in{\cal F}}
\left(\frac1n \sum_{j=1}^n
\bar f(f)(\xi_j) +\frac1n \sum_{j=1}^n E f^2(\xi_j)\right)
\ge \left(1+A^{4/3}\right)\sigma^2\right)\\
&&\le P\left(\sup_{\bar f\in\bar {\cal F}}
\frac1{\sqrt n} \sum_{j=1}^n
\bar f(\xi_j) \ge A^{4/3}n^{1/2}\sigma^2\right)
\le e^{-A^{2/3} n\sigma^2},
\end{eqnarray*}
i.e. relation~(\ref{(7.8)}) holds.
By formula~(\ref{(7.6)}) and the definition of the set $H$
given in~(\ref{(7.7)}) the estimate
\begin{equation}
P(f,A|\xi_1,\dots,\xi_n)\le 2e^{- A^{2/3} n\sigma^2/144} \quad
\textrm{if }(\xi_1,\dots,\xi_n)\notin H
\label{(7.9)}
\end{equation}
holds for all $f\in {\cal F}$ and $A>T\ge1$. (Here we used the
estimate $1+A^{4/3}\le2A^{4/3}$.) Let us introduce the conditional
probability
$$
P({\cal F},A|\xi_1,\dots,\xi_n)=
P\left(\left.\sup_{f\in {\cal F}} \frac1{\sqrt n}\left|
\sum\limits_{j=1}^n \varepsilon_jf(\xi_j)\right| \ge
\frac A3\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all $(\xi_1,\dots,\xi_n)$ and $A>T$. We shall
estimate this conditional probability with the help of
relation~(\ref{(7.9)}) if $(\xi_1,\dots,\xi_n) \notin H$.
Given a vector $x^{(n)}=(x_1,\dots,x_n)\in X^n$, let us introduce
the probability measure $\nu=\nu(x_1,\dots,x_n)=\nu(x^{(n)})$ on
$(X,{\cal X})$ which is concentrated in the coordinates of the
vector $x^{(n)}=(x_1,\dots,x_n)$, and $\nu(\{x_j\})=\frac1n$ for all
points~$x_j$, $j=1,\dots,n$. If $\int f^2(u)\nu(\,du)\le\delta^2$
for a function $f$, then
$\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)\right|
\le n^{1/2}\int|f(u)|\nu(\,du)\le n^{1/2}\delta$. As a
consequence, we can write that
\begin{eqnarray}
&&\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)-
\frac1{\sqrt n}\sum\limits_{j=1}^n \varepsilon_jg(x_j)\right|
\le\frac A6 \sqrt n\sigma^2 \nonumber \\
&&\qquad\textrm{if }
\int (f(u)-g(u))^2\,d\nu(u)\le\left(\frac {A\sigma^2}6\right)^2.
\label{(7.10)}
\end{eqnarray}
Let us list the elements of the (countable) set ${\cal F}$ as
${\cal F}=\{f_1,f_2,\dots\}$, fix the number $\delta=\frac{A\sigma^2}6$,
and choose for all vectors $x^{(n)}=(x_1,\dots,x_n)\in X^n$ a
sequence of indices $p_1(x^{(n)}),\dots,p_m(x^{(n)})$ taking
positive integer values with
$m=\max(1, D\delta^{-L})=\max(1,D(\frac6{A\sigma^2})^L)$
elements in such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\,d\nu(x^{(n)})(u)\le\delta^2$
for all $f\in{\cal F}$ and $x^{(n)}\in X^n$ with the above defined
measure $\nu(x^{(n)})$ on the space $(X,{\cal X})$. This is possible
because of the $L_2$-dense property of the class of
functions~${\cal F}$. (This is the point where the $L_2$-dense
property of the class of functions ${\cal F}$ is exploited in its
full strength.) In a complete proof of Proposition~7.3 we still have
to show that we can choose the indices $p_j(x^{(n)})$,
$1\le j\le m$, as measurable functions of their argument~$x^{(n)}$
on the space $(X^n,{\cal X}^n)$. We shall show this in Lemma~7.4 at
the end of the proof.
Put $\xi^{(n)}(\omega)=(\xi_1(\omega),\dots,\xi_n(\omega))$. Because
of relation~(\ref{(7.10)}), the choice of the number $\delta$ and
the property of the functions $f_{p_l(x^{(n)})}(\cdot)$ we have
\begin{eqnarray}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
\frac1{\sqrt n}\left|\sum\limits_{j=1}^n
\varepsilon_j(\omega)f(\xi_j(\omega))\right|
\ge\frac A3\sqrt n\sigma^2\right\} \label{(7.11)} \\
&&\qquad \subset\bigcup_{l=1}^m\left\{\omega\colon\,\frac1{\sqrt n}
\left|\sum\limits_{j=1}^n \varepsilon_j(\omega)f_{p_l(\xi^{(n)}(\omega))}
(\xi_j(\omega))\right|\ge\frac A6\sqrt n\sigma^2\right\}. \nonumber
\end{eqnarray}
We can estimate the conditional probability at the left-hand side
of~(\ref{(7.11)}) under the condition that the vector
$(\xi_1(\omega),\dots,\xi_n(\omega))$ takes a prescribed value. We
get with the help of~(\ref{(7.11)}) and inequality~(\ref{(7.9)}) that
\begin{eqnarray}
P({\cal F},A|\xi_1,\dots,\xi_n)
&&\le\sum\limits_{l=1}^m P(f_{p_l(\xi^{(n)})},A|\xi_1,\dots,\xi_n)
\nonumber\\
&&\le 2\max\left(1,D\left(\frac 6{A\sigma^2}\right)^L\right)
e^{-A^{2/3} n\sigma^2/144} \nonumber \\
&&\qquad \textrm{if }(\xi_1,\dots,\xi_n)\notin H \textrm{ and } A> T.
\label{(7.12)}
\end{eqnarray}
If $A\ge\bar A_0$ with a sufficiently large constant~$\bar A_0$,
then this inequality together with Lemma~7.2 and the
estimate~(\ref{(7.8)}) imply that
\begin{eqnarray}
&&P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right) \nonumber \\
&&\qquad \le 4P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\cal F}}\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\label{(7.13)} \\
&&\qquad \le\max\left(4, 8D\left(\frac6{A\sigma^2}\right)^L
\right)e^{-A^{2/3}n\sigma^2/144}
+4e^{-A^{2/3}n\sigma^2} \quad \textrm{if } A>T. \nonumber
\end{eqnarray}
(We may apply Lemma~7.2 if $A\ge A_0$ with a sufficiently large~$A_0$,
since $n\sigma^2\ge L\log n+\log D\le\log 2$, hence
$\sqrt n\sigma\ge\sqrt{\log 2}$, and the condition
$A\ge \frac{3\sqrt2}{\sqrt n\sigma}$ demanded in relation~(\ref{(7.2)})
is satisfied.)
By the conditions of Proposition~7.3 the inequalities
$n\sigma^2\ge L\log n +\log D$ hold with some $L\ge1$, $D\ge1$
and $n\ge2$. This implies that $n\sigma^2\ge L\log2\ge\frac12$,
$(\frac6{A\sigma^2})^L
\le(\frac n{2n\sigma^2})^L\le n^L=e^{L\log n}
\le e^{n\sigma^2}$ if $A\ge\bar A_0$ with some sufficiently large
constant $\bar A_0>0$, and $2D=e^{\log2+\log D}\le e^{3n\sigma^2}$.
Hence the first term at the right-hand side of~(\ref{(7.13)}) can be
bounded by
$$
\max\left(4,8D\left(\frac6{A\sigma^2}\right)^L\right)
e^{-A^{2/3}n\sigma^2/144}
\le e^{-A^{2/3}n\sigma^2/144}\cdot 4e^{4n\sigma^2}
\le \frac12e^{-A^{1/2}n\sigma^2}
$$
if $A\ge\bar A_0$ with a sufficiently large~$\bar A_0$. The
second term at the right-hand side of~(\ref{(7.13)}) can also be
bounded as $4e^{-A^{2/3}n\sigma^2}\le \frac12e^{-A^{1/2}n\sigma^2}$
with an appropriate choice of the number~$\bar A_0$.
By the above calculation formula~(\ref{(7.13)}) yields the inequality
$$
P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}\left|
\sum\limits_{j=1}^n f(\xi_j)\right| \ge An^{1/2}\sigma^{2}\right)
\le e^{-A^{1/2}n\sigma^2}
$$
if $A>T$, and the constant~$\bar A_0$ is chosen sufficiently large.
\medskip
To complete the proof of Proposition~7.3 we still show in the
following Lemma~7.4 that the functions $p_l(x^{(n)})$,
$1\le l\le m$, we have introduced in the above argument can be
chosen as measurable functions in the space $(X^n,{\cal X}^n)$.
This implies that the expressions
$f_{p_l(\xi^{(n)}(\omega))}(\xi_j(\omega))$ in formula~(\ref{(7.11)})
are ${\cal F}(\xi_1,\dots,\xi_n)$ measurable random variables. Hence
the formulation of~(\ref{(7.12)}) is legitime, no measurability
problem arises. We shall present Lemma~7.4 together with some
generalizations that we shall apply later in the proof of
Propositions~15.3 and~15.4 which are multivariate versions of
Proposition~7.3. We shall need these results in the proof of the
multivariate version of Proposition~6.2. We have formulated them not
in their most general possible form, but as we shall need them.
\medskip\noindent
{\bf Lemma~7.4.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some
exponent~$L>0$ and parameter~$D\ge1$ on a measurable space
$(X,{\cal X})$. Fix some positive integer~$n$, and define for
all $x^{(n)}=(x_1,\dots,x_n)\in X^n$ the probability measure
$\nu(x^{(n)})=\nu(x_1,\dots,x_n)$ on the space $(X,{\cal X})$
by the formula $\nu(x^{(n)})(x_j)=\frac1n$, $1\le j\le n$.
For a number $0\le\varepsilon\le 1$ put
$m=m(\varepsilon)=[D\varepsilon^{-L}]$, where $[\cdot]$
denotes integer part. For all $0\le\varepsilon\le 1$ there
exists $m=m(\varepsilon)$
measurable functions $p_l(x^{(n)})$, $1\le l\le m$, on the
measurable space $(X^n,{\cal X}^n)$ with positive integer values in
such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\nu(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^n$ and $f\in{\cal F}$.}
\medskip
In the proof of Proposition~15.3 we need the following result.
\medskip\noindent
{\bf Lemma 7.4A.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some
exponent $L>0$ and parameter~$D\ge1$ on the $k$-fold product
$(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$ with
some $k\ge1$. Fix some positive integer~$n$, and define for
all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$,
where $x^{(j)}_l\in X$ for all $j$ and $l$ the probability
measure $\rho(x^{(n)})$ in the space $(X^k,{\cal X}^k)$ by
the formula
$\rho(x^{(n)})(x_{l_j}^{(j)},\,1\le j\le k,\,1\le l_j\le n)
=\frac1{n^k}$ for all sequences
$(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)})$ , $1\le j\le k$,
$1\le l_j\le n$, with coordinates of the vector
$x^{(n)}=(x_l^{(j)},1\le l\le n,\,1\le j\le k)$. For all
$0\le\varepsilon\le 1$ there exist
$m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable functions
$p_r(x^{(n)})$, $1\le r\le m$, on the measurable space
$(X^{kn},{\cal X}^{kn})$ with positive integer values in
such a way that $\inf\limits_{1\le r\le m}
\int(f(u)-f_{p_r(x^{(n)})}(u))^2\rho(x^{(n)})(\,du)
\le\varepsilon^2$
for all $x^{(n)}\in X^{kn}$ and $f\in{\cal F}$.}
\medskip
In the proof of Proposition~15.4 the following result will be needed.
\medskip\noindent
{\bf Lemma 7.4B.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some exponent
$L>0$ and parameter~$D\ge1$ on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some measurable
spaces $(X,{\cal X})$ and $(Y,{\cal Y})$ and integer~$k\ge1$.
Fix some positive integer~$n$, and define for all vectors
$x^{(n)}=(x_l^{(j,1)},x_l^{(j,-1)},\,1\le l\le n,\,1\le j\le k)
\in X^{2kn}$, where $x^{(j,\pm1)}_l\in X$ for all $j$ and $l$
a probability measure $\alpha(x^{(n)})$ in the space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$
in the following way. Fix some probability measure $\rho$ in
the space $(Y,{\cal Y})$ and two $\pm1$ sequences
$\varepsilon^{(k)}_1=(\varepsilon_{1,1},\dots,\varepsilon_{k,1})$
and
$\varepsilon^{(k)}_2=(\varepsilon_{1,2},\dots,\varepsilon_{k,2})$
of length~$k$. Define with their help first the following
probability measures
$\alpha_1(x^{(n)})=\alpha_1(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
and $\alpha_2(x^{(n)})=\alpha_2(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
in the space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ for all
$x^{(n)}\in{\cal X}^{2kn}$. Let
$\alpha_1(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,1})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,1})}\}\times B)=\frac{\rho(B)}{n^k}$
and
$\alpha_2(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,2})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,2})}\}\times B)=\frac{\rho(B)}{n^k}$
with $1\le l_j\le n$ for all $1\le j\le k$ and $B\in{\cal Y}$ if
$x_{l_j}^{(j,\varepsilon_{j,1})}$ and
$x_{l_j}^{(j,\varepsilon_{j,2})}$ are the appropriate coordinates
of the vector $x^{(n)}\in X^{2kn}$. Put
$\alpha(x^{(n)})=\frac{\alpha_1(x^{(n)})+\alpha_2(x^{(n)})}2$.
For all $0\le\varepsilon\le 1$ there exist
$m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable
functions $p_r(x^{(n)})$, $1\le r\le m$, on the measurable space
$(X^{2kn},{\cal X}^{2kn})$ with positive integer values in
such a way that
$\inf\limits_{1\le r\le m}\int(f(u)-f_{p_r(x^{(n)})}(u))^2
\alpha(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^{2kn}$ and $f\in{\cal F}$.}
\medskip\noindent
{\it Proof of Lemma 7.4.}\/ Fix some $0<\varepsilon\le 1$, put
the number $m=m(\varepsilon)$ introduced in the lemma, and let
us list the set of all vectors $(j_1,\dots,j_m)$ of length~$m$
with positive integer coordinates in some way. Define for all of
these vectors $(j_1,\dots,j_m)$ the set
$B(j_1,\dots,j_m)\subset X^n$ in the following way. We have
$x^{(n)}=(x_1,\dots,x_n)\in B(j_1,\dots,j_m)$
if and only if $\inf\limits_{1\le r\le m}
\int (f(u)-f_{j_r}(u))^2\,d\nu(x^{(n)})(u)\le\varepsilon^2$ for all
$f\in{\cal F}$. Then all sets $B(j_1,\dots,j_m)$ are measurable, and
$\bigcup\limits_{(j_1,\dots,j_m)}B(j_1,\dots,j_m)=X^n$
because ${\cal F}$ is an $L_2$-dense class of functions with
exponent~$L$ and parameter~$D$. Given a point
$x^{(n)}=(x_1,\dots,x_n)$ let us choose
the first vector $(j_1,\dots,j_m)=(j_1(x^{(n)}),\dots,j_m(x^{(n)}))$
in our list of vectors for which $x^{(n)}\in B(j_1,\dots,j_m)$, and
define $p_l(x^{(n)})=j_l(x^{(n)})$ for all $1\le l\le m$ with this
vector $(j_1,\dots,j_m)$. Then the functions $p_l(x^{(n)})$ are
measurable, and the functions $f_{p_l(x^{(n)})}$, $1\le l\le m$,
defined with their help together with the probability measures
$\nu(x^{(n)})$ satisfy the inequality demanded in Lemma~7.4.
\medskip
The proof of Lemmas~7.4A and~7.4B is almost the same. We only
have to modify the definition of the sets $B(j_1,\dots,j_m)$
in a natural way. The space of arguments $x^{(n)}$ are the spaces
$X^{kn}$ and $X^{2kn}$ in these lemmas, and we have to integrate
with respect to the measures $\rho(x^{(n)})$ in the space $X^k$
and with respect to the measures $\alpha(x^{(n)})$ in the space
$X^k\times Y$ respectively. The sets $B(j_1,\dots,j_m)$ are
measurable also in these cases, and the rest of the proof can be
applied without any change.
\chapter{Formulation of the main results of this work}
Former sections of this work contain estimates about the tail
distribution of normalized sums of independent, identically
distributed random variables and of the supremum of appropriate
classes of such random sums. They were considered together with
some estimates about the tail distribution of the integral of a
(deterministic) function with respect to a normalized empirical
distribution and of the supremum of such integrals. This two kind
of problems are closely related, and to understand them better it
is useful to investigate them together with their natural Gaussian
counterpart.
In this section I formulate the natural multivariate versions of
the above mentioned results. They will be proved in the subsequent
sections. To formulate them we have to introduce some new notions.
I shall also discuss some new problems whose solution helps in the
proof of our results. I finish this section with a short overview
about the content of the remaining part of this work.
I start this section with the formulation of two results,
Theorems~8.1 and~8.2 together with some of their simple
consequences. They yield a sharp estimate about the tail
distribution of a multiple random integral with respect to a
normalized empirical distribution and about the analogous
problem when the tail distribution of the supremum of such
integrals is considered. These results are the natural
versions of the corresponding one-variate results about the tail
behaviour of an integral or of the supremum of a class of
integrals with respect to a normalized empirical distribution.
They can be formulated with the help of the notions introduced
before, in particular with the help of the notion of multiple
random integrals with respect to a normalized empirical
distribution introduced in formula~(\ref{(4.8)}).
To formulate the following two results, Theorems~8.3 and~8.4 and
their consequences, which are the natural multivariate versions
of the results about the tail distribution of partial sums of
independent random variables, and of the supremum of such sums
we have to make some preparations. First we introduce the
so-called $U$-statistics which can be considered the natural
multivariate generalizations of the sum of independent and
identically distributed random variables. Beside this observe
that we had a good estimation about the tail distribution of sums
of independent random variables only if the summands had expectation
zero, and we have to find the natural multivariate version of this
property. Hence we define the so-called degenerate $U$-statistics
which can be considered as the natural multivariate counterpart of
sums of independent and identically distributed random variables
with zero expectation. Theorems~8.3 and~8.4 contain estimates about
the tail-distribution of degenerate $U$-statistics and of the
supremum of such expressions.
In Theorems~8.5 and~8.6 we formulate the Gaussian counterparts of
the above results. They deal with multiple Wiener-It\^o integrals
with respect to a so-called white noise. The notion of multiple
Wiener--It\^o integrals and their properties needed to have a good
understanding of these results will be explained in a later
section. Still two results are discussed in this section. They are
Examples~8.7 and~8.8, which state that the estimates of Theorems~8.5
and~8.3 are in a certain sense sharp.
\medskip
To formulate the first two results of this section let us consider
a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in a measurable space
$(X,{\cal X})$. Let $\mu$ denote the distribution
of the random variables $\xi_j$, and introduce the empirical
distribution of the sequence $\xi_1,\dots,\xi_n$ defined
in~(\ref{(4.5)}).
Given a measurable function $f(x_1,\dots,x_k)$ on the
$k$-fold product space $(X^k,{\cal X}^k)$ consider its integral
$J_{n,k}(f)$ with respect to the $k$-fold product of the normalized
empirical measure $\sqrt n(\mu_n-\mu)$ defined in
formula~(\ref{(4.8)}). In
the definition of this integral the diagonals $x_j=x_l$,
$1\le j0$ and $\alpha=\alpha_k>0$, such that the random integral
$J_{n,k}(f)$ defined by formulas~(\ref{(4.5)})
and~(\ref{(4.8)}) satisfies the
inequality
\begin{equation}
P(|J_{n,k}(f)|>u)\le C \max\left(e^{-\alpha(u/\sigma)^{2/k}},
e^{-\alpha(nu^2)^{1/(k+1)}} \right) \label{(8.3)}
\end{equation}
for all $u>0$. The constants $C=C_k>0$ and
$\alpha=\alpha_k>0$ in formula~(\ref{(8.3)}) depend only on the
parameter~$k$.}
\medskip
Theorem 8.1 can be reformulated in the following equivalent form.
\medskip\noindent
{\bf Theorem 8.1$'$.} {\it Under the conditions of Theorem 8.1
\begin{equation}
P(|J_{n,k}(f)|>u)\le C e^{-\alpha(u/\sigma)^{2/k}}
\quad \textrm{for all } 0____0$,
$\alpha=\alpha_k>0$, depending only on the multiplicity~$k$ of the
integral $J_{n,k}(f)$.}
\medskip
Theorem 8.1 clearly implies Theorem~$8.1'$, since in the case
$u\le n^{k/2}\sigma^{k+1}$ the first term is larger than the second
one in the maximum at the right-hand side of formula~(\ref{(8.3)}).
On the other hand, Theorem~$8.1'$ implies Theorem~8.1 also if
$u>n^{k/2}\sigma^{k+1}$. Indeed, in this case Theorem~$8.1'$ can be
applied with
$\bar\sigma=\left(u n^{-k/2}\right)^{1/(k+1)}\ge \sigma$ if
$u\le n^{k/2}$, hence also condition $0<\bar\sigma\le1$ is satisfied.
This yields that
$P\left(|J_{n,k}(f)|>u\right)\le C\exp\left\{-\alpha
\left(\frac u{\bar\sigma}\right)^{2/k}\right\}=C \exp\left\{-\alpha
(nu^2)^{1/(k+1)}\right\}$ if $n^{k/2}\ge u>n^{k/2}\sigma^{k+1}$,
and relation~(\ref{(8.3)}) holds in this case. If $u>2^kn^{k/2}$,
then $P(|J_{n,k}(f)|>u)=0$, and if $n^{k/2}\le u<2^kn^{k/2}$,
then
\begin{eqnarray*}
&&P(|J_{n,k}(f)|>u)\le P(|J_{n,k}(f)|>n^{k/2}) \\
&&\qquad \le C \exp\left\{-\alpha((n\cdot n^{k/2})^2)^{1/(k+1)}\right\}
\le C \exp\left\{-2^{-k}\alpha(nu^2)^{1/(k+1)}\right\}.
\end{eqnarray*}
Hence relation~(\ref{(8.3)}) holds (with a possibly different
parameter~$\alpha$) in these cases, too.
Theorem~8.1 or Theorem~$8.1'$ state that the tail distribution
$P(|J_{n,k}(f)|>u)$ of the $k$-fold random integral $J_{n,k}(f)$ can
be bounded similarly to the probability
$P(|\textrm{const.}\,\sigma\eta^k|>u)$,
where $\eta$ is a random variable with standard normal distribution
and the number $0\le\sigma\le1$ satisfies relation (\ref{(8.2)}),
provided that the level~$u$ we consider is less than
$n^{k/2}\sigma^{k+1}$. As we shall see later (see Corollary~1 of
Theorem~9.4), the value of the number $\sigma^2$ in
formula~(\ref{(8.2)}) is closely related to the
variance of $J_{n,k}(f)$. At the end of this section an example is
given which shows that the condition $u\le n^{k/2}\sigma^{k+1}$ is
really needed in Theorem~$8.1'$.
The next result, Theorem 8.2, is the generalization of Theorem~$4.1'$
for multiple random integrals with respect to a normalized empirical
measure. In its formulation the notions of $L_2$-dense classes and
countably approximability introduced in Section~4 are applied.
\medskip\noindent
{\bf Theorem 8.2 (Estimate on the supremum of multiple random
integrals with respect to an empirical
distribution).}\index{estimate on the supremum of multiple
random integrals with respect to an empirical distribution}
{\it Let us have a non-atomic probability measure
$\mu$ on a measurable space $(X,{\cal X})$ together with a countable
and $L_2$-dense class ${\cal F}$ of functions $f=f(x_1,\dots,x_k)$ of
$k$ variables with some parameter $D\ge1$ and exponent $L\ge1$ on
the product space $(X^k,{\cal X}^k)$ which satisfies the conditions
\begin{equation}
\|f\|_\infty=\sup\limits_{x_j\in X,\;1\le j\le k}|f(x_1,\dots,x_k)|\le 1,
\qquad \textrm{for all } f\in {\cal F} \label{(8.4)}
\end{equation}
and
\begin{eqnarray}
\|f\|_2^2=Ef^2(\xi_1,\dots,\xi_k)&&=\int f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k)\le \sigma^2 \nonumber \\
&&\qquad\qquad\qquad \textrm{for all } f\in {\cal F} \label{(8.5)}
\end{eqnarray}
with some constant $0<\sigma\le1$. There exist some constants
$C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$ depending only on
the parameter $k$ such that the supremum of the random integrals
$J_{n,k}(f)$, $f\in {\cal F}$, defined by formula~(\ref{(4.8)})
satisfies the inequality
\begin{eqnarray}
P\left(\sup_{f\in{\cal F}}|J_{n,k}(f)|\ge u\right)
&&\le C \exp\left\{-\alpha
\left(\frac u{\sigma}\right)^{2/k}\right\}
\quad \textrm{for those numbers } u\nonumber \\
\textrm{for which } n\sigma^2&&\ge
\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}),
\label{(8.6)}
\end{eqnarray}
where the numbers $D$ and $L$ agree with the parameter and exponent
of the $L_2$-dense class~${\cal F}$.
The condition about the countable cardinality of the class ${\cal F}$
can be replaced by the weaker condition that the class of random
variables $J_{n,k}(f)$, $f\in{\cal F}$, is countably approximable.}
\medskip
The condition given for the number~$u$ in formula~(\ref{(8.6)})
appears in Theorem~8.2 for a similar reason as the analogous
condition formulated in~(\ref{(4.4)}) in its one-variate counterpart,
Theorem~4.1. The lower bound is needed, since we have a good
estimate in formula~(\ref{(8.6)}) only for
$u\ge E\sup\limits_{f\in{\cal F}}|J_{n,k}(f)|$.
The upper bound appears, since we have a good estimate in
Theorem~$8.1'$ only for $0____0$ and $B=B(k)>0$ depending only
on the order $k$ of the $U$-statistic $I_{n,k}(f)$ such that
\begin{equation}
P(n^{-k/2}k!|I_{n,k}(f)|>u)
\le A\exp\left\{-\frac{u^{2/k}}{2\sigma^{2/k}
\left(1+B\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}
\right)}\right\} \label{(8.10)}
\end{equation}
for all $0\le u\le n^{k/2}\sigma^{k+1}$.}
\medskip
Let us also formulate the following simple corollary of Theorem~8.3.
\medskip\noindent
{\bf Corollary of Theorem 8.3.} {\it Under the conditions
of Theorem~8.3 there exist some universal constants
$C=C(k)>0$ and $\alpha=\alpha(k)>0$
that
\begin{equation}
P(n^{-k/2}k!|I_{n,k}(f)|>u)
\le C\exp\left\{-\alpha\left(\frac u\sigma\right)^{2/k}
\right\} \quad \textrm{for all } 0\le u\le n^{k/2}\sigma^{k+1}.
\label{($8.10'$)}
\end{equation}
}
\medskip
The following estimate holds about the supremum of degenerate
$U$-statistics.
\medskip\noindent
{\bf Theorem 8.4 (Estimate on the supremum of degenerate
$U$-sta\-tis\-tics).}\index{estimate on the supremum of
degenerate $U$-statistics}
{\it Let us have a probability
measure~$\mu$ on a measurable space $(X,{\cal X})$ together
with a countable and $L_2$-dense class ${\cal F}$ of functions
$f=f(x_1,\dots,x_k)$ of $k$ variables with some parameter
$D\ge1$ and exponent~$L\ge1$ on the product space
$(X^k,{\cal X}^k)$ which satisfies conditions~(\ref{(8.4)})
and~(\ref{(8.5)}) with some constant $0<\sigma\le1$. Let us
take a sequence of independent $\mu$ distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge k$, and consider the
$U$-statistics $I_{n,k}(f)$ with these random variables and
kernel functions $f\in{\cal F}$. Let us assume that all these
$U$-statistics $I_{n,k}(f)$, $f\in{\cal F}$, are degenerate,
or in an equivalent form, all functions $f\in {\cal F}$
are canonical with respect to the measure~$\mu$. Then there exist
some constants $C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$
depending only on the parameter $k$ such that the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\le C
\exp\left\{-\alpha \left(\frac u{\sigma}\right)^{2/k}\right\} \quad
\textrm{holds for those } \nonumber \\
&&\qquad \textrm{numbers } u \textrm{ for which } n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}), \nonumber \\
\label{(8.11)}
\end{eqnarray}
where the numbers $D$ and $L$ agree with the parameter and
exponent of the $L_2$-dense class~${\cal F}$.
The condition about the countable cardinality of the class ${\cal F}$
can be replaced by the weaker condition that the class of random
variables $n^{-k/2}I_{n,k}(f)$, $f\in{\cal F}$, is countably
approximable.}
\medskip
Next I formulate a Gaussian counterpart of the above results. To do
this I need some notions that will be introduced in Section~10. In
that section the white noise with a reference measure $\mu$ will
be defined. It is an appropriate set of jointly Gaussian random
variables indexed by those measurable sets $A\in {\cal X}$ of a
measure space $(X,{\cal X},\mu)$ with a $\sigma$-finite
measure~$\mu$ for which $\mu(A)<\infty$. Its distribution depends
on the measure~$\mu$ which will be called the reference measure of
the white noise.
In Section~10 it will be also shown that given a white noise $\mu_W$
with a non-atomic $\sigma$-additive reference measure $\mu$ on a
measurable space $(X,{\cal X})$ and a measurable function
$f(x_1,\dots,x_k)$ of $k$ variables on the product space
$(X^k,{\cal X}^k)$ such that
\begin{equation}
\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2<\infty
\label{(8.12)}
\end{equation}
a $k$-fold Wiener-It\^o integral of the function $f$ with respect
to the white noise~$\mu_W$
\begin{equation}
Z_{\mu,k}(f)=\frac1{k!}\int f(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots \mu_W(\,dx_k) \label{(8.13)}
\end{equation}
can be defined, and the main properties of this integral will be
proved there. It will be seen that Wiener-It\^o integrals have a
similar relation to degenerate $U$-statistics and multiple
integrals with respect to normalized empirical measures as
normally distributed random variables have to partial sums of
independent random variables. Hence it is useful to find the
analogues of the previous estimates of this section about the
tail distribution of Wiener-It\^o integrals. This will be done in
Theorems~8.5 and~8.6.
\medskip\noindent
{\bf Theorem 8.5 (Estimate on the tail distribution of a multiple
Wiener--It\^o integral).}\index{estimate on the tail distribution
of a multiple Wiener--It\^o integral}
{\it Let us fix a measurable space
$(X,{\cal X})$ together with a $\sigma$-finite non-atomic
measure~$\mu$ on it, and let $\mu_W$ be a white noise with reference
measure $\mu$ on $(X,{\cal X})$. If $f(x_1,\dots,x_k)$ is a measurable
function on $(X^k,{\cal X}^k)$ which satisfies relation~(\ref{(8.12)})
with some $0<\sigma<\infty$, then
\begin{equation}
P(k!|Z_{\mu,k}(f)|>u)\le C \exp\left\{-\frac12\left(\frac
u\sigma\right)^{2/k}\right\} \label{(8.14)}
\end{equation}
for all $u>0$ with some constants $C=C(k)$ depending only on~$k$.}
\medskip\noindent
{\bf Theorem 8.6 Estimate on the supremum of Wiener--It\^o
integrals).}\index{estimate on the supremum of Wiener--It\^o integrals}
{\it Let ${\cal F}$ be a countable class of functions
of $k$ variables defined on the $k$-fold product $(X^k,{\cal X}^k)$
of a measurable space $(X,{\cal X})$ such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots \mu(\,dx_k)\le\sigma^2
\quad \textrm{\rm with some } 0<\sigma\le1 \textrm { \rm for all }
f\in {\cal F}
$$
with some non-atomic $\sigma$-additive measure~$\mu$ on $(X,{\cal X})$.
Let us also assume that ${\cal F}$ is an $L_2$-dense class of functions
in the space $(X^k,{\cal X}^k)$ with respect to the measure~$\mu^k$
with some exponent~$L\ge1$ and parameter~$D\ge1$, where $\mu^k$ is
the $k$-fold product of the measure~$\mu$. (The classes of
$L_2$-dense classes with respect to a measure were defined in
Section~4.)
Take a white noise $\mu_W$ on $(X,{\cal X})$ with reference measure
$\mu$, and define the Wiener--It\^o integrals $Z_{\mu,k}(f)$ for
all $f\in{\cal F}$. Fix some $0<\varepsilon\le1$. The inequality
\begin{equation}
P\left(\sup_{f\in {\cal F}} k!|Z_{\mu,k}(f)|>u\right)\le CD
\exp\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k}\right\}\label{(8.15)}
\end{equation}
holds with some universal constants $C=C(k)>0$, $M=M(k)>0$
for those numbers~$u$ for which $u\ge ML^{k/2}\frac1\varepsilon
\log^{k/2}\frac2\varepsilon \cdot \sigma\log^{k/2}\frac2\sigma$.}
\medskip
Formula~(\ref{(8.15)}) yields an almost as good estimate for the
supremum of Wiener--It\^o integrals with the choice of a small
$\varepsilon>0$ as formula~(\ref{(8.14)}) for a single
Wiener--It\^o integral. But the lower bound imposed on the
number~$u$ in the estimate~(\ref{(8.15)}) depends on $\varepsilon$,
and for a small number $\varepsilon>0$ it is large.
The subsequent result presented in Example~8.7 may help to
understand why Theorems~8.3 and~8.5 are sharp. Its proof and
the discussion of the question about the sharpness
of Theorems~8.3 and~8.5 will be postponed to Section~13.
\medskip\noindent
{\bf Example 8.7 (A converse estimate to Theorem 8.5).} {\it Let
us have a $\sigma$-finite measure $\mu$ on some measure space
$(X,{\cal X})$ together with a white noise $\mu_W$ on $(X,{\cal X})$
with counting measure~$\mu$. Let $f_0(x)$ be a real valued function
on $(X,{\cal X})$ such that $\int f_0(x)^2\mu(\,dx)=1$, and take the
function $f(x_1,\dots,x_k)=\sigma f_0(x_1)\cdots f_0(x_k)$ with
some number $\sigma>0$ together with the Wiener--It\^o integral
$Z_{\mu,k}(f)$ introduced in formula~(\ref{(8.13)}).
Then the relation
$\int f(x_1,\dots,x_k)^2\,\mu(\,dx_1)\dots\,\mu(\,dx_k)=\sigma^2$
holds, and the Wiener--It\^o integral $Z_{\mu,k}(f)$ satisfies the
inequality
\begin{equation}
P(k!|Z_{\mu,k}(f)|>u)
\ge \frac{\bar C}{\left(\frac u\sigma\right)^{1/k}+1}
\exp\left\{-\frac12\left(\frac u\sigma\right)^{2/k}\right\}\quad
\textrm{for all } u>0 \label{(8.16)}
\end{equation}
with some constant $\bar C>0$.}
\medskip
The above results show that multiple integrals with respect to a
normalized empirical measure or degenerate $U$-statistics satisfy
some estimates similar to those about multiple Wiener--It\^o
integrals, but they hold under more restrictive conditions. The
difference between the estimates in these problems is similar to
the difference between the corresponding results in Section~4 whose
reason was explained there. Hence this will be only briefly
discussed here. The estimates of Theorem~8.1 and~8.3 are similar to
that of Theorem~8.5. Moreover, for
$0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a small number
$\varepsilon>0$ Theorem~8.3 yields an almost as good
estimate about degenerate $U$-statistics as Theorem~8.5 yields
for a Wiener--It\^o integral with the same kernel function $f$ and
underlying measure $\mu$. Example~8.7 shows that the constant in
the exponent of formula~(\ref{(8.14)}) cannot be improved, at
least there is no possibility of an improvement if only the
$L_2$-norm of the kernel function $f$ is known. Some results
discussed later indicate that neither the estimate of Theorem~8.3
can be improved.
The main difference between Theorem~8.5 and the results of
Theorem~8.1 or~8.3 is that in the latter case the kernel
function~$f$ must satisfy not only an $L_2$ but also an $L_\infty$
norm type condition, and the estimates of these results are
formulated under the additional condition
$u\le n^{k/2}\sigma^{k+1}$. It can be shown that the condition about
the $L_\infty$ norm of the kernel function cannot be dropped from
the conditions of these theorems, and a version of Example~3.3 will
be presented in Example~8.8 which shows that in the case
$u\gg n^{k/2}\sigma^{k+1}$ the left-hand side of~(\ref{(8.10)})
may satisfy only a much weaker estimate. This estimate will be
given only for $k=2$, but with some work it can be generalized
for general indices~$k$.
Theorems~8.2, 8.4 and~8.6 show that for the tail distribution of the
supremum of a not too large class of degenerate $U$-statistics or
multiple integrals a similar upper bound can be given as for the tail
distribution of a single degenerate $U$-statistic or multiple integral,
only the universal constants may be worse in the new estimates.
However, they hold only under the additional condition that the level
at which the tail distribution of the supremum is estimated is not too
low. A similar phenomenon appeared already in the results of Section~4.
Moreover, such a restriction had to be imposed in the formulation of
the results here and in Section~4 for the same reason.
In Theorem~8.2 and~8.4 an $L_2$-dense class of kernel functions was
considered, and this meant that the class of random integrals or
$U$-statistics we consider in this result is not too large. In
Theorem~8.6 a similar, but weaker condition was imposed on the class
of kernel functions. They had to satisfy a similar condition, but
only for the reference measure $\mu$ of the white noise appearing in
the Wiener--It\^o integral. A similar difference appears in the
comparison of Theorems~4.1 or~$4.1'$ with Theorem~4.2, and this
difference has the same reason in the two cases.
I still present in this section the proof of the following Example~8.8
which is a multivariate version of Example~3.3. For the sake of
simplicity I restrict my attention to the case $k=2$.
\medskip\noindent
{\bf Example 8.8 (A converse estimate to Theorem 8.3).} {\it Let us
take a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in the plane $X=R^2$ such
that $\xi_j=(\eta_{j,1},\eta_{j,2})$, $\eta_{j,1}$ and $\eta_{j,2}$
are independent random variables with the following distributions.
The distribution of $\eta_{j,1}$ is defined with the help of a
parameter $\sigma^2$, $0<\sigma^2\le\frac18$, in the same way as
the distribution of the random variables $X_j$ in Example~3.3, i.e.
$\eta_{j,1}=\bar\eta_{j,1}-E\bar\eta_{j,1}$ with
$P(\bar\eta_{j,1}=1)=\bar\sigma^2$,
$P(\bar\eta_{j,1}=0)=1-\bar\sigma^2$, where $\bar\sigma^2$ is that
solution of the equation $x^2-x+\sigma^2=0$, which is smaller
than~$\frac12$. The distribution of the random variables is given
by the formula $P(\eta_{j,2}=1)=P(\eta_{j,2}=-1)=\frac12$ for all
$1\le j\le n$. Introduce the function
$f(x,y)=f((x_1,x_2),(y_1,y_2))=x_1y_2+x_2y_1$,
$x=(x_1,x_2)\in R^2$, $y=(y_1,y_2)\in R^2$ if $(x,y)$ is in the
support of the distribution of the random vector $(\xi_1,\xi_2)$,
i.e. if $x_1$ and $y_1$ take the values $1-\bar\sigma^2$ or
$-\bar\sigma^2$ and $x_2$ and $y_2$ take the values $\pm1$. Put
$f(x,y)=0$ otherwise. Define the $U$-statistic
$$
I_{n,2}(f)=\frac12\sum_{1\le j,k\le n,\,j\neq k} f(\xi_j,\xi_k)
=\frac12\sum_{1\le j,k\le n,\,j\neq k}
(\eta_{j,1}\eta_{k,2}+\eta_{k,1}\eta_{j,2})
$$
of order 2 with the above kernel function $f$ and sequence of
independent random variables $\xi_1,\dots,\xi_n$. Then $I_{n,2}(f)$
is a degenerate $U$-statistic such that $|\sup f(x,y)|\le 1$ and
$Ef^2(\xi_j,\xi_j)=\sigma^2$.
If $u\ge B_1n\sigma^3$ with some appropriate constant $B_1>2$,
$\bar B_2^{-1}n\ge u\ge \bar B_2 n^{-1/2}$ with a sufficiently
large fixed number $\bar B_2>0$ and
$\frac14\ge\sigma^2\ge\frac1{n^2}$, and $n$ is a sufficiently
large number, then the estimate
\begin{equation}
P(n^{-1}I_{n,2}(f)>u)\ge \exp\left\{-Bn^{1/3}u^{2/3}\log
\left(\frac u{n\sigma^3}\right)\right\} \label{(8.17)}
\end{equation}
holds with some $B>0$.}
\medskip\noindent
{\it Remark:}\/ In Theorem~8.3 we got the estimate
$P(n^{-1}I_{n,2}(f)>u)\le e^{-\alpha u/\sigma}$ for the above
defined degenerate $U$-statistic $I_{n,2}(f)$ if
$0\le u\le n\sigma^3$. In the particular case $u=n\sigma^3$
we have the estimate
$P(n^{-1}I_{n,2}(f)>n\sigma^3)\le e^{-\alpha n\sigma^2}$. On the
other hand, the above example shows that in the case
$u\gg n\sigma^3$
we can get only a weaker estimate. It is worth looking at the
estimate~(\ref{(8.17)}) with fixed parameters $n$ and $u$ and
to observe the dependence of the upper bound on the variance
$\sigma^2$ of $I_{n,2}(f)$. In the case $\sigma^2=u^{2/3}n^{-2/3}$
we have the upper bound $e^{-\alpha n^{1/3}u^{2/3}}$. Example~8.8
shows that in the case $\sigma^2\ll u^{2/3}n^{-2/3}$ we can get
only a relatively small improvement of this estimate. A similar
picture appears as in Example~3.3 in the case $k=1$.
\medskip
It is simple to check that the $U$-statistic introduced in the
above example is degenerate because of the independence of the
random variables $\eta_{j,1}$ and $\eta_{j,2}$ and the identity
$E\eta_{j,1}=E\eta_{j,2}=0$. Beside this,
$Ef(\xi_j,\xi_j)^2=\sigma^2$. In the proof of the
estimate~(\ref{(8.17)})
the results of Section~3, in particular Example~3.3 can be applied
for the sequence $\eta_{j,1}$, $j=1,2,\dots,n$. Beside this, the
following result known from the theory of large deviations will
be applied. If $X_1,\dots,X_n$ are independent and identically
distributed random variables, $P(X_1=1)=P(X_1=-1)=\frac12$, then
for any number $0\le \alpha<1$ there exists some numbers
$C_1=C_1(\alpha)>0$ and $C_2=C_2(\alpha)>0$ such that
$P\left(\sum\limits_{j=1}^nX_j >u\right)\ge C_1e^{-C_2u^2/n}$ for all
$0\le u\le \alpha n$.
\medskip\noindent
{\it Proof of Example 8.8.}\/ The inequality
\begin{eqnarray}
&&P(n^{-1}I_{n,2}(f)>u) \label{(8.18)} \\
&&\qquad \ge P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right)
-P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right) \nonumber
\end{eqnarray}
holds. Because of the independence of the random variables
$\eta_{j,1}$ and $\eta_{j,2}$ the first probability at the
right-hand side of (\ref{(8.18)}) can be bounded from below
by bounding
the multiplicative terms in it with $v_1=4n^{1/3}u^{2/3}$ and
$v_2=n^{2/3}u^{1/3}$. The first term will be estimated by means
of Example 3.3. This estimate can be applied with the choice
$y=v_1$, since the relation $v_1\ge 4n\sigma^2$ holds if
$u\ge B_1n\sigma^3$ with $B_1>1$, and the remaining conditions
$0\le \sigma^2\le\frac18$ and $n\ge4v_1\ge6$ also hold under the
conditions of Example~8.8. The second term can be bounded with
the help of the large-deviation result mentioned after the
remark, since $v_2\le \frac12 n$ if $u\le \bar B_2^{-1}n$ with
a sufficiently large $\bar B_2>0$. In such a way we get the
estimate
\begin{eqnarray*}
&&P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right) \\
&&\qquad \ge P\left(\sum_{j=1}^n\eta_{j,1} >v_1\right)
P\left(\sum_{j=1}^n\eta_{j,2}>v_2\right) \\
&&\qquad \ge C\exp\left\{-B_1v_1\log
\left(\frac{v_1}{n\sigma^2}\right)-B_2\frac{v_2^2}{n}\right\} \\
&&\qquad \ge C\exp\left\{-B_3n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\}
\end{eqnarray*}
with appropriate constants $B_1>1$, $B_2>0$ and $B_3>0$. On the
other hand, by applying Bennett's inequality, more precisely its
consequence given in formula~(\ref{(3.4)}) for the sum of the random
variables $X_j=\eta_{j,1}\eta_{j,2}$ at level $nu$ instead of
level~$u$ we get the following upper bound for the second term at
the right-hand side of~(\ref{(8.18)}).
\begin{eqnarray*}
P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right)
&\le& \exp\left\{-Knu\log \frac u{\sigma^2}\right\} \\
&\le& \exp\left\{-2B_4n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\},
\end{eqnarray*}
since $E\eta_{j,1}\eta_{j,2}=0$,
$E\eta^2_{j,1}\eta^2_{j,2}=\sigma^2$,
$nu\ge B_1n^2\sigma^3\ge 2n\sigma^2$ because of the
conditions $B_1>2$ and $n\sigma\ge1$. Hence the
estimate~(\ref{(3.4)})
(with parameter $nu$) can be applied in this case. Beside this,
the constant $B_4$ can be chosen sufficiently large in the last
inequality if the number~$n$ or the bound~$\bar B_2$ in
Example~8.8 us chosen sufficiently large. This means that this
term is negligible small. The above estimates imply the
statement of Example~8.8.
\medskip
Let me remark that under some mild additional restrictions the
estimate (\ref{(8.17)}) can be slightly sharpened, the term
$\log$ can be replaced by $\log^{2/3}$ in the exponent of the
right-hand side of~(\ref{(8.17)}). To get such an estimate
some additional calculation is needed where the numbers
$v_1$ and $v_2$ are replaced by
$\bar v_1=4n^{1/3}u^{2/3}\log^{-1/3}\left(\frac u{n\sigma^3}\right)$
and
$\bar v_2=n^{2/3}u^{1/3}\log^{1/3}\left(\frac u{n\sigma^3}\right)$.
\medskip
I finish this section with a short overview about the
remaining part of this work.
In our proofs we needed some results about $U$-statistics, and this
is the main topic of Section~9. One of the results discussed here
is the so-called Hoeffding decomposition of $U$-statistics to the
linear combination of degenerate $U$-statistics of different order.
We also needed some additional results which explain how some
properties (e.g. a bound on the $L_2$ and $L_\infty$ norm of a
kernel function, the $L_2$-density property of a class~${\cal F}$ of
kernel function) is inherited if we turn from the original
$U$-statistics to the degenerate $U$-statistics appearing in
their Hoeffding decomposition. Section~9 contains some results
in this direction. Another important result in it is Theorem~9.4
which yields a decomposition of multiple integrals with respect
to a normalized empirical distribution to the linear combination
of degenerate $U$-statistics. This result is very similar to the
Hoeffding decomposition of $U$-statistics. The main difference
between them is that in the decomposition of multiple integrals
much smaller coefficients appear. Theorem~9.4 makes possible to
reduce the proof of Theorems~8.1 and~8.2 to the corresponding
results in Theorems~8.3 and~8.4 about degenerate $U$-statistics.
The definition and the main properties of Wiener--It\^o integrals
needed in the proof of Theorems~8.5 and~8.6 are presented in
Section~10. It also contains a result, called the diagram formula
for Wiener--It\^o integrals which plays an important role in our
considerations. Beside this we proved a limit theorem, where we
expressed the limit of normalized degenerate $U$-statistics with
the help of multiple Wiener--It\^o integrals. This result may
explain why it is natural to consider Theorem~8.5 as the
natural Gaussian counterpart of Theorem~8.5, and Theorem~8.6 as
the natural Gaussian counterpart of Theorem~8.6.
We could prove Bernstein's and Bennett's inequality by means of a
good estimation of the exponential moments of the partial sums we
were investigating. In the proof of their multivariate versions,
in Theorems~8.3 and~8.5 this method does not work, because the
exponential moments we have to bound in these cases may be
infinite. On the other hand, we could prove these results by means
of a good estimate on the high moments of the random variables
whose tail distribution we wanted to bound. In the proof of
Theorem~8.5 the moments of multiple Wiener--It\^o integrals
have to be bounded, and this can be done with the help of the
diagram formula for Wiener--It\^o integrals. In Sections~11
and~12 we proved that there is a version of the diagram formula
for degenerate $U$-statistics, and this enables us to estimate
the moments needed in the proof of Theorem~8.3. In Section~13
we proved Theorems~8.3, 8.5 and a multivariate version of the
Hoeffding inequality. At the end of this section we still
discussed some results which state that in certain cases when
we have, beside the upper bound of their $L_2$ and $L_\infty$
norm some additional information about the behaviour of the
kernel function~$f$ in Theorems~8.3 or~8.5 these results can
be improved.
Section~14 contains the natural multivariate versions of the
results in Section~6. In Section~6 Theorem~4.2 is proved about
the supremum of Gaussian random variables and in Section~14
its multivariate version, Theorem~8.6. Both results are proved
with the help of the chaining argument. On the other hand, the
chaining argument is not strong enough to prove Theorem~4.1.
But as it is shown in Section~6, it enables us to prove a result
formulated in Proposition~6.1, and to reduce the proof of
Theorem~4.1 with its help to a simpler result formulated in
Proposition~6.2. One of the results in Section~14,
Proposition~14.1, is a multivariate version of Proposition~6.1.
We showed that the proof of Theorem~8.4 can be reduced with its
help to the proof of a result formulated in Proposition~14.2,
which can be considered a multivariate version of Proposition~6.2.
Section~14 contains still another result. It turned out that
it is simpler to work with so-called decoupled $U$-statistics
introduced in this section than with usual $U$-statistics,
because they have more independence properties. In
Proposition~$14.2'$ a version of Proposition~14.2 is formulated
about degenerate $U$-statistics, and it is shown with the help
of a result of de la Pe\~na and Montgomery--Smith that the proof
of Proposition~14.2, and thus of Theorem~8.4 can be reduced to
the proof of Proposition~$14.2'$.
Proposition~$14.2'$ is proved similarly to its one-variate
version, Proposition~6.2. The strategy of the proof is explained
in Section~15. The main difference between the proof of the two
propositions is that since the independence properties exploited
in the proof of Proposition~6.2 hold only in a weaker form in the
present case, we have to apply a more refined and more difficult
argument. In particular, we have to apply instead of the
symmetrization lemma, Lemma~7.1, a more general version of it,
Lemma~15.2. It is hard to check its conditions when we try to
apply this result in the problems arising in the proof of
Proposition~$14.2'$. This is the reason why we had to prove
Proposition~$14.2'$ with the help of two inductive propositions,
formulated in Propositions~15.3 and~15.4, while in the proof of
Proposition~6.2 it was enough to prove a single result, presented
in Proposition~7.3. We discuss the details of the problems and
the strategy of the proof in Section~15. The proof of
Propositions~15.3 and~15.4 is given in Sections~16 and~17.
Section~16 contains the symmetrization arguments needed for us,
and the proof is completed with its help in Section~17.
Finally in Section~18 we give an overview of this work, and
explain its relation to some similar researches. The proof of
some results is given in the Appendix.
\chapter{Some results about $U$-statistics}
This section contains the proof of the Hoeffding decomposition
theorem, an important result about $U$-statistics. It states that
all $U$-statistics can be represented as a sum of degenerate
$U$-statistics of different order. This representation can be
considered as the natural multivariate version of the
decomposition of a sum of independent random variable to the sum
of independent random variables with expectation zero plus a
constant (which can be interpreted as a random variable of zero
variable). Some important properties of the Hoeffding
decomposition will also be proved. The properties of the kernel
function of a $U$-statistic will be compared to those of the kernel
functions of the $U$-statistics in its Hoeffding decomposition.
If the Hoeffding decomposition of a $U$-statistic is taken, then
the $L_2$ and $L_\infty$-norms of the kernel functions appearing
in the $U$-statistics of the Hoeffding decomposition will be
bounded by means of the corresponding norm of the kernel function
of the original $U$-statistic. It will be also shown that if we
take a class of $U$-statistics with an $L_2$-dense class of kernel
functions (and the same sequence of independent and identically
distributed random variables in the definition of each
$U$-statistic) and consider the Hoeffding decomposition of all
$U$-statistics in this class, then the kernel functions of the
degenerate $U$-statistics appearing in these Hoeffding
decompositions also constitute an $L_2$-dense class. Another
important result of this section is Theorem~9.4. It yields a
decomposition of a $k$-fold random integral with respect to a
normalized empirical distribution to the linear combination of
degenerate $U$-statistics. This result enables us to derive
Theorem~8.1 from Theorem 8.3 and Theorem~8.2 from Theorem~8.4,
and it is also useful in the proof of Theorems~8.3 and~8.4.
Let us first consider the Hoeffding's decomposition. In the
special case $k=1$ it states that the sum
$S_n=\sum\limits_{j=1}^n\xi_j$ of independent and identically
distributed random variables can be rewritten as
$S_n=\sum\limits_{j=1}^n(\xi_j-E\xi_j)
+\left(\sum\limits_{j=1}^nE\xi_j\right)$, i.e.\
as the sum of independent random variables with zero expectation
plus a constant. We introduced the convention that a constant is
the kernel function of a degenerate $U$-statistic of order zero,
and $I_{n,0}(c)=c$ for a $U$-statistic of order zero. I wrote
down the above trivial formula, because Hoeffding's decomposition
is actually its adaptation to a more general situation. To
understand this let us first see how to adapt the above
construction to the case $k=2$.
In this case a sum of the form
$2I_{n,2}(f)=\sum\limits_{1\le j,k\le n,j\neq k} f(\xi_j,\xi_k)$
has to be considered. Write
$f(\xi_j,\xi_k)=[f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)]+
E(f(\xi_j,\xi_k)|\xi_k)=f_1(\xi_j,\xi_k)+\bar f_1(\xi_k)$ with
$f_1(\xi_j,\xi_k)=f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)$, and
$\bar f_1(\xi_k)=E(f(\xi_j,\xi_k)|\xi_k)$ to make the conditional
expectation of $f_1(\xi_j,\xi_k)$ with respect to $\xi_k$ equal
zero. Repeating this procedure for the first coordinate we define
$f_2(\xi_j,\xi_k)=f_1(\xi_j,\xi_k)-E(f_1(\xi_j,\xi_k)|\xi_j)$ and
$\bar f_2(\xi_j)=E(f_1(\xi_j,\xi_k)|\xi_j)$.
Let us also write $\bar f_1(\xi_k)=
[\bar f_1(\xi_k)-E\bar f_1(\xi_k)]+E\bar f_1(\xi_k)$ and
$\bar f_2(\xi_j)=[\bar f_2(\xi_j)-E\bar f_2(\xi_j)]
+E\bar f_2(\xi_j)$.
Simple calculation shows that $2I_{n,2}(f_2)$ is a degenerate
$U$-statistics of order 2, and the identity
$2I_{n,2}(f)=2I_{n,2}(f_2)+I_{n,1}((n-1)(\bar f_1-E\bar f_1))+
I_{n,1}((n-1)((\bar f_2-E\bar f_2))+n(n-1)E(\bar f_1+\bar f_2)$
yields the decomposition of $I_{n,2}(f)$ into a sum of degenerate
$U$-statistics of different orders.
Hoeffding's decomposition can be obtained by working out the details
of the above argument in the general case. But it is simpler to
calculate the appropriate conditional expectations by working with
the kernel functions of the $U$-statistics. To carry out such
a program we introduce the following notations.
Let us consider the $k$-fold product $(X^k,{\cal X}^k,\mu^k)$ of a
measure space $(X,{\cal X},\mu)$ with some probability measure $\mu$,
and define for all integrable functions $f(x_1,\dots,x_k)$ and indices
$1\le j\le k$ the projection~$P_jf$ of the function $f$ to its $j$-th
coordinate, i.e.\ integration of the function~$f$ with respect to its
$j$-th coordinate.
For the sake of simpler notations in our future considerations we
shall define the operator $P_j$ in a slightly more general setting.
Let us consider a set $A=\{p_1,\dots,p_s\}\subset\{1,\dots,k\}$, put
$X^A=X_{p_1}\times X_{p_2}\times\cdots\times X_{p_s}$, ${\cal X}^A
={\cal X}_{p_1}\times {\cal X}_{p_2}\times\cdots\times{\cal X}_{p_s}$,
$\mu^A=\mu_{p_1}\times \mu_{p_2}\times\cdots\times \mu_{p_s}$, take
the product space $(X^A,{\cal X}^A,\mu^A)$ and if $j\in A$, then
define the operator $P_j$ on this product space by the formula
\begin{equation}
P_jf(x_{p_1},\dots,x_{p_{r-1}},x_{p_{r+1}},\dots,x_{p_s})
=\int
f(x_{p_1},\dots,x_{p_s})\mu(\,dx_j), \quad\text{if } j=p_r. \label{(9.1)}
\end{equation}
Let us also define the (orthogonal projection) operators
$Q_j=I-P_j$ as $Q_jf=f-P_jf$ for all integrable functions $f$ on
the space $(X^A,{\cal X}^A,\mu^A)$, and $j\in A$, i.e. put
\begin{eqnarray}
O_jf(x_{p_1},\dots,x_{p_s})&=&(I-P_j)f(x_{p_1},\dots,x_{p_s})
\nonumber\\
&=&f(x_{p_1},\dots,x_{p_s})-\int f(x_{p_1},\dots,x_{p_s})\mu(\,dx_j).
\label{(9.1a)}
\end{eqnarray}
In the definition~(\ref{(9.1)}) $P_jf$ is a function not
depending on the coordinate $x_j$, but in the definition of $Q_j$
we introduce the fictive coordinate $x_j$ to make the expression
$Q_jf=f-P_jf$ meaningful. The following result holds.
\medskip\noindent
{\bf Theorem 9.1 (The Hoeffding decomposition of
$U$-statistics).}\index{Hoeffding decomposition of $U$-statistics}
{\it Let $f(x_1,\dots,x_k)$ be an integrable function on the $k$-fold
product $(X^k,{\cal X}^k,\mu^k)$ of a space $(X,{\cal X},\mu)$
with a probability measure $\mu$. It has a decomposition of the form
\begin{eqnarray}
&&f(x_1,\dots,x_k)=\sum\limits_{V\subset\{1,\dots,k\}}
f_V(x_{j_1},\dots,x_{j_{|V|}}),
\label{(9.2)} \\
&& \qquad\quad \textrm{with} \quad
f_V(x_{j_1},\dots,x_{j_{|V|}})
=\left(\prod_{j\in\{1,\dots,k\}\setminus V}P_j
\prod_{j\in V}Q_j\right)f(x_1,\dots,x_k) \nonumber
\end{eqnarray}
with $V=\{j_1,\dots,j_{|V|}\}$, $j_1u)\le \sum_{V\subset\{1,\dots,k\}}
P\left(n^{-|V|/2}|I_{n,|V|}(f_V)|>\frac u{2^kC(k)}\right)
\label{(9.10)}
\end{equation}
with a constant $C(k)$ satisfying the inequality $p!C(n,k,p)\le
k!C(k)$ for all coefficients $C(n,k,p)$, $1\le p\le k$,
in~(\ref{(9.9)}). Hence Theorem~$8.1'$ follows from Theorem~8.3
and relations~(\ref{(9.4)}) and~(\ref{($9.4'$)}) in Theorem~9.2 by
which the $L_2$-norm of the functions $f_V$ is bounded by the
$L_2$-norm of the function~$f$ and the $L_\infty$-norm of $f_V$
is bounded by $2^{|V|}$-times the $L_\infty$-norm or $f$. It is
enough to estimate each term at the right-hand side of~(\ref{(9.10)})
by means of Theorem~8.3. It can be assumed that $2^kC(k)>1$. Let us
first assume that also the inequality $\frac u{2^kC(k) \sigma}\ge1$
holds. In this case formula~(\ref{($8.3'$)}) in Theorem~$8.1'$
can be obtained by means of the estimation of each term at the
right-hand side of~(\ref{(9.10)}). Observe that
$\exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}\right)^{2/s}
\right\}\le \exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}
\right)^{2/k}\right\}$ for all
$s\le k$ if $\frac u{2^kC(k)\sigma}\ge1$. In the other case, when
$\frac u{2^kC(k)\sigma}\le1$, formula~(\ref{($8.3'$)}) holds again
with a sufficiently large $C>0$, because in this case its right-hand
side of~(\ref{($8.3'$)}) is greater than~1.
Theorem~8.2 can be similarly derived from Theorem~8.4 by observing
that relation~(\ref{(9.10)}) remains valid if $|J_{n,k}(f)|$ is
replaced by
$\sup\limits_{f\in{\cal F}}|J_{n,k}(f)|$ and $|I_{n,|V|}(f_V)|$
by $\sup\limits_{f_V\in{\cal F}_V} |I_{n,|V|}(f_V)|$ in it, and we
have the right to choose the constant~$M$ in formula~(\ref{(8.6)}) of
Theorem~8.2 sufficiently large. The only difference in the argument
is that beside formulas~(\ref{(9.4)}) and~(\ref{($9.4'$)}) the last
statement of Theorem~9.2 also has to be applied in this case. It
tells that if ${\cal F}$ is an $L_2$-dense class of functions on a
space $(X^k,{\cal X}^k)$, then the classes of functions
${\cal F}_V=\{2^{-|V|}f_V\colon\, f\in{\cal F}\}$ are also
$L_2$-dense classes of functions for all $V\subset\{1,\dots,k\}$
with the same exponent and parameter.\index{estimate on the
supremum of multiple random integrals with respect to an empirical
distribution}
\medskip
I make some comments about the content of Theorem~9.4. The
expression $J_{n,k}(f)$ was defined as a $k$-fold random integral
with respect to the signed measure $\mu_n-\mu$, where the diagonals
were omitted from the domain of integration. Formula~(\ref{(9.9)})
expresses the random integral $J_{n,k}(f)$ as a linear combination of
degenerate $U$-statistics of different order. This is similar to
the Hoeffding decomposition of the $U$-statistic $I_{n,k}(f)$ to the
linear combination of degenerate $U$-statistics defined with the
same kernel functions~$f_V$. The main difference between these two
formulas is that in the expansion~(\ref{(9.9)}) of
$J_{n,k}(f)$ the terms $I_{n,|V|}(f_V)$ appear with small
coefficients $C(n,k,|V|)|V|!\frac1{n^{|V|/2}}$. As we shall see,
$E(C(n,k,|V|)|V|!\frac1{n^{|V|/2}}I_{n,V}(f_V))^20$ there is a finite partition
$A=\bigcup\limits_{s=1}^N B_s$ of the set~$A$ with the property
$\mu(B_s)<\varepsilon$ for all $1\le s\le N$. There is a formally
weaker definition of a non-atomic measures by which a
$\sigma$-finite measure~$\mu$ is non-atomic if for all measurable
sets $A$ such that $0<\mu(A)<\infty$ there is a measurable set
$B\subset A$ with the property $0<\mu(B)<\mu(A)$. But these two
definitions of non-atomic measures are actually equivalent,
although this equivalence is not trivial. I do not discuss this
problem here, since it is a little bit outside from the direction
of the present work. In our further considerations we shall work
with the first definition of non-atomic measures.
The $k$-fold Wiener-It\^o integrals\index{Wiener--It\^o integrals}
of the functions $f\in{\cal H}_{\mu,k}$ with respect to the white
noise~$\mu_W$ will be defined in a rather standard way. First they
will be defined for some simple functions, called elementary
functions, then it will be shown that the integral for this
elementary functions have an $L_2$ contraction property which
makes possible to extend it to the class of functions in
${\cal H}_{\mu,k}$.
Let us first introduce the following class of elementary
functions $\bar{\cal H}_{\mu,k}$ of $k$ variables.\index{elementary
functions of $k$ variables} A function $f(x_1,\dots,x_k)$ on
$(X^k,{\cal X}^k)$ belongs to $\bar{\cal H}_{\mu,k}$ if there
exist finitely many disjoint measurable subsets $A_1,\dots,A_M$,
$1\le M<\infty$, of the set~$X$ (i.e. $A_j\cap A_{j'}=\emptyset$
if $j\neq j'$) such that $\mu(A_j)<\infty$ for all $1\le j\le M$,
and the function $f$ has the form
\begin{equation}
f(x_1,\dots,x_k)=\left\{
\begin{array}{l}
c(j_1,\dots,j_k)\quad\textrm{if } (x_1,\dots,x_k) \in
A_{j_1}\times\cdots \times A_{j_k} \textrm{ with} \\
\qquad \textrm{some indices } (j_1,\dots,j_k),
\quad 1\le j_s\le M,\; 1\le s\le k,\\
\qquad \textrm{ such that all numbers } j_1,\dots,j_k
\textrm{ are different} \\
0 \quad\textrm{if }(x_1,\dots,x_k)\notin \!\!\!
\bigcup\limits_{\substack
{(j_1,\dots,j_k)\colon\, 1\le j_s\le M, \; 1\le s\le k,\\
\textrm{ and all } j_1,\dots,j_k\textrm { are different.} }}\! \!\!
A_{j_1}\times\cdots \times A_{j_k}
\end{array}
\right. \label{(10.2)}
\end{equation}
with some real numbers $c(j_1,\dots,j_k)$, $1\le j_s\le M$, $1\le
s\le k$, if all $j_1,\dots,j_k$ are different numbers. This means
that the function $f$ is constant on all $k$-dimensional
rectangles $A_{j_1}\times\dots\times A_{j_k}$ with different,
non-intersecting edges, and it equals zero on the complementary
set of the union of these rectangles. The property that the support
of the function~$f$ is on the union of rectangles with
non-intersecting edges is sometimes interpreted so that the
diagonals are omitted from the domain of integration of
Wiener--It\^o integrals.
The Wiener-It\^o integral of an elementary function
$f(x_1,\dots,x_k)$ of the form~(\ref{(10.2)}) with respect to a white
noise $\mu_W$ with the (non-atomic) reference measure $\mu$
is defined by the formula
\begin{eqnarray}
&&\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k) \nonumber\\
&&\qquad=\sum_{\substack{1\le j_s\le M,\;1\le s\le k \\
\textrm{all } j_1,\dots,j_k \textrm{ are different} }}
c(j_1,\dots,j_k) \mu_W(A_{j_1})\cdots\mu_W(A_{j_k}). \label{(10.3)}
\end{eqnarray}
(The representation of the function $f$ in~(\ref{(10.2)}) is not unique,
the sets $A_j$ can be divided to smaller disjoint sets, but its
Wiener--It\^o integral defined in~(\ref{(10.3)}) does not depend on its
representation. This can be seen with the help of the additivity
property $\mu_W(A\cup B)=\mu_W(A)+\mu_W(B)$ if $A\cap B=\emptyset$
of the white noise~$\mu_W$.) The notation
\begin{equation}
Z_{\mu,k}(f)=\frac1{k!}
\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k), \label{(10.4)}
\end{equation}
will be used in the sequel, and the expression $Z_{\mu,k}(f)$
will be called the normalized Wiener--It\^o integral of the
function~$f$. Such a terminology will be applied also for the
Wiener--It\^o integrals of all functions $f\in{\cal H}_{\mu,k}$ to
be defined later.
If $f$ is an elementary function in $\bar{\cal H}_{\mu,k}$ defined
in~(\ref{(10.2)}), then its normalized Wiener--It\^o integral defined
in~(\ref{(10.3)}) and~(\ref{(10.4)}) satisfies the relations
\begin{eqnarray}
Ek!Z_{\mu,k}(f)&&=0, \nonumber \\
E(k!Z_{\mu,k}(f))^2&&= \!\!
\sum_{\substack{(j_1,\dots,j_k)\colon\,
1\le j_s\le M,\; 1\le s\le k, \nonumber \\
\textrm{and all } j_1,\dots,j_k\textrm{ are different.} }}
\sum_{\pi\in \Pi_k}
c(j_1,\dots,j_k)c(j_{\pi(1)},\dots,j_{\pi(k)}) \nonumber \\
&&\qquad\qquad E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j_{\pi(1)}})\cdots\mu_W(A_{j_{\pi(k)}}) \nonumber \\
&&=k!\int \textrm{Sym\,} f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k) \nonumber \\
&&\le k!\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k),
\label{(10.5)}
\end{eqnarray}
with $\textrm{Sym\,}f(x_1,\dots,x_k)=
\frac1{k!}\sum\limits_{\pi\in\Pi_k}f(x_{\pi(1)},\dots,x_{\pi(k)})$,
where $\Pi_k$ denotes the set of all permutations
$\pi=\{\pi(1),\dots,\pi(k)\}$ of the set $\{1,\dots,k\}$.
The identities written down in~(\ref{(10.5)}) can be simply
checked. The first relation follows from the identity
$E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})=0$ for disjoint sets
$A_{j_1},\dots,A_{j_k}$, which holds, since the expectation of the
product of independent random variables with zero expectation is
taken. The second identity follows similarly from the identity
\begin{eqnarray*}
&&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})=0\\
&&\qquad \textrm{ if the sets of indices }
\{j_1,\dots,j_k\} \textrm { and }
\{j'_1,\dots,j'_k\} \textrm{ are different,} \\
&&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})
=\mu(A_{j_1})\cdots\mu(A_{j_k})\\
&&\qquad \textrm{ if } \{j_1,\dots,j_k\}=
\{j'_1,\dots,j'_k\} \textrm{ i.e. if }
j'_1=j_{\pi(1)},\dots,j'_k=j_{\pi(k)} \\
&&\qquad \textrm{ with some permutation } \pi\in\Pi_k,
\end{eqnarray*}
which holds because of the facts that the $\mu_W$ measure of
disjoint sets are independent with expectation zero, and
$E\mu_W(A)^2=\mu(A)$. The remaining relations in~(\ref{(10.5)})
can be simply checked.
It is not difficult to check that
\begin{equation}
EZ_{\mu,k}(f)Z_{\mu,k'}(g)=0 \label{(10.6)}
\end{equation}
for all functions $f\in \bar{\cal H}_{\mu,k}$ and
$g\in \bar{\cal H}_{\mu,k'}$ if $k\neq k'$, and
\begin{equation}
Z_{\mu,k}(f)=Z_{\mu,k}(\textrm{Sym}\, f) \label{(10.7)}
\end{equation}
for all functions $f\in \bar{\cal H}_{\mu,k}$.
The definition of Wiener--It\^o integrals can be extended to
general functions $f\in{\cal H}_{\mu,k}$ with the help of the
estimate~(\ref{(10.5)}). To carry out this extension we still have
to know that the class of functions $\bar{\cal H}_{\mu,k}$ is
a dense subset of the class ${\cal H}_{\mu,k}$ in the Hilbert
space $L_2(X^k,{\cal X}^k,\mu^k)$, where $\mu^k$ is the $k$-th power
of the reference measure $\mu$ of the white noise~$\mu_W$. I
briefly explain how this property of $\bar{\cal H}_{\mu,k}$ can be
proved. The non-atomic property of the measure~$\mu$ is exploited
at this point.
To prove this statement it is enough to show that the indicator
function of any product set $A_1\times\cdots\times A_k$
such that $\mu(A_j)<\infty$, $1\le j\le k$, but the sets
$A_1,\dots,A_k$ may be non-disjoint is in the $L_2(\mu^k)$
closure of $\bar{\cal H}_{\mu,k}$. In the proof of this
statement it will be exploited that since $\mu$ is a non-atomic
measure, the sets $A_j$ can be represented for all
$\varepsilon>0$ and $1\le j\le k$ as a finite union
$A_j=\bigcup\limits_s B_{j,s}$ of disjoint sets $B_{j,s}$
with the property $\mu(B_{j,s})<\varepsilon$.
By means of these relations the
product $A_1\times\cdots\times A_k$ can be written in the form
\begin{equation}
A_1\times\cdots\times A_k=\bigcup_{s_1,\dots,s_k}
B_{1,s_1}\times\cdots\times B_{k,s_k} \label{(10.8)}
\end{equation}
with some sets $B_{j,s_j}$ such that $\mu(B_{j,s_j})<\varepsilon$
for all sets in this union. Moreover, we may assume, by refining
the partitions of the sets $A_j$ if this is necessary that any
two sets $B_{j,s_j}$ and $B_{j',s'_{j'}}$ in this representation
are either disjoint, or they agree. Take such a representation of
$A_1\times\cdots\times A_k$, and consider the set we obtain by
omitting those products $B_{1,s_1}\times\cdots\times B_{k,s_k}$
from the union at the right-hand side of~(\ref{(10.8)}) for which
$B_{i,s_i}=B_{j,s_j}$
for some $1\le i0$, and the sets $B_j$ into the
union of small disjoint sets $F_j^{(m)}$, $1\le j\le l$, with
some fixed number $1\le m\le M$, in such a way that
$\mu(F_j^{(m)})\le \varepsilon$ with some fixed $\varepsilon>0$.
Beside this, we also require that two sets
$D_j^{(m)}$ and $F_{j'}^{(m')}$ should be either disjoint or
they should agree. (The sets $D_j^{(m)}$ are disjoint for
different indices, and the same relation holds for the
sets $F_{j'}^{(m')}$.)
Then the identities
$$
k!Z_{\mu,k}(f)=\prod_{j=1}^k
\left(\sum_{m=1}^M\mu_W(D_j^{(m)})\right)
$$
and
$$
l!Z_{\mu,l}(g)=\prod_{j'=1}^l
\left(\sum_{m'=1}^M\mu_W(F_{j'}^{(m')})\right),
$$
hold, and the product of these two Wiener--It\^o integrals can be
written in the form of a sum by means of a term by term
multiplication. Let us divide the terms of the sum we get in such a
way into classes indexed by the diagrams $\gamma\in\Gamma(k,l)$
in the following way: Each term in this sum is a product of the form
$\prod\limits_{j=1}^k\mu_W(D_j^{(m_j)})
\prod\limits_{j'=1}^l\mu_W(F_{j'}^{(m_j')})$. Let it belong to the
class indexed by the diagram $\gamma$ with edges
$((1,j_1),(2,j_1'))$,\dots, and $((1,j_s),(2,j'_s))$ if the elements
in the pairs $(D_{j_1}^{m_{j_1}},F_{j'_1}^{m_{j'_1}})$,\dots,
$(D_{j_s}^{m_{j_s}},F_{j'_s}^{m_{j'_s}})$ agree, and otherwise all
terms are different. Then letting $\varepsilon\to0$ (and taking
partitions of the sets $D_j$ and $F_{j'}$ corresponding to the
parameter $\varepsilon$) the
sums of the terms in each class turn to integrals, and our
calculation suggests the identity
\begin{equation}
(k!Z_{\mu,k}(f))(l!Z_{\mu,l}(g))
=\sum_{\gamma\in\Gamma(k,l)}\bar Z_\gamma(f,g) \label{(10.13)}
\end{equation}
with
\begin{eqnarray}
\bar Z_\gamma(f,g)&&=\int
f(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k)})
g(x_{(2,1)},\dots,x_{(2,l)}) \label{(10.13a)} \\
&&\qquad \mu_W(\,dx_{\alpha_\gamma(1,1)})\dots
\mu_W(\,dx_{\alpha_\gamma(1,k)})
\mu_W(\,dx_{(2,1)})\dots\mu_W(\,dx_{(2,l)}) \nonumber
\end{eqnarray}
with the function $\alpha_\gamma(\cdot)$ introduced before
formula~(\ref{(10.9)}). The indices $\alpha(1,j)$ of the
arguments in~(\ref{(10.13a)}) mean
that in the case $\alpha_\gamma(1,j)=(2,j')$ the argument
$x_{(1,j)}$ has to be replaced by $x_{(2,j')}$. In particular,
$$\
\mu_W(\,dx_{\alpha(1,j)})\mu_W(\,dx_{(2,j')})
=(\mu_W(\,dx_{(2,j')}))^2=\mu(\,dx_{(2,j')})
$$
in this case because of the `identity'
$(\mu_W(\,dx))^2=\mu(\,dx)$. Hence the above informal
calculation yields the identity
$\bar Z_\gamma(f,g)=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))$,
and relations~(\ref{(10.13)}) and~(\ref{(10.13a)}) imply
formula~(\ref{(10.12)}).
A similar heuristic argument can be applied to get formulas for
the product of integrals of normalized empirical distributions or
(normalized) Poisson fields, only the starting `identity'
$(\mu_W(\,dx))^2=\mu(\,dx)$ changes in these cases, some additional
terms appear in it, which modify the final result. I return to
this question in the next section.
\medskip
It is not difficult to generalize Theorem~10.2A with the help of
some additional notations to a diagram formula about the product
of finitely many Wiener--It\^o integrals. Let us consider $m\ge2$
Wiener--It\^o integrals $k_p!Z_{\mu,k_p}(f_p)$, of functions
$f_p(x_1,\dots,x_{k_p})\in{\cal H}_{\mu,k_p}$, of order
$k_p\ge1$, $1\le p\le m$, and define a class of diagrams
$\Gamma=\Gamma(k_1,\dots,k_m)$ in the following way.
The diagrams $\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ have
vertices of the form $(p,r)$, $1\le p\le m$, $1\le r\le k_p$. The
set of vertices $\{(p,r)\colon\, 1\le r\le k_p\}$ with a fixed number
$p$ will be called the $p$-th row of the diagram $\gamma$. A diagram
$\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ may have some edges. All
edges of a diagram connect vertices from different rows, and from
each vertex there starts at most one edge. All diagrams satisfying
these properties belong to $\Gamma(k_1,\dots,k_m)$. If a diagram
$\gamma$ contains an edge of the form $((p_1,r_1),(p_2,r_2))$ with
$p_1|O(\gamma)|$. Given a diagram
$\gamma\in\Gamma(k_1,\dots,k_m)$ we shall enumerate its open
chains $\beta\in O(\gamma)$ with the numbers
$1,\dots,|O(\gamma)|$. The subsequent definition of the functions
$F_\gamma(f_1,\dots,f_m)$ will depend on the previously fixed
enumeration of the chains of diagrams with two rows and of the open
chains of diagrams with arbitrary many rows. But the results
formulated with the help of these functions are valid for an
arbitrary enumeration of these chains. Hence the non-uniqueness
in the definition of $F_\gamma(f_1,\dots,f_m)$ will cause no
problem.
To define the functions~$F_\gamma(f_1,\dots,f_m)$ we
introduce the following operators. Given a function
$h(x_{l_1},\dots,x_{l_r})$ with coordinates in the space
$(X,{\cal X})$ (the indices $l_1,\dots,l_r$ are all different,
otherwise they can be chosen in an arbitrary way) and a probability
measure~$\mu$ on the space $(X,{\cal X})$ we introduce
the transformations $P_{l_p}h$ and $Q_{l_p}h$, $1\le p\le r$, by
the formulas
\begin{eqnarray}
(P_{l_p}h)(x_{l_1},\dots,x_{l_{p-1}},x_{l_{p+1}},\dots,x_{l_r})
&&=\int h(x_{l_1},\dots,x_{l_r})\mu(\,dx_{l_p}), \nonumber \\
&&\qquad\qquad 1\le p\le r, \label{(11.1)}
\end{eqnarray}
and
\begin{eqnarray}
(Q_{l_p}h)(x_{l_1},\dots,x_{l_r})&&=h(x_{l_1},\dots,x_{l_r})-
\int h(x_{l_1},\dots,x_{l_r})\mu(\,dx_{l_p}), \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad 1\le p\le r. \label{(11.2)}
\end{eqnarray}
(These formulas actually agree with the definition of the operators
$P_j$ and $Q_j$ in formulas~(\ref{(9.1)}) and~(\ref{(9.1a)}), only
the notation is slightly different.)
First we formulate the diagram formula for the product of two
degenerate $U$-statistics, i.e. we consider the case $m=2$. Let
us have a measurable space $(X,{\cal X})$ with a probability
measure~$\mu$ on it together with two measurable functions
$f_1(x_{1},\dots,x_{k_1})$ and $f_2(x_1,\dots,x_{k_2})$ of
$k_1$ and $k_2$ variables on this space which are canonical
with respect to the measure $\mu$. Let $\xi_1,\xi_2,\dots$ be
a sequence of $(X,{\cal X})$ valued, independent and identically
distributed random variables with distribution~$\mu$. We want
to express the product
$n^{-k_1/2}k_1!I_{n,k_1}(f_1)n^{-k_2/2}k_2!I_{n,k_2}(f_2)$
of normalized degenerate $U$-statistics defined with the help
of the above random variables and kernel functions~$f_1$
and~$f_2$ as a sum of normalized degenerate $U$-statistics. For
this goal we define some functions $F_\gamma(f_1,f_2)$ for all
$\gamma\in\Gamma(k_1,k_2)$.
We shall define the function $F_\gamma(f_1,f_2)$ with the help
of the previously fixed enumeration of the chains of the
diagram~$\gamma$. We shall introduce with the help of this
enumeration also an enumeration of the vertices $(1,p)$, $(2,q)$,
$1\le p\le k_1$, $1\le q\le k_2$, of the diagram~$\gamma$. We put
$\alpha_\gamma(p,r)=l_s$ if $(p,r)\in\beta(l_s)$.
Let us have two functions $f_1(x_1,\dots,x_{k_1})$ and
$f_2(x_1,\dots,x_{k_2})$ together with a coloured diagram
$\gamma\in\Gamma(k_1,k_2)$. We define the function
$F_\gamma(f_1,f_2)$ in two steps. First we define the function
\begin{eqnarray}
&&(f_1\circ f_2)_\gamma(x_{l_1},\dots,x_{l_s}) \nonumber \\
&&\qquad= f_1(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k_1)})
f_2(x_{\alpha_\gamma(2,1)},\dots,x_{\alpha_\gamma(2,k_2)}),
\label{(11.3)}
\end{eqnarray}
where $l_1,\dots,l_s$, $l_12$. Put
\begin{equation}
W(\gamma)=\sum_{\beta\in O(\gamma)}(\ell(\beta)-1)
+\sum_{\beta\in C(\gamma)}(\ell(\beta)-2),\quad
\gamma\in\Gamma(k_1,\dots,k_m), \label{(11.9)}
\end{equation}
where $\ell(\beta)$ denotes the length of the chain~$\beta$.
To define the next quantity let us first introduce the following
notation. Given a chain $\beta=\{(p_1,r_1),\dots,(p_l,r_l)\}$,
$1\le p_1p\}$,
i.e. ${\cal B}_1(\gamma,p)$ consists of those chains $\beta\in\Gamma$
which have colour~$1$, all their vertices are in the first~$p$
rows of the diagram, and contain a vertex in the~$p$-th row, while
${\cal B}_2(\gamma,p)$ consists of those chains $\beta\in\gamma$
which have either colour~$-1$, and all their vertices are in the
first~$p$ rows of the diagram, or they have (with an arbitrary
colour) a vertex both in the first~$p$ rows and in the remaining
rows of the diagram. Put $B_1(\gamma,p)=|{\cal B}_1(\gamma,p)|$ and
$B_2(\gamma,p)=|{\cal B}_2(\gamma,p)|$.
With the help of these numbers we define
\begin{equation}
J_n(\gamma,p)=\left\{
\begin{array}{l}
\prod\limits_{j=1}^{B_1(\gamma,p)}
\left(\frac{n-B_1(\gamma,p)-B_2(\gamma,p)+j}n\right)
\quad\textrm{if } B_1(\gamma,p)\ge1\\
\quad 1\quad \textrm{if } B_1(\gamma,p)=0
\end{array}
\right. \label{(11.10)}
\end{equation}
for all $2\le p\le m$ and diagrams $\gamma\in\Gamma(k_1,\dots,k_m)$.
Theorem 11.2 will be formulated with the help of the above notations.
\medskip\noindent
{\bf Theorem 11.2 (The diagram formula for the product of several
degenerate $U$-statistics).}\index{diagram formula for the product
of degenerate $U$-statistics} {\it Let a sequence of independent
and identically distributed random variables $\xi_1,\xi_2,\dots$
be given with some distribution $\mu$ on a measurable space
$(X,{\cal X})$ together with $m\ge2$ bounded functions
$f_p(x_1,\dots,x_{k_p})$ on the spaces $(X^{k_p},{\cal X}^{k_p})$,
$1\le p\le m$, canonical with respect to the probability
measure~$\mu$. Let us consider the class of coloured diagrams
$\Gamma(k_1,\dots,k_m)$ together with the functions
$F_\gamma=F_{\gamma}(f_1,\dots,f_m)$,
$\gamma\in\Gamma(k_1,\dots,k_m)$, defined in formulas
(\ref{(11.8)}) and the constants $W(\gamma)$
and $J_n(\gamma,p)$, $1\le p\le m$, given in
formulas~(\ref{(11.9)}) and~(\ref{(11.10)}).
The functions $F_\gamma(f_1,\dots,f_m)$ are canonical with
respect to the measure $\mu$ with $|O(\gamma)|$ variables,
and the product of the degenerate $U$-statistics
$I_{n,k_p}(f_p)$, $1\le p\le m$,
$n\ge \max\limits_{1\le p\le m} k_p$, defined in~(\ref{(8.7)})
can be written in the form
\begin{eqnarray}
&&\prod_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_p)
={\sum_{\gamma\in\Gamma(k_1,\dots,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n,\,m)}
\left(\prod_{p=2}^m J_n(\gamma,p)\right) \nonumber \\
&&\qquad n^{-W(\gamma)/2}\cdot n^{-|O(\gamma)|/2} |O(\gamma)|!
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m)),
\label{(11.11)}
\end{eqnarray}
where $\sum^{\prime(n,\,m)}$ means that summation is taken
for those $\gamma\in\Gamma(k_1,\dots,k_m)$ which satisfy the
relation $B_1(\gamma,p)+B_2(\gamma,p)\le n$ for all
$2\le p\le m$ with the quantities $B_1(\gamma,p)$ and
$B_2(\gamma,p)$ introduced before the definition of
$J_n(\gamma,p)$ in~(\ref{(11.10)}), and the expression
$W(\gamma)$ was defined in~(\ref{(11.9)}). The terms
$I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))$ at the
right-hand side of formula (\ref{(11.11)}) can be replaced by
$I_{n,|O(\gamma)|}(\textrm{\rm Sym}\,F_\gamma(f_1,\dots,f_m))$.}
\medskip
In Theorem 11.2 the product of such degenerate $U$-statistics were
considered, whose kernel functions were bounded. This also implies
that all functions $F_\gamma$ appearing at the right-hand side of
(\ref{(11.11)}) are well-defined (i.e. the integrals appearing in
their definition are convergent) and bounded. In the applications
of Theorem~11.2 it is useful to have more information about the
behaviour of the functions $F_\gamma(f_1,\dots,f_m)$. We shall
need some good bound on their $L_2$-norm. Such a result is
formulated in the following
\medskip\noindent
{\bf Lemma 11.3 (Estimate about the $L_2$-norm of the kernel
functions of the $U$-statistics appearing in the diagram
formula).}\index{bound on the kernel functions in the diagram
formula for $U$-statistics}
{\it Let $m$ functions $f_p(x_1,\dots,x_{k_p})$ be given on the
products $(X^{k_p},{\cal X}^{k_p},\mu^{k_p})$ of some measure
space $(X,{\cal X},\mu)$, $1\le p\le m$, with a probability
measure $\mu$, which satisfy inequalities~(\ref{(8.1)}) (if the
index $k$ is replaced by the index $k_p$ in them).
Let us take a coloured diagram $\gamma\in\Gamma(k_1,\dots,k_m)$,
and consider the function $F_\gamma(f_1,\dots,f_m)$ defined by
formulas (\ref{(11.8)}). The $L_2$-norm of
the function $F_\gamma(f_1,\dots,f_m)$ (with respect to the
power of the measure~$\mu$ to the space where
$F_\gamma(f_1,\dots,f_m)$ is defined) satisfies the inequality
$$
\|F_\gamma(f_1,\dots,f_m)\|_2
\le2^{W(\gamma)}\prod_{p\in U(\gamma)} \|f_p\|_2,
$$
where $W(\gamma)$ is given in~(\ref{(11.9)}), and the set
$U(\gamma)\subset\{1,\dots,m\}$ is defined as
\begin{eqnarray}
U(\gamma)&&=\{p\colon\; 1\le p\le m,\quad\textrm{for all vertices }
(p,r),\; 1\le r\le k_p \textrm{ the chain }\beta\in\gamma
\nonumber \\
&&\qquad \text{ for which } (p,r)\in\beta \textrm{ has the
property that either } u(\beta)=p \nonumber \\
&&\qquad\textrm{ or } d(\beta)=p\textrm{ and } c_\gamma(\beta)=1\}.
\label{(11.12)}
\end{eqnarray}
(If the point $(p,r)$ is such that $\beta=\{(p,r)\}\in\gamma$ is
the chain containing it, then $u(\beta)=d(\beta)=$, and
$c_\gamma(\beta)=1$. In this case the vertex $(p,r)$ satisfies that
condition which all vertices $(p,r)$, $1\le r\le k_p$, must
satisfy to guarantee the property $p\in U(\gamma)$.}
\medskip
The last result of this section is a corollary of Theorem~11.2. In
this corollary we give an estimate on the expected value of product
of degenerate $U$-statistics. To formulate this result we introduce
the following terminology. Let us call a (coloured) diagram
$\gamma\in\Gamma(k_1,\dots,k_m)$ closed if $c_\gamma(\beta)=1$ for
all chains $\beta\in\gamma$. Let us denote the set of all closed
diagrams by $\bar\Gamma(k_1,\dots,k_m)$. Observe that
$F_\gamma(f_1,\dots,f_m)$ is constant (a function of zero variable)
if and only if $\gamma$ is a closed diagram, i.e.
$\gamma\in\bar\Gamma(k_1,\dots,k_m)$, and
$$
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))
=I_{n,0}(F_\gamma(f_1,\dots,f_m))
=F_\gamma(f_1,\dots,f_m)
$$
in this case. Now we formulate the following result.
\medskip\noindent
{\bf Corollary of Theorem 11.2 about the expectation of a product
of degenerate $U$-statistics.}\index{calculation of the expectation
of a product of degenerate $U$-statistics}
{\it Let a finite sequence of functions $f_p(x_1,\dots,x_{k_p})$,
$1\le p\le m$, be given on the products $(X^{k_p},{\cal X}^{k_p})$ of
some measurable space $(X,{\cal X})$ together with a sequence of
independent and identically distributed random variables with
value in the space $(X,{\cal X})$ and some distribution~$\mu$
which satisfy the conditions of Theorem 11.2.
Let us apply the notation of Theorem~11.2 together with the notion
of the above introduced class of closed diagrams
$\bar\Gamma(k_1,\dots,k_m)$. The identity
\begin{eqnarray}
&&E\left(\prod_{p=1}^m k_p! n^{-k_p/2}I_{n,k_p}(f_{k_p})\right)
\label{(11.13)} \\
&&\qquad = {\sum_{\gamma\in\bar\Gamma(k_1,\dots,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\prime(n,m)}
\left(\prod_{p=1}^m
J_n(\gamma,p)\right) n^{-W(\gamma)/2}\cdot F_\gamma(f_1,\dots,f_m)
\nonumber
\end{eqnarray}
holds. This identity has the consequence
\begin{equation}
\left|E\left(\prod_{p=1}^m k_p! n^{-k_p/2}
I_{n,k_p}(f_{k_p})\right)\right|
\le \sum_{\gamma\in\bar\Gamma(k_1,\dots,k_m)}
n^{-W(\gamma)/2}|F_\gamma(f_1,\dots,f_m)|.
\label{(11.14)}
\end{equation}
Beside this, if the functions~$f_p$, $1\le p\le m$, satisfy
conditions~(\ref{(8.1)}) and~(\ref{(8.2)}) (with indices~$k_p$
instead of~$k$ in them), then the numbers
$|F_\gamma(f_1,\dots,f_m)|$ at the right-hand
side of~(\ref{(11.14)}) satisfy the inequality
\begin{eqnarray}
|F_\gamma(f_1,\dots,f_m)|\le2^{W(\gamma)}\sigma^{|U(\gamma)|} \quad
\textrm{for all } \gamma\in\bar\Gamma(k_1,\dots,k_m).
\label{(11.15)}
\end{eqnarray}
In formula~(\ref{(11.15)}) the same number~$W(\gamma)$ and
set $U(\gamma)$ appear as in Lemma 11.3. The only difference is
that in the present case the definition of $U(\gamma)$ becomes a bit
simpler, since $c_\gamma(\beta)=1$ for all chains $\beta\in\gamma$.}
\medskip\noindent
{\it Remark:}\/ We have applied a different terminology for
diagrams in this section and in Section~10, where the theory
of Wiener--It\^o integrals was discussed. But there is a simple
relation between the terminology of these sections. If we take
only those diagrams from the diagrams considered in this section
which contain only chains of length~1 or~2, and beside this the
chains of length~1 have colour~$-1$, and the chains of length~2
have colour~1, then we get the diagrams considered in the previous
section. Moreover, the functions
$F_\gamma=F_\gamma(f_1,\dots,f_m)$ are the same in the two cases.
Hence formula~(\ref{(10.18)}) in the Corollary of Theorem~10.2 and
formula~(\ref{(11.14)}) in the Corollary of Theorem~11.2 make
possible to compare the moments of Wiener--It\^o integrals and
degenerate $U$-statistics.
The main difference between these estimates is that
formula~(\ref{(11.14)})
contains some additional terms. They are the contributions of
those diagrams $\gamma\in\bar\Gamma(k_1,\dots,k_m)$ which
contain chains $\beta\in\gamma$ with length $\ell(\beta)>2$.
These are those diagrams $\gamma\in\bar\Gamma(k_1,\dots,k_m)$
for which $W(\gamma)\ge1$. The estimate~(\ref{(11.15)}) given
for the terms $F_\gamma$ corresponding to such diagrams is
weaker, than the estimate given for the terms $F_\gamma$ with
$W(\gamma)=0$, since $|U(\gamma)|0. \label{(12.5)}
\end{equation}
for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$.
Relation~(\ref{(12.4)}) follows from relation~(\ref{(12.2)})
in the same way as formula~(\ref{(9.3)}) follows from
formula~(\ref{(9.2)}) in the proof of the Hoeffding
decomposition. Let us understand why the coefficient
$n^{|C(\gamma)|}J_n(\gamma)$ appears at the right-hand
side of~(\ref{(12.4)}).
This coefficient can be calculated in the following way.
Let us write up the identity
\begin{eqnarray*}
&&n^{-(k_1+k_2)/2}|O(\bar\gamma)|
\left((f_1\circ f_2)_{\bar\gamma}\right)
(\xi_{j_1},\dots,\xi_{j_{s(\bar\gamma)}})\\
&&\qquad =\sum_{\gamma\in\Gamma(\bar\gamma)}
n^{-(k_1+k_2)/2}\bar F_\gamma(f_1,f_2)
(\xi_{j_{l_1}},\dots,\xi_{j_{l_{|O(\gamma)|}}})
\end{eqnarray*}
with the help of~(\ref{(12.2)}), and let us sum up these identities
for all such sets of arguments $(j_1,\dots,j_{s(\bar\gamma)})$
for which all $j_p$, $1\le p\le s(\bar\gamma)$, are different,
and $1\le j_p\le n$. Then we get at the left-hand side of the
identity the $U$-statistic
$n^{-(k_1+k_2)/2}|O(\bar\gamma)|! I_{n,\bar s(\bar\gamma)}
\left((f_1\circ f_2)_{\bar\gamma}\right)$. We still have to
check that a term of the form
$n^{-(k_1+k_2)/2}\bar F_\gamma(f_1,f_2)
(\xi_{j_{l_1}},\dots,\xi_{j_{l_{|O(\gamma)|}}})$ appears
with multiplicity $n^{|C(\gamma)}J_n(\gamma)$ at the right-hand
side of this identity. Indeed, such a term appears for such
vectors $(j_1,\dots, j_{s(\bar\gamma)})$ for which the value of
$|O(\gamma)|$ arguments are fixed, the remaining arguments can
take arbitrary value between~1 and~$n$ with the only restriction
that all coordinates must be different.
There are $n^{|C(\gamma)}J_n(\gamma)$ such vectors. The above
observations imply identity~(12.4).
Let us observe that $k_1+k_2-2|C(\gamma)|=|O(\gamma)|+W(\gamma)$
with the number $W(\gamma)$ introduced in the formulation of
Theorem~11.1. Hence
$$
n^{-(k_1+k_2)/2}n^{|C(\gamma)|}=n^{-W(\gamma)/2}n^{-|O(\gamma)|/2}.
$$
Let us replace the left-hand side of the last identity by its
right-hand side in~(\ref{(12.4)}), and let us sum up the identity
we get in such a way for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$
such that $s(\bar\gamma)\le n$. The identity we get in such a way
together with formulas~(\ref{(12.1)}) and~(\ref{(12.5)}) imply
such a version of identity~(\ref{(11.5)}) where the kernel functions
$F_\gamma(f_1,f_2)$ of the $U$-statistics at the right-hand side
of the equation are replaced by the kernel functions
$\bar F_\gamma(f_1,f_2)$ defined in~(\ref{(12.3)}). But we can
get the function $F_\gamma(f_1,f_2)$ by reindexing the arguments
of the function $\bar F_\gamma(f_1,f_2)$. This has the consequence
that $I_{n,|O(\gamma}(F_\gamma(f_1,f_2))
=I_{n,|O(\gamma}(\bar F_\gamma(f_1,f_2))$, and identity~(11.5)
holds in its original form.
Clearly,
$$
I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))=
I_{n,|O(\gamma)|}(\textrm{\rm Sym}F_\gamma(f_1,f_2)),
$$
hence $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))$ can be
replaced by
$I_{n,|O(\gamma)|}(\textrm{\rm Sym}\,F_\gamma(f_1,f_2))$ in
formula~(\ref{(11.5)}). Beside this, we have shown that the
functions $F_\gamma(f_1,f_2)$ are canonical, and it can be simply
shown that they are bounded, if the functions~$f_1$ and~$f_2$ are
bounded. We still have to prove inequalities~(\ref{(11.6)})
and~(\ref{(11.7)}).
\medskip
Inequality (\ref{(11.6)}), the estimate of the $L_2$-norm of the
function $F_\gamma(f_1,f_2)$ follows from the Schwarz
inequality, and actually it agrees with
inequality~(\ref{(10.11)}), proved at the start
of Appendix~B. Hence its proof is omitted here.
To prove inequality (\ref{(11.7)}) let us introduce, similarly to
formula (\ref{(11.2)}), the operators
\begin{eqnarray*}
\tilde Q_{l_p}h(x_{l_1},\dots,x_{l_r})
&&=h(x_{l_1},\dots,x_{l_r})+
\int h(x_{l_1},\dots,x_{l_r})\mu(\,dx_{l_p}), \\
&&\qquad 1\le p\le r,
\end{eqnarray*}
in the space of functions $h(x_{l_1},\dots,x_{l_r})$ with coordinates
in the space $(X,{\cal X})$. (The indices $l_1,\dots,l_r$ are all
different.) Observe that both the operators $\tilde Q_{l_p}$ and
the operators $P_{l_p}$ defined in (\ref{(11.1)}) are positive,
i.e. they map a non-negative function to a non-negative function.
Beside this, $Q_{l_p}\le\tilde Q_{l_p}$, and the norms of the
operators $\frac{\tilde Q_{l_p}}2$ and $P_{l_p}$ are bounded by 1
both in the $L_1(\mu)$, the $L_2(\mu)$ and the supremum norm.
Let us define the function
\begin{eqnarray*}
&&(\tilde F_\gamma{f_1,f_2})(x_{l_1},\dots,x_{l_{|O(\gamma)|}})
\\
&&\qquad=\left(\prod_{p\colon \beta(l_p)\in C(\gamma)}P_{l_p}
\prod_{p\colon \beta_{l_p}\in O_2(\gamma) } \tilde Q_{l_p}\right)
(f_1\circ f_2)_\gamma(x_{l_1},\dots,x_{l_s})
\end{eqnarray*}
with the notation of Section~11, where
$s=s(\gamma)=|C(\gamma)|+|O(\gamma)|$. The function
$\tilde F_\gamma(f_1, f_2)$ was defined similarly to
$F_\gamma(f_1,f_2)$ defined in~(\ref{(11.4)}) with the help of
$(f_1\circ f_2)_\gamma$ only the operators $Q_j$
were replaced by $\tilde Q_j$ in its definition.
The properties of the operators $P_{u_j}$ and $\tilde Q_{u_j}$
listed above together with the condition
$\sup|f_2(x_1,\dots,x_k)|\le1$ imply that
\begin{equation}
|F_\gamma(f_1,f_2)|\le \tilde F_\gamma(|f_1|,|f_2|)
\le \tilde F_\gamma(|f_1|,1), \label{(12.6)}
\end{equation}
where `$\le$' means that the function at the right-hand side is
greater than or equal to the function at the left-hand side in
all points, and the term~1 in~(\ref{(12.6)}) denotes the function
which equals identically~1. Because of
the relation~(\ref{(12.6)}) to prove relation~(\ref{(11.7)})
it is enough to show that
\begin{eqnarray}
&&\|(\tilde F_\gamma(|f_1|,1)_\gamma\|_2 \nonumber \\
&&\qquad=\left\|\left(\prod_{p\colon \beta(l_p)\in C(\gamma)} P_{l_p}
\prod_{\beta(p\colon l_p)\in O_2(\gamma)} \tilde Q_{l_p}\right)
|f_1(x_{\alpha_\gamma(1,1)},
\dots,x_{\alpha_\gamma(1,k_1)})|\right\|_2 \nonumber \\
&&\qquad\le 2^{|O_2(\gamma)|}\|f\|_2=2^{W(\gamma)}\|f_1\|_2.
\label{(12.7)}
\end{eqnarray}
But this inequality trivially holds, since the norm of all
operators $P_j$ in formula (\ref{(12.7)}) is bounded
by~1, the norm of all operators $\tilde Q_j$ is bounded
by~2 in the $L_2(\mu)$ norm, and $|O_2(\gamma)|=W(\gamma)$.
\medskip\noindent
{\it Proof of Theorem 11.2.} Theorem~11.2 will be proved with the
help of Theorem~11.1 by induction with respect to the number~$m$
of the terms in the product of the degenerate $U$-statistics
$k_p!I_{n,k_p}(f_p)$, $1\le p\le m$. For $m=2$ Theorem~11.2
follows from Theorem~11.1, since
formula~(\ref{(11.5)}) agrees with formula~(\ref{(11.11)})
for~$m=2$. To prove Theorem~11.2 for $m\ge3$ first we express
with the help of our inductive hypothesis the product of the
first $m-1$ terms in the product of degenerate $U$-statistics
as a sum of degenerate $U$-statistics, and then we calculate the
product of each term in this sum with the last term of the
product with the help of Theorem~11.1. To show
that we get the identity~(\ref{(11.11)}) formulated in
Theorem~11.2 we have to observe some properties of the
decomposition of the diagrams $\gamma\in\Gamma(k_1,\dots,k_m)$
to a pair of diagrams $\gamma_{pr}\in\Gamma(k_1,\dots,k_{m-1})$
and $\gamma_{cl}\in\Gamma(|O(\gamma_{pr})|,k_m)$. Let me recall
that during the definition of the functions
$F_\gamma(f_1,\dots,f_m)$ we have fixed an enumeration of
the open chains of the diagrams
$\gamma\in\Gamma(k_1,\dots,k_m)$ for all $m=2,3,\dots$, hence
also of the open chains of the diagrams
$\gamma_{pr}\in\Gamma(k_1,\dots,k_{m-1})$.
Let us observe that the pair $(\gamma_{pr},\gamma_{cl})$ uniquely
determines the diagram $\gamma\in\Gamma(k_1,\dots,k_m)$, i.e. if
$\gamma,\gamma'\in\Gamma(k_1,\dots,k_m)$, and if
$\gamma\neq\gamma'$, then either $\gamma_{pr}\neq\gamma'_{pr}$ or
$\gamma_{cl}\neq\gamma'_{cl}$. Hence we can identify each
diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ with the pair
$(\gamma_{pr},\gamma_{cl})$ we defined with its help. Beside
this, the pairs of diagrams $(\gamma_{pr},\gamma_{cl})$
satisfy the relation $\gamma_{cl}\in\Gamma(|O(\gamma_{pr}|,k_m)$.
Moreover, the class of pairs of diagrams
$(\gamma_{pr},\gamma_{cl})$, $\gamma\in\Gamma(k_1,\dots,k_m)$,
have the following characterization. A one to one
correspondence can be given between the pairs of diagrams
$(\bar\gamma,\hat\gamma)$ such that
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ and
$\tilde\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ and the diagrams
$\gamma\in\Gamma(k_1,\dots,k_m)$ in such a way that
$\bar\gamma=\gamma_{pr}$ and $\hat\gamma=\gamma_{cl}$. (This
correspondence depends on the enumeration of the open chains
of the diagrams $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ that
we have previously fixed.) The proof of the above statements
is not difficult, and I leave it to the reader.
Because of our inductive hypothesis we can write by applying
relation~(\ref{(11.11)}) of Theorem~11.2 with parameter~$m-1$
the identity
\begin{eqnarray}
&&\prod_{p=1}^{m-1} n^{-k_p/2}k_p!I_{n,k_p}(f_p)
={\sum_{\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n,\,m-1)}
\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right) \nonumber \\
&&\qquad n^{-W(\bar\gamma)/2}
\cdot n^{-|O(\bar\gamma)|/2} |O(\bar\gamma)|!
I_{n,|O(\bar\gamma)|}(F_{\bar\gamma}(f_1,\dots,f_{m-1})).
\label{(12.8)}
\end{eqnarray}
(Here we use the notations of Section~11.)
We get by multiplying the identity~(\ref{(11.5)}) of
Theorem~11.1 with an appropriate constant that the identity
\begin{eqnarray}
&&\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2}n^{-|O(\bar\gamma)|/2}O(\bar\gamma)!
I_{n,|O(|\bar\gamma|}(F_{\bar\gamma}(f_1,\dots,f_{m-1}) \nonumber\\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\cdot n^{-k_m/2}k_m! I_{n,k_m}(f_m) \nonumber \\
&&\qquad=\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2} \!\!\!
{\sum_{\hat\gamma\in\Gamma(|O(\bar\gamma|,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\prime(n)}
\,\, \prod_{j=1}^{|C(\hat\gamma)|}
\left(\frac{n-s(\hat\gamma)+j}n\right)
n^{-W(\hat\gamma)/2}\nonumber \\
&&\qquad\qquad n^{-|O(\hat\gamma)|/2}|O(\hat\gamma)|!
I_{n,|O(\hat\gamma)|}
(F_{\hat\gamma}(F_{\bar\gamma}(f_1,\dots,f_{m-1}),f_m)).
\label{(12.9)}
\end{eqnarray}
holds for all $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$,
where ${\sum\limits_{\hat\gamma\in\Gamma(|O(\bar\gamma|),k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\prime(n)}$
means that summation is taken for such diagrams
$\hat\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ for which
$s(\hat\gamma)=|O(\hat\gamma)|+|C(\hat\gamma)|\le n$, and
$\prod\limits_{j=1}^{|C(\hat\gamma|}$ equals~1, if
$|C(\hat\gamma)|=0$.
We get~(\ref{(12.9)}) by applying the identity~(\ref{(11.5)})
of Theorem~11.1 for the product
$$
n^{-|O(\bar\gamma)|/2}|O(\bar\gamma)|!I_{n,|O(\bar\gamma)|}
(F_{\bar\gamma}(f_1,\dots,f_{m-1}))\cdot n^{-k_m/2}k_m!I_{n,k_m}(f_m),
$$
and by multiplying it with
$\left(\prod\limits_{p=2}^{m-1}J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2}$.
We shall prove relation~(\ref{(11.11)}) for the parameter~$m$
with the help of relations~(\ref{(12.8)}) and~(\ref{(12.9)}).
Let us sum up formula~(\ref{(12.9)}) for all such diagrams
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ for which
$B_1(\bar\gamma,p)+B_2(\bar\gamma,p)\le n$ for all $2\le p\le m-1$.
The numbers $B_1(\cdot)$ and $B_2(\cdot)$ in these inequalities
are the numbers introduced before formula~(\ref{(11.10)}), only
in this case the diagram~$\gamma$ is replaced by~$\bar\gamma$.
We imposed those conditions on the terms~$\bar\gamma$ in this
summation which appear in the conditions of the summation in
${\sum}^{\prime(n,m-1)}$ at the right-hand side of
formula~(\ref{(12.8)}) when it is applied with parameter~$m-1$.
Hence formula~(\ref{(12.8)}) implies that the sum of the
terms at the left-hand side of these identities equals
$\prod\limits_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_p)$, i.e.
the left-hand side of~(\ref{(11.11)}) for parameter~$m$. To
prove formula~(\ref{(11.11)}) for the parameter~$m$ it is
enough to show that the sum of the right-hand side of the above
inequalities equals the right-hand side of~(\ref{(11.11)}).
In the proof of this relation we shall apply the properties of
the pairs of diagrams $(\gamma_{pr},\gamma_{cl})$ coming from a
diagram~$\gamma\in\Gamma(k_1,\dots,k_m)$ mentioned before.
Namely, we shall exploit that there is a one to one
correspondence between the diagrams
$\gamma\in\Gamma(k_1,\dots,k_m)$ and pairs of diagrams
$(\bar\gamma,\hat\gamma)$, $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$,
$\hat\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ in such a way that
$\gamma$ and the pair ($\bar\gamma,\hat\gamma)$ correspond to each
other if and only if $\bar\gamma=\gamma_{pr}$ and
$\hat\gamma=\gamma_{cl}$. This correspondence enables us to
reformulate the statement we have to prove in the following way. Let
us rewrite formula~(\ref{(12.9)}) by replacing $\bar\gamma$ with
$\gamma_{pr}$ and $\hat\gamma$ by $\gamma_{cl}$, with that diagram
$\gamma\in\Gamma(k_1,\dots,k_m)$ for which $\bar\gamma=\gamma_{pr}$
and $\hat\gamma=\gamma_{cl}$. It is enough to show that if
we take those modified versions of~(\ref{(12.9)}) which we
get by replacing the pairs $(\bar\gamma,\hat\gamma)$ by the
pairs $(\gamma_{pr},\gamma_{cl})$ with some
$\gamma\in\Gamma(k_1,\dots,k_m)$,
and sum up them for those~$\gamma$ for which
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, then the sum of the right-hand side
expressions in these identities equals the right-hand
side of~(\ref{(11.11)}).
To shall prove the above identity with the help of the
following statements. For all $\gamma\in\Gamma(k_1,\dots,k_m)$
the identities $W(\gamma_{pr})+W(\gamma_{cl})=W(\gamma)$ and
$$
\prod\limits_{p=2}^{m-1} J_n(\gamma_{pr},p)
\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)
=\prod\limits_{p=2}^m J_n(\gamma,p),
$$
hold, where $\prod\limits_{j=1}^{|C(\gamma_{cl})|}=1$ if
$|C(\gamma_{cl})|=0$. The inequalities
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ hold simultaneously for
all $2\le p\le m$ for a diagram~$\gamma$ if and only if
the inequalities $B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$
for all $2\le p\le m-1$, together with the inequality
$s(\gamma_{cl})\le n$ hold simultaneously for this~$\gamma$.
To show the validity of the above identity with the help of
the above relations let us first check that we sum up for
the same set of $\gamma\in\Gamma(k_1,\dots,k_m)$ if we take
the sum of modified versions of~(\ref{(12.9)}) for all $\gamma$
such that $B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, and if we take the ${\sum}^{\prime(n,m)}$
at the right-hand side of~(\ref{(11.11)}). Indeed, in the
second case we have to take those diagrams $\gamma$ for which
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ for all $2\le p\le m$, while
in the first case we take those diagrams~$\gamma$ for which
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, and $s(\gamma_{cl})\le n$. The last
condition is contained in a slightly hidden form in the
summation ${\sum}^{\prime(n)}$ of formula~(\ref{(12.9)}).
Hence the above mentioned relations imply that have to sum up
for the same diagrams~$\gamma$ in the two cases.
Beside this, it follows from~(\ref{(11.8)}) that the same
$U$-statistics appear for a
diagram~$\gamma\in\Gamma(k_1,\dots,k_m)$ in~(\ref{(11.11)}) and
in the modified version of~(\ref{(12.9)}). We still have to
check that they have the same coefficients in the two cases.
But this holds, because the previously formulated identities
imply that
\begin{eqnarray*}
n^{-(W(\gamma_{pr})/2}n^{-W(\gamma_{cl})/2}&=&n^{-W(\gamma)/2},\\
\prod\limits_{p=2}^{m-1} J_n(\gamma_{pr},p)
\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)
&=& \prod\limits_{p=2}^m J_n(\gamma,p)
\end{eqnarray*}
and $n^{-|O(\gamma_{cl})|/2}|O(\gamma_{cl})|!
=n^{-|O(\gamma)|/2}|O(\gamma)|!$, since
$|O(\gamma)|=|O(\gamma_{cl})|$, as we have seen before.
Let us prove the relations we applied in the previous argument.
We start with the proof of the identity
$W(\gamma_{pr})+W(\gamma_{cl})=W(\gamma)$ for the
function~$W(\cdot)$ defined in~(\ref{(11.9)}).
Let us first remark that $W(\gamma_{cl})=|O_2(\gamma_{cl})|$,
where $O_2(\gamma_{cl})$ is the set of open chains
in~$\gamma_{cl}$ with length~2. Beside this if
$\beta\in\gamma$ is such that
$\beta\cap\{(m,1),\dots,(m,k)\}=\emptyset$, i.e. if the
chain~$\beta$ contains no vertex from the last row of the
diagram~$\gamma$, then $\ell(\beta)=\ell(\beta_{pr})$, and
$c_\gamma(\beta)=c_{\gamma_{pr}}(\beta_{pr})$. If
$\beta\cap\{(m,1),\dots,(m,k)\}\neq\emptyset$, then either
$c_\gamma(\beta)=1$, $\ell(\beta_{pr})=\ell(\beta)-1$, and
$c_{\gamma_{pr}}(\beta)=-1$ or $c_\gamma(\beta)=-1$ and one
of the following cases appears. Either $\ell(\beta)=1$, and
the chain $\beta_{pr}$ does not exists, or $\ell(\beta)>1$,
and $\ell(\beta_{pr})=\ell(\beta)-1$,
$c_{\gamma_{pr}}(\beta_{pr})=-1$. We get by calculating
$W(\gamma)$ with the help of the above relations that
$W(\gamma)=W(\gamma_{pr})+|{\cal V}(\gamma)|$, where
${\cal V}(\gamma)=\{\beta\colon\; \beta\in\gamma,\,
\beta\cap\{(m,1),\dots,(m,k)\}\neq\emptyset,\, \ell(\beta)>1,\,
c_\gamma(\beta)=-1\}$. Since
$|{\cal V}(\gamma)|=|O_2(\gamma_{cl})|$, the above
relations imply the desired identity.
To prove the remaining relations first we observe that for
each diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ and number
$2\le p\le m-1$ we have $B_1(\gamma_{pr},p)=B_1(\gamma,p)$
and $B_2(\gamma_{pr},p)=B_2(\gamma,p)$. Beside this,
$|C(\gamma_{cl})|=B_1(\gamma,m)$
and $|O(\gamma_{cl})|=B_2(\gamma,m)$. The identity about
$|C(\gamma_{cl})|$ simply follows from the definition
of~$\gamma_{cl}$ and $B_1(\gamma,m)$. To prove the
identity about $|O(\gamma_{cl})|$ observe that
$|O(\gamma_{cl})|=|O(\gamma)|$, and
$|O(\gamma)|=B_2(\gamma,m)$. (Observe that in the case
$p=m$ the definition of the set ${\cal B}_2(\gamma,m)$
becomes simpler, because there is no chain
$\beta\in\gamma$ for which $d(\beta)>m$.)
The remaining relations can be deduced from these relations.
Indeed, they imply that $J_n(\gamma_{pr},p)=J_n(\gamma,p)$
for all $2\le p\le m-1$. Beside this, we have
$\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)=J_n(\gamma,m)$
because of the relations $|C(\gamma_{cl})|=B_1(\gamma,m)$
$|O(\gamma_{cl})|=B_2(\gamma,m)$,
$s(\gamma_{cl})=|C(\gamma_{cl})|+|O(|\gamma_{cl})|$ and the
definition of $J_n(\gamma,m)$. Hence the identity about the
product of the terms $J_n(\gamma,p)$ holds. It can be seen
similarly that the relations
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ holds for all
$2\le p\le m-1$ if and only if
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, and $B_1(\gamma,m)+B_2(\gamma,m)\le n$ if
and only if $s(\gamma_{cl})\le n$.
Thus we have proved identity~(\ref{(11.11)}). To complete the
proof of Theorem~11.2 we still have to show that under its
conditions $F_{\gamma}(f_1,\dots,f_m)$ is a bounded, canonical
function. But this follows from Theorem~11.1 and
relation~(\ref{(11.8)}) by a simple induction argument.
\medskip\noindent
{\it Proof of Lemma 11.3.} Lemma~11.3 will be proved by induction
with respect to the number~$m$ of the terms in the product of
$U$-statistics with the help of inequalities~(\ref{(11.6)})
and~(\ref{(11.7)}). These relations imply the desired inequality
for $m=2$. In the case $m>2$ we apply the identity~(\ref{(11.8)})
$F_{\gamma}(f_1,\dots,f_m)=
F_{\gamma_{cl}}(F_{\gamma_{pr}}(f_1,\dots,f_{m-1}),f_m)$. We have
seen that $W(\gamma)=W(\gamma_{pr})+W(\gamma_{cl})$, and it is not
difficult to show that $U(\gamma)=U(\gamma_{pr})+U(\gamma_{cl})$.
Hence if $U(\gamma_{cl})=0$, i.e. if $\gamma_{cl}$ contains a
chain of length~2 with colour~$-1$, then $U(\gamma)=U(\gamma_{pr})$,
and an application of~(\ref{(11.8)}) and~(\ref{(11.7)}) for the
diagram~$\gamma_{cl}$ implies Lemma~11.3 in this case.
If $U(\gamma_{cl})=1$, then $W(\gamma_{cl})=0$,
$U(\gamma)=U(\gamma_{pr})+1$, $W(\gamma)=W(\gamma_{pr})$, and
the application of~(\ref{(11.8)}) and~(\ref{(11.6)}) for the
diagram~$\gamma_{cl}$ implies Lemma~11.3 in this case.
\medskip
The corollary of Theorem 11.2 is a simple consequence of
Theorem~11.2 and Lem\-ma~11.3.
\medskip\noindent
{\it Proof of the corollary of Theorem 11.2.}\/ Observe that
$F_\gamma$ is a function of $|O(\gamma)|$ arguments. Hence a
coloured diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ is in the
class of closed diagrams, i.e.
$\gamma\in\bar\Gamma(k_1,\dots,k_m)$ if and only if
$F_\gamma(f_1,\dots,f_m)$ is a constant. Thus
formula~(\ref{(11.13)}) is a simple consequence of
relation~(\ref{(11.11)}) and the observation
that $EI_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))=0$ if
$|O(\gamma)|\ge1$, i.e. if
$\gamma\notin\bar\Gamma(k_1,\dots,k_m)$, and
\begin{eqnarray}
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))
&&=I_{n,0}(F_\gamma(f_1,\dots,f_m))=F_\gamma(f_1,\dots,f_m)
\nonumber \\
&&\qquad\qquad\qquad\textrm{ if }\gamma\in\bar\Gamma(k_1,\dots,k_m).
\nonumber
\end{eqnarray}
Relations~(\ref{(11.14)}) and~(\ref{(11.15)}) follow from
relation~(\ref{(11.13)}) and Lemma~11.3.
\chapter{The proof of Theorems 8.3, 8.5 and Example 8.7}
In this section we prove the estimates on the distribution of
a multiple Wiener--It\^o integral or degenerate $U$-statistic
formulated in Theorems~8.5 and~8.3 and also present the proof of
Example~8.7. Beside this, we prove a multivariate version
of Hoeffding's inequality~(Theorem~3.4). The latter result is
useful in the estimation of the supremum of a class of degenerate
$U$-statistics. The estimate on the distribution of a multiple
random integral with respect to a normalized empirical
distribution given in Theorem~8.1 is omitted, because, as it was
shown in Section~9, this result follows from the estimate of
Theorem~8.3 on degenerate $U$-statistics. We finish this section
with a separate part Section~13~B, where the results proved in
this section are discussed together with the method of their
proofs and some recent results. These new results state that in
certain cases the estimates on the tail distribution of
Wiener--It\^o integrals and $U$-statistics considered in this
section can be improved if we have some additional information
on the kernel function of these Wiener--It\^o integrals or
$U$-statistics.
The proof of Theorems~8.5 and~8.3 is based on a good estimate
on high moments of Wiener--It\^o integrals and degenerate
$U$-statistics. They can be deduced from the corollaries of
Theorems~10.2 and~11.2. Such an approach slightly differs from
the classical proof in the one-variate case. The one-variate
version of the problems discussed here is an estimate about
the tail distribution of a sum of independent random variables.
This can be proved with the help of a good bound on the moment
generating function of the sum. Such a method does not work in
the multivariate case, because, as later calculations will show,
there is no good estimate on the moment-generating function
of $U$-statistics or multiple Wiener--It\^o integrals of order
$k\ge3$. Actually, the moment-generating function of a
Wiener--It\^o integral of order $k\ge3$ is always divergent,
because the tail distribution behaviour of such a random
integral is similar to that of the $k$-th power of a Gaussian
random variable. On the other hand, good bounds on the moments
$EZ^{2M}$ of a random variable~$Z$ for all positive integers~$M$
(or at least for a sufficiently rich class of parameters~$M$)
together with the application of the Markov inequality for
$Z^{2M}$ and an appropriate choice of the parameter~$M$ yield
a good estimate on the tail distribution of~$Z$.
Propositions~13.1 and~13.2 contain some estimates on the moments
of Wiener--It\^o integrals and degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.1 (Estimate of the moments of Wiener--It\^o
integrals).}\index{estimate of the moments of Wiener--It\^o
integrals} {\it Let $f(x_1,\dots,x_k)$ be a function of $k$
variables on some measurable space $(X,{\cal X})$ that satisfies
formula~(\ref{(8.12)}) with some $\sigma$-finite non-atomic
measure $\mu$. Take the $k$-fold Wiener--It\^o integral
$Z_{\mu,k}(f)$ of this function with respect to a white noise
$\mu_W$ with reference measure~$\mu$. The inequality
\begin{equation}
E\left(k!|Z_{\mu,k}(f)|\right)^{2M}\le 1\cdot3\cdot5\cdots
(2kM-1)\sigma^{2M}\quad\textrm {for all }M=1,2,\dots
\label{(13.1)}
\end{equation}
holds.}
\medskip
By Stirling's formula Proposition~13.1 implies that
\begin{equation}
E(k!|Z_{\mu,k}(f)|)^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!}\sigma^{2M}
\le A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\label{(13.2)}
\end{equation}
for any $A>\sqrt2$ if $M\ge M_0=M_0(A)$. Formula~(\ref{(13.2)}) can be
considered as a simpler, better applicable version of
Proposition~13.1. It can be better compared with the moment estimate
on~degenerate $U$-statistics given in~\ref{(13.3)}).
Proposition~13.2 provides a similar, but weaker inequality for the
moments of normalized degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.2 (Estimate on the moments of degenerate
$U$-statistics).}\index{estimate on the moments of degenerate
$U$-statistics} {\it Let us consider a degenerate $U$-statistic
$I_{n,k}(f)$ of order $k$ with sample size $n$ and with a kernel
function $f$ satisfying relations~(\ref{(8.1)}) and~(\ref{(8.2)})
with some $0<\sigma^2\le1$. Fix a positive number $\eta>0$.
There exist some universal constants $A<\infty$ and $C<\infty$
such that
\begin{eqnarray}
&&E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}
\le A\left(1+C\sqrt\eta\right)^{2kM}
\left(\frac2e\right)^{kM}\left(kM\right)^{kM}\sigma^{2M}\nonumber \\
&&\qquad\qquad \textrm{for all integers } M \textrm{ such that }
0\le kM\le \eta n\sigma^2. \label{(13.3)}
\end{eqnarray}
In formula~(\ref{(13.3)}) the constant $C$ can be chosen as $C=\sqrt2$.}
\medskip
Theorem~13.2 yields a good estimate on
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ with a fixed
exponent~$2M$ with
the choice $\eta=\frac{kM}{n\sigma^2}$. With such a choice of the
number $\eta$ formula~(\ref{(13.3)}) yields an estimate on the moments
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ comparable with the
estimate on the corresponding Wiener--It\^o integral if
$M\le n\sigma^2$, while
it yields a much weaker estimate if $M\gg n\sigma^2$.
Now I turn to the proof of these propositions.
\medskip\noindent
{\it Proof of Proposition 13.1.}\/ Proposition 13.1 can be simply
proved by means of the Corollary of Theorem~10.2 with the choice
$m=2M$, and $f_p=f$ for all $1\le p\le 2M$. Formulas~(\ref{(10.18)})
and~(\ref{(10.19)}) yield that
\begin{eqnarray*}
E\left(k!Z_{\mu,k}(f)^{2M}\right)&\le&\left( \int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(dx_k)\right)^M|
\Gamma_{2M}(k)| \\
&\le& |\Gamma_{2M}(k)|\sigma^{2M},
\end{eqnarray*}
where $|\Gamma_{2M}(k)|$ denotes the number of closed diagrams
$\gamma$ in the class
$\bar\Gamma(\underbrace{k,\dots,k}_{2M\textrm{ times}})$
introduced in the corollary of Theorem~10.2. Thus to complete the
proof of Proposition~13.1 it is enough to show that
$|\Gamma_{2M}(k)|\le 1\cdot3\cdot5\cdots(2kM-1)$. But this can
easily be seen with the help of the following observation. Let
$\bar\Gamma_{2M}(k)$ denote the class of all graphs with vertices
$(l,j)$, $1\le l\le 2M$, $1\le j\le k$, such that from all vertices
$(l,j)$ exactly one edge starts, all edges connect different
vertices, but edges connecting vertices $(l,j)$ and $(l,j')$ with
the same first coordinate~$l$ are also allowed. Let
$|\bar\Gamma_{2M}(k)|$ denote the number of graphs in
$\bar\Gamma_{2M}(k)$. Then clearly
$|\Gamma_{2M}(k)|\le|\bar\Gamma_{2M}(k)|$. On the other hand,
$|\bar\Gamma_{2M}(k)|=1\cdot3\cdot5\cdots(2kM-1)$. Indeed, let us
list the vertices of the graphs from $\bar\Gamma_{2M}(k)$ in an
arbitrary way. Then the first vertex can be paired with another
vertex in $2kM-1$ way, after this the first vertex from which no
edge starts can be paired with $2kM-3$ vertices from which no edge
starts. By following this procedure the next edge can be chosen
$2kM-5$ ways, and by continuing this calculation we get the desired
formula.
\medskip\noindent
{\it Proof of Proposition 13.2.}\/ Relation~(\ref{(13.3)}) will
be proved by
means of relations (\ref{(11.14)}) and (\ref{(11.15)}) in the
Corollary of Theorem~11.2 with the choice $m=2M$ and $f_p=f$
for all $1\le p\le 2M$. Let us take the class of closed
coloured diagrams
$\Gamma(k,M)=\bar\Gamma(\underbrace{k,\dots,k}_{2M\textrm{times}})$.
This will be partitioned into subclasses
$\Gamma(k,M,r)$, $1\le r\le kM$, where $\Gamma(k,M,r)$ contains
those closed diagrams $\gamma\in\Gamma(k,M)$ for which
$W(\gamma)=2r$. Let us recall that $W(\gamma)$ was defined
in~(\ref{(11.9)}), and in the case of closed diagrams
$W(\gamma)=\sum\limits_{\beta\in\gamma}(\ell(\beta)-2)$. For a
diagram $\gamma\in\Gamma(k,M)$, $W(\gamma)$ is an even number,
since $W(\gamma)+2s(\gamma)=2kM$, i.e. $W(\gamma)=2r$ with $r=kM-s$,
where $s=s(\gamma)$ denotes the number of chains in~$\gamma$.
First we prove an estimate about the cardinality of~$\Gamma(k,M,r)$.
We claim that there exists some universal constant $A<\infty$
such that
\begin{eqnarray}
|\Gamma(k,M,r)|&\le& {{2kM}\choose{2r}} 1\cdot3\cdot5\cdots(2kM-2r-1)
(kM-r)^{2r} \label{(13.4)} \\
&\le& A\left(\frac2e\right)^{kM} {{2kM}\choose{2r}}
2^{-r}(kM)^{kM+r} \quad\textrm{for all } 0\le r\le kM \nonumber
\end{eqnarray}
with some universal constant $A<\infty$.
To prove formula~(\ref{(13.4)}) let us first observe that
$|\Gamma(k,M,r)|$ can be bounded from above with the number of
such partitions of a set with $2kM$ points which consists of
$s=kM-r$~sets containing at least two points. Indeed,
for each $\gamma\in\Gamma(k,M,r)$ the chains of the
diagram~$\gamma$ yield a partition of the set
$\{(p,r)\colon\;1\le p\le 2M,\,1\le k\le r\}$ consisting
of~$2r$ sets such that each of them contains at least two points.
Moreover, the partition given in such a way determines the
chains of~$\gamma$, because the vertices of a chain are listed
in a prescribed order. Namely, the indices of the rows which
contain them follow each other in increasing order. This
implies that we can correspond to each diagram
$\gamma\in\Gamma(k,M,r)$ a different partition of a set of
$2Mk$ elements with the prescribed properties.
The number of the partitions with the above properties can be
bounded from above in the following way. Let us calculate the
number of possibilities for choosing $s=kM-r$ disjoint subsets
of cardinality two from a set of cardinality~$2kM$, and multiply
this number with the possibility of attaching each of the
remaining $2r$ points of the original set to one of these sets of
cardinality~2.
We can choose these sets of cardinality~2 in
${{2kM}\choose{2r}}1\cdot3\cdot5\cdots(2kM-1)$ ways, since we can
choose the union of these sets, which consists of $2kM-2r$
points in ${{2kM}\choose{2kM-2r}}={{2kM}\choose{2r}}$ ways, and
then we can choose the pair of the first element in~$2kM-2r-1$ ways,
then the pair of the first still not chosen element in
$2kM-2r-3$ ways, and continuing this procedure we get the above
formula for the number of choices for these sets of cardinality~2.
Then the remaining $2r$ points of the original set can be put
in~$(kM-r)^{2r}$ ways in one of these $kM-r$ sets of
cardinality~2. The above relations imply the first inequality of
formula~(\ref{(13.4)}).
To get the second inequality observe that by the Stirling formula
$1\cdot3\cdot5\cdots(2kM-2r-1)=\frac{(2kM-2r)!}{2^{kM-r}(kM-r)!}
\le A\left(\frac2e\right)^{kM-r}(kM-r)^{kM-r}$ with some universal
constant~$A<\infty$. Beside this, we can write
$(kM-r)^{kM+r}\le (kM)^r(kM-r)^{kM}
=(kM)^{kM+r}(1-\frac r{kM})^{kM}\le e^{-r}(kM)^{kM+r}$. These
estimates imply the second inequality in~(\ref{(13.4)}).
We prove the estimate~(\ref{(13.3)}) with the help of the
relations~(\ref{(11.14)}), (\ref{(11.15)}) and~(\ref{(13.4)}).
First we estimate the term $n^{-W(\gamma)/2}|F_\gamma|$ for a
diagram $\gamma\in\Gamma(k,M,r)$ under the conditions
$kM\le\eta n\sigma^2$ and $\sigma^2\le1$ with the help of
relation~(\ref{(11.15)}).
In this case we can write $|U(\gamma)|\ge 2M-W(\gamma)=2M-2r$ for
the function~$U(\gamma)$ defined in~(\ref{(11.12)}). Hence by
relation~(\ref{(11.15)})
$n^{-W(\gamma)/2}|F_\gamma|\le 2^{2r} n^{-r}\sigma^{|U(\gamma)|}
\le 2^{2r} \left(n\sigma^2\right)^{-r}\sigma^{2M}
\le\eta^{r}2^{2r}(kM)^{-r}
\sigma^{2M}$ for $\gamma\in\Gamma(k,M,r)$ because of the conditions
$kM\le \eta n\sigma^2$ and $\sigma^2\le1$.
This estimate together with relation~(\ref{(11.14)}) imply
that for $kM\le\eta n\sigma^2$
\begin{eqnarray*}
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}
&\le&\sum_{\gamma\in\Gamma(k,M)}
n^{-W(\gamma)/2}\cdot |F_\gamma| \\
&\le& \sum_{r=0}^{kM}|\Gamma(k,M,r)|
\eta^{r}2^{2r}(kM)^{-r}\sigma^{2M}.
\end{eqnarray*}
Hence by formula~(\ref{(13.4)})
\begin{eqnarray*}
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}
&\le& A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\sum_{r=0}^{kM}{{2kM}\choose{2r}}
\left(\sqrt{2\eta}\right)^{2r}\\
&\le& A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\left(1+\sqrt2\sqrt{\eta}\right)^{2kM}
\end{eqnarray*}
if $0\le kM\le\eta n\sigma^2$. Thus we have proved
Proposition~13.2 with $C=\sqrt2$.
\medskip
It is not difficult to prove Theorem 8.5 with the help of
Proposition~13.1.\index{estimate on the tail distribution
of a multiple Wiener--It\^o integral}
\medskip\noindent
{\it Proof of Theorem 8.5.}\/
By formula~(\ref{(13.2)}) which is a consequence of
Proposition~13.1 and the Markov inequality
\begin{equation}
P\left(|k!Z_{\mu,k}(f)|>u\right)\le
\frac{E\left(k!Z_{\mu,k}(f)\right)^{2M}}{u^{2M}}
\le A\left(\frac {2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM}
\label{(13.5)}
\end{equation}
with some constant $A>\sqrt2$ if $M\ge M_0$ with some constant
$M_0=M_0(A)$, and $M$ is an integer.
Put $\bar M=\bar M(u)=\frac1{2k}\left(\frac u\sigma\right)^{2/k}$,
and $M=M(u)=[\bar M]$, where $[x]$ denotes the integer part of
a real number $x$. Choose some number $u_0$ such that
$\frac1{2k}\left(\frac {u_0}\sigma\right)^{2/k}\ge M_0+1$. Then
relation~(\ref{(13.5)}) can be applied with $M=M(u)$ for
$u\ge u_0$, and this yields that
\begin{eqnarray}
P\left(|k!Z_{\mu,k}(f)|>u\right)
&\le& A\left(\frac {2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM}
\le e^{-kM}\le Ae^{k}e^{-k\bar M} \nonumber \\
&=&Ae^k\exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\} \quad\textrm{if } u\ge u_0.
\label{(13.6)}
\end{eqnarray}
Relation~(\ref{(13.6)}) means that relation~(\ref{(8.14)}) holds
for $u\ge u_0$ with
the pre-exponential coefficient $Ae^k$. By enlarging this
coefficient if it is needed it can be guaranteed that
relation~(\ref{(8.14)}) holds for all $u>0$. Theorem~8.5 is proved.
\medskip
Theorem 8.3 can be proved similarly by means of Proposition~13.2.
Nevertheless, the proof is technically more complicated, since
in this case the optimal choice of the parameter in the Markov
inequality cannot be given in such a direct form as in the proof of
Theorem~8.5. In this case the Markov inequality is applied with an
only almost optimal choice of the parameter~$M$.\index{estimate on
the tail distribution of a degenerate $U$-statistic}
\medskip\noindent
{\it Proof of Theorem 8.3.}\/ The Markov inequality and
relation~(\ref{(13.3)}) with $\eta=\frac{kM}{n\sigma^2}$ imply that
\begin{eqnarray}
P(n^{-k/2}k!|I_{n,k}(f)|>u)
&\le& \frac{E\left(k!n^{-k/2}I_{n,k}(f)\right)^{2M}}{u^{2M}}
\label{(13.7)} \\
&\le& A\left(\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}\right)^{kM} \nonumber
\end{eqnarray}
for all integers $M\ge0$.
Relation~(\ref{(8.10)}) will be proved with the help of
estimate~(\ref{(13.7)}) under the condition
$0\le\frac u\sigma\le n^{k/2}\sigma^k$. To this end let us
introduce the number $\bar M$ by means of the formula
$$
k\bar M=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+B\frac{ \left(\frac u\sigma\right)^{1/k}}{\sqrt n\sigma}}
=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}
$$
with a sufficiently large number $B=B(C)>0$ and $M=[\bar M]$,
where $[x]$ means the integer part of the number $x$.
Observe that $\sqrt{k\bar M}\le\left(\frac u\sigma\right)^{1/k}$,
$\frac{\sqrt{k\bar M}}{\sqrt n\sigma}
\le\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}\le1$,
and
$$
\left(1+C\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\right)^2\le
1+B\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\le 1+B\left(u n^{-k/2}
\sigma^{-(k+1)}\right)^{1/k}
$$
with a sufficiently large $B=B(C)>0$ if
$\frac u\sigma\le n^{k/2}\sigma^k$. Hence
\begin{eqnarray}
&&\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}
\le \frac1e\cdot 2k\bar M\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k} \nonumber \\
&&\qquad =\frac1e\cdot\frac{\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2}
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}\le\frac1e
\label{(13.8)}
\end{eqnarray}
if $\frac u\sigma\le n^{k/2}\sigma^k$. Inequalities~(\ref{(13.7)})
and~(\ref{(13.8)}) together yield that
$$
P(n^{-k/2}k!|I_{n,k}(f)|>u)\le A e^{-kM}\le Ae^k e^{-k\bar M}
$$
if $0\le\frac u\sigma\le n^{k/2}\sigma^k$. Hence the choice of
the number~$\bar M$ implies that inequality~(\ref{(8.10)}) holds
with the pre-exponential constant $Ae^k$ and the sufficiently
large but fixed number~$B>0$. Theorem~8.3 is proved.
\medskip
Example 8.7 is a relatively simple consequence of It\^o's formula
for multiple Wiener--It\^o integrals.
\medskip\noindent
{\it Proof of Example 8.7.}\/ We may restrict our attention to the
case $k\ge2$. It\^o's formula for multiple Wiener-It\^o integrals,
more explicitly relation~(\ref{(10.21)}), implies that the random
variable $k!Z_{\mu,k}(f)$ can be expressed as $k!Z_{\mu,k}(f)
=\sigma H_k\left(\int f_0(x)\mu_W(\,dx)\right)=\sigma H_k(\eta)$,
where $H_k(x)$ is the $k$-th Hermite polynomial with leading
coefficient~1, and $\eta=\int f_0(x)\mu_W(\,dx)$ is a standard
normal random variable. Hence we get by exploiting that the
coefficient of $x^{k-1}$ in the polynomial $H_k(x)$ is zero that
$P(k!|Z_{\mu,k}(f)|>u)=P(|H_k(\eta)|\ge\frac u\sigma)\ge
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)$ with a
sufficiently large constant $D>0$ if $\frac u\sigma>1$. There
exist such positive constants $A$ and $B$ for which
$$
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)
\ge P\left(|\eta^k|>\frac u\sigma+A
\left(\frac u\sigma\right)^{(k-2)/k}\right)\quad
\textrm{if } \frac u\sigma>B.
$$
Hence
\begin{eqnarray*}
P(k!|Z_{\mu,k}(f)|>u)&\ge&
P\left(|\eta|>\left(\frac u\sigma\right)^{1/k}
\left(1+A\left(\frac u\sigma\right)^{-2/k}\right)\right) \\
&\ge&\frac{\bar C \exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\}}
{\left(\frac u\sigma\right)^{1/k}+1}
\end{eqnarray*}
with an appropriate $\bar C>0$ if $\frac u\sigma>B$. Since
$P(k!|Z_{\mu,k}(f)|>0)>0$, the above inequality also holds
for $0\le \frac u\sigma\le B$ if the constant $\bar C>0$ is chosen
sufficiently small. This means that relation~(\ref{(8.16)}) holds.
\medskip
Next we prove a multivariate version of Hoeffding's inequality.
Before its formulation some notations will be introduced.
Let us fix two positive integers~$k$ and~$n$ and some
real numbers $a(j_1,\dots,j_k)$ for all sequences of arguments
$\{j_1,\dots,j_k\}$ such that $1\le j_l\le n$, $1\le l\le k$, and
$j_l\neq j_{l'}$ if $l\neq l'$.
With the help of the above real numbers $a(\cdot)$ and a
sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, the random
variable
\begin{equation}
V=\sum_{\substack {(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
a(j_1,\dots, j_k)
\varepsilon_{j_1}\cdots \varepsilon_{j_k} \label{(13.9)}
\end{equation}
and number
\begin{equation}
S^2=\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
a^2(j_1,\dots, j_k). \label{(13.10)}
\end{equation}
will be introduced.
With the help of the above notations the following result can be
formulated.
\medskip\noindent
{\bf Theorem 13.3 (The multivariate version of Hoeffding's
inequality).}\index{multivariate version of Hoeffding's
inequality} {\it The random variable $V$ defined in
formula~(\ref{(13.9)}) satisfies the inequality
\begin{equation}
P(|V|>u)\le C
\exp\left\{-\frac12\left(\frac uS\right)^{2/k}\right\}
\quad\textrm{for all }u\ge 0 \label{(13.11)}
\end{equation}
with the constant $S$ defined in~(\ref{(13.10)}) and some
constants $C>0$ depending only on the parameter $k$ in the
expression $V$.}
\medskip
Theorem~13.3 will be proved by means of two simple lemmas. Before
their formulation the random variable
\begin{equation}
Z=\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
|a(j_1,\dots,j_k)|\eta_{j_1}\cdots \eta_{j_k} \label{(13.12)}
\end{equation}
will be introduced, where $\eta_1,\dots,\eta_n$ are independent
random variables with standard normal distribution, and the numbers
$a(j_1,\dots,j_k)$ agree with those in formula~(\ref{(13.9)}). The
following lemmas will be proved.
\medskip\noindent
{\bf Lemma 13.4.} {\it The random variables $V$ and $Z$ introduced
in~(\ref{(13.9)}) and (\ref{(13.12)}) satisfy the inequality
$$
EV^{2M}\le EZ^{2M}\quad\textrm{for all }M=1,2,\dots.
$$
}
\medskip\noindent
{\bf Lemma 13.5.} {\it The random variable $Z$ defined in
formula~(\ref{(13.12)}) satisfies the inequality
\begin{equation}
EZ^{2M}\le 1\cdot3\cdot5\cdots(2kM-1)S^{2M}\quad\textrm{for all }
M=1,2,\dots \label{(13.13)}
\end{equation}
with the constant $S$ defined in formula~(\ref{(13.10)}).}
\medskip\noindent
{\it Proof of Lemma 13.4.}\/ We can write, by carrying out the
multiplications in the expressions $EV^{2M}$ and $EZ^{2M}$,
by exploiting the additive and multiplicative properties of the
expectation for sums and products of independent random variables
together with the identities
$E\varepsilon_j^{2k+1}=0$ and $E\eta_j^{2k+1}=0$
for all $k=0,1,\dots$ that
\begin{equation}
EV^{2M}= \!\!\!\!\!\!\!\!\!\!\!
\sum_{\substack{ (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM}}
\!\!\!\!\!\!\!\!\!\!\!
A(j_1,\dots,j_l,m_1,\dots,m_l)
E\varepsilon_{j_1}^{2m_1}\cdots E\varepsilon_{j_l}^{2m_l}
\label{(13.14)}
\end{equation}
and
\begin{equation}
EZ^{2M}= \!\!\!\!\!\!\!\!\!\!\!\!\!
\sum_{\substack{ (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM}}
\!\!\!\!\!\!\!\!\!\!\!\!\!
B(j_1,\dots,j_l,m_1,\dots,m_l) E\eta_{j_1}^{2m_1}\cdots
E\eta_{j_l}^{2m_l} \label{(13.15)}
\end{equation}
with some coefficients $A(j_1,\dots,j_l,m_1,\dots,m_l)$ and
$B(j_1,\dots,j_l,m_1,\dots,m_l)$ such that
\begin{equation}
|A(j_1,\dots,j_l,m_1,\dots,m_l)|\le
B(j_1,\dots,j_l,m_1,\dots,m_l). \label{(13.16)}
\end{equation}
The coefficients $A(\cdot,\cdot,\cdot)$ and $B(\cdot,\cdot,\cdot)$
could be expressed explicitly, but we do not need such a formula.
What is important for us is that $A(\cdot,\cdot,\cdot)$ can be
expressed as the sum of certain terms, and $B(\cdot,\cdot,\cdot)$
as the sum of the absolute value of the same terms. Hence
relation~(\ref{(13.16)}) holds. Since
$E\varepsilon_j^{2m}\le E\eta_j^{2m}$
for all parameters $j$ and $m$ formulas~(\ref{(13.14)}),
(\ref{(13.15)}) and~(\ref{(13.16)}) imply
Lemma~13.4.
\medskip\noindent
{\it Proof of Lemma~13.5.} Let us consider a white noise $W(\cdot)$
on the unit interval $[0,1]$ with the Lebesgue measure $\lambda$ on
$[0,1]$ as its reference measure, i.e.\ let us take a set of
Gaussian random variables $W(A)$ indexed by the measurable sets
$A\subset [0,1]$ such that $EW(A)=0$, $EW(A)W(B)=\lambda(A\cap B)$
with the Lebesgue measure $\lambda$ for all measurable subsets of
the interval $[0,1]$. Let us introduce $n$ orthonormal functions
$\varphi_1(x),\dots,\varphi_n(x)$ with respect to the Lebesgue
measure on the interval $[0,1]$, and define the random variables
$\eta_j=\int \varphi_j(x)W(\,dx)$, $0\le j\le n$. Then
$\eta_1,\dots,\eta_n$ are independent random variables with standard
normal distribution, hence we may assume that they appear in the
definition of the random variable~$Z$ in formula~(\ref{(13.12)}). Beside
this, the identity $\eta_{j_1}\cdots\eta_{j_k}=\int \varphi_{j_1}(x_1)
\cdots\varphi_{j_k}(x_k)W(\,dx_1)\dots W(\,dx_k)$ holds for all
$k$-tuples $(j_1,\dots,j_k)$, such that $1\le j_s\le n$ for all
$1\le s\le k$, and the indices $j_1$,\dots, $j_s$ are different.
This identity follows from It\^o's formula for multiple Wiener--It\^o
integrals formulated in formula~(\ref{(10.20)}) of Theorem~10.3.
Hence the random variable $Z$ defined in~(\ref{(13.12)}) can be
written in the form
$$
Z=\int f(x_1,\dots,x_k)W(\,dx_1)\dots W(\,dx_k)
$$
with the function
$$
f(x_1,\dots,x_k)=
\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
|a(j_1,\dots,j_k)| \varphi_{j_1}(x_1)\cdots \varphi_{j_k}(x_k).
$$
Because of the orthogonality of the functions $\varphi_j(x)$
$$
S^2=\int_{[0,1]^k} f^2(x_1,\dots,x_k)\,dx_1\dots\,dx_k.
$$
Lemma~13.5 is a straightforward consequence of the above relations
and formula~(\ref{(13.1)}) in Proposition~13.1.
\medskip\noindent
{\it Proof of Theorem~13.3.}\/ The proof of Theorem~13.3 with the
help of Lemmas~13.4 and~13.5 is an almost word for word repetition
of the proof of Theorem~8.5. By Lemma~13.4 inequality~(\ref{(13.13)})
remains valid if the random variable $Z$ is replaced by the random
variable~$V$ at its left-hand side. Hence the Stirling formula
yields that
$$
EV^{2M}\le EZ^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!} S^{2M}\le C
\left(\frac2e\right)^{kM}(kM)^{kM}S^{2M}
$$
for any $C\ge\sqrt2$ if $M\ge M_0(A)$. As a consequence, by the
Markov inequality the estimate
\begin{equation}
P(|V|>u)\le\frac{EV^{2M}}{u^{2M}}\le C\left(\frac{2kM}e\left(\frac
Su\right)^{2/k}\right)^{kM} \label{(13.17)}
\end{equation}
holds for all $C>\sqrt 2$ if $M\ge M_0(C)$. Put $k\bar M=k\bar
M(u)=\frac12\left(\frac uS\right)^{2/k}$ and $M=M(u)=[\bar M]$, where
$[x]$ denotes the integer part of the number~$x$. Let us choose
a threshold number $u_0$ by the identity
$\frac1{2k}\left(\frac{u_0}S\right)^{2/k}=M_0(C)+1$.
Formula~(\ref{(13.17)}) can be applied with $M=M(u)$ for
$u\ge u_0$, and it yields that
$$
P(|V|>u)\le Ce^{-kM}\le Ce^ke^{-k\bar M}=Ce^k\exp\left\{-\frac12
\left(\frac uS\right)^{2/k}\right\}\ \quad\textrm{if } u\ge u_0.
$$
The last inequality means that relation~(\ref{(13.11)})
holds for $u\ge u_0$
if the constant $C$ is replaced by $Ce^k$ in it. With the choice of
a sufficiently large constant~$C$ relation~(\ref{(13.11)}) holds
for all $u\ge0$. Theorem~13.3 is proved.
\medskip\noindent
{\script 13. B) A short discussion about the methods and results.}
\medskip\noindent
A comparison of Theorem 8.5 and Example 8.7 shows that the
estimate~(\ref{(8.15)})
is sharp. At least no essential improvement of this estimate
is possible which holds for {\it all}\/ Wiener--It\^o integrals
with a kernel function $f$ satisfying the conditions of Theorem~8.5.
This fact also indicates that the bounds~(\ref{(13.1)})
and~(\ref{(13.2)}) on high
moments of Wiener--It\^o integrals are sharp. It is worth
while comparing formula~(\ref{(13.2)}) with the estimate of
Proposition~13.2 on moments of degenerate $U$-statistics.
Let us consider a normalized $k$-fold degenerate $U$-statistic
$n^{-k/2}k!I_{n,k}(f)$ with some kernel function $f$ and a
$\mu$-distributed sample of size~$n$. Let us compare its moments
with those of a $k$-fold Wiener--It\^o integral k!$Z_{\mu,k}(f)$
with the same kernel function~$f$ with respect to a white noise
$\mu_W$ with reference measure~$\mu$. Let $\sigma$ denote the
$L_2$-norm of the kernel function~$f$. If
$M\le\varepsilon n\sigma^2$ with a small number $\varepsilon>0$,
then Proposition~13.2 (with an appropriate
choice of the parameter~$\eta$ which is small in this case)
provides an almost as good bound on the $2M$-th moment of the
normalized $U$-statistic as Proposition~13.1 does on the
$2M$-th moment of the corresponding Wiener--It\^o integral. In
the case $M\le Cn\sigma^2$ with some fixed (not necessarily small)
number $C>0$ the $2M$-th moment of the normalized $U$-statistic
can be bounded by $C(k)^M$ times the natural estimate on the
$2M$-th moment of the Wiener--It\^o integral with some
constant~$C(k)>0$ depending only on the number~$C$. This can be
so interpreted that in this case the estimate on the moments of the
normalized $U$-statistic is weaker than the estimate on the moments
of the Wiener--It\^o integral, but they are still comparable.
Finally, in the case $M\gg n\sigma^2$ the estimate on the $2M$-th
moment of the normalized $U$-statistic is much worse than the
estimate on the $2M$-th moment of the Wiener--It\^o integral.
A similar picture arises if the distribution of the normalized
degenerate $U$-statistic
$$
F_n(u)=P(n^{-k/2}k!|I_{n,k}(f)|>u)
$$
is compared to the distribution of the Wiener--It\^o integral
$$
G(u)=P(k!|Z_{\mu,k}(f)|>u).
$$
A comparison of Theorems~8.3 and~8.5 shows that for
$0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a small $\varepsilon>0$
an almost as good estimate holds $F_n(u)$ as for $G(u)$. In the
case $0\le u\le n^{k/2}\sigma^{k+1}$ the behaviour of $F_n(u)$
and $G(u)$ is similar, only in the exponent of the estimate on
$F_n(u)$ in formula~({(8.10)}) a worse constant appears. Finally, if
$u\gg n^{k/2}\sigma^{k+1}$, then --- as Example~8.8 shows, at
least in the case $k=2$, --- the (tail) distribution function
$F_n(u)$ satisfies a much worse estimate than the function
$G(u)$. Thus a similar picture arises as in the case when the
estimate on the tail-distribution of normalized sums of independent
random variables, discussed in Section~3, is compared to the
behaviour of the standard normal distribution in the neighbourhood
of infinity. To understand this similarity better it is useful to
recall Theorem~10.4, the limit theorem about normalized degenerate
$U$-statistics. Theorems 8.3 and~8.5 enable us to compare the tail
behaviour of normalized degenerate $U$-statistics with their limit
presented in the form of multiple Wiener--It\^o integrals, while
the one-variate versions of these results compare the distribution
of sums of independent random variables with their Gaussian limit.
The above results show that good bounds on the moments of degenerate
$U$-statistics and multiple Wiener--It\^o also provide a good
estimate on their distribution. To understand the behaviour of high
moments of degenerate $U$-statistics it is useful to have a closer
look at the simplest case $k=1$, when the moments of sums of
independent random variables with expectation zero are considered.
Let us consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with expectation
zero, take their sum $S_n=\sum\limits_{j=1}^n\xi_j$, and let us try
to give a good estimate on the moments $ES_n^{2M}$ for all
$M=1,2,\dots$. Because of the independence of the random variables
$\xi_j$ and the condition $E\xi_j=0$ the identity
\begin{equation}
ES_n^{2M}=\sum_{\substack{(j_1,\dots,j_s,l_1,\dots,l_s)\\
j_1+\dots+j_s=2M,\,j_u\ge 2,\textrm{ for all }1\le u\le s \\
l_u\neq l_{u'} \textrm { if }u\neq u'}}
E\xi_{l_1}^{j_1}\cdots E\xi_{l_s}^{j_s}
\label{(13.18)}
\end{equation}
holds. Simple combinatorial considerations show that a dominating
number of terms at the right-hand side of~(\ref{(13.18)})
are indexed by a
vector $(j_1,\dots,j_M;\,l_1,\dots,l_M)$ such that $j_u=2$ for all
$1\le u\le M$, and the number of such vectors is equal to
${n\choose M}\frac{(2M)!}{2^M}\sim n^M\frac{(2M)!}{2^MM!}$. The
last asymptotic relation holds if the number $n$ of terms in the
random sum~$S_n$ is sufficiently large. The above considerations
suggest that under not too restrictive conditions $ES_n^{2M}\sim
\left(n\sigma^2\right)^M\frac{(2M)!}{2^MM!}=E\eta_{n\sigma^2}^{2M}$,
where $\sigma^2=E\xi^2$ is the variance of the terms in the sum
$S_n$, and $\eta_u$ denotes a random variable with normal
distribution with expectation zero and variance~$u$. The question
arises when the above heuristic argument gives a right estimate.
For the sake of simplicity let us restrict our attention to the
case when the absolute value of the random variables $\xi_j$ is
bounded by~1. Let us observe that even in this case the above
heuristic argument holds only under the condition that the variance
$\sigma^2$ of the random variables $\xi_j$ is not too small.
Indeed, let us consider such random variables $\xi_j$, for which
$P(\xi_j=1)=P(\xi_j=-1)=\frac{\sigma^2}2$, $P(\xi_j=0)=1-\sigma^2$.
Then these random variables $\xi_j$ have variance $\sigma^2$, and
the contribution of the terms $E\xi_j^{2M}$, $1\le j\le n$, to the
sum in~(\ref{(13.18)}) equals $n\sigma^2$. If $\sigma^2$ is very small,
then it may happen that $n\sigma^2\gg\left(n\sigma^2\right)^M
\frac{(2M)!}{2^MM!}$, and the approximation given for $ES_n^{2M}$
in the previous paragraph does not hold any longer. Hence the
asymptotic relation for a very high moment $ES_n^{2M}$ suggested
by the above heuristic argument may only hold if the variance
$\sigma^2$ of the summands satisfies an appropriate lower bound.
In the proof of Proposition~13.2 a similar picture appears in a
hidden way. In the calculation of the moments of a degenerate
$U$-statistic the contribution of certain (closed) diagrams,
more precisely of some integrals defined with their help, has to
be estimated. Some of these diagrams (those in which all chains
have length~2) appear also in the calculation of the moments of
multiple Wiener--It\^o integrals. In the calculation of the
moments of sums of independent random variables the terms
consisting of products of second moments play such a role in
the sum in formula~(\ref{(13.18)}) as the `nice' diagrams consisting of
chains of length~2 play in the calculation of the moments of
degenerate $U$-statistics in formula~(\ref{(11.14)}). In nice cases the
remaining diagrams do not give a much greater contribution than
these `nice' diagrams, and we get an almost as good bound for
the moments of a normalized degenerate $U$-statistic as for the
moments of the corresponding multiple Wiener--It\^o integral.
The proof of Proposition~13.2 shows that such a situation
appears under very general conditions.
Let me also remark that there is an essential difference
between the tail behaviour of Wiener--It\^o integrals and
normalized degenerate $U$-statistics. A good estimate can be
given on the tail distribution of Wiener--It\^o integrals which
depends only on the $L_2$-norm of the kernel function, while in
the case of normalized degenerate $U$-statistics the
corresponding estimate depends not only on the $L_2$-norm but
also on the $L_\infty$ norm of the kernel function. In
Theorem~8.3 such an estimate is proved.
For $k\ge2$ the distribution of $k$-fold Wiener-It\^o integrals are
not determined by the $L_2$-norm of their kernel functions. This is
an essential difference between Wiener--It\^o integrals of order
$k\ge2$ and $k=1$. In the case $k=1$ a Wiener--It\^o integral is
a Gaussian random variable with expectation zero, and its variance
equals the square of the $L_2$-norm of its kernel function. Hence
its distribution is completely determined by the $L_2$-norm of its
kernel function. On the other hand, the distribution of a
Wiener--It\^o integral of order $k\ge2$ is not determined by its
variance. Theorem~8.5 yields a `worst case' estimate on the
distribution of Wiener--It\^o integrals if we have a bound on their
variance. In the statistical problems which were the main
motivation for this work we need such estimates, but it may be
interesting to know what kind of estimates are known about the
distribution of a multiple Wiener--It\^o integral or degenerate
$U$-statistic if we have some additional information about its
kernel function. Some results will be mentioned in this direction,
but most technical details will be omitted from their discussion.
H. P. Mc. Kean proved the following lower bound on the distribution
of multiple Wiener--It\^o integrals. (See \cite{r30} or \cite{r43}.)
\medskip\noindent
{\bf Theorem 13.6 (Lower bound on the tail distribution of
Wiener--It\^o integrals).}\index{lower bound on the tail
distribution of Wiener--It\^o integrals (result of H.~P. Mc. Kean)}
{\it All $k$-fold Wiener--It\^o integrals $Z_{\mu,k}(f)$ satisfy
the inequality
\begin{equation}
P(|Z_{\mu,k}(f)|>u)>Ke^{-Au^{2/k}} \label{(13.19)}
\end{equation}
with some numbers $K=K(f,\mu)>0$ and $A=A(f,\mu)>0$.}
\medskip\noindent
The constant $A$ in the exponent $Au^{2/k}$ of
formula~(\ref{(13.19)}) is
always finite, but Mc.~Kean's proof yields no explicit upper
bound on it. The following example shows that in certain cases
if we fix the constant~$K$ in relation~(\ref{(13.19)}), then this
inequality holds only with a very large constant $A>0$ even
if the variance of the Wiener--It\^o integral equals~1.
Take a probability measure $\mu$ and a white noise $\mu_W$ with
reference measure $\mu$ on a measurable space $(X,{\cal X})$, and let
$\varphi_1,\varphi_2,\dots$ be a sequence of orthonormal functions
on $(X,{\cal X})$ with respect to this measure $\mu$. Define for all
$L=1,2,\dots$, the function
\begin{equation}
f(x_1,\dots,x_k)=f_L(x_1,\dots,x_k)=(k!)^{1/2}L^{-1/2}
\sum\limits_{j=1}^L \varphi_j(x_1)\cdots\varphi_j(x_k)
\label{(13.20)}
\end{equation}
and the Wiener--It\^o integral
$$
Z_{\mu,k}(f)=Z_{\mu,k}(f_L)=\frac1{k!}\int f_L(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots\mu_W(\,dx_k).
$$
Then $EZ_{\mu,k}^2(f)=1$, and the high moments of $Z_{\mu,k}(f)$ can
be well estimated. For a large parameter~$L$ these moments are much
smaller, than the bound given in Proposition~13.1. (The
calculation leading to the estimation of the moments of
$Z_{\mu,k}(f)$ will be omitted.) These moment estimates also imply
that if the parameter~$L$ is large, then for not too large
numbers~$u$ the probability $P(|Z_{\mu,k}(f)|>u)$ has a much better
estimate than that given in Theorem~8.5. As a consequence,
for a large number $L$ and fixed number~$K$
relation~(\ref{(13.19)}) may
hold only with a very big number $A>0$.
We can expect that if we take a Gaussian random
polynomial~$P(\xi_1,\dots,\xi_n)$ whose arguments are Gaussian
random variables $\xi_1,\dots,\xi_n$, and which is the sum of
many small almost independent terms, then
a similar picture arises as in the case of a Wiener--It\^o
integral with kernel function~(\ref{(13.20)}) with a
large parameter~$L$.
Such a random polynomial has an almost Gaussian distribution by
the central limit theorem, and we can also expect that its not
too high moments behave so as the corresponding moments of a
Gaussian random variable with expectation zero and the same
variance as the Gaussian random polynomial we consider. Such a
bound on the moments has the consequence that the estimate on
the probability $(P(\xi_1,\dots,\xi_n)>u)$ given in Theorem~8.5
can be improved if the number~$u$ is not too large. A similar
picture arises if we consider Wiener--It\^o integrals whose
kernel function satisfies some `almost independence' properties.
The problem is to find the right properties under which we can
get a good estimate that exploits the almost independence
property of a Gaussian random polynomial or of a Wiener--It\^o
integral. The main result of R.~Lata{\l}a's paper~\cite{r27} can be
considered as a response to this question. I describe this
result below.
To formulate Lata{\l}a's result some new notions have to be
introduced. Given a finite set $A$ let ${\cal P}(A)$ denote the
set of all its partitions. If a partition
$P=\{B_1,\dots,B_s\}\in{\cal P}(A)$ consists of $s$ elements then we
say that this partition has order~$s$, and write $|P|=s$. In the
special case $A=\{1,\dots,k\}$ the notation ${\cal P}(A)={\cal P}_k$
will be used. Given a measurable space $(X,{\cal X})$ with a
probability measure $\mu$ on it together with a finite set
$B=\{b_1,\dots,b_j\}$ let us introduce the following notations. Take
$j$ different copies $(X_{b_r},{\cal X}_{b_r})$ and $\mu_{b_r}$,
$1\le r\le j$, of this measurable space and probability measure indexed
by the elements of the set $B$, and define their product
$(X^{(B)},{\cal X}^{(B)},\mu^{(B)})=\left(\prod\limits_{r=1}^j X_{b_r},
\prod\limits_{r=1}^j{\cal X}_{b_r},
\prod\limits_{r=1}^j\mu_{b_r}\right)$. The points
$(x_{b_1},\dots,x_{b_j})\in X^{(B)}$ will be denoted by
$x^{(B)}\in X^{(B)}$ in the sequel. With the help of the above
notations I introduce the quantities needed in the formulation of the
following Theorem~13.7.
Let $f=f(x_1,\dots,x_k)$ be a function on the $k$-fold product
$(X^k,{\cal X}^k,\mu^k)$ of a measure space $(X,{\cal X},\mu)$
with a probability measure $\mu$. For all partitions
$P=\{B_1,\dots,B_s\}\in{\cal P}_k$ of the set $\{1,\dots,k\}$ consider
the functions $g_r\left(x^{(B_r)}\right)$ on the space $X^{(B_r)}$,
$1\le r\le s$, and define with their help the quantities
\begin{eqnarray}
\alpha(P)
&&=\alpha(P,f,\mu) \nonumber \\
&&=\sup_{g_1,\dots,g_s} \int f(x_1,\dots,x_k)
g_1\left(x^{(B_1)}\right)\cdots g_s\left(x^{(B_s)}\right)\mu(dx_1)
\dots\mu(dx_k); \nonumber \\
&&\qquad\quad \textrm{where supremum is taken for such functions}
\nonumber \\
&&\qquad \quad g_1,\dots,g_s,\quad g_r\colon\,
X^{B_r}\to R^1 \textrm{ for which} \nonumber \\
&&\qquad\quad
\int g_r^2\left(x^{(B_r)}\right)\mu^{(B_r)}\left(\,dx^{(B_r)}\right)\le1
\quad \textrm{for all } 1\le r\le s, \label{(13.21)}
\end{eqnarray}
and put
\begin{equation}
\alpha_s=\max_{P\in{\cal P}_k,\,|P|=s}\alpha(P),
\quad 1\le s\le k. \label{(13.22)}
\end{equation}
In Lata{\l}a's estimation of Wiener--It\^o integrals of order~$k$
the quantities $\alpha_s$, $1\le s\le k$, play a similar role as
the number $\sigma^2$ in Theorem~8.5. Observe that in the case
$|P|=1$, i.e.\ if $P=\{1,\dots,k\}$ the identity
$\alpha^2(P)=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$
holds, which means that $\alpha_1=\sigma$. The following estimate
is valid for Wiener--It\^o integrals of general order.
\medskip\noindent
{\bf Theorem 13.7 (Lata{\l}a's estimate about the tail-distribution
of Wiener--It\^o integrals).}\index{Lata{\l}a's estimate about
the tail-distribution of Wiener--It\^o integrals}
{\it Let a $k$-fold Wiener--It\^o integral $Z_{\mu,k}(f)$,
$k\ge1$, be defined with the help of a white noise $\mu_W$ with
a non-atomic reference measure~$\mu$ and a kernel function~$f$
of $k$~variables such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty.
$$
There is some universal constant $C(k)<\infty$ depending only on
the order~$k$ of the random integral such that the inequalities
\begin{equation}
E(Z_{\mu,k}(f))^{2M}\le
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
\label{(13.23)}
\end{equation}
and
\begin{equation}
P(|Z_{\mu,k}(f)|>u)\le C(k)\exp\left\{-\frac1{C(k)}\min_{1\le s\le k}
\left(\frac u{\alpha_s}\right)^{2/s}\right\}
\label{(13.24)}
\end{equation}
hold for all $M=1,2,\dots$ and $u>0$ with the quantities $\alpha_s$,
defined in formulas~(\ref{(13.21)}) and~(\ref{(13.22)}).}
\medskip
Inequality~(\ref{(13.24)}) is a simple consequence
of~(\ref{(13.23)}). In the special case when
$\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$, then
inequality~(\ref{(13.23)}) says that the moment
$EZ_{\mu,k}(f)^{2M}$ has
the same magnitude as the $2M$-th moment of a standard Gaussian
random variable multiplied by a constant, and it implies a good
estimate on $P(|Z_{\mu,k}(f)|>u)$ given in~(\ref{(13.24)}). Actually the
result of Theorem~13.7 can be reduced to the special case when
$\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$. Thus it can be
interpreted so that if the quantities~$\alpha_s$ of a $k$-fold
Wiener--It\^o integral are sufficiently small, then these `almost
independence' conditions imply that the $2M$-th moment of this
integrals behaves like a one-fold Wiener--It\^o integral with the
same variance.
Actually Lata{\l}a formulated his result in a different form, and
he proved a slightly weaker result. He considered Gaussian
polynomials of the following form:
\begin{eqnarray}
&&P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k) \nonumber \\
&&\qquad =\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)\xi^{(1)}_{j_1}\cdots\xi^{(k)}_{j_k},
\label{(13.25)}
\end{eqnarray}
where $\xi_j^{(s)}$, $1\le j\le n$ and $1\le s\le k$, are independent
standard normal random variables. Lata{\l}a gave an estimate about
the moments and tail-distribution of such random polynomials.
The problem about the behaviour of such random polynomials can be
reformulated as a problem about the behaviour of Wiener--It\^o
integrals in the following way: Take a measurable space $(X,{\cal X})$
with a non-atomic measure~$\mu$ on it. Let $Z_\mu$ be a white noise
with reference measure~$\mu$, let us choose a set of orthogonal
functions $h^{(s)}_j(x)$, $1\le j\le n$, $1\le s\le k$, on the
space $(X,{\cal X})$ with respect to the measure~$\mu$, and define
the function
\begin{equation}
f(x_1,\dots,x_k)=\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)h^{(1)}_{j_1}(x_1)\cdots h^{(k)}_{j_k}(x_k)
\label{(13.26)}
\end{equation}
together with the Wiener--It\^o integral $Z_{\mu,k}(f)$. Since
the random integrals $\bar\xi_j^{(s)}=\int h_j^{(s)}(x)Z_\mu(\,dx)$,
$1\le j\le n$, $1\le s\le k$, are independent, standard Gaussian
random variables, it is not difficult to see with the help of
It\^o's formula (Theorem~10.3 in this work) that the distributions
of the random polynomial
$P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k)$ and $Z_{\mu,k}(f)$
agree. Here we reformulated Lata{\l}a's estimates about random
polynomials of the form~(\ref{(13.25)}) to estimates about
Wiener--It\^o integrals with kernel function of the
form~(\ref{(13.26)}).
These estimates are equivalent to Lata{\l}a's result if we restrict
our attention to the special class of Wiener--It\^o integrals
with kernel functions of the form~(\ref{(13.26)}). But we have
formulated our result for Wiener--It\^o integrals with a general
kernel function. Lata{\l}a's proof heavily exploits the special
structure of the random polynomials given in~(\ref{(13.25)}),
the independence of the
random variables~$\xi_j^{(s)}$ for different parameters~$s$ in
it. (It would be interesting to find a proof which does not
exploit this property.) On the other hand, this result can
be generalized to the case discussed in Theorem~13.7. This
generalization can be proved by exploiting the theorem of
de la Pe{\~n}a and Montgomery--Smith about the comparison of
$U$-statistics and decoupled $U$-statistics (formulated in
Theorem~14.3 of this work) and the properties of the
Wiener--It\^o integrals. I omit the details of the proof.
Lata{\l}a also proved a converse estimate in~\cite{r27} about random
polynomials of Gaussian random polynomials which shows that the
estimates of Theorem~13.7 are sharp. We formulate it in its
original form, i.e. we restrict our attention to the case of
Wiener--It\^o integrals with kernel functions of the
form~(\ref{(13.26)}).
\medskip\noindent
{\bf Theorem 13.8 (A lower bound about the tail distribution of
Wiener--It\^o integrals).} {\it A random integral $Z_{\mu,k}(f)$
with a kernel function of the form~(\ref{(13.26)}) satisfies the
inequalities
$$
E(Z_{\mu,k}(f))^{2M}\ge
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
$$
and
$$
P(|Z_{\mu,k}(f)|>u)\ge \frac1{C(k)}\exp\left\{-C(k)
\min_{1\le s\le k}\left(\frac u{\alpha_s} \right)^{2/s}\right\}
$$
for all $M=1,2,\dots$ and $u>0$ with some universal constant
$C(k)>0$ depending only on the order~$k$ of the integral and the
quantities $\alpha_s$, defined in formula~(\ref{(13.21)})
and~(\ref{(13.22)}).}
\medskip
Let me finally remark that there is a counterpart of Theorem~13.7
about degenerate $U$-statistics. Adamczak's paper~\cite{r1} contains
such a result. Here we do not discuss it, because this result is
far from the main topic of this work. We only remark that some new
quantities have to be introduced to formulate it. The appearance of
these conditions is related to the fact that in an estimate about
the tail-behaviour of a degenerate $U$-statistic we need a bound
not only on the $L_2$-norm but also on the supremum norm of the
kernel function. In a sharp estimate the bound about the supremum
of the kernel function has to be replaced by a more complex system
of conditions, just as the condition about the $L_2$-norm of the
kernel function was replaced by a condition about the quantities
$\alpha_s$, $1\le s\le k$, defined in formulas~(\ref{(13.21)})
and~(\ref{(13.22)}) in Theorem~13.7.
\chapter{Reduction of the main result in this work}
The main result of this work is Theorem 8.4 or its multiple integral
version Theorem~8.2. It was shown in Section~9 that Theorem 8.2
follows from Theorems~8.4. Hence it is enough to prove Theorem~8.4.
It may be useful to study this problem together with its multiple
Wiener--It\^o integral version, Theorem~8.6.
Theorems~8.6 and~8.4 will be proved similarly to their one-variate
versions, Theorems~4.2 and~4.1. Theorem~8.6 will be proved with
the help of~Theorem~8.5 about the estimation of the tail
distribution of multiple Wiener--It\^o integrals. A natural
modification of the chaining argument applied in the proof of
Theorem~4.2 works also in this case. No new difficulties arise. On
the other hand, in the proof of Theorem~8.4 several new
difficulties have to be overcome. I start with the proof of
Theorem~8.6.\index{estimate on the supremum of Wiener--It\^o
integrals}
\medskip\noindent
{\it Proof of Theorem 8.6.}\/ Fix a number $0<\varepsilon<1$, and
let us list the elements of the countable set
${\cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$ let us
choose by exploiting the conditions of Theorem~8.6 a set
${\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}
\subset{\cal F}$ of function with
$m_p\le2D\,2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}$ elements in
such a way that
$\inf\limits_{1\le j\le m_p}\int(f-f_{a(j,p)})^2\,d\mu
\le 2^{-4p-8}\varepsilon^2\sigma^2$ for all $f\in{\cal F}$, and
beside this let
$f_p\in{\cal F}_p$. For all indices $a(j,p)$, $p=1,2,\dots$,
$1\le j\le m_p$, choose a predecessor $a(j',p-1)$, $j'=j'(j,p)$,
$1\le j'\le m_{p-1}$, in such a way that the functions
$f_{a(j,p)}$ and
$f_{a(j',p-1)}$ satisfy the relation
$\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu
\le\varepsilon^2\sigma^22^{-4(p+1)}$.
Theorem~8.5 with the choice
$\bar u=\bar u(p)=2^{-(p+1)}\varepsilon u$ and
$\bar\sigma=\bar\sigma(p)=2^{-2p-2}\varepsilon\sigma$ yields
the estimates
\begin{eqnarray}
P(A(j,p))&=& P\left(k!|Z_{\mu,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
2^{-(1+p)}\varepsilon u\right)\nonumber \\
&\le& C \exp\left\{-\frac12
\left(\frac{2^{p+1}u}\sigma\right)^{2/k}\right\},
\qquad 1\le j\le m_p,
\label{(14.1)}
\end{eqnarray}
for all $p=1,2,\dots$, and
\begin{eqnarray}
P(B(s))&=&P\left(k!|Z_{\mu,k}(f_{a(0,s)})| \
\ge \left(1-\frac \varepsilon2\right)u\right) \nonumber \\
&\le& C\exp\left\{-\frac12
\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}, \quad 1\le s\le m_0. \label{(14.2)}
\end{eqnarray}
Since all $f\in{\cal F}$ is the element of at least one set
${\cal F}_p$, $p=0,1,2,\dots$, (We made a construction, where
$f_p\in {\cal F}_p$), the definition of the predecessor of an
index $a(j,p)$ and of the events $A(j,p)$ and
$B(s)$ in formulas~(\ref{(14.1)}) and (\ref{(14.2)}) together
with the previous estimates imply that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}k!|Z_{\mu,k}(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \nonumber \\
&&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \nonumber \\
&&\qquad \le \sum_{p=1}^{\infty} 2CD2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}
\sigma\right)^{2/k} \right\}\nonumber \\
&&\qquad\qquad +2^{1+4L}CD\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}. \label{(14.3)}
\end{eqnarray}
Standard calculation shows that if
$u\ge ML^{k/2}\frac1\varepsilon\log^{k/2}\frac2
\varepsilon\cdot\sigma\log^{k/2}\frac2\sigma$
with a sufficiently large constant~$M$, then the inequalities
$$
2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}\sigma\right)^{2/k}
\right\}\le
2^{-p}\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k} \right\}
$$
hold for all $p=1,2\dots$, and
$$
2^{4L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}\le
\exp\left\{-\frac12\left(\frac{\left(1-\varepsilon\right)u}
\sigma\right)^{2/k}\right\}.
$$
These inequalities together with relation~(\ref{(14.3)}) imply
relation~(\ref{(8.15)}). Theorem~8.6 is proved.
\medskip
The proof of Theorem~8.4 is harder. In this case the chaining
argument in itself does not supply the proof, since Theorem~8.3
gives a good estimate about the distribution of a degenerate
$U$-statistic only if it has a not too small variance. The same
difficulty appeared in the proof of Theorem~4.1, and the method
applied in that case will be adapted to the present situation.
A multivariate version of Proposition~6.1 will be proved in
Proposition~14.1, and another result which can be considered as
a multidimensional version of Proposition~6.2 will be formulated
in Proposition~14.2. It will be shown that Theorem~8.4 follows
from Propositions~14.1 and~14.2. Most steps of these proofs can
be considered as a simple repetition of the corresponding
arguments in the proof of the results in Section~6. Nevertheless,
I wrote them down for the sake of completeness.
\medskip
The result formulated in Proposition~14.1 can be proved in almost
the same way as its one-variate version, Proposition~6.1. The only
essential difference is that now we apply a multivariate version
of Bernstein's inequality given in the Corollary of Theorem~8.3.
In the calculations of the proof of Proposition~14.1 the term
$(\frac u\sigma)^{2/k}$ shows a behaviour similar to the term
$(\frac u\sigma)^2$ in Proposition~6.1. Theorem~14.1 contains the
information we can get by applying Theorem~8.3 together with the
chaining argument. Its main content, inequality~(\ref{(14.4)}),
yields a good estimate on the supremum of degenerate
$U$-statistics if it is taken for an appropriate finite subclass
${\cal F}_{\bar\sigma}$ of the original class of kernel
functions~${\cal F}$. The class of kernel functions
${\cal F}_{\bar\sigma}$ is a relatively dense subclass of
${\cal F}$ in the $L_2$ norm. Proposition~14.1 also provides some
useful estimates on the value of the parameter~$\bar\sigma$ which
describes how dense the class of functions ${\cal F}_{\bar\sigma}$
is in ${\cal F}$.
\medskip\noindent
{\bf Proposition 14.1.} {\it Let the $k$-fold power
$(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$ be given
together with some probability measure $\mu$ on $(X,{\cal X})$
and a countable, $L_2$-dense class ${\cal F}$ of functions
$f(x_1,\dots,x_k)$ of~$k$ variables with some exponent~$L\ge1$
and parameter~$D\ge1$ with respect to the measure~$\mu$ on the
product space $(X^k,{\cal X}^k)$ which has the following
properties. All functions $f\in{\cal F}$ are canonical with
respect to the measure~$\mu$, and they satisfy
conditions~(\ref{(8.4)}) and~(\ref{(8.5)}) with some real number
$0<\sigma\le1$. Take a sequence of independent, $\mu$-distributed
random variables $\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and
consider the (degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in {\cal F}$, defined in formula~(\ref{(8.7)}), and fix some
number $\bar A=\bar A_k\ge2^k$.
There is a number $M=M(\bar A,k)$ such that for all
numbers~$u>0$ for which the inequality
$n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}
\ge M(L\log\frac2\sigma+\log D)$ holds, a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$,
and a collection of functions
${\cal F}_{\bar\sigma}={\cal F}_{\bar\sigma(u)}
=\{f_1,\dots,f_m\}\subset{\cal F}$ with $m\le D\bar\sigma^{-L}$
elements can be chosen in such a way that the sets
${\cal D}_j=\{f\colon\, f\in {\cal F},\; \int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$, satisfy the relation
${\cal F}=\bigcup\limits_{j=1}^m{\cal D}_j$, and for the
(degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in{\cal F}_{\bar\sigma(u)}$, the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma(u)}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar A}\right)\le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\} \nonumber \\
&&\qquad \qquad \textrm{if}\quad n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M\left(L\log\frac2\sigma+\log D\right) \label{(14.4)}
\end{eqnarray}
holds with the constants $\alpha=\alpha(k)$, $C=C(k)$ appearing in
formula~(\ref{($8.10'$)}) of the Corollary of Theorem~8.3 and the
exponent $L$ and parameter $D$ of the $L_2$-dense class ${\cal F}$.
Beside this, also the inequality
$4\left(\frac u{\bar A\bar\sigma}\right)^{2/k}\ge
n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$
holds for this number $\bar\sigma=\bar\sigma(u)$. If the
number~$u$ satisfies also the inequality
\begin{equation}
n\sigma^2\ge \left(\frac u\sigma\right)^{2/k}\ge
M(L^{3/2}\log\frac2\sigma +(\log D)^{3/2}) \label{(14.5)}
\end{equation}
with a sufficiently large number $M=M(\bar A,k)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too.}
\medskip\noindent
{\it Proof of Proposition 14.1.} Let us list the elements of the
countable set ${\cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$
let us choose, by exploiting the $L_2$-density property of the class
${\cal F}$, a set ${\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset
{\cal F}$ with $m_p\le D\,2^{2pL}\sigma^{-L}$ elements in such a way
that $\inf\limits_{1\le j\le m_p}\int (f-f_{a(j,p)})^2\,d\mu\le
2^{-4p}\sigma^2$ for all $f\in{\cal F}$.
For all indices $a(j,p)$, $p=1,2,\dots$, $1\le j\le m_p$, choose a
predecessor $a(j',p-1)$, $j'=j'(j,p)$, $1\le j'\le m_{p-1}$, in such a
way that the functions $f_{a(j,p)}$ and $f_{a(j',p-1)}$ satisfy the
relation $\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu\le \sigma^2
2^{-4(p-1)}$. Then the inequalities
$\int\left(\frac{f_{a(j,p)}-f_{a(j',p-1)}}2\right)^2\,d\mu
\le4\sigma^2 2^{-4p}$
and $\sup\limits_{x_j\in X,\,1\le j\le k}\left|
\frac{f_{a(j,p)}(x_1,\dots,x_k)-f_{a(j',p-1)}(x_1,\dots,x_k)}2\right|
\le 1$ hold. The Corollary of Theorem~8.3 yields that
\begin{eqnarray}
P(A(j,p))&&=P\left(n^{-k/2}|I_{n,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
\frac{2^{-(1+p)}u}{\bar A}\right)\nonumber \\
&&\le C \exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A
\sigma}\right)^{2/k} \right\}
\quad \textrm {if}\quad 4n\sigma^2 2^{-4p}\ge\left(\frac{2^{p}u}
{8\bar A\sigma}\right)^{2/k}, \nonumber \\
&&\qquad\qquad 1\le j\le m_p,\; p=1,2,\dots,
\label{(14.6)}
\end{eqnarray}
and
\begin{eqnarray}
P(B(s))&&=P\left(n^{-k/2}|I_{n,k}(f_{0,s})|
\ge \frac u{2\bar A}\right)\le
C\exp\left\{-\alpha\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\},
\nonumber \\
&& \qquad 1\le s\le m_0, quad \textrm{ if }
n\sigma^2\ge \left(\frac u{2\bar A\sigma}\right)^{2/k}. \label{(14.7)}
\end{eqnarray}
Introduce an integer $R=R(u)$, $R>0$, which satisfies the relations
$$
2^{(4+{2/k})(R+1)}\left(\frac{u}{\bar A\sigma}\right)^{2/k} \ge
2^{2+6/k}n\sigma^2\ge 2^{(4+2/k)R}
\left(\frac{u}{\bar A\sigma}\right)^{2/k},
$$
and define $\bar\sigma^2=2^{-4R}\sigma^2$ and
${\cal F}_{\bar\sigma}={\cal F}_R$ (this is the class of functions
${\cal F}_p$ introduced at the start of the proof with $p=R$).
We defined the number~$R$, analogously to the proof of Proposition~6.1,
as the largest number~$p$ for which the condition formulated
in~(\ref{(14.6)}) holds. As
$n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}$,
and $\bar A\ge2^k$ by our
conditions, there exists such a positive integer $R$.) The
cardinality~$m$ of the set ${\cal F}_{\bar\sigma}$ is clearly not
greater than $D\bar\sigma^{-L}$, and
$\bigcup\limits_{j=1}^m {\cal D}_j={\cal F}$. Beside this, the number
$R$ was chosen in such a way that the inequalities
(\ref{(14.6)}) and (\ref{(14.7)}) hold for $1\le p\le R$. Hence the
definition of the predecessor of an index $a(j,p)$ implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar
A}\right) \le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le \sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \\
&&\le \sum_{p=1}^{\infty} CD\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A\sigma}\right)^{2/k}
\right\}
+CD\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^{2/k}\right\}.
\end{eqnarray*}
If the condition $\left(\frac u\sigma\right)^{2/k}\ge
M(L\log\frac2\sigma+\log D)$ holds with a sufficiently large
constant $M$ (depending on $\bar A$), then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar
A\sigma}\right)^{2/k} \right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^{2/k} \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\}\le
\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}.
$$
Hence the previous estimate implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|\ge
\frac u{\bar A}\right) \le\sum_{p=1}^{\infty}C 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^{2/k}
\right\}\\
&&\qquad +C\exp\left\{-\alpha\left(\frac u{10\bar A
\sigma}\right)^{2/k}\right\} \le 2C\exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^{2/k}\right\},
\end{eqnarray*}
and relation~(\ref{(14.4)}) holds.
The estimates
\begin{eqnarray*}
\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
&\le&2^{-2-6/k}2^{2k/R}\left(\frac u{\bar A\sigma}\right)^{2/k}
=2^{-4R}\cdot2^{(4+2/k)R-2-6/k}\left(\frac{u}
{\bar A\sigma}\right)^{2/k}\\
&\le& n\bar\sigma^2=2^{-4R} n\sigma^2\le
2^{-4R}\cdot2^{(4+2/k)(R+1)-2-6/k}
\left(\frac{u}{\bar A\sigma}\right)^{2/k}\\
&=&2^{2-4/k}\cdot 2^{2R/k}\left(\frac{u}{\bar A \sigma}\right)^{2/k}
=2^{2-4/k}\cdot2^{-2R/k} \left(\frac{u}
{\bar A\bar\sigma}\right)^{2/k} \\
&\le& 4 \left(\frac{u}{\bar A\bar\sigma}\right)^{2/k}
\end{eqnarray*}
hold because of the relation~$R\ge1$. This means that
$n\bar\sigma^2$
has the upper and lower bound formulated in Proposition~14.1.
It remained to show that $n\bar\sigma^2\ge
L\log n+D$ if relation~(\ref{(14.5)}) holds.
This inequality clearly holds under the conditions of
Proposition~14.1
if $\sigma\le n^{-1/3}$, since in this case
$\log\frac2\sigma\ge\frac{\log n}3$, and
\begin{eqnarray*}
n\bar\sigma^2&\ge&\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
\ge\frac1{64}\bar A^{-2/k}
M(L^{3/2}\log\frac2\sigma +(\log D)^{3/2})^{3/2} \\
&\ge&\frac1{192}\bar A^{-2/k} M(L^{3/2}\log n+(\log D)^{3/2})
\ge L\log n+\log D
\end{eqnarray*}
if $M= M(\bar A,k)$ is sufficiently large.
If $\sigma\ge n^{-1/3}$, then the inequality $2^{(4+2/k)R}
\left(\frac u{\bar A\sigma}\right)^{2/k} \le2^{2+6/k} n\sigma^2$
can be applied. This
implies that $2^{-4R}\ge2^{-4(2+6/k))/(4+2/k)}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^{2/k}}
{n\sigma^2}\right]^{4/(4+2/k)}$, and
$$
n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma
\textrm{ with } \gamma=\frac4{4+\frac2k}\ge\frac23.
$$
The inequalities $n\sigma^2\ge n^{1/3}$ and
$n\sigma^2\ge(\frac u\sigma)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold,
(since $\log\frac2\sigma\ge\frac12$). They yield that for
sufficiently large $M=M(\bar A,k)$
$(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma\ge
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3}=
(n\sigma^2)^{1/(2k+1)}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3}$, and
\begin{eqnarray*}
n\bar\sigma^2
&\ge& \frac{\bar A^{-4/3}}{50}
(n\sigma^2)^{1/(2k+1)}\left[\left(\frac
u\sigma\right)^{2/k}\right]^{2/3}\\
&\ge& \frac{\bar A^{-4/3}}{50}n^{1/3(2k+1)}
\left(\frac M2\right)^{2/3} (L^{3/2}+(\log D)^{3/2})^{2/3}
\ge L\log n+\log D.
\end{eqnarray*}
\medskip
A multivariate analogue of Proposition~6.2 is formulated in
Proposition~14.2, and it will be shown that Propositions~14.1
and~14.2 imply Theorem~8.4.\index{estimate on the supremum of
degenerate $U$-statistics}
\medskip\noindent
{\bf Proposition 14.2.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ and a countable $L_2$-dense class ${\cal F}$ of
canonical (with respect to the measure~$\mu$) kernel functions
$f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$ and exponent
$L\ge1$ on the product space $(X^k,{\cal X}^k)$. Let all functions
$f\in{\cal F}$ satisfy conditions~(\ref{(8.1)})
and~(\ref{(8.2)}) with some
$0<\sigma\le1$ such that $n\sigma^2>L\log n+D$. Let us consider
the (degenerate) $U$-statistics $I_{n,k}(f)$ with the random
sequence $\xi_1,\dots,\xi_n$, $n\ge\max(2,k)$, and kernel
functions $f\in{\cal F}$. There exists a threshold index
$A_0=A_0(k)>0$ and two numbers $\bar C=\bar C(k)>0$ and
$\gamma=\gamma(k)>0$ depending only on the order $k$ of the
$U$-statistics such that the degenerate $U$-statistics
$I_{n,k}(f)$, $f\in{\cal F}$, satisfy the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|n^{-k/2}I_{n,k}(f)|
\ge A n^{k/2}\sigma^{k+1}\right)
\le \bar C e^{-\gamma A^{1/2k}n\sigma^2}\quad \textrm{if } A\ge A_0.
\label{(14.8)}
\end{equation}
}
\medskip
Proposition~14.2 yields an estimate for the tail distribution of
the supremum of degenerate $U$-statistics at level
$u\ge A_0n^{k/2}\sigma^{k+1}$, i.e. in the case when Theorem~8.3
does not give a good estimate on the tail-distribution of the single
degenerate $U$-statistics taking part in the supremum at
the left-hand side of~(\ref{(14.8)}).
Formula~(\ref{(8.11)}) will be proved by means of
Proposition~14.1 with an
appropriate choice of the parameter $\bar A$ in it and
Proposition~14.2 with the choice $\sigma=\bar\sigma=\bar\sigma(u)$
and the classes of functions
${\cal F}_j=\left\{\frac{g-f_j}2\colon\, g\in{\cal D}_j\right\}$
with the number $\bar\sigma$, functions~$f_j$ and sets of
functions~${\cal D}_j$, $1\le j\le m$, introduced in Proposition~14.1.
Clearly,
\begin{eqnarray}
&&P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\le
P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}|I_{n,k}(f)|
\ge \frac u{\bar A}\right) \nonumber \\
&&\qquad+\sum_{j=1}^m P\left(\sup_{g\in{\cal D}_j} n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar A}\right)u\right),
\label{(14.9)}
\end{eqnarray}
where $m$ is the cardinality of the set of functions
${\cal F}_{\bar\sigma}$ appearing in Proposition~14.1.
We shall estimate the two terms of the sum at the right-hand side
of~(\ref{(14.9)}) by means of Propositions~14.1 and~14.2 with a good
choice of the parameters $\bar A$ and the corresponding $M=M(\bar A)$
in Proposition~14.1 together with a parameter $A\ge A_0$ in
Proposition~14.2.
We shall choose the parameter~$A\ge A_0$ in the application of
Proposition~14.2 so that it satisfies also the relation
$\gamma\ A^{1/2k}\ge2$ with the
number~$\gamma$ appearing in Proposition~14.2, hence we put
$A=\max(A_0,(\frac2\gamma)^{2k})$. After this choice we want to
define the parameter $\bar A$ in Proposition~14.1 in such a way
that the numbers~$u$ satisfying the conditions of Proposition~14.1
also satisfy the relation
$(\frac12-\frac1{2\bar A})u\ge An^{k/2}\bar\sigma^{k+1}$ with
the already fixed number~$A$. This inequality can be rewritten
in the form $A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}
(\frac u{\bar\sigma})^{2/k}\ge n\bar\sigma^2$. On the other hand,
under the conditions of Proposition~14.1 the inequality
$4(\frac u{\bar A\bar\sigma})^{2/k}\ge n\bar\sigma^2$ holds.
Hence the desired inequality holds if
$A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}\ge 4{\bar A}^{-2/k}$.
Thus the number $\bar A=2^{k+1}A+1$ is an appropriate choice.
With such a choice of $\bar A$ (together with the corresponding
$M=M(\bar A,k)$) and $A$ we can write
\begin{eqnarray*}
&&P\left(\sup_{g\in{\cal D}_j} n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar
A}\right)u\right) \\
&&\qquad\le P\left(\sup_{g\in{\cal D}_j}n^{-k/2}
\left|I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge A n^{k/2}\bar\sigma^{k+1}\right)
\le \bar Ce^{-\gamma A^{1/2k}n\bar\sigma^2}
\end{eqnarray*}
for all $1\le j\le m$.
(Observe that the set of functions $\frac{f_j-g}2,\;g\in{\cal D}_j$, is
an $L_2$-dense class with parameter $D$ and exponent $L$.) Hence
Proposition~14.1 (relation (\ref{(14.4)}) together with
the inequality $m\le
D\bar \sigma^{-L}$) and formula (\ref{(14.8)}) with our
$A\ge A_0$ and relation~(\ref{(14.9)}) imply that
\begin{eqnarray}
&&P\left(\sup\limits_{f\in{\cal F}} n^{-k/2}|I_{n,k}(f)|\ge u\right)
\nonumber \\
&&\qquad \le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}
+\bar CD\bar\sigma^{-L} e^{-\gamma A^{1/2k}n\bar\sigma^2}.
\label{(14.10)}
\end{eqnarray}
We show by repeating an argument given in Section~6 that
$D\bar\sigma^{-L}\le e^{n\bar\sigma^2}$. Indeed, we have to show
that $\log D+L\log\frac1{\bar\sigma}\le n\bar\sigma^2$. But, as we
have seen, the relation $n\bar\sigma^2\ge L\log n+\log D$ with
$L\ge1$ and $D\ge1$ implies that $n\bar\sigma^2\ge\log n$, hence
$\log\frac1\sigma\le\log n$, and
$\log D+L\log\frac1{\bar\sigma}\le \log D+L\log n\le n\bar\sigma^2$.
On the other hand, $\gamma A^{1/2k}\ge2$ by the definition of
the number~$A$, and by the estimates of Proposition~14.1
$n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$.
The above relations imply that
$D\bar\sigma^{-L} e^{-\gamma A^{1/2k}n \bar\sigma^2}
\le e^{-\gamma A^{1/2k}n\bar\sigma^2/2}
\le \exp\left\{-\frac\gamma{128} A^{1/2k} \bar A^{-2/k}
\left(\frac u\sigma\right)^{2/k}\right\}$.
Hence relation~(\ref{(14.10)}) yields that
\begin{eqnarray*}
&&P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)\\
&&\qquad\le 2C\exp \left\{-\frac\alpha{(10\bar A)^2}\left(\frac
u\sigma\right)^{2/k}\right\} +\bar C\exp\left\{-\frac\gamma{128}
A^{1/2k} \bar A^{-2/k} \left(\frac u\sigma\right)^{2/k}\right\},
\end{eqnarray*}
and this estimate implies Theorem~8.4.
\medskip
To complete the proof of Theorem~8.4 we have to prove
Proposition~14.2. It will be proved, similarly to its one-variate
version Proposition~6.2, by means of a symmetrization argument.
We want to find its right formulation. It would be natural to
formulate it as a result about the supremum of degenerate
$U$-statistics. However, we shall choose a slightly different
approach. There is a notion, called decoupled $U$-statistic.
Decoupled $U$-statistics behave similarly to $U$-statistics, but
it is simpler to work with them, because they have more
independence properties. It turned out to be useful to introduce
this notion and to apply a result of de la Pe\~na and
Montgomery--Smith which enables us to reduce the estimation of
$U$-statistics to the estimation of decoupled $U$-statistics,
and to work out the symmetrization argument for decoupled
$U$-statistics.
Next we introduce the notion of decoupled $U$-statistics
together with their randomized version. We also formulate a
result of de la Pe\~na and Montgomery--Smith in Theorem~14.3
which enables us to reduce Proposition~14.2 to a version of it,
presented in Proposition~$14.2'$. It states a result similar
to Proposition~14.2 about decoupled $U$-statistics. The proof of
Proposition~$14.2'$ is the hardest part of the problem. In
Sections~15, 16 and~17 we deal essentially with this problem.
The result of de la Pe\~na and Montgomery--Smith will be
proved in Appendix~D.
Now we introduce the following notions.
\medskip\noindent
{\bf Definition of decoupled and randomized decoupled
$U$-statistics.}\index{decoupled $U$-statistics}\index{randomized
decoupled $U$-statistics} {\it Let us have $k$ independent
copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of a sequence
$\xi_1,\dots,\xi_n$ of independent and identically distributed
random variables taking their values in a measurable space
$(X,{\cal X})$ together with a measurable function $f(x_1,\dots,x_k)$
on the product space $(X^k,{\cal X}^k)$ with values in a separable
Banach space. The decoupled $U$-statistic $\bar I_{n,k}(f)$
determined by the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, and kernel function $f$ is defined by the formula
\begin{equation}
\bar I_{n,k}(f)=\frac1{k!}\sum_{\substack{(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1)},\dots,\xi_{l_k}^{(k)}\right).
\label{(14.11)}
\end{equation}
Let us have beside the sequences
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and function
$f(x_1,\dots,x_k)$ a sequence of independent random variables
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is independent also of the sequences of
random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$.
The randomized decoupled $U$-statistic $\bar I_{n,k}(f,\varepsilon)$
(depending on the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function $f$ and the randomizing sequence
$\varepsilon_1,\dots,\varepsilon_n$) is defined by the formula
\begin{equation}
\bar I^\varepsilon_{n,k}(f)=\frac1{k!}\sum_{\substack
{(l_1,\dots,l_k)\colon\, 1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f\left(\xi_{l_1}^{(1)},
\dots,\xi_{l_k}^{(k)}\right).
\label{(14.12)}
\end{equation}
}
\medskip
A decoupled or randomized decoupled $U$-statistics (with real
valued kernel function) will be called degenerate if its kernel
function is canonical. This terminology is in full accordance with
the definition of (usual) degenerate $U$-statistics.
A result of de la Pe\~na and Montgomery--Smith will be formulated
below. It gives an upper bound for the tail distribution of a
$U$-statistic by means of the tail distribution of an appropriate
decoupled $U$-statistic. It also has a generalization, where the
supremum of $U$-statistics is bounded by the supremum of decoupled
$U$-statistics. It enables us to reduce Proposition~14.2 to a
version formulated Proposition~$14.2'$, which gives a bound on the
tail distribution of the supremum of decoupled $U$-statistics.
It is simpler to prove this result than the original one.
Before the formulation of the theorem of de la Pe\~na and
Montgomery--Smith I make some remark about it. It considers
more general $U$-statistics with kernel functions taking values
in a separable Banach space, and it compares the norm of
Banach space valued $U$-statistics and decoupled $U$-statistics.
(Decoupled $U$-statistics were defined with general Banach space
valued kernel functions, and the definition of $U$-statistics can
also be generalized to separable Banach space valued kernel
functions in a natural way.) This result was formulated in such
a general form for a special reason. This helped to derive
formula~(\ref{(14.14)}) of the subsequent theorem from
formula~(\ref{(14.13)}).
It can be exploited in the proof of formula~(\ref{(14.14)}) that the
constants in the estimate~(\ref{(14.13)}) do not depend on the Banach
space, where the kernel function~$f$ takes its values.
\medskip\noindent
{\bf Theorem 14.3 (Theorem of de la Pe\~na and Montgomery--Smith
about the comparison of $U$-statistics and decoupled
$U$-statistics).}\index{comparison of the tail distribution of
$U$-statistics and decoupled $U$-statistics (result of de la
Pe\~na and Montgomery--Smith)}
{\it Let us consider a sequence of independent
and identically distributed random variables $\xi_1,\dots,\xi_n$
with values in a measurable space $(X,{\cal X})$ together with $k$
independent copies $\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$,
of this sequence. Let us also have a function $f(x_1,\dots,x_k)$ on
the $k$-fold product space $(X^k,{\cal X}^k)$ which takes its values
in a separable Banach space~$B$. Let us take the $U$-statistic and
decoupled $U$-statistic $I_{n,k}(f)$ and $\bar I_{n,k}(f)$ with
the help of the above random sequences $\xi_1,\dots,\xi_n$,
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and kernel
function~$f$. There exist some constants $\bar C=\bar C(k)>0$
and $\gamma=\gamma(k)>0$ depending only on the order~$k$ of the
$U$-statistic such that
\begin{equation}
P\left(\|I_{n,k}(f)\|>u\right)
\le\bar CP\left(\|\bar I_{n,k}(f)\|>\gamma u\right)
\label{(14.13)}
\end{equation}
for all $u>0$. Here $\|\cdot\|$ denotes the norm in the Banach
space~$B$ where the function~$f$ takes its values.
More generally, if we have a countable sequence of functions
$f_s$, $s=1,2,\dots$, taking their values in the same separable
Banach-space, then
\begin{equation}
P\left(\sup_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|>u\right)\le
\bar CP\left(\sup_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|
>\gamma u\right). \label{(14.14)}
\end{equation}
}
\medskip
Now I formulate the following version of Proposition~4.2.
\medskip\noindent
{\bf Proposition 14.2$'$.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and a countable $L_2$-dense
class ${\cal F}$ of canonical (with respect to the measure~$\mu$)
kernel functions $f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$
and exponent $L\ge1$ on the product space $(X^k,{\cal X}^k)$. Let
all functions $f\in{\cal F}$ satisfy conditions~(\ref{(8.1)})
and~(\ref{(8.2)})
with some $0<\sigma\le1$ such that $n\sigma^2>L\log n+\log D$.
Let us take $k$ independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, of the random sequence $\xi_1,\dots,\xi_n$, and
consider the decoupled $U$-statistics $\bar I_{n,k}(f)$,
$f\in{\cal F}$, defined with their help in formula~(\ref{(14.11)}).
There exists a threshold index $A_0=A_0(k)>0$ depending only on
the order $k$ of the decoupled $U$-statistics $I_{n,k}(f)$,
$f\in{\cal F}$, such that the (degenerate) decoupled
$U$-statistics $\bar I_{n,k}(f)$, $f\in{\cal F}$, satisfy the
following version of inequality (\ref{(14.8)}):
\begin{equation}
P\left(\sup_{f\in{\cal F}}n^{-k/2}|\bar I_{n,k}(f)|
\ge An^{k/2}\sigma^{k+1}\right)
\le e^{-2^{-(1/2+1/2k)} A^{1/2k}n\sigma^2}\quad \textrm{if } A\ge A_0.
\label{(14.15)}
\end{equation}
}
\medskip
It is clear that Proposition~$14.2'$ and Theorem~14.3, more
explicitly formula~(\ref{(14.14)}) in it, imply
Proposition~14.2. Hence the
proof of Theorem~8.4 was reduced to Proposition~$14.2'$ in this
section. The proof of Proposition~$14.2'$ is based on a
symmetrization argument. Its main ideas will be explained in the
next section.
\chapter{The strategy of the proof for the main result of
this work}
In the previous section the proof of Theorem~8.4 was reduced to
that of Proposition~$14.2'$. Proposition~$14.2'$ is a multivariate
version of Proposition~6.2, and its proof is based on similar
ideas. An important step in the proof of Proposition~6.2 was a
symmetrization argument in which we reduced the estimation of
the probability $P\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{j=1}^nf(\xi_j)>u\right)$
to the estimation of the probability
$P\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{j=1}^n\varepsilon_jf(\xi_j)>\frac u3\right)$, where
$\xi_1,\dots,\xi_n$ is a sequence of independent and identically
distributed random variables, and $\varepsilon_j$, $1\le j\le n$,
is a sequence of independent random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, independent of
the sequence~$\xi_j$. Let us understand how to formulate the
corresponding symmetrization argument in the proof of
Proposition~$14.2'$ and how to prove it.\index{estimate on the
supremum of degenerate $U$-statistics}
The symmetrization argument applied in the proof of Proposition~6.2
was carried out in two steps. We took a copy $\xi_1',\dots,\xi'_n$
of the sequence $\xi_1,\dots,\xi_n$, i.e. a sequence of independent
random variables which is independent also of the original
sequence $\xi_1,\dots,\xi_n$, and has the same distribution. In the
first step we compared the tail distribution of the expression
$\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^n[f(\xi_j)-f(\xi'_j)]$
with that of $\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^nf(\xi_j)$.
This was done with the help of Lemma~7.1. In the second step, in
Lemma~7.2, we proved a `randomization argument' which stated that
the distribution of the random fields
$\sum\limits_{j=1}^n[f(\xi_j)-f(\xi_j')]$ and
$\sum\limits_{j=1}^n\varepsilon_j[f(\xi_j)-f(\xi_j')]$,
$f\in{\cal F}$, agree. The symmetrization argument was proved
with the help of these two observations.
In the proof of Proposition~$14.2'$ we would like to reduce the
estimation of the tail distribution of the supremum of decoupled
$U$-statistics $\sup\limits_{f\in{\cal F}}\bar I_{n,k}(f)$ defined
in formula~(\ref{(14.11)}) to the estimation of the tail distribution of
the supremum of randomized decoupled $U$-statistics
$\sup\limits_{f\in{\cal F}}\bar I_{n,k}^\varepsilon(f)$ defined
in formula~(\ref{(14.12)})
by means of a similar argument. To do this first we have to
understand what kind of random fields should be introduced
instead of $\sum\limits_{j=1}^n[f(\xi_j)-f(\xi'_j)]$, $f\in{\cal F}$,
in the new case. In formula~(\ref{(15.1)}) we shall define such a random
field. Its definition reminds a bit to the definition of
Stieltjes measures. In Lemma~15.1 we will show that a version
of the `randomization argument' of Lemma~7.2 can be applied when
we are working with this random field.
The adaptation of the first step of the symmetrization argument
in the proof of Proposition~6.2 to the present case is much harder.
The proof of Proposition~6.2 was based on the symmetrization lemma,
Lemma~7.1, which does not work in the present case. Hence we shall
prove a generalization of this result in Lemma~15.2. The proof of
the symmetrization argument is difficult even with the help of this
result. The hardest part of our problem appears at this point. I
return to this point after the formulation of Lemma~15.2.
To formulate Lemma~15.1 needed in our proof we introduce some
notations.
Let ${\cal V}_k$ denote the set of all sequences $(v(1),\dots,v(k))$
of length~$k$ such that $v(j)=+1$ or $v(j)=-1$ for all $1\le j\le k$.
Let $m(v)$, $v=(v(1),\dots,v(k))\in{\cal V}_k$, denote the number of
digits $-1$ in the sequence $v$. Let a (real valued) function
$f(x_1,\dots,x_k)$ of $k$ variables be given on a measurable space
$(X,{\cal X})$ together with a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(X,{\cal X})$ and $2k$ independent copies
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, $1\le j\le k$, of this
sequence. Let us have beside them another sequence
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, of
independent random variables, also independent of all previously
introduced random variables. With the help of the above
quantities we introduce the random variables
\begin{equation}
\tilde I_{n,k}(f)=\frac1{k!}\sum_{v\in {\cal V}_k}
(-1)^{m(v)} \sum_{\substack{(l_1,\dots,l_k)\colon\, 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'} \textrm{ if } r\neq r'}}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\label{(15.1)}
\end{equation}
and
\begin{eqnarray}
\tilde I^\varepsilon_{n,k}(f)
&&=\frac1{k!}\sum_{v\in {\cal V}_k}
(-1)^{m(v)} \label{(15.2)} \\
&&\qquad \sum_{\substack{ (l_1,\dots,l_k)\colon\, 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'}
\textrm{ if } r\neq r'}} \varepsilon_{l_1}\cdots\varepsilon_{l_k}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\nonumber
\end{eqnarray}
The number $m(v)$ in the above formulas denotes the number of the
digits $-1$ in the $\pm1$ sequence $v$ of length~$k$, hence it
counts how many random variables $\xi_{l_j}^{(j,1)}$, $1\le j\le k$,
were replaced by the `secondary copy' $\xi_{l_j}^{(j,-1)}$ for a
$v\in{\cal V}_k$ in the inner sum in formulas~(\ref{(15.1)})
or~(\ref{(15.2)}).
The following result holds.
\medskip\noindent
{\bf Lemma 15.1.} {\it Let us consider a (non-empty) class of
functions ${\cal F}$ of $k$ variables $f(x_1,\dots,x_k)$ on the
space $(X^k,{\cal X}^k)$ together with the random variables
$\tilde I_{n,k}(f)$ and $\tilde I^\varepsilon_{n,k}(f)$ defined in
formulas~(\ref{(15.1)}) and~(\ref{(15.2)}) for all $f\in {\cal F}$.
The distributions of the random fields $\tilde I_{n,k}(f)$,
$f\in{\cal F}$, and $\tilde I^\varepsilon_{n,k}(f)$, $f\in {\cal F}$,
agree.}
\medskip
Let me recall that we say that the distribution of two random
fields $X(f)$, $f\in{\cal F}$, and $Y(f)$, $f\in{\cal F}$,
agree if for any finite sets $\{f_1,\dots,f_p\}\in{\cal F}$ the
distribution of the random vectors $(X(f_1),\dots,X(f_p))$ and
$(Y(f_1),\dots,Y(f_p))$ agree.
\medskip\noindent
{\it Proof of Lemma 15.1.}\/ I even claim that for any fixed
sequence
$$
u=(u(1),\dots,u(n)), \quad u(l)=\pm1, \;\; 1\le l\le n,
$$
of length~$n$ the conditional distribution of the field
$\tilde I^\varepsilon_{n,k}(f)$, $f\in {\cal F}$, under the
condition $(\varepsilon_1,\dots,\varepsilon_n)=u=(u(1),\dots,u(n))$
agrees with the distribution of the field of $\tilde I_{n,k}(f)$,
$f\in{\cal F}$.
Indeed, the random variables $\tilde I_{n,k}(f)$, $f\in{\cal F}$,
defined in (\ref{(15.1)}) are functions of a random vector
with coordinates
$(\xi_l^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$,
$1\le l\le n$, $1\le j\le k$, and the distribution of this random
vector does not change if the coordinates
$(\xi_{l}^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
with such indices $(l,j)$ for which $u(l)=-1$ (and the index~$j$
is arbitrary) are replaced by
$(\bar\xi_l^{(j)},\xi_l^{(j)})=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$,
and the coordinates $(\xi_{l}^{(j)},\bar\xi_l^{(j)})$ with such
indices $(l,j)$ for which $u(l)=1$ are not changed. As a
consequence, the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\cal F}$, we get by replacing the
original vector $(\xi_l^{(j)},\bar\xi_l^{(j)})$, $1\le l\le n$,
$1\le j\le k$, in the definition of the expression
$\tilde I_{n,k}(f)$ in~(\ref{(15.1)}) for all $f\in {\cal F}$ by this
modified vector depending on~$u$ has the same distribution as the
random field $\tilde I_{n,k}(f)$, $f\in{\cal F}$. On the other hand,
I claim that the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\cal F}$, agrees with the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\cal F}$, defined in~(\ref{(15.2)}) under the condition that
$(\varepsilon_1,\dots,\varepsilon_n)=u$ with $u=(u(1),\dots,u(n))$.
To prove the last statement let us observe that the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\cal F}$, under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$ is the same
as the distribution of the random field we obtain by putting
$u(l)=\varepsilon_l$, $1\le l\le n$, in all coordinates
$\varepsilon_l$ of the random
variables $\tilde I^\varepsilon_{n,k}(f)$. On the other hand, the
random variables we get in such a way agree with the random
variables appearing in the sum defining $\tilde I_{n,k}(f|u)$,
only the terms in this sum are listed in a different order.
Lemma~15.1 is proved.
\medskip
Next we prove the following generalization of Lemma~7.1.
\medskip\noindent
{\bf Lemma 15.2 (Generalized version of the Symmetrization
Lemma).}\index{symmetrization lemma}
{\it Let $Z_p$ and $\bar Z_p$, $p=1,2,\dots$, be two sequences of
random variables on a probability space $(\Omega,{\cal A},P)$. Let a
$\sigma$-algebra ${\cal B}\subset {\cal A}$ be given on the probability
space $(\Omega,{\cal A},P)$ together with a ${\cal B}$-measurable set
$B$ and two numbers $\alpha>0$ and $\beta>0$ such that the random
variables $Z_p$, $p=1,2,\dots$, are ${\cal B}$ measurable, and the
inequality
\begin{equation}
P(|\bar Z_p|\le\alpha|{\cal B})(\omega)\ge\beta\quad \textrm{for all }
\,p=1,2,\dots \textrm{ if }\,\omega\in B \label{(15.3)}
\end{equation}
holds.
Then
\begin{equation}
P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)
\le\frac1\beta P\left(\sup\limits_{1\le
p<\infty}|Z_p-\bar Z_p|>u\right)+(1-P(B))
\label{(15.4)}
\end{equation}
for all $u>0$.}
\medskip\noindent
{\it Proof of Lemma 15.2.}\/ Put $\tau=\min\{p\colon\, |Z_p|>\alpha+u)$
if there exists such an index $p\ge1$, and put $\tau=0$ otherwise. Then
\begin{eqnarray*}
P(\{\tau=p\}\cap B)
&\le&\int_{\{\tau=p\}\cap B}\frac1\beta
P(|\bar Z_p|\le \alpha|{\cal B})\,dP \\
&=&\frac1\beta P(\{\tau=p\}\cap\{|\bar Z_p|\le\alpha\}\cap B)\\
&\le& \frac1\beta P(\{\tau=p\}\cap\{|Z_p-\bar Z_p|>u\})
\quad \textrm{for all } p=1,2,\dots.
\end{eqnarray*}
Hence
\begin{eqnarray*}
&&P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)-(1-P(B))\le
P\left(\left\{\sup_{1\le p<\infty}|Z_p|>
\alpha+u\right\}\cap B\right) \\
&&\qquad=\sum_{p=1}^\infty P(\{\tau=p\}\cap B)
\le \frac1\beta \sum_{p=1}^\infty P(\{\tau=p\}\cap\{|Z_p-\bar
Z_p|>u\}) \\
&&\qquad \le\frac1\beta
P\left(\sup_{1\le p<\infty}|Z_p-\bar Z_p|>u\right).
\end{eqnarray*}
Lemma~15.2 is proved.
\medskip
To find a symmetrization argument useful in the proof of
Proposition~$14.2'$ we want to bound the probability
$P\left(\sup\limits_{f\in{\cal F}}|\bar I_{n,k}(f)|>u\right)$ by
$$
C\cdot P\left(\sup\limits_{f\in{\cal F}}|\tilde I_{n,k}(f)|>c u\right)
+\textrm{ a negligible error term}
$$
with some appropriate numbers $C<\infty$ and $00$ we say that the
set of decoupled $U$-statistics determined by the class of
functions ${\cal F}$ has a good tail behaviour at level~$T$ (with
parameters $n$ and $\sigma^2$ which are fixed in the sequel) if
\begin{equation}
P\left(\sup_{f\in{\cal F}}|n^{-k/2}\bar I_{n,k}(f)|\ge A
n^{k/2}\sigma^{k+1}\right)
\le \exp\left\{-A^{1/2k}n\sigma^2 \right\}
\quad \textrm{for all } A>T. \label{(15.5)}
\end{equation}
}
\medskip\noindent
{\bf Definition of good tail behaviour for a class of integrals of
decoupled $U$-statistics.}\index{good tail behaviour for a class
of integrals of decoupled $U$-statistics.}
{\it Let us have a product space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some product measure
$\mu^k\times\rho$, where $(X^k,{\cal X}^k,\mu^k)$ is the $k$-fold
product of some probability space $(X,{\cal X},\mu)$, and
$(Y,{\cal Y},\rho)$ is some other probability space. Fix some positive
integer~$n\ge k$ and a positive number $0<\sigma\le1$, and consider
some countable class ${\cal F}$ of functions $f(x_1,\dots,x_k,y)$ on
the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$. Take $k$
independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. For all $f\in{\cal F}$ and $y\in Y$ let us define
the decoupled $U$-statistics $\bar I_{n,k}(f,y)=\bar I_{n,k}(f_y)$
by means of these random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function
$f_y(x_1,\dots,x_k)=f(x_1,\dots,x_k,y)$ and formula~(\ref{(14.11)}).
Define
with the help of these $U$-statistics $\bar I_{n,k}(f,y)$ the random
integrals
\begin{equation}
H_{n,k}(f)=\int \bar I_{n,k}(f,y)^2\rho(\,dy), \quad f\in{\cal F}.
\label{(15.6)}
\end{equation}
Choose some real number $T>0$. We say that the set of random
integrals $H_{n,k}(f)$, $f\in{\cal F}$, has a good tail behaviour at
level $T$ (with parameters $n$ and $\sigma^2$ which we fix in the
sequel) if
\begin{eqnarray}
P\left(\sup_{f\in{\cal F}} n^{-k}H_{n,k}(f)
\ge A^2 n^k\sigma^{2k+2}\right)
&&\le \exp\left\{-A^{1/(2k+1)}n\sigma^2 \right\} \nonumber \\
&&\qquad \textrm{for all } A> T.
\label{(15.7)}
\end{eqnarray}
}
\medskip
Propositions~15.3 and~15.4 will be formulated with the help of the
above notions.
\medskip\noindent
{\bf Proposition 15.3.} {\it Let us fix a positive
integer~$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a
probability measure $\mu$ on a measurable space $(X,{\cal X})$
together with two real numbers $L\ge1$ and $D\ge1$ such that
$n\sigma^2\ge L\log n+\log D$. Let us consider those countable
$L_2$-dense classes ${\cal F}$ of canonical kernel functions
$f=f(x_1,\dots,x_k)$ (with respect to the measure~$\mu$) on the
$k$-fold product space $(X^k,{\cal X}^k)$ with exponent~$L$
and parameter~$D$ for which all functions $f\in{\cal F}$ satisfy the
inequalities $\sup\limits_{x_j\in X, 1\le j\le k}
|f(x_1,\dots,x_k)|\le 2^{-(k+1)}$ and $\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$.
There is some real number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\cal F}$ which satisfy the above conditions
the sets of decoupled $U$-statistics $\bar I_{n,k}(f)$, $
f\in{\cal F}$, have a good tail behaviour at level~$T^{4/3}$ for
some $T\ge A_0$, then they also have a good tail behaviour at
level~$T$.}
\medskip\noindent
{\bf Proposition 15.4.} {\it Fix some positive integer
$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a product
space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some product
measure $\mu^k\times\rho$, where $(X^k,{\cal X}^k,\mu^k)$ is the
$k$-fold product of some probability space $(X,{\cal X},\mu)$, and
$(Y,{\cal Y},\rho)$ is some other probability space together with
two real numbers $L\ge1$ and $D\ge1$ such that
$n\sigma^2>L\log n+\log D$ hold.
Let us consider those countable $L_2$-dense classes ${\cal F}$
consisting of canonical functions $f(x_1,\dots,x_k,y)$ on the
product space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ with
exponent $L\ge1$ and parameter $D\ge1$ whose elements
$f\in{\cal F}$ satisfy the inequalities
\begin{equation}
\sup\limits_{x_j\in X, 1\le j\le k, y\in Y}|f(x_1,\dots,x_k,y)|\le
2^{-(k+1)} \label{(15.8)}
\end{equation}
and
\begin{equation}
\int f^2(x_1,\dots,x_k,y)\mu(\,dx_1)\dots\mu(\,dx_k)\rho(\,dy)
\le\sigma^2 \quad \textrm{for all } f\in {\cal F}.
\label{(15.9)}
\end{equation}
There exists some number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\cal F}$ which satisfy the above conditions
the random integrals $H_{n,k}(f)$, $f\in{\cal F}$, defined
in~(\ref{(15.6)}) have a good tail behaviour at level $T^{(2k+1)/2k}$
with some $T\ge A_0$, then they also have a good tail behaviour
at level~$T$.}
\medskip\noindent
{\it Remark:}\/ To complete the formulation of Proposition~15.4 we
still have to clarify when we call a function $f(x_1,\dots,x_k,y)$
defined on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$
canonical.\index{canonical function}
Here we apply a definition which slightly differs from that given
in formula~(\ref{(8.8)}).
We say that a function
$f(x_1,\dots,x_k,y)$ on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$
is canonical if
\begin{eqnarray*}
&&\int f(x_1,\dots,x_{j-1},u,x_{j+1},\dots,x_k,y)\mu(\,du)=0\\
&&\qquad \qquad \textrm{for all } 1\le j\le k,\; x_s\in X,
\;s\neq j \textrm{ and }y\in Y.
\end{eqnarray*}
In this definition we do not require the analogous identity if
we integrate with respect to the variable $Y$ with fixed
arguments $x_j\in X$, $1\le j\le k$.
\medskip
Let me also remark that the estimate (\ref{(15.7)}) we have
formulated in the definition of the property `good tail behaviour
for a class of integrals of $U$-statistics' is fairly natural. We
have applied the natural normalization, and with such a
normalization it is natural to expect that the tail behaviour of
the distribution of $\sup\limits_{f\in{\cal F}}n^{-k}H_{n,k}(f)$
is similar to that of $\textrm{const}\,\left(\sigma\eta^k\right)^2$,
where $\eta$ is a standard normal random variable.
Formula~(\ref{(15.7)}) expresses such a behaviour, only the power
of the number~$A$ in the exponent at the right-hand side was
chosen in a non-optimal way. Formula~(\ref{(15.5)}) in the
formulation of the property `good tail behaviour for a class of
decoupled $U$-statistics' has a similar interpretation. It says
that $\sup\limits_{f\in{\cal F}}|n^{-k/2}I_{n,k}(f)|$ behaves
similarly to $\textrm{const},\sigma|\eta^k|$ with a standard
normal random variable $\eta$.
We wanted to prove the property of good tail behaviour for a class
of integrals of decoupled $U$-statistics under appropriate, not too
restrictive conditions. Let me remark that in Proposition~15.4 we
have imposed beside formula (\ref{(15.8)}) a fairly weak
condition (\ref{(15.9)})
about the $L_2$-norm of the function~$f$. Most difficulties appear
in the proof, because we did not want to impose more restrictive
conditions.
It is not difficult to derive Proposition~$14.2'$ from
Proposition~15.3. Indeed, let us observe that the set of decoupled
$U$-statistics determined by a class of functions ${\cal F}$
satisfying the conditions of Proposition~15.3 has a good
tail-behaviour at level $T_0=\sigma^{-(k+1)}$, since under the
conditions of this Proposition the probability at the left-hand
side of~(\ref{(15.5)}) equals zero for $A>\sigma^{-(k+1)}$. Then we get
from Proposition~15.3 by induction with respect to the number $j$,
that this set of decoupled $U$-statistics has a good tail-behaviour
also for all $T=T_j=\ge T_0^{(3/4)^j}=\sigma^{-(k+1)(3/4)^j}$,
$j=0,1,2,\dots$, with such indices~$j$ for which
$T_j=\sigma^{-(k+1)(3/4)^j}\ge A_0$. This implies that if a class of
functions ${\cal F}$ satisfies the conditions of Proposition~15.3,
then the set of decoupled $U$-statistics determined by this class
of functions has a good tail-behaviour at level $T=A_0^{4/3}$,
i.e. at a level which depends only on the order~$k$ of the
decoupled $U$-statistics. This result implies Proposition~$14.2'$,
only it has to be applied for the class of function
${\cal F}'=\{2^{-(k+1)}f,\; f\in{\cal F}\}$ instead of the original
class of functions ${\cal F}$ which appears in Proposition~$14.2'$
with the same parameters~$\sigma$, $L$ and~$D$.
Similarly to the above argument an inductive procedure yields a
corollary of Proposition~15.4 formulated below. Actually, we shall
need this corollary of Proposition~15.4.
\medskip\noindent
{\bf Corollary of Proposition 15.4.} {\it If the class of functions
${\cal F}$ satisfies the conditions of Proposition~15.4, then there
exists a constant $\bar A_0=\bar A_0(k)>0$ depending only on $k$
such that the class of integrals $H_{n,k}(f)$, $f\in {\cal F}$,
defined in formula~(\ref{(15.6)}) have a good tail behaviour at level
$\bar A_0$.}
\medskip
The main difficulty in the proof of Proposition 15.3 arises in the
application of the symmetrization procedure corresponding to
Lemma~7.2 in the one-variate case. This difficulty can be overcome
by means of Proposition~15.4, more precisely by means of its
corollary. It helps us to estimate the conditional variances of
the decoupled $U$-statistics we have to handle in the proof of
Proposition~15.3. The proof of Propositions~15.3 and~15.4 apply
similar arguments, and they will be proved simultaneously. The
following inductive procedure will be applied in their proof.
First Proposition~15.3 and then Proposition~15.4 is proved for
$k=1$. If Propositions~15.3 and~15.4 are already proved
for all $k'0$ and $\gamma=\gamma_k>0$
such that the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} n^{-k/2}\left|\bar
I_{n,k}(f)\right|>An^{k/2}\sigma^{k+1}\right) \nonumber \\
&&\qquad <2^{k+1}P\left(\sup_{f\in{\cal F}}
\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right) \nonumber \\
&&\qquad\qquad +2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\label{(16.1)}
\end{eqnarray}
holds for all $A\ge A_0$.}
\medskip
It may be worth remarking that the second term at the right-hand side
of formula~(\ref{(16.1)}) yields a small contribution to
the upper bound in
this relation because of the condition $n\sigma^2\ge L\log n+\log D$.
To formulate Lemma~16.1B first some new quantities have to be
introduced. Some of them will be used somewhat later. The quantities
$\bar I_{n,k}^V(f,y)$ introduced in the subsequent
formula~(\ref{(16.2)}) depend on the sets $V\subset\{1,\dots,k\}$,
and they are the natural modifications of the inner sum terms in
formula (\ref{(15.1)}). Such expressions are needed in the
formulation of the symmetrization result applied in the proof of
Proposition~15.4. Their randomized versions
$\bar I_{n,k}^{(V,\varepsilon)}(f,y)$, introduced in
formula~(\ref{(16.5)}), correspond to the inner sum terms in
formula~(\ref{(15.2)}). The integrals of these expressions will
be also introduced in formulas~(\ref{(16.3)}) and~(\ref{(16.6)}).
Let us consider a class ${\cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\cal F}$ on a space $(X^k\times Y, {\cal X}^k
\times {\cal Y},\mu^k\times\rho)$ which satisfies the conditions of
Proposition~15.4. Let us take $2k$ independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$\bar\xi_1^{(j)},\dots,\bar\xi_n^{(j)}$, $1\le j\le k$,
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_k$ together with a sequence of independent random
variables
$(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous random sequences.
Let us introduce the notation $\xi_l^{(j,1)}=\xi_l^{(j)}$
and $\xi_l^{(j,-1)}=\bar\xi_l^{(j)}$, $1\le l\le n$, $1\le j\le k$.
For all subsets $V\subset\{1,\dots,k\}$ of the set
$\{1,\dots,k\}$ let $|V|$ denote the cardinality of this set,
and define for all functions $f(x_1,\dots,x_k,y)\in {\cal F}$ and
$V\subset\{1,\dots,k\}$ the decoupled $U$-statistics
\begin{equation}
\bar I_{n,k}^V(f,y)=\frac1{k!}\sum_{\substack {(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,\xi_{l_k}
^{(k,\delta_k(V))},y\right),
\label{(16.2)}
\end{equation}
where $\delta_j(V)=\pm1$, $1\le j\le k$, $\delta_j(V)=1$ if $j\in V$,
and $\delta_j(V)=-1$ if $j\notin V$, together with the random
variables
\begin{equation}
H_{n,k}^V(f)=\int \bar I_{n,k}^V(f,y)^2\rho(\,dy), \quad f\in{\cal F}.
\label{(16.3)}
\end{equation}
We shall consider $\bar I_{n,k}^V(f,y)$ defined
in~(\ref{(16.2)}) as a random
variable with values in the space $L_2(Y,{\cal Y},\rho)$.
Put
\begin{equation}
\bar I_{n,k}(f,y)=\bar I_{n,k}^{\{1,\dots,k\}}(f,y),\quad
H_{n,k}(f)=H_{n,k}^{\{1,\dots,k\}}(f), \label{(16.4)}
\end{equation}
i.e. $\bar I_{n,k}(f,y)$ and $H_{n,k}(f)$ are the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ with
$V=\{1,\dots,k\}$, which means that these expressions are defined
with the help of the random variables $\xi^{(j)}_l=\xi_l^{(j,1)}$,
$1\le j\le k$, $1\le l\le n$.
Let us also define the `randomized version' of the random variables
$\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ as
\begin{eqnarray}
\bar I_{n,k}^{(V,\varepsilon)}(f,y)&&=\frac1{k!} \!\!
\sum_{\substack{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\!\!\!\!\!\!\!\!\!
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))},y\right), \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{if } f\in{\cal F}, \label{(16.5)}
\end{eqnarray}
and
\begin{equation}
H_{n,k}^{(V,\varepsilon)}(f)=\int
\bar I_{n,k}^{(V,\varepsilon)}(f,y)^2\rho(\,dy)
,\quad f\in{\cal F}, \label{(16.6)}
\end{equation}
where $\delta_j(V)=1$ if $j\in V$, and $\delta_j(V)=-1$ if
$j\in\{1,\dots,k\}\setminus V$.
Similarly to formula~(\ref{(16.2)}), we shall consider
$\bar I_{n,k}^{V,\varepsilon}(f,y)$ defined in~(\ref{(16.5)}) as a random
variable with values in the space $L_2(Y,{\cal Y},\rho)$.
Let us also introduce the random variables
\begin{equation}
\bar W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}
\bar I_{n,k}^{(V,\varepsilon)}(f,y)\right]^2\rho(\,dy),
\quad f\in{\cal F}. \label{(16.7)}
\end{equation}
With the help of the above notations Lemma~16.1B can be formulated
in the following way.
\medskip\noindent
{\bf Lemma 16.1B (Randomization argument in the proof of
Proposition~15.4).} {\it Let ${\cal F}$ be a set of functions on
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ which satisfies the
conditions of Proposition~15.4 with some probability measure
$\mu^k\times\rho$. Let us have $2k$ independent copies
$\xi_{1}^{(j,\pm1)},\dots,\xi_{n}^{(j,\pm1)}$, $1\le j\le k$, of a
sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, which is independent also of the previously considered
sequences.
Then there exist some constants $A_0=A_0(k)>0$ and
$\gamma=\gamma_k$ such that if the integrals $H_{n,k}(f)$,
$f\in{\cal F}$, determined by this class of functions ${\cal F}$ have
a good tail behaviour at level $T^{(2k+1)/2k}$ for some $T\ge A_0$,
(this property was defined in Section~15 in the definition of good
tail behaviour for a class of integrals of decoupled $U$-statistics
before the formulation of Propositions~15.3 and~15.4), then the
inequality
\begin{eqnarray}
P\left(\sup_{f\in{\cal F}} \left|H_{n,k}(f)\right|
>A^2n^{2k}\sigma^{2(k+1)}\right)
&&<2P\left(\sup_{f\in{\cal F}} \left|\bar W(f)\right|
>\frac{A^2}2 n^{2k}\sigma^{2(k+1)}\right)\nonumber \\
&&\qquad+2^{2k+1}n^{k-1}e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\label{(16.8)}
\end{eqnarray}
holds with the random variables $H_{n,k}(f)$ introduced in the
second identity of relation (\ref{(16.4)}) and with $\bar W(f)$
defined in formula~(\ref{(16.7)}) if $\gamma_k>0$ is a
sufficiently small positive number for all $A\ge T$.}
\medskip
A corollary of Lemma~16.1B will be formulated which can be
better applied than the original lemma. Lemma~16.B is a little bit
inconvenient, because the expression at the right-hand side of
formula~(\ref{(16.8)}) contains a probability depending on
$\sup\limits_{f\in{\cal F}}|\bar W(f)|$, and $\bar W(f)$ is a too
complicated expression. Some new formulas~(\ref{(16.9)})
and~(\ref{(16.10)}) will
be introduced which enable us to rewrite $\bar W(f)$ in a slightly
simpler form. These formulas yield such a corollary of Lemma~16.B
which is more appropriate for our purposes. To work out the details
first some diagrams will be introduced.
Let ${\cal G}={\cal G}(k)$ denote the set of all diagrams
consisting of two rows, such that both rows of these diagrams are
the set $\{1,\dots,k\}$, and these diagrams contain some edges
$\{(j_1,j_1')\dots,(j_s,j_s')\}$, $0\le s\le k$, connecting a
point (vertex) of the first row with a point (vertex) of the
second row. The vertices $j_1,\dots,j_s$ which are end points of
some edge in the first row are all different, and the same relation
holds also for the vertices $j_1',\dots,j_s'$ in the second row.
Given some diagram $G\in{\cal G}$
let $e(G)=\{(j_1,j_1')\dots,(j_s,j_s')\}$ denote the set of its
edges, and let $v_1(G)=\{j_1,\dots,j_s\}$ be the set of those
vertices in the first row and $v_2(G)=\{j_1',\dots,j_s'\}$ the
set of those vertices in the second row of the diagram~$G$ from
which an edge of~$G$ starts.
Given some diagram $G\in {\cal G}$ and two sets
$V_1,V_2\subset\{1,\dots,k\}$, we define the following random
variables $H_{n,k}(f|G,V_1,V_2)$ with the help of the random
variables $\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$,
$\xi_{1}^{(j,-1)},\dots,\xi_{n}^{(j,-1)}$, $1\le j\le k$, and
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ taking part
in the definition of the random
variables $\bar W(f)$:
\begin{eqnarray}
&& H_{n,k}(f|G,V_1,V_2) \nonumber \\
&&\qquad =\sum_{\substack
{(l_1,\dots,l_k,\,l'_1,\dots,l'_k)\colon\\
1\le l_j\le n,\, l_j\neq l_{j'}
\textrm{ if }j\neq j',\,1\le j,j'\le k,\\
1\le l'_j\le n,\, l'_j\neq l'_{j'}\textrm { if }
j\neq j',\,1\le j,j'\le
k,\\ l_j=l'_{j'} \textrm { if } (j,j')\in e(G),\; l_j\neq l'_{j'}
\textrm { if } (j,j')\notin e(G)}}
\!\!\!\!\!\!\!\!\!\!\!\! \prod_{j\in\{1,\dots,k\}
\setminus v_1(G)} \!\!\!\!
\varepsilon_{l_j} \prod_{j\in\{1,\dots,k\}
\setminus v_2(G)} \!\!\!\! \varepsilon_{l'_j} \nonumber \\
&&\qquad\qquad \frac1{k!^2} \int
f(\xi_{l_1}^{(1,\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y) \nonumber \\
&& \qquad\qquad\qquad f(\xi_{l'_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l'_k}^{(k,\delta_k(V_2))},y)
\rho(\,dy), \label{(16.9)}
\end{eqnarray}
where $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$. (Let us observe that if the
graph $G$ contains $s$ edges, then the product of the
$\varepsilon$-s in (\ref{(16.9)})
contains $2(k-s)$ terms, and the number of terms in the
sum~(\ref{(16.9)}) is
less than $n^{2k-s}$.) As the Corollary of Lemma~16.1B will indicate,
in the proof of Proposition~15.4 we shall need a good estimate on the
tail distribution of the random variables $H_{n,k}(f|G,V_1,V_2)$
for all $f\in{\cal F}$ and $G\in{\cal G}$, $V_1,V_2\subset\{1,\dots,k\}$.
Such an estimate can be obtained by means of Theorem 13.3, the
multivariate version of Hoeffding's inequality. But the estimate we
get in such a way will be rewritten in a form more appropriate for our
inductive procedure. This will be done in the next section.
The identity
\begin{equation}
\bar W(f)=\sum_{G\in {\cal G},\, V_1,V_2\subset\{1,\dots,k\}}
(-1)^{|V_1|+|V_2|} H_{n,k}(f|G,V_1,V_2) \label{(16.10)}
\end{equation}
will be proved.
To prove this identity let us write first
$$
\bar W(f)=\sum_{V_1,V_2\subset \{1,\dots,k\}} (-1)^{|V_1|+|V_2|}
\int\bar I_{n,k}^{(V_1,\varepsilon)}
(f,y)\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy).
$$
Let us express the products
$\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)\bar I_{n,k}^{(V_2,
\varepsilon)}(f,y)$ by means of formula (\ref{(16.5)}).
Let us rewrite
this product as a sum of products of the form
$$
\frac1{k!^2}\prod\limits_{j=1}^k\varepsilon_{l_j}f(\cdots)
\prod_{j=1}^k\varepsilon_{l_j'}f(\cdots)
$$
and let us define the following partition of the terms in this
sum. The elements of this partition
are indexed by the diagrams $G\in {\cal G}$, and if we
take a diagram $G\in{\cal G}$ with the set of edges $e(G)=
\{(j_1,j_1'),\dots,(j_s,j_s')\}$, then the term of this sum
determined by the indices $l_1,\dots,l_k,l'_1,\dots,l'_k$
belongs to the element of the partition indexed by this diagram
$G$ if and only if $l_{j_u}=l_{j_u'}'$ for all $1\le u\le s$, and
no more numbers between the indices $l_1,\dots,l_k,l_1'\dots,l'_k$
may agree. Since $\varepsilon_{l_{j_u}}\varepsilon_{l'_{j_u'}}=1$
for all $1\le u\le s$ and the set of indices of the remaining
random variables $\varepsilon_{l_j}$ is
$\{l_j\colon\,j\in\{1,\dots,k\}\setminus v_1(G)\}$,
the set of indices of the remaining random variables
$\varepsilon_{l_j'}$
is $\{l'_{j'}\colon\,j\in\{1,\dots,k\}\setminus v_2(G)\}$, we get
by integrating the product
$\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)$
with respect to the measure $\rho$ that
$$
\int\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy)
=\sum_{G\in {\cal G}} H_{n,k}(f|G,V_1,V_2)
$$
for all $V_1,V_2\in\{1,\dots,k\}$. The last two identities imply
formula~(\ref{(16.10)}).
Since the number of terms in the sum of formula (\ref{(16.10)})
is less than
$2^{4k}k!$, this relation implies that Lemma~16.1B has the following
corollary:
\medskip\noindent
{\bf Corollary of Lemma 16.1B (A simplified version of the
randomization argument of Lemma~16.1B).} {\it Let a set of
functions ${\cal F}$ satisfy the conditions of Proposition~15.4. Then
there exist some constants $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that if the integrals $H_{n,k}(f)$, $f\in{\cal F}$, determined
by this class of functions ${\cal F}$ have a good tail behaviour at
level $T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} |H_{n,k}(f)|>A^2n^{2k}
\sigma^{2(k+1)}\right) \nonumber \\
&&\qquad\le 2\sum_{G\in {\cal G},\, V_1,V_2\subset\{1,\dots,k\}}
P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!} \right) \nonumber \\
&&\qquad\qquad\qquad\qquad
+2^{2k+1}n^{k-1} e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\label{(16.11)}
\end{eqnarray}
holds with the random variables $H_{n,k}(f)$ and
$H_{n,k}(f|G,V_1,V_2)$
defined in formulas (\ref{(16.4)}) and (\ref{(16.9)}) for all $A\ge T$.}
\medskip\noindent
In the proof of Lemmas 16.1A and 16.1B the result of the
following Lemmas~16.2A and~16.2B will be applied.
\medskip\noindent
{\bf Lemma 16.2A.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)} \quad \textrm{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous sequences.
Let ${\cal F}$ be a class of functions which satisfies the
conditions of Proposition 15.3. Introduce with the help of the above
random variables for all sets $V\subset\{1,\dots,k\}$ and functions
$f\in {\cal F}$ the decoupled $U$-statistic
\begin{equation}
\bar I_{n,k}^V(f)=\frac1{k!}\sum_{\substack {(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))}\right) \label{(16.12)}
\end{equation}
and its `randomized version'
\begin{eqnarray}
\bar I_{n,k}^{(V,\varepsilon)}(f)&&=\frac1{k!}
\sum_{\substack{(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))}\right), \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
f\in{\cal F}, \label{($16.12'$)}
\end{eqnarray}
where $\delta_j(V)=\pm1$, and we have $\delta_j(V)=1$ if $j\in V$,
and $\delta_j(V)=-1$ if $j\in\{1,\dots,k\}\setminus V$.
Then the sets of random variables
\begin{equation}
S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar I_{n,k}^V(f),
\quad f\in{\cal F}, \label{(16.13)}
\end{equation}
and
\begin{equation}
\bar S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar
I_{n,k}^{(V,\varepsilon)}(f), \quad f\in{\cal F},
\label{($16.13'$)}
\end{equation}
have the same joint distribution.}
\medskip\noindent
{\bf Lemma 16.2B.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}\quad \textrm{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent, $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is also independent of the previous sequences.
Let us consider a class ${\cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\cal F}$ on a space
$(X^k\times Y, {\cal X}^k\times {\cal Y},\mu^k\times\rho)$ which
satisfies the conditions of
Proposition~15.4. For all functions $f\in {\cal F}$
and $V\in\{1,\dots,k\}$ consider the decoupled $U$-statistics
$\bar I_{n,k}^V(f,y)$ defined by formula (\ref{(16.2)}) with
the help of the random variables
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, and define with their help
the random variables
\begin{equation}
W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}\bar
I_{n,k}^V(f,y)\right]^2\rho(\,dy), \quad f\in{\cal F}.
\label{(16.14)}
\end{equation}
Then the random vectors $\{W(f)\colon\, f\in {\cal F}\}$ defined
in~(\ref{(16.14)}) and $\{\bar W(f)\colon\, f\in {\cal F}\}$ defined
in~(\ref{(16.7)}) have the same distribution.}
\medskip\noindent
{\it Proof of Lemmas 16.2A and 16.2B.} Lemma~16.2A actually agrees
with the already proved Lemma~15.1, only the notation is
different. The proof of Lemma~16.2B is very similar to the proof of
Lemma~15.1. It can be shown that even the following stronger
statement holds. For any $\pm1$ sequence $u=(u_1,\dots,u_n)$ of
length~$n$ the conditional distribution of the random field
$\bar W(f)$, $f\in{\cal F}$, under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u=(u_1,\dots,u_n)$ agrees with
the distribution of the random field $W(f)$, $f\in{\cal F}$.
To see this relation let us first observe that the conditional
distribution of the field $\bar W(f)$ under this condition agrees
with the distribution of the random field we get by replacing the
random variables $\varepsilon_l$ by $u_l$ for all $1\le l\le n$ in
formulas~(\ref{(16.5)}), (\ref{(16.6)}) and~(\ref{(16.7)}).
Beside this, define the vector
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
by the formula
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$
for those indices $(j,l)$ for which $u_l=-1$, and
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
for which $u_l=1$ (independently of the value of the parameter $j$).
Then the joint distribution of the vectors
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
and $(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
agree. Hence the joint distribution of the random vectors
$\bar I_{n,k}^V(f,y)$, $f\in{\cal F}$, $V\subset \{1,\dots,k\}$ defined
in~(\ref{(16.2)}) and of the random vectors $W(f)$,
$f\in{\cal F}$, defined
in~(\ref{(16.14)}) do not change if we replace in their
definition the random
variables $\xi^{(j,1)}_l$ and $\xi^{(j,-1)}_l$ by $\xi(u)^{(j,1)}_l$
and $\xi(u)^{(j,-1)}_l$. But the set of random variables $W(f)$,
$f\in{\cal F}$, obtained in this way agrees with the set of random
variables we introduced to get a set of random variables with the
same distribution as the conditional distribution of $\bar W(f)$,
$f\in {\cal F}$ under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$. (These
random variables are defined as the square integral of the same sum,
only the terms of this sum are listed in a different order in the
two cases.) These facts imply Lemma~16.2B.
\medskip
In the next step we prove the following Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3A.} {\it Let us consider a class of functions
${\cal F}$ satisfying the conditions of Proposition 15.3 with
parameter~$k$ together with $2k$ independent copies
$\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n$ and
$\xi^{(j,-1)}_1,\dots,\xi^{(j,-1)}_n$, $1\le j\le k$, of a sequence
of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables $\bar I_{n,k}^V(f)$,
$f\in{\cal F}$, $V\subset\{1,\dots,k\}$, defined with the help of
these quantities in formula (\ref{(16.12)}). Let
${\cal B}={\cal B}(\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)};\;1\le j\le k)$
denote the $\sigma$-algebra generated by the random variables
$\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$ , $1\le j\le k$, i.e.\ by
the random variables with upper indices of the form $(j,1)$,
$1\le j\le k$. There exists a number $A_0=A_0(k)>0$ such that
for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, the
inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}
\left.E\left(\bar I_{n,k}^V(f)^2\right|{\cal B}\right)
> 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right)<
n^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\label{(16.15)}
\end{equation}
holds with a sufficiently small $\gamma_k>0$ if $A\ge A_0$.}
\medskip\noindent
{\it Proof of Lemma 16.3A.}\/ Let us first consider the case
$V=\emptyset$. In this case the estimate $\left.E\left(\bar
I_{n,k}^\emptyset(f)^2\right|{\cal B}\right)
=E\left(\bar I_{n,k}^\emptyset(f)^2\right)
\le\frac{n^k}{k!}\sigma^2\le 2^kn^{2k}\sigma^{2k+2}$ holds for all
$f\in{\cal F}$. In the above calculation it was exploited that the
functions $f\in{\cal F}$ are canonical, which implies certain
orthogonalities, and beside this the inequality $n\sigma^2\ge\frac12$
holds, because of the relation $n\sigma^2\ge L\log n+\log D$.
The above relations imply that for $V=\emptyset$ the probability at
the left-hand side of (\ref{(16.15)}) equals zero if the
number $A_0$ is chosen sufficiently large. Hence
inequality~(\ref{(16.15)}) holds in this case.
To avoid some complications in the notation let us first restrict our
attention to sets of the form $V=\{1,\dots,u\}$ with some $1\le u 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}
\right\} \label{(16.17)} \\
&&\qquad \subset \bigcup_{\substack{(l_{u+1},\dots,l_k)\colon\\
1\le l_j\le n,\; j=u+1,\dots,k.\\
l_j\neq l_{j'} \textrm { if } j\neq j'}} \nonumber \\
&&\qquad \qquad \left\{\omega\colon\, \sup_{f\in{\cal F}}
\left. E\left(\bar I_{n,k}^V(f,l_{u+1},\dots,l_k)^2\right|
{\cal B}\right)(\omega)
>\frac{A^2n^{2k}\sigma^{2k+2}}{2^{(3k+3)}n^{k-u}} \right\}.
\nonumber
\end{eqnarray}
The probability of the events in the union at the right-hand side
of~(\ref{(16.17)}) can be estimated with the help of the
Corollary of Proposition~15.4 with parameter $u\frac {A^2n^{k+u}\sigma^{2k+2}} {2^{(3k+3)}}\right)
\nonumber \\
&&\qquad \le e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2} \label{(16.18)}
\end{eqnarray}
with an appropriate $\gamma_k>0$ for all sequences
$(l_{u+1},\dots,l_k)$, $1\le l_j\le n$, $u+1\le j\le k$, and such
that $l_j\neq l_{j'}$ if $j\neq j'$.
Let us show that if a class of functions $f\in {\cal F}$
satisfies the conditions of Proposition~15.3, then it also
satisfies relation~(\ref{(16.18)}).
For this goal introduce the space $(Y,{\cal Y},\rho)=(X^{k-u},
{\cal X}^{k-u},\mu^{k-u})$, the $k-u$-fold power of the measure
space $(X, {\cal X},\mu)$, and for the sake of simpler notations
write $y=(x_{u+1},\dots,x_k)$ for a point $y\in Y$. Let us also
introduce the class of those function $\bar{\cal F}$ in the
space $(X^u\times Y,{\cal X}^u\times{\cal Y},\mu^u\times\rho)$
consisting of functions $\bar f$ of the form
$\bar f(x_1,\dots,x_u,y)=f(x_1,\dots,x_k)$ with
$y=(x_{u+1},\dots,x_k)$ and some function
$f(x_1,\dots,x_k)\in{\cal F}$.
If the class of function ${\cal F}$ satisfies the conditions of
Proposition~15.3 (with parameter~$k$), then the class of functions
$\bar{\cal F}$ satisfies the conditions of Proposition~15.4 with
parameter $u0$. We have
\begin{eqnarray}
&& P\biggl(\sup_{\bar f\in\bar{\cal F}}
E(\bar I_{n,k}^V(f,l_{u+1},\dots,l_k)^2|{\cal B}) \nonumber \\
&& \qquad\qquad\qquad\qquad
\ge \left(\frac{k!}{u!}\right)^2 \gamma_k^{2/(2u+1)}
A^2 (n+u-k)^{2u}\sigma^{2u+2}\biggr) \nonumber \\
&&
\quad
=P\left(\sup_{\bar f\in\bar{\cal F}} (n+u-k)^{-u}
H^{l(u)}_{n+u-k,u}(\bar f)\ge \gamma_k^{2/(2u+1)}
A^2(n+u-k)^u\sigma^{2u+2}\right) \nonumber \\
&&\qquad\le e^{-\gamma_kA^{1/(2u+1)}(n+k-u)\sigma^2}
\quad \textrm{for } A>A_0(u)\gamma_k^{-2/(2u+1)}.
\label{(16.20)}
\end{eqnarray}
It is not difficult to derive formula~(\ref{(16.18)})
from relation~(\ref{(16.20)}).
It is enough to check that the level
$\frac{A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ in the
probability at the left-hand side of~(\ref{(16.18)}) can be
replaced by
$\gamma_k^{2/(2u+1)}
A^2\left(\frac{k!}{u!}\right)^2(n+u-k)^{2u}\sigma^{2u+2}$
if $\gamma_k>0$ is chosen sufficiently small. This
statement holds, since
$\gamma_k^{2/(2u+1)}
A^2\left(\frac{k!}{u!}\right)^2(n+u-k)^{2u}\sigma^{2u+2}<
\gamma_k^{2/(2k+1)}A^2\left(\frac{k!}{u!}\right)^2n^{2u}\sigma^{2u+2}
\le\frac {A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ if the constant
$\gamma_k>0$ is chosen sufficiently small, since
$n\sigma^2>L\log n\le \frac12$ by the conditions of
Proposition~15.3.
Relations (\ref{(16.17)}) and (\ref{(16.18)}) imply that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}}\left. E\left(\bar I_{n,k}^V(f)^2\right|
{\cal B}\right)(\omega)
>2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2} \right) \\
&& \qquad \le n^{k-u}e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2}.
\end{eqnarray*}
Since $e^{-\gamma_k A^{1/(2u+1)}(n+u-k)\sigma^2}
\le e^{-\gamma_k A^{1/(2k-1)}n\sigma^2/k}$
if $u\le k-1$, $n\ge k$ and $A>A_0$ with a sufficiently large
number~$A_0$, inequality (\ref{(16.15)}) holds for all
sets $V$ of the form $V=\{1,\dots,u\}$, $1\le uAn^{k/2}\sigma^{k+1}\right) \nonumber \\
&&\qquad <2P\left(\sup_{f\in{\cal F}} |S(f)|
>\frac A2n^k\sigma^{k+1}\right) \nonumber \\
&&\qquad\qquad +2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\label{(16.21)}
\end{eqnarray}
with the function $S(f)$ defined in (\ref{(16.13)}). To prove
relation (\ref{(16.21)}) introduce the random variables
$Z(f)=\bar I_{n,k}^{\{1,\dots,k\}}(f)$ and
$$
\bar Z(f)=-\sum_{V\subset \{1,\dots,k\},\,V\neq\{1,\dots,k\}}
(-1)^{k-|V|}\bar I_{n,k}^V(f)
$$
for all $f\in{\cal F}$, the
$\sigma$-algebra ${\cal B}$ considered in Lemma~16.3A and the set
$$
B=\bigcap_{\substack{V\subset\{1,\dots,k\}\\
V\neq\{1,\dots,k\}}} \left\{\omega\colon\,
\sup_{f\in{\cal F}}\left.E\left(\bar I_{n,k}^V(f)^2\right|
{\cal B}\right)(\omega) \le 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right\}.
$$
Observe that $S(f)=Z(f)-\bar Z(f)$, $f\in{\cal F}$, $B\in{\cal B}$,
and by Lemma~16.3A the inequality
$1-P(B)\le2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)}
n\sigma^2/k}$ holds. To prove relation~(\ref{(16.21)}) apply
Lemma~15.2 with
the above introduced random variables $Z(f)$ and $\bar Z(f)$,
$f\in{\cal F}$, (both here and in the subsequent proof of Lemma~16.1B
we work with random variables $Z(\cdot)$ and $\bar Z(\cdot)$ indexed
by the countable set of functions $f\in{\cal F}$, hence the functions
$f\in{\cal F}$ play the role of the parameters~$p$ when Lemma~15.2 is
applied) random set $B$ and $\alpha=\frac A2n^k\sigma^{k+1}$,
$u=\frac A2n^k\sigma^{k+1}$. It is enough to show that
\begin{equation}
P\left(|\bar Z(f)|
>\frac A2n^k\sigma^{k+1}|{\cal B}\right)(\omega)\le\frac12
\quad \textrm{ for all }f\in{\cal F} \quad
\textrm {if } \omega\in B.
\label{(16.22)}
\end{equation}
But
\begin{eqnarray*}
&&P\left(|\bar I_{n,k}^{|V|}(f)|>2^{-(k+1)} An^k\sigma^{k+1}|
{\cal B}\right)(\omega) \\
&& \qquad \le\frac{2^{2(k+1)}E(\bar I^{|V|}_{n,k}(f)^2|{\cal B})(\omega)}
{A^2n^{2k}\sigma^{2(k+1)}}\le 2^{-(k+1)}
\end{eqnarray*}
for all functions $f\in {\cal F}$ and sets
$V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, if $\omega\in B$
by the `conditional Chebishev inequality', hence
relations~(\ref{(16.22)}) and~(\ref{(16.21)}) hold.
Lemma 16.1A follows from relation~(\ref{(16.21)}), Lemma~16.2A
and the observation that the random variables
$\bar I_{n,k}^{(V,\varepsilon)}(f)$,
$f\in{\cal F}$, defined in~(\ref{($16.12'$)}) have the same
distribution for
all $V\subset\{1,\dots,k\}$ as the random variables
$\bar I_{n,k}^{\varepsilon}(f)$, defined in
formula~(\ref{(14.12)}). Hence Lemma~16.2A and the
definition~(\ref{($16.13'$)}) of the random variables
$\bar S(f)$, $f\in{\cal F}$, imply the inequality
\begin{eqnarray*}
P\left(\sup_{f\in{\cal F}} |S(f)|>\frac A2n^k\sigma^{k+1}\right)
&=&P\left(\sup_{f\in{\cal F}} |\bar S(f)|
>\frac A2n^k\sigma^{k+1}\right)\\
&\le& 2^kP\left(\sup_{f\in{\cal F}}
\left|\bar I_{n,k}^\varepsilon(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right).
\end{eqnarray*}
Lemma 16.1A is proved.
\medskip
Lemma~16.1B will be proved with the help of the following
Lemma~16.3B, which is a version of Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3B.} {\it Let us consider a class of functions
${\cal F}$ satisfying the conditions of Proposition~15.4
together with $2k$ independent copies
$\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n$ and
$\xi^{(j,-1)}_1$,\dots, $\xi^{(j,-1)}_n$, $1\le j\le k$, of a
sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables
$\bar I_{n,k}^V(f,y)$ and $H^V_{n,k}(f)$, $f\in{\cal F}$,
$V\subset\{1,\dots,k\}$, defined in formulas~(\ref{(16.2)})
and~(\ref{(16.3)}) with
the help of these quantities. Let
${\cal B}={\cal B}(\xi_1^{(j,1)},\dots, \xi_n^{(j,1)};\;1\le j\le k)$
denote the $\sigma$-algebra generated by the random variables
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$, $1\le j\le k$, i.e. by those
random variables which appear in the definition of the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ introduced in
formulas (\ref{(16.2)}) and~(\ref{(16.3)}), and have
second argument~1 in their upper index.
\begin{enumerate}
\item
There exist some numbers $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$,
the inequality
\begin{eqnarray}
&& P\left(\sup_{f\in{\cal F}} E(H^{V}_{n,k}(f)|{\cal B})
>2^{-(4k+4)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right) \nonumber \\
&& \qquad 0$ and $\gamma=\gamma_k$ such that
if the integrals $H_{n,k}(f)$, $f\in{\cal F}$, determined by
this class of functions ${\cal F}$ have a good tail behaviour at level
$T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})
>2^{-(2k+2)}A^2n^{2k}\sigma^{2k+2}\right)
<2n^{k-1}e^{-\gamma_k A^{1/2k}n\sigma^2/k}
\label{(16.25)}
\end{equation}
holds for any pairs of subsets $V_1,V_2\subset\{1,\dots,k\}$ with
the property that at least one of them does not equal the set
$\{1,\dots,k\}$ if the number~$A$ satisfies the condition $A>T$.
\end{enumerate}
}
\medskip\noindent
{\it Proof of Lemma 16.3B.}\/ Part a) of Lemma 16.3B can be proved
in almost the same way as Lemma 16.3A. Hence I only briefly
explain the main step of the proof. In the case $V=\emptyset$ the
identity $E(H^{V}_{n,k}(f)|{\cal B})=E(H^{V}_{n,k}(f))$ holds, hence it
is enough to show that $E(H^{V}_{n,k}(f))\le\frac{n^k\sigma^2}{k!}
\le2^k\frac{n^{2k}\sigma^{2k+2}}{k!}$ for all $f\in{\cal F}$ under the
conditions of Proposition~15.4. (This relation holds, because
the functions of the class ${\cal F}$ are canonical.) The case of a
general set $V$, $V\neq\emptyset$ and $V\neq\{1,\dots,k\}$, can be
reduced to the case $V=\{1,\dots,u\}$ with some $1\le u\frac {A^{(2k-1)/k}n^{k+u}\sigma^{2k+2}}{2^{(4k+4)}}\right)\\
&&\qquad \le e^{-\gamma_kA^{(2k-1)/2k(2u+1)}(n+u-k)\sigma^2}
\end{eqnarray*}
with a sufficiently small $\gamma_k>0$. This inequality can be
proved, similarly to relation~(\ref{(16.18)}) in the proof of
Lemma~16.3A
with the help of the Corollary of Proposition~15.4. Only here we
have to work in the space $(X^u\times \bar Y,\cal
X^u\times\bar{\cal Y}, \mu^u\times\bar \rho)$ where $\bar
Y=X^{k-u}\times Y$, $\bar{\cal Y}={\cal X}^{k-u}\times{\cal Y}$,
$\bar\rho=\mu^{k-u}\times\rho$ with the class of function
$\bar f\in\bar{\cal F}$ consisting of the functions~$\bar f$
defined by the formula
$\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)$
with some $f(x_1,\dots,x_k,y)\in {\cal F}$, where
$\bar y=(x_{u+1},\dots,x_k,y)$. Here we apply the following
version of formula~(\ref{(16.19)}).
\begin{eqnarray*}
&& E\left(\bar I_{n,k}^V(f,l_{u+1},\dots,l_k,y)^2|{\cal B}\right)
=\left(\frac{u!}{k!}\right)^2\int \bar I^{l(u)}_{n+u-k,u}
(\bar f,\bar y)^2\bar\rho(\,d\bar y) \\
&&\qquad =\left(\frac{u!}{k!}\right)^2H_{n+u-k,u}(\bar f)
\end{eqnarray*}
with the function $\bar f\in\bar{\cal F}$ for which the identity
$$
\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)
$$
holds with $\bar y=(x_{u+1},\dots,x_k,y)$ and the random variables
$\bar I^{l(u)}_{n+u-k,u}(\bar f,\bar y)$ and $H_{n+u-k,u}(\bar f)$
defined similarly as the corresponding terms after
formula~(\ref{(16.19)}),
only $y$ is replaced by $\bar y$, the measure $\rho$ by $\bar\rho$,
and the presently defined $\bar f\in\bar{\cal F}$ are considered in
the present case. I omit the details.
\medskip\noindent
Part b) of Lemma 16.3B will be proved with the help of Part a) and
the inequality
$$
\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B}) \le
\left(\sup_{f\in{\cal F}} E(H^{V_1}_{n,k}(f)|{\cal B})\right)^{1/2}
\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})\right)^{1/2}
$$
which follows from the Schwarz inequality applied for integrals with
respect to conditional distributions. Let us assume that
$V_1\neq\{1,\dots,k\}$. The last inequality implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})
>2^{-(2k+2)}A^2n^{2k}\sigma^{2k+2}\right)\\
&&\qquad \le P\left(\sup_{f\in{\cal F}} E(H^{V_1}_{n,k}(f)|{\cal B})
>2^{-(4k+4)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right) \\
&&\qquad\qquad+P\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\end{eqnarray*}
Hence if we know that also the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\le n^{k-1} e^{-\gamma_k A^{1/2k}n\sigma^2} \label{(16.26)}
\end{equation}
holds, then we can deduce relation~(\ref{(16.25)}) from the
estimate~(\ref{(16.23)})
and the last inequality. Relation~(\ref{(16.26)}) follows
from Part~a) of Lemma~16.3B if $V_2\neq\{1,\dots,k\}$ and $A\ge1$,
since in this case the level
$A^{(2k+1)/k} n^{2k}\sigma^{2k+2}$ can be replaced
by the smaller number $2^{-(4k+2)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}$
in the probability of formula (\ref{(16.26)}). In the case
$V_2=\{1,\dots,k\}$ it follows from the conditions of Part~b) of
Lemma~16.3B if the number $\gamma_k$ is chosen for some
$\gamma_k\le1$. Indeed, since $A^{(2k+1)/2k}>T^{(2k+1)/2k}$, by
the conditions of Proposition~15.4 the estimate~(\ref{(15.7)})
holds if the number $A$ is replaced in it by
$A^{(2k+1)/2k}$ (at both side of the inequality), and this
relation implies inequality~(\ref{(16.26)}) in this case.
\medskip
Now we turn to the proof of Lemma~16.1B.
\medskip\noindent
{\it Proof of Lemma 16.1B.}\/ By Lemma~16.2B it is enough to
prove that relation (\ref{(16.8)}) holds if the random
variables $\bar W(f)$ are replaced in it by the random
variables $W(f)$ defined in formula~(\ref{(16.14)}). We shall
prove this by applying the generalized form of the
symmetrization lemma, Lemma~15.2, with the choice of
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)$, $\bar V=\{1,\dots,k\}$,
$\bar Z(f)=Z(f)-W(f)$, $f\in{\cal F}$,
${\cal B}={\cal B}(\xi_1^{(j,1)},\dots,\xi_n^{(j,1)};\;1\le j\le k)$,
$\alpha=\frac{A^2}2n^{2k}\sigma^{2k+2}$,
$u=\frac{A^2}2n^{2k}\sigma^{2k+2}$ and the set
\begin{eqnarray*}
B&&=\bigcap_{\substack{(V_1,V_2)\colon\, V_j\in \{1,\dots,k\},
\;j=1,2,\\
V_1\neq\{1,\dots,k\} \textrm { or } V_2\neq\{1,\dots,k\} }} \\
&&\qquad\qquad \left\{\omega\colon
\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})(\omega)
\le 2^{-(2k+2)} A^{2} n^{2k}\sigma^{2k+2}\right\}.
\end{eqnarray*}
By part~b) of Lemma 16.3B the inequality
$$
1-P(B)\le2^{2k+1}n^{k-1}
e^{-\gamma_k A^{1/2k}n\sigma^2/k}
$$
holds. Observe that
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)=H_{n,k}(f)$ for all $f\in{\cal F}$.
Hence to prove Lemma 16.1B with the
help of Lemma~15.2 it is enough to show that
\begin{equation}
P\left(\left.|\bar Z(f)|>\frac{A^2}2 n^{2k}\sigma^{2k+2}\right|
{\cal B}\right)(\omega)\le\frac12 \quad \textrm{ for all }f\in{\cal F}
\textrm{ if } \omega\in B. \label{(16.27)}
\end{equation}
To prove this relation observe that because of the definition of the
set~$B$
\begin{eqnarray*}
&& E (|\bar Z(f)| |{\cal B})(\omega) \\
&&\qquad \le \sum_{\substack
{(V_1,V_2)\colon\, V_j\in \{1,\dots,k\},\;j=1,2,\\
V_1\neq\{1,\dots,k\} \textrm { or } V_2\neq\{1,\dots,k\} }}
E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})(\omega)
\le\frac{A^2}4n^{2k}\sigma^{2k+2}
\end{eqnarray*}
if $\omega\in B$ for all $f\in {\cal F}$. Hence the `conditional
Markov inequality' implies that
$P\left(\left.|\bar Z(f)|>\frac{A^2}2 n^{2k}\sigma^{2(k+1)}\right|
{\cal B}\right)(\omega)\le\frac
{2E(|\bar Z(f)| |{\cal B})(\omega)}{A^2n^{2k}\sigma^{2k+2}}\le\frac12$
if $\omega\in B$, and inequality~(\ref{(16.27)}) holds.
Lemma~16.1B is proved.
\chapter{The proof of the main result}
This section contains the proof of Proposition~15.3 together with
Proposition~15.4. They complete the proof of Theorem~8.4, of the
main result of this work.
\medskip\noindent
{\script A.) The proof of Proposition 15.3.}
\medskip\noindent
The proof of Proposition 15.3 is similar to that of Proposition~7.3.
It applies an induction procedure with respect to the parameter $k$.
In the proof of Proposition~15.3 for parameter~$k$ we may assume that
Propositions~15.3 and~15.4 hold for $u2^{-(k+1)}A n^k\sigma^{k+1}\right)
$$
appearing at the right-hand side of the estimate (\ref{(16.1)})
in Lemma~16.1A. To estimate this probability we introduce (using
the notation of Proposition~15.3) the functions
\begin{eqnarray}
&&S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,1\le j\le k) \nonumber \\
&&\qquad =\frac1{k!}\sum_{\substack {(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; j=1,\dots, k,\\ l_j\neq l_{j'}
\textrm{ if } j\neq j'}}
f^2\left(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)}\right),
\quad f\in{\cal F}, \label{(17.1)}
\end{eqnarray}
with $x_l^{(j)}\in X$, $1\le l\le n$, $1\le j\le k$.
We define with the help of this function the following set
$H=H(A)\subset X^{kn}$ for all $A>T$ similarly to the set defined
in formula~(\ref{(7.7)}).
\begin{eqnarray}
&&H=H(A)=\biggl\{(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\colon
\nonumber \\
&&\qquad \sup_{f\in{\cal F}} S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,
1\le j\le k)>2^kA^{4/3}n^k\sigma^2\biggr\}. \label{(17.2)}
\end{eqnarray}
We want to show that
\begin{equation}
P(\{\omega\colon\, (\xi_l^{(j)}(\omega),
\,1\le j\le n,\,1\le j\le k)\in H\})
\le 2^k e^{-A^{2/3k}n\sigma^2} \quad\textrm{if }A\ge T.
\label{(17.3)}
\end{equation}
To prove relation (\ref{(17.3)}) we take the Hoeffding
decomposition of the
$U$-statistics with kernel functions $f^2(x_1,\dots,x_k)$,
$f\in{\cal F}$, given in Theorem~9.1, i.e. we write
\begin{equation}
f^2(x_1,\dots,x_k)
=\sum\limits_{V\subset\{1,\dots,k\}} f_V(x_j,j\in V),
\quad f\in{\cal F}, \label{(17.4)}
\end{equation}
with
$f_V(x_j,j\in V)=\prod\limits_{j\notin V}P_j\prod\limits_{j\in V}Q_j
f^2(x_1,\dots,x_k)$, where $P_j$ and $Q_j$ are the operators defined
in formulas (\ref{(9.1)}) and~(\ref{(9.1a)}).
The functions $f_V$ appearing in formula (\ref{(17.4)}) are
canonical (with respect to the measure $\mu$), and the identity
$S^2_{n,k}(f)(\xi_l^{(j)}\,1\le l\le n,1\le j \le k)=\bar I_{n,k}(f^2)$
holds for all $f\in {\cal F}$ with the expression $\bar I_{n,k}(\cdot)$
defined in~(\ref{(14.11)}). By applying the Hoeff\-ding
decomposition~(\ref{(17.4)})
for each term $f^2(\xi_{l_1}^{(1)}\dots,\xi_{l_k}^{(k)})$ in the
expression $S^2_{n,k}(f)$ we get that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}S^2_{n,k}(f)(\xi_l^{(j)},
\,1\le l\le n,\,1\le j\le
k) >2^kA^{4/3}n^k\sigma^2\right) \nonumber \\
&&\qquad \le \!\!\! \sum_{V\subset\{1,\dots,k\}} \!\!\!
P\left(\frac{|V|!}{k!}
\sup_{f\in{\cal F}}
n^{k-|V|}|\bar I_{n,|V|}(f_V)|>A^{4/3}n^k\sigma^2\right)
\label{(17.5)}
\end{eqnarray}
with the functions $f_V$ appearing in formula~(\ref{(17.4)}).
We want to give
a good estimate for each term in the sum at the right-hand side
in~(\ref{(17.5)}). For this goal first we show that the
classes of functions
$\{f_V\colon\,f\in {\cal F}\}$ in the expansion~(\ref{(17.4)})
satisfy the
conditions of Proposition~15.3 for all $V\subset\{1,\dots,k\}$.
The functions $f_V$ are canonical for all $V\subset\{1,\dots,k\}$.
It follows from the conditions of Proposition~15.3 that
$|f^2(x_1,\dots,x_k)|\le 2^{-2(k+1)}$ and
$$
\int f^4(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le
2^{-(k+1)}\sigma^2.
$$
Hence relations (\ref{(9.4)}) and~(\ref{($9.4'$)}) of
Theorem~9.2 imply that
$$
\left|\sup_{x_j\in X,j\in V}f_V(x_j,j\in V)\right|
\le 2^{-(k+2)}\le2^{-(k+1)}
$$
and $\int f^2_V(x_j,j\in V)\prod\limits_{j\in V}\mu(\,dx_j)
\le 2^{-(k+1)} \sigma^2\le\sigma^2$ for all
$V\subset\{1,\dots,k\}$. Finally, to check that the class of
functions ${\cal F}_V=\{f_V\colon\, f\in{\cal F}\}$
is $L_2$-dense with exponent $L$ and parameter $D$ observe
that for all probability measures $\rho$ on $(X^k,{\cal X}^k)$
and pairs of functions $f,g\in {\cal F}$ the inequality
$\int(f^2-g^2)^2\,d\rho\le 2^{-2k}\int(f-g)^2\,d\rho$ holds.
This implies that if $\{f_1,\dots,f_m\}$,
$m\le D\varepsilon^{-L}$, is an
$\varepsilon$-dense subset of ${\cal F}$ in the space
$L_2(X^k,{\cal X}^k,\rho)$,
then the set of functions $\{2^kf_1^2,\dots,2^kf_m^2\}$ is an
$\varepsilon$-dense subset of the class of functions
${\cal F}'=\{2^kf^2\colon\,
f\in {\cal F}\}$, hence ${\cal F}'$ is also an $L_2$-dense class
of functions with exponent~$L$ and parameter~$D$. Then by
Theorem~9.2 the class of functions ${\cal F}_V$ is also
$L_2$-dense with exponent $L$ and
parameter~$D$ for all sets $V\subset\{1,\dots,k\}$.
For $V=\emptyset$, the function $f_V$ is constant, the relation
$$
f_V=\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2
$$
holds, and $\bar I_{|V|}(f_{|V|})|=f_V\le\sigma^2$. Therefore
the term corresponding to $V=\emptyset$ in the sum of
probabilities at the right-hand side of (\ref{(17.5)}) equals
zero under the conditions of Proposition~15.3 with the choice
of some $A_0\ge1$. I claim that the remaining terms in the sum
at the right-hand side of~(\ref{(17.5)}) satisfy the inequality
\begin{eqnarray}
&&P\left(\frac{|V|!}{k!}n^{k-|V|}\sup_{f\in{\cal F}}
|\bar I_{n,|V|}(f_V)|>A^{4/3}n^{k}\sigma^2\right)\nonumber \\
&&\qquad \le P\left(\sup_{f\in{\cal F}}
|\bar I_{n,|V|}(f_V)|>A^{4/3}\frac{k!}
{|V|!}n^{|V|}\sigma^{|V|+1}\right)
\le e^{-A^{2/3k}n\sigma^2} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{if } 1\le|V|\le k. \label{(17.6)}
\end{eqnarray}
The first inequality in (\ref{(17.6)}) holds, since
$\sigma^{|V|+1}\le\sigma^2$
for $|V|\ge1$, and $n\ge k\ge|V|$. The second inequality
follows from the inductive hypothesis if $|V|2^{-(k+2)}A n^{k/2}\sigma^{k+1}\right)$ with
respect to the random variables $\xi_l^{(j)}$,
$1\le l\le n$, $1\le j\le k$ we get with the help of
the multivariate version of Hoeff\-ding's inequality
(Theorem~13.3) that
\begin{eqnarray}
&&P\left(\left.\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+2)}A n^k\sigma^{k+1}\right|\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\right) \nonumber \\
&&\qquad \le C\exp\left\{-\frac12
\left(\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{2k+4}
S^2_{n,k}(x_l^{(j)},1\le l\le n,\,1\le j\le k)/k!}
\right)^{1/k}\right\} \nonumber \\
&&\qquad \le Ce^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2} \quad
\textrm{for all }f\in{\cal F} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad \textrm{if } (x_l^{(j)},\,
1\le l\le n,\,1\le j\le k) \notin H \label{(17.7)}
\end{eqnarray}
with some appropriate constant $C=C(k)>0$.
Define for all $1\le j\le k$ and sets of points $x_l^{(j)}\in X$,
$1\le l\le n$, the probability measures
$\rho_j=\rho_{j,\,(x_l^{(j)},\,
1\le l\le n)}$, $1\le j\le k$, uniformly distributed on the set of
points $\{x_l^{(j)},\; 1\le l\le n\}$, i.e. let
$\rho_j(x_l^{(j)})=\frac1n$ for all $1\le l\le n$. Let us also
define the product $\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)
=\rho_1\times\cdots\times\rho_k$ of these measures on the space
$(X^k,{\cal X}^k)$. If $f$ is a function on $(X^k,{\cal X}^k)$ such
that $\int f^2\,d\rho\le\delta^2$ with some $\delta>0$, then
\begin{eqnarray*}
&&\sup_{\varepsilon_1,\dots,\varepsilon_n} |\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)| \\
&&\qquad \le\frac{n^k}{k!}\int
|f(u_1,\dots,u_k)|\rho(\,du_1,\dots,\,du_k)
\le\frac{n^k}{k!}
\left(\int f^2\,d\rho\right)^{1/2}\le\frac{n^k}{k!}\delta,
\end{eqnarray*}
$u_j\in R^k$, $1\le j\le k$, and as a consequence
\begin{eqnarray}
&&\sup_{\varepsilon_1,\dots,\varepsilon_n}|\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k) \label{(17.8)} \\
&&\qquad\qquad\qquad -\bar I_{n,k}^\varepsilon(g)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)| \nonumber \\
&&\qquad \le2^{-(k+2)}An^k\sigma^{k+1} \quad\textrm{if }
\int (f-g)^2\,d\rho\le (2^{-(k+2)}k!A\sigma^{k+1})^2,
\nonumber
\end{eqnarray}
where
$\bar I_{n,k}^\varepsilon(f)(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)$
equals the expression $\bar I_{n,k}^\varepsilon(f)$ defined
in~(\ref{(14.12)}) if we replace $\xi_{l_j}^{(j)}$ by $x_{l_j}^{(j)}$
for all $1\le j\le k$, and $1\le l_j\le n$ in it, and $\rho$ is
the measure
$\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$ defined above.
Let us fix the number $\delta=2^{-(k+2)}k!A\sigma^{k+1}$,
and let us list the elements of the set ${\cal F}$ as
${\cal F}=\{f_1,f_2,\dots\}$.
Put
$$
m=m(\delta)=\max(1,D\delta^{-L})
=\max(1,D(2^{(k+2)}(k!)^{-1}A^{-(1)}\sigma^{-(k+1)})^L),
$$
and choose for all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$ such a
sequence of positive integers $p_1(x^{(n)}),\dots,p_m(x^{(n)}))$
for which
$$
\inf\limits_{1\le l\le m}\int (f(u)-f_{p_l(x^{(n)})}(u))^2
\,d\rho(x^{(n)})\le\delta^2\quad\textrm{for all } f\in{\cal F}.
$$
(Here we apply the notation
$\rho(x^{(n)})=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$.)
This is possible, since ${\cal F}$ is an $L_2$-dense
class with exponent~$L$ and parameter~$D$, and we can choose
$m=D\delta^{-l}$, if $\delta<1$, Beside this, we can choose $m=1$
if $\delta=1$, since
$\int |f-g|^2\,d\rho\le\sup|f(x)-g(x)|^2\le2^{-2k}\le1$ for all
$f,g\in{\cal F}$. Moreover, it follows from Lemma~7.4A that the
functions $p_l(x^{(n)})$, $1\le l\le m$, can be chosen as
measurable functions of the argument $x^{(n)}\in X^{kn}$.
Let us introduce the random vector
$\xi^{(n)}(\omega)=(\xi^{(j)}_l(\omega),\,1\le l\le n,\,1\le j\le k)$.
By arguing similarly as we did in the proof of Proposition~7.3 we
get with the help of relation~(\ref{(17.8)}) and the property of the
functions $f_{p_l(x^{(n)})}(\cdot)$ constructed above that
\begin{eqnarray*}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
|\bar I_{n,k}^\varepsilon(f)(\omega)|
\ge2^{-(k+1)}An^k\sigma^{k+1}\right\} \\
&&\qquad \subset\bigcup\limits_{l=1}^m\left\{\omega\colon\,
|\bar I_{n,k}^\varepsilon(f_{p_l(\xi^{(n)}(\omega))})(\omega)|
\ge2^{-(k+2)}An^k\sigma^{(k+1)} \right\}.
\end{eqnarray*}
The above relation and formula (\ref{(17.7)}) imply that
\begin{eqnarray}
&&P \left.\biggl(\sup_{f\in{\cal F}}
\left|\bar I_{n,k}^{\varepsilon}(f)(\omega)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right| \nonumber \\
&&\qquad\qquad\qquad \qquad\qquad\qquad \xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\biggr) \nonumber \\
&&\qquad \le \sum_{l=1}^m P\left.\biggl(|
\bar I_{n,k}^{\varepsilon}(f_{p_l(\xi^{(n)}(\omega))}(\omega)|
>\frac{A n^k\sigma^{k+1}}{2^{k+2}}\right|
\nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad \xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\biggr) \nonumber \\
&&\qquad \le C m(\delta) e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2}
\nonumber \\
&&\qquad \le C (1+D(2^{k+2} A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad \textrm{if }
\{x_l^{(j)},\, 1\le l\le n,\,1\le j\le k\}\notin H. \label{(17.9)}
\end{eqnarray}
Relations~(\ref{(17.3)}) and~(\ref{(17.9)}) imply that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}
\left|\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right) \nonumber \\
&&\qquad \le C (1+D(2^{k+2}A^{-1} (k!)^{-1}
\sigma^{-(k+1)})^L)\nonumber \\
&&\qquad\quad e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2}
+2^k e^{-A^{2/3k}n\sigma^2} \quad\textrm{if }A> T.
\label{(17.10)}
\end{eqnarray}
Proposition 15.3 follows from the estimates~(\ref{(16.1)}),
(\ref{(17.10)}) and the
condition $n\sigma^2\ge L\log n+\log D$, $L,D\ge 1$, if $A\ge A_0$
with a sufficiently large number~$A_0$. Indeed, in this case
$n\sigma^2\ge\frac12$,
$(2^{k+2}A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L\le
(\frac{n^{(k+1)/2}}{(2n\sigma^2)^{(k+1)/2}})^L\le n^{L(k+1)/2}=
e^{L\log n\cdot (k+1)/2}\le e^{(k+1)n\sigma^2/2}$,
$D=e^{\log D}\le e^{n\sigma^2}$, and
$$
C (1+D(2^{k+2} A^{-1} (k!)^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}(k!)^{1/k}n\sigma^2}
\le\frac13 e^{-A^{1/2k}n\sigma^2}.
$$
The estimation of the remaining terms in the upper bound of the
estimates~(\ref{(16.1)}) and~(\ref{(17.10)}) leading to the proof of
relation~(\ref{(15.5)}) is simpler. We can exploit that
$e^{-A^{2/3k}n\sigma^2}\ll e^{-A^{1/2k}n\sigma^2}$ and as
$n^{k-1}\le e^{(k-1)n\sigma^2}$
$$
2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}\le
2^ke^{(k-1)n\sigma^2}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\ll e^{-A^{1/2k}n\sigma^2}
$$
for a large number~$A$.
Now we turn to the proof of Proposition~15.4.
\medskip\noindent
{\script B.) The proof of Proposition 15.4.}
\medskip\noindent
Because of formula~(\ref{(16.11)}) in the Corollary of
Lemma~16.1B to prove Proposition 15.4 i.e.
inequality~(\ref{(15.7)}) it is enough to choose a
sufficiently large parameter $A_0$ and to show that with such
a choice the random variables $H_{n,k}(f|G,V_1,V_2)$ defined in
formula~(\ref{(16.9)}) satisfy the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!} \right) \le
2^{k+1} e^{-A^{1/2k}n\sigma^2}\nonumber \\
&&\qquad\textrm{ for all } G\in {\cal G}\quad \textrm{and }
\;V_1,V_2\in\{1,\dots,k\} \quad\textrm{if } A>T\ge A_0
\label{(17.11)}
\end{eqnarray}
under the conditions of Proposition~15.4.
Let us first prove formula (\ref{(17.11)}) in the case $|e(G)|=k$,
i.e.\ when all vertices of the diagram $G$ are end-points of some
edge, and the expression $H_{n,k}(f|G,V_1,V_2)$ contains no
`symmetrizing term' $\varepsilon_j$. In this case we apply a
special argument to prove relation~(\ref{(17.11)}).
It can be seen with the help of the Schwarz inequality that for a
diagram $G$ such that $|e(G)|=k$
\begin{eqnarray}
&&|H_{n,k}(f|G,V_1,V_2)| \label{(17.12)} \\
&&\qquad \le\frac1{k!}
\left(\sum_{\substack{ (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\int f^2(\xi_{l_1}^{(1),\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y)
\rho(\,dy)\right)^{1/2} \nonumber \\
&& \qquad\qquad
\frac1{k!}\left(\sum_{\substack{ (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \textrm{ if }j\neq j'}}
\int f^2(\xi_{l_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l_k}^{(k,\delta_k(V_2))},y) \rho(\,dy)\right)^{1/2} \nonumber
\end{eqnarray}
with $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$.
Relation (\ref{(17.12)}) can be proved for instance by
bounding first each integral in formula (\ref{(16.9)}) by
means of the Schwarz inequality, and then by bounding the
sum appearing in such a way by means of the inequality
$\sum |a_jb_j|\le \left(\sum a_j^2\right)^{1/2}
\left(\sum b_j^2\right)^{1/2}$. Observe that in the case
$|(e(G)|=k$ the summation in~(\ref{(16.9)}) is
taken for such vectors $(l_1,\dots,l_k,l_1',\dots,l_k')$ for
which $(l_1',\dots,l_k')$ is a permutation of the sequence
$(l_1,\dots,l_k)$ determined by the diagram~$G$. Hence the
sum we get after applying the Schwarz inequality for each
integral in~(\ref{(16.9)}) has the form $\sum a_jb_j$ where
the set of indices~$j$ in this sum agrees with
the set of vectors $(l_1,\dots,l_k)$ such that $1\le l_p\le n$
for all $1\le p\le k$, and $l_p\neq l_{p'}$ if $p\neq p'$.
By formula (\ref{(17.12)})
\begin{eqnarray*}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
\left |H_{n,k}(f|G,V_1,V_2)(\omega)\right|
>\frac{A^2n^{2k}\sigma^{(2(k+1)}}{2^{4k+1}k!} \right\} \\
&&\qquad \subset
\biggl\{\omega\colon\, \sup_{f\in{\cal F}} \!\!\!\!
\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; 1\le j\le k, \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!\!
\int f^2(\xi_{l_1}^{(1,\delta_1(V_1))}(\omega),
\dots,\xi_{l_k}^{(k,\delta_k(V_1))}
(\omega),y) \rho(\,dy) \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
>\frac {A^2n^{2k}\sigma^{2(k+1)}k!}{2^{4k+1}} \biggr\} \\
&&\qquad\quad \cup \biggl\{\omega\colon\, \sup_{f\in{\cal F}} \!\!\!\!
\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n, \; 1\le j\le k, \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!\!
\int f^2(\xi_{l_1}^{(1,\delta_1(V_2))}(\omega),\dots,
\xi_{l_k}^{(k,\delta_k(V_2))}
(\omega),y)\rho(\,dy) \\
&&\qquad\qquad \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
>\frac{A^2n^{2k}\sigma^{2(k+1)}k!}{2^{4k+1}}\biggr\},
\end{eqnarray*}
hence
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right) \label{(17.13)} \\
&&\qquad \le 2P\left(\sup_{f\in{\cal F}}\frac1{k!}
\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; 1\le j\le k, \nonumber \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
h_f(\xi_{l_1}^{(1,1)},\dots,\xi_{l_k}^{(k,1)})
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right) \nonumber
\end{eqnarray}
with the functions $h_f(x_1,\dots,x_k)=\int
f^2(x_1,\dots,x_k,y)\rho(\,dy)$, $f\in{\cal F}$. (In this
upper bound we could get rid of the terms $\delta_j(V_1)$
and $\delta_j(V_2)$, i.e. on the dependence of the
expression $H_{n,k}(f|G,V_1,V_2)$ on the sets $V_1$ and
$V_2$, since the probability of the events in the
previous formula do not depend on them.)
I claim that
\begin{equation}
P\left(\sup\limits_{f\in{\cal F}} |\bar I_{n,k}(h_f)|
\ge2^k An^k \sigma^2\right)\le
2^k e^{-A^{1/2k}n\sigma^2} \quad \textrm{for }A\ge A_0
\label{(17.14)}
\end{equation}
if the constant $A_0=A_0(k)$ is chosen sufficiently large in
Proposition~15.4. Relation (\ref{(17.14)}) together with the
relation
$A^2\frac{n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\ge2^kA n^k\sigma^2$
(if $A>A_0$ with a sufficiently large~$A_0$) imply that the
probability at the right-hand side of (\ref{(17.13)}) can be
bounded by $2^{k+1}e^{-A^{1/2k}n\sigma^2}$, and the
estimate~(\ref{(17.11)}) holds in the case $|e(G)|=k$.
Relation (\ref{(17.14)}) is similar to relation~(\ref{(17.3)})
(together with the definition of the random set~$H$ in
formula~(\ref{(17.2)}), and a modification of the proof of
the latter estimate yields the proof also in this case.
Indeed, it follows from the conditions of
Proposition~15.4 that
$0\le\int h_f(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$
for all $f\in{\cal F}$, and it is not difficult to check that
$\sup|h_f(x_1,\dots,x_k)|\le2^{-2(k+1)}$, and the class of
functions ${\cal H}=\{2^kh_f,\; f\in{\cal F}\}$ is an
$L_2$-dense class with exponent $L$ and parameter $D$. Hence
by applying the Hoeff\-ding decomposition of the functions
$h_f$, $f\in {\cal F}$, similarly to formula~(\ref{(17.4)}) we
get for all $V\subset \{1,\dots,k\}$ such a set of functions
$\{h_f)_V,\,f\in{\cal F}\}$, which satisfies the conditions
of Proposition~15.3. Hence a natural adaptation of the
estimate given for the expression at the right-hand side
of~(\ref{(17.5)}) (with the help of~(\ref{(17.6)}) and the
investigation of $\bar I_{|V|}(f_V)$ for $V=\emptyset$) yields
the proof of formula (\ref{(17.14)}). We only have to replace
$S_{n,k}(f)$ by $\bar I_{n,k}(h_f)$, then $\bar I_{n,|V|}(f_V)$ by
$\bar I_{n,|V|}((h_f)_V)$ and the levels
$2^kA^{4/3}n^k\sigma^2$ and $A^{4/3}n^k\sigma^2$ by
$2^kAn^k\sigma^2$ and $An^k\sigma^2$. Let us observe that
each term of the upper bound we get in such a way can be
directly bounded, since during the proof of Proposition~15.4
for parameter~$k$ we may assume that the result
of Proposition~15.3 holds also for this parameter~$k$.
In the case $e(G)2^{2k}A^{8/3}n^{2k}\sigma^4\right)
\le 2^{k+1}e^{-A^{2/3k}n\sigma^2} \nonumber \\
&&\qquad\qquad\qquad\qquad \qquad\qquad
\textrm{if }A\ge A_0\textrm{ and } e(G)2^{2k}A^{8/3}n^{2k}\sigma^4) \le
2P\left(\sup\limits_{f\in{\cal F}}
\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right)
$$
with
$h_f(x_1,\dots,x_k)=\int f^2(x_1,\dots,x_k,y)\rho(\,dy)$.
(Here we exploited that in the last formula
$S^2({\cal F}|G,V_1,V_2)$ is bounded by the product of two
random variables whose distributions do not depend on the
sets $V_1$ and $V_2$.) Thus to prove inequality
(\ref{(17.16)}) it is enough to show that
\begin{equation}
2P\left(\sup\limits_{f\in{\cal F}}
\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2} \quad \textrm{if } A\ge A_0.
\label{(17.21)}
\end{equation}
Actually formula (\ref{(17.21)}) follows from the already
proven formula~(\ref{(17.14)}), only the parameter $A$ has
to be replaced by $A^{4/3}$ in it.
With the help of relation (\ref{(17.16)}) the proof of
Proposition~15.4 can be completed similarly to
Proposition~15.3. The following version of
inequality~(\ref{(17.7)}) can be proved with the help
of the multivariate version of Hoeff\-ding's inequality,
Theorem~13.3, and the representation of the random variable
$H_{n,k}(f|G,V_1,V_2)$ in the form~(\ref{(17.15)}).
\begin{eqnarray}
&&P\biggl(\left.|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+2}k!} n^{2k}\sigma^{2(k+1)}
\right| \nonumber \\
&& \qquad\qquad\qquad
\xi^{j,\pm1}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)
\le Ce^{-2^{-(6+2/k)} A^{2/3k}n\sigma^2} \nonumber \\
&&\qquad \qquad \textrm{if}\quad S^2({\cal F}|G,V_1,V_2)(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \textrm{ and }A\ge A_0
\label{(17.22)}
\end{eqnarray}
with an appropriate constant $C=C(k)>0$ for all $f\in{\cal F}$
and $G\in {\cal G}$ such that $|e(G)|0$, where $2j=2k-2|e(G)|$, and
$0\le |e(G)|\le k-1$. Since $j\le k$, $n\sigma^2\ge\frac12$,
and also $\frac{A^{4/3}}{2^{10k+4}}\ge2$ if $A_0$ is chosen
sufficiently large we can write in the above upper bound for
the left-hand side of~(\ref{(17.22)}) $j=k$, and in such a way
we get inequality~(\ref{(17.22)}).
The next inequality in which we estimate
$\sup\limits_{f\in{\cal F}}H_{n,k}(f|G,V_1,V_2)$ is a natural
version of formula~(\ref{(17.9)}) in the proof of Proposition~15.3.
\begin{eqnarray}
&&P\biggl(\left.\sup_{f\in{\cal F}} |H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+1}k!} n^{2k}\sigma^{2(k+1)}
\right| \nonumber \\
&& \qquad\qquad\qquad\qquad\qquad\qquad
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)
(\omega)\nonumber \\
&&\qquad \le C \left(1+D\left(\frac{2^{4k+3}k!}
{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2} \nonumber \\
&& \qquad \textrm{if } S^2({\cal F}|G,V_1,V_2))(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \textrm{ and } A\ge A_0
\label{(17.23)}
\end{eqnarray}
for all $G\in{\cal G}$ such that $|e(G)|\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right| \\
&& \qquad\qquad\qquad\qquad\qquad\qquad\qquad
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)\\
&&\qquad\le \sum_{l=1}^m
P\biggl(\left. |H_{n,k}(f_{p_l(\xi^{(n)}(\omega))}|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right| \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)
\end{eqnarray*}
for almost all~$\omega$. The last inequality together
with~(\ref{(17.22)})
and the inequality $m=\max(1,D\delta^{-L})\le 1+D
\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L$ imply
relation~(\ref{(17.23)}).
It follows from relations (\ref{(17.16)}) and (\ref{(17.23)}) that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}}|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}k!}\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2}\\
&&\qquad + C
\left(1+D\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2}
\quad\textrm{if }A\ge A_0
\end{eqnarray*}
for all $V_1,V_2\subset\{1,\dots,k\}$ and diagram
$G\in{\cal G}$ such that $|e(G)|\le k-1$. This inequality
implies that relation~(\ref{(17.11)}) holds also in the
case $|e(G)|\le k-1$ if the constants $A_0$ is chosen
sufficiently large in Proposition~15.4, and we this completes
the proof of Proposition~15.4. To prove relation~(\ref{(17.11)})
in the case $|e(G)|\le k-1$ we still have to show that
$D(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}})^L
\le e^{\textrm{const.}\, n\sigma^2}$
if $A>A_0$ with a sufficiently large~$A_0$, since this
implies that the second term at the right-hand of our last
estimation is not too large.
This follows from the inequality $n\sigma^2\ge L\log n+\log D$
which implies that
$$
\left(\frac{2^{4k+3}k!}{A^2\sigma^{2(k+1)}}\right)^L\le
\left(\frac{n^{(k+1)}}{(2n\sigma^2)^{(k+1)}}\right)^L
\le e^{(k+1)L\log n}\le e^{{(k+1)}n\sigma^2}
$$
if $A_0$ is sufficiently large, and
$D=e^{\log D}\le e^{n\sigma^2}$.
\chapter{An overview of the results in this work}
I discuss briefly the problems investigated in this work,
recall some basic results related to them, and also give some
references. I also write about the background of these problems
which may explain the motivation for their study.
I met the main problem considered in this work when I tried to
adapt the method of proof of the central limit theorem for
maximum-likelihood estimates to some more difficult questions about
so-called non-parametric maximum likelihood estimate problems.
The Kaplan--Meyer estimate for the empirical distribution function
with the help of censored data investigated in the second section
is such a problem. It is not a maximum-likelihood estimate in the
classical sense, but it can be considered as a non-parametric
maximum likelihood estimate. In the estimation of the empirical
distribution function with the help of censored data we cannot
apply the classical maximum likelihood method, since in this
problem we have to choose our estimate from a too large class
of distribution functions. The main problem is that there is no
dominating measure with respect to which all candidates which
may appear as our estimate have a density function. A natural
way to overcome this difficulty is to choose an appropriate
smaller class of distribution functions, to compare the
probability of the appearance of the sample we observed with
respect to all distribution functions of this class and to
choose that distribution function as our estimate for which this
probability takes its maximum.
The Kaplan--Meyer estimate can be found on the basis of the above
principle in the following way: Let us estimate the distribution
function $F(x)$ of the censored data simultaneously together with
the distribution function $G(x)$ of the censoring data. (We have a
sample of size $n$ and know which sample elements are censored and
which are censoring data.) Let us consider the class of such pairs
of estimates $(F_n(x),G_n(x))$ of the pair $(F(x),G(x))$ for which
the distribution function $F_n(x)$ is concentrated in the censored
sample points and the distribution function $G_n(x)$ is
concentrated in the censoring sample points; more precisely, let us
also assume that if the largest sample point is a censored point,
then the distribution function $G_n(x)$ of the censoring data takes
still another value which is larger than any sample point, and if
it is a censoring point then the distribution function $F_n(x)$ of
the censored data takes still another value larger than any sample
point. (This modification at the end of the definition is needed,
since if the largest sample point is from the class of censored
data, then the distribution $G(x)$ of the censoring data in this
point must be strictly less than~1, and if it is from the class of
censoring data, then the value of the distribution function $F(x)$
of the censored data must be strictly less than~1 in this point.)
Let us take this class of pairs of distribution functions
$(F_n(x),G_n(x))$, and let us choose that pair of distribution
functions of this class as the (non-parametric maximum likelihood)
estimate with respect to which our observation has the greatest
probability.\index{product limit estimator (Kaplan--Meyer method)}
The above extremal problem about the pairs of distribution functions
$(F_n(x),G_n(x))$ can be solved explicitly, (see~\cite{r26}), and it
yields the estimate of $F_n(x)$ written down in formula~(2.3).
(The function $G_n(x)$ satisfies a similar relation, only the
random variables~$X_j$ and~$Y_j$ and the events $\delta_j=1$ and
$\delta_j=0$ have to be replaced in it.) Then, as I have indicated,
a natural analogue of the linearization procedure in the proof of
the central limit theorem for the classical maximum likelihood
estimate works also in this case, and there is only
one really hard part of the proof. We have to show that the
linearization procedure gives a small error. The estimation of
this error led to the problem about a good estimate on the tail
distribution of the integral of an appropriate function of two
variables with respect to the product of a normalized empirical
measure with itself. Moreover, as a more detailed investigation
showed, we actually need the solution of a more general problem
where we have to bound the tail distribution of the supremum of
a class of such integrals. The main subject of this work is to
solve the above problems in a more general setting, to estimate
not only two-fold, but also $k$-fold random integrals and the
supremum of such integrals for an appropriate class of kernel
functions with respect to a normalized empirical distribution
for all~$k\ge1$.
The proof of the limit theorem for the Kaplan--Meyer estimate
explained in this work applied the explicit form of this estimate.
It would be interesting to find such a modification of this proof
which only exploits that the Kaplan--Meyer estimate is the solution
of an appropriate extremal problem. We may expect that such a proof
can be generalized to a general result about the limit behaviour
for a wide class of non-parametric maximum likelihood estimates.
Such a consideration was behind the remark of Richard Gill I quoted
at the end of Section~2.
A detailed proof together with a sharp estimate on the speed of
convergence for the limit behaviour of the Kaplan--Meyer
estimate based on the ideas presented in Section~2 is given
in paper~\cite{r39}. Paper~\cite{r40} explains more about its
background, and it also discusses the solution of some other
non-parametric maximum likelihood problems. The results about
multiple integrals with respect to a normalized empirical
distribution function needed in these works were proved
in~\cite{r31}. These results were satisfactory for the study
in~\cite{r39}, but they also have some drawbacks. They do
not show that if the random integrals we are considering have
small variances, then they satisfy better estimates. Beside this,
if we consider the supremum of random integrals of an appropriate
class of functions, then these results can be applied only in
very special cases. Moreover, the method of proof of~\cite{r31}
did not allow a real generalization of these results, hence I
had to find a different approach when I tried to generalize them.
I do not know of other works where the distribution of multiple
random integrals with respect to a normalized empirical distribution
is studied. On the other hand, there are some works where the
distribution of (degenerate) $U$-statistics is investigated. The
most important results obtained in this field are contained in the
book of de la Pe\~na and Gin\'e {\it Decoupling, From Dependence to
Independence}\/~\cite{r8}. The problems about the behaviour of
degenerate $U$-statistics and multiple integrals with respect to
a normalized empirical distribution function are closely related,
but the explanation of their relation is far from trivial. The main
difference between them is that integration with respect to
$\mu_n-\mu$ instead of the empirical distribution $\mu_n$ means
some sort of normalization, while this normalization is missing
in the definition of $U$-statistics. I return to this question
later.
The main part of this work starts at Section~3. A general overview
of the results without the hard technical details can be found
in~\cite{r34}.
First the estimation of sums of independent random variables
or one-fold random integrals with respect to a normalized empirical
distribution and the supremum of such expressions is investigated
in Sections~3 and~4. This question has a fairly big literature. I
would mention first of all the books {\it A course on empirical
processes}\/~\cite{r12},
{\it Real Analysis and Probability}\/~\cite{r13} and
{\it Uniform Central Limit Theorems}\/~\cite{r14} of R.~M.~Dudley.
These books contain a much more detailed description of the
empirical processes than the present work together with a lot of
interesting results.
Section~3 deals with the tail behaviour of sums of independent and
bounded random variables with expectation zero. The proof of two
already classical results, Bernstein's and Bennett's inequalities
is given there. (Their proofs can be found e.g. in Theorem~1.3.2
of~\cite{r14} and~\cite{r6}). We are also interested in the
question when they
give such an estimate which the central limit theorem suggests.
Actually, as it is explained in Section~3, Bennett's inequality
gives a bound suggested by a Poissonian approximation of partial
sums of independent random variables. Bernstein's inequality
provides an estimate suggested by the central limit theorem if the
variance of the sum we consider is not too small. (The results in
Section~3 explain this statement more explicitly.) If the variance
of the sum is too small, then Bennett's inequality provides a
slight improvement of Bernstein's inequality. Moreover, as
Example~3.3 shows, Bennett's inequality is essentially sharp in
this case.
The estimate on the tail distribution of a sum of independent random
variables is weak if this sum has a small variance. This means that
in this case the probability that the sum is larger than a given
value may be much larger than the (rather small) value suggested by
the central limit theorem. Such a behaviour may occur, because the
contribution of some unpleasant irregularities to this probability
may be non-negligible in the case of a small variance.
In the study of the supremum of sums of independent random variables
a good control is needed on the tail distribution of the (supremum
of) sums of independent random variables even if they have small
variance. The solution of this problem (and of its natural
multivariate version) turned out to be the hardest part of this
work. The results based on the similar behaviour of partial sums
and their Gaussian counterpart is not sufficient in this case,
some new ideas have to be applied. In the proof of sharp estimates
in this case we also use some kind of symmetrization arguments.
The last result of Section~3, Hoeff\-ding's inequality presented
in Theorem~3.4 is an important ingredient of these symmetrization
arguments. It is also a classical result whose proof can be found
for instance in~\cite{r24}.
Section~4 contains the one-variate version of our main result
about the supremum of the integrals of a class ${\cal F}$ of
functions with respect to a normalized empirical measure together
with an equivalent statement about the tail distribution of the
supremum of a class of random sums defined with the help of a
sequence of independent and identically distributed random
variables and a class of functions ${\cal F}$ with some nice
properties. These results are formulated in Theorems~4.1 and~$4.1'$.
Also a Gaussian version of them is presented in Theorem~4.2 about
the distribution of the supremum of a Gaussian random field with
some appropriate properties. A deeper version of Theorem~4.2 is
studied in paper~\cite{r11}. The content of these results can be
so interpreted that if we take the supremum of random integrals
or of random sums determined by a nice class of functions ${\cal F}$
in the way described in Section~4, then the tail distribution of
this supremum satisfies an almost as good estimate as the `worst
element' of the random variables taking part in this supremum. But
such a result holds only if we consider the value of this tail
distribution at a sufficiently large level, since as some
concentration inequalities imply the supremum of these random sums
are larger than the expected value of this supremum with probability
almost~1. I also discussed a result in Example~4.3 which shows that
some rather technical conditions of Theorem~4.1 cannot be omitted.
The most important condition in Theorem~4.1 was that the class of
functions ${\cal F}$ we considered in it is $L_2$-dense. This
property was introduced before the formulation of this result.
One may ask whether one can prove a better version of this result,
where we prove similar bound with a different, possibly larger
class of functions~${\cal F}$. It is worth mentioning that
Talagrand proved results similar to Theorem~4.1 for different
classes of functions~${\cal F}$ in his book~\cite{r53}.
These classes of functions are very different of
ours, and Talagrand's results seem to be incomparable with ours.
I return to this question later.
In the above mentioned results we have imposed the condition that
the class of functions~${\cal F}$ or what is equivalent, the set of
random variables whose supremum we estimate is countable. In the
proofs this condition is really exploited. On the other hand, in
some important applications we also need results about the
supremum of a possibly non-countable set of random variables.
To handle such cases I introduced the notion of countably
approximable classes of random variables and proved that in the
results of this work the condition about countability can be
replaced by the weaker condition that the supremum of countably
approximable classes is taken. R.~M.~Dudley worked out a different
method to handle the supremum of possibly non-countably many
random variables, and generally his method is applied in the
literature. The relation between these two methods deserves
some discussion.\index{countably approximable classes of random
variables}
Let us first recall that if we take a class of random variables $S_t$,
$t\in T$, indexed by some index set $T$, and consider a set $A$,
measurable with respect to the $\sigma$-algebra generated by the
random variables $S_t$, $t\in T$, then there exists a countable
subset $T'=T'(A)\subset T$ such that the set $A$ is measurable also
with respect to the smaller $\sigma$-algebra generated by the random
variable $S_t$, $t\in T'$. Beside this, if the finite dimensional
distributions of the random variables $S_t$, $t\in T$, are given,
then by the results of classical measure theory the probability
of the events measurable with respect to the $\sigma$-algebra
generated by these random variables $S_t$, $t\in T$, is also
determined. But we cannot get the probability of all events we
are interested in such a way. In particular, if $T$ is a
non-countable set, then the events
$\left\{\omega\colon\,\sup\limits_{t\in T}S_t(\omega)>u\right\}$ are
non-measurable with respect to the above $\sigma$-algebra, and
generally we cannot speak of their probabilities. To overcome
this difficulty Dudley worked out a theory which enabled him to
work also with outer measures. His theory is based on some
rather deep results of the analysis. It can be found for
instance in his book~\cite{r14}.
I restricted my attention to such cases when after the completion of
the probability measure $P$ we can also speak of the real (and not
only outer) probabilities $P\left(\sup\limits_{t\in T}S_t>u\right)$.
I tried to
find appropriate conditions under which these probabilities really
exist. More explicitly, we are interested in the case when for all
$u>0$ there exists some set $A=A_u$ measurable with respect to the
$\sigma$-algebra generated by the random variables $S_t$, $t\in T$,
such that the symmetric difference of the sets $A_u$ and
$\left\{\omega\colon\,\sup\limits_{t\in T}S_t(\omega)>u\right\}$
is contained
in a set measurable with respect to the $\sigma$-algebra generated
by the random variables $S_t$, $t\in T$, which has probability
zero. In such a case the probability
$P\left(\sup\limits_{t\in T}S_t>u\right)$
can be defined as $P(A_u)$. This approach led me to the definition
of countable approximable classes of random variables. If this
property holds, then we can speak about the probability of the
event that the supremum of the random variables we are interested
in is larger than some fixed value. I proved a simple but
useful result in Lemma~4.4 which provides a condition for the
validity of this property. In Lemma~4.5 I proved with its help
that an important class of functions is countably approximable. It
seems that this property can be proved for many other interesting
classes of functions with the help of Lemma~4.4, but I did not
investigate this question in more detail.
The problem we met here is not an abstract, technical difficulty.
Indeed, the distribution of such a supremum can become different
if we modify each random variable on a set of probability zero,
although the finite dimensional distributions of the random
variables we consider remain the same after such an operation.
Hence, if we are interested in the probability of the supremum
of a non-countable set of random variables with described finite
dimensional distributions we have to describe more explicitly
which version of this set of random variables we consider. It
is natural to look for such an appropriate version of the
random field $S_t$, $t\in T$, whose `trajectories' $S_t(\omega)$,
$t\in T$, have nice properties for all elementary events
$\omega\in\Omega$. Lemma~4.4 can be interpreted as a result in
this spirit. The condition given for the countable
approximability of a class of random variables at the end of
this lemma can be considered as a smoothness type condition about
the `trajectories' of the random field we consider. This
approach shows some analogy to some important problems in the
theory of stochastic processes when a regular version of a
stochastic process is considered and the smoothness properties
are investigated for the trajectories of this version.
In our problems the version of the set of random variables $S_t$,
$t\in T$, we shall work with appears in a simple and natural
way. In these problems we have finitely many random variables
$\xi_1,\dots,\xi_n$ at the start, and all random variables
$S_t(\omega)$, $t\in T$, we are considering can be defined
individually for each $\omega$ as a function of these random
variables $\xi_1(\omega),\dots,\xi_n(\omega)$. We take the
version of the random field $S_t(\omega)$, $t\in T$, we get in
such a way and want to show that it is countably approximable.
In Section~4 this property is proved in an important model,
probably in the most important model in possible applications
we are interested in. In more complicated situations when our
random variables are defined not as a function of finitely
many sample points, for instance in the case when we define
our set of random variables by means of integrals with respect
to a Gaussian random field it is harder to find the right
regular version of our sets of random variables. In this case the
integrals we consider are defined only with probability~1, and it
demands some extra work to find their right version. But in
the problems we study in this work such an approach is satisfactory
for our purposes, and it is simpler than that of Dudley; we do not
have to follow his rather difficult technique. On the other hand,
I must admit that I do not know the precise relation between the
approach of this work and that of Dudley.
In Section~4 the notion of $L_p$-dense classes, $1\le p<\infty$,
also has been introduced. The notion of $L_2$-dense classes
appeared in the formulation Theorems~4.1 and~$4.1'$. It can be
considered as a version of the $\varepsilon$-entropy, discussed
at many places in the literature. On the other hand, there
seems to be no standard definition of the
$\varepsilon$-entropy. The term of $L_2$-dense
classes seemed to be the appropriate object to work with in
this work. To apply the results related to $L_2$-dense classes we
also need some knowledge about how to check this property in
concrete models. For this goal I discussed here
Vapnik--\v{C}ervonenkis classes, a popular and important notion of
modern probability theory. Several books and papers, (see e.g. the
books~\cite{r14}, \cite{r45},~\cite{r54} and the references in
them) deal with this
subject. An important result in this field is Sauer's lemma,
(Lemma~5.1) which together with some other results, like Lemma~5.3
imply that several interesting classes of sets or functions are
Vapnik--\v{C}ervonenkis classes.\index{Vapnik-\v{C}ervonenkis
classes of sets and functions}
I put the proof of these results to the Appendix, partly because
they can be found in the literature, partly because in this work
Vapnik--\v{C}ervonenkis classes play a different and less important
role than at other places. Here Vapnik--\v{C}ervonenkis classes are
applied to show that certain classes of functions are $L_2$-dense.
A result of Dudley formulated in Lemma~5.2 implies that a
Vapnik--\v{C}ervonenkis class of functions with absolute value
bounded by a fixed constant is an $L_1$, and as a consequence, also
an $L_2$-dense class of functions. The proof of this important
result which seems to be less known even among experts of this
subject than it would deserve is contained in the main text.
Dudley's original result was formulated in the special case when
the functions we consider are indicator functions of some sets.
But its proof contains all important ideas needed in the proof of
Lemma~5.2.
Theorem 4.2, which is the Gaussian counterpart of Theorems~4.1
and~$4.1'$ is proved in Section~6 by means of a natural and
important technique, called the chaining argument.\index{chaining
argument} This means the application of an inductive procedure,
in which an appropriate sequence of finite subsets of the original
set of random variables is introduced, and a good estimate is
given on the supremum of the random variables in these subsets
by means of an inductive procedure. The subsets became denser
subsets of the original set of the random variables at each
step of this procedure. This chaining argument is a popular
method in certain investigation. It is hard to say with whom to
attach it. Its introduction may be connected to some works of
R.~M.~Dudley. It is worth mentioning that Talagrand~\cite{r53}
worked out a sharpened version of it which yields in the study
of certain problems a sharper and more useful estimate. But it
seems to me that in the study of the problems of this work this
improvement has a limited importance, it turns out to be useful
in the study of different problems.
Theorem 4.2 can be proved by means of the chaining argument, but
this method is not strong enough to supply a proof of Theorem~4.1.
The chaining argument provides only a weak estimate in this case,
because there is no good estimate on the probability that a sum of
independent random variables is greater than a prescribed value if
these random variables have too small variances. As a consequence
the chaining argument supplies a much weaker estimate than the result
we want to prove under the conditions of Theorem~4.1. Lemma~6.1
contains the result the chaining argument yields under these
conditions. In Section~6 still another result, Lemma~6.2 is
formulated. It can be considered as a special case of Theorem~4.1
where only the supremum of partial sums with small variances is
estimated. We also show in this section that Lemmas~6.1 and~6.2
together imply Theorem~4.1. The proof is not difficult, despite of
some non-attractive details. It has to be checked that the parameters
in Lemmas~6.1 and~6.2 can be fitted to each other.
Lemma~6.2 is proved in Section~7. It is based on a symmetrization
argument. This proof applies the ideas of a paper of Kenneth
Alexander~\cite{r3}, and although its presentation is different from
Alexander's approach, it can be considered as a version of his proof.
It may be worth mentioning that the symmetrization arguments were
first applied in the theory of Vapnik--\v{C}ervonenkis classes
to get some useful estimates (see e.g.~\cite{r45}). But it turned
out that an appropriate refinement of this method supplies sharper
results if we are working with $L_2$-dense classes instead of
Vapnik--\v{C}ervonenkis classes of functions.
A similar problem should also be mentioned at this place.
M.~Talagrand wrote a series of papers about concentration
inequalities, (see e.g. \cite{r51} or \cite{r52}), and his
research was also continued by some other authors. I would
mention the works of M.~Ledoux~\cite{r28} and
P.~Massart~\cite{r42}. Concentration inequalities give a
bound about the difference between the supremum of a set of
appropriately defined random variables and the expected value
of this supremum. They express how strongly this supremum is
concentrated around its expected value. Such results are closely
related to Theorem~4.1, and the discussion of their relation
deserves some attention. A typical concentration inequality is
the following result of Talagrand~\cite{r52}.\index{concentration
inequalities}
\medskip\noindent
{\bf Theorem 18.1 (Theorem of Talagrand).} {\it Consider $n$
independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ with values in some measurable space
$(X,{\cal X})$. Let ${\cal F}$ be some countable family of
real-valued measurable functions of $(X,{\cal X})$ such that
$\|f\|_\infty\le b<\infty$ for every $f\in{\cal F}$. Let
$Z=\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f(\xi_i)$ and
$v=E\left(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i)\right)$.
Then for every positive number~$x$,
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac 1{K'}\frac
xb\log\left(1+\frac{xb}v\right)\right\}
$$
and
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac{x^2}{2(c_1v+c_2bx)}\right\},
$$
where $K$, $K'$, $c_1$ and $c_2$ are universal positive constants.
Moreover, the same inequalities hold when replacing $Z$ by $-Z$.}
\medskip
Theorem~18.1 yields, similarly to Theorem~4.1, an estimate about
the distribution of the supremum for a class of sums of independent
random variables. (The paper of P.~Massart~\cite{r42} contains a
similar estimate which is better for our purposes. The main
difference between these two estimates is that the bound given by
Massart depends on $\sigma^2=\sup\limits_{f\in{\cal F}}
\sum\limits_{i=1}^n \textrm{Var}\,f(\xi_i)$ instead of
$v=E\left(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i)\right)$.)
Theorem~18.1 can be considered as a generalization of
Bernstein's and Bennett's inequalities when the distribution of the
supremum of partial sums (and not only the distribution of one
partial sum) is estimated. A remarkable feature of this
result is that it assumes no condition about the structure of the
class of functions ${\cal F}$ (like the condition of $L_2$-dense
property of the class ${\cal F}$ imposed in Theorem~4.1). On the
other hand, the estimates in Theorem~18.1 contain the quantity
$EZ=E\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{i=1}^n f(\xi_i)\right)$. Such an
expectation of some supremum appears in all concentration
inequalities. As a consequence, they are useful only if we can
bound the expected value of the supremum we want to estimate.
This is a hard question in the general case. Paper~\cite{r17}
provides a useful estimate about the expected value of the
supremum of random sums under the conditions of Theorem~4.1.
But I preferred a direct proof of this result. Let me remark
that because of the above mentioned concentration inequality the
condition $u\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$
with some appropriate constant which cannot be dropped from
Theorem~4.1 can be interpreted so that under the conditions of
Theorem~4.1 $\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$
is an upper bound for the expected value of the supremum we
investigated in this result. Example~4.3 implies that if the
conditions of Theorem~4.1 are violated then the expected value
of the above supremum may be larger.
It is also worth mentioning Talagrand's work~\cite{r53} which
contains several interesting results similar to Theorem~4.1.
But despite their formal similarity, they are essentially
different from the results of this work. This difference
deserves a special discussion.
Talagrand proved in~\cite{r53} by working out a more refined, better
version of the chaining argument a sharp upper bound for the
expected value $E\sup\limits_{t\in T}\xi_t$ of the supremum of
countably many (jointly) Gaussian random variable with zero
expectation. This result is sharp. Indeed, Talagrand proved also
a lower bound for this expected value, and the proportion of his
upper and lower bound is bounded by a universal constant.
By applying similar arguments he also gave an upper bound for
$E\sup\limits_{f\in{\cal F}}\sum\limits_{k=1}^N f(\xi_k)$
in Proposition~2.7.2 of his book, where $\xi_1,\dots,\xi_N$ is a
sequence of independent, identically distributed random variables
with some known distribution~$\mu$, and ${\cal F}$ is a class of
functions with some nice properties. Then he proved in Chapter~3
of this book some estimates with the help of this result for
certain models which solved some problems that could not be
solved with the help of the original version of the chaining
argument.
Let us make some short comparison between our Theorem~4.1 and
Talagrand's result. Talagrand investigated in his book~\cite{r53}
the expected value of the supremum of partial sums, while we
gave an estimate on its tail distribution. But this is not an
essential difference. Talagrand's results also give an estimate
on the tail distribution of the supremum by means of
concentration inequalities, and actually his proofs also provide
a direct estimate for the tail distribution we are interested in
without the application of these results. The main difference
between the two works is that Talagrand's method gives a sharp
estimate for different classes of functions~${\cal F}$.
Talagrand could prove sharp results in such cases when the class
of functions ${\cal F}$ for which the supremum is taken consists of
smooth functions. An example for such classes of functions which he
thoroughly investigated is the class of Lipschitz~1 functions. In
particular, in Chapter~3 of his book~\cite{r53} he proved that if
$\xi_1,\dots,\xi_n$ is a sequence of independent random variables,
uniformly distributed in the unit square $D=[0,1]\times[0,1]$, and
${\cal F}$ is the class of Lipschitz~1 functions on the unit
square~$D$ such that $\int_D f\,d\lambda=0$ for all $f\in{\cal F}$,
where $\lambda$ denotes the Lebesgue measure on~$D$, then
$E\sup\limits_{f\in{\cal F}}\sum\limits_{l=1}^n f(\xi_l)
\le L\sqrt{n\log n}$ with a universal constant~$L$. He was
interested in this result because it is equivalent with a theorem
of Ajtai--Koml\'os--Tusn\'ady~\cite{r2}.
(See Chapter~3 of~\cite{r53} for details.) On the other hand we
can give sharp results in such cases when ${\cal F}$ consists of
non-smooth functions, (see Example~5.5), and Talagrand's method
does not work in the study of such problems.
This difference in the conditions of the results in these two
books is not a small technical detail. Talagrand heavily
exploited in his proof that he worked with such classes of
functions~${\cal F}$ from which he could select a subclass of
functions of ${\cal F}$ of relatively small cardinality which is
dense in ${\cal F}$ not only in the $L_2(\mu)$-norm with the
probability measure~$\mu$ he was working with, but also in the
supremum norm. He needed this property, because this enabled
him to get sharp estimates on the tail distribution of the
differences of functions he had to work with by means of the
Bernstein's inequality. He needed such estimates to apply (a
refined version of) the chaining argument. On the other hand,
we considered such classes of functions ${\cal F}$ which may
have no small subclasses which are dense in ${\cal F}$ in the
supremum norm. I would characterize the difference between the
results of the two works in the following way. Talagrand
proved the sharpest possible estimates which can be obtained
by a refinement of the chaining argument, while our main
problem was to get sharp estimates also in such cases when the
chaining argument does not work. Let me remark that we could
prove such results (see Theorem~4.1) for such classes of functions
${\cal F}$ which are $L_2$-dense. In the Gaussian counterpart of
this result, in Theorem~4.2, it was enough to impose that
${\cal F}$ is an $L_2$-dense class with respect to a fixed
probability measure~$\mu$. This extra condition enabled us
to prove sharp results about the tail distribution of supremum
of partial sums when the chaining argument does not work.
\medskip
The main results of this work are presented in Section~8. A weaker
version of Theorem~8.3 about an estimate of the distribution of
a degenerate $U$-statistic was first proved in a paper of Arcones
and Gin\'e in~\cite{r4}. The result of Theorem~8.3 in the present
form is proved in my paper~\cite{r37}. Its version about multiple
integrals with respect to a normalized empirical measure
formulated in Theorem~8.1 is proved in~\cite{r33}. This paper
contains a direct proof. On the other hand, Theorem 8.1 can be
derived from Theorem~8.3 by means of Theorem~9.4 of this paper.
Theorem 8.5 is the natural Gaussian counterpart of Theorem~8.3.
The limit theorem about degenerate $U$-statistics, Theorem~10.4
(and its version about limit theorems for multiple integrals with
respect to normalized empirical measures, presented in
Theorem~$10.4'$ of Appendix~C) was discussed in this work to
explain better the relation between degenerate $U$-statistics
(or multiple integrals with respect to normalized empirical
measures) and multiple Wiener--It\^o integrals. A proof of this
result based on similar ideas as that discussed here can be found
in~\cite{r15}. Theorem~6.6 of my lecture note~\cite{r30}
contains such a weakened version of Theorem~8.5 which does not
take into account the variance of the random integral.
Example~8.7 is a natural supplement of Theorem~8.5. It shows
that the estimate of Theorem~8.5 is sharp if only the variance
of a Wiener--It\^o integral is known. At the end of Section~13
I also mentioned the results of papers~\cite{r1} and~\cite{r27}
without proof which also have some relation to this problem. I
discussed mainly the content of~\cite{r27} and explained its
relation to some results discussed in this work. The proof of
these papers apply a method different of those of this work. It
would be interesting to prove them with the methods discussed
here. These papers contain such a refinement of Theorems~8.5
and~8.3 respectively whose estimates depend on some other rather
complicated quantities. In some cases they supply a better
estimate. On the other hand, in the problems discussed here
they have a restricted importance because their conditions are
hard to check.
Theorems~8.2 and~8.4 which are the natural multivariate
counterparts of Theorem~4.1 and~$4.1'$ yield an estimate about
the supremum of (degenerate) $U$-statistics or of multiple random
integrals with respect to a normalized empirical measure when
the class of kernel functions in these $U$-statistics or random
integrals satisfy some conditions. They were proved in my
paper~\cite{r35}. Earlier Arcones and Gin\'e proved a weaker
form of this result in paper~\cite{r5}, but their work did not
help in the proof of the results of this note. They were based
on an adaptation of Alexander's method~\cite{r3} to the
multivariate case. Theorem~8.6 contains the natural Gaussian
counterpart of Theorems~8.2 and~8.4.
Example~8.8 in Section~8 shows that the condition
$u\le\textrm{const.}\, n\sigma^3$ imposed in Theorem~8.3 in
the case $k=2$ cannot be dropped. The paper of Arcones and
Gin\'e~\cite{r4} contains another example explained by Talagrand
to the authors of that paper which also has a similar consequence.
But that example does not provide such an explicit comparison
of the upper and lower bound on the probability investigated
in Theorem~8.3 as Example~8.8. Similar examples could be
constructed for all $k\ge1$.
Example 8.8 shows that at high levels only a very weak (and from
practical point of view not really important) improvement of the
estimation on the tail distribution of degenerate $U$-statistics
is possible. But probably there exists a multivariate version of
Bennett's inequality, i.e. of Theorem~3.2 which provides such
an estimate. Moreover, there is some hope to get a similar
strengthened form of Theorems~8.2 and~8.4 (or of Theorem~4.2 in
the one-dimensional case). This question is not investigated in
the present work.
Section 9 deals with the properties of $U$-statistics. Its
first result, Theorem~9.1, is a rather classical result. It
is the so-called Hoeffding decomposition of $U$-statistics to
the sum of degenerate statistics. Its proof first appeared
in the paper~\cite{r23}, but it can be found at many places.
The explanation of this work contains some ideas similar
to~\cite{r50}. I tried to explain that Hoeffding's decomposition
is the natural multivariate version of the (trivial)
decomposition of sums of independent random variables to sums of
independent random variables {\it with expectation zero}\/ plus
the sum of the expectations of the original random variables.
Moreover, even the proof of the Hoeffding's decomposition shows
some similarity to this simple decomposition.
Theorem~9.2 and Proposition~9.3 can be considered as a continuation
of the investigation of the Hoeffding's decomposition in Theorem~9.1.
They tell how the properties of the kernel function of the original
$U$-statistic are inherited in the properties of the kernel
functions of the degenerate $U$-statistics taking part in its
Hoeffding decomposition. In several applications of Hoeffding's
decomposition we need such results.
The last result of Section~9, Theorem~9.4, enables us to reduce the
estimation of multiple random integrals with respect to normalized
empirical measures to the estimation of degenerate $U$-sta\-tis\-tics.
This result is a version of Hoeffding's decomposition, where
multiple integrals with respect to a normalized empirical
distribution are decomposed to the sum of degenerate
$U$-statistics. The main difference between them is that the
coefficients of the degenerate $U$-statistics in the decomposition
of Theorem~9.4 are relatively small. The cancellation effect
caused by integration with respect to a {\it normalized}\/
empirical measure is reflected in the appearance of small
coefficients in the decomposition given in Theorem~9.4.
Theorem~9.4 was proved in~\cite{r35}. The proof given in this
note is essentially different from that of~\cite{r35}.
Theorem~8.1 can be derived from Theorem~8.3 and Theorem~8.2 from
Theorem~8.4 by means of Theorem~9.4. The proof of the latter
results is simpler. The results of Sections 10--12 contain the
results needed in the proof of Theorem~8.3 and its Gaussian
counterpart Theorems~8.5 and~8.7. The proof of these results is
based on good estimates of high moments of degenerate
$U$-statistics and multiple Wiener--It\^o integrals. The
classical proof of the one-variate counterparts of these results is
based on a good estimate of the moment generating function. This
method was replaced by the estimate of high moments, because the
moment generating function of a $k$-fold Wiener--It\^o integral is
divergent for $k\ge3$, and this property is also reflected in the
behaviour of degenerate $U$-statistics. On the other hand, good
estimates on high moments can replace the estimate of the moment
generating function. A good estimate can be given for all moments
of a Wiener--It\^o integral, while we have a good estimate only on
not too high moments of degenerate $U$-statistics. This has the
consequence that we can give a good estimate on the tail
distribution of degenerate $U$-statistic only for not too large
values. We met a similar situation in Section~3 in the study of
Bernstein's and Bennett's inequality.
I know of two deep methods to study high moments of multiple
Wiener--It\^o integrals. Both of them can be adapted to the study
of the moments of degenerate $U$-statistics. They deserve a more
detailed discussion.
The first one is called Nelson's inequality named after Edward
Nelson who published it in his paper~\cite{r44}. This inequality
simply implies Theorem~8.5 about multiple Wiener--It\^o integrals,
although with worse constants. Later Leonhard Gross discovered a
deep and useful generalization of this result which he
published in the work {\it Logarithmic Sobolev
inequalities}\/~\cite{r20}.
In that paper Gross compared two Markov processes with the same
infinitesimal operator but with possibly different initial
distribution, where the second Markov process had stationary
distribution. He could give a sharp bound on the Radon--Nikodym
derivative of the distribution of the first Markov process with
respect to the (stationary) distribution of the second Markov
process at all time~$T$ on the basis of the properties of the
infinitesimal operator of the Markov processes. With the help of
this result he could prove a more general form of Nelson's
inequality. In particular, his result may help to prove (a weaker
version of) Theorem~8.3 (with worse universal constants). Let me
also remark that Gross' method works not only in the study of
these problems, but in several hard problems of the probability
theory. (See e.g~\cite{r21} or~\cite{r28}). Nevertheless, in the
present note I applied a different method, because this seemed
to be more appropriate here.
I applied a method related to the names of Kyoshi It\^o and Roland
L'vovich Dobrushin. This is the theory of multiple Wiener--It\^o
integrals with respect to a white noise. This integral was
introduced in paper~\cite{r25}. It is useful, because every random
variable measurable with respect to the $\sigma$-algebra generated
by the Gaussian random variables of the underlying white noise with
finite second moment can be written as the sum of Wiener--It\^o
integrals of different order. Moreover, if only Wiener--It\^o
integrals of symmetric kernel functions are taken, then this
representation is unique. An important result, the so-called
diagram formula, formulated in Theorem~10.2, expresses products
of Wiener--It\^o integrals as a sum of such integrals. This result
which shows some similarity to the Feynman diagrams applied in the
statistical physics was proved in~\cite{r10}. Actually this paper
discussed a modified version of Wiener--It\^o integrals which is more
appropriate to study the action of shift operators for non-linear
functionals of a stationary Gaussian field. But these modified
Wiener--It\^o integrals can be investigated in almost the same way
as the original ones. The diagram formula has a simple consequence
formulated in Corollary of Theorem~10.2 of this note. It enables us
to calculate the expectation of products of Wiener--It\^o integrals,
in particular it yields an explicit formula about the moments of
a Wiener--It\^o integral. This result was applied in the proof of
Theorem~8.5, i.e.\ in the estimation of the tail-distribution of
Wiener--It\^o integrals. It\^o's formula for multiple Wiener--It\^o
integrals (Theorem~10.3) was proved in~\cite{r25}.
The diagram formula has a natural and useful analogue both for
degenerate $U$-statistics and multiple integrals with respect to a
normalized empirical measure. They enable us to express the product
of degenerate $U$-statistics and multiple integrals as the sum of
such expressions. These results enable us to adapt several useful
methods in the study of non-linear functionals of a Gaussian
random field to the study of non-linear functionals of normalized
empirical measures. A version of the diagram formula was proved
for degenerate $U$-statistics in~\cite{r37} and for multiple random
integrals with respect to a normalized empirical measures in~\cite{r33}.
Let me remark that in the formulation of the result in the
work~\cite{r37} a different notation was applied than in the present
note. In that paper I wanted to formulate such a version of the
diagram formula for $U$-statistics where we have to work with
diagrams similar to those introduced in the study of Wiener--It\^o
integrals. I could do this only in a somewhat artificial way. In
this work I formulated the diagram formula for $U$-statistics with
the help of diagrams having a more general form. I introduced the
notion of chains and coloured chains, and defined diagrams
with their help. The formulation of the results with the help of
such more general diagrams seems to me more natural. Let me also
remark that the study of results similar to the diagram formula
for Wiener--It\^o integrals did not get such an attention in the
literature as it would deserve in my opinion. I know only of one
work where such questions were investigated. It is the paper of
Surgailis~\cite{r47}, where a version of the diagram formula is
proved for Poissonian integrals. The Corollary of Theorem~11.2 is
of special interest for us, because it enables us to prove such
moment estimates which are useful in the proof of Theorem~8.3.
It is worth mentioning that the problems about Wiener--It\^o
integrals are closely related to the study of Hermite polynomials
or to their multivariate version, to the so-called Wick polynomials.
(See e.g.~\cite{r30} or~\cite{r41} for the definition of Wick
polynomials.) Appendix~C contains the most important properties
of Hermite polynomials needed in the study of Wiener--It\^o
integrals. In particular, it contains the proof of Proposition~C2
which states that the set of all Hermite polynomials is a complete
orthogonal system in the Hilbert space of the functions square
integrable with respect to the standard Gaussian distribution.
This result can be found for instance in Theorem~5.2.7
of~\cite{r49}. In the present proof I wanted to show that this
result is closely related to the so-called moment problem, i.e.\
to the question when a distribution is determined by its moments
uniquely. This method, with some refinement, can be applied to
prove some generalizations of Proposition~C2 about the
completeness of orthogonal polynomials with respect to more
general weight functions.
It\^o's formula creates a relation between Wiener--It\^o integrals
and Hermite polynomials. The results about multiple Wiener--It\^o
integrals have their analogues for Wick polynomials. Thus for
instance there is a diagram formula for the product of Wick
polynomials which also has some interesting generalizations.
Such questions are studied both in probability theory and
statistical physics, see~\cite{r41} and~\cite{r46}. The relation
between Wiener--It\^o integrals and Hermite polynomials also has
a natural counterpart in the study of other multiple random
integrals. The so-called Appell polynomials, (see~\cite{r48}),
appeared in such a way.
Theorems~8.3,~8.5 and~8.7 were proved on the basis of the results
in Sections 10--12 and in Section~13. Section~13 also contains
the proof of a multivariate version of Hoeffding's inequality,
formulated in Theorem~13.3. This result is needed in the
symmetrization argument applied in the proof of Theorem~8.4. A
weaker version of it (an estimate with a worse constant in the
exponent) which would be satisfactory for our purposes would
simply follow from a classical result, called Borell's inequality.
But since this result is not discussed in this note, and I was
interested in a proof which yields the best estimate in the
exponent of this estimate I have chosen another proof, given
in~\cite{r36} which is based on the results of Sections~10--12.
Later I have learned that this estimate is contained in an
implicit form also in the paper~\cite{r7} of A.~Bonami.
Sections 14--17 are devoted to the proof of Theorems~8.4 and~8.6.
They are based on a similar argument as their one-variate
counterparts, Theorems~4.1 and~4.2. The proof of Theorem~8.6
about the supremum of Wiener--It\^o integrals is based, similarly
to the proof of Theorem~4.2 on the chaining argument. In the
proof of Theorem~8.4 the chaining argument yields only a weaker
result formulated in Proposition~14.1 which helps to reduce
Theorem~8.4 to the proof of Proposition~14.2. In the one-variate
case a similar approach was applied. In that case the proof of
Theorem~4.1 was reduced to that of Proposition~6.2 by means of
Proposition~6.1. The next step in the proof of Theorem~8.4 has
no one-variate counterpart. The notion of so-called decoupled
$U$-statistics was introduced, and Proposition~14.2 was reduced
to a similar result about decoupled $U$-statistics formulated
in Proposition~$14.2'$.
The adjective `decoupled' in the expression decoupled $U$-statistic
refers to the fact that it is such a version of a $U$-statistic
where independent copies of a sequence of independent and
identically distributed random variables are put into different
coordinates of the kernel function. Their study is a popular
subject of some mathematicians. In particular, the main subject of
the book~\cite{r8} is a comparison of the properties of $U$-statistics
and decoupled $U$-statistics. A result of de la Pe\~na and
Montgomery--Smith~\cite{r9} formulated in Theorem~14.3 helps in
reducing some problems about $U$-statistics to a similar problem
about decoupled $U$-statistics. In this lecture note the proof of
Theorem~14.3 is given in Appendix~D. It follows the argument of
the original proof, but several steps are worked out in detail
where the authors gave only a very short explanation. Paper~\cite{r9}
also contains some kind of converse results to~Theorem~14.3, but
as they are not needed in the present work, I omitted their
discussion.
Decoupled $U$-statistics behave similarly to the original
$U$-statistics. Beside this, some symmetrization arguments
becomes considerably simpler if we are working with decoupled
$U$-statistics instead of the original ones. This can
be exploited in some investigations. For example the proof of
Proposition~$14.2'$ is simpler than a direct proof of
Proposition~14.2. On the other hand, Theorem~14.3 enables us
to reduce the proof of Proposition~14.2 to that of
Proposition~$14.2'$, and we have exploited this possibility.
The proof of Theorem~8.4 was reduced to that of Proposition~$14.2'$
in Section~14. Sections 15--17 deal with the proof of this result.
It was proved in my paper~\cite{r35}. The proof is similar to that
of its one-variate version, Proposition~6.2, but some additional
difficulties have to be overcome. The main difficulty appears when
we want to find the multivariate analogue of the symmetrization
argument which could be carried out by means of Lemmas~7.1
and~7.2 in the one-variate case.
In the multivariate case Lemma~7.1 is not sufficient for us. We
work instead of it with a generalized version of this result,
formulated in Lemma~15.2. The proof of Lemma~15.2 is not hard.
The real difficulty arises when we want to apply it in the proof
of Proposition~$14.2'$. We have to check its condition given in
formula~(\ref{(15.3)}), and this means in this case a non-trivial
estimation of some complicated conditional probabilities. This is
the hardest part in the proof of Proposition~$14.2'$.
Proposition $14.2'$ was proved by means of an inductive procedure
formulated in Proposition 15.3, which is the multivariate analogue
of Proposition~7.3. A basic ingredient of both proofs was a
symmetrization argument. But while this symmetrization argument
could be simply carried out in the one-variate case, its
adaptation to the multivariate case in the proof of Theorem~15.3
was a most serious problem. To overcome this difficulty another
result was formulated in Proposition~15.4. Propositions~15.3
and~15.4 were proved simultaneously by means of an appropriate
inductive procedure. Their proofs were based on a refinement of
the arguments in the proof of Proposition~7.3. We also had to
apply Theorem~13.3, a multivariate version of Hoeff\-ding's
inequality, and some properties of the Hoeff\-ding decomposition
of $U$-statistics proved in Section~9.
\appendix
\chapter{The proof of some results about Vapnik--\v{C}ervonenkis classes}
\label{introA}
\medskip\noindent
{\it Proof of Theorem 5.1. (Sauer's lemma).}\/\index{Sauer's lemma}
This result has several different proofs. Here I write down a
relatively simple proof of P. Frankl and J. Pach which appeared
in~\cite{r16}. It is based on some linear algebraic arguments.
The following equivalent reformulation of Sauer's lemma will be
proved. Let us take a set $S=S(n)$ consisting of $n$ elements and
a class ${\cal E}$ of subsets of $S$ consisting of $m$ elements
$E_1,\dots,E_m\subset S$. Assume that $m\ge m_0+1$ with
$m_0=m_0(n,k)={n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$.
Then there exists a set $F\subset S$ of cardinality $k$ which
is shattered by the class of sets ${\cal E}$. Actually, it is
enough to show that there exists a set $F$ of cardinality
greater than or equal to~$k$ which is shattered by the class
of sets ${\cal E}$, because if a set has this property, then
all of its subsets have it. This latter statement will be proved.
To prove this statement let us first list the subsets
$X_0,\dots,X_{m_0}$ of the set $S$ of cardinality less than or equal
to $k-1$, and correspond to all sets $E_i\in{\cal E}$ the vector
$e_i=(e_{i,1},\dots,e_{i,m_0})$, $1\le i\le m$, with elements
$$
e_{i,j}=\left\{
\begin{array}{l}
1\quad\textrm{if }X_j\subseteq E_i \\
0\quad\textrm{if }X_j\not\subseteq E_i
\end{array}
\right. \qquad 1\le i\le m, \textrm{ and } 1\le j\le m_0.
$$
Since $m>m_0$, the vectors $e_1,\dots,e_m$ are linearly dependent.
Because of the definition of the vectors $e_i$, $1\le i\le m$,
this can be expressed in the following way: There is a non-zero
vector $(f(E_1),\dots,f(E_m))$ such that
\begin{equation}
\sum_{E_i\colon\, E_i\supseteq X_j} f(E_i)=0 \quad \textrm{for all }
1\le j\le m_0. \label{(A1)}
\end{equation}
Let $F$, $F\subset S$, be a {\it minimal}\/ set with the property
\begin{equation}
\sum_{E_i\colon\, E_i\supseteq F} f(E_i)=\alpha\neq0. \label{(A2)}
\end{equation}
Such a set $F$ really exists, since every maximal element of the
family $\{E_i\colon\, 1\le i\le m,\, f(E_i)\neq0\}$ satisfies
relation (\ref{(A2)}). The requirement that $F$ should be a
minimal set means
that if $F$ is replaced by some $H\subset F$, $H\neq F$, at the
left-hand side of~(\ref{(A2)}), then this expression equals zero. The
inequality $|F|\ge k$ holds because of relation (\ref{(A1)}) and the
definition of the sets $X_j$.
Introduce the quantities
$$
Z_F(H)=\sum_{E_i\colon\, E_i\cap F=H} f(E_i)
$$
for all $H\subseteq F$.
Then $Z_F(F)=\alpha$, and for any set of the form $H=F\setminus\{x\}$,
$x\in F$,
$$
Z_F(H)=\sum_{E_i\colon\, E_i\cap F=H} f(E_i)
=\sum_{E_i\colon\, E_i\supseteq H}f(E_i)
-\sum_{E_i\colon\, E_i\supseteq F}f(E_i)=0-\alpha=-\alpha
$$
because of the minimality property of the set $F$.
Moreover, the identity
\begin{equation}
Z_F(H)=(-1)^p\alpha \quad\textrm{for all } H\subseteq F
\textrm{ such that } |H|=|F|-p, \; 0\le p\le |F|. \label{(A3)}
\end{equation}
holds. To show relation (\ref{(A3)}) observe that
\begin{equation}
Z_F(H)=\sum_{E_i\colon\, E_i\cap F=H} f(E_i)=\sum_{j=0}^p
(-1)^j\sum_{G\colon\,H\subset G\subset F,\;|G|=|H|+j}
\sum_{E_i\colon\, E_i\supseteq G}f(E_i) \label{(A4)}
\end{equation}
for all sets $H\subset F$ with cardinality $|H|=|F|-p$.
Identity~(\ref{(A4)}) holds, since the term $f(E_i)$ is
counted at the right-hand side of~(\ref{(A4)})
$\sum\limits_{j=0}^l (-1)^j{l\choose j}=(1-1)^l=0$ times if
$E_i\cap F=G$ with some $H\subset G\subseteq F$ with $|G|=|H|+l$
elements, $1\le l\le p$, while in the case $E_i\cap F=H$ it is
counted once. Relation~(\ref{(A4)}) together with~(\ref{(A2)})
and the minimality
property of the set~$F$ imply relation~(\ref{(A3)}).
It follows from relation~(\ref{(A3)}) and the definition of
the function $Z_F(H)$ that for all sets $H\subseteq F$ there
exists some set $E_i$ such that $H=E_i\cap F$, i.e. $F$ is
shattered by ${\cal E}$. Since $|F|\ge k$, this implies
Theorem~5.1.
\medskip\noindent
{\it Proof of Theorem 5.3.}\/ Let us fix an arbitrary set
$F=\{x_1,\dots,x_{k+1}\}$ of the set $X$, and consider the set of
vectors
${\cal G}_k(F)=\{(g(x_1),\dots,g(x_{k+1}))\colon\, g\in{\cal G}_k\}$
of the $k+1$-dimensional space $R^{k+1}$. By the conditions of
Theorem~5.3 ${\cal G}_k(F)$ is an at most $k$-dimensional subspace of
$R^{k+1}$. Hence there exists a non-zero vector
$a=(a_1,\dots,a_{k+1})$ such that
$\sum\limits_{j=1}^{k+1} a_jg(x_j)=0$ for all $g\in{\cal G}_k$. We
may assume that the set $A=A(a)=\{j\colon\, a_j<0,\, 1\le j\le k+1\}$
is non-empty, by multiplying the vector $a$ by $-1$ if it is necessary.
Thus the identity
\begin{equation}
\sum_{j\in A} a_jg(x_j)=\sum_{j\in \{1,\dots,k+1\}\setminus A}
(-a_j)g(x_j),\qquad \textrm{for all }g\in{\cal G}_k \label{(A5)}
\end{equation}
holds. Put $B=\{x_j\colon\, j\in A\}$. Then $B\subset F$, and
$F\setminus B\neq\{x\colon\, g(x)\ge0\}\cap F$ for all
$g\in{\cal G}_k$. Indeed, if there were some $g\in {\cal G}_k$
such that $F\setminus B=\{x\colon\, g(x)\ge0\}\cap F$, then
the left-hand side of the equation (\ref{(A5)}) would be strictly
positive (as $a_j<0$, $g(x_j)<0$ if $j\in A$, and
$A\neq\emptyset$) its right-hand side would be non-positive for
this $g\in{\cal G}_k$, and this is a contradiction.
The above proved property means that ${\cal D}$ shatters no set
$F\subset X$ of cardinality~$k+1$. Hence Theorem~5.1
implies that ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.
\chapter{The proof of the diagram formula for
Wiener--It\^o integrals}
\label{introB}
We start the proof of Theorem~10.2A (the diagram formula for
the product of two Wiener--It\^o integrals) with the proof of
inequality (\ref{(10.11)}).\index{diagram formula for Wiener--It\^o
integrals} To show that this relation holds
let us observe that the Cauchy inequality yields
the following bound on the function $F_\gamma(f,g)$ defined
in~(\ref{(10.10)}) (with the notation introduced there):
\begin{eqnarray}
&&F^2_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\,\;(1,j)\in
V_1(\gamma),\, (2,j')\in V_2(\gamma)) \nonumber \\
&&\qquad\le
\int f^2(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)} \mu(\,dx_{(2,j)}) \nonumber \\
&&\qquad\qquad
\int g^2(x_{(2,1)},\dots,x_{(2,l)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)}\mu(\,dx_{(2,j)}).
\label{(B1)}
\end{eqnarray}
The expression at the right-hand side of inequality~(\ref{(B1)})
is the product of two functions with different arguments. The first
function has arguments $x_{(1,j)}$ with $(1,j)\in V_1(\gamma)$ and
the second one $x_{(2,j')}$ with $(2,j')\in V_2(\gamma)$.
By integrating both sides of inequality~(\ref{(B1)}) with respect
to these arguments we get inequality~(\ref{(10.11)}).
Relation (\ref{(10.12)}) will be proved first for the product
of the Wiener--It\^o integrals of two elementary functions.
Let us consider two (elementary) functions $f(x_1,\dots,x_k)$
and $g(x_1,\dots,x_l)$ given in the following form: Let some
disjoint sets $A_1,\dots,A_M$, $\mu(A_s)<\infty$, $1\le s\le M$,
be given together with some real numbers $c(s_1,\dots,s_k)$
indexed with such $k$-tuples $(s_1,\dots,s_k)$, $1\le s_j\le M$,
$1\le j\le k$, for which the numbers $s_1,\dots,s_k$ in a
$k$-tuple are all different. Put
$f(x_1,\dots,x_k)=c(s_1,\dots,s_k)$ on the rectangles
$A_{s_1}\times\cdots\times A_{s_k}$ with edges $A_s$,
indexed with the above $k$-tuples, and let
$f(x_1,\dots,x_k)=0$ outside of these rectangles. Take
similarly some disjoint sets $B_1,\dots,B_{M'}$,
$\mu(B_t)<\infty$, $1\le t\le M'$, and some real numbers
$d(t_1,\dots,t_l)$, indexed with such $l$-tuples
$(t_1,\dots,t_l)$, $1\le t_{j'}\le M'$, $1\le j'\le l$, for
which the numbers $t_1,\dots,t_l$ in an $l$-tuple are
different. Put $g(x_1,\dots,x_l)=d(t_1,\dots,t_l)$ on the
rectangles $B_{t_1}\times\cdots\times B_{t_l}$ with edges
indexed with the above introduced $l$-tuples, and let
$g(x_1,\dots,x_l)=0$ outside of these rectangles.
Let us take some small number $\varepsilon>0$ and rewrite
the above introduced functions $f(x_1,\dots,x_k)$ and
$g(x_1,\dots,x_l)$ with the help of this number
$\varepsilon>0$ in the following way. Divide the sets
$A_1,\dots,A_M$ to smaller sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{s=1}^{M(\varepsilon)} A_s^\varepsilon=
\bigcup\limits_{s=1}^{M} A_s$, in such a way that all sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$ are
disjoint, and $\mu(A^\varepsilon_s)\le\varepsilon$,
$1\le s\le M(\varepsilon)$. Similarly, take sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{t=1}^{M'(\varepsilon)} B_t^\varepsilon
=\bigcup\limits_{t=1}^{M'} B_t$, in such a way that all
sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$
are disjoint, and $\mu(B^\varepsilon_t)\le\varepsilon$,
$1\le t\le M'(\varepsilon)$. Beside this, let us also
demand that two sets $A_s^\varepsilon$ and
$B_t^\varepsilon$, $1\le s\le M(\varepsilon)$,
$1\le t\le M'(\varepsilon)$, are either disjoint or
they agree. Such a partition exists because of the
non-atomic property of measure $\mu$. The above defined
functions $f(x_1,\dots,x_k)$ and $g(x_1,\dots,x_l)$ can be
rewritten by means of these new sets $A^\varepsilon_s$ and
$B^\varepsilon_t$. Namely, let
$f(x_1,\dots,x_k)=c^\varepsilon(s_1,\dots,s_k)$ on the
rectangles
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}$
with $1\le s_j\le M(\varepsilon)$, $1\le j\le k$, with
different indices $s_1,\dots,s_k$, where
$c^\varepsilon(s_1,\dots,s_k)=c(p_1,\dots,p_k)$ with
those indices $(p_1,\dots,p_k)$ for which
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}\subset
A_{p_1}\times\cdots\times A_{p_k}$.
The function $f$ disappears outside of these rectangles.
The function $g(x_1,\dots,x_l)$ can be written similarly
in the form $g(x_1,\dots,x_l)=d^\varepsilon(t_1,\dots,t_l)$
on the rectangles
$B^\varepsilon_{t_1}\times\cdots\times B^\varepsilon_{t_l}$
with $1\le t_{j'}\le M'(\varepsilon)$, $1\le j'\le l$, and
different indices, $t_1,\dots,t_l$. Beside this, the
function~$g$ disappears outside of these rectangles.
The above representation of the functions $f$ and $g$
through a parameter $\varepsilon$ is useful, since it
enables us to give a good asymptotic formula for the
product $k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)$ which yields the
diagram formula for the product of Wiener--It\^o integrals
of elementary functions with the help of a limiting
procedure $\varepsilon\to0$.
Fix a small number $\varepsilon>0$, take the
representation of the functions $f$ and $g$ with
its help, and write
\begin{equation}
k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)
=\sum_{\gamma\in \Gamma(k,l)} Z_\gamma(f,g,\varepsilon)
\label{(B2)}
\end{equation}
with
\begin{eqnarray}
&&Z_\gamma(f,g,\varepsilon)={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l) \nonumber \\
&&\qquad\qquad\quad
\mu_W(A^\varepsilon_{s_1})\dots\mu_W(A^\varepsilon_{s_k})
\mu_W(B^\varepsilon_{t_1})\dots\mu_W(B^\varepsilon_{t_l}),
\label{(B3)}
\end{eqnarray}
where $\Gamma(k,l)$ denotes the class of diagrams introduced before
the formulation of Theorem~10.2A, and $\sum^\gamma$ denotes
summation for such $k+l$-tuples $(s_1,\dots,s_k,t_1,\dots,t_l)$,
$1\le s_j\le M(\varepsilon)$, $1\le j\le k$, and
$1\le t_{j'}\le M'(\varepsilon)$,
$1\le j'\le l$, for which
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, i.e.\ if it is an edge of $\gamma$,
and otherwise all sets $A^\varepsilon_{s_j}$ and
$B^\varepsilon_{t_{j'}}$ are
disjoint. (This sum also depends on $\varepsilon$.) In the
case of an empty sum $Z_\gamma(f,g,\varepsilon)$ equals zero.
For all $\gamma\in\Gamma(k,l)$ the expression
$Z_\gamma(f,g,\varepsilon)$ will be written in the form
\begin{equation}
Z_\gamma(f,g,\varepsilon)=Z_\gamma^{(1)}(f,g,\varepsilon)
+Z_\gamma^{(2)}(f,g,\varepsilon),
\quad \gamma\in\Gamma(k,l), \label{(B4)}
\end{equation}
with
\begin{eqnarray}
Z^{(1)}_\gamma(f,g,\varepsilon)
&&={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l) \nonumber \\
&&\qquad\prod_{j\colon\, (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad \prod_{j\colon\, (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})
\label{(B5)}
\end{eqnarray}
and
\begin{eqnarray}
Z^{(2)}_\gamma(f,g,\varepsilon)
&&={\sum}^\gamma
c^\varepsilon(s_1,\dots,s_k) d^\varepsilon(t_1,\dots,t_l)
\nonumber \\
&&\qquad \prod_{j\colon\, (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad \biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j}) \nonumber \\
&& \qquad\qquad\qquad
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad\qquad -\prod_{j\colon\, (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr], \label{(B6)}
\end{eqnarray}
where $V_1(\gamma)$ and $V_2(\gamma)$ (introduced before
formula~(\ref{(10.9)}) during the preparation to the formulation of
Theorem~10.2A) are the sets of vertices in the first and second
row of the diagram $\gamma$ from which no edge starts.
I claim that there is some constant $C>0$ not depending on
$\varepsilon$ such that
\begin{equation}
E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z^{(1)}_\gamma(f,g,\varepsilon)\right)^2\le C\varepsilon
\quad \textrm{for all } \gamma\in\Gamma(k,l) \label{(B7)}
\end{equation}
with the Wiener--It\^o integral with the kernel function
$F_\gamma(f,g)$ defined in (\ref{(10.9)}), (\ref{(10.9a)})
and (\ref{(10.10)}), and
\begin{equation}
E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
\le C\varepsilon\quad\textrm{for all } \gamma\in\Gamma(k,l).
\label{(B8)}
\end{equation}
Relations~(\ref{(B7)}) and~(\ref{(B8)}) imply relation~(\ref{(10.12)})
if $f$ and $g$ are elementary functions. Indeed, they imply that
$$
\lim_{\varepsilon\to0}\left\|\,|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))
-Z_\gamma(f,g,\varepsilon)\right\|_2\to0
\quad\textrm{for all }\gamma\in\Gamma(k,l),
$$
and this relation together with (\ref{(B2)}) yield
relation (\ref{(10.12)}) with
the help of a limiting procedure $\varepsilon\to0$.
To prove relation (\ref{(B7)}) let us introduce the function
\begin{eqnarray*}
&&F^\varepsilon_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in
V_1(\gamma),\; (2,j')\in V_2(\gamma))\\
&&\qquad=
F_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))\\
&&\qquad\quad\textrm{if } x_{(1,j)}\in A^\varepsilon_{s_j},
\textrm{ for all } (1,j)\in V_1(\gamma),\\
&&\qquad\quad\textrm{ } x_{(2,j')}\in B^\varepsilon_{t_{j'}},
\textrm{ for all } (2,j')\in V_2(\gamma)), \quad\textrm{and}\\
&& \qquad\quad\textrm{ all sets }
A^\varepsilon_{s_j},\; (1,j)\in V_1(\gamma),
\textrm{ and } B^\varepsilon_{t_{j'}}, \; (2,j')\in V_2(\gamma)
\textrm{ are different.}
\end{eqnarray*}
with the function~$F_\gamma(f,g)$ defined in~(\ref{(10.9a)})
and~(\ref{(10.10)}), and
put
$$
F^\varepsilon_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))=0 \quad
\textrm{otherwise.}
$$
The function $F_\gamma^\varepsilon(f,g)$ is elementary, and
a comparison of its definition with relation~(\ref{(B5)})
and the definition of the function $F_\gamma(f,g)$ yields that
\begin{equation}
Z_\gamma^{(1)}(f,g,\varepsilon)=|\gamma|!
Z_{\mu,|\gamma|}(F_\gamma^\varepsilon(f,g)). \label{(B9)}
\end{equation}
The function $F^\varepsilon_\gamma(f,g)$ slightly differs
from $F_\gamma(f,g)$, since the function $F_\gamma(f,g)$ may not
disappear in such points
$(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\,
(2,j')\in V_2(\gamma))$ for which there is some pair $(j,j')$
with the property $x_{(1,j)}\in A^\varepsilon_{s_j}$ and
$x_{(2,j')}\in B^\varepsilon_{t_{j'}}$ with some sets
$A^\varepsilon_{s_j}$ and
$B^\varepsilon_{t_{j'}}$ such that
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$, while
$F_\gamma^\varepsilon(f,g)$ must be zero in such points. On the other
hand, in the case $|\gamma|=\max(k,l)-\min(k,l)$, i.e. if one
of the sets $V_1(\gamma)$ or $V_2(\gamma)$ is empty,
$F_\gamma(f,g)=F^\varepsilon_\gamma(f,g)$, \
$Z_\gamma^{(1)}(f,g,\varepsilon)
=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))$, and
relation~(\ref{(B7)}) clearly holds for such diagrams $\gamma$.
In the case $|\gamma|=\max(k,l)-\min(k,l)>0$ such an estimate
will be proved for the probability of the set where
$F_\gamma\neq F_\gamma^\varepsilon$ which implies
relation~(\ref{(B7)}).
Let us define the sets $A=\bigcup\limits_{s=1}^{M(\varepsilon)}
A^\varepsilon_s$ and
$B=\bigcup\limits_{t=1}^{M'(\varepsilon)} B^\varepsilon_t$.
These sets $A$ and $B$ do
not depend on the parameter $\varepsilon$. Beside this
$\mu(A)<\infty$, and $\mu(B)<\infty$. Define for all pairs
$(j_0,j_0')$ such that $(1,j_0)\in V_1(\gamma)$,
$(2,j_0')\in V_2(\gamma)$ the set
\begin{eqnarray*}
D(j_0,j'_0)
&&=\{(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma), \,
(2,j')\in V_2(\gamma)) \colon\\
&&\qquad x_{(1,j_0)}\in A^\varepsilon_{s_{j_0}}, \;
x_{(1,j'_0)}\in B^\varepsilon_{t_{j'_0}}
\quad \textrm{for some } s_{j_0} \textrm{ and } t_{j'_0} \\
&&\qquad\qquad \textrm{such that }
A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}}
\quad x_{(1,j)}\in A\textrm{ for all } (1,j)\in V_1(\gamma), \\
&&\qquad\qquad \textrm{ and }x_{(2,j')}\in B
\textrm{ for all }(2,j')\in V_2(\gamma)\}.
\end{eqnarray*}
Introduce the notation $x^\gamma=(x_{(1,j)},x_{(2,j')}),\,
(1,j)\in V_1(\gamma),\,(2,j')\in V_2(\gamma))$, and consider
only such vectors $x^\gamma$ whose coordinates satisfy the
conditions $x_{(1,j)}\in A$ for all $(1,j)\in V_1(\gamma)$
and $x_{(2,j')}\in B$ for all $(2,j')\in V_2(\gamma)$.
Put $D_\gamma=\{x^\gamma\colon\,
F^\varepsilon_\gamma(f,g)(x^\gamma)\neq F_\gamma(f,g)(x^\gamma)\}$.
The relation $D_\gamma\subset\bigcup\limits_{j=1}^k
\bigcup\limits_{j'=1}^l D(j_0,j_0')$ holds, since if
$F^\varepsilon_\gamma(f,g)(x^\gamma)\neq F_\gamma(f,g)(x^\gamma)$
for some vector~$x^\gamma$, then it has some coordinates
$(1,j_0)\in V_1(\gamma)$ and $(2,j'_0)\in V_2(\gamma)$ such that
$x_{(1,j_0)}\in A^\varepsilon_{s_{j_0}}$ and
$x_{(1,j'_0)}\in B^\varepsilon_{t_{j'_0}}$ with some sets
$A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}}$, and the
relation in the last
line of the definition of $D(j_0,j'_0)$ must also hold for this
vector $x^\gamma$, since otherwise
$F_\gamma(f,g)(x_\gamma)=0=F^\varepsilon_\gamma(f,g)(x_\gamma)$.
I claim that there is some constant $C_1$ such that
$$
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D(j_0,j'_0))\le C_1\varepsilon
\quad\textrm{for all sets } D(j_0,j'_0),
$$
where $\mu^{|V_1(\gamma)|+|V_2(\gamma)|}$
denotes the direct product of the measure $\mu$ on some copies of
the original space $(X,{\cal X})$ indexed by $(1,j)\in V_1(\gamma)$
and $(2,j')\in V_2(\gamma)$. To see this relation one has to
observe that
$\sum\limits_{A^\varepsilon_{s_{j_0}}=B^\varepsilon_{t_{j'_0}}}
\mu(A^\varepsilon_{s_{j_0}})\mu(B^\varepsilon_{t_{j'_0}})\le
\sum\limits\varepsilon \mu(A^\varepsilon_{s_{j_0}})
=\varepsilon\mu(A)$.
Thus the set $D(j_0,j'_0)$ can be covered by the direct product
of a set whose $\mu$ measure is not greater than
$\varepsilon\mu(A)$ and of a rectangle whose edges are
either the set $A$ or the set $B$.
The above relations imply that
\begin{equation}
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C_2\varepsilon
\label{(B10)}
\end{equation}
with some constant $C_2>0$.
Relation (\ref{(B9)}), estimate (\ref{(B10)}), the
property~c) formulated in
Theorem~10.1 for Wiener--It\^o integrals and the observation that
the function $F_\gamma(f,g)$ is bounded in supremum norm
if $f$ and $g$ are elementary functions imply the inequality
\begin{eqnarray*}
&&E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z^{(1)}_\gamma(f,g,\varepsilon)\right)^2 \\
&&\qquad =|\gamma!|^2E\left( Z_{\mu,|\gamma|}
(F_\gamma(f,g)-F_\gamma^\varepsilon(f,g))\right)^2
\le |\gamma|!\| F_\gamma(f,g)-F_\gamma^\varepsilon(f,g)\|_2^2 \\
&&\qquad\le K\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C\varepsilon.
\end{eqnarray*}
This means that relation~(\ref{(B7)}) holds.
To prove relation (\ref{(B8)}) write
$E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2$
in the following form:
\begin{eqnarray}
&&E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
={\sum}^\gamma
{\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l)
c^\varepsilon(\bar s_1,\dots,\bar s_k) \nonumber \\
&&\qquad\qquad
d^\varepsilon(\bar t_1, \dots,\bar t_l)
EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l) \nonumber \\
\label{(B11)}
\end{eqnarray}
with
\begin{eqnarray}
&&U(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l) \nonumber \\
&&\qquad =\prod_{j\colon\, (1,j)
\in V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\,(2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&&\qquad\qquad
\prod_{\bar\jmath\colon\, (1,\bar\jmath)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}})
\prod_{\bar\jmath'\colon\, (2,\bar\jmath')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{\bar t_{\bar\jmath'}}) \nonumber \\
&&\qquad\qquad \biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\!\!\!\!\!\!\!\!\!\!
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)} \!\!\!\!\!\!\!\!\!\!\!\!
\!\!\!\!\!\!
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&&\qquad\qquad\qquad
-\prod_{j\colon\, (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr]
\nonumber \\
&&\qquad\biggl[\prod_{\bar\jmath\colon\, (1,\bar\jmath)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)} \!\!\!\!
\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}}) \!\!\!\!\!
\prod_{\bar\jmath'\colon\,
(2,\bar\jmath')\in \{(2,1),\dots,(2,l)\}
\setminus\in V_2(\gamma)} \!\!\!\!\!\!
\mu_W(B^\varepsilon_{\bar t_{\bar \jmath'}}) \nonumber \\
&&\qquad\qquad\qquad
-\prod_{\bar\jmath\colon\, (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}
\mu(A^\varepsilon_{\bar s_{\bar\jmath}})\biggr]. \label{(B12)}
\end{eqnarray}
The double sum $\sum^\gamma\sum^\gamma$ in (\ref{(B11)}) has to be
understood in the following way. The first summation is taken for
vectors $(s_1,\dots,s_k,t_1,\dots,t_l)$, and these vectors take
such values which were defined in $\sum^\gamma$ in
formula (\ref{(B3)}).
The second summation is taken for vectors
$(\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)$, and
again with values defined in the summation $\sum^\gamma$.
Relation~(\ref{(B8)}) will be proved by means of some
estimates about the expectation of the above defined random
variable $U(\cdot)$ which will be presented in the following
Lemma~B. Before its formulation I introduce the following
Properties~A and~B.
\medskip\noindent
{\bf Property A.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar\jmath}\le M(\varepsilon)$, for
$1\le j,\bar\jmath\le k$, and
$1\le t_j,\bar t_{\bar\jmath'}\le M'(\varepsilon)$ for
$1\le j',\bar\jmath'\le l$,
satisfies Property~A (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequences of sets
$A^\varepsilon_{s_j}$, $(1,j)\in V_1(\gamma)$,
$B^\varepsilon_{t_{j'}}$, $(2,j')\in V_2(\gamma)$,
and
$A^\varepsilon_{\bar s_{\bar\jmath}}$, $(1,\bar\jmath)\in V_1(\gamma)$,
$B^\varepsilon_{\bar t_{\bar\jmath'}}$, $(2,\bar\jmath')\in V_2(\gamma)$,
agree. (Here we say that two sequences agree if they contain
the same elements in a possibly different order.)}
\medskip
(In the formulation of Property~A we considered the sets
$A^\varepsilon_{s_j}$ only for such indices~$j$ for which
$(1,j)\in V_1(\gamma)$, the sets $B^\varepsilon_{t_j'}$ only
for such indices~$j'$ for which $(2,j')\in V_2(\gamma)$. The
sets $A^\varepsilon_{\bar s_{\bar\jmath}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$ were selected in a similar
way. A similar convention will be applied in the definition of
Property~B.)
\medskip\noindent
{\bf Property B.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar\jmath}\le M(\varepsilon)$, for
$1\le j,\bar\jmath\le k$, and
$1\le t_j,\bar t_{\bar\jmath'}\le M'(\varepsilon)$ for
$1\le j',\bar\jmath'\le l$,
satisfies Property~B (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequences of sets
\begin{eqnarray*}
&&A^\varepsilon_{s_j},\;
(1,j)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma), \\
&&\qquad\qquad B^\varepsilon_{t_{j'}}, \;
(2,j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma),
\end{eqnarray*}
and
\begin{eqnarray*}
&&A^\varepsilon_{\bar s_{\bar\jmath}},
(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma), \\
&&\qquad\qquad B^\varepsilon_{\bar t_{\bar\jmath'}}, \;
(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma),
\end{eqnarray*}
have at least one common element.}
\medskip
(In the above definitions two sets $A^\varepsilon_s$ and
$B^\varepsilon_t$ are
identified if $A^\varepsilon_s=B^\varepsilon_t$.)
Now I formulate the following
\medskip\noindent
{\bf Lemma B.} {\it Let us consider the function $U(\cdot)$
introduced in formula~(\ref{(B12)}). Assume that its arguments
$s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$ are chosen
in such a way that the function $U(\cdot)$ with these
arguments appears in the double sum $\sum^\gamma\sum^\gamma$
in formula~(\ref{(B11)}), i.e.\
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, otherwise all sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$ are disjoint,
and an analogous statement holds if
the coordinates $s_1,\dots,s_k,t_1,\dots,t_l$ are replaced by
$\bar s_1,\dots,\bar s_k$ and $\bar t_1,\dots,\bar t_l$. Then
\begin{equation}
EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)=0
\label{(B13)}
\end{equation}
if the sequence of the arguments in $U(\cdot)$ does not satisfies
either Property~A or Property~B.
If the sequence of the arguments in $U(\cdot)$ satisfies both
Property~A and Property~B, then
\begin{eqnarray}
&&|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)|
\nonumber \\
&&\qquad \le C\varepsilon \prod{\vphantom\prod}'
\mu(A^\varepsilon_{s_j})\mu(A^\varepsilon_{\bar s_{\bar\jmath}})
\mu(B^\varepsilon_{t_{j'}})\mu(B^\varepsilon_{\bar t_{\bar\jmath'}})
\label{(B14)}
\end{eqnarray}
with some appropriate constant $C=C(k,l)>0$ depending only on
the number of variables $k$ and $l$ of the functions $f$ and $g$.
The prime in the product $\prod'$ at the right-hand side
of~(\ref{(B14)})
means that in this product the measure $\mu$ of those sets
$A^\varepsilon_{s_j}$, $A^\varepsilon_{\bar s_{\bar\jmath}}$,
$B^\varepsilon_{t_{j'}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$ are considered,
whose indices are
listed among the arguments
$s_j,\bar s_{\bar\jmath},t_{j'}$ or $\bar t_{\bar\jmath'}$ of
$U(\cdot)$, and the measure~$\mu$ of each such set appears
exactly once. (This means e.g. that if
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ or
$A^\varepsilon_{s_j}=B^\varepsilon_{\bar t_{\bar\jmath'}}$
for some indices $j$ and
$j'$ or $\bar\jmath'$, then one of the terms between
$\mu(A^\varepsilon_{s_j})$ and $\mu(B^\varepsilon_{t_{j'}})$ or
$\mu(B^\varepsilon_{\bar t_{\bar\jmath'}})$ is omitted from the
product. For the sake of definitiveness let us preserve the set
$\mu(A^\varepsilon_{j_s})$ in such a case.)}
\medskip\noindent
{\it Remark.}\/ The content of Lemma~B is that most terms
in the double sum in formula~(\ref{(B11)}) equal zero, and
even the non-zero terms are small.
\medskip\noindent
{\it The proof of Lemma B.}\/ Let us prove first
relation~(\ref{(B13)})
in the case when Property~A does not hold. It will be exploited
that for disjoint sets the random variables $\mu_W(A_s)$ and
$\mu_W(B_t)$ are independent, and this provides a good
factorization of the expectation of certain products. Let us
carry out the multiplications in the definition of $U(\cdot)$
in formula~(\ref{(B12)}), and show that each product obtained
in such a way has zero expectation. If Property~A does not hold
for the arguments of $U(\cdot)$, and beside this the arguments
$s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$ satisfy the
remaining conditions of Lemma~B, then each
product we consider contains a factor
$\mu_W(A^\varepsilon_{s_{j_0}})$, $(1,j_0)\in V_1(\gamma)$,
which is independent of all those terms in this product which
are in the following list:
$\mu_W(A^\varepsilon_{s_j})$ with some $j\neq j_0$, $1\le j\le k$,
or $\mu_W(B^\varepsilon_{t_{j'}})$, $1\le j\le l$, or
$\mu_W(A^\varepsilon_{\bar s_{\bar\jmath}})$ with
$(1,\bar\jmath)\in V_1(\gamma)$, or
$\mu_W(B^\varepsilon_{\bar t_{\bar\jmath'}})$ with
$(2,\bar\jmath')\in V_2(\gamma)$. We will show with the help of
this property that the expectation of each term has a factorization
with a factor either of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})=0$ or
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$, hence it equals zero.
Indeed, although the above properties do not exclude the
appearance of such a pair of arguments
$A^\varepsilon_{t_{\bar j'}}$,
$(1,\bar\jmath')\in\{(1,1),\dots,(1,k)\setminus V_1(\gamma)$
and $B^\varepsilon_{t_{\bar j'}}$,
$(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)$ in the product for which
$A^\varepsilon_{t_{\bar j}}
=B^\varepsilon_{t_{\bar j'}}=A^\varepsilon_{s_{j_0}}$, and in
such a case a term of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})$
will not appear in the product, but if this happens,
then the product contains a factor of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$. Hence an
appropriate factorization of each term of $EU(\cdot)$
contains either a factor of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})=0$ or
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$ if
$U(\cdot)$ does not satisfy Property~A.
To finish the proof of relation (\ref{(B13)}) it is enough
consider the case when the arguments of $U(\cdot)$ satisfy
Property~A, but they do not satisfy Property~B. The validity
of Property~A implies that the sets
$\{A^\varepsilon_{s_j},\,j\in V_1\}
\cup\{B^\varepsilon_{t_{j'}},\,j'\in V_2\}$
and
$\{A^\varepsilon_{\bar s_j},\,j\in V_1\}
\cup\{B^\varepsilon_{\bar t_{j'}},\,j'\in V_2\}$
agree. The conditions of Lemma~B also imply that the elements
of these sets are such sets which are disjoint of the sets
$A^\varepsilon_{s_j}$,
$B^\varepsilon_{t_{j'}}$, $A^\varepsilon_{\bar s_{\bar\jmath}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$ with indices
$(1,j),(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)$
and
$(2,j'),(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$.
If Property~B does not hold, then the latter class of sets can be
divided into two subclasses in such a way that the elements in
different subclasses are disjoint. The first subclass consists
of the sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$, and the second
one of the sets $A^\varepsilon_{\bar s_{\bar\jmath}}$ and
$B^\varepsilon_{\bar t_{\bar\jmath'}}$
with indices such that
$(1,j),(1,\bar\jmath)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)$ and
$(2,j'),(2,\bar\jmath')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$.
These facts imply that $EU(\cdot)$ has a factorization,
which contains the term
\begin{eqnarray*}
&&E\biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \\
&&\qquad\qquad -\prod_{j\colon\, (1,j)\in\{(1,1),\dots,(1,k)\}
\setminus V_1(\gamma)}\mu(A^\varepsilon_{s_j})\biggr]=0,
\end{eqnarray*}
hence relation (\ref{(B13)}) holds also in this case. The
last expression has zero expectation, since if we take
such pairs $A^\varepsilon_{s_j},B^\varepsilon_{t_j'}$ for
the sets appearing in it for which that
$((1,j),(2,j'))\in E(\gamma)$, i.e. these vertices are
connected with an edge of $\gamma$, then
$A^\varepsilon_{s_j}=B^\varepsilon_{t_j'}$ in a pair, and
elements in different pairs are disjoint. This
observation allows a factorization in the product whose
expectation is taken, and then the identity
$E\mu_W(A^\varepsilon_{s_j})\mu_W(B^\varepsilon_{t_{j'}})
=\mu(A^\varepsilon_{s_j})$ implies the desired identity.
To prove relation (\ref{(B14)}) if the arguments of the
function~$U(\cdot)$ satisfy both Properties~A and~B consider
the expression (\ref{(B12)}) which defines $U(\cdot)$, carry
out the term by term multiplication between the two
differences at the end of this formula, take expectation for
each term of the sum obtained in such a way and factorize
them. Since $E\mu_W(A)^2=\mu(A)$, $E\mu_W(A)^4=3\mu(A)^2$
for all sets $A\in{\cal X}$, $\mu(A)<\infty$, some
calculation shows that each term can be expressed as
constant times a product whose elements are those
probabilities $\mu(A_s^\varepsilon)$ and
$\mu(B_t^\varepsilon)$ or their square which appear at the
right-hand side of (\ref{(B14)}). Moreover, since the
arguments of $U(\cdot)$ satisfy Property~B, there will be
at least one term of the form $\mu(A_s^\varepsilon)^2$ in
this product. Since
$\mu(A_s^\varepsilon)^2\le \varepsilon\mu(A_s^\varepsilon)$,
these calculations provide
formula~(\ref{(B14)}). Lemma~B is proved.
\medskip
Relation (\ref{(B11)}) implies that
\begin{eqnarray}
&&E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
\label{(B15)} \\
&&\qquad\le K\sum{\vphantom{\sum}}^\gamma
\sum{\vphantom{\sum}}^\gamma
|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)| \nonumber
\end{eqnarray}
with some appropriate $K>0$. By Lemma~B it is enough to sum up
only for such terms $U(\cdot)$ in (\ref{(B15)}) whose
arguments satisfy
both Properties~A and~B. Moreover, each such term can be bounded
by means of inequality (\ref{(B14)}). Let us list the sets
$A^\varepsilon_{s_j},A^\varepsilon_{\bar s_{\bar\jmath}},
B^\varepsilon_{t_{j'}},
B^\varepsilon_{\bar t_{\bar\jmath'}}$ appearing in the
upper bound at the right-hand side of (\ref{(B14)}) for
all functions $U(\cdot)$ taking part in the sum at the
right-hand side of (\ref{(B15)}). Since all fixed sequences
of the sets $A^\varepsilon_s$ and $B^\varepsilon_t$ appear
less than $C(k,l)$ times with an appropriate
constant $C(k,l)$ depending only on the order $k$ and $l$
of the integrals we are considering, and
$\sum\limits_{s=1}^{M(\varepsilon)}
\mu(A^\varepsilon_s)+\sum\limits_{t=1}^{M'(\varepsilon)}
\mu(B^\varepsilon_t)
=\mu(A)+\mu(B)<\infty$, the above relations imply that
$$
E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
\le C_1\varepsilon\sum_{j=1}^{k+l}(\mu(A)+\mu(B))^j
\le C\varepsilon.
$$
Hence relation (\ref{(B8)}) holds.
\medskip
To prove Theorem 10.2A in the general case take for all pairs of
functions $f\in{\cal H}_{\mu,k}$ and $g\in{\cal H}_{\mu,l}$ two
sequences of elementary functions $f_n\in\bar{\cal H}_{\mu,k}$
and $g_n\in\bar{\cal H}_{\mu,l}$, $n=1,2,\dots$, such that
$\|f_n-f\|_2\to0$ and $\|g_n-g\|_2\to0$ as $n\to\infty$.
It is enough to show that
\begin{equation}
E|k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)-k!Z_{\mu,k}(f_n)
l!Z_{\mu,l}(g_n)|\to0\quad \textrm{as }n\to\infty,
\label{(B16)}
\end{equation}
and
\begin{eqnarray}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\to0
\textrm{ as } n\to\infty \nonumber \\
&&\qquad\qquad\qquad \qquad\qquad\qquad\qquad
\textrm{for all } \gamma\in\Gamma(k,l),
\label{(B17)}
\end{eqnarray}
since then a simple limiting procedure $n\to\infty$, and the
already proved part of the theorem for Wiener--It\^o integrals of
elementary functions imply Theorem~10.2A.
To prove relation (\ref{(B16)}) write
\begin{eqnarray*}
&&E|k!Z,{\mu,k}(f)l!Z_{\mu,l}(g)-
k!Z_{\mu,k}(f_n)l!Z_{\mu,l}(g_n)|\\
&&\qquad\le k!l!\left(E|Z_{\mu,k}(f)Z_{\mu,l}(g-g_n)|
+E|Z_{\mu,k}(f-f_n)Z_{\mu,l}(g_n)\right)| \\
&&\qquad\le k!l!
\left(\left(EZ^2_{\mu,k}(f)\right)^{1/2}
\left(EZ^2_{\mu,l}(g-g_n)\right)^{1/2} \right. \\
&&\qquad\qquad \left. +\left(EZ^2_{\mu,k}(f-f_n)\right)^{1/2}
\left(EZ^2_{\mu,l}(g_n)\right)^{1/2}\right)\\
&&\qquad\le (k!l!)^{1/2}\left(\|f\|_2\|g-g_n\|_2
+\|f-f_n\|_2\|g_n\|_2\right).
\end{eqnarray*}
Relation (\ref{(B16)}) follows from this inequality with a limiting
procedure $n\to\infty$.
To prove relation (\ref{(B17)}) write
\begin{eqnarray*}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&&\qquad\le
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g-g_n))\right|+
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f-f_n,g_n))\right|\\
&&\qquad\le
|\gamma|!\left(EZ^2_{\mu,|\gamma|}
(F_\gamma(f,g-g_n))\right)^{1/2}+
|\gamma|!\left(EZ^2_{\mu,|\gamma|}
(F_\gamma(f-f_n,g_n))\right)^{1/2}\\
&&\qquad\le (|\gamma|!)^{1/2}\left(\|F_\gamma(f,g-g_n)\|_2+
\|F_\gamma(f-f_n,g_n)\|_2\right),
\end{eqnarray*}
and observe that by relation (\ref{(10.11)})
$\|F_\gamma(f,g-g_n)\|_2\le \|f\|_2\|g-g_n\|_2$, and
\hfill\break
$\|F_\gamma(f-f_n,g_n)\|_2\le \|f-f_n\|_2\|g_n\|_2$. Hence
\begin{eqnarray*}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&&\qquad\le(|\gamma|!)^{1/2}
\left(\|f\|_2\|g-g_n\|_2+\|f-f_n\|_2\|g_n\|_2\right).
\end{eqnarray*}
The last inequality implies relation (\ref{(B17)})
with a limiting procedure
$n\to\infty$. Theorem 10.2A is proved.
\chapter{The proof of some results about Wiener--It\^o integrals}
\label{introC}
First I prove It\^o's formula about multiple
Wiener--It\^o integrals (Theorem~10.3). The proof is based
on the diagram formula for Wiener--It\^o integrals and a
recursive formula about Hermite polynomials proved in
Proposition~C. In Proposition~C2 I present the proof of
another important property of Hermite polynomials. This
result states that the class of all Hermite polynomials is a
{\it complete}\/ orthogonal system in an appropriate
Hilbert space. It is needed in the proof of Theorem 10.5
which provides an isomorphism between a Fock space and the
Hilbert space generated by Wiener--It\^o integrals with respect
to a white noise with an appropriate reference measure. At the
end of Appendix~C the proof of Theorem~10.4, a limit theorem
about degenerate $U$-statistics is given together with a
version of this result about the limit behaviour of multiple
integrals with respect to a normalized empirical distribution.
\medskip\noindent
{\bf Proposition C about some properties of Hermite
polynomials.}\index{Hermite polynomials} {\it The functions
\begin{equation}
H_k(x)=(-1)^k e^{x^2/2}\frac {d^k}{dx^k}e^{-x^2/2},
\quad k=0,1,2,\dots \label{(C1)}
\end{equation}
are the Hermite polynomials with leading
coefficient 1, i.e.\ $H_k(x)$ is a polynomial of
order $k$ with leading coefficient 1 such that
\begin{equation}
\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \textrm{if } k\neq l. \label{(C2)}
\end{equation}
Beside this,
\begin{equation}
\int_{-\infty}^\infty H^2_k(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=k!
\quad \textrm{for all } k=0,1,2\dots. \label{(C$2'$)}
\end{equation}
The recursive relation
\begin{equation}
H_k(x)=x H_{k-1}(x)-(k-1)H_{k-2}(x) \label{(C3)}
\end{equation}
holds for all $k=1,2,\dots$.}
\medskip\noindent
{\it Remark.} It is more convenient to consider
relation~(\ref{(C3)}) valid also in the case $k=1$. In this
case $H_1(x)=x$, $H_0(x)=1$, and relation holds with an
arbitrary function $H_{-1}(x)$.
\medskip\noindent
{\it Proof of Proposition C.} It is clear from
formula~(\ref{(C1)}) that $H_k(x)$ is a polynomial of
order $k$ with leading coefficient 1. Take $l\ge k$, and
write by means of integration by parts
\begin{eqnarray*}
&&\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=\int_{-\infty}^\infty\frac1{\sqrt{2\pi}}
H_k(x)(-1)^l\frac{d^l}{dx^l} e^{-x^2/2}\,dx\\
&&\qquad\qquad
=\int_{-\infty}^\infty\frac1{\sqrt{2\pi}} \frac d{dx} H_k(x)
(-1)^{l-1}\frac{d^{l-1}}{dx^{l-1}}e^{-x^2/2}\,dx.
\end{eqnarray*}
Successive partial integration together with the identity
$\frac{d^k}{dx^k}H_k(x)=k!$ yield that
$$
\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=k!\int_{-\infty}^\infty\frac1{\sqrt{2\pi}}
(-1)^{l-k}\frac{d^{l-k}}{dx^{l-k}}e^{-x^2/2}\,dx.
$$
The last relation supplies formulas (\ref{(C2)})
and~(\ref{(C$2'$)}).
To prove relation (\ref{(C3)}) observe that
$H_k(x)-xH_{k-1}(x)$ is a polynomial of order $k-2$. (The term
$x^{k-1}$ is missing from this expression. Indeed, if $k$ is
an even number, then the polynomial $H_k(x)-xH_{k-1}(x)$ is
an even function, and it does not contain the term $x^{k-1}$
with an odd exponent $k-1$. Similar argument holds if the
number $k$ is odd.) Beside this, it is orthogonal (with
respect to the standard normal distribution) to all Hermite
polynomials $H_l(x)$ with $0\le l\le k-3$. Hence
$H_k(x)-xH_{k-1}(x)=CH_{k-2}(x)$ with some constant $C$ to be
determined.
Multiply both sides of the last identity with $H_{k-2}(x)$
and integrate them with respect to the standard normal
distribution. Apply the orthogonality of the polynomials
$H_k(x)$ and $H_{k-2}(x)$, and observe that the identity
$$
\int H_{k-1}(x)xH_{k-2}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=\int H^2_{k-1}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=(k-1)!
$$
holds. (In this calculation we have exploited that $H_{k-1}(x)$
is orthogonal to $H_{k-1}(x)-xH_{k-2}(x)$, because the order of
the latter polynomial is less than $k-1$.) In such a way we get
the identity $-(k-1)!=C(k-2)!$ for the constant~$C$ in the last
identity, i.e. $C=-(k-1)$, and this implies relation (\ref{(C3)}).
\medskip\noindent
{\it Proof of It\^o's formula for multiple Wiener--It\^o
integrals.}\/\index{It\^o's formula for multiple Wiener--It\^o
integrals} Let $K=\sum\limits_{p=1}^m k_p$, the sum of the
order of the Hermite polynomials, denote the order of the
expression in relation (\ref{(10.20)}).
Formula~(\ref{(10.20)}) clearly holds
for expressions of order $K=1$. It will be proved in the
general case by means of induction with respect to the
order~$K$.
In the proof the functions $f(x_1)=\varphi_1(x_1)$ and
$$
g(x_1,\dots,x_{K_m-1})=\prod_{j=1}^{K_1-1}\varphi_1(x_j)
\cdot \prod_{p=2}^m \prod_{j=K_{p-1}}^{K_p-1}\varphi_p(x_j),
$$
will be introduced and the product
$Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)$ will be calculated
by means of the diagram formula. (The same notation is
applied as in Theorem 10.3. In particular, $K=K_m$, and in
the case $K_1=1$ the convention
$\prod\limits_{j=1}^{K_1-1}\varphi_1(x_j)=1$ is applied.)
In the application of the diagram formula diagrams with
two rows appear. The first row of these diagrams contains the
vertex $(1,1)$ and the second row contains the vertices
$(2,1),\dots,(2,K_m-1)$. It is useful to divide the diagrams to
three disjoint classes. The first class, $\Gamma_0$ contains
only the diagram $\gamma_0$ without any edges. The second class
$\Gamma_1$ consists of those diagrams which have an edge of the
form $((1,1),(2,j))$ with some $1\le j\le k_1-1$, and the third
class $\Gamma_2$ is the set of those diagrams which have an
edge of the form $((1,1),(2,j))$ with some $k_1\le j\le K_m-1$.
Because of the orthogonality of the functions $\varphi_s$ for
different indices~$s$ $F_\gamma\equiv0$ and
$Z_{\mu,K_m-2}(F_\gamma)=0$ for $\gamma\in\Gamma_2$.
The class $\Gamma_1$ contains $k_1-1$ diagrams. Let us consider a
diagram $\gamma$ from this class with an edge $((1,1),(2,j_0))$,
$1\le j\le k_1-1$. We have for such a diagram $F_\gamma=
\prod\limits_{j\in\{1,\dots,K_1-1\}
\setminus \{j_0\}}\varphi_1(x_{(2,j)})
\prod\limits_{p=2}^m
\prod\limits_{j=K_{p-1}}^{K_p-1}\varphi_p(x_{(2,j)})$, and
by our inductive hypothesis $(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)=
H_{k_1-2}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p)$. Finally
$$
K_m!Z_{\mu,K_m}(F_{\gamma_0})=
K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m
\left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)
$$
for the diagram $\gamma_0\in\Gamma_0$ without any edge.
Our inductive hypothesis also implies the following identity for
the expression we wanted to calculate with the help of the diagram
formula.
$$
Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)=\eta_1
H_{k_1-1}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p).
$$
The above calculations together with the observation
$|\Gamma_1|=k_1-1$ yield the identity
\begin{eqnarray}
&&K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m \left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)=K_m!Z_{\mu,K_m}(F_{\gamma_0})
\nonumber \\
&&\qquad=Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)-
\sum_{\gamma\in\Gamma_1}(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)
\nonumber \\
&&\qquad=\eta_1 H_{k_1-1}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)
-(k_1-1) H_{k_1-2}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)
\nonumber \\
&&\qquad=\left[\eta_1H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)\right]\prod_{p=2}^m H_{k_p}(\eta_p).
\label{(C4)}
\end{eqnarray}
On the other hand, $\eta_1 H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)=H_{k_1}(\eta_1)$ by
formula (\ref{(C3)}). These relations imply
formula~(\ref{(10.20)}), i.e. It\^o's formula.
\medskip
I present the proof of another important property of the Hermite
polynomials in the following Proposition~C2.
\medskip\noindent
{\bf Proposition~C2 on the completeness of the orthogonal system
of Hermite polynomials.}\index{Hermite polynomials} {\it The
Hermite polynomials $H_k(x)$, $k=0,1,2,\dots$, defined in
formula~(\ref{(C4)}) constitute a complete orthonormal system
in the $L_2$-space of the functions square integrable with
respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ on the real line.}
\medskip\noindent
{\it Proof of Proposition C2.} Let us consider the orthogonal
complement of the subspace generated by the Hermite polynomials
in the space of the square integrable functions with respect
to the measure $\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$. It is enough
to prove that this orthogonal completion contains only the
identically zero function. Since the orthogonality of a function to
all polynomials of the form $x^k$, $k=0,1,2,\dots$ is equivalent
to the orthogonality of this function to all Hermite polynomials
$H_k(x)$, $k=0,1,2,\dots$, Proposition~C2 can be reformulated in
the following form:
If a function $g(x)$ on the real line is such that
\begin{equation}
\int_{-\infty}^\infty x^k g(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \textrm{for all }k=0,1,2,\dots \label{(C5)}
\end{equation}
and
\begin{equation}
\int_{-\infty}^\infty g^2(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx<\infty,
\label{(C6)}
\end{equation}
then $g(x)=0$ for almost all $x$.
Given a function $g(x)$ on the real line whose absolute value is
integrable with respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ define the (finite)
measure $\nu_g$,
$$
\nu_g(A)=\int_A g(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
$$
on the measurable sets of the real line together with its Fourier
transform $\tilde\nu_g(t)=\int_{-\infty}^\infty e^{itx}\nu_g(\,dx)$.
(This measure $\nu_g$ and its Fourier transform can
be defined for all functions~$g$ satisfying relation (\ref{(C6)}),
because their absolute value is integrable with respect to
the Gaussian measure.) First I show that Proposition~C2 can be
reduced to the following statement: If a function $g$ satisfies
both (\ref{(C5)}) and (\ref{(C6)})
then $\tilde\nu_g(t)=0$ for all $-\inftyu\right)\le A(k)
P\left(\|\bar I_{n,k}(f(\ell))\|>\gamma(k)u\right)
\label{(14.13d)}
\end{equation}
with some constants $A(k)>0$ and $\gamma(k)>0$ depending only
on the order~$k$ of these generalized $U$-statistics.
We concentrate mainly on the proof of the
generalization (\ref{(14.13d)}) of relation (\ref{(14.13)}).
Formula~(\ref{(14.14)}) is a relatively simple consequence of
it. Formula~(\ref{(14.13d)}) will be proved by means of an
inductive procedure which works only in this more general
setting. It will be derived from the following statement.
Let us take two independent copies $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$
and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ of our original sequence of
random variables $\xi_1,\dots,\xi_n$, and introduce for all sets
$V\subset \{1,\dots,k\}$ the function $\alpha_V(j)$, $1\le j\le k$,
defined as $\alpha_V(j)=1$ if $j\in V$ and $\alpha_V(j)=2$ if
$j\notin V$. Let us define with their help the following
version of decoupled $U$-statistics
\begin{eqnarray}
I_{n,k,V}(f(\ell))
&&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots,k}
\!\!\!\!
f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\alpha_V(1))},\dots,
\xi_{l_k}^{(\alpha_V(k))}\right) \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\quad \textrm{for all }
V\subset \{1,\dots,k\}.
\label{(D3)}
\end{eqnarray}
The following inequality will be proved: There are some constants
$C_k>0$ and $D_k>0$ depending only on the order $k$ of the
generalized $U$-statistic $I_{n,k}(f(\ell))$ such that for all
numbers $u>0$
\begin{equation}
P\left(\|I_{n,k}(f(\ell))\|>u\right)\le
\sum_{V\subset\{1,\dots,k\},\,1\le|V|\le k-1} C_kP\left(D_k\|
I_{n,k,V}(f(\ell))\|>u\right). \label{(D4)}
\end{equation}
Here $|V|$ denotes the cardinality of the set $V$, and the
condition $1\le |V|\le k-1$ in the summation of
formula~(\ref{(D4)}) means that the
sets $V=\emptyset$ and $V=\{1,\dots,k\}$ are omitted from the
summation, i.e. the terms where either $\alpha_V(j)=1$
or $\alpha_V(j)=2$ for all $1\le j\le k$ are not considered.
Formula (\ref{(14.13d)}) can be derived from
formula~(\ref{(D4)}) by means of an inductive argument. The
hard part of the problem is to prove formula~(\ref{(D4)}).
To do this first we prove the following simple lemma.
\medskip\noindent
{\bf Lemma D1.} {\it Let $\xi$ and $\eta$ be two independent and
identically distributed random variables taking values in a
separable Banach space~$B$. Then
$$
3P\left(|\xi+\eta|>\frac 23u\right)\ge P(|\xi|>u)
\quad \textrm{for all }u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma D1.}\/ {\it Let $\xi$, $\eta$ and
$\zeta$ be three independent, identically distributed
random variables taking values in~$B$. Then
\begin{eqnarray*}
3P\left(|\xi+\eta|>\frac23 u\right)
&&=P\left(|\xi+\eta|>\frac23 u\right)
+P\left(|\xi+\zeta|>\frac23 u\right) \\
&&\qquad +P\left(|-(\eta+\zeta)|>\frac23 u\right)\\
&&\ge P(|\xi+\eta+\xi+\zeta-\eta-\zeta|>2u)=P(|\xi|>u).
\end{eqnarray*}
}
\medskip
To prove formula (\ref{(D4)}) we introduce the random variable
\begin{eqnarray}
T_{n,k}(f(\ell))&=&\frac1{k!}
\sum_{\substack {(l_1,\dots,l_k),\; (s_1,\dots,s_k) \colon\\
1\le l_j\le n,\, s_j=1 \textrm{ or }s_j=2,\; j=1,\dots, k,}}
\!\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\right)
\nonumber \\
&=& \sum_{V\subset\{1,\dots,k\}}\!\!\!\!\!
I_{n,k,V}(f(\ell)). \label{(D5)}
\end{eqnarray}
The random variables $I_{n,k}(f(\ell))$,
$I_{n,k,\emptyset}(f(\ell))$ and $I_{n,k,\{1,\dots,k\}}(f(\ell))$
are identically distributed, and the last two random variables are
independent of each other. Hence Lemma~D1 yields that
\begin{eqnarray}
&&P(\|I_{n,k}(f(\ell))\|>u)
\le3P\left(\|I_{n,k,\emptyset}(f(\ell))
+I_{n,k,\{1,\dots,k\}}(f(\ell))\|>\frac23u\right) \nonumber\\
&&\qquad =3P\left(\left\|T_{n,k}(f(\ell))-\!\!\!\!\!\!
\sum_{V\colon\, V\subset\{1,\dots,k\},\,
1\le|V|\le k-1} I_{n,k,|V|}(f(\ell))\right\|>\frac23u\right)
\!\!\!\!\!\! \nonumber \\
&&\qquad \le 3P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u) \nonumber\\
&&\qquad\qquad\qquad+
\!\!\!\!\!\!\!\!\!
\sum_{V\colon\, V\subset\{1,\dots,k\},\, 1\le|V|\le k-1}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
3P(3\cdot2^{k-1}\|I_{n,k,|V|}(f(\ell))\|>u). \label{(D6)}
\end{eqnarray}
To derive relation (\ref{(D4)}) from relation (\ref{(D6)}) we
need a good upper bound on the probability
$P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u)$. To get such an estimate
we shall compare the tail distribution of $\|T_{n,k}(f(\ell))\|$
with that of $\|I_{n,k,V}(f(\ell))\|$ for an arbitrary set
$V\subset\{1,\dots,k\}$. This will be done with the
help of Lemmas~D2 and~D4 formulated below.
In Lemma~D2 such a random variable $\|\hat I_{n,k,V}(f(\ell))\|$
will be constructed whose distribution agrees with the
distribution of $\|I_{n,k,V}(f(\ell))\|$. The expression
$\hat I_{n,k,V}(f(\ell))$, whose norm will be investigated
will be defined in formulas~(\ref{(D7)}) and~(\ref{(D8)}).
It is a random polynomial of some Rademacher functions
$\varepsilon_1,\dots,\varepsilon_n$. The coefficients of
this polynomial are random variables, independent of the
Rademacher functions $\varepsilon_1,\dots,\varepsilon_n$.
Beside this, the constant term of this polynomial equals
$T_{n,k}(f(\ell))$. These properties of the polynomial
$\hat I_{n,k,V}(f(\ell))$ together with Lemma~D4 formulated
below enable us prove such an estimate on the distribution
of $\|T_{n,k}(f(\ell))\|$ that together with
formula~(\ref{(D6)}) imply relation~(\ref{(D4)}). Let us
formulate these lemmas.
\medskip\noindent
{\bf Lemma D2.} {\it Let us consider a sequence of independent
random variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is also independent of the random variables
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ appearing in the definition of
the modified decoupled $U$-statistics $I_{n,k,V}(f(\ell))$ given
in formula (\ref{(D3)}). Let us define with their help the
sequences of random variables $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$
and $\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ whose elements
$(\eta_l^{(1)},\eta_l^{(2)})
=(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))$,
$1\le l\le n$, are defined by the formula
$$
(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))
=\left(\frac{1+\varepsilon_l}2\xi_l^{(1)}+
\frac{1-\varepsilon_l}2\xi_l^{(2)},\frac{1-\varepsilon_l}2\xi_l^{(1)}+
\frac{1+\varepsilon_l}2\xi_l^{(2)}\right),
$$
i.e. let $(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(1)},\xi_l^{(2)})$ if $\varepsilon_l=1$, and
$(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(2)},\xi_l^{(1)})$ if $\varepsilon_l=-1$, $1\le l\le n$.
Then the joint distribution of the pair of sequences of random
variables $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ agrees with that of the pair of
sequences $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$, which is also independent of the
sequence $\varepsilon_1,\dots,\varepsilon_n$.
Let us fix some $V\subset\{1,\dots,k\}$, and introduce the random
variable
\begin{equation}
\hat I_{n,k,V}(f(\ell))=\frac1{k!}\sum_{(l_1,\dots,l_k) \colon\,
1\le l_j\le n,\; j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right), \label{(D7)}
\end{equation}
where similarly to formula (\ref{(D3)}) $\alpha_V(j)=1$ if
$j\in V$, and $\alpha_V(j)=2$ if $j\notin V$. Then the identity
\begin{eqnarray}
&&2^k\hat I_{n,k,V}(f(\ell)) \label{(D8)} \\
&&\quad=\frac1{k!}
\!\!\sum_{\substack {(l_1,\dots,l_k),
\;(s_1,\dots,s_k)\colon\\
1\le l_j\le n,\; s_j=1 \textrm{ or }s_j=2, \nonumber \\
\;j=1,\dots, k,}}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right) \nonumber
\end{eqnarray}
holds, where $\kappa^{(j)}_{1,V}=1$ and $\kappa^{(j)}_{2,V}=-1$ if
$j\in V$, and $\kappa^{(j)}_{1,V}=-1$ and $\kappa^{(j)}_{2,V}=1$ if
$j\notin V$, i.e. $\kappa_{1,V}^{(j)}=3-2\alpha_V(j)$ and
$\kappa_{2,V}^{(j)}=-\kappa_{1,V}^{(j)}$.}
\medskip
Before the formulation of Lemma~D4 another Lemma~D3 will be
presented which will be applied in its proof.
\medskip\noindent
{\bf Lemma D3.} {\it Let $Z$ be a random variable taking values in
a separable Banach space $B$ with expectation zero, i.e. let
$E\kappa(Z)=0$ for all $\kappa\in B'$, where $B'$ denotes the
(Banach) space of all (bounded) linear transformations of $B$ to
the real line. Then $P(\|v+Z\|\ge\|v\|)\ge \inf\limits_{\kappa\in B'}
\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$ for all $v\in B$.}
\medskip\noindent
{\bf Lemma D4.} {\it Let us consider a positive integer $n$ and
a sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le l\le n$.
Beside this,
fix some positive integer $k$, take a separable Banach space~$B$ and
choose some elements $a_s(l_1,\dots,l_s)$ of this Banach space $B$,
$1\le s\le k$, $1\le l_j\le n$, $l_j\neq l_{j'}$ if $j\neq j'$,
$1\le j,j'\le s$. With the above notations the inequality
\begin{equation}
P\left(\left\|v+\sum_{s=1}^k\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s}\right\|
\ge\|v\|\right)\ge c_k \label{(D9)}
\end{equation}
holds for all $v\in B$ with some constant $c_k>0$ which depends
only on the parameter $k$. In particular, it does not depend on
the norm in the separable Banach space~$B$.}
\medskip\noindent
{\it Proof of Lemma D2.}\/ Let us consider the conditional
joint distribution of the sequences of random variables
$\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ under the condition that the
random vector $\varepsilon_1,\dots,\varepsilon_n$ takes
the value of some prescribed
$\pm1$ series of length~$n$. Observe that this conditional
distribution agrees with the joint distribution of the sequences
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ for all possible conditions.
This fact implies the statement about the joint distribution of
the sequences $(\eta_l^{(1)},\eta_l^{(2)})$, $1\le l\le n$ and their
independence of the sequence $\varepsilon_1,\dots,\varepsilon_n$.
To prove identity~(\ref{(D8)}) let us fix a set
$M\subset\{1,\dots,n\}$, and consider the case when
$\varepsilon_l=1$ if $l\in M$ and $\varepsilon_l=-1$ if
$l\notin M$. Put $\beta_{V,M}(j,l)=1$ if $j\in V$ and $l\in M$
or $j\notin V$ and $l \notin M$, and let $\beta_{V,M}(j,l)=2$
otherwise. Then we have for all $(l_1,\dots,l_k)$,
$1\le l_j\le n$, $1\le j\le k$, and our fixed set $V$
\begin{eqnarray}
&&\sum_{\substack{(s_1,\dots,s_k)\colon\\
s_j=1 \textrm{ or }s_j=2,\;j=1,\dots, k}}
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right) \nonumber \\
&&\qquad\qquad\qquad =2^k f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right),
\label{(D10)}
\end{eqnarray}
since the product $(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})$
equals either zero or $2^k$, and it equals $2^k$ for that
sequence $(s_1,\dots,s_k)$ for which
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ for all
$1\le j\le k$, and the relation
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ is
equivalent to $\beta_{V,M}(j,l_j)=s_j$ for all $1\le j\le k$.
(In relation~(\ref{(D10)}) it is sufficient to consider only
such products for which $l_j\neq l_{j'}$ if $j\neq j'$
because of the properties of the functions $f_{l_1,\dots,l_k}$.)
Beside this, $\xi_l^{\beta_{V,M}(l,j)}=\eta_l^{\alpha_V(j)}$
for all $1\le l\le n$ and $1\le j\le k$, and as a consequence
$$f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right)=
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right).
$$
Summing up the identities (\ref{(D10)}) for all
$1\le l_1,\dots,l_k\le n$ and applying the last identity we
get relation~(\ref{(D8)}), since the identity obtained in such
a way holds for all $M\subset\{1,\dots,n\}$.
\medskip\noindent
{\it Proof of Lemma D3.}\/ Let us first observe that if $\xi$
is a real valued random variable with zero expectation, then
$P(\xi\ge0) \ge \frac{(E|\xi|)^2}{4E\xi^2}$ since $(E|\xi|)^2
=4(E(\xi I(\{\xi\ge0\}))^2\le 4P(\xi\ge0)E\xi^2$ by the Schwarz
inequality, where $I(A)$ denotes the indicator function of
the set~$A$. (In the above calculation and in the subsequent proofs
I apply the convention $\frac00=1$. We need this convention if
$E\xi^2=0$. In this case we have the identities $P(\xi=0)=1$ and
$E|\xi|=0$, hence the above proved inequality holds in this
case, too.)
Given some $v\in B$, let us choose a linear operator $\kappa$ such
that $\|\kappa\|=1$, and $\kappa(v)=\|v\|$. Such an operator exists
by the Banach--Hahn theorem. Observe that
$\{\omega\colon\,\|v+Z(\omega)\|
\ge\|v\|\} \supset\; \{\omega\colon\,
\kappa(v+Z(\omega))\ge\kappa(v)\}
=\{\omega\colon\, \kappa(Z(\omega))\ge0\}$. Beside this,
$E\kappa(Z)=0$. Hence we can apply the above proved inequality
for $\xi=\kappa(Z)$, and it yields that
$P(\|v+Z\|\ge\|v\|)\ge P(\kappa(Z)\ge0)
\ge\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$. Lemma~D3 is proved.
\medskip\noindent
{\it Proof of Lemma D4.}\/
Take the class of random polynomials
$$
Y=\sum_{s=1}^k\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
b_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
where $\varepsilon_l$, $1\le l\le n$, are independent random
variables with $P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
and the coefficients
$b_s(l_1,\dots,l_s)$, $1\le s\le k$, are arbitrary real numbers.
The proof of Lemma~D4 can be reduced to the statement that there
exists a constant $c_k>0$ depending only on the order~$k$ of these
polynomials such that the inequality
\begin{equation}
(E|Y|)^2\ge 4c_k EY^2. \label{(D11)}
\end{equation}
holds for all such polynomials~$Y$. Indeed, consider the polynomial
$$
Z=\sum_{s=1}^k\sum_{\substack {(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
and observe that $E\kappa(Z)=0$ for all linear functionals $\kappa$
on the space $B$. Hence Lemma~D3 implies that the left-hand side
expression in~(\ref{(D9)}) is bounded from below by
$\inf\limits_{\kappa\in B'}\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$.
On the other hand, relation~(\ref{(D11)}) implies that
$\inf\limits_{\kappa\in G'}\frac{(E|\kappa(Z)|)^2}
{4E\kappa(Z)^2}\ge c_k$.
To prove relation (\ref{(D11)}) first we compare the moments
$EY^2$ and $EY^4$. Let us introduce the random variables
$$
Y_s=\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\ l_j\neq l_{j'} \textrm{ if }
j\neq j'}}
b_s(l_1,\dots,l_s) \varepsilon_{l_1}\cdots\varepsilon_{l_s}
\quad 1\le s\le k.
$$
We shall show that the estimates of Section~13 imply that
\begin{equation}
EY_s^4\le 2^{4s} \left(EY_s^2\right)^2 \label{(D12)}
\end{equation}
for these random variables $Y_s$.
Relation (\ref{(D12)}) together with the uncorrelatedness of
the random variables $Y_s$, $1\le s\le k$, imply that
\begin{eqnarray*}
EY^4
&=&E\left(\sum_{s=1}^k Y_s\right)^4\le k^3\sum_{s=1}^k EY_s^4\le
k^3 2^{4k} \sum_{s=1}^k (EY_s^2)^2\\
&\le& k^3 2^{4k}\left(\sum_{s=1}^k EY_s^2\right)^2=k^3 2^{4k}(EY^2)^2.
\end{eqnarray*}
This estimate together with the H\"older inequality with $p=3$ and
$q=\frac32$ yield that
$$
EY^2=E|Y|^{4/3}|\cdot|Y|^{2/3}\le
(EY^4)^{1/3}(E|Y|)^{2/3}\le k2^{4k/3}(EY^2)^{2/3}(E|Y|)^{2/3},
$$
i.e. $EY^2\le k^32^{4k}(E|Y|)^2$, and relation (\ref{(D11)}) holds
with $4c_k=k^{-3}2^{-4k}$. Hence to complete the proof of Lemma~D4
it is enough to check relation~(\ref{(D12)}).
In the proof of relation (\ref{(D12)}) it can be assumed that the
coefficients $b_s(l_1,\dots,l_s)$ of the random variable $Y_s$ are
symmetric functions of the arguments
$l_1,\dots,l_s$, since a symmetrization of these coefficients does
not change the value of $Y$. Put
$$
B^2_s=\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
b_s^2(l_1,\dots,l_s), \quad 1\le s\le k.
$$
Then
$$
EY_s^2=s! B_s^2,
$$
and
$$
EY_s^4\le 1\cdot3\cdot5\cdots(4s-1)B_s^4
=\frac{(4s)!}{2^{2s}(2s)!}B_s^4
$$
by Lemmas 13.4 and 13.5 with the choice $M=2$ and $k=s$.
Inequality~(\ref{(D12)}) follows from the last two relations.
Indeed, to prove formula~(\ref{(D12)}) by means of these
relations it is enough to check that
$\frac{(4s)!}{2^{2s}(2s)!(s!)^2}\le 2^{4s}$. But it is easy to
check this inequality with induction with respect to $s$.
(Actually, there is a well-known inequality in the literature,
known under the name Borell's inequality, which implies
inequality~(\ref{(D12)}) with a better coefficient at the right
hand side of this estimate.) We have proved Lemma~D4.
\medskip
Let us turn back to the estimation of the probability
$P(3\cdot2^{k-1}\|T_{n,k}(f)\|>u)$. Let us introduce the
$\sigma$-algebra ${\cal F}={\cal B}(\xi_l^{(1)},\xi_l^{(2)},\,1\le
l\le n)$ generated by the random variables $\xi_l^{(1)},\xi_l^{(2)}$,
$1\le l\le n$, and fix some set $V\subset\{1,\dots,k\}$.
I show with the help of Lemma~D4 and formula~(\ref{(D8)}) that
there exists some constant $c_k>0$ such that the random
variables $T_{n,k}f(\ell))$ defined in formula~(\ref{(D5)}) and
$\hat I_{n,k,V}(f(\ell))$ defined in formula~(\ref{(D7)}) satisfy
the inequality
\begin{equation}
P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\cal F}\right)
\ge c_k \quad \textrm{ with probability 1.} \label{(D13)}
\end{equation}
In the proof of~(\ref{(D13)}) we shall exploit that in
formula~(\ref{(D8)}) $2^k\hat I_{n,k,V}(f(\ell))$ is represented
by a polynomial of the Rademacher functions
$\varepsilon_1,\dots,\varepsilon_n$ whose constant term is
$T_{n,k}(f(\ell))$. The coefficients of this polynomial are
functions of the random variables $\xi^{(1)}_l$ and $\xi^{(2)}_l$,
$1\le l\le n$. The independence of these random variables from
$\varepsilon_{l}$, $1\le l\le n$, and the definition of the
$\sigma$-algebra ${\cal F}$ yield that
\begin{eqnarray}
&&P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>
\|T_{n,k}(f(\ell))\||{\cal F}\right) \label{(D14)} \\
&&\qquad=P_{\varepsilon_V}\biggl(\biggl\|\frac1{k!}
\sum_{\substack{(l_1,\dots,l_k),\; (s_1,\dots,s_k)\colon\\
1\le l_j\le n, s_j=1 \textrm{ or }s_j=2,\\
j=1,\dots, k,}}
\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k}) \nonumber \\
&&\qquad\qquad \qquad\qquad\qquad\qquad\qquad
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right)\biggr\| \nonumber \\
&&\qquad \qquad\qquad\qquad\qquad\qquad
>\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|\biggr),
\nonumber
\end{eqnarray}
where $P_{\varepsilon_V}$ means that the values of the
random variables $\xi_l^{(1)}$, $\xi_l^{(2)}$, $1\le l\le n$,
are fixed, (their value depend on the atom of the
$\sigma$-algebra ${\cal F}$ we are considering) and the
probability is taken with respect to the remaining random
variables $\varepsilon_l$, $1\le l\le n$. At the right-hand
side of (\ref{(D14)}) the probability of such an event is
considered that the norm of a polynomial of order~$k$ of the
random variables $\varepsilon_1,\dots,\varepsilon_n$ is larger
than
$\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|$.
Beside this, the constant term of this polynomial
equals~$T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)$.
Hence this probability can be bounded by means of Lemma~D4,
and this result yields relation~(\ref{(D13)}).
As the distributions of $I_{n,k,V}(f(\ell))$ and
$\hat I_{n,k,V}(f(\ell))$ agree by the first statement of Lemma~D2
and a comparison of formulas~(\ref{(D3)}) and~(\ref{(D7)}),
relation (\ref{(D13)})
implies that
\begin{eqnarray*}
&&P\left(\|2^k I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right)
=P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right) \\
&&\qquad
\ge P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|\ge\|T_{n,k}(f(\ell))\|,\;
\|T_{n,k}(f(\ell))\|\ge\frac13\cdot2^{1-k} u\right)\\
&&\qquad=\int_{\{\omega\colon\, \|T_{n,k}(f(\ell))(\omega)\|
\ge\frac13\cdot2^{1-k} u\}}
\!\!\!\!\!\!\!
P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\cal F}\right)\,dP\\
&&\qquad \ge c_k P(3\cdot2^{k-1} \|T_{n,k}(f(\ell))\|\ge u).
\end{eqnarray*}
The last inequality with the choice of any set
$V\subset\{1,\dots,k\}$, $1\le |V|\le k-1$, together with
relation~(\ref{(D6)}) imply formula~(\ref{(D4)}).
Relation (\ref{(14.13d)}) will be proved together with another
inductive hypothesis with the help of relation~(\ref{(D4)}) by
means of an induction procedure with respect to the order $k$
of the $U$-statistic. To formulate this new inductive
hypothesis some new quantities will be introduced. Let
${\cal W}={\cal W}(k)$ denote the set of all partitions of
the set $\{1,\dots,k\}$. Let us fix $k$ independent copies
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of the
sequence of random variables $\xi_{1},\dots,\xi_n$. Given a
partition $W=(U_1,\dots,U_s)\in{\cal W}(k)$ let us introduce
the function $s_W(j)$, $1\le j\le k$, which tells for all
arguments $j$ the index of that element of the partition~$W$
which contains the point $j$, i.e. the value of the function
$s_W(j)$, $1\le j\le k$, in a point $j$ is defined by the
relation $j\in V_{s_W(j)}$. Let us introduce the expression
\begin{eqnarray*}
I_{n,k,W}(f(\ell))
&&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_W(1))},
\dots,\xi_{l_k}^{(s_W(k))}\right)\\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{for all }W\in{\cal W}(k).
\end{eqnarray*}
An expression of the form $I_{n,k,W}(f(\ell))$, $W\in{\cal W}_k$,
will be called a decoupled $U$-statistic with generalized
decoupling. Given a partition $W=(U_1,\dots,U_s)\in{\cal W}_k$
let us call the number $s$, i.e.\ the number of the elements of
this partition the rank both of the partition $W$ and of the
decoupled $U$-statistic $I_{n,k,W}(f(\ell))$ with generalized
decoupling.
Now I formulate the following hypothesis. For all $k\ge2$ and
$2\le j\le k$ there exist some constants $C(k,j)>0$ and
$\delta(k,j)>0$ such that for all $W\in{\cal W}_k$ a decoupled
$U$-statistic $I_{n,k,W}(f(\ell))$ with generalized decoupling
satisfies the inequality
\begin{eqnarray}
&&P(\|I_{n,k,W}(f(\ell))\|>u)\le C(k,j)P\left(\|\bar
I_{n,k}(f(\ell))\|>\delta(k,j) u\right) \nonumber \\
&&\qquad\qquad\qquad\textrm{for all }2
\le j\le k \textrm{ if the rank of } W \textrm{ equals }j.
\label{(D15)}
\end{eqnarray}
It will be proved by induction with respect to $k$ that both
relations~(\ref{(14.13d)}) and~(\ref{(D15)}) hold for
$U$-statistics of order~$k$.
Let us observe that for $k=2$ relation~(\ref{(14.13d)})
follows from~(\ref{(D4)}).
Relation~(\ref{(D15)}) also holds for $k=2$, since in
this case we have to consider only the case $j=k=2$, and
relation (\ref{(D15)}) clearly holds in this case with
$C(2,2)=1$ and $\delta(2,2)=1$. Hence we can start our
inductive proof with $k=3$. First I prove
relation~(\ref{(D15)}).
In relation (\ref{(D15)}) the tail-distribution of decoupled
$U$-sta\-tis\-tics with generalized decoupling is compared
with that of the decoupled $U$-statistic
$\bar I_{n,k}(f(\ell))$ introduced in~(\ref{(D2)}). Given
the order $k$ of these $U$-statistics it will be proved
by means of a backward induction with
respect to the rank $j$ of the decoupled $U$-statistics
$I_{n,k,W}(f(\ell))$ with generalized decoupling.
Relation~(\ref{(D15)}) clearly holds for $j=k$ with $C(k,k)=1$
and $\delta(k,k)=1$. To prove it for decoupled $U$-statistics
with generalized decoupling of rank $2\le ju)\le \bar A(k) P\left(\|I_{n,k,\bar W}
(f(\ell))\|>\bar \gamma(k) u\right) \label{(D16)}
\end{equation}
with $\bar A(k)=\sup\limits_{2\le p\le k-1}A(p)$,
$\bar\gamma(k)=\inf\limits_{2\le p\le k-1}\gamma(p)$ if the
rank $j$ of $W$ is such that $2\le j\le k-1$, where the
constants $A(p)$ and $\gamma(p)$ agree with the corresponding
coefficients in formula~(\ref{(14.13d)}).
To prove relation~(\ref{(D16)}) (in the case
$U_j=\{t,\dots,k\}$) let us define the $\sigma$-algebra
${\cal F}$ generated by the random variables appearing in
the first $t-1$ coordinates of these $U$-statistics, i.e.
by the random variables $\xi^{s_W(j)}_{l_j}$,
$1\le j\le t-1$, and $1\le l_j\le n$ for all
$1\le j\le t-1$. We have $2\le t\le k-1$. By our inductive
hypothesis relation~(\ref{(14.13d)}) holds for
$U$-statistics of order $p=k-t+1$, since $2\le p\le k-1$. I
claim that this implies that
\begin{equation}
P(\|I_{n,k,W}(f(\ell))\|>u|{\cal F})\le A(k-t+1)
P\left(\|I_{n,k,\bar W}(f(\ell))\|
>\gamma(k-t+1)u|{\cal F}\right) \label{(D17)}
\end{equation}
with probability~1. Indeed, by the independence properties of
the random variables $\xi_l^{s_W(j)}$
(and $\xi_l^{s_{\bar W}(j)}$),
$1\le j\le k$, $1\le l\le n$,
$$
P(\|I_{n,k,W}(f(\ell))\|>u|{\cal F})
=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,W}(f(\ell)\|>u)
$$
and
\begin{eqnarray*}
&&P\left(\|I_{n,k,\bar W}(f(\ell))\|>\gamma(k-t+1)u|{\cal F}\right)\\
&&\qquad=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,\bar W}f(\ell)\|
>\gamma(k-t+1)u),
\end{eqnarray*}
where $P_{\xi_l^{s_W(j)}, 1\le j\le t-1}$ denotes that the
values of the
random variables $\xi_l^{s_W(j)}(\omega)$, $1\le j\le t-1$,
$1\le l\le n$, are fixed, and we consider the probability that
the appropriate functions of these fixed values and of the
remaining random variables
$\xi^{s_W(j)}$ and $\xi^{s_{\bar W}(j)}$, $t\le j\le k$,
satisfy the desired relation. These identities and the relation
between the sets $W$ and $\bar W$ imply that relation~(\ref{(D17)})
is equivalent to the identity~(\ref{(14.13d)}) for the generalized
$U$-statistics of order $2\le k-t+1\le k-1$ with kernel functions
\begin{eqnarray*}
&&f_{l_t,\dots,l_k}(x_t,\dots,x_k)\\
&&\qquad= \!\!\!\!\!
\sum_{(l_1,\dots,l_{t-1})\colon\, 1\le l_j\le n,\;1\le j\le t-1}
\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}(\xi_{l_1}^{s_W(1)}(\omega),\dots,
\xi_{l_{t-1}}^{s_W(t-1)}(\omega),x_t,\dots,x_k).
\end{eqnarray*}
Relation~(\ref{(D16)}) follows from inequality~(\ref{(D17)}) if
expectation is taken at both sides. As the rank of $\bar W$ is
strictly greater than the rank of $W$, relation~(\ref{(D16)})
together with our backward inductive assumption imply
relation~(\ref{(D15)}) for all $2\le j\le k$.
Relation~(\ref{(D15)}) implies in particular (with the
applications of partitions of order~$k$ and rank~2) that the
terms in the sum at the right-hand side of~(\ref{(D4)})
satisfy the inequality
$$
P\left(D_k\|I_{n,k,V}(f(\ell))\|>u\right)\le \bar C(k,j)
P\left(\|\bar I_{n,k}(f(\ell))\|>\bar D_k u\right)
$$
with some appropriate $\bar C_k>0$ and $\bar D_k>0$ for all
$V\subset\{1,\dots,k\}$, $1\le|V|\le k-1$. This inequality
together with relation~(\ref{(D4)}) imply that
inequality~(\ref{(14.13d)}) also holds for
the parameter~$k$.
\medskip
In such a way we get the proof of relation (\ref{(14.13d)}) and
of its special case, relation~(\ref{(14.13)}). Let us prove
formula~(\ref{(14.14)}) with its help first in the simpler case
when the supremum of finitely many functions is taken. If
$M<\infty$ functions $f_1,\dots,f_M$ are considered, then
relation~(\ref{(14.14)}) for the supremum of the $U$-statistics
and decoupled $U$-statistics with these kernel functions can be
derived from formula (\ref{(14.13)}) if it is applied for the
function $f=(f_1,\dots,f_M)$ with values in the separable
Banach space $B_M$ which consists of the vectors
$(v_1,\dots,v_M)$, $v_j\in B$, $1\le j\le M$, and the norm
$\|(v_1,\dots,v_M)\|=\sup\limits_{1\le j\le m}\|v_j\|$ is
introduced in it. The application of formula (\ref{(14.13)})
with this choice yields formula~(\ref{(14.14)}) for this
supremum. Let us emphasize that the constants appearing in
this estimate do not depend on the number~$M$. (We took only
$M<\infty$ kernel functions, because with such a choice the
Banach space $B_M$ defined above is also separable.) Since the
distribution of the random variables
$\sup\limits_{1\le s\le M}\left\|I_{n,k}(f_s)\right\|$
converge to that of
$\sup\limits_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|$, and
the distribution of the random variables $\sup\limits_{1\le s\le M}
\left\| \bar I_{n,k}(f_s)\right\|$ converge to that of
$\sup\limits_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|$ as
$M\to\infty$, relation (\ref{(14.14)}) in the general case
follows from its already proved special case and a limiting
procedure $M\to\infty$.
\medskip\noindent
{\it Remark.} The above proved formula (\ref{(14.13d)}) can be
slightly generalized. It also holds if the expressions
$I_{n,k}(f(\ell))$ and $\bar I_{n,k}(f(\ell))$ appearing in
this inequality are defined in a more general way. Namely,
they are the random functions introduced in
formulas~(\ref{(D1)}) and (\ref{(D2)}), but the sequences
$\xi_1,\dots,\xi_n$ and their independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$ in these formulas are independent
random variables which may also be non-identically distributed.
Such a generalization can be proved without any essential change
in the original proof.
\begin{thebibliography}{99}
\bibitem{r1}
Adamczak, R. (2006) Moment inequalities for
$U$-statistics. {\it Annals of Probability} {\bf34}, 2288--2314
\bibitem{r2}
Ajtai, M., Koml\'os, J. and Tusn\'ady, G. (1984) On optimal matchings.
{\it Combinatorica}\/ {\bf 4} no. 4, 259--264
\bibitem{r3}
Alexander, K. (1987) The central limit theorem for
empirical processes over Vapnik--\v{C}ervonenkis classes. {\it Annals
of Probability} {\bf 15}, 178--203
\bibitem{r4}
Arcones, M. A. and Gin\'e, E. (1993) Limit theorems for
$U$-processes. {\it Annals of Probability}, {\bf 21}, 1494--1542
\bibitem{r5}
Arcones, M. A. and Gin\'e, E. (1994) $U$-processes
indexed by Vapnik--\v{C}ervonenkis classes of functions with
application to asymptotics and bootstrap of $U$-statistics with
estimated parameters. {\it Stoch. Proc. Appl.} {\bf 52}, 17--38
\bibitem{r6}
Bennett, G. (1962) Probability inequality for the sum of
independent random variables. {\it J. Amer. Statist. Assoc.}\/
{\bf 57}, 33--45
\bibitem{r7}
Bonami, A. (1970) \'Etude des coefficients de Fourier des
fonctions de $L^p(G)$. {\it Ann. Inst. Fourier (Grenoble)\/} {\bf 20}
335--402
\bibitem{r8}
de la Pe\~na, V. H. and Gin\'e, E. (1999) {\it
Decoupling. From dependence to independence.}\/ Springer series in
statistics. Probability and its application. Springer Verlag,
New York, Berlin, Heidelberg
\bibitem{r9}
de la Pe\~na, V. H. and Montgomery--Smith, S. (1995)
Decoupling inequalities for the tail-probabilities of multivariate
$U$-statistics. {\it Ann. Probab.}, {\bf 23}, 806--816
\bibitem{r10}
Dobrushin, R. L. (1979) Gaussian and their subordinated
fields. {\it Annals of Probability}\/ {\bf 7}, 1-28
\bibitem{r11}
Dudley, R. M. (1978) Central limit theorems for empirical
measures. {\it Annals of Probability}\/ {\bf 6}, 899--929
\bibitem{r12}
Dudley, R. M. (1984) A course on empirical processes.
{\it Lecture Notes in Mathematics}\/ {\bf 1097}, 1--142 Springer
Verlag, New York
\bibitem{r13}
Dudley, R. M. (1989) {\it Real Analysis and
Probability.}\/ Wadsworth \& Brooks, Pacific Grove, California
\bibitem{r14}
Dudley, R. M. (1998) {\it Uniform Central Limit
Theorems.}\/ Cambridge University Press, Cambridge U.K.
\bibitem{r15}
Dynkin, E. B. and Mandelbaum, A. (1983) Symmetric
statistics, Poisson processes and multiple Wiener integrals. {\it
Annals of Statistics\/} {\bf 11}, 739--745
\bibitem{r16}
Frankl, P. and Pach J. (1983) On the number of sets in
null-$t$-design. {\it European J. Combinatorics} {\bf 4} 21--23
\bibitem{r17}
Gin\'e, E. and Guillou, A. (2001) On consistency of
kernel density estimators for randomly censored data: Rates holding
uniformly over adaptive intervals. {\it Ann. Inst. Henri
Poincar\'e PR\/} {\bf 37} 503--522
\bibitem{r18}
Gin\'e, E., Kwapie\'n, S, Lata\l{}a, R. and Zinn, J.
(2001) The LIL for canonical $U$-statistics of order~2.
{Annals of Probability} {\bf 29} 520--527
\bibitem{r19}
Gin\'e, E., Lata\l{}a, R. and Zinn, J. (2000)
Exponential and moment inequalities for $U$-statistics in {\it High
dimensional probability II.} Progress in Probability 47. 13--38.
Birkh\"auser Boston, Boston, MA.
\bibitem{r20}
Gross, L. (1975) Logarithmic Sobolev inequalities.
Amer. J. Math. {\bf 97}, 1061--1083
\bibitem{r21}
Guionnet, A. and Zegarlinski, B. (2003) Lectures on
Logarithmic Sobolev inequalities. {\it Lecture Notes in Mathematics}
{\bf 1801} 1--134 2. Springer Verlag, New York
\bibitem{r22}
Hanson, D. L. and Wright, F. T. (1971) A bound on the
tail probabilities for quadratic forms in independent random
variables. {\it Ann. Math. Statist.} {\bf 42} 52--61
\bibitem{r23}
Hoeffding, W. (1948) A class of statistics with
asymptotically normal distribution. {\it Ann. Math. Statist.}
{\bf 19} 293--325
\bibitem{r24}
Hoeffding, W. (1963) Probability inequalities for sums
of bounded random variables. {\it J. Amer. Math. Society}\/
{\bf 58}, 13--30
\bibitem{r25}
It\^o K. (1951) Multiple Wiener integral. {\it J. Math.
Soc. Japan}\/ {\bf3}. 157--164
\bibitem{r26}
Kaplan, E.L. and Meier P. (1958) Nonparametric
estimation from incomplete data, {\it Journal of American
Statistical Association}, {\bf 53}, 457--481.
\bibitem{r27}
Lata\l{a}, R. (2006) Estimates of moments and tails of
Gaussian chaoses. {\it Annals of Probability} {\bf34} 2315--2331
\bibitem{r28}
Ledoux, M. (1996) On Talagrand deviation inequalities
for product measures. {\it ESAIM: Probab. Statist.}\/ {\bf 1.}
63--87. Available at http://www.emath./fr/ps/.
\bibitem{r29}
Ledoux, M. (2001) The concentration of measure phenomenon.
{\it Mathematical Surveys and Monographs}\/ {\bf 89} American
Mathematical Society, Providence, RI.
\bibitem{r30}
Major, P. (1981) Multiple Wiener--It\^o integrals. {\it
Lecture Notes in Mathematics\/} {\bf 849}, Springer Verlag, Berlin,
Heidelberg, New York,
\bibitem{r31}
Major, P. (1988) On the tail behaviour of the
distribution function of multiple stochastic integrals. {\it
Probability Theory and Related Fields}, {\bf 78}, 419--435
\bibitem{r32}
Major, P. (1994) Asymptotic distributions for weighted
$U$-statistics. {\it The Annals of Probability}, {\bf 22} 1514--1535
\bibitem{r33}
Major, P. (2005) An estimate about multiple stochastic
integrals with respect to a normalized empirical measure.
{\it Studia Scientarum Mathematicarum Hungarica.} 295--341
\bibitem{r34}
Major, P. (2005) Tail behaviour of multiple random integrals
and $U$-sta\-tis\-tics. {\it Probability Reviews.} 448--505
\bibitem{r35}
Major, P. (2006) An estimate on the maximum of a nice
class of stochastic integrals. {\it Probability Theory
and Related Fields.} {\bf 134}, 489--537
\bibitem{r36}
Major, P. (2006) A multivariate generalization of
Hoeffding's inequality. {\it Electronic Communication in
Probability} {\bf 2} (220--229)
\bibitem{r37}
Major, P. (2007) On a multivariate version of
Bernstein's inequality {\it Electronic Journal of
Probability} {\bf12} 966--988
%\bibitem{r38}
%Major, P. (2005) On the tail behaviour of multiple
%random integrals and degenerate $U$-statistics. (First version of
%xthis lecture note) http://www.renyi.hu/\~{}major
\bibitem{r39}
Major, P. and Rejt\H{o}, L. (1988) Strong embedding of
the distribution function under random censorship. {\it Annals of
Statistics} {\bf 16}, 1113--1132
\bibitem{r40}
Major, P. and Rejt\H{o}, L. (1998) A note on
nonparametric estimations. In the conference volume to the 65.
birthday of Mikl\'os Cs\"org\H{o}. 759--774
\bibitem{r41}
Malyshev, V. A. and Minlos, R. A. (991) Gibbs Random
Fields. Method of cluster expansion. Kluwer, Academic Publishers,
Dordrecht
\bibitem{r42}
Massart, P. (2000) About the constants in Talagrand's
concentration inequalities for empirical processes.
{\it Annals of Probability}\/ {\bf 28}, 863--884
\bibitem{r43}
Mc. Kean, H. P. (1973) Wiener's theory of non-linear
noise. in {\it Stochastic Differential Equations}
SIAM--AMS Proc. 6 197--209
\bibitem{r44}
Nelson, E. (1973) The free Markov field. J. Functional
Analysis {\bf 12}, 211--227
\bibitem{r45}
Pollard, D. (1984) {\it Convergence of Stochastic
Processes.}\/ Springer Verlag, New York
\bibitem{r46}
Rota, G.-C. and Wallstrom, C. (1997) Stochastic
integrals: a combinatorial approach. {\it Annals of Probability}
{\bf 25} (3) 1257--1283
\bibitem{r47}
Surgailis, D. (1984) On multiple Poisson stochastic
integrals and associated Markov semigroups. {\it Probab. Math.
Statist.} 3. no. {\bf 2} 217-239
\bibitem{r48}
Surgailis, D. (2000) Long-range dependence and Appell
rank. {\it Annals of Probability} {\bf 28} 478--497
%\bibitem{r41}
%Surgailis, D. (2000) CLTs for polynomials of linear
%sequences: Diagram formulae with illustrations. in {\it Long Range
%Dependence} 111--128 Birkh\"auser, Boston, Boston, MA.
\bibitem{r49}
Szeg\H{o}, G. (1967) {\it Orthogonal Polynomials.}
American Mathematical Society Colloquium Publications. Vol. 23
\bibitem{r50}
Takemura, A. (1983) Tensor Analysis of ANOVA
decomposition. {\it J. Amer. Statist. Assoc.} {\bf 78}, 894--900
\bibitem{r51}
Talagrand, M. (1994) Sharper bounds for Gaussian and
empirical processes. {\it Annals of Probability} {\bf 22}, 28--76
\bibitem{r52}
Talagrand, M. (1996) New concentration inequalities in
product spaces. {\it Invent. Math.} {\bf 126}, 505--563
\bibitem{r53}
Talagrand, M. (2005) {\it The general chaining.}
Springer Monographs in Mathematics. Springer Verlag, Berlin
Heidelberg New York
\bibitem{r54}
Vapnik, V. N. (1995) {\it The Nature of Statistical
Learning Theory.} Springer Verlag, New York
\end{thebibliography}
\backmatter
%\begin{theindex}
\printindex
%\end{theindex}
\end{document}
__