\documentclass[graybox,envcountchap,sectrefs]{svmono}
\usepackage{amssymb,amsmath}
\usepackage{amsfonts}
\usepackage{epsfig,wrapfig}
\usepackage{mathptmx}
\usepackage{helvet}
\usepackage{courier}
\usepackage{type1cm}
\usepackage{makeidx}
\usepackage{multicol}
\makeindex
\font\script=cmcsc10
\makeatletter
\renewcommand{\theenumi}{\alph{enumi}}
\renewcommand{\labelenumi}{\theenumi)}
\makeatother
\begin{document}
\author{P\'eter Major}
\title{On the estimation of multiple random
integrals and $U$-statistics}
\subtitle{-- Lecture Note --}
%\subtitle{-- Monograph --}
\maketitle
\frontmatter
\tableofcontents
\preface
This lecture note has a fairly long history. Its starting point
was an attempt to solve some limit problems about the behaviour
of non-linear functionals of a sequence of independent random
variables. These problems could not be solved by means of
classical probabilistic methods. I tried to solve them with the
help of some sort of Taylor expansion. The idea was to represent
the functional we are investigating as a sum with a leading term
whose asymptotic behaviour can be well described by means of
classical results of probability theory and with some error
terms whose effect is negligible. This approach worked well,
but to bound the error terms I needed some non-trivial
estimates. The proof of these estimates was interesting in
itself, it was a problem worth of a closer study on its own
right. So I tried to work out the details and to present the
most important and most interesting results I met during this
research. This lecture note is the result of these efforts.
To solve the problems I met I had to give a good estimate on
the tail distribution of the integral of a function of several
variables with respect to the appropriate power of a normalized
empirical distribution. Beside this I also had to consider a
generalized version of this problem when the tail distribution
of the supremum of such integrals has to be bounded. The
difficulties in these problems concentrate around two points.
\medskip
\begin{enumerate}
\item
We consider non-linear functionals of independent random
variables, and we have to work out some techniques to deal with
such problems.
\item
The idea behind several arguments is the observation that independent
random variables behave in many respects almost as if they were
Gaussian. But we have to understand how strong this similarity is,
when we can apply the techniques worked out for Gaussian random
variables. Beside this we have to find methods to deal with our
problems also in such cases when the techniques related to
Gaussian and almost Gaussian random variables do not work.
\end{enumerate}
\medskip
To deal with problem a) I have discussed the theory of multiple random
integrals and their most important properties together with the
properties of so-called (degenerate) $U$-statistics. I considered
the Wiener--It\^o integrals which are multiple Gaussian type integrals,
and provide a useful tool to handle non-linear functionals of Gaussian
sequences. I also proved some results about a good representation of
the product of Wiener--It\^o integrals or degenerate $U$-statistics
as a sum of Wiener--It\^o integrals or degenerate $U$-statistics.
A comparison of these results indicates some similarity between the
behaviour of Wiener--It\^o integrals and degenerate $U$-statistics. I
tried to present a fairly detailed discussion of Wiener--It\^o
integrals and degenerate $U$-statistics which contains their most
important properties.
Problem b) appeared in particular in the study of the supremum of a
class of random integrals. It may be worth mentioning that there is
a deep theory worked out mainly by Michel Talagrand which gives good
estimates in such problems, at least in the case if only one-fold
integrals are considered. It turned out however that the results and
methods of this theory are not appropriate to prove such estimates
that I needed in this work. Roughly speaking, the problems I met
have a different character than those investigated in Talagrand's
theory. This point is discussed in more detail in the main text of
this work, in particular in Chapter~18, which gives an overview of
the problems investigated in this work together with their history.
The problems get even harder if the supremum not only of one-fold
but also of multiple random integrals have to be estimated. Here
some new methods are needed which we can find by refining some
symmetrization arguments appearing in the theory of so-called
Vapnik--\v{C}ervonenkis classes.
I have also considered an example in Chapter~2 which shows how
to apply the estimates proved in this work in the study of some
limit theorem problems in mathematical statistics. Actually
this was the starting point of the research described in this
work. I discussed only one example, but I consider it more than
just an example. My goal was to explain a method that can help
in solving some non-trivial limit problems and to show why the
results of this lecture notes are useful in their investigation.
I think that this approach works in a very general setting, but
this is the task of future research. Let me also remark that to
understand how this method works and how to apply it one does
not have to learn the whole material of this lecture note. It
is enough to understand the content of the results in Chapter~8
together with some results of Chapter~9 about the properties of
$U$-statistics.
I had two kinds of readers in mind when writing this lecture
note. The first kind of them would like to learn more about such
problems in which relatively few independence is available, and
as a consequence the methods of classical probability theory do
not work in their study. They would like to acquire some results
and methods useful in such cases, too. The second kind of readers
would not like to go into the details of complicated, unpleasant
arguments. They would restrict their attention to some useful
methods which may help them in proving the limit theorem
problems of probability theory they meet also in such cases
when the standard methods do not work. This lecture note can be
considered as an attempt to satisfy the wishes of both kinds
of readers.
\medskip\medskip\noindent
Budapest, January 2013
\rightline{P\'eter Major}
\mainmatter
\chapter{Introduction}
First I briefly describe the main subject of this work.
Fix a positive integer $n$, consider $n$ independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ on a measurable
space $(X,{\cal X})$ with some distribution $\mu$ and take their
empirical distribution $\mu_n$ together with its normalization
$\sqrt n(\mu_n-\mu)$. Beside this, take a function $f(x_1,\dots,x_k)$
of $k$ variables on the $k$-fold product $(X^k,{\cal X}^k)$ of the
space $(X,{\cal X})$, introduce the $k$-th power of the normalized
empirical measure $\sqrt n(\mu_n-\mu)$ on $(X^k,{\cal X}^k)$ and
define the integral of the function $f$ with respect to this
signed product measure. This integral is a random variable, and we
want to give a good estimate on its tail distribution. More precisely,
we take the integrals not on the whole space, the diagonals
$x_s=x_{s'}$, $1\le s,s'\le k$, $s\neq s'$, of the space $X^k$ are
omitted from the domain of integration. Such a modification of the
integral seems to be natural.
We shall also be interested in the following generalized version of
the above problem. Let us have a nice class of functions ${\cal F}$
of $k$ variables on the product space $(X^k,{\cal X}^k)$, and consider
the integrals of all functions in this class with respect to the
$k$-fold direct product of our normalized empirical measure. Give a
good estimate on the tail distribution of the supremum of these
integrals.
One may ask why the above problems deserve a closer study. I found
them important, because they may help in solving some essential
problems in probability theory and mathematical statistics. I met
such problems when I tried to adapt the method of proof about the
Gaussian limit behaviour of the maximum likelihood estimate to some
similar but more difficult questions. In the original problem the
asymptotic behaviour of the solution of the so-called maximum
likelihood equation has to be investigated. The study of this
problem is hard in its original form. But by applying an appropriate
Taylor expansion of the function that appears in this equation and
throwing out its higher order terms we get an approximation whose
behaviour can be well understood. So to describe the limit
behaviour of the maximum likelihood estimate it suffices to show
that this approximation causes only a negligible error.
One would try to apply a similar method in the study of more
difficult questions. I met some non-parametric maximum likelihood
problems, for instance the description of the limit behaviour of
the so-called Kaplan--Meyer product limit estimate when such an
approach could be applied. But in these problems it was harder
to show that the simplifying approximation causes only a
negligible error. In this case the solution of the above
mentioned problems was needed. In the non-parametric maximum
likelihood estimate problems I met, the estimation of multiple
(random) integrals played a role similar to the estimation of
the coefficients in the Taylor expansion in the study of maximum
likelihood estimates. Although I could apply this approach only
in some special cases, I believe that it works in very general
situations. But it demands some further work to show this.
The above formulated problems about random integrals are interesting
and non-trivial even in the special case $k=1$. Their solution
leads to some interesting and non-trivial generalization
of the fundamental theorem of the mathematical statistics about
the difference of the empirical and real distribution of a large
sample.
These problems have a natural counterpart about the behaviour of
so-called $U$-statistics, which is a fairly popular subject in
probability theory. The investigation of multiple random integrals
and $U$-statistics are closely related, and it turned out to be
useful to consider them simultaneously.
Let us try to get some feeling about what kind of results can be
expected in these problems. For a large sample size $n$ the
normalized empirical measure $\sqrt n(\mu_n-\mu)$ behaves similarly
to a Gaussian random measure.
This suggests that in the problems we are interested in similar
results should hold as in the analogous problems about multiple
Gaussian integrals. The behaviour of multiple Gaussian integrals,
called Wiener--It\^o integrals in the literature, is fairly well
understood, and it suggests that the tail distribution of a
$k$-fold random integral with respect to a normalized empirical
measure should satisfy such estimates as the tail distribution of
the $k$-th power of a Gaussian random variable with expectation
zero and appropriate variance. Beside this, if we consider the
supremum of multiple random integrals of a class of functions
with respect to a normalized empirical measure or with respect
to a Gaussian random measure, then we expect that under not too
restrictive conditions this supremum is not much larger than
the `worst' random integral with the largest variance taking
part in this supremum. We may also hope that the methods of the
theory of multiple Gaussian integrals can be adapted to the
investigation of our problems.
The above presented heuristic considerations supply a fairly good
description of the situation, but they do not take into account a
very essential difference between the behaviour of multiple
Gaussian integrals and multiple integrals with respect to a
normalized empirical measure. If the variance of a multiple
integral with respect to a normalized empirical measure is very
small, what turns out to be equivalent to a very small $L_2$-norm
of the function we are integrating, then the behaviour of this
integral is different from that of a multiple Gaussian integral
with the same kernel function. In this case the effect of some
irregularities of the normalized empirical distribution turns
out to be non-negligible, and no good Gaussian approximation
holds any longer. This case must be better understood, and some
new methods have to be worked out to handle it. The hardest
problems discussed in this work are related to this phenomenon.
The precise formulation of the results will be given in the
main part of the work. Beside their proofs I also tried to explain
the main ideas behind them and the notions introduced in their
investigation. This work contains some new results, and also the
proof of some already rather classical theorems is presented.
The results about Gaussian random variables and their non-linear
functionals, in particular multiple integrals with respect to a
Gaussian field, have a most important role in the study of the
present work. Hence they are discussed in detail together
with some of their counterparts about multiple random integrals
with respect to a normalized empirical measure and some results
about $U$-statistics.
The proofs apply results from different parts of the probability
theory. Papers investigating similar results refer to works dealing
with quite different subjects, and this makes their reading rather
hard. To overcome this difficulty I tried to work out the details
and to present a self-contained discussion even at the price of a
longer text. Thus I wrote down (in the main text or in the Appendix)
the proof of many interesting and basic results, like results about
Vapnik--\v{C}ervonenkis classes, about $U$-statistics and their
decomposition to sums of so-called degenerate $U$-statistics, about
so-called decoupled $U$-statistics and their relation to ordinary
$U$-statistics, the diagram formula about the product of
Wiener--It\^o integrals, their counterpart about the product of
degenerate $U$-statistics, etc. I tried to give such an exposition
where different parts of the problem are explained independently of
each other, and they can be understood in themselves.
As all the topics treated in the individual chapters relate to
each other it seemed natural to me to tell the history of how the
various results were reached in one last chapter. This last chapter,
Chapter~18, just before the Appendix, also contains the complete
reference list. I tried to give satisfactory referencing to all
essential problems discussed, concentrate on explaining the main
ideas behind the proofs and indicate where they were published. I
did not attempt to provide an exhaustive literature list for fear
that more would be less. As a consequence the reference list
reflects my subjctive preferences, my way of thinking.
\chapter{Motivation of the investigation. Discussion of
some problems}
In this chapter I try to show by means of an example why the
solution of the problems mentioned in the introduction may be
useful in the study of some important problems of probability
theory. I try to give a good picture about the main ideas, but I
do not work out all details. Actually the elaboration of some
details omitted from this discussion would demand hard work.
But as the present chapter is quite independent of the rest of
the work, these omissions cause no problem in understanding
the subsequent part.
I start with a short discussion of the maximum likelihood
estimate in the simplest case. The following problem is considered.
Let us have a class of density functions $f(x,\vartheta)$ on the
real line depending on a parameter $\vartheta\in R^1$, and
observe a sequence of independent random variables
$\xi_1(\omega),\dots,\xi_n(\omega)$ with a density function
$f(x,\vartheta_0)$, where $\vartheta_0$ is an unknown parameter
we want to estimate with the help of the above sequence of random
variables.
The maximum likelihood method suggests the following approach. Choose
that value $\hat\vartheta_n =\hat\vartheta_n(\xi_1,\dots,\xi_n)$ as
the estimate of the parameter $\vartheta_0$ where the density function
of the random vector $(\xi_1,\dots,\xi_n)$, i.e.\ the product
$$
\prod_{k=1}^n f(\xi_k,\vartheta)=\exp\left\{\sum_{k=1}^n\log
f(\xi_k,\vartheta)\right\}
$$
takes its maximum. This point can be found as the solution of the
so-called maximum likelihood equation\index{maximum likelihood equation}
\begin{equation}
\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\vartheta)=0. \label{(2.1)}
\end{equation}
We are interested in the asymptotic behaviour of the random variable
$\hat\vartheta_n-\vartheta_0$, where $\hat\vartheta_n$ is the
(appropriate) solution of the equation~(\ref{(2.1)}).
The direct study of this equation is rather hard, but a Taylor
expansion of the expression at the left-hand side of~(\ref{(2.1)})
around the (unknown) point $\vartheta_0$ yields a good and simple
approximation of $\hat\vartheta_n$, and it enables us to describe
the asymptotic behaviour of $\hat\vartheta_n-\vartheta_0$.
This Taylor expansion yields that
\begin{eqnarray}
&&\sum_{k=1}^n\frac{\partial}{\partial\vartheta}\log
f(\xi_k,\hat\vartheta_n)=
\sum_{k=1}^n\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)} \nonumber \\
&&+(\hat\vartheta_n-\vartheta_0)
\left(\sum_{k=1}^n\left(\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{\left(\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)\right)^2}
{f^2(\xi_k,\bar\vartheta_0)} \right)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right) \nonumber \\
&&= \sum_{k=1}^n
\left(\eta_k+\zeta_k(\hat\vartheta_n-\vartheta_0)\right)
+O\left(n(\hat\vartheta_n-\vartheta_0)^2\right),
\label{(2.2)}
\end{eqnarray}
where
$$
\eta_k=\frac{\frac{\partial}{\partial\vartheta}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}\quad \textrm{and}\quad
\zeta_k=
\frac{\frac{\partial^2}{\partial\vartheta^2}
f(\xi_k,\vartheta_0)}{f(\xi_k,\vartheta_0)}-
\frac{ \left( \frac{\partial}{\partial\vartheta}
f( \xi_k,\vartheta_0)\right)^2}{f^2(\xi_k,\bar\vartheta_0)}
$$
for $k=1,\dots,n$. We want to understand the asymptotic behaviour
of the (random) expression on the right-hand side of~(\ref{(2.2)}).
The relation
$$
E\eta_k=\int\frac{\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)}{f(x,\vartheta_0)}f(x,\vartheta_0)\,dx
=\frac{\partial}{\partial\vartheta}\int
f(x,\vartheta_0)\,dx=0
$$
holds, since $\int f(x,\vartheta)\,dx=1$ for all $\vartheta$,
and a differentiation of this relation gives the last identity.
Similarly,
$E\eta^2_k=-E\zeta_k
=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$, \
$k=1,\dots,n$. Hence by the central limit theorem
$\chi_n=\frac{1}{\sqrt n}\sum\limits_{k=1}^n\eta_k$
is asymptotically normal with expectation zero and variance
$I^2=\int\frac{\left(\frac{\partial}{\partial\vartheta}
f(x,\vartheta_0)\right)^2}{f(x,\vartheta_0)}\,dx>0$.
In the statistics literature this number $I$ is called the Fisher
information. By the laws of large numbers
$\frac{1}{n}\sum\limits_{k=1}^n\zeta_k\sim -I^2$.
Thus relation (\ref{(2.2)}) suggests the approximation of the
maximum-likelihood estimate $\hat\vartheta_n$ by the random variable
$\tilde\vartheta_n$ given by the identity $\tilde\vartheta_n-\vartheta_0=
-\frac{\sum\limits_{k=1}^n\eta_k}{\sum\limits_{k=1}^n\zeta_k}$, and
the previous calculations imply that
$\sqrt n(\tilde\vartheta_n-\vartheta_0)$
is asymptotically normal with
expectation zero and variance~$\frac1{I^2}$. The random variable
$\tilde\vartheta_n$ is not a solution of the equation (\ref{(2.1)}),
the value of the expression at the left-hand side is of order
$O(n(\tilde\vartheta_n-\vartheta_0)^2)=O(1)$ in this point. On
the other hand, some calculations show that the derivative of the
function at the left-hand side is large in this point, it is greater
than $\textrm{const.}\, n$ with some $\textrm{const.}>0$. This implies
that the maximum-likelihood equation has a solution
$\hat\vartheta_n$ such that
$\hat\vartheta_n-\tilde\vartheta_n=O\left(\frac1n\right)$. Hence
$\sqrt n(\hat\vartheta_n-\vartheta_0)$ and
$\sqrt n(\tilde\vartheta_n-\vartheta_0)$ have the same asymptotic
limit behaviour.
The previous method can be summarized in the following way:
Take a simpler linearized version of the expression we want to
estimate by means of an appropriate Taylor expansion, describe the
limit distribution of this linearized version and show that the
linearization causes only a negligible error.
We want to show that such a method also works in more difficult
situations. But in some cases it is harder to show that the error
committed by a replacement of the original expression by a simpler
linearized version is negligible, and to show this the solution of
the problems mentioned in the introduction is needed. The discussion
of the following problem, called the Kaplan--Meyer method for the
estimation of the empirical distribution function with the help of
censored data shows such an example.
The following problem is considered. Let $(X_i,Z_i)$, $i=1,\dots,n$,
be a sequence of independent, identically distributed random vectors
such that the components $X_i$ and $Z_i$ are also independent with
some unknown, continuous distribution functions $F(x)$ and $G(x)$.
We want to estimate the distribution function $F$ of the random
variables $X_i$, but we cannot observe the variables $X_i$, only
the random variables $Y_i=\min(X_i,Z_i)$ and
$\delta_i=I(X_i\leq Z_i)$. In other words, we want to solve the
following problem. There are certain objects whose lifetime $X_i$
are independent and $F$ distributed. But we cannot observe this
lifetime $X_i$, because after a time $Z_i$ the observation must
be stopped. We also know whether the real lifetime $X_i$ or the
censoring variable $Z_i$ was observed. We make $n$ independent
experiments and want to estimate with their help the distribution
function~$F$.
Kaplan and Meyer, on the basis of some maximum-likelihood estimation
type considerations, proposed the following so-called product limit
estimator\index{product limit estimator (Kaplan--Meyer method)}
$S_n(u)$ to estimate the unknown survival function $S(u)=1-F(u)$:
\begin{equation}
1-F_n(u)=S_n(u)=\left\{
\begin{array}{l}
\prod\limits_{i=1}^n\left(\frac{N(Y_i)}{N(Y_i)+1}\right)^{I(Y_i\leq u,
\delta_i=1)} \textrm{ if }u\leq\max(Y_1,\dots,Y_n)\\
0 \textrm{ if } u\geq\max(Y_1,\dots,Y_n),\textrm{ and }\delta_n =1,\\
\textrm{undefined if }u\geq\max(Y_1,\dots,Y_n),\textrm{ and }\delta_n=0,
\end{array}
\right.
\label{(2.3)}
\end{equation}
where
$$
N(t)=\#\{Y_i,\;\;Y_i>t,\;1\le i \le n\}=\sum_{i=1}^n I(Y_i>t).
$$
We want to show that the above estimate (\ref{(2.3)}) is really good.
For this goal we shall approximate the random variables $S_n(u)$ by
some appropriate random variables. To do this first we introduce some
notations.
Put
\begin{eqnarray}
H(u) &=&P(Y_i\leq u)=1-\bar H(u), \nonumber \\
\tilde H(u) &=&P(Y_i\leq u,\,\delta_i=1),\quad
\tilde{\tilde H}(u)=P(Y_i\leq u,\,\delta_i =0)
\label{(2.4)}
\end{eqnarray}
and
\begin{eqnarray}
H_n(u) &=&\frac{1}{n} \sum_{i=1}^n I( Y_i \leq u)\label{(2.5)} \\
\tilde H_n(u) &=&\frac1n \sum_{i=1}^n I(Y_i\leq u,\, \delta_i
=1), \quad \tilde{\tilde H}_n(u)=\frac{1}{n}\sum_{i=1}^n I( Y_i
\leq u, \, \delta_i=0). \nonumber
\end{eqnarray}
Clearly $H(u)=\tilde H(u)+\tilde{\tilde H}(u)$ and
$ H_n(u)=\tilde H_n(u)+\tilde{\tilde H}_n(u)$.
We shall estimate $F_n(u)-F(u)$ for $u\in(-\infty, T]$ if
\begin{equation}
1-H(T)>\delta \quad \textrm{with some fixed } \delta>0.
\label{(2.6)}
\end{equation}
Condition (\ref{(2.6)}) implies that there are more than
$\frac\delta2n$
sample points $Y_j$ larger than~$T$ with probability almost 1. The
complementary event has only an exponentially small probability.
This observation helps to show in the subsequent calculations that
some events have negligibly small probability.
We introduce the so-called cumulative hazard function and its
empirical version
\begin{equation}
\Lambda(u)=-\log(1-F(u)), \quad \Lambda_n(u)=-\log(1-F_n(u)).
\label{(2.7)}
\end{equation}
Since $F_n(u)-F(u)=\exp(-\Lambda(u))
\left(1-\exp(\Lambda(u)-\Lambda_n(u))\right)$
a simple Taylor expansion yields
\begin{equation}
F_n(u)-F(u)=(1-F(u))\left(\Lambda_n(u)-\Lambda(u)\right)+R_1(u),
\label{(2.8)}
\end{equation}
and it is easy to see that
$R_1(u)=O\left((\Lambda(u)-\Lambda_n(u))^2\right)$.
It follows from the subsequent estimations that
$\Lambda(u)-\Lambda_n(u)=O(n^{-1/2})$, thus $R_1(u)=O(\frac1n)$. Hence it
is enough to investigate the term $\Lambda_n(u)$. We shall show that
$\Lambda_n(u)$ has an expansion with $\Lambda(u)$ as the main term
plus $n^{-1/2}$ times a term which is a linear functional of an
appropriate normalized empirical distribution function plus an error
term of order $O(n^{-1})$.
From~(\ref{(2.3)}) it is obvious that
$$
\Lambda_n(u)=-\sum_{i=1}^n I(Y_i\leq u, \, \delta_i=1)
\log\left(1-\frac{1}{1+N(Y_i)}\right).
$$
It is not difficult to get rid of the unpleasant logarithmic function
in this formula by means of the relation $-\log(1-x)=x+O(x^2)$ for
small~$x$. It yields that
\begin{equation}
\Lambda_n(u)=\sum_{i=1}^n \frac{I(Y_i\leq u, \,\delta_i=1)}{N(Y_i)}
+R_2(u)=\tilde{\Lambda}_n(u)+R_2(u) \label{(2.9)}
\end{equation}
with an error term $R_2(u)$ such that $nR_2(u)$ is smaller than a
constant with probability almost one. (The probability of the
exceptional set is exponentially small.)
The expression $\tilde{\Lambda}_n(u)$ is still inappropriate for our
purposes. Since the denominators $N(Y_i)=\sum\limits_{j=1}^n I(Y_j>Y_i)$
are dependent for different indices~$i$ we cannot see directly the
limit behaviour of $\tilde{\Lambda}_n(u)$.
We try to approximate $\tilde{\Lambda}_n(u)$ by a simpler
expression. A natural approach would be to approximate the terms
$N(Y_i)$ in it by their conditional expectation $(n-1)\bar
H(Y_i)=(n-1)(1-H(Y_i))=E(N(Y_i)|Y_i)$ with respect to the
$\sigma$-algebra generated by the random variable~$Y_i$. This is a
too rough `first order' approximation, but the following `second
order approximation' will be sufficient for our goals. Put
$$
N(Y_i)=\sum_{j=1}^n I(Y_j>Y_i)=n\bar H(Y_i) \left(1+
\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}\right)
$$
and express the terms $\frac1{N(Y_i)}$ in the sum defining
$\tilde \Lambda_n$, (with $\tilde\Lambda_n$ introduced in~(\ref{(2.9)}))
by means of the relation
$\frac1{1+z}=\sum\limits_{k=0}^\infty (-1)^kz^k=1-z+\varepsilon(z)$
with the choice
$z=\frac{\sum\limits_{j=1}^nI(Y_j>Y_i)-n\bar H(Y_i)}{n\bar H(Y_i)}$. As
$|\varepsilon(z)|<2z^2$ for $|z|<\frac{1}{2}$ we get that
\begin{eqnarray}
\tilde{\Lambda}_n(u)
&=&\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1+\sum_{k=1}^\infty\left(- \frac{\sum\limits_{j=1}^n
I(Y_j>Y_i)-n\bar H(Y_i)} {n\bar H(Y_i)}\right)^k\right)\nonumber \\
&=&\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}
{n\bar H(Y_i)}\left(1-\frac{\sum\limits_{j=1}^n I(Y_j>Y_i)-n\bar H(Y_i)}
{n\bar H(Y_i)}\right)+R_3(u) \nonumber \\
&=&2A(u)-B(u)+R_3(u), \label{(2.10)}
\end{eqnarray}
where
$$
A(u)=A(n,u)=\sum_{i=1}^n\frac{I(Y_i\leq u,\,\delta_i=1)}{n\bar H(Y_i)}
$$
and
$$
B(u)=B(n,u)=\sum_{i=1}^n \sum_{j=1}^n\frac
{I(Y_i\leq u,\,\delta_i=1)I(Y_j>Y_i)}{n^2\bar H^2(Y_i)}.
$$
It can be proved by means of standard methods that $nR_3(u)$ is
exponentially small. Thus relations~(\ref{(2.9)})
and~(\ref{(2.10)}) yield that
\begin{equation}
\Lambda_n(u)=2A(u)-B(u)+\textrm{negligible error.}
\label{(2.11)}
\end{equation}
This means that to solve our problem the asymptotic behaviour of the
random variables $A(u)$ and $B(u)$ has to be given. We can get a
better insight to this problem by rewriting the sum $A(u)$ as an
integral and the double sum $B(u)$ as a two-fold integral with
respect to empirical measures. Then these integrals can be rewritten
as sums of random integrals with respect to normalized empirical
measures and deterministic measures. Such an approach yields a
representation of $\Lambda_n(u)$ in the form of a sum whose terms
can be well understood.
Let us write
\begin{eqnarray*}
A(u)&=&\int_{-\infty}^{+\infty}\frac{I(y\leq u)}{1-H(y)}\,d\tilde
H_n(y),\\
B(u) &=&\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}
\frac{I(y\leq u)I(x>y)}{\left(1-H(y)\right)^2}\,dH_n(x) d\tilde H_n(y).
\end{eqnarray*}
We rewrite the terms $A(u)$ and $B(u)$ in a form better for our
purposes. We express these terms as a sum of integrals with respect
to $dH(u)$, $d\tilde H(u)$ and the normalized empirical processes
$d\sqrt n(H_n(x)-H(x))$ and $d\sqrt n(\tilde H_n(y)-\tilde H(y))$.
For this goal observe that
\begin{eqnarray*}
H_n(x)\tilde H_n(y)&&=H(x)\tilde H(y)+H(x)(\tilde H_n(y)-\tilde H(y))
+(H_n(x)-H(x))\tilde H(y)\\
&&\qquad+(H_n(x)-H(x))(\tilde H_n(y)-\tilde H(y)).
\end{eqnarray*}
Hence we can write that
$B(u)=B_1(u)+B_2(u)+B_3(u)+B_4(u)$, where
\begin{eqnarray*}
B_1(u)&&=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\tilde H(y)\;,\\
B_2(u) &&=\frac{1}{\sqrt n}\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x)\,d\left(\sqrt n
(\tilde H_n(y)-\tilde H(y))\right),\\
B_3(u)&&=\frac1{\sqrt n}\int_{-\infty}^u
\int_{-\infty}^{+\infty}\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\,d\tilde H(y)\;,\\
B_4(u)&&=\frac 1n \int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}
\,d\left(\sqrt n\left(H_n(x)-H(x)\right)\right)\, \\
&& \qquad \qquad\qquad\qquad
d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right).
\end{eqnarray*}
In the above decomposition of $B(u)$ the term $B_1$ is a
deterministic function, $B_2$, $B_3$ are linear functionals of
normalized empirical processes and $B_4$ is a nonlinear functional
of normalized empirical processes. The deterministic term $B_1(u)$
can be calculated explicitly. Indeed,
$$
B_1(u)=\int_{-\infty}^u\int_{-\infty}^{+\infty}
\frac{I(x>y)}{\left(1-H(y)\right)^2}\,dH(x) d\tilde H(y)=
\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}.
$$
Then the relations
$\tilde H(u)=\int_{-\infty}^u\left(1-G(t)\right)\,dF(t)$ and
$1-H = (1-F)(1-G)$ imply that
\begin{equation}
B_1(u)=\int_{-\infty}^u\frac{dF(y)}{1-F(y)}=
-\log(1-F(u))=\Lambda(u). \label{(2.12)}
\end{equation}
Observe that
\begin{eqnarray}
A(u)&=&\int_{-\infty}^u\frac{d\,\tilde H_n(y)}{1-H(y)}\nonumber \\
&=&\int_{-\infty}^u\frac{d\tilde H(y)}{1-H(y)}+
\frac1{\sqrt n}\int_{-\infty}^u
\frac{d \left(\sqrt n (\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}
\nonumber \\
&=& B_1(u)+B_2(u). \label{(2.13)}
\end{eqnarray}
From relations~(\ref{(2.11)}), (\ref{(2.12)}) and~(\ref{(2.13)})
it follows that
\begin{equation}
\Lambda_n(u)-\Lambda(u)=B_2(u)-B_3(u)-B_4(u)+\textrm{negligible error.}
\label{(2.14)}
\end{equation}
Integration of $B_2$ and $B_3$ with respect to the variable $x$
and then integration by parts in the expression $B_2$ yields that
\begin{eqnarray*}
B_2(u)&=&\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{d\left(\sqrt n(\tilde H_n(y)-\tilde H(y))\right)}{1-H(y)}\\
&=&\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{\sqrt n(1-H(u))}-\frac1{\sqrt n}\int_{-\infty}^{u}
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y),\\
B_3(u)&=&\frac1{\sqrt n}\int_{-\infty}^u
\frac{\sqrt n\left(H(y)-H_n(y)\right)}
{\left(1-H(y)\right)^2}\,d\tilde H(y).
\end{eqnarray*}
With the help of the above expressions for $B_2$ and $B_3$
(\ref{(2.14)}) can be rewritten as
\begin{eqnarray}
\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)
&=\frac{\sqrt n\left(\tilde H_n(u)-\tilde H(u)\right)}
{1-H(u)}-\int_{-\infty}^u
\frac{\sqrt n(\tilde H_n(y)-\tilde H(y))}
{\left(1-H(y)\right)^2}\,dH(y)\nonumber \\
&\qquad+\int_{-\infty}^u\frac{\sqrt n\left(H_n(y)-H(y)\right)}
{\left(1-H(y)\right)^2} \,d\tilde H(y)\nonumber \\
&\qquad-\sqrt n B_4(u)+\textrm{negligible error.}
\label{(2.15)}
\end{eqnarray}
Formula (\ref{(2.15)}) (together with formula~(\ref{(2.8)}))
almost agrees with the statement we wanted to prove. Here the
random variable $\sqrt n\left(\Lambda_n(u)-\Lambda(u)\right)$
is expressed as a sum of linear functionals of normalized
empirical distributions plus some negligible error terms plus
the error term $\sqrt nB_4(u)$. So to get a complete proof it
is enough to show that $\sqrt nB_4(u)$ also yields a negligible
error. But $nB_4(u)$ is a double integral of a bounded function
(here we apply again formula (\ref{(2.6)})) with respect to a
normalized empirical distribution. Hence to bound this term we
need a good estimate of multiple stochastic integrals (with
multiplicity~2), and this is just the problem formulated in
the introduction. The estimate we need here follows from
Theorem~8.1 of the present work. Let us remark that the problem
discussed here corresponds to the estimation of the coefficient
of the second term in the Taylor expansion considered in the
study of the maximum likelihood estimation. One may worry a
little bit how to bound $nB_4(u)$ with the help of estimations
of double stochastic integrals, since in the definition of
$B_4(u)$ integration is taken with respect to different
normalized empirical processes in the two coordinates. But
this is a not too difficult technical problem. It can be
simply overcome for instance by rewriting the integral as
a double integral with respect to the empirical process
$\left(\sqrt n\left(H_n(x)-H(x)\right),
\sqrt n\left(\tilde H_n(y)-\tilde H(y)\right)\right)$
in the space $R^2$.
By working out the details of the above calculation we get
that the linear functional $B_2(u)-B_3(u)$ of normalized
empirical processes yields a good estimate on the expression
$\sqrt n(\Lambda_n(u)-\Lambda(u))$ for a fixed parameter~$u$.
But we want to prove somewhat more, we want to get an estimate
uniform in the parameter~$u$, i.e. to show that even the random
variable $\sup\limits_{u\le T}\left|
\sqrt n(\Lambda_n(u)-\Lambda(u))-B_2(u)+B_3(u)\right|$
is small. This can be done by making estimates uniform in the
parameter~$u$ in all steps of the above calculation. There appears
only one difficulty when trying to carry out this program. Namely,
we need an estimate on $\sup\limits_{u\le T} |nB_4(u)|$, i.e. we
have to bound the supremum of multiple random integrals with respect
to a normalized random measure for a nice class of kernel functions.
This can be done, but at this point the second problem mentioned
in the introduction appears. This difficulty can be overcome by
means of Theorem~8.2 of this work.
Thus the limit behaviour of the Kaplan--Meyer estimate can be
described by means of an appropriate expansion. The steps of the
calculation leading to such an expansion are fairly standard, the
only hard part is the solution of the problems mentioned in the
introduction. It can be expected that such a method also works in
a much more general situation.
I finish this chapter with a remark of Richard Gill he made in a
personal conversation after my talk on this subject at a conference.
While he accepted my proof he missed an argument in it about the
maximum likelihood character of the Kaplan--Meyer estimate. This
was a completely justified remark, since if we do not restrict our
attention to this problem, but try to generalize it to general
non-parametric maximum likelihood estimates, then we have to
understand how the maximum likelihood character of the estimate
can be exploited. I believe that this can be done, but only with
the help of some further studies.
\chapter{Some estimates about sums of independent random
variables}
We shall need a good bound on the tail distribution of sums
of independent random variables bounded by a constant with
probability one. Later only the results about sums of independent
and identically distributed variables will be interesting for us.
But since they can be generalized without any effort to sums of not
necessarily identically distributed random variables the condition
about identical distribution of the summands will be dropped.
We are interested in the question when these
estimates give such a good bound as the central limit theorem
suggests, and what can be told otherwise.
More explicitly, the following problem will be considered: Let
$X_1,\dots,X_n$ be independent random variables, $EX_j=0$,
$\textrm{Var}\, X_j=\sigma_j^2$, $1\le j\le n$, and take the random sum
$S_n=\sum\limits_{j=1}^nX_j$ and its variance
$\textrm{Var}\, S_n=V_n^2=\sum\limits_{j=1}^n\sigma_j^2$.
We want to get a good
bound on the probability $P(S_n>u V_n)$. The central limit theorem
suggests that under general conditions an upper bound of the
order $1-\Phi(u)$ should hold for this probability, where $\Phi(u)$
denotes the standard normal distribution function. Since the
standard normal distribution function satisfies the inequality
$\left(\frac1u-\frac1{u^3}\right)
\frac{e^{-u^2/2}}{\sqrt{2\pi}} <1-\Phi(u)<
\frac1u\frac{e^{-u^2/2}}{\sqrt{2\pi}}$ for all $u>0$ it is natural
to ask when the probability $P(S_n >uV_n)$ is comparable with the
value $e^{-u^2/2}$. More generally, we shall call an upper bound of
the form $P(S_n>uV_n)\le e^{-Cu^2}$ with some constant $C>0$ a
Gaussian type estimate.
First I formulate Bernstein's inequality which tells for which values
$u$ the probability $P(S_n>uV_n)$ has a Gaussian type estimate.
It supplies such an estimate if $u\le \textrm{const.}\, V_n$. On
the other hand, for $u\ge\textrm{const.}\, V_n$ it yields a much
weaker bound. I shall formulate another result, called Bennett's
inequality, which is a slight improvement of Bernstein's inequality.
It helps us to tell what can be expected if Bernstein's inequality
does not provide a Gaussian type estimate. I shall also
present an example which shows that Bennett's inequality is in some
sense sharp. The main difficulties we meet in this work are closely
related to the weakness of the estimates we have for the probability
$P(S_n>uV_n)$ if it does not satisfy a Gaussian type estimate. As we
shall see this happens if $u\gg \textrm{const.}\, V_n$.
In the usual formulation of Bernstein's inequality a
real number~$M$ is introduced, and it is assumed that the terms in
the sum we investigate are bounded by this number. But since the
problem can be simply reduced to the case $M=1$ I shall
consider only this special case.
\medskip\noindent
{\bf Theorem 3.1 (Bernstein's
inequality).}\index{Bernstein's inequality} {\it Let
$X_1,\dots,X_n$ be independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\textrm{\rm Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P\left(S_n>uV_n\right)\le\exp\left\{-\frac{u^2}{2\left(1+\frac13
\frac u{V_n}\right)} \right\} \quad\textrm{for all }u>0.
\label{(3.1)}
\end{equation}
}
\medskip\noindent
{\it Proof of Theorem 3.1.} Let us give a good bound on the
exponential moments $Ee^{tS_n}$ for appropriate parameters
$t>0$. Since $EX_j=0$ and $E|X_j^{k+2}|\le\sigma^2_j$ for $k\ge0$ we can
write $Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k
\le 1+\frac{t^2\sigma_j^2}2\left(1+\sum\limits_{k=1}^\infty
\frac {2t^{k}}{(k+2)!}\right) \le 1+\frac{t^2\sigma_j^2}2
\left(1+\sum\limits_{k=1}^\infty 3^{-k}t^{k}\right)
= 1+\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}
\le\exp\left\{\frac{t^2\sigma_j^2}2\frac1{1-\frac t3}\right\}$
if $0\le t<3$. Hence
$$
Ee^{tS_n}=\prod\limits_{j=1}^n Ee^{tX_j}\le
\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3}\right\}
\quad\textrm{for } 0\le t<3.
$$
The above relation implies that
$$
P\left(S_n>uV_n\right)=P(e^{tS_n}>e^{tuV_n})\le Ee^{tS_n}e^{-tuV_n}
\le\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3}-tuV_n\right\}
$$
if $0\le t<3$. Choose the number $t$ in this inequality as the
solution of the equation $t^2V_n^2\frac1{1-\frac t3}=tuV_n$, i.e.
put $t=\frac u{V_n+\frac u3}$. Then $0\le t<3$, and we get that
$P(S_n>uV_n)\le e^{-tuV_n/2}=
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}\right\}$.
\hfill $\qed$
\medskip
If the random variables $X_1,\dots,X_n$ satisfy the conditions of
Bernstein's inequality, then also the random variables
$-X_1,\dots,-X_n$ satisfy them. By applying the above result in both
cases we get that
$P(|S_n|>uV_n)\le2
\exp\left\{-\frac{u^2}{2\left(1+\frac13\frac u{V_n}\right)}
\right\}$ under the conditions of Bernstein's inequality.
\medskip
By Bernstein's inequality for all $\varepsilon>0$ there is some
number $\alpha(\varepsilon)>0$ such that in the case
$\frac u{V_n}<\alpha(\varepsilon)$ the inequality
$P(S_n>uV_n)\le e^{-(1-\varepsilon)u^2/2}$ holds. Beside this,
for all fixed numbers $A>0$ there is some constant $C=C(A)>0$
such that if $\frac u{V_n}uV_n)\le e^{-Cu^2}$.
This can be interpreted as a Gaussian type estimate for the
probability $P(S_n>uV_n)$ if $u\le \textrm{const.}\, V_n$.
On the other hand, if $\frac u{V_n}$ is very large, then Bernstein's
inequality yields a much worse estimate. The question arises whether
in this case Bernstein's inequality can be replaced by a better, more
useful result. Next I present Theorem~3.2, the so-called Bennett's
inequality which provides a slight improvement of Bernstein's
inequality. But if $\frac u{V_n}$ is very large, then also
Bennett's inequality provides a much weaker estimate on the
probability $P(S_n>uV_n)$ than the bound suggested by a Gaussian
comparison. On the other hand, I shall present an example that shows
that (without imposing some additional conditions) no real
improvement of this estimate is possible.
\medskip\noindent
{\bf Theorem 3.2 (Bennett's inequality).}\index{Bennett's inequality}
{\it Let $X_1,\dots,X_n$ be
independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and
$V_n^2=\textrm{\rm Var}\, S_n=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P(S_n>u)\le\exp\left\{-V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac u{V^2_n}\right]\right\}
\quad\textrm{for all } u>0. \label{(3.2)}
\end{equation}
As a consequence, for all $\varepsilon>0$ there exists some
$B=B(\varepsilon)>0$ such
that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-(1-\varepsilon)u\log \frac u{V^2_n}
\right\}\quad\textrm{if } u>BV_n^2, \label{(3.3)}
\end{equation}
and there exists some positive constant $K>0$ such that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-Ku\log \frac u{V^2_n}
\right\}\quad\textrm{if }u>2V_n^2. \label{(3.4)}
\end{equation}
}
\medskip\noindent
{\it Proof of Theorem 3.2.}\/ We have
\begin{eqnarray*}
Ee^{tX_j}=\sum\limits_{k=0}^\infty\frac{t^k}{k!}EX_j^k\le
1+\sigma_j^2\sum\limits_{k=2}^\infty\frac {t^k}{k!}&&=1+\sigma_j^2
\left(e^t-1-t\right)\le e^{\sigma_j^2(e^t-1-t)}, \\
&& \qquad\quad 1\le j\le n,
\end{eqnarray*}
and $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$ for all $t\ge0$. Hence
$P(S_n>u)\le e^{-tu}Ee^{tS_n}\le e^{-tu+V_n^2(e^t-1-t)}$ for all
$t\ge0$. We get relation (\ref{(3.2)}) from this inequality
with the choice $t=\log\left(1+\frac u{V^2_n}\right)$. (This is
the place of minimum of the
function $-tu+V_n^2(e^t-1-t)$ for fixed $u$ in the parameter~$t$.)
Relation (\ref{(3.2)}) and the observation
$\lim\limits_{v\to\infty}\frac{(v+1)\log(v+1)-v}{v\log v}=1$
with the choice $v=\frac u{V_n^2}$ imply formula~(\ref{(3.3)}).
Because of relation (\ref{(3.3)}) to prove formula (\ref{(3.4)})
it is enough to check it for $2\le\frac u{V_n^2}\le B$
with some sufficiently large constant $B>0$.
In this case relation (3.4) follows directly from formula
(\ref{(3.2)}). This can be seen for instance by observing that
the expression $\frac{V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac
u{V^2_n}\right]}{u\log\frac u{V^2_n}}$ is a continuous and positive
function of the variable $\frac u{V_n^2}$ in the interval $2\le
\frac u{V_n^2}\le B$, hence its minimum in this interval is strictly
positive.
\hfill$\qed$
\medskip
Let us make a short comparison between Bernstein's and Bennett's
inequalities. Both results yield an estimate on the probability
$P(S_n>u)$, and their proofs are very similar. They are based on
an estimate of the moment generating functions $R_j(t)=Ee^{tX_j}$
of the summands~$X_j$, but Bennett's inequality yields a better
estimate. It may be worth mentioning that the estimate given for
$R_j(t)=Ee^{tX_j}$ in the proof of
Bennett's inequality agrees with the moment generating function
$Ee^{t(Y_j-EY_j)}$ of the normalization $Y_j-EY_j$ of a Poissonian
random variable $Y_j$ with parameter $\textrm{Var}\, X_j$. As a
consequence,
we get, by using the standard method of estimating tail-distributions
by means of the moment generating functions such an estimate for the
probability $P(S_n>u)$ which is comparable with the probability
$P(T_n-ET_n>u)$, where $T_n$ is a Poissonian random variable with
parameter $V_n=\textrm{Var}\, S_n$. We can say that Bernstein's
inequality yields a Gaussian and Bennett's inequality a Poissonian
type estimate for the sums of independent, bounded random variables.
\medskip\noindent
{\it Remark.}\/ Bennett's inequality yields a sharper estimate for
the probability $P(S_n>u)$ than Bernstein's inequality for all
numbers $u>0$. To prove this it is enough to show that for all
$0\le t<3$ the inequality $Ee^{tS_n}\le e^{V_n^2(e^t-1-t)}$
appearing in the proof of Bennett's inequality is a sharper
estimate than the corresponding inequality
$Ee^{tS_n}\le\exp\left\{\frac{t^2V_n^2}2\frac1{1-\frac t3} \right\}$
appearing in the proof of Bernstein's inequality. (Recall, how we
estimate the probability $P(S_n>u)$ in these proofs with the help of
the exponential moment $Ee^{tS_n}$.) But to prove this
it is enough to check that $e^t-1-t\le \frac{t^2}2\frac1{1-\frac t3}$
for all $0\le t<3$. This inequality clearly holds, since
$e^t-1-t=\sum\limits_{k=2}^\infty\frac{t^k}{k!}$, and
$\frac{t^2}2\frac1{1-\frac t3}=\sum\limits_{k=2}^\infty
\frac12(\frac13)^{k-2}t^k$.
\medskip
Next I present Example~3.3 which shows that Bennett's inequality
yields a sharp estimate also in the case $u\gg V_n^2$ when
Bernstein's inequality yields a weak bound. But Bennett's inequality
provides only a small improvement which has only a limited
importance. This may be the reason why Bernstein's inequality
which yields a more transparent estimate is more popular.
\medskip\noindent
{\bf Example 3.3 (Sums of independent random variables with bad
tail distribution for large values).} {\it Let us fix some
positive integer $n$, real numbers $u$ and $\sigma^2$ such that
$0<\sigma^2\le\frac18$, $n>4u\ge6$ and $u>4n\sigma^2$. Let
$\bar\sigma^2$ be that solution of the equation $x^2-x+\sigma^2=0$
which is smaller than~$\frac12$. Take a sequence of independent
and identically distributed random variables
$\bar X_1,\dots,\bar X_n$ such that $P(\bar X_j=1)=\bar\sigma^2$,
$P(\bar X_j=0)=1-\bar\sigma^2$ for all $1\le j\le n$. Put
$X_j=\bar X_j-E\bar X_j=X_j-\bar\sigma^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and $V_n^2=n\sigma^2$.
Then $P(|X_1|\le1)=1$, $EX_1=0$, $\textrm{\rm Var}\, X_1=\sigma^2$,
hence $ES_n=0$, and $\textrm{\rm Var}\, S_n=V_n^2$. Beside this
$$
P(S_n\ge u)>\exp\left\{-Bu\log \frac u{V^2_n}\right\}
$$
with some appropriate constant $B>0$ not depending on~$n$,
$\sigma$ and~$u$.}
\medskip\noindent
{\it Proof of Example 3.3.}\/ Simple calculation shows that $EX_j=0$,
$\textrm{Var}\, X_j=\bar\sigma^2-\bar\sigma^4=\sigma^2$,
$P(|X_j|\le1)=0$, and
also the inequality $\sigma^2\le\bar\sigma^2\le\frac32\sigma^2$ holds.
To see the upper bound in the last inequality observe that
$\bar\sigma^2\le\frac13$, i.e. $1-\bar\sigma^2\ge\frac23$, hence
$\sigma^2=\bar\sigma^2(1-\bar\sigma^2)\ge\frac23\bar\sigma^2$. In
the proof of the inequality of Example~3.3 we can restrict our
attention to the case when $u$ is an integer, because in the
general case we can apply the inequality with $\bar u=[u]+1$
instead of~$u$, where $[u]$ denotes the integer part of~$u$, and
since $u\le\bar u\le 2u$, the application of the result in this
case supplies the desired inequality with a possibly worse
constant~$B>0$.
Put $\bar S_n=\sum\limits_{j=1}^n\bar X_j$. We can write
$P(S_n\ge u)=P(\bar S_n\ge u+n\bar\sigma^2)\ge P(\bar S_n\ge2u)
\ge P(\bar S_n=2u)={n\choose{2u}}
\bar\sigma^{4u}(1-\bar\sigma^2)^{(n-2u)}
\ge(\frac {n\bar\sigma^2}{2u})^{2u}(1-\bar\sigma^2)^{(n-2u)}$,
since $u\ge n\bar\sigma^2$, and $n\ge2u$. On the other hand
$(1-\bar\sigma^2)^{(n-2u)}\ge e^{-2\bar\sigma^2(n-2u)}
\ge e^{-2n\bar\sigma^2}\ge e^{-u}$, hence
\begin{eqnarray*}
P(S_n\ge u)
&\ge&\exp\left\{-2u\log\left(\frac u{n\bar\sigma^2}\right)
-2u\log2-u\right\}\\
&=&\exp\left\{-2u\log\left(\frac u{n\sigma^2}\right)
-2u\log\frac{\bar\sigma^2}{\sigma^2}-2u\log2-u\right\}\\
&\ge&\exp\left\{-100u\log\left(\frac u{V_n^2}\right)\right\}.
\end{eqnarray*}
Example~3.3 is proved.
\hfill$\qed$
\medskip
In the case $u>4V_n^2$ Bernstein's inequality yields the estimate
$P(S_n>u)\le e^{-\alpha u}$ with some universal constant $\alpha>0$,
and the above example shows that at most an additional logarithmic
factor $K\log\frac u{V_n^2}$ can be expected in the exponent of
the upper bound in an improvement of this estimate. Bennett's
inequality shows that such an improvement is really possible.
\medskip
I finish this chapter with another estimate due to Hoeffding
which will be later useful in some symmetrization arguments.
\medskip\noindent
{\bf Theorem 3.4 (Hoeffding's inequality).}\index{Hoeffding's
inequality} {\it Let $\varepsilon_1,\dots,\varepsilon_n$
be independent random variables,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, and let $a_1,\dots,a_n$ be arbitrary real numbers. Put
$V=\sum\limits_{j=1}^na_j\varepsilon_j$. Then
\begin{equation}
P(V>u)\le\exp\left\{-\frac{u^2}{2\sum_{j=1}^na_j^2 }\right\}\quad
\textrm{for all }u>0. \label{(3.5)}
\end{equation}
}
\medskip\noindent
{\it Remark 1:}\/ Clearly $EV=0$ and
$\textrm{Var}\, V=\sum\limits_{j=1}^n a_j^2$,
hence Hoeffding's inequality yields such an estimate for $P(V>u)$
which the central limit theorem suggests. This estimate holds for
all real numbers $a_1,\dots,a_n$ and $u>0$.
\medskip\noindent
{\it Remark 2:}\/ The Rademacher
functions\index{Rademacher functions} $r_k(x)$, $k=1,2,\dots$,
defined by the formulas $r_k(x)=1$ if $(2j-1)2^{-k}\le x<2j2^{-k}$
and $r_k(x)=-1$ if $2(j-1)2^{-k}\le x<(2j-1)2^{-k}$,
$1\le j\le 2^{k-1}$, for all $k=1,2,\dots$, can be considered as
random variables on the probability space $\Omega=[0,1]$ with the
Borel $\sigma$-algebra and the Lebesgue measure as probability
measure on the interval $[0,1]$. They are independent random
variables with the same distribution as the random variables
$\varepsilon_1,\dots,\varepsilon_n$ considered in Theorem~3.4.
Therefore results
about such sequences of random variables whose distributions agree
with those in~Theorem~3.4 are also called sometimes results about
Rademacher functions in the literature. At some points we will
also apply this terminology.
\medskip\noindent
{\it Proof of Theorem 3.4.} Let us give a good bound on the
exponential moment $Ee^{tV}$ for all $t>0$. The identity
$Ee^{tV}=\prod\limits_{j=1}^nEe^{ta_j\varepsilon_j}=
\prod\limits_{j=1}^n\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2$ holds,
and
$\frac{\left(e^{a_jt}+e^{-a_jt}\right)}2=\sum\limits_{k=0}^\infty
\frac{a_{j}^{2k}} {(2k)!}t^{2k}\le \sum\limits_{k=0}^\infty \frac
{(a_jt)^{2k}}{2^{k}k!}=e^{a_j^2t^2/2}$, since $(2k)!\ge 2^k k!$
for all $k\ge0$. This implies that $Ee^{tV}\le
\exp\left\{\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$. Hence
$P(V>u)\le\exp\left\{-tu+\frac{t^2}2\sum\limits_{j=1}^n a_j^2\right\}$,
and we get relation (\ref{(3.5)}) with the choice
$t=u\left(\sum\limits_{j=1}^n a_j^2\right)^{-1}$.
\hfill$\qed$
\chapter{On the supremum of a nice class of partial sums}
This chapter contains an estimate about the supremum of a nice
class of normalized sums of independent and identically
distributed random variables together with an analogous result
about the supremum of an appropriate class of one-fold random
integrals with respect to a normalized empirical distribution.
The second result deals with a one-variate version of the
problem about the estimation of multiple integrals with respect
to a normalized empirical distribution. This problem was
mentioned in the introduction. Some natural questions related
to these results will be also discussed. It will be examined
how restrictive their conditions are. In particular, we are
interested in the question how the condition about the
countable cardinality of the class of random variables can be
weakened. A natural Gaussian counterpart of the supremum
problems about random one-fold integrals will be also
considered. Most proofs will be postponed to later chapters.
To formulate these results first a notion will be
introduced that plays a most important role in the sequel.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions.}
\index{L${}_p$-dense, (in particular $L_2$-dense classes) of functions}
{\it Let a measurable space $(Y,{\cal Y})$ be given together with
a class ${\cal G}$ of ${\cal Y}$ measurable real valued functions
on this space. The class of functions ${\cal G}$ is called an
$L_p$-dense class of functions, $1\le p<\infty$, with parameter~$D$
and exponent~$L$ if for all numbers $0<\varepsilon\le1$ and
probability measures $\nu$ on the space $(Y,{\cal Y})$ there
exists a finite $\varepsilon$-dense subset
${\cal G}_{\varepsilon,\nu}=\{g_1,\dots,g_m\}\subset {\cal G}$
in the space $L_p(Y,{\cal Y},\nu)$ with $m\le D\varepsilon^{-L}$
elements, i.e. there exists such a set ${\cal G}_{\varepsilon,\nu}
\subset {\cal G}$ with $m\le D\varepsilon^{-L}$ elements for which
$\inf\limits_{g_j\in{\cal G}_{\varepsilon,\nu}}\int |g-g_j|^p\,d\nu
<\varepsilon^p$
for all functions $g\in {\cal G}$. (Here the set
${\cal G}_{\varepsilon,\nu}$ may depend
on the measure $\nu$, but its cardinality is bounded by a number
depending only on $\varepsilon$.)}
\medskip
In most results of this work the above defined $L_p$-dense classes
will be considered only for the parameter $p=2$. But at some
points it will be useful to work also with $L_p$-dense classes with
a different parameter~$p$. Hence to avoid some repetitions I
introduced the above definition for a general parameter~$p$. When
working with $L_p$-dense classes we shall consider only such
classes of functions ${\cal G}$ whose elements are functions with
bounded absolute value. Hence all integrals appearing in the
definition of $L_p$-dense classes of functions are finite.
The following estimate will be proved.
\medskip\noindent
{\bf Theorem 4.1 (Estimate on the supremum of a class of partial
sums).}\index{estimate on the supremum of a class of partial sums}
{\it Let us consider a sequence of independent and
identically distributed random variables $\xi_1,\dots,\xi_n$,
$n\ge2$, with values in a measurable space $(X,{\cal X})$ and with
some distribution~$\mu$. Beside this, let a countable and
$L_2$-dense class of functions ${\cal F}$ with some parameter $D\ge1$
and exponent $L\ge1$ be given on the space $(X,{\cal X)}$ which
satisfies the conditions
\begin{eqnarray}
\|f\|_\infty&=&\sup_{x\in X}|f(x)|\le 1,
\qquad \textrm{for all }
f\in{\cal F} \label{(4.1)} \\
\|f\|_2^2&=&\int f^2(x) \mu(\,dx)\le \sigma^2
\qquad \textrm{for all }
f\in {\cal F} \label{(4.2)}
\end{eqnarray}
with some constant $0<\sigma\le1$, and
\begin{equation}
\int f(x)\mu(\,dx)=0 \quad \textrm{for all } f\in{\cal F}.
\label{(4.3)}
\end{equation}
Define the normalized partial sums $S_n(f)=\frac1{\sqrt n}
\sum\limits_{k=1}^n f(\xi_k)$ for all $f\in {\cal F}$.
There exist some universal constants $C>0$, $\alpha>0$ and $M>0$
such that the supremum of the normalized random sums $S_n(f)$,
$f\in {\cal F}$, satisfies the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge u\right)
\le C\exp\left\{-\alpha\left(\frac u{\sigma}\right)^2\right\}
\quad \textrm{for those numbers } u \nonumber \\
&&\qquad \textrm{for which }\sqrt n\sigma^2\ge
u\ge M\sigma(L^{3/4}\log^{1/2}\tfrac2\sigma +(\log D)^{3/4}),
\label{(4.4)}
\end{eqnarray}
where the numbers~$D$ and $L$ in formula~(\ref{(4.4)}) agree with
the parameter and exponent of the $L_2$-dense class~${\cal F}$.}
\medskip\noindent
{\it Remark.}\/ Here and also in the subsequent part of this work
we consider random variables which take their values in a general
measurable space $(X,{\cal X})$. The only restriction we impose
on these spaces is that all sets consisting of one point are
measurable, i.e. $\{x\}\in{\cal X}$ for all $x\in X$.
\medskip
The condition $\sqrt n\sigma^2\ge u\ge
M\sigma(L^{3/4}\log^{1/2}\frac2\sigma +D^{3/4})$ for the numbers~$u$
for which inequality~(4.4) holds is natural. I discuss this after the
formulation of Theorem~4.2 which can be considered as the Gaussian
counterpart of Theorem~4.1. I also formulate a result in Example~4.3
which can be considered as part of this discussion.
\medskip
The condition about the countable cardinality of ${\cal F}$ can be
weakened with the help of the notion of countable approximability
introduced below. For the sake of later applications I define it
in a more general form than needed in this chapter. In the subsequent
part of this work I shall assume that the probability measure I work
with is complete, i.e. for all such pairs of sets~$A$ and~$B$ in the
probability space $(\Omega,{\cal A},P)$ for which $A\in{\cal A}$,
$P(A)=0$ and $B\subset A$ we have $B\in{\cal A}$ and $P(B)=0$.
\medskip\noindent
{\bf Definition of countably approximable classes of random
variables.} \index{countably approximable classes of random variables}
{\it Let us have a class of random variables $U(f)$,
$f\in {\cal F}$, indexed by a class of functions $f\in{\cal F}$
on a measurable space $(Y,{\cal Y})$. This class of random variables
is called countably approximable if there is a countable subset
${\cal F}'\subset {\cal F}$ such that for all numbers $u>0$ the sets
$A(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}'} |U(f)(\omega)|\ge u\}$
satisfy the identity $P(A(u)\setminus B(u))=0$.}
\medskip
Clearly, $B(u)\subset A(u)$. In the above definition it was demanded
that for all $u>0$ the set $B(u)$ should be almost as large as
$A(u)$. The following corollary of Theorem~4.1 holds.
\medskip\noindent
{\bf Corollary of Theorem~4.1.} {\it Let a class of functions
${\cal F}$ satisfy the conditions of Theorem~4.1 with the only
exception that instead of the condition about the countable
cardinality of ${\cal F}$ it is assumed that the class of random
variables $S_n(f)$, $f\in{\cal F}$, is countably approximable. Then
the random variables $S_n(f)$, $f\in{\cal F}$, satisfy
relation~(\ref{(4.4)}).}
\medskip
This corollary can be simply proved, only Theorem~4.1 has to be
applied for the class ${\cal F}'$. To do this it has to be checked
that if ${\cal F}$ is an $L_2$-dense class with some parameter $D$
and exponent $L$, and ${\cal F}'\subset {\cal F}$, then ${\cal F}'$ is
also an $L_2$-dense class with the same exponent $L$, only with a
possibly different parameter~$D'$.
To prove this statement let us choose for all numbers
$0<\varepsilon\le1$ and probability measures $\nu$ on
$(Y,{\cal Y})$ some functions
$f_1,\dots,f_m\in {\cal F}$ with
$m\le D\left(\frac\varepsilon2\right)^{-L}$ elements, such that
the sets ${\cal D}_j=\left\{f\colon\,\int |f-f_j|^2\,d\nu\le
\left(\frac\varepsilon2\right)^2\right\}$ satisfy the relation
$\bigcup\limits_{j=1}^m {\cal D}_j=Y$. For all sets
${\cal D}_j$ for which ${\cal D}_j\cap {\cal F}'$ is
non-empty choose a function $f'_j\in {\cal D}_j\cap {\cal F}'$. In
such a way we get a collection of functions $f'_j$ from the class
${\cal F}'$ containing at most $2^LD\varepsilon^{-L}$ elements
which satisfies
the condition imposed for $L_2$-dense classes with exponent $L$ and
parameter $2^LD$ for this number $\varepsilon$ and measure $\nu$.
\medskip
Next I formulate in Theorem~$4.1'$ a result about the supremum of
the integral of a class of functions with respect to a normalized
empirical distribution. It can be considered as a simple version
of Theorem~4.1. I formulated this result, because Theorems~4.1
and~$4.1'$ are special cases of their multivariate counterparts
about the supremum of so-called $U$-statistics and multiple
integrals with respect to a normalized empirical distribution
discussed in Chapter~8. These results are also closely related,
but the explanation of their relation demands some work.
Given a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ taking values in $(X,{\cal X})$ let us introduce
their empirical distribution on $(X,{\cal X})$ as
\begin{equation}
\mu_n(A)(\omega)
=\frac1n \#\left\{j\colon\, 1\le j\le n,\; \xi_j(\omega)\in
A\right\} \quad \textrm{for all } A\in {\cal X}, \label{(4.5)}
\end{equation}
and define for all measurable and $\mu$~integrable functions~$f$
the (random) integral
\begin{equation}
J_n(f)=J_{n,1}(f)=\sqrt n\int f(x)(\mu_n(\,dx)-\mu(\,dx)). \label{(4.6)}
\end{equation}
Clearly
$$
J_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n (f(\xi_j)-Ef(\xi_j))
=S_n(\hat f)
$$
with $\hat f(x)=f(x)-\int f(x)\mu(\,dx)$. It is not
difficult to see that $\sup\limits_{x\in X}|\hat f(x)|\le2$ if
$\sup\limits_{x\in X}|f(x)|\le1$, $\int \hat f(x)\mu(\,dx)=0$,
$\int \hat f^2(x)\mu(\,dx)\le\int f^2(x)\mu(\,dx)$, and if
${\cal F}$ is an $L_2$-dense class of functions with parameter~$D$
and exponent~$L$, then the class of functions $\bar{\cal F}$
consisting of the functions
$\bar f(x)=\frac12\left(f(x)-\int f(x)\mu(\,dx)\right)$,
$f\in{\cal F}$, is an $L_2$-dense class of functions with
parameter $D$ and exponent $L$. Indeed, since
$\int(\bar f-\bar g)^2\,d\nu\le\frac12\int(f-g)^2\,d\nu
+\frac12\int(f-g)^2\,d\mu=\int(f-g)^2\frac{\,d\mu+\,d\nu}2$, hence
$\{\bar f_1,\dots,\bar f_m\}$ is an $\varepsilon$-dense set of
$\bar{\cal F}$ in the $L_2(\nu)$-norm if $\{f_1,\dots,f_m\}$ is
an $\varepsilon$-dense set of ${\cal F}$ in the
$L_2(\frac{\mu+\nu}2)$-norm. Hence Theorem~4.1 implies the
following result.
\medskip\noindent
{\bf Theorem 4.1$'$ (Estimate on the supremum of random integrals
with respect to a normalized empirical distribution).}\index{estimate
on the supremum of random integrals with respect to a normalized
empirical distribution} {\it Let us
have a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge2$, with distribution~$\mu$ on
a measurable space $(X,{\cal X})$ together with some class of
functions ${\cal F}$ on this space which satisfies the
conditions of Theorem~4.1 with the possible exception of
condition~(\ref{(4.3)}). The estimate~(\ref{(4.4)}) remains valid
if the random sums $S_n(f)$ are replaced in it by the random
integrals $J_n(f)$ defined in~(\ref{(4.6)}). Moreover,
similarly to the corollary of Theorem~4.1, the condition about the
countable cardinality of the set ${\cal F}$ can be replaced by the
condition that the class of random variables $J_n(f)$,
$f\in{\cal F}$, is countably approximable.}
\medskip
All finite dimensional distributions of the set of random variables
$S_n(f)$, $f\in{\cal F}$, considered in Theorem~4.1 converge to those
of a Gaussian random field $Z(f)$, $f\in{\cal F}$, with expectation
$EZ(f)=0$ and correlation $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\cal F}$ as $n\to\infty$. Here, and in the subsequent part
of the paper a collection of random variables indexed by some set
of parameters will be called a Gaussian random field if for all
finite subsets of these parameters the random variables indexed by
this finite set are jointly Gaussian. We shall also define
so-called linear Gaussian random fields.\index{linear Gaussian random
fields} They consist of jointly Gaussian random variables $Z(f)$,
$f\in{\cal G}$, indexed by the elements of a linear space
$f\in{\cal G}$ which satisfy the relation $Z(af+bg)=aZ(f)+bZ(g)$
with probability~1 for all real numbers $a$ and $b$ and
$f,g\in{\cal G}$. (Let us observe that a set of Gaussian random
variables $Z(f)$, indexed by the elements of a linear space
$f\in{\cal G}$ such that $EZ(f)=0$, and
$EZ(f)Z(g)=\int f(x)g(x)\,\mu(\,dx)$ for all $f,g\in{\cal F}$ is a
linear Gaussian random field. This can be seen by checking the
identity $E[Z(af+bg)-(aZ(f)+bZ(g))]^2=0$ for all real numbers $a$
and $b$ and $f,g\in{\cal G}$ in this case.)
Let us consider a linear Gaussian random field $Z(f)$, $f\in{\cal G}$,
where the set of indices~${\cal G}={\cal G}_\mu$ consists of the
functions~$f$ square integrable with respect to a $\sigma$-finite
measure~$\mu$, and take an appropriate restriction of this field to
some parameter set ${\cal F}\subset {\cal G}$. In the next
Theorem~4.2 I present a natural Gaussian counterpart of Theorem~4.1
by means of an appropriate choice of~${\cal F}$. Let me also remark
that in Chapter~10 the multiple Wiener--It\^o integrals of functions
of $k$~variables with respect to a white noise will be defined for
all $k\ge1$. In the special case $k=1$ the Wiener--It\^o integrals
for an appropriate class of functions $f\in{\cal F}$ yield a model
for which Theorem~4.2 is applicable. Before formulating this result
let us introduce the following definition which is a version of the
definition of $L_p$-dense functions.
\medskip\noindent
{\bf Definition of
$L_p$-dense classes of functions with respect to a
measure~$\mu$.}\index{L${}_p$-dense classes of functions with
respect to a measure~$\mu$}
{\it Let a measurable space $(X,{\cal X})$ be given
together with a measure $\mu$ on the $\sigma$-algebra ${\cal X}$ and
a set ${\cal F}$ of ${\cal X}$ measurable real valued functions on
this space. The set of functions ${\cal F}$ is called an $L_p$-dense
class of functions, $1\le p<\infty$, with respect to the
measure~$\mu$ with parameter $D$ and exponent $L$ if for all
numbers $0<\varepsilon\le1$ there exists a finite $\varepsilon$-dense
subset ${\cal F}_\varepsilon=\{f_1,\dots,f_m\}\subset{\cal F}$
in the space
$L_p(X,{\cal X},\mu)$ with $m\le D\varepsilon^{-L}$ elements, i.e.
such a set ${\cal F}_\varepsilon\subset {\cal F}$ with
$m\le D\varepsilon^{-L}$ elements for which
$\inf\limits_{f_j\in {\cal F}_\varepsilon}\int |f-f_j|^p\,d\mu
<\varepsilon^p$ for all functions $f\in\ {\cal F}$.}
\medskip\noindent
{\bf Theorem 4.2 (Estimate on the supremum of a class of Gaussian
random variables).} \index{estimate on the supremum of a class of
Gaussian random variables} {\it Let a probability measure $\mu$ be given
on a measurable space $(X,{\cal X})$ together with a linear Gaussian
random field $Z(f)$, $f\in{\cal G}$, such that $EZ(f)=0$,
$EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$, $f,g\in{\cal G}$, where ${\cal G}$
is the space of square integrable functions with respect to this
measure~$\mu$. Let ${\cal F}\subset{\cal G}$ be a countable and
$L_2$-dense class of functions with respect to the measure~$\mu$
with some exponent~$L\ge1$ and parameter~$D\ge1$ which also
satisfies condition~(\ref{(4.2)}) with some $0<\sigma\le1$.
Then there exist some universal constants $C>0$ and $M>0$ (for
instance $C=4$ and $M=16$ is a good choice) such that the inequality
\begin{eqnarray}
P\left(\sup\limits_{f\in{\cal F}}|Z(f)|
\ge u\right)&&\le C(D+1) \exp\left\{-\frac1{256}
\left(\frac u{\sigma}\right)^2\right\} \nonumber \\
&&\qquad \textrm{if }u\ge ML^{1/2}\sigma \log^{1/2}\frac2\sigma
\label{(4.7)}
\end{eqnarray}
holds with the parameter $D$ and exponent $L$ introduced in this
theorem.}
\medskip\noindent
{\it Remark.} In formulas~(\ref{(4.4)}) of Theorem~4.1 and
in~(\ref{(4.7)}) of Theorem~4.2 we had a slightly different lower
bound on the numbers~$u$ for which these results give an estimate
on the probability that the supremum of certain random variables
is larger then~$u$. Nevertheless in the most interesting cases
when the exponent~$L$ and the parameter~$D$ of the $L_2$-dense class
of functions we consider in these theorems are separated both from
zero and infinity these bounds behave similarly. In such cases they
have the magnitude $\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$.
In~(\ref{(4.7)}) the lower bound on the number~$u$ did not depend
on the parameter~$D$, since the dependence on this parameter
appeared in the coefficient at the right-hand side of the inequality
in this relation. The formula providing a lower bound on the
number~$u$ had a coefficient~$L^{3/4}$ in~(\ref{(4.4)}) and not a
coefficient $L^{1/2}$ as in~(\ref{(4.7)}). This is a weak bound if
$L$ is very large, and it could be improved. But we did not work on
this problem, because we were mainly interested in a good bound in
the case when the exponent~$L$ is separated from infinity.
\medskip
The exponent at the right-hand side of inequality~(\ref{(4.7)})
does not contain the best possible universal constant. One could
choose the coefficient $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ instead of the coefficient $\frac1{256}$ in the
exponent at the right-hand side of~(\ref{(4.7)}) if the universal
constants $C>0$ and $M>0$ are chosen sufficiently large in this
inequality. Actually, later in Theorem~8.6 such an estimate will
be proved which can be considered as the multivariate
generalization of Theorem~4.2 with the expression
$-\frac{(1-\varepsilon)u^2}{2\sigma^2}$ in the exponent.
The condition about the countable cardinality of the set ${\cal F}$
in Theorem~4.2 could be weakened similarly to Theorem~4.1. But
I omit the discussion of this question, since Theorem~4.2 was
only introduced for the sake of a comparison between the
Gaussian and non-Gaussian case. An essential difference between
Theorems~4.1 and~4.2 is that the class of functions~${\cal F}$
considered in Theorem~4.1 had to be $L_2$-dense, while in
Theorem~4.2 a weaker version of this property was needed. In
Theorem~4.2 it was demanded that there exists a finite subset of
${\cal F}$ of relatively small cardinality which is dense in the
$L_2(\mu)$ norm. In the $L_2$-density property imposed in
Theorem~4.1 a similar property was demanded for all probability
measures~$\nu$. The appearance of such a condition may be
unexpected. It is not clear why we demand this property for
such probability measures~$\nu$ which have nothing to do with
our problem. But as we shall see, the proof of Theorem~4.1
contains a conditioning argument where a lot of new conditional
measures appear, and the $L_2$-density property is needed to
work with all of them. One would also like to know some results
that enable us to check when this condition holds. In the next
chapter a notion popular in probability theory, the notion of
Vapnik--\v{C}ervonenkis classes will be introduced, and it
will be shown that a Vapnik--\v{C}ervonenkis class of
functions bounded by~1 is $L_2$-dense.
Another difference between Theorems~4.1 and~4.2 is that the
conditions of formula~(\ref{(4.4)}) contain the upper bound
$\sqrt n\sigma^2>u$, and no similar condition was imposed in
formula~(\ref{(4.7)}). The appearance of this condition in
Theorem~4.1 can be explained by comparing this result with those
of Chapter~3. As we have seen, we do not loose much information
if we restrict our attention to the case
$u\le\textrm{const.}\, V_n^2=\textrm{const.}\, n\sigma^2$ in
Bernstein's inequality (if sums of independent and identically
distributed random variables are considered). Theorem~4.1 gives
an almost as good estimate for the supremum of normalized partial
sums under appropriate conditions for the class ${\cal F}$ of
functions we consider in this theorem as Bernstein's inequality
yields for the normalized partial sums of independent and
identically distributed random variables with variance bounded
by~$\sigma^2$. But we could prove the estimate of Theorem~4.1
only under the condition $\sqrt n\sigma^2>u$. (Actually we could
slightly improve this result. We could impose the condition
$B\sqrt n\sigma^2>u$ with an arbitrary constant $B>0$
in~(\ref{(4.4)}) if the remaining constants are appropriately
chosen in dependence of~$B$ in this formula.) It has also a
natural reason why condition~(\ref{(4.1)}) about the supremum
of the functions $f\in {\cal F}$ appeared in Theorems 4.1
and~$4.1'$, and no such condition was needed in Theorem~4.2.
The lower bounds for the level~$u$ were imposed in
formulas~(\ref{(4.4)}) and~(\ref{(4.7)}) because of a similar
reason. To understand why such a condition is needed in
formula~(\ref{(4.7)}) let us consider the
following example.
Take a Wiener process $W(t)$, $0\le t\le1$,
define for all $0\le s0$ the following class of functions ${\cal F}_\sigma$.
${\cal F}_\sigma=\{f_{s,t}\colon\, 0\le su\right)
\le e^{-\textrm{const.}\,(u/\sigma)^2}$.
However, this relation does not hold if
$u=u(\sigma)<2(1-\varepsilon)\sigma\log^{1/2}\frac1\sigma$
with some $\varepsilon>0$. In such cases
$P\left(\sup\limits_{f\in{\cal F}_\sigma}Z(f) >u\right)\to1$,
as $\sigma\to0$. This can be proved relatively simply with the help
of the estimate
$P(Z(f_{s,t})>u(\sigma))\ge\textrm{const.}\, \sigma^{2(1-\varepsilon)^2}$
if $|t-s|=\sigma^2$ and the independence of the random integrals
$Z(f_{s,t})$ if the functions $f_{s,t}$ are indexed by such pairs
$(s,t)$ for which the intervals $(s,t)$ are disjoint. This means
that in this example formula~(\ref{(4.7)}) holds only under the
condition $u\ge M\sigma\log^{1/2}\frac1\sigma$ with $M=2$.
There is a classical result about the modulus of continuity of
Wiener processes, and actually this result helped us to find the
previous example. It is also worth mentioning that there are some
concentration inequalities, \index{concentration inequalities}
see Ledoux~\cite{r29} and Talagrand~\cite{r52},
which state that under very general conditions the distribution
of the supremum of a class of partial sums of independent random
variables or of the elements of a Gaussian random field is
strongly concentrated around the expected value of this supremum.
(Talagrand's result in this direction is also formulated in
Theorem~18.1 of this lecture note.) These results imply that the
problems discussed in Theorems~4.1 and~4.2 can be reduced to a
good estimate of the expected value
$E\sup\limits_{f\in{\cal F}}|S_n(f)|$ and
$E\sup\limits_{f\in{\cal F}}|Z(f)|$ of the supremum considered in
these results. However, the estimation of the expected value of
these suprema is not much simpler than the original problem.
Theorem~4.2 implies that under its conditions
$$E
\sup\limits_{f\in{\cal F}}|Z(f)|
\le\textrm{const.}\, \sigma\log^{1/2}\frac2\sigma
$$
with an appropriate multiplying constant depending on the
parameter~$D$ and exponent~$L$ of the class of functions~${\cal F}$.
In the case of Theorem~4.1 a similar estimate holds, but under more
restrictive conditions. We also have to impose that
$\sqrt n\sigma^2\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$ with
a sufficiently large constant. This condition is needed to guarantee
that the set of numbers~$u$ satisfying condition~(\ref{(4.4)}) is
not empty. If this condition is violated, then Theorem~4.1 supplies
a weaker estimate which we get by replacing $\sigma$ by an
appropriate~$\bar\sigma>\sigma$, and by applying Theorem~4.1 with
this number~$\bar\sigma$.
One may ask whether the above estimate on the expected value of
the supremum of normalized partial sums holds without the condition
$\sqrt n\sigma^2\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$.
We show an example which gives a negative answer to this question.
Since here we discuss a rather particular problem which is outside
of our main interest in this work I give a rather sketchy
explanation of this example. I present this example together with
a Poissonian counterpart of it which may help to explain its
background.
\medskip\noindent
{\bf Example 4.3 (Supremum of partial sums with bad tail behaviour).}
{\it Let $\xi_1,\dots,\xi_n$ be a sequence of independent random
variables with uniform distribution in the interval~$[0,1]$. Choose
a sequence of real numbers, $\varepsilon_n$, $n=3,4,\dots$, such that
$\varepsilon_n\to0$ as $n\to\infty$, and
$\frac12\ge\varepsilon_n\ge n^{-\delta}$ with a
sufficiently small number $\delta>0$. Put
$\sigma_n=\varepsilon_n\sqrt{\frac{\log n}n}$, and define the set of
functions $\bar f_{j,n}(\cdot)$ and $f_{j,n}(\cdot)$
on the interval $[0,1]$ by the formulas
$\bar f_{j,n}(x)=1$ if $(j-1)\sigma^2_n\le x0$. Then
$$
\lim_{n\to\infty}P\left(\sup_{f\in{\cal F}_n}S_n(f)>u_n\right)=1.
$$
}
\medskip
This example has the following Poissonian counterpart.
\medskip\noindent
{\bf Example 4.3$'$ (A Poissonian counterpart of Example 4.3).}
{\it Let $\bar P_n(x)$ be a Poisson process on the interval~$[0,1]$
with parameter~$n$ and $P_n(x)=\frac1{\sqrt n}[\bar P_n(x)-nx]$,
$0\le x\le 1$. Consider the same sequences of numbers~$\varepsilon_n$,
$\sigma_n$ and~$u_n$ as in Example~4.3, and define the random
variables $Z_{n,j}=P_n(j\sigma^2_n)-P_n((j-1)\sigma^2_n)$ for all
$n=3,4,\dots$ and $1\le j\le \frac1{\sigma^2_n}$. Then
$$
\lim_{n\to\infty}P\left(\sup_{1\le j\le \frac1{\sigma_n}}
Z_{n,j}>u_n\right)=1.
$$
}
\medskip
The classes of functions ${\cal F}_n$ in Example~4.3 are $L_2$-dense
classes of functions with some exponent~$L$ and parameter~$D$
not depending on the parameter~$n$ and the choice of the
numbers~$\sigma_n$. It can be seen that even the class of function
${\cal F}=\{f_{s,t}\colon\, f_{s,t}(x)=1,\textrm{ if }s\le x0$ in this case. As
$\varepsilon_n\log\frac1{\varepsilon_n}\to0$ as $n\to\infty$,
this means that the
expected value of the supremum of the random sums considered in
Example~4.3 does not satisfy the estimate
$\limsup\limits_{n\to\infty}
\frac1{\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}}
E\sup\limits_{f\in{\cal F}_n}S_n(f)<\infty$ suggested by
Theorem~4.1. Observe that $\sqrt n\bar\sigma^2_n
\sim\textrm{const.}\, \varepsilon_n\bar\sigma_n\log^{1/2}
\frac2{\bar\sigma_n}$
in this case, since
$\sqrt n\bar\sigma^2_n\sim\varepsilon_n^2\frac{\log n}{\sqrt n}$,
and $\bar\sigma_n\log^{1/2}\frac2{\bar\sigma_n}
\sim \textrm{const.}\,\varepsilon_n\frac{\log n}{\sqrt n}$.
\medskip\noindent
{\it The proof of Examples~4.3 and~$4.3'$.} First we prove the
statement of Example~$4.3'$. For a fixed index~$n$ the number of
random variables $Z_{n,j}$ equals
$\frac1{\sigma_n^2}\ge\frac1{\varepsilon_n^2}\frac n{\log n}
\ge\frac n{\log n}$, and they are independent. Hence it is enough
to show that $P(Z_{n,j}>u_n)\ge n^{-1/2}$ if first $A>0$ and then
$\delta>0$ (appearing in the condition
$\varepsilon_n>n^{-\delta}$) are chosen sufficiently small, and
$n\ge n_0$ with some threshold index $n_0=n_0(A,\delta)$.
Put $\bar u_n=[\sqrt nu_n+n\sigma^2_n]+1$, where $[\cdot]$ denotes
integer part. Then
$P(Z_{n,j}>u_n)\ge P(\bar P_n(\sigma^2_n)\ge\bar u_n)
\ge P(\bar P_n(\sigma^2_n)=\bar u_n)
=\frac{(n\sigma_n^2)^{\bar u_n}}{\bar u_n!}e^{-n\sigma_n^2}
\ge \left(\frac{n\sigma_n^2}{\bar u_n}\right)^{\bar u_n}e^{-n\sigma_n^2}$.
Some calculation shows that
$\bar u_n\le\frac{A \log n}{\log \frac1{\varepsilon_n}}
+\varepsilon_n^2\log n+1
\le\frac{2A \log n}{\log \frac1{\varepsilon_n}}$,
$\frac{n\sigma_n^2}{\bar u_n}
\ge\frac{\varepsilon_n^2\log\frac1{\varepsilon_n}}{2A}$,
and $\log\frac{n\sigma_n^2}{\bar u_n}\ge- 2\log\frac1{\varepsilon_n}$
if the constants $A>0$, $\delta>0$ and threshold index $n_0$ are
appropriately chosen. Hence
$P(Z_{n,j}>u_n)\ge e^{-2\bar u_n\log(1/\varepsilon_n)-n\sigma_n^2}
\ge e^{-2A\log n-\varepsilon_n^2\log n}\ge\frac1{\sqrt n}$
if~$A_0>0$ is small enough.
The statement of Example~4.3 can be deduced from~Example~$4.3'$
by applying Poissonian approximation. Let us apply the result of
Example~$4.3'$ for a Poisson process $\bar P_{n/2}$ with
parameter~$\frac n2$ and with such a number~$\bar\varepsilon_{n/2}$
with which the value of $\sigma_{n/2}$ equals the previously
defined~$\sigma_n$. Then
$\bar\varepsilon_{n/2}\sim\frac{\varepsilon_n}{\sqrt 2}$,
and the number of sample points of $\bar P_{n/2}$ is less
than~$n$ with probability almost~1. Attaching additional sample
points to get exactly $n$ sample points we can get the result of
Example~4.3. I omit the details. \qed$\qed$
\medskip
In formulas~(\ref{(4.4)}) and~(\ref{(4.7)}) we formulated such a
condition for the validity of Theorem~4.1 and Theorem~4.2 which
contains a large multiplying constant $ML^{3/4}$ and $ML^{1/2}$
of $\sigma\log^{1/2}\frac2\sigma$ in the lower bound for the
number~$u$ if we deal with such an $L_2$-dense class of functions
${\cal F}$ which has a large exponent~$L$. At a heuristic level
it is clear that in such a case a large multiplying constant
appears. On the other hand, I did not try to find the best
possible coefficients in the lower bound in
relations~(\ref{(4.4)}) and~(\ref{(4.7)}).
\medskip
In Theorem~4.1 (and in its version, Theorem~$4.1'$) it was
demanded that the class of functions ${\cal F}$ should be countable.
Later this condition was replaced by a weaker one about countable
approximability. By restricting our attention to countable or
countably approximable classes we could avoid some unpleasant
measure theoretical problems which would have arisen if we had
worked with the supremum of non-countably many random
variables which may be non-measurable. There are some papers
where possibly non-measurable models are also considered with
the help of some rather deep results of the analysis and measure
theory. Here I chose a different approach. I proved a simple
result in the following Lemma~4.4 which enables us to show that
in many interesting problems we can restrict our attention to
countably approximable classes of random variables. In Chapter~18,
in the discussion of the content of Chapter~4 I write more about
the relation of this approach to the results of other works.
\medskip\noindent
{\bf Lemma 4.4.} {\it Let a class of random variables $U(f)$,
$f\in{\cal F}$, indexed by some set ${\cal F}$ of functions be given
on a space $(Y,{\cal Y})$. If there exists a countable subset
${\cal F}'\subset {\cal F}$ of the set ${\cal F}$ such that the sets
$A(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}}|U(f)(\omega)|\ge u\}$
and
$B(u)=\{\omega\colon\,\sup\limits_{f\in {\cal F}'}
|U(f)(\omega)|\ge u\}$
introduced
for all $u>0$ in the definition of countable approximability satisfy
the relation $A(u)\subset B(u-\varepsilon)$ for all $u>\varepsilon>0$,
then the class
of random variables $U(f)$, $f\in{\cal F}$, is countably approximable.
The above property holds if for all $f\in{\cal F}$, $\varepsilon>0$
and $\omega\in\Omega$ there exists a function
$\bar f=\bar f(f,\varepsilon,\omega)\in{\cal F}'$
such that $|U(\bar f)(\omega)|\ge|U(f)(\omega)|-\varepsilon$.}
\medskip\noindent
{\it Proof of Lemma 4.4.}\/ If $A(u)\subset B(u-\varepsilon)$ for
all $\varepsilon>0$, then
$P^*(A(U)\setminus B(u))\le \lim\limits_{\varepsilon\to0}
P(B(u-\varepsilon)\setminus B(u))=0$, where $P^*(X)$ denotes the
outer measure
of a not necessarily measurable set $X\subset\Omega$, since
$\bigcap\limits_{\varepsilon\to0}B(u-\varepsilon)=B(u)$, and this is
what we had to prove.
If $\omega\in A(u)$, then for all $\varepsilon>0$ there exists some
$f=f(\omega)\in{\cal F}$ such that $|U(f)(\omega)|>u-\frac\varepsilon2$.
If there
exists some $\bar f=\bar f(f,\frac\varepsilon2,\omega)$,
$\bar f\in{\cal F}'$ such that
$|U(\bar f)(\omega)| \ge |Uf(\omega)|-\frac\varepsilon2$,
then $|U(\bar f)(\omega)|
>u-\varepsilon$, and $\omega\in B(u-\varepsilon)$. This
means that $A(u)\subset B(u-\varepsilon)$.
\hfill$\qed$
\medskip
The question about countable approximability also appears in the
case of multiple random integrals with respect to a normalized
empirical measure. To avoid some repetition we prove a result which
also covers such cases. For this goal first we introduce the notion
of multiple integrals with respect to a normalized empirical
distribution.\index{multiple integrals with respect to a normalized
empirical distribution}
Given a measurable function $f(x_1,\dots,x_k)$ on the $k$-fold
product space $(X^k,{\cal X}^k)$ and a sequence of independent random
variables $\xi_1,\dots,\xi_n$ with some distribution $\mu$ on the
space $(X,{\cal X})$ we define the integral $J_{n,k}(f)$ of the
function $f$ with respect to the $k$-fold product of the normalized
version of the empirical distribution $\mu_n$ introduced in (\ref{(4.5)})
by the formula
\begin{eqnarray}
J_{n,k}(f)&&=\frac{n^{k/2}}{k!} \int'
f(x_1,\dots,x_k)(\mu_n(dx_1)-\mu(dx_1))\dots
(\mu_n(dx_k)-\mu(dx_k)), \nonumber \\
&&\quad\textrm{where the prime in $\int'$ means that the
diagonals } x_j=x_l,\; \nonumber\\
&&\quad 1\le ju\right)$. We have
seen that the above class of functions ${\cal F}$ is countably
approximable. The results of the next chapter imply that this
class of functions is also $L_2$-dense. Let me remark that
actually it is not difficult to check this property directly.
Hence we can apply Theorem~$4.1'$ to the above defined class of
functions with $\sigma=1$, and it yields that
$P\left(n^{-1/2}\sup\limits_{f\in {\cal F}}|J_n(f)|>u\right)
\le e^{-Cnu^2}$
if $1\ge u\ge\bar Cn^{-1/2}$ with some universal constants $C>0$ and
$\bar C>0$. (The condition $1\ge u$ can actually be dropped.) The
application of this estimate for the numbers $\varepsilon>0$ together
with the Borel--Cantelli lemma imply the fundamental theorem of the
mathematical statistics.
In short, the results of this chapter yield more information about
the closeness the empirical distribution function $F_n$ and
distribution function $F$ than the fundamental theorem of the
mathematical statistics. Moreover, since these results can also be
applied for other classes of functions, they yield useful
information about the closeness of the probability measure $\mu$
to the empirical distribution~$\mu_n$.
\chapter{Vapnik--\v{C}ervonenkis classes and $L_2$-dense
classes of functions}
In this chapter the most important notions and results will be
presented about Vapnik--\v{C}ervonenkis classes, and it will be
explained how they help to show in some important cases that
certain classes of functions are $L_2$-dense. The classes of
$L_2$-dense classes played an important role in the previous
chapter. The results of this chapter may help to find interesting
classes of functions with this property. Some of the results of
this chapter will be proved in Appendix~A.
First I recall the definition of the following notion.
\medskip\noindent
{\bf Definition of Vapnik-\v{C}ervonenkis classes of sets and
functions.}\index{Vapnik-\v{C}ervonenkis classes of sets and functions}
{\it Let a set $X$ be given, and let us select a class
${\cal D}$ of subsets of this set $X$. We call
${\cal D}$ a Vapnik--\v{C}ervonenkis class if there exist two real
numbers $B$ and $K$ such that for all positive integers $n$ and
subsets $S(n)=\{x_1,\dots,x_n\}\subset X$ of cardinality $n$
of the set $X$ the collection of sets of the form $S(n)\cap D$,
$D\in{\cal D}$, contains no more than $Bn^K$ subsets of~$S(n)$.
We shall call $B$ the parameter and $K$ the exponent of this
Vapnik--\v{C}ervonenkis class.
A class of real valued functions ${\cal F}$ on a space
$(Y,{\cal Y})$ is called a Vapnik--\v{C}ervonenkis class if
the collection of graphs of these functions is a
Vapnik--\v{C}er\-vo\-nen\-kis class, i.e.\ if the sets
$A(f)=\{(y,t)\colon\, y\in Y,\;\min(0,f(y))\le t\le\max(0,f(y))\}$,
$f\in {\cal F}$, constitute a Vapnik--\v{C}er\-vo\-nen\-kis class
of subsets of the product space $X=Y\times R^1$.}
\medskip
The following result which was first proved by Sauer plays a
fundamental role in the theory of Vapnik--\v{C}er\-vo\-nen\-kis
classes. This result provides a relatively simple condition for
a class ${\cal D}$ of subsets of a set~$X$ to be a
Vapnik--\v{C}ervonenkis class. Its proof is given in Appendix~A.
Before its formulation I introduce some terminology which is
often applied in the literature.
\medskip\noindent
{\bf Definition of shattering of a set.}\index{shattering of a set}
{\it Let a set $S$ and a class ${\cal E}$ of subsets of $S$ be
given. A finite set $F\subset S$ is called shattered by the
class ${\cal E}$ if all its subsets $H\subset F$ can be written
in the form $H=E\cap F$ with some element $E\in{\cal E}$ of the
class of sets of ${\cal E}$.}
\medskip\noindent
{\bf Theorem 5.1 (Sauer's lemma).}\index{Sauer's lemma}
{\it Let a finite set $S=S(n)$ consisting of $n$ elements be given
together with a class ${\cal E}$ of subsets of $S$. If ${\cal E}$
shatters no subset of $S$ of cardinality~$k$, then ${\cal E}$
contains at most ${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$
subsets of $S$.}
\medskip
The estimate of Sauer's lemma is sharp. Indeed, if ${\cal E}$ contains
all subsets of $S$ of cardinality less than or equal to $k-1$, then
it shatters no subset of a set $F$ of cardinality $k$ (a set $F$
of cardinality~$k$ cannot be written in the form $E\cap F$,
$E\in {\cal E}$), and ${\cal E}$ contains
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ subsets of $S$.
Sauer's lemma states, that this is an extreme case. Any class of
subsets ${\cal E}$ of $S$ with cardinality greater than
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ shatters at
least one subset of~$S$ with cardinality~$k$.
Let us have a set $X$ and a class of subsets ${\cal D}$ of it. One may
be interested in when ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.
Sauer's lemma gives a useful condition for it. Namely, it implies
that if there exists a positive integer $k$ such that the class
${\cal D}$ shatters no subset of $X$ of cardinality~$k$,
then ${\cal D}$
is a Vapnik--\v{C}ervonenkis class. Indeed, let us take some number
$n\ge k$, fix an arbitrary set $S(n)=\{x_1,\dots,x_n\}\subset X$ of
cardinality~$n$, and introduce the class of subsets
${\cal E}={\cal E}(S(n))=\{S(n)\cap D\colon\, D\subset{\cal D}\}$. If
${\cal D}$ shatters no subset of $X$ of cardinality~$k$, then ${\cal E}$
shatters no subset of $S(n)$ of cardinality~$k$. Hence by
Sauer's lemma the class ${\cal E}$ contains at most
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$ elements.
Let me remark that it is also proved that
${n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}\le1.5\frac{n^{k-1}}{(k-1)!}$
if $n\ge k+1$. This estimate gives a bound on the parameter and
exponent of a Vapnik--\v{C}ervonenkis class which satisfies the
above condition.
Moreover, Theorem~5.1 also has the following consequence. Take
an (infinite) set $X$ and a class of its subsets ${\cal D}$.
There are two possibilities. Either there is some set
$S(n)\subset X$ of cardinality $n$ for all integers $n$ such
that ${\cal E}(S(n))$ contains all subsets
of $S(n)$, i.e. ${\cal D}$ shatters this set, or
$\sup\limits_{S\colon\,S\subset X,\,|S|=n}|{\cal E}(S)|$
tends to infinity at most in a polynomial order as
$n\to\infty$, where $|S|$ and $|{\cal E}(S)|$
denote the cardinality of $S$ and ${\cal E}(S)$.
To understand why Sauer's lemma plays an important role in the
theory of Vapnik--\v{C}ervonenkis classes let us formulate the
following consequence of the above considerations.
\medskip\noindent
{\bf Corollary of Sauer's lemma.} \index{Sauer's lemma}
{\it Let a set $X$ be given together with a class
${\cal D}$ of subsets of this set $X$. This class of sets
${\cal D}$ is a Vapnik--\v{C}ervonenkis class if there exists a positive
integer~$k$ such that ${\cal D}$ shatters no subset $F\subset X$ of
cardinality~$k$. In other words if each set
$F=\{x_1,\dots,x_k\}\subset X$ of cardinality~$k$ has a subset $G\subset F$
which cannot be written in the form $G=D\cap F$ with some $D\in{\cal D}$,
then ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.}
\medskip
The following Theorem~5.2, an important result of Richard Dudley,
states that a Vapnik--\v{C}er\-vo\-nen\-kis class of functions
bounded by~1 is an $L_1$-dense class of functions.
\medskip\noindent
{\bf Theorem 5.2 (A relation between the $L_1$-dense class and
Vapnik--\v{C}er\-vo\-nen\-kis class property).}\index{relation
between $L_1$-dense and Vapnik--\v{C}ervonenkis classes}
{\it Let $f(y)$,
$f\in {\cal F}$, be a Vapnik--\v{C}ervonenkis class of real valued
functions on some measurable space $(Y,{\cal Y})$ such that
$\sup\limits_{y\in Y}|f(y)|\le1$ for all $f\in{\cal F}$.
Then ${\cal F}$ is an
$L_1$-dense class of functions on $(Y,{\cal Y})$. More explicitly, if
${\cal F}$ is a Vapnik--\v{C}ervonenkis class with parameter $B\ge1$
and exponent $K>0$, then it is an $L_1$-dense class with exponent
$L=2K$ and parameter $D=CB^2 (4K)^{2K}$ with some universal
constant~$C>0$.}
\medskip\noindent
{\it Proof of Theorem 5.2.}\/ Let us fix some probability
measure $\nu$ on $(Y,{\cal Y})$ and a real number
$0<\varepsilon\le1$. We are going to show that any finite set
${\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\cal F}$
such that $\int|f_j-f_k|\,d\nu\ge\varepsilon$ if $j\neq k$,
$f_j,f_k\in{\cal D}(\varepsilon,\nu)$ has cardinality
$M\le D\varepsilon^{-L}$ with some $D>0$ and $L>0$. This
implies that ${\cal F}$ is an $L_1$-dense class with
parameter~$D$ and exponent~$L$. Indeed, let us take a maximal
subset
$\bar{\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset{\cal F}$
such that the $L_1(\nu)$ distance of any two functions in this
subset is at least~$\varepsilon$. Maximality means in this context
that no function $f_{M+1}\in{\cal F}$ can be attached to
$\bar{\cal D}(\varepsilon,\nu)$ without violating this condition.
Thus the inequality $M\le D\varepsilon^{-L}$ means that
$\bar{\cal D}(\varepsilon,\nu)$ is an $\varepsilon$-dense subset
of~${\cal F}$ in the space $L_1(Y,{\cal Y},\nu)$
with no more than $D\varepsilon^{-L}$ elements.
In the estimation of the cardinality $M$ of a set
${\cal D}(\varepsilon,\nu)=\{f_1,\dots,f_M\}\subset {\cal F}$
with the property $\int|f_j-f_k|\,d\nu\ge\varepsilon$ if
$j\neq k$ we exploit the Vapnik--\v{C}ervonenkis class
property of ${\cal F}$ in the following way.
Let us choose relatively few $p=p(M,\varepsilon)$ points
$(y_l,t_l)$, $y_l\in Y$, $-1\le t_l\le1$, $1\le l\le p$, in the
space $Y\times [-1,1]$ in such a way that the set
$S_0(p)=\{(y_l,t_l),\,1\le l\le p\}$ and graphs
$A(f_j)=\{(y,t)\colon\, y\in Y,\;\min(0,f_j(y))
\le t\le\max(0,f_j(y))\}$,
$f_j\in{\cal D}(\varepsilon,\nu)\subset{\cal F}$ have
the property that all
sets $A(f_j)\cap S_0(p)$, $1\le j\le M$, are different. Then the
Vapnik--\v{C}ervonenkis class property of ${\cal F}$ implies that
$M\le Bp^K$. Hence if there exists a set $S_0(p)$ with the above
property and with a relatively small number $p$, then this yields a
useful estimate on $M$. Such a set $S_0(p)$ will be given by means of
the following random construction.
Let us choose the $p$ points $(y_l,t_l)$, $1\le l\le p$, of the
(random) set $S_0(p)$ independently of each other in such a way that
the coordinate $y_l$ is chosen with distribution $\nu$ on
$(Y,{\cal Y})$ and the coordinate $t_l$ with uniform distribution on
the interval $[-1,1]$ independently of $y_l$. (The number~$p$ will be
chosen later.) Let us fix some indices $1\le j,k\le M$, and estimate
from above the probability that the sets $A(f_j)\cap S_0(p)$ and
$A(f_k)\cap S_0(p)$ agree, where $A(f)$ denotes the graph of the
function~$f$. Consider the symmetric difference $A(f_j)\Delta A(f_k)$
of the sets $A(f_j)$ and $A(f_k)$. The sets
$A(f_j)\cap S_0(p)$ and $A(f_k)\cap S_0(p)$ agree if and only if
$(y_l,t_l)\notin A(f_j)\Delta A(f_k)$ for all $(y_l,t_l)\in S_0(p)$.
Let us observe that for a fixed
$l$ the estimate $P((y_l,t_l)\in A(f_j)\Delta A(f_k))
=\frac12(\nu\times\lambda)(A(f_j)\Delta A(f_k))
=\frac12\int |f_j-f_k|\,d\nu\ge\frac\varepsilon2$ holds, where
$\lambda$ denotes the Lebesgue measure. This implies that the
probability that the (random) sets $A(f_j)\cap S_0(p)$ and
$A(f_k)\cap S_0(p)$ agree can be bounded from above by
$\left(1-\frac\varepsilon2\right)^p\le e^{-p\varepsilon/2}$.
Hence the probability that all sets $A(f_j)\cap S_0(p)$ are
different is greater than
$1-{M\choose2} e^{-p\varepsilon/2}\ge1-\frac{M^2}2e^{-p\varepsilon/2}$.
Choose $p$ such that
$\frac74e^{p\varepsilon/2}>e^{(p+1)\varepsilon/2}>M^2\ge e^{p\varepsilon/2}$.
(We may assume that $M>1$, in which case there is such a number
$p\ge1$. We may really assume that $M>1$, since we want to give
an upper bound on~$M$. Moreover, the estimate we shall give on it,
satisfies this inequality.) Then the above probability is greater
than $\frac18$, and there exists some set $S_0(p)$ with
the desired property.
The inequalities $M\le Bp^K$ and $M^2\ge e^{p\varepsilon/2}$ imply
that $M\ge e^{p\varepsilon/4}\ge e^{\varepsilon M^{1/K}/4B^{1/K}}$,
i.e.\ $\frac{\log M^{1/K}}{M^{1/K}}\ge \frac\varepsilon{4KB^{1/K}}$.
As $\frac{\log M^{1/K}}{M^{1/K}}\le CM^{-1/2K}$
for $M\ge1$ with some universal constant $C>0$, this estimate
implies that Theorem 5.2 holds with the exponent~$L$ and
parameter~$D$ given in its formulation.
\hfill$\qed$
\medskip
Let us observe that if ${\cal F}$ is an $L_1$-dense class of
functions on a measure space $(Y,{\cal Y})$ with some
exponent~$L$ and parameter~$D$, and also the inequality
$\sup\limits_{y\in Y}|f(y)|\le1$ holds for all $f\in{\cal F}$,
then ${\cal F}$ is an $L_2$-dense class of functions
with exponent $2L$ and parameter $D2^L$. Indeed, if we fix some
probability measure $\nu$ on $(Y,{\cal Y})$ together with a number
$0<\varepsilon\le1$, and
${\cal D}(\varepsilon,\nu)=\{f_1,\dots, f_M\}$ is an
$\frac{\varepsilon^2}2$-dense set of ${\cal F}$ in the
space $L_1(Y,{\cal Y},\nu)$,
$M\le2^L D \varepsilon^{-2L}$, then for all function
$f\in {\cal F}$ some function $f_j\in{\cal D}(\varepsilon,\nu)$ can
be chosen in such a way that
$\int(f-f_j)^2\,d\nu\le2\int|f-f_j|\,d\nu\le\varepsilon^2$. This
implies that ${\cal F}$ is an $L_2$-dense class with the given
exponent and parameter.
It is not easy to check whether a collection of subsets ${\cal D}$
of a set $X$ is a Vapnik--\v{C}ervonenkis class even with the help
of Theorem~5.1. Therefore the following Theorem~5.3 which enables
us to construct many non-trivial Vapnik--\v{C}ervonenkis classes
is of special interest. Its proof is given in Appendix~A.
\medskip\noindent
{\bf Theorem 5.3 (A way to construct Vapnik--\v{C}ervonenkis classes).}
{\it Let us consider a $k$-dimensional subspace ${\cal G}_k$ of the
linear space of real valued functions defined on a set $X$, and
define the level-set $A(g)=\{x\colon\, x\in X,\,g(x)\ge0\}$ for
all functions $g\in{\cal G}_k$. Take the class of subsets
${\cal D}=\{A(g)\colon\, g\in {\cal G}_k\}$ of the set $X$ consisting
of the above introduced level sets. No subset $S=S(k+1)\subset X$ of
cardinality $k+1$ is shattered by ${\cal D}$. Hence by Theorem~5.1
${\cal D}$ is a Vapnik--\v{C}ervonenkis class of subsets of~$X$.}
\medskip
Theorem~5.3 enables us to construct interesting
Vapnik--\v{C}ervonenkis classes. Thus for instance the class of
all half-spaces in a Euclidean space, the class of all ellipses in
the plane, or more generally the level sets of $k$-order algebraic
functions of $p$ variables with a fixed number $k$ constitute a
Vapnik--\v{C}ervonenkis class in the $p$-dimensional Euclidean
space~$R^p$. It can be proved that if ${\cal C}$ and ${\cal D}$
are Vapnik--\v{C}ervonenkis classes of subsets of a set $S$, then
also their intersection
${\cal C}\cap{\cal D}=\{C\cap D\colon\,C\in{\cal C},\,D\in{\cal D}\}$,
their union
${\cal C}\cup {\cal D}=\{C\cup D\colon\, C\in{\cal C},\,D\in{\cal D}\}$
and complementary sets ${\cal C}^c
=\{S\setminus C\colon\, C\in{\cal C}\}$
are Vapnik--\v{C}ervonenkis classes. These results are less
important for us, and their proofs will be omitted. We are
interested in Vapnik--\v{C}ervonenkis classes not for their own
sake. We are going to find $L_2$-dense classes of functions, and
Vapnik--\v{C}ervonenkis classes help us in this. Indeed, Theorem~5.2
implies that if ${\cal D}$ is a Vapnik--\v{C}ervonenkis class of
subsets of a set $S$, then their indicator functions constitute a
Vapnik--\v{C}ervonenkis class of functions, and as a consequence
an $L_1$-dense, hence also an $L_2$-dense class of functions. Then
the results of Lemma~5.4 formulated below enable us to construct
new $L_2$-dense classes of functions.
\medskip\noindent
{\bf Lemma 5.4 (Some useful properties of $L_2$-dense classes).}
{\it Let ${\cal G}$ be an $L_2$-dense class of functions
on some space $(Y,{\cal Y})$ whose absolute values are bounded
by one, and let $f$ be a function on $(Y,{\cal Y})$ also with
absolute value bounded by one. Then
$f\cdot{\cal G}=\{f\cdot g\colon\, g\in{\cal G}\}$ is also an
$L_2$-dense class of functions. Let ${\cal G}_1$ and
${\cal G}_2$ be two $L_2$-dense classes of functions on some
space $(Y,{\cal Y})$ whose absolute values are
bounded by one. Then the classes of functions
${\cal G}_1+{\cal G}_2=\{g_1+g_2\colon\,
g_1\in{\cal G}_1,\,g_2\in{\cal G}_2\}$,
${\cal G}_1\cdot{\cal G}_2
=\{g_1g_2\colon\, g_1\in{\cal G}_1,\,g_2\in{\cal G}_2\}$,
$\min({\cal G}_1,{\cal G}_2)
=\{\min(g_1,g_2)\colon\, g_1\in{\cal G}_1,\,g_2\in
{\cal G}_2\}$, $\max({\cal G}_1,{\cal G}_2)
=\{\max(g_1,g_2)\colon\, g_1\in
{\cal G}_1,\,g_2\in{\cal G}_2\}$ are also $L_2$-dense.
If ${\cal G}$ is an
$L_2$-dense class of functions, and ${\cal G}'\subset{\cal G}$,
then ${\cal G}'$ is also an $L_2$-dense class.}
\medskip\noindent
The proof of Lemma 5.4 is rather straightforward. One has to observe
for instance that if $g_1,\bar g_1\in{\cal G}_1$,
$g_2,\bar g_2\in{\cal G}_2$ then $|\min(g_1,g_2)-\min(\bar g_1,\bar g_2)|
\le |g_1-\bar g_1)|+|g_2-\bar g_2|$, hence if
$g_{1,1},\dots,g_{1,M_1}$ is an $\frac\varepsilon2$-dense
subset of ${\cal G}_1$
and $g_{2,1},\dots,g_{2,M_2}$ is an $\frac\varepsilon2$-dense
subset of ${\cal G}_2$ in the space $L_2(Y,{\cal Y},\nu)$ with
some probability measure
$\nu$, then the functions $\min(g_{1,j},g_{2,k})$, $1\le j\le M_1$,
$1\le k\le M_2$ constitute an $\varepsilon$-dense subset of
$\min({\cal G}_1,{\cal G}_2)$ in $L_2(Y,{\cal Y},\nu)$. The last
statement of Lemma~5.4 was proved after the Corollary of
Theorem~4.1. The details are left to the reader.
\hfill $\qed$
\medskip
The above result enable us to construct some $L_2$-dense class of
functions. We give an example for it in the following Example~5.5
which is a consequence of Theorem~5.2 and Lemma~5.4.
\medskip\noindent
{\bf Example 5.5.} {\it Take $m$ measurable functions $f_j(x)$,
$1\le j\le m$, on a measurable space $(X,{\cal X})$ which
have the property $\sup\limits_{x\in X}|f_j(x)|\le1$ for all
$1\le j\le m$. Let ${\cal D}$ be a Vapnik-\v{C}ervonenkis class
consisting of measurable subsets of the set $X$. Define for all
pairs $(f_j,D)$, $f_j$, $1\le j\le m$, and $D\in{\cal D}$ the
function $f_{j,D}(\cdot)$ as $f_{j,D}(x)=f_j(x)$ if $x\in D$, and
$f_{j,D}(x)=0$ if $x\notin D$, i.e. $f_{j,D}(\cdot)$ is the
restriction of the function $f_j(\cdot)$ to the set~$D$. Then the
set of functions
${\cal F}=\{f_{j,D}\colon\; 1\le j\le m,\; D\in{\cal D}\}$
is $L_2$-dense.}
\medskip
Beside this, Theorem~5.3 helps us to construct
Vapnik-\v{C}ervonenkis classes of sets. Let me also remark that it
follows from the result of this chapter that the random variables
considered in Lemma~4.5 are not only countably approximable, but
the class of functions $f_{u_1,\dots,u_k,v_1,\dots,v_k}$
appearing in their definition is $L_2$-dense.
\chapter{The proof of Theorems 4.1 and 4.2 on the
supremum of random sums}
In this chapter we prove Theorem~4.2, an estimate about the tail
distribution of the supremum of an appropriate class of Gaussian
random variables with the help of a method, called the chaining
argument. We also investigate the proof of Theorem~4.1 which can
be considered as a version of Theorem~4.2 about the supremum of
partial sums of independent and identically distributed random
variables. The chaining argument is not a strong enough method
to prove Theorem~4.1, but it enables us to prove a weakened form
of it formulated in Proposition~6.1. This result turned out to
be useful in the proof of Theorem~4.1. It enables us to reduce
the proof of Theorem~4.1 to a simpler statement formulated in
Proposition~6.2. In this chapter we prove Proposition~6.1,
formulate Proposition~6.2, and reduce the proof of Theorem~4.1
with the help of Proposition~6.1 to this result. The proof of
Proposition~6.2 which demands different arguments is postponed
to the next chapter. Before presenting the proofs I briefly
describe the chaining argument.\index{chaining argument}
Let us consider a countable class of functions ${\cal F}$ on a
probability space $(X,{\cal X},\mu)$ which is $L_2$-dense with
respect to the probability measure~$\mu$. Let us have either a
class of Gaussian random variables $Z(f)$ with zero
expectation such that $EZ(f)Z(g)=\int f(x)g(x)\mu(\,dx)$,
$f,g\in{\cal F}$, or a set of normalized partial sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in{\cal F}$,
where $\xi_1,\dots,\xi_n$ is a sequence of independent $\mu$
distributed random variables with values in the space
$(X,{\cal X})$, and assume that $Ef(\xi_j)=0$ for all
$f\in{\cal F}$. We want to get a good estimate on the
probability $P\left(\sup\limits_{f\in{\cal F}}Z(f)>u\right)$ or
$P\left(\sup\limits_{f\in{\cal F}}S_n(f)>u\right)$ if the class of
functions~${\cal F}$ has some nice properties. The chaining
argument suggests to prove such an estimate in the following way.
Let us try to find an appropriate sequence of subset
${\cal F}_1\subset{\cal F}_2\subset\cdots\subset{\cal F}$ such that
$\bigcup\limits_{N=1}^\infty{\cal F}_N={\cal F}$, ${\cal F}_N$ is
such a set of functions from ${\cal F}$ with relatively few
elements for which
$\inf\limits_{f\in{\cal F}_N}\int (f-\bar f)^2\,d\mu\le\delta_N$
with an appropriately chosen number $\delta_N$ for all functions
$\bar f\in{\cal F}$, and let us give a good estimate on the
probability $P\left(\sup\limits_{f\in{\cal F}_N}Z(f)>u_N\right)$ or
$P\left(\sup\limits_{f\in{\cal F}_N}S_n(f)>u_N\right)$
for all $N=1,2,\dots$
with an appropriately chosen monotone increasing sequence $u_N$
such that $\lim\limits_{N\to\infty} u_N=u$.
We can get a relatively good estimate under appropriate conditions
for the class of functions~${\cal F}$ by choosing the classes of
functions ${\cal F}_N$ and numbers $\delta_N$ and $u_N$ in an
appropriate way. We try to bound the difference of the probabilities
$$
P\left(\sup_{f\in{\cal F}_{N+1}}Z(f)>u_{N+1}\right)
-P\left(\sup_{f\in{\cal F}_N}Z(f)>u_N\right)
$$
or of the analogous difference if $Z(f)$ is replaced by $S_n(f)$.
For the sake of completeness define this difference also in the
case $N=1$ with the choice ${\cal F}_0=\emptyset$, when the
second probability in this difference equals zero.
The above mentioned difference of probabilities can be estimated
in a natural way by taking for all functions
$f_{j_{N+1}}\in{\cal F}_{N+1}$ a function
$f_{j_N}\in{\cal F}_N$ which is close to it, more explicitly
$\int (f_{j_{N+1}}-f_{j_N})^2\,d\mu\le\delta_N^2$, and
calculating the probability that the difference of the random
variables corresponding to these two functions is greater than
$u_{N+1}-u_N$. We can estimate these probabilities with the help
of some results which give a relatively good bound on the tail
distribution of $Z(g)$ or $S_n(g)$ if $\int g^2\,d\mu$ is small.
The sum of all such probabilities gives an upper bound for the
above considered difference of probabilities. Then we get an
estimate for the probability
$P\left(\sup\limits_{f\in{\cal F}_N}Z(f)>u_N\right)$
for all $N=1,2,\dots$,
by summing up the above estimate, and we get a bound on the
probability we are interested in by taking the limit
$N\to\infty$. This method is called the chaining argument. It
got this name, because we estimate the contribution of a random
variable corresponding to a function
$f_{j_{N+1}}\in{\cal F}_{N+1}$ to the bound of the probability we
investigate by taking the random variable corresponding to a
function $f_{j_N}\in{\cal F}_N$ close to it, then we choose
another random variable corresponding to a function
$f_{j_{N-1}}\in{\cal F}_{N-1}$ close to this function, and by
continuing this procedure we take a chain of subsequent functions
and the random variables corresponding to them.
First we show how this method supplies the proof of Theorem~4.2.
Then we turn to the investigation of Theorem~4.1. In the study of
this problem the above method does not work well, because if two
functions are very close to each other in the $L_2(\mu)$-norm,
then the Bernstein inequality (or an improvement of it) supplies
a much weaker estimate for the difference of the partial sums
corresponding to these two functions than the bound suggested
by the central limit theorem. On the other hand, we shall prove
a weaker version of Theorem~4.1 in Proposition~6.1 with the help
of the chaining argument. This result will be also useful for us.
\medskip\noindent
{\it Proof of Theorem 4.2.}\/\index{estimate on the supremum of
a class of Gaussian random variables} Let us list the elements
of ${\cal F}$ as $\{f_0,f_1,\dots\}={\cal F}$, and choose for all
$p=0,1,2,\dots$ a set of functions
${\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset{\cal F}$
with $m_p\le (D+1)\,2^{2pL}\sigma^{-L}$ elements in such a way that
$\inf\limits_{1\le j\le m_p}
\int (f-f_{a(j,p)})^2\,d\mu\le 2^{-4p}\sigma^2$
for all $f\in{\cal F}$, and let the set ${\cal F}_p$ contain also
the function~$f_p$. (We imposed the condition $f_p\in{\cal F}_p$
to guarantee that the relation $f\in{\cal F}_p$ holds with some
index~$p$ for all $f\in{\cal F}$. We could do this by slightly
enlarging the upper bound we can give for the number~$m_p$ by
replacing the factor~$D$ by~$D+1$ in it.) For all indices
$a(j,p)$ of the functions in ${\cal F}_p$, \ $p=1,2,\dots$, define a
predecessor $a(j',p-1)$ from the indices of the set of functions
${\cal F}_{p-1}$ in such a way that the functions $f_{a(j,p)}$ and
$f_{a(j',p-1))}$ satisfy the relation
$\int(f_{a(j,p)}-f_{a(j',p-1)})^2\,d\mu\le2^{-4(p-1)}\sigma^2$.
With the help of the behaviour of the standard normal distribution
function we can write the estimates
\begin{eqnarray*}
P(A(j,p))&&=P\left(|Z(f_{a(j,p)})-Z(f_{a(j',p-1)})|
\ge 2^{-(1+p)}u\right)\\
&&\le 2\exp\left\{-\frac{2^{-2(p+1)}u^2}{2\cdot 2^{-4(p-1)}\sigma^2}
\right\}
=2\exp\left\{-\frac{2^{2p}u^2}{128\sigma^2}\right\}\\
&&\qquad 1\le j\le m_p,\; p=1,2,\dots,
\end{eqnarray*}
and
$$
P(B(j))=P\left(|Z(f_{a(j,0)})|\ge \frac u2\right)\le
\exp\left\{-\frac {u^2}{8\sigma^2}\right\},
\quad 1\le j\le m_0.
$$
The above estimates together with the relation
$\bigcup\limits_{p=0}^\infty
{\cal F}_p={\cal F}$ which implies that \hfill\break
$\{|Z(f)|\ge u\}\subset\bigcup\limits_{p=1}^\infty
\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$ for all $f\in{\cal F}$ yield that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}} |Z(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \\
&&\qquad \le \sum_{p=1}^{\infty} 2(D+1)2^{2pL}
\sigma^{-L} \exp\left\{-\frac{2^{2p}u^2}{128\sigma^2} \right\}
+2(D+1)\sigma^{-L} \exp\left\{-\frac {u^2}{8\sigma^2}\right\}.
\end{eqnarray*}
If $u\ge ML^{1/2}\sigma\log^{1/2}\frac2\sigma$ with $M\ge16$ (and
$L\ge1$ and $0<\sigma\le1$), then
$$
2^{2pL}\sigma^{-L}\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2}
\right\}\le2^{2pL}\sigma^{-L}\left(\frac\sigma
2\right)^{2^{2p}M^2L/256}\le 2^{-pL}\le2^{-p}
$$
for all $p=0,1\dots$, hence the previous inequality implies that
\begin{eqnarray*}
P\left(\sup_{f\in{\cal F}}|Z(f)|\ge u\right)
&\le& 2(D+1)\sum\limits_{p=0}^\infty 2^{-p}
\exp\left\{-\frac{2^{2p}u^2}{256\sigma^2} \right\} \\
&=&4(D+1) \exp\left\{-\frac{u^2}{256\sigma^2} \right\}.
\end{eqnarray*}
Theorem 4.2 is proved.
\hfill$\qed$
\medskip
With an appropriate choice of the bound of the integrals in the
definition of the sets ${\cal F}_p$ in the proof of Theorem~4.2 and
some additional calculation it can be proved that the coefficient
$\frac1{256}$ in the exponent of the right-hand side~(\ref{(4.7)})
can be replaced by $\frac{1-\varepsilon}2$ with arbitrary small
$\varepsilon>0$ if the remaining (universal) constants in this
estimate are chosen sufficiently large.
The proof of Theorem 4.2 was based on a sufficiently good estimate on
the probabilities $P(|Z(f)-Z(g)|>u)$ for pairs of functions
$f,g\in{\cal F}$ and numbers $u>0$. In the case of Theorem~4.1 only a
weaker bound can be given for the corresponding probabilities. There
is no good estimate on the tail distribution of the difference
$S_n(f)-S_n(g)$ if its variance is small. As a consequence, the
chaining argument supplies only a weaker result in this case. This
result, where the tail distribution of the supremum of the normalized
random sums $S_n(f)$ is estimated on a relatively dense subset of the
class of functions $f\in{\cal F}$ in the $L_2(\mu)$ norm will
be given in Proposition~6.1. Another result will be formulated in
Proposition~6.2 whose proof is postponed to the next chapter. It will
be shown that Theorem~4.1 follows from Propositions~6.1 and~6.2.
Before the formulation of Proposition~6.1 I recall an estimate which
is a simple consequence of Bernstein's inequality. If
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$ is the
normalized sum of
independent, identically random variables, $P(|f(\xi_1)|\le1)=1$,
$Ef(\xi_1)=0$, $Ef(\xi_1)^2\le\sigma^2$, then there exists some
constant $\alpha>0$ such that
\begin{equation}
P(|S_n(f)|>u)\le 2e^{-\alpha u^2/\sigma^2}
\quad \textrm{if}\quad 0__\frac u{\bar A}\right)$
with some parameter~$\bar A>1$, where ${\cal F}_{\bar\sigma}$ is
an appropriate finite subset of a set of functions~${\cal F}$
satisfying the conditions of Theorem~4.1. (We introduced the
number~$\bar A$ because of some technical reasons. We can
formulate with its help such a result which simplifies the
reduction of the proof of Theorem~4.1 to the proof of another
result formulated in Proposition~6.2.) We cannot give a good
estimate for the above probability for all $u>0$, we can do
this only for such numbers~$u$ which are in an appropriate
interval depending on the parameter~$\sigma$ appearing in
condition~(\ref{(4.2)}) of Theorem~4.1 and the
parameter~$\bar A$ we chose in Proposition~6.1. This fact may
explain why we could prove the estimate of Theorem~4.1 only
for such numbers~$u$ which satisfy the condition imposed in
formula~(\ref{(4.4)}). The choice of the set of functions
${\cal F}_{\bar\sigma}\subset{\cal F}$ depends of the number~$u$
appearing in the probability we want to estimate. It is such a
subset of relatively small cardinality of ${\cal F}$ whose
$L_2(\mu)$-norm distance from all elements of ${\cal F}$ is less
than $\bar\sigma=\bar\sigma(u)$ with an appropriately defined
number $\bar\sigma(u)$. With the help of Proposition~6.1 we want
to reduce the proof of Theorem~4.1 to a result formulated in the
subsequent Proposition~6.2. To do this we still need
an upper bound on the cardinality of ${\cal F}_{\bar\sigma}$ and
some upper and lower bounds on the value of $\bar\sigma(u)$.
In Proposition~6.1 we shall formulate such results, too.
\index{estimate on the supremum of a class of partial sums}
\medskip\noindent
{\bf Proposition 6.1.} {\it Let us have a countable, $L_2$-dense
class of functions ${\cal F}$ with parameter $D\ge1$ and
exponent~$L\ge1$ with respect to some probability measure~$\mu$
on a measurable space $(X,{\cal X})$ whose elements satisfy
relations~(\ref{(4.1)}), (\ref{(4.2)})
and~(\ref{(4.3)}) with this probability measure $\mu$ on
$(X,{\cal X})$ and some real number $0<\sigma\le1$. Take
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and define the normalized random
sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{l=1}^nf(\xi_l)$, for all
$f\in {\cal F}$. Let us fix some number $\bar A\ge1$. There exists
some number $M=M(\bar A)$ such that with these parameters~$\bar A$
and~$M=M(\bar A)\ge1$ the following relations hold.
For all numbers $u>0$ such that
$n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M(L\log\frac2\sigma+\log D)$ a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$, and a
collection of functions
${\cal F}_{\bar\sigma}=\{f_1,\dots,f_m\}\subset{\cal F}$ with
$m\le D\bar\sigma^{-L}$ elements can be chosen in such a way that
the union of the sets
${\cal D}_j=\{f\colon\, f\in {\cal F},\int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$, cover the set of functions
${\cal F}$, i.e. $\bigcup\limits_{j=1}^m{\cal D}_j={\cal F}$, and
the normalized random sums $S_n(f)$, $f\in{\cal F}_{\bar\sigma}$,
$n\ge2$, satisfy the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}} |S_n(f)|
\ge\frac u{\bar A}\right)
\le 4\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}
\nonumber \\
&&\qquad \textrm{under the condition } n\sigma^2\ge(\tfrac u\sigma)^2
\ge M(L\log\tfrac2\sigma+\log D) \label{(6.2)}
\end{eqnarray}
with the constants $\alpha$ in formula~(\ref{(6.1)}) and the
exponent~$L$ and parameter $D$ of the $L_2$-dense class ${\cal F}$.
The inequality $\frac1{16}(\frac u{\bar A\bar\sigma})^2\ge n\bar\sigma^2
\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^2$ also holds with
the number~$\bar\sigma=\bar\sigma(u)$. If the number~$u$ satisfies
also the inequality
\begin{equation}
n\sigma^2\ge \left(\frac u\sigma\right)^2
\ge M\left(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}\right) \label{(6.3)}
\end{equation}
with a sufficiently large number $M=M(\bar A)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too.}
\medskip\noindent
{\it Remark.}\/ Under the conditions $L\ge1$ and $D\ge1$ of
Proposition~6.1 the condition formulated in relation~(\ref{(6.3)})
(with a sufficiently large number $M=M(\bar A)$) is stronger than
the condition $(\frac u\sigma)^2\ge M(L\log\frac2\sigma+\log D)$
imposed in formula~(\ref{(6.2)}). To see this observe that although
$(\log D)^{3/2}\le\log D$ if $\log D\le1$, but this effect can be
compensated by choosing a sufficiently large parameter~M in
formula~(\ref{(6.3)}) and exploiting that $L\log\frac2\sigma\ge\log 2$.
\medskip
Proposition~6.1 helps to reduce the proof of Theorem~4.1 to the
case when such classes of functions ${\cal F}$ are considered
whose elements are such functions whose $L_2$-norm is bounded by
a relatively small number $\bar\sigma$. In more detail, the
proof of Theorem~4.1 can be reduced to a good estimate on the
distribution of the supremum of random variables
$\sup\limits_{f\in {\cal D}_j}|S_n(f-f_j)|$ for all classes ${\cal D}_j$,
$1\le j\le m$, by means of Proposition~6.1. To carry out such a
reduction we also need the inequality $n\bar\sigma^2\ge L\log n+\log D$
(or a slightly weaker version of it). This is the reason why we
have finished Proposition~6.1 with the statement that this
inequality holds under the condition~(\ref{(6.3)}). We also
have to know that the number~$m$ of the classes ${\cal D}_j$
is not too large. Beside this, we need some estimates on the
number $\bar\sigma=\bar\sigma(u)$ which is an upper bound for
the $L_2$-norm of the functions $f-f_j$, $f\in{\cal D}_j$. To
get such bounds for $\bar\sigma$ that we need in the
applications of Proposition~6.1 we introduced a large
parameter~$\bar A$ in the formulation of Proposition~6.1
and imposed a condition with a sufficiently large
number~$M=M(\bar A)$ in formula~(\ref{(6.3)}). This condition
reappears in Theorem~4.1 in the conditions of the
estimate~(\ref{(4.4)}).
Let me remark that one of the inequalities the number
$\bar\sigma$ introduced in Proposition~6.1 satisfies has
the consequence $u>\textrm{const.}\,\sqrt n\bar\sigma^2$
with an appropriate constant. Hence to complete the proof
of Theorem~4.1 we have to estimate the probability
$P\left(\sup\limits_{f\in{\cal F}} S_n(f)|>u\right)$ also
in such cases when the $L_2$ norm of the functions
in~${\cal F}$ is bounded with such a number~$\bar\sigma$
for which $u>\textrm{const.}\,\sqrt n\bar\sigma^2$. On the
other hand, we got an estimate in Proposition~6.1 if
$u<\sqrt n\sigma^2$, (see formula~(\ref{(6.2)}), and this
is an inequality in the opposite direction. Hence to
complete the proof of Theorem~4.1 with the help of
Proposition~6.1 we need a result whose proof demands an
essentially different method. Proposition~6.2 formulated below
is such a result. I shall show that Theorem~4.1 is a consequence
of Propositions~6.1 and~6.2. Proposition~6.1 is proved at the
end of this chapter, while the proof of Proposition~6.2 is
postponed to the next chapter.
\medskip\noindent
{\bf Proposition 6.2.}\index{estimate on the supremum of a class
of partial sums} {\it Let us have a probability measure $\mu$
on a measurable space $(X,{\cal X})$ together with a sequence of
independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge2$, and a countable, $L_2$-dense class of
functions $f=f(x)$ on $(X,{\cal X})$ with some parameter $D\ge1$ and
exponent $L\ge1$ which satisfies conditions~(\ref{(4.1)}),
(\ref{(4.2)}) and~(\ref{(4.3)})
with some $0<\sigma\le1$ such that the inequality
$n\sigma^2>L\log n+\log D$ holds. Then there exists
a threshold index $A_0\ge5$ such that the normalized random sums
$S_n(f)$, $f\in {\cal F}$, introduced in Theorem~4.1 satisfy the
inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge A n^{1/2}\sigma^2\right)\le
e^{-A^{1/2}n\sigma^2/2}\quad \textrm{if } A\ge A_0. \label{(6.4)}
\end{equation}
}
\medskip
I did not try to find optimal parameters in formula~(\ref{(6.4)}).
Even the coefficient $-A^{1/2}$ in the exponent at its right-hand
side could be improved. The result of Proposition~6.2 is similar
to that of Theorem~4.1. Both of them give an estimate on a
probability of the form
$P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|\ge u\right)$ with
some class of functions~${\cal F}$. The essential difference
between them is that in Theorem~4.1 this probability is considered
for $u\le n^{1/2}\sigma^2$ while in Proposition~6.2 the case
$u=A n^{1/2}\sigma^2$ with $A\ge A_0$ is taken, where $A_0$ is a
sufficiently large positive number. Let us observe that in this
case no good Gaussian type estimate can be given for the
probabilities $P(S_n(f)\ge u)$, $f\in{\cal F}$. In this case
Bernstein's inequality yields the bound
$P(S_n(f)>An^{1/2}\sigma^2)=
P\left(\sum\limits_{l=1}^nf(\xi_l)>uV_n\right)n\sigma^2
\ge2^{6R}\left(\frac{u}{16\bar A\sigma}\right)^2,
$$
define $\bar\sigma^2=2^{-4R}\sigma^2$ and
${\cal F}_{\bar\sigma}={\cal F}_R$.
(As $n\sigma^2\ge\left(\frac u\sigma\right)^2$ and $\bar A\ge1$
by our conditions, there exists such a number $R\ge1$. The
number~$R$ was chosen as the largest number~$p$ for which the
second relation of formula~(\ref{(6.7)}) holds.) Then the
cardinality~$m$ of the set ${\cal F}_{\bar\sigma}$ equals
$m_R\le D2^{2RL}\sigma^{-L}
=D\bar\sigma^{-L}$, and the sets ${\cal D}_j$ are
${\cal D}_j=\{f\colon\, f\in{\cal F},\int (f_{a(j,R)}-f)^2\,d\mu\le
2^{-4R}\sigma^2\}$, $1\le j\le m_R$, hence $\bigcup\limits_{j=1}^m
{\cal D}_j={\cal F}$. Beside this, with our choice of the number $R$
inequalities~(\ref{(6.7)}) and~(\ref{(6.8)}) can be applied
for $1\le p\le R$.
Hence the definition of the predecessor of an index $(j,p)$ implies
that
$\left\{\omega\colon\,\sup\limits_{f\in{\cal F}_{\bar\sigma}}
|S_n(f)(\omega)|\ge
\frac u{\bar A}\right\}\subset
\bigcup\limits_{p=1}^R\bigcup\limits_{j=1}^{m_p}A(j,p)
\cup\bigcup\limits_{s=1}^{m_0}B(s)$, and
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}} |S_n(f)|\ge
\frac u{\bar A}\right)
\le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le
\sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))+\sum_{s=1}^{m_0}P(B(s)) \\
&&\qquad\le\sum_{p=1}^{\infty} 2D\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}
+2D\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^2\right\}.
\end{eqnarray*}
If the relation $(\frac u\sigma)^2\ge M(L\log\frac2\sigma+\log D)$
holds with a sufficiently large constant~$M$ (depending on $\bar A$),
and $\sigma\le1$, then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp
\left\{-\alpha\left(\frac{2^pu}{8\bar A\sigma}\right)^2
\right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^2 \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^2\right\}
\le\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^2\right\}.
$$
Hence the previous estimate implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}
|S_n(f)|\ge \frac u{\bar A}\right)
\le\sum_{p=1}^{\infty}2\cdot 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^2
\right\}\\
&&\qquad +2\exp\left\{-\alpha
\left(\frac u{10\bar A \sigma}\right)^2\right\}
\le 4 \exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^2\right\},
\end{eqnarray*}
and relation~(\ref{(6.2)}) holds.
As $\sigma^2=2^{4R}\bar\sigma^2$ the inequality
\begin{eqnarray*}
2^{-4R}\cdot\frac{2^{6R}}{256}\left(\frac{u}{\bar A\sigma}\right)^2
&\le& n\bar\sigma^2=2^{-4R} n\sigma^2\\
&\le& 2^{-4R}\cdot\frac{2^{6(R+1)}}{256}
\left(\frac{u}{\bar A\sigma}\right)^2
=\frac14\cdot 2^{-2R}\left(\frac{u}{\bar A\bar \sigma}\right)^2
\end{eqnarray*}
holds, and this implies (together with the relation $R\ge1$) that
$$
\frac1{64}\left(\frac u{\bar A\sigma}\right)^2\le n\bar\sigma^2
\le\frac1{16}\left(\frac{u}{\bar A \bar\sigma}\right)^2,
$$
as we have claimed. It remained to show that under the
condition~(\ref{(6.3)}) $n\bar\sigma^2\ge L\log n+\log D$.
This inequality clearly holds under the conditions of Proposition~6.1
if $\sigma\le n^{-1/3}$, since in this case $\log\frac2\sigma\ge
\frac{\log n}3$, and
$n\bar\sigma^2\ge\frac1{64}(\frac u {\bar A\sigma})^2
\ge\frac1{64\bar A^2} M(L\log\frac2\sigma+\log D)\ge
\frac1{192\bar A^2} M(L\log n+\log D))\ge L\log n+\log D$
if $M\ge M_0(\bar A)$ with a sufficiently large number $M_0(\bar A)$.
If $\sigma\ge n^{-1/3}$, we can exploit that the inequality
$2^{6R}\left(\frac u{\bar A\sigma}\right)^2 \le256n\sigma^2$ holds
because of the definition of the number~$R$. It can be rewritten as
$$
2^{-4R}\ge 2^{-16/3}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^2}
{n\sigma^2}\right]^{2/3}.
$$
Hence $n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1/3}\left(\frac u\sigma\right)^{4/3}$. As
$\log\frac2\sigma\ge\log2>\frac12$ the inequalities
$n\sigma^2\ge n^{1/3}$ and $(\frac u\sigma)^2\ge
M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold. They yield that
\begin{eqnarray*}
n\bar\sigma^2&\ge&\frac{\bar A^{-4/3}}{50} (n\sigma^2)^{1/3}\left(\frac
u\sigma\right)^{4/3}
\ge\frac{\bar A^{-4/3}}{50}n^{1/9}\left(\frac M2\right)^{2/3}
(L^{3/2}+(\log D)^{3/2})^{2/3}\\
&\ge&\frac{M^{2/3}n^{1/9}(L+\log D)}{100\bar A^{4/3}}\ge L\log n+\log D
\end{eqnarray*}
if $M=M(\bar A)$ is chosen sufficiently large.
\hfill$\qed$
\chapter{The completion of the proof of Theorem 4.1}
This chapter contains the proof of Proposition~6.2
with the help of a symmetrization argument, and this
completes the proof of Theorem~4.1. By symmetrization
argument I mean the reduction of the investigation
of sums of the form $\sum_j f(\xi_j)$ to
sums of the form $\sum_j\varepsilon_jf(\xi_j)$, where
$\varepsilon_j$ are independent random variables,
independent also of the random variables $\xi_j$, and
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$. First a
symmetrization lemma is proved, and then such an inductive
statement is formulated in Proposition~7.3 which implies
Proposition~6.2. Proposition~7.3 will be proved with the help
of the symmetrization lemma and a conditioning argument. To
carry out such a program we shall need some estimates which
follow from Hoeffding's inequality formulated in Theorem~3.4.
First I formulate the symmetrization lemma we shall apply.
\medskip\noindent
{\bf Lemma 7.1 (Symmetrization Lemma).}\index{symmetrization lemma}
{\it Let $Z_n$ and $\bar
Z_n$, $n=1,2,\dots$, be two sequences of random variables
independent of each other, and let the random variables $\bar Z_n$,
$n=1,2,\dots$, satisfy the inequality
\begin{equation}
P(|\bar Z_n|\le\alpha)\ge\beta\quad \textrm{for all } n=1,2,\dots
\label{(7.1)}
\end{equation}
with some numbers $\alpha>0$ and $\beta>0$. Then
$$
P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
\le\frac1\beta P\left(\sup\limits_{1\le
n<\infty}|Z_n-\bar Z_n|>u\right)\quad \textrm{for all } u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma 7.1.}\/ Put $\tau=\min\{n\colon\, |Z_n|>u+\alpha\}$
if there exists such an index $n$, and $\tau=0$ otherwise. Then the
event $\{\tau=n\}$ is independent of the sequence of random variables
$\bar Z_1,\bar Z_2,\dots$ for all $n=1,2,\dots$, and because of this
independence
$$
P(\{\tau=n\})\le\frac1\beta P(\{\tau=n\}\cap\{|\bar Z_n|\le\alpha\})
\le \frac1\beta P(\{\tau=n\}\cap\{|Z_n-\bar Z_n|>u\})
$$
for all $n=1,2,\dots$. Hence
\begin{eqnarray*}
&&P\left(\sup_{1\le n<\infty}|Z_n|>u+\alpha\right)
=\sum_{l=1}^\infty P(\tau=l)\\
&&\qquad \le \frac1\beta
\sum_{l=1}^\infty P(\{\tau=l\}\cap\{|Z_l-\bar Z_l|>u\}) \\
&&\qquad \le \frac1\beta \sum_{l=1}^\infty
P(\{\tau=l\}\cap\sup_{1\le n<\infty}|Z_n-\bar Z_n|>u\}) \\
&&\qquad \le\frac1\beta P\left(\sup\limits_{1\le n<\infty}
|Z_n-\bar Z_n|>u\right).
\end{eqnarray*}
Lemma 7.1 is proved.
\hfill$\qed$
\medskip
We shall apply the following Lemma~7.2 which is a consequence of the
Symmetrization Lemma~7.1.
\medskip\noindent
{\bf Lemma 7.2.} {\it Let us fix a countable class of functions
${\cal F}$ on a measurable space $(X,{\cal X})$ together with a real
number $0<\sigma<1$. Consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(X,{\cal X})$ such
that $Ef(\xi_1)=0$, $Ef^2(\xi_1)\le\sigma^2$ for all $f\in{\cal F}$
together with another sequence $\varepsilon_1,\dots,\varepsilon_n$
of independent
random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, independent also of the random sequence
$\xi_1,\dots,\xi_n$. Then
\begin{eqnarray}
&&P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\cal F}}\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A
n^{1/2}\sigma^{2}\right) \nonumber \\
&&\qquad \le 4P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\quad\textrm{if } A\ge \frac{3\sqrt2}{\sqrt n\sigma}.
\label{(7.2)}
\end{eqnarray}
}
\medskip\noindent
{\it Proof of Lemma 7.2.}\/ Let us construct an independent copy
$\bar\xi_1,\dots,\bar\xi_n$ of the sequence $\xi_1,\dots,\xi_n$ in
such a way that all three sequences $\xi_1,\dots,\xi_n$, \
$\bar\xi_1,\dots,\bar\xi_n$ and $\varepsilon_1,\dots,\varepsilon_n$
are independent.
Define the random variables
$$
S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n f(\xi_j) \quad \textrm{and}\quad
\bar S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n f(\bar\xi_j)
$$
for all $f\in{\cal F}$. The inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|> A\sqrt n\sigma^2\right)\le
2P\left(\sup_{f\in{\cal F}}|S_n(f)-\bar S_n(f)|> \frac23 A\sqrt
n\sigma^2\right). \label{(7.3)}
\end{equation}
follows from Lemma~7.1 if it is applied for the countable set of
random variables $Z_n(f)=S_n(f)$ and $\bar Z_n(f)=\bar S_n(f)$,
$f\in{\cal F}$, and the numbers $u=\frac23 A\sqrt n\sigma^2$ and
$\alpha=\frac13A\sqrt n\sigma^2$, since the random fields $S_n(f)$
and $\bar S_n(f)$ are independent, and
$P(|\bar S_n(f)|\le\alpha)>\frac12$ for all $f\in{\cal F}$. Indeed,
$\alpha=\frac13 A\sqrt n\sigma^2\ge\sqrt2\sigma$, $E\bar S_n(f)^2
\le\sigma^2$, thus Chebishev's inequality implies that
$P(|\bar S_n(f)|\le\alpha)\ge P(|\bar S_n(f)|\le\sqrt2\sigma)
\ge\frac12$ for all $f\in{\cal F}$.
Let us observe that the random field
\begin{equation}
S_n(f)-\bar S_n(f)=\frac1{\sqrt n}\sum_{j=1}^n \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\cal F}, \label{(7.4)}
\end{equation}
and its randomized version
\begin{equation}
\frac1{\sqrt n}\sum_{j=1}^n \varepsilon_j \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\cal F}, \label{($7.4'$)}
\end{equation}
have the same distribution. Indeed, even the conditional
distribution of~(\ref{($7.4'$)}) under the condition that the
values of the $\varepsilon_j$-s are prescribed agrees with
the distribution of~(\ref{(7.4)}) for all possible values of
the $\varepsilon_j$-s. This follows from the observation that
the distribution of the random field~(\ref{(7.4)}) does not
change if we exchange the random variables $\xi_j$ and
$\bar\xi_j$ for those indices $j$ for which $\varepsilon_j=-1$
and do not change them for those indices~$j$ for which
$\varepsilon_j=1$. On the other hand, the distribution of
the random field obtained with such an exchange of its
variables agrees with the conditional distribution of the
random field defined in~(\ref{($7.4'$)}) under the condition
that the random variables $\varepsilon_j$ take these
prescribed values.
The above relation together with formula~(\ref{(7.3)}) imply that
\begin{eqnarray*}
&&P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}\left|\sum_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right)\\
&&\qquad \le 2P\left(\frac1{\sqrt n}
\sup_{f\in{\cal F}}\left|\sum_{j=1}^n
\varepsilon_j\left[f(\xi_j)-\bar f(\xi_j)\right]\right| \ge\frac23 A
n^{1/2}\sigma^{2}\right) \\
&&\qquad\le 2P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3 n^{1/2}\sigma^{2}\right) \\
&&\qquad\qquad+ 2P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}\left|
\sum_{j=1}^n \varepsilon_jf(\bar\xi_j)\right|
\ge\frac A3n^{1/2}\sigma^{2}\right) \\
&&\qquad=4P\left(\frac1{\sqrt n}\sup_{f\in{\cal F}}
\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge\frac A3n^{1/2}\sigma^{2}\right).
\end{eqnarray*}
Lemma~7.2 is proved.
\qed$\qed$
\medskip
First I try to explain briefly the method of proof of
Proposition~6.2. A probability of the form
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
has to be estimated. Lemma~7.2 enables us to replace this problem
by the estimation of the probability
$$
P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}\left| \sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right|>\frac u3\right)
$$
with some independent random variables $\varepsilon_j$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $j=1,\dots,n$,
which are also independent of the random variables $\xi_j$. We
shall bound the conditional probability of the event appearing in
this modified problem under the condition that each random
variable $\xi_j$ takes a prescribed value. This can be done
with the help of Hoeffding's inequality formulated in Theorem~3.4
and the $L_2$-density property of the class of functions ${\cal F}$
we consider. We hope to get a sharp estimate in such a way which
is similar to the result we got in the study of the Gaussian
counterpart of this problem, because Hoeffding's inequality yields
always a Gaussian type upper bound for the tail distribution of
the random sum we are studying.
Nevertheless, there appears a problem when we try to apply such an
approach. To get a good estimate on the conditional tail distribution
of the supremum of the random sums we are studying with the help of
Hoeffding's inequality we need a good estimate on the supremum of
the conditional variances of the random sums we are studying, i.e.
on the tail distribution of
$\sup\limits_{f\in{\cal F}}\frac1n\sum\limits_{j=1}^n f^2(\xi_j)$.
This problem is similar to the original one, and it is not simpler.
But a more detailed study shows that our approach to get a good
estimate with the help of Hoeffding's inequality works. In
comparing our original problem with the new, complementary problem
we have to understand at which level we need a good estimate on the
tail distribution of the supremum in the complementary problem to
get a good tail distribution estimate at level~$u$ in the original
problem. A detailed study shows that to bound the probability in
the original problem with parameter~$u$ we have to estimate the
probability
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}'}\left|
\sum\limits_{j=1}^n f(\xi_j)\right|>u^{1+\alpha}\right)$ with
some new nice, appropriately defined $L_2$-dense class of
bounded functions ${\cal F}'$ and some
number $\alpha>0$. We shall exploit that the number~$u$ is
replaced by a larger number $u^{1+\alpha}$ in the new problem. Let
us also observe that if the sum of bounded random variables is
considered, then for very large numbers~$u$ the probability we
investigate equals zero. On the basis of these observations an
appropriate backward induction procedure can be worked out. In its
$n$-th step we give a good upper bound on the probability
$P\left(n^{-1/2}\sup\limits_{f\in{\cal F}}\left|
\sum\limits_{j=1}^nf(\xi_j)\right|>u\right)$
if $u\ge T_n$ with an appropriately chosen number~$T_n$, and try
to diminish the number~$T_n$ in each step of this induction
procedure. We can prove Proposition~6.2 as a consequence of the
result we get by means of this backward induction procedure. To
work out the details we introduce the following notion.
\medskip\noindent
{\bf Definition of good tail behaviour for a class of normalized
random sums.}
\index{good tail behaviour for a class of normalized random sums}
{\it Let us have some measurable space $(X,{\cal X})$ and a
probability measure $\mu$ on it together with some integer $n\ge2$
and real number $\sigma>0$. Consider some class ${\cal F}$ of
functions $f(x)$ on the space $(X,{\cal X})$, and take a sequence of
independent and $\mu$ distributed random variables $\xi_1,\dots,\xi_n$
with values in the space $(X,{\cal X})$. Define the normalized random
sums
$S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^nf(\xi_j)$, $f\in {\cal F}$.
Given some real number $T>0$ we say that the set of normalized
random sums $S_n(f)$, $f\in{\cal F}$,
has a good tail behaviour at level~$T$ (with parameters $n$ and
$\sigma^2$ which will be fixed in the sequel) if the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|S_n(f)|\ge A \sqrt n\sigma^2\right) \le
\exp\left\{-A^{1/2}n\sigma^2 \right\} \label{(7.5)}
\end{equation}
holds for all numbers $A>T$.}
\medskip
Now I formulate Proposition 7.3 and show that Proposition 6.2
follows from it.
\medskip\noindent
{\bf Proposition 7.3.} {\it Let us fix a positive integer~$n\ge2$,
a real number $0<\sigma\le1$ and a probability measure $\mu$ on a
measurable space $(X,{\cal X})$ together with some numbers $L\ge1$
and $D\ge1$ such that $n\sigma^2\ge L\log n+\log D$. Let us
consider those countable $L_2$-dense classes ${\cal F}$ of functions
$f=f(x)$ on the space $(X,{\cal X})$ with exponent~$L$ and
parameter~$D$ for which all functions $f\in{\cal F}$ satisfy the
conditions
$\sup\limits_{x\in X}|f(x)|\le\frac14$, $\int f(x)\mu(\,dx)=0$
and $\int f^2(x)\mu(\,dx)\le\sigma^2$.
Let a number $T>1$ be such that for all classes of functions
${\cal F}$ which satisfy the above conditions the set of normalized
random sums $S_n(f)=\frac1{\sqrt n}\sum\limits_{j=1}^n f(\xi_j)$,
$f\in{\cal F}$, defined with the help of a sequence of independent
$\mu$ distributed random variables $\xi_1,\dots,\xi_n$ have a good
tail behaviour at level~$T^{4/3}$. There is a universal
constant~$\bar A_0$ such that if \ $T\ge\bar A_0$, then the set of the
above defined normalized sums, $S_n(f)$, $f\in{\cal F}$, have a good
tail behaviour for all such classes of functions ${\cal F}$ not
only at level $T^{4/3}$ but also at level~$T$.}
\medskip
Proposition~6.2 simply follows from Proposition~7.3. To show this
let us first observe that a class of normalized random sums
$S_n(f)$, $f\in{\cal F}$, has a good tail behaviour at level
$T_0=\frac1{4\sigma^2}$ if this class of functions ${\cal F}$
satisfies the conditions of Proposition~7.3. Indeed, in this
case
$$
P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|\ge A\sqrt n\sigma^2\right)
\le P\left(\sup\limits_{f\in{\cal F}}|S_n(f)|>\frac{\sqrt n}4\right)=0
$$
for all $A>T_0.$ Then the repetitive application of Proposition~7.3
yields that a class of random sums $S_n(f)$, $f\in{\cal F}$, has a
good tail behaviour at all levels $T\ge T_0^{(3/4)^j}$ with an
index~$j$ such that $T_0^{(3/4)^j}\ge\bar A_0$ if the class of
functions ${\cal F}$ satisfies the conditions of Proposition~7.3.
Hence it has a good tail behaviour for $T=\bar A_0^{4/3}$ with the
number~$\bar A_0$ appearing in Proposition~7.3. If a class of
functions $f\in{\cal F}$ satisfies the conditions of
Proposition~6.2, then the class of functions
$\bar{\cal F}=\left\{\bar f=\frac f4\colon\,f\in{\cal F}\right\}$
satisfies the conditions of Proposition~7.3, with the same
parameters~$\sigma$, $L$ and~$D$. (Actually some of the
inequalities that must hold for the elements of a class of
functions~${\cal F}$ satisfying the conditions of Proposition~7.3
are valid with smaller parameters. But we did not change these
parameters to satisfy also the condition
$n\sigma^2\ge L\log n+\log D$.) Hence the class of functions
$S_n(\bar f)$, $\bar f\in \bar{\cal F}$, has a good tail
behaviour at level $T=\bar A_0^{4/3}$. This implies that the
original class of functions ${\cal F}$ satisfies
formula~(\ref{(6.4)}) in Proposition~6.2, and this is what we
had to show.\index{estimate on the supremum of a class of
partial sums}
\medskip\noindent
{\it Proof of Proposition 7.3.}\/ Fix a class of functions
${\cal F}$ which satisfies the conditions of Proposition~7.3
together with two independent sequences $\xi_1,\dots,\xi_n$ and
$\varepsilon_1,\dots,\varepsilon_n$ of independent random variables,
where $\xi_j$ is $\mu$-distributed,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le j\le n$,
and investigate the conditional probability
$$
P(f,A|\xi_1,\dots,\xi_n)=P\left(\left.\frac1{\sqrt n}\left|\sum_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge
\frac A6\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all functions $f\in{\cal F}$, $A> T$ and values
$(\xi_1,\dots,\xi_n)$ in the condition. By the Hoeffding inequality
formulated in Theorem~3.4
\begin{equation}
P(f,A|\xi_1,\dots,\xi_n)\le 2\exp\left\{-\frac{\frac 1{36}
A^2 n\sigma^4}{2\bar S^2(f,\xi_1,\dots,\xi_n)}\right\} \label{(7.6)}
\end{equation}
with
$$
\bar S^2(f,x_1,\dots,x_n)=\frac1n\sum_{j=1}^n f^2(x_j),
\quad f\in {\cal F}.
$$
Let us introduce the set
\begin{equation}
H=H(A)=\left\{(x_1,\dots,x_n)\colon\, \sup_{f\in{\cal F}}
\bar S^2(f,x_1,\dots,x_n)
\ge \left(1+A^{4/3}\right)\sigma^2\right\}. \label{(7.7)}
\end{equation}
I claim that
\begin{equation}
P((\xi_1,\dots,\xi_n)\in H)\le e^{-A^{2/3} n\sigma^2}\quad\textrm{ if }
A>T. \label{(7.8)}
\end{equation}
(The set $H$ is the small exceptional set of those points
$(x_1,\dots,x_n)$ for which we cannot give a good estimate for
$P(f,A|\xi_1(\omega),\dots,\xi_n(\omega))$ with
$\xi_1(\omega)=x_1$,\dots, $\xi_n(\omega)=x_n$ for some $f\in{\cal F}$.)
To prove relation~(\ref{(7.8)}) let us consider the functions
$\bar f=\bar f(f)$, $\bar f(x)=f^2(x)-\int f^2(x)\mu(\,dx)$, and
introduce the
class of functions $\bar{\cal F}=\{\bar f(f)\colon\, f\in{\cal F}\}$.
Let us show that the class of functions $\bar{\cal F}$ satisfies the
conditions of Proposition~7.3, hence the estimate~(\ref{(7.5)}) holds
for the class of functions $\bar{\cal F}$ if $A> T^{4/3}$.
The relation $\int \bar f(x)\mu(\,dx)=0$ clearly holds. The condition
$\sup| \bar f(x)|\le\frac 18<\frac14$ also holds if $\sup |f(x)|\le
\frac14$, and $\int \bar f^2(x)\mu(\,dx)\le \int f^4(x)\mu(\,dx)\le
\frac 1{16}\int f^2(x)\,\mu(\,dx)\le\frac{\sigma^2}{16}<\sigma^2$
if $f\in{\cal F}$. It remained to show that $\bar{\cal F}$ is an
$L_2$-dense class with exponent $L$ and parameter $D$. For this goal
we need a good estimate on $\int(\bar f(x)-\bar g(x))^2\rho(\,dx)$,
where $\bar f,\,\bar g\in\bar{\cal F}$, and $\rho$ is an arbitrary
probability measure.
Observe that
\begin{eqnarray*}
&&\int (\bar f(x)-\bar g(x))^2\rho(\,dx) \\
&&\qquad \le 2\int(f^2(x)-g^2(x))^2\rho(\,dx)+
2\int(f^2(x)-g^2(x))^2\mu(\,dx) \\
&&\qquad \le2 (\sup\limits (|f(x)|+|g(x)|)^2
\left(\int (f(x)-g(x))^2(\rho(\,dx)+\mu(\,dx)\right) \\
&&\qquad \le \int (f(x)-g(x))^2\bar\rho(\,dx)
\end{eqnarray*}
for all $f, g\in{\cal F}$, $\bar f=\bar
f(f)$, $\bar g=\bar g(g)$ and probability measure $\rho$, where
$\bar\rho=\frac{\rho+\mu}2$. This means that if $\{f_1,\dots,f_m\}$
is an $\varepsilon$-dense subset of ${\cal F}$ in the space
$L_2(X,{\cal X},\bar\rho)$, then
$\{\bar f_1,\dots,\bar f_m\}$ is an $\varepsilon$-dense
subset of $\bar{\cal F}$ in the space $L_2(X,{\cal X},\rho)$, and
not only ${\cal F}$, but also $\bar{\cal F}$ is an $L_2$-dense class
with exponent $L$ and parameter $D$.
Because of the conditions of Proposition 7.3 we can write
for the number $A^{4/3}> T^{4/3}$ and the class of functions
$\bar{\cal F}$ that
\begin{eqnarray*}
&&P((\xi_1,\dots,\xi_n)\in H) \\
&&\qquad=P\left(\sup_{f\in{\cal F}}
\left(\frac1n \sum_{j=1}^n
\bar f(f)(\xi_j) +\frac1n \sum_{j=1}^n E f^2(\xi_j)\right)
\ge \left(1+A^{4/3}\right)\sigma^2\right)\\
&&\qquad\le P\left(\sup_{\bar f\in\bar {\cal F}}
\frac1{\sqrt n} \sum_{j=1}^n
\bar f(\xi_j) \ge A^{4/3}n^{1/2}\sigma^2\right)
\le e^{-A^{2/3} n\sigma^2},
\end{eqnarray*}
i.e. relation~(\ref{(7.8)}) holds.
By formula~(\ref{(7.6)}) and the definition of the set $H$
given in~(\ref{(7.7)}) the estimate
\begin{equation}
P(f,A|\xi_1,\dots,\xi_n)\le 2e^{- A^{2/3} n\sigma^2/144} \quad
\textrm{if }(\xi_1,\dots,\xi_n)\notin H
\label{(7.9)}
\end{equation}
holds for all $f\in {\cal F}$ and $A>T\ge1$. (Here we used the
estimate $1+A^{4/3}\le2A^{4/3}$.) Let us introduce the conditional
probability
$$
P({\cal F},A|\xi_1,\dots,\xi_n)=
P\left(\left.\sup_{f\in {\cal F}} \frac1{\sqrt n}\left|
\sum\limits_{j=1}^n \varepsilon_jf(\xi_j)\right| \ge
\frac A3\sqrt n\sigma^2\right|\xi_1,\dots,\xi_n\right)
$$
for all $(\xi_1,\dots,\xi_n)$ and $A>T$. We shall
estimate this conditional probability with the help of
relation~(\ref{(7.9)}) if $(\xi_1,\dots,\xi_n) \notin H$.
Given a vector $x^{(n)}=(x_1,\dots,x_n)\in X^n$, let us introduce
the probability measure
$$
\nu=\nu(x_1,\dots,x_n)=\nu(x^{(n)})\quad \textrm{on } (X,{\cal X})
$$
which is concentrated in the coordinates of the vector
$x^{(n)}=(x_1,\dots,x_n)$, and $\nu(\{x_j\})=\frac1n$ for all
points~$x_j$, $j=1,\dots,n$. If $\int f^2(u)\nu(\,du)\le\delta^2$
for a function $f$, then
$\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)\right|
\le n^{1/2}\int|f(u)|\nu(\,du)\le n^{1/2}\delta$. As a
consequence, we can write that
\begin{eqnarray}
&&\left|\frac1{\sqrt n}\sum\limits_{j=1}^n\varepsilon_jf(x_j)-
\frac1{\sqrt n}\sum\limits_{j=1}^n \varepsilon_jg(x_j)\right|
\le\frac A6 \sqrt n\sigma^2 \nonumber \\
&&\qquad\textrm{if }
\int (f(u)-g(u))^2\,d\nu(u)\le\left(\frac {A\sigma^2}6\right)^2.
\label{(7.10)}
\end{eqnarray}
\medskip\noindent
{\it Remark.} We may assume in our proof that the distribution of
the random variables $\xi_j$, $1\le j\le n$, are non-atomic, and
as a consequence we can restrict our attention to such measures
$\nu(x^{(n)})$ for which all coordinates of the vector $x^{(n)}$
are different. Otherwise we can define independent and uniformly
distributed random variables on the interval $[0,1]$,
$\eta_1,\dots,\eta_n$, which are also independent of the random
variables $\xi_j$, $1\le j\le n$. With the help of these random
variables $\eta_j$ we can introduce the random variables
$\tilde\xi_j=(\xi_j,\eta_j)$, $1\le j\le n$, and the class of
functions $\tilde{\cal F}$ on the space $X\times[0,1]$ consisting
of functions $\tilde f(x,y)=f(x)$, $f\in{\cal F}$, with $x\in X$
and $0\le y\le 1$. It is not difficult to see that the random
variables $\tilde\xi_j$ and the class of functions
$\tilde{\cal F}$ satisfy the conditions of Proposition~7.3, and
the distribution of the random variables $\tilde\xi_j$ is
non-atomic. Hence we can apply Proposition~7.3 with such a choice,
and this provides the statement of Proposition~7.3 in the original
case, too.
\medskip
Let us list the elements of the (countable) set ${\cal F}$ as
${\cal F}=\{f_1,f_2,\dots\}$, fix the number $\delta=\frac{A\sigma^2}6$,
and choose for all vectors $x^{(n)}=(x_1,\dots,x_n)\in X^n$ a
sequence of indices $p_1(x^{(n)}),\dots,p_m(x^{(n)})$ taking
positive integer values with
$m=\max(1, D\delta^{-L})=\max(1,D(\frac6{A\sigma^2})^L)$
elements in such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\,d\nu(x^{(n)})(u)\le\delta^2$
for all $f\in{\cal F}$ and $x^{(n)}\in X^n$ with the above defined
measure $\nu(x^{(n)})$ on the space $(X,{\cal X})$. This is possible
because of the $L_2$-dense property of the class of
functions~${\cal F}$. (This is the point where the $L_2$-dense
property of the class of functions ${\cal F}$ is exploited in its
full strength.) In a complete proof of Proposition~7.3 we still have
to show that we can choose the indices $p_j(x^{(n)})$,
$1\le j\le m$, as measurable functions of their argument~$x^{(n)}$
on the space $(X^n,{\cal X}^n)$. We shall show this in Lemma~7.4 at
the end of the proof.
Put $\xi^{(n)}(\omega)=(\xi_1(\omega),\dots,\xi_n(\omega))$. Because
of relation~(\ref{(7.10)}), the choice of the number $\delta$ and
the property of the functions $f_{p_l(x^{(n)})}(\cdot)$ we have
\begin{eqnarray}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
\frac1{\sqrt n}\left|\sum\limits_{j=1}^n
\varepsilon_j(\omega)f(\xi_j(\omega))\right|
\ge\frac A3\sqrt n\sigma^2\right\} \label{(7.11)} \\
&&\qquad \subset\bigcup_{l=1}^m\left\{\omega\colon\,\frac1{\sqrt n}
\left|\sum\limits_{j=1}^n \varepsilon_j(\omega)f_{p_l(\xi^{(n)}(\omega))}
(\xi_j(\omega))\right|\ge\frac A6\sqrt n\sigma^2\right\}. \nonumber
\end{eqnarray}
We can estimate the conditional probability at the right-hand side
of~(\ref{(7.11)}) under the condition that the vector
$(\xi_1(\omega),\dots,\xi_n(\omega))$ takes such a prescribed value
for which $(\xi_1(\omega),\dots,\xi_n(\omega))\in H$. We
get with the help of~(\ref{(7.11)}), inequality~(\ref{(7.9)})
and the definition of the quantity $P(f,A|\xi_1,\dots,\xi_n)$
before formula~(\ref{(7.6)}) that
\begin{eqnarray}
P({\cal F},A|\xi_1,\dots,\xi_n)
&&\le\sum\limits_{l=1}^m P(f_{p_l(\xi^{(n)})},A|\xi_1,\dots,\xi_n)
\nonumber\\
&&\le 2\max\left(1,D\left(\frac 6{A\sigma^2}\right)^L\right)
e^{-A^{2/3} n\sigma^2/144} \nonumber \\
&&\qquad \textrm{if }(\xi_1,\dots,\xi_n)\notin H \textrm{ and } A> T.
\label{(7.12)}
\end{eqnarray}
If $A\ge\bar A_0$ with a sufficiently large constant~$\bar A_0$,
then this inequality together with Lemma~7.2 and the
estimate~(\ref{(7.8)}) imply that
\begin{eqnarray}
&&P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^n
f(\xi_j)\right| \ge A n^{1/2}\sigma^{2}\right) \nonumber \\
&&\qquad \le 4P\left(\frac1{\sqrt n}
\sup\limits_{f\in{\cal F}}\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\label{(7.13)} \\
&&\qquad \le\max\left(4, 8D\left(\frac6{A\sigma^2}\right)^L
\right)e^{-A^{2/3}n\sigma^2/144}
+4e^{-A^{2/3}n\sigma^2} \quad \textrm{if } A>T. \nonumber
\end{eqnarray}
(We may apply Lemma~7.2 if $A\ge A_0$ with a sufficiently large~$A_0$,
since $n\sigma^2\ge L\log n+\log D\ge\log 2$, hence
$\sqrt n\sigma\ge\sqrt{\log 2}$, and the condition
$A\ge \frac{3\sqrt2}{\sqrt n\sigma}$ demanded in relation~(\ref{(7.2)})
is satisfied.)
By the conditions of Proposition~7.3 the inequalities
$n\sigma^2\ge L\log n +\log D$ hold with some $L\ge1$, $D\ge1$
and $n\ge2$. This implies that $n\sigma^2\ge L\log2\ge\frac12$,
$(\frac6{A\sigma^2})^L
\le(\frac n{2n\sigma^2})^L\le n^L=e^{L\log n}
\le e^{n\sigma^2}$ if $A\ge\bar A_0$ with some sufficiently large
constant $\bar A_0>0$, and $2D=e^{\log2+\log D}\le e^{3n\sigma^2}$.
Hence the first term at the right-hand side of~(\ref{(7.13)}) can be
bounded by
$$
\max\left(4,8D\left(\frac6{A\sigma^2}\right)^L\right)
e^{-A^{2/3}n\sigma^2/144}
\le e^{-A^{2/3}n\sigma^2/144}\cdot 4e^{4n\sigma^2}
\le \frac12e^{-A^{1/2}n\sigma^2}
$$
if $A\ge\bar A_0$ with a sufficiently large~$\bar A_0$. The
second term at the right-hand side of~(\ref{(7.13)}) can also be
bounded as $4e^{-A^{2/3}n\sigma^2}\le \frac12e^{-A^{1/2}n\sigma^2}$
with an appropriate choice of the number~$\bar A_0$.
By the above calculation formula~(\ref{(7.13)}) yields the inequality
$$
P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}\left|
\sum\limits_{j=1}^n f(\xi_j)\right| \ge An^{1/2}\sigma^{2}\right)
\le e^{-A^{1/2}n\sigma^2}
$$
if $A>T$, and the constant~$\bar A_0$ is chosen sufficiently large.
\hfill$\qed$
\medskip
To complete the proof of Proposition~7.3 we still show in the
following Lemma 7.4 that the functions $p_l(x^{(n)})$,
$1\le l\le m$, we have introduced in the above argument can be
chosen as measurable functions in the space $(X^n,{\cal X}^n)$.
This implies that the expressions
$f_{p_l(\xi^{(n)}(\omega))}(\xi_j(\omega))$ in formula~(\ref{(7.11)})
are ${\cal F}(\xi_1,\dots,\xi_n)$ measurable random variables. Hence
the formulation of~(\ref{(7.12)}) is legitime, no measurability
problem arises. We shall present Lemma~7.4 together with some
generalizations in Lemma~7.4A and Lemma~7.4B that we shall apply
later in the proof of Propositions~15.3 and~15.4 which are
multivariate versions of Proposition~7.3. We shall need these
results in the proof of the multivariate version of
Proposition~6.2. We have formulated them not in their most
general possible form, but in the form as we shall need them.
\medskip\noindent
{\bf Lemma~7.4.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some
exponent~$L>0$ and parameter~$D\ge1$ on a measurable space
$(X,{\cal X})$. Fix some positive integer~$n$, and define for
all $x^{(n)}=(x_1,\dots,x_n)\in X^n$ the probability measure
$\nu(x^{(n)})=\nu(x_1,\dots,x_n)$ on the space $(X,{\cal X})$
by the formula $\nu(x^{(n)})(x_j)=\frac1n$, $1\le j\le n$.
For a number $0\le\varepsilon\le 1$ put
$m=m(\varepsilon)=[D\varepsilon^{-L}]$, where $[\cdot]$
denotes integer part. For all $0\le\varepsilon\le 1$ there
exists $m=m(\varepsilon)$
measurable functions $p_l(x^{(n)})$, $1\le l\le m$, on the
measurable space $(X^n,{\cal X}^n)$ with positive integer values in
such a way that $\inf\limits_{1\le l\le m}
\int(f(u)-f_{p_l(x^{(n)})}(u))^2\nu(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^n$ and $f\in{\cal F}$.}
\medskip
In the proof of Proposition~15.3 we need the following result.
\medskip\noindent
{\bf Lemma 7.4A.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some
exponent $L>0$ and parameter~$D\ge1$ on the $k$-fold product
$(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$ with
some $k\ge1$. Fix some positive integer~$n$, and define for
all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$,
where $x^{(j)}_l\in X$ for all $j$ and $l$ the probability
measure $\rho(x^{(n)})$ in the space $(X^k,{\cal X}^k)$ by
the formula
$\rho(x^{(n)})(x_{l_j}^{(j)},\,1\le j\le k,\,1\le l_j\le n)
=\frac1{n^k}$ for all sequences
$(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)})$ , $1\le j\le k$,
$1\le l_j\le n$, with coordinates of the vector
$x^{(n)}=(x_l^{(j)},1\le l\le n,\,1\le j\le k)$. For all
$0\le\varepsilon\le 1$ there exist
$m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable functions
$p_r(x^{(n)})$, $1\le r\le m$, on the measurable space
$(X^{kn},{\cal X}^{kn})$ with positive integer values in
such a way that $\inf\limits_{1\le r\le m}
\int(f(u)-f_{p_r(x^{(n)})}(u))^2\rho(x^{(n)})(\,du)
\le\varepsilon^2$
for all $x^{(n)}\in X^{kn}$ and $f\in{\cal F}$.}
\medskip
In the proof of Proposition~15.4 the following result will be needed.
\medskip\noindent
{\bf Lemma 7.4B.} {\it Let ${\cal F}=\{f_1,f_2,\dots\}$ be a
countable and $L_2$-dense class of functions with some exponent
$L>0$ and parameter~$D\ge1$ on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some measurable
spaces $(X,{\cal X})$ and $(Y,{\cal Y})$ and integer~$k\ge1$.
Fix some positive integer~$n$, and define for all vectors
$x^{(n)}=(x_l^{(j,1)},x_l^{(j,-1)},\,1\le l\le n,\,1\le j\le k)
\in X^{2kn}$, where $x^{(j,\pm1)}_l\in X$ for all $j$ and $l$
a probability measure $\alpha(x^{(n)})$ in the space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$
in the following way. Fix some probability measure $\rho$ in
the space $(Y,{\cal Y})$ and two $\pm1$ sequences
$\varepsilon^{(k)}_1=(\varepsilon_{1,1},\dots,\varepsilon_{k,1})$
and
$\varepsilon^{(k)}_2=(\varepsilon_{1,2},\dots,\varepsilon_{k,2})$
of length~$k$. Define with their help first the following
probability measures
$\alpha_1(x^{(n)})=\alpha_1(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
and $\alpha_2(x^{(n)})=\alpha_2(x^{(n)},\varepsilon^{(k)}_1,
\varepsilon^{(k)}_2,\rho)$
in the space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ for all
$x^{(n)}\in{\cal X}^{2kn}$. Let
$\alpha_1(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,1})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,1})}\}\times B)=\frac{\rho(B)}{n^k}$
and
$\alpha_2(x^{(n)})(\{x_{l_1}^{(1,\varepsilon_{1,2})}\}
\times\cdots\times
\{x_{l_k}^{(k,\varepsilon_{k,2})}\}\times B)=\frac{\rho(B)}{n^k}$
with $1\le l_j\le n$ for all $1\le j\le k$ and $B\in{\cal Y}$ if
$x_{l_j}^{(j,\varepsilon_{j,1})}$ and
$x_{l_j}^{(j,\varepsilon_{j,2})}$ are the appropriate coordinates
of the vector $x^{(n)}\in X^{2kn}$. Put
$\alpha(x^{(n)})=\frac{\alpha_1(x^{(n)})+\alpha_2(x^{(n)})}2$.
For all $0\le\varepsilon\le 1$ there exist
$m=m(\varepsilon)=[D\varepsilon^{-L}]$ measurable
functions $p_r(x^{(n)})$, $1\le r\le m$, on the measurable space
$(X^{2kn},{\cal X}^{2kn})$ with positive integer values in
such a way that
$\inf\limits_{1\le r\le m}\int(f(u)-f_{p_r(x^{(n)})}(u))^2
\alpha(x^{(n)})(\,du)\le\varepsilon^2$
for all $x^{(n)}\in X^{2kn}$ and $f\in{\cal F}$.}
\medskip\noindent
{\it Proof of Lemma 7.4.}\/ Fix some $0<\varepsilon\le 1$, put
the number $m=m(\varepsilon)$ introduced in the lemma, and let
us list the set of all vectors $(j_1,\dots,j_m)$ of length~$m$
with positive integer coordinates in some way. Define for all of
these vectors $(j_1,\dots,j_m)$ the set
$B(j_1,\dots,j_m)\subset X^n$ in the following way. The relation
$x^{(n)}=(x_1,\dots,x_n)\in B(j_1,\dots,j_m)$ holds
if and only if $\inf\limits_{1\le r\le m}
\int (f(u)-f_{j_r}(u))^2\,d\nu(x^{(n)})(u)\le\varepsilon^2$ for all
$f\in{\cal F}$. Then all sets $B(j_1,\dots,j_m)$ are measurable, and
$\bigcup\limits_{(j_1,\dots,j_m)}B(j_1,\dots,j_m)=X^n$
because ${\cal F}$ is an $L_2$-dense class of functions with
exponent~$L$ and parameter~$D$. Given a point
$x^{(n)}=(x_1,\dots,x_n)$ let us choose
the first vector $(j_1,\dots,j_m)=(j_1(x^{(n)}),\dots,j_m(x^{(n)}))$
in our list of vectors for which $x^{(n)}\in B(j_1,\dots,j_m)$, and
define $p_l(x^{(n)})=j_l(x^{(n)})$ for all $1\le l\le m$ with this
vector $(j_1,\dots,j_m)$. Then the functions $p_l(x^{(n)})$ are
measurable, and the functions $f_{p_l(x^{(n)})}$, $1\le l\le m$,
defined with their help together with the probability measures
$\nu(x^{(n)})$ satisfy the inequality demanded in Lemma~7.4.
\hfill$\qed$
\medskip
The proof of Lemmas~7.4A and~7.4B is almost the same. We only
have to modify the definition of the sets $B(j_1,\dots,j_m)$
in a natural way. The space of arguments $x^{(n)}$ are the spaces
$X^{kn}$ and $X^{2kn}$ in these lemmas, and we have to integrate
with respect to the measures $\rho(x^{(n)})$ in the space $X^k$
and with respect to the measures $\alpha(x^{(n)})$ in the space
$X^k\times Y$ respectively. The sets $B(j_1,\dots,j_m)$ are
measurable also in these cases, and the rest of the proof can be
applied without any change.
\chapter{Formulation of the main results of this work}
Former chapters of this work contain estimates about the tail
distribution of normalized sums of independent, identically
distributed random variables and of the supremum of appropriate
classes of such random sums. They were considered together with
some estimates about the tail distribution of the integral of a
(deterministic) function with respect to a normalized empirical
distribution and of the supremum of such integrals. This two kinds
of problems are closely related, and to understand them better it
is useful to investigate them together with their natural Gaussian
counterpart.
In this chapter I formulate the natural multivariate versions of
these results. They will be proved in the subsequent chapters.
To formulate them we have to introduce some new notions. I shall
also discuss some new problems whose solutions help in their
proof. I finish this chapter with a short overview about the
content of the remaining part of this work.
I start this chapter with the formulation of two results,
Theorems~8.1 and~8.2 together with some simple
consequences. They yield a sharp estimate about the tail
distribution of a multiple random integral with respect to a
normalized empirical distribution and about the analogous
problem when the tail distribution of the supremum of such
integrals is considered. These results are the natural
versions of the corresponding one-variate results about the tail
behaviour of an integral or of the supremum of a class of
integrals with respect to a normalized empirical distribution.
They can be formulated with the help of the notions introduced
before, in particular with the help of the notion of multiple
random integrals with respect to a normalized empirical
distribution introduced in formula~(\ref{(4.8)}).
To formulate the following two results, Theorems~8.3 and~8.4 and
their consequences, which are the natural multivariate versions
of the results about the tail distribution of partial sums of
independent random variables, and of the supremum of such sums
we have to make some preparations. First we introduce the
so-called $U$-statistics which can be considered the natural
multivariate generalizations of the sum of independent and
identically distributed random variables. Beside this, observe
that in the one-variate case we had a good estimation about the
tail distribution of sums of independent random variables only
if the summands had expectation zero. We have to find the
natural multivariate version of this property. Hence we define
the so-called degenerate $U$-statistics which can be considered
as the natural multivariate counterparts of sums of independent
and identically distributed random variables with zero
expectation. Theorems~8.3 and~8.4 contain estimates about the
tail-distribution of degenerate $U$-statistics and of the
supremum of such expressions.
In Theorems~8.5 and~8.6 I formulate the Gaussian counterparts of
the above results. They deal with multiple Wiener-It\^o integrals
with respect to a so-called white noise. The notion of white noise
and multiple Wiener--It\^o integrals with respect to it and their
properties needed to have a good understanding of these results
will be explained in Chapter~10. Still two results are
discussed in this chapter. They are Examples~8.7 and~8.8, which
state that the estimates of Theorems~8.5 and~8.3 are in a
certain sense sharp.
\medskip
To formulate the first two results of this chapter let us consider
a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in a measurable
space $(X,{\cal X})$. Let $\mu$ denote the distribution
of the random variables $\xi_j$, and introduce the empirical
distribution of the sequence $\xi_1,\dots,\xi_n$ defined
in~(\ref{(4.5)}). Given a measurable function $f(x_1,\dots,x_k)$
on the $k$-fold product space $(X^k,{\cal X}^k)$ consider its
integral $J_{n,k}(f)$ with respect to the $k$-fold product of
the normalized empirical distribution $\sqrt n(\mu_n-\mu)$
defined in formula~(\ref{(4.8)}). In the definition of this
integral the diagonals $x_j=x_l$, $1\le j0$ and $\alpha=\alpha_k>0$ such that the random integral
$J_{n,k}(f)$ defined in formulas~(\ref{(4.5)})
and~(\ref{(4.8)}) satisfies the
inequality
\begin{equation}
P(|k!J_{n,k}(f)|>u)\le C \max\left(e^{-\alpha(u/\sigma)^{2/k}},
e^{-\alpha(nu^2)^{1/(k+1)}} \right) \label{(8.3)}
\end{equation}
for all $u>0$. The constants $C=C_k>0$ and
$\alpha=\alpha_k>0$ in formula~(\ref{(8.3)}) depend only on the
parameter~$k$.}
\medskip
Theorem 8.1 can be reformulated in the following equivalent form.
\medskip\noindent
{\bf Theorem 8.1$'$.} {\it Under the conditions of Theorem 8.1
\begin{equation}
P(|k!J_{n,k}(f)|>u)\le C e^{-\alpha(u/\sigma)^{2/k}}
\quad \textrm{for all } 0____0$,
$\alpha=\alpha_k>0$, depending only on the multiplicity~$k$ of the
integral $J_{n,k}(f)$.}
\medskip
Theorem 8.1 clearly implies Theorem~$8.1'$, since in the case
$u\le n^{k/2}\sigma^{k+1}$ the first term is larger than the second
one in the maximum at the right-hand side of formula~(\ref{(8.3)}).
On the other hand, Theorem~$8.1'$ implies Theorem~8.1 also if
$u>n^{k/2}\sigma^{k+1}$. Indeed, in this case Theorem~$8.1'$ can be
applied with $\bar\sigma=\left(u n^{-k/2}\right)^{1/(k+1)}\ge \sigma$
if $u\le n^{k/2}$, since the condition $0<\bar\sigma\le1$ is satisfied.
This yields that
$P\left(|k!J_{n,k}(f)|>u\right)\le C\exp\left\{-\alpha
\left(\frac u{\bar\sigma}\right)^{2/k}\right\}=C\exp\left\{-\alpha
(nu^2)^{1/(k+1)}\right\}$ if $n^{k/2}\ge u>n^{k/2}\sigma^{k+1}$,
and relation~(\ref{(8.3)}) holds in this case. If $u>2^kn^{k/2}$,
then $P(k!|J_{n,k}(f)|>u)=0$, and if $n^{k/2}\le u<2^kn^{k/2}$,
then
\begin{eqnarray*}
&&P(|k!J_{n,k}(f)|>u)\le P(|k!J_{n,k}(f)|>n^{k/2}) \\
&& \qquad \le C \exp\left\{-\alpha((n\cdot n^{k/2})^2)^{1/(k+1)}\right\}
\le C \exp\left\{-2^{-k}\alpha(nu^2)^{1/(k+1)}\right\}.
\end{eqnarray*}
Hence relation~(\ref{(8.3)}) holds (with a possibly different
parameter~$\alpha$) in these cases, too.
Theorem~8.1 or Theorem~$8.1'$ state that the tail distribution
$P(k!|J_{n,k}(f)|>u)$ of the $k$-fold random integral
$k!J_{n,k}(f)$ can be bounded similarly to the probability
$P(|\textrm{const.}\,\sigma\eta^k|>u)$, where $\eta$ is a random
variable with standard normal distribution, and the number
$0\le\sigma\le1$ satisfies relation (\ref{(8.2)}), provided that
the level~$u$ we consider is less than $n^{k/2}\sigma^{k+1}$. As
we shall see later (see Corollary~1 of Theorem~9.4), the value
of the number $\sigma^2$ in formula~(\ref{(8.2)}) is closely
related to the variance of $k!J_{n,k}(f)$. At the end of this
chapter an example is given which shows that the condition
$u\le n^{k/2}\sigma^{k+1}$ is really needed in Theorem~$8.1'$.
The next result, Theorem 8.2, is the generalization of Theorem~$4.1'$
for multiple random integrals with respect to a normalized empirical
measure. In its formulation the notions of $L_2$-dense classes and
countable approximability introduced in Chapter~4 are applied.
\medskip\noindent
{\bf Theorem 8.2 (Estimate on the supremum of multiple random
integrals with respect to an empirical
distribution).}\index{estimate on the supremum of multiple
random integrals with respect to an empirical distribution}
{\it Let us have a non-atomic probability measure
$\mu$ on a measurable space $(X,{\cal X})$ together with a countable
and $L_2$-dense class ${\cal F}$ of functions $f=f(x_1,\dots,x_k)$ of
$k$ variables with some parameter $D\ge2$ and exponent $L\ge1$ on
the product space $(X^k,{\cal X}^k)$ which satisfies the conditions
\begin{equation}
\|f\|_\infty=\sup\limits_{x_j\in X,\;1\le j\le k}|f(x_1,\dots,x_k)|\le 1,
\qquad \textrm{for all } f\in {\cal F} \label{(8.4)}
\end{equation}
and
\begin{eqnarray}
\|f\|_2^2=Ef^2(\xi_1,\dots,\xi_k)&&=\int f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k)\le \sigma^2 \nonumber \\
&&\qquad\qquad\qquad \textrm{for all } f\in {\cal F} \label{(8.5)}
\end{eqnarray}
with some constant $0<\sigma\le1$. There exist some constants
$C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$ depending only on
the parameter $k$ such that the supremum of the random integrals
$k!J_{n,k}(f)$, $f\in {\cal F}$, defined by formula~(\ref{(4.8)})
satisfies the inequality
\begin{eqnarray}
P\left(\sup_{f\in{\cal F}}|k!J_{n,k}(f)|\ge u\right)
&&\le C \exp\left\{-\alpha
\left(\frac u{\sigma}\right)^{2/k}\right\}
\quad \textrm{for those numbers } u\nonumber \\
\textrm{for which } n\sigma^2&&\ge
\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}),
\label{(8.6)}
\end{eqnarray}
where the numbers $D$ and $L$ agree with the parameter and exponent
of the $L_2$-dense class~${\cal F}$.
The condition about the countable cardinality of the class ${\cal F}$
can be replaced by the weaker condition that the class of random
variables $k!J_{n,k}(f)$, $f\in{\cal F}$, is countably approximable.}
\medskip
The condition given for the number~$u$ in formula~(\ref{(8.6)})
appears in Theorem~8.2 for a similar reason as the analogous
condition formulated in~(\ref{(4.4)}) in its one-variate counterpart,
Theorem~4.1. The lower bound is needed, since we have a good
estimate in formula~(\ref{(8.6)}) only for
$u\ge E\sup\limits_{f\in{\cal F}}|k!J_{n,k}(f)|$.
The upper bound appears, since we have a good estimate in
Theorem~$8.1'$ only for $0____0$ and $B=B(k)>0$ depending only
on the order $k$ of the $U$-statistic $I_{n,k}(f)$ such that
\begin{equation}
P(n^{-k/2}|k!I_{n,k}(f)|>u)
\le A\exp\left\{-\frac{u^{2/k}}{2\sigma^{2/k}
\left(1+B\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}
\right)}\right\} \label{(8.10)}
\end{equation}
for all $0\le u\le n^{k/2}\sigma^{k+1}$.}
\medskip
Let us also formulate the following simple corollary of Theorem~8.3.
\medskip\noindent
{\bf Corollary of Theorem 8.3.} {\it Under the conditions
of Theorem~8.3 there exist some universal constants
$C=C(k)>0$ and $\alpha=\alpha(k)>0$
that
\begin{equation}
P(n^{-k/2}|k!I_{n,k}(f)|>u)
\le C\exp\left\{-\alpha\left(\frac u\sigma\right)^{2/k}
\right\} \quad \textrm{for all } 0\le u\le n^{k/2}\sigma^{k+1}.
\label{($8.10'$)}
\end{equation}
}
\medskip
The following estimate holds about the supremum of degenerate
$U$-statistics.
\medskip\noindent
{\bf Theorem 8.4 (Estimate on the supremum of degenerate
$U$-sta\-tis\-tics).}\index{estimate on the supremum of
degenerate $U$-statistics}
{\it Let us have a probability
measure~$\mu$ on a measurable space $(X,{\cal X})$ together
with a countable and $L_2$-dense class ${\cal F}$ of functions
$f=f(x_1,\dots,x_k)$ of $k$ variables with some parameter
$D\ge2$ and exponent~$L\ge1$ on the product space
$(X^k,{\cal X}^k)$ which satisfies conditions~(\ref{(8.4)})
and~(\ref{(8.5)}) with some constant $0<\sigma\le1$. Let us
take a sequence of independent $\mu$ distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge k$, and consider the
$U$-statistics $I_{n,k}(f)$ with these random variables and
kernel functions $f\in{\cal F}$. Let us assume that all these
$U$-statistics $I_{n,k}(f)$, $f\in{\cal F}$, are degenerate,
or in an equivalent form, all functions $f\in {\cal F}$
are canonical with respect to the measure~$\mu$. Then there exist
some constants $C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$
depending only on the parameter $k$ such that the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}n^{-k/2}|k!I_{n,k}(f)|\ge u\right)\le C
\exp\left\{-\alpha \left(\frac u{\sigma}\right)^{2/k}\right\} \quad
\textrm{holds for those } \nonumber \\
&&\qquad \textrm{numbers } u \textrm{ for which } n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2}), \nonumber \\
\label{(8.11)}
\end{eqnarray}
where the numbers $D$ and $L$ agree with the parameter and
exponent of the $L_2$-dense class~${\cal F}$.
The condition about the countable cardinality of the class ${\cal F}$
can be replaced by the weaker condition that the class of random
variables $n^{-k/2}I_{n,k}(f)$, $f\in{\cal F}$, is countably
approximable.}
\medskip
Next I formulate a Gaussian counterpart of the above results. To do
this I need some notions that will be introduced in Chapter~10. In
that chapter the white noise with a reference measure $\mu$ will
be defined. It is an appropriate set of jointly Gaussian random
variables indexed by those measurable sets $A\in {\cal X}$ of a
measure space $(X,{\cal X},\mu)$ with a $\sigma$-finite
measure~$\mu$ for which $\mu(A)<\infty$. Its distribution depends
on the measure~$\mu$ which will be called the reference measure of
the white noise.
In Chapter~10 it will also be shown that given a white noise $\mu_W$
with a non-atomic $\sigma$-additive reference measure $\mu$ on a
measurable space $(X,{\cal X})$ and a measurable function
$f(x_1,\dots,x_k)$ of $k$ variables on the product space
$(X^k,{\cal X}^k)$ such that
\begin{equation}
\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2<\infty
\label{(8.12)}
\end{equation}
a $k$-fold Wiener-It\^o integral of the function $f$ with respect
to the white noise~$\mu_W$
\begin{equation}
Z_{\mu,k}(f)=\frac1{k!}\int f(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots \mu_W(\,dx_k) \label{(8.13)}
\end{equation}
can be defined, and the main properties of this integral will be
proved there. It will be seen that Wiener-It\^o integrals have a
similar relation to degenerate $U$-statistics and multiple
integrals with respect to normalized empirical measures as
normally distributed random variables have to partial sums of
independent random variables. Hence it is useful to find the
analogues of the previous results to estimates about the
tail distribution of Wiener-It\^o integrals. This will be done in
Theorems~8.5 and~8.6.
\medskip\noindent
{\bf Theorem 8.5 (Estimate on the tail distribution of a multiple
Wiener--It\^o integral).}\index{estimate on the tail distribution
of a multiple Wiener--It\^o integral}
{\it Let us fix a measurable space
$(X,{\cal X})$ together with a $\sigma$-finite non-atomic
measure~$\mu$ on it, and let $\mu_W$ be a white noise with reference
measure $\mu$ on $(X,{\cal X})$. If $f(x_1,\dots,x_k)$ is a measurable
function on $(X^k,{\cal X}^k)$ which satisfies relation~(\ref{(8.12)})
with some $0<\sigma<\infty$, then
\begin{equation}
P(|k!Z_{\mu,k}(f)|>u)\le C \exp\left\{-\frac12\left(\frac
u\sigma\right)^{2/k}\right\} \label{(8.14)}
\end{equation}
for all $u>0$ with some constants $C=C(k)$ depending only on~$k$.}
\medskip\noindent
{\bf Theorem 8.6 Estimate on the supremum of Wiener--It\^o
integrals).}\index{estimate on the supremum of Wiener--It\^o integrals}
{\it Let ${\cal F}$ be a countable class of functions
of $k$ variables defined on the $k$-fold product $(X^k,{\cal X}^k)$
of a measurable space $(X,{\cal X})$ such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots \mu(\,dx_k)\le\sigma^2
\quad \textrm{\rm with some } 0<\sigma\le1 \textrm { \rm for all }
f\in {\cal F}
$$
with some non-atomic $\sigma$-additive measure~$\mu$ on $(X,{\cal X})$.
Let us also assume that ${\cal F}$ is an $L_2$-dense class of functions
in the space $(X^k,{\cal X}^k)$ with respect to the measure~$\mu^k$
with some exponent~$L\ge1$ and parameter~$D\ge1$, where $\mu^k$ is
the $k$-fold product of the measure~$\mu$. (The classes of
$L_2$-dense classes with respect to a measure were defined in
Chapter~4.)
Take a white noise $\mu_W$ on $(X,{\cal X})$ with reference measure
$\mu$, and define the Wiener--It\^o integrals $Z_{\mu,k}(f)$ for
all $f\in{\cal F}$. Fix some $0<\varepsilon\le1$. The inequality
\begin{equation}
P\left(\sup_{f\in {\cal F}} |k!Z_{\mu,k}(f)|>u\right)\le CD
\exp\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k}\right\}\label{(8.15)}
\end{equation}
holds for those numbers~$u$ which satisfy the inequality
$u\ge ML^{k/2}\sigma\frac1\varepsilon
(\log^{k/2}\frac2\varepsilon+\log^{k/2}\frac2\sigma)$.
Here $C=C(k)>0$, $M=M(k)>0$ are some universal constants
depending only on the multiplicity~$k$ of the integrals.}
\medskip\noindent
{\it Remark:}\/ Theorem 8.6 is the multivariate version of
Theorem~4.2 about the tail distribution of the supremum of
Gaussian random variables. In Theorem~4.2 we could get good
estimates for such levels~$u$ which satisfy the inequality
$u\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$ with an appropriate
constant, while in Theorem~8.6 we had a similar estimate under
the condition $u\ge\textrm{const.}\sigma\log^{k/2}\frac2\sigma$
with an appropriate constant. In Chapter~4 we presented an example
which shows that the above condition on the level~$u$ in
Theorem~4.2 cannot be dropped. A similar example can be given
about the necessity of the analogous condition in Theorem~8.6
with the help of the subsequent Example~8.7.
Put $f_{s,t}(u_1,\dots,u_k)=\prod\limits_{j=1}^k f^0_{s,t}(u_j)$,
where $f^0_{s,t}(u)$ denotes the indicator function of the
interval~$[s,t]$. Take the class of functions
$$
{\cal F}={\cal F}_\sigma=\{f_{s,t}\colon\;0\le s0$ such that
$P\left(\sup\limits_{f_{s,t}\in{\cal F}_\sigma} Z(f_{s,t})>
c\sigma\log^{k/2}\frac2\sigma\right)\to1$ as $\sigma\to0$. Beside
this, it can be seen that ${\cal F}$ is an $L_2$-dense class
with respect to the Lebesgue measure. This implies that the lower
bound imposed on~$u$ in Theorem~8.6 cannot be dropped. I omit the
details of the proof.
\medskip
Formula~(\ref{(8.15)}) yields an almost as good estimate for the
supremum of Wiener--It\^o integrals with the choice of a small
$\varepsilon>0$ as formula~(\ref{(8.14)}) for a single
Wiener--It\^o integral. But the lower bound imposed on the
number~$u$ in the estimate~(\ref{(8.15)}) depends on $\varepsilon$,
and for a small number $\varepsilon>0$ it is large.
The subsequent result presented in Example~8.7 may help to
understand why Theorems~8.3 and~8.5 are sharp. Its proof and
the discussion of the question about the sharpness
of Theorems~8.3 and~8.5 will be postponed to Chapter~13.
\medskip\noindent
{\bf Example 8.7 (A converse estimate to Theorem 8.5).} {\it Let
us have a $\sigma$-finite measure $\mu$ on some measure space
$(X,{\cal X})$ together with a white noise $\mu_W$ on $(X,{\cal X})$
with counting measure~$\mu$. Let $f_0(x)$ be a real valued function
on $(X,{\cal X})$ such that $\int f_0(x)^2\mu(\,dx)=1$, and take the
function $f(x_1,\dots,x_k)=\sigma f_0(x_1)\cdots f_0(x_k)$ with
some number $\sigma>0$ together with the Wiener--It\^o integral
$Z_{\mu,k}(f)$ introduced in formula~(\ref{(8.13)}).
Then the relation
$\int f(x_1,\dots,x_k)^2\,\mu(\,dx_1)\dots\,\mu(\,dx_k)=\sigma^2$
holds, and the Wiener--It\^o integral $Z_{\mu,k}(f)$ satisfies the
inequality
\begin{equation}
P(|k!Z_{\mu,k}(f)|>u)
\ge \frac{\bar C}{\left(\frac u\sigma\right)^{1/k}+1}
\exp\left\{-\frac12\left(\frac u\sigma\right)^{2/k}\right\}\quad
\textrm{for all } u>0 \label{(8.16)}
\end{equation}
with some constant $\bar C>0$.}
\medskip
The above results show that multiple integrals with respect to a
normalized empirical distribution or degenerate $U$-statistics
satisfy some estimates similar to those about multiple Wiener--It\^o
integrals, but they hold under more restrictive conditions. The
difference between the estimates in these problems is similar to
the difference between the corresponding results in Chapter~4 whose
reason was explained there. Hence this will be only briefly
discussed here.
The estimates of
Theorem~8.1 and~8.3 are similar to that of Theorem~8.5. Moreover,
for $0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a small
number $\varepsilon>0$ Theorem~8.3 yields an almost as good
estimate about degenerate $U$-statistics as Theorem~8.5 yields
for a Wiener--It\^o integral with the same kernel function $f$ and
underlying measure $\mu$. Example~8.7 shows that the constant in
the exponent of formula~(\ref{(8.14)}) cannot be improved, at
least there is no possibility of an improvement if only the
$L_2$-norm of the kernel function $f$ is known. Some results
discussed later indicate that neither the estimate of Theorem~8.3
can be improved.
The main difference between Theorem~8.5 and the results of
Theorem~8.1 or~8.3 is that in the latter case the kernel
function~$f$ must satisfy not only an $L_2$ but also an $L_\infty$
norm type condition, and the estimates of these results are
formulated under the additional condition
$u\le n^{k/2}\sigma^{k+1}$. It can be shown that the condition about
the $L_\infty$ norm of the kernel function cannot be dropped from
the conditions of these theorems, and a version of Example~3.3 will
be presented in Example~8.8 which shows that in the case
$u\gg n^{k/2}\sigma^{k+1}$ the left-hand side of~(\ref{(8.10)})
may satisfy only a much weaker estimate. This estimate will be
given only for $k=2$, but with some work it can be generalized
for general indices~$k$.
Theorems~8.2, 8.4 and~8.6 show that for the tail distribution of the
supremum of a not too large class of degenerate $U$-statistics or
multiple integrals a similar upper bound can be given as for the tail
distribution of a single degenerate $U$-statistic or multiple integral,
only the universal constants may be worse in the new estimates.
However, they hold only under the additional condition that the level
at which the tail distribution of the supremum is estimated is not too
low. A similar phenomenon appeared already in the results of Chapter~4.
Moreover, such a restriction had to be imposed in the formulation of
the results here and in Chapter~4 for the same reason.
In Theorem~8.2 and~8.4 an $L_2$-dense class of kernel functions was
considered, and this meant that the class of random integrals or
$U$-statistics we consider in this result is not too large. In
Theorem~8.6 a similar, but weaker condition was imposed on the class
of kernel functions. They had to satisfy a similar condition, but
only for the reference measure $\mu$ of the white noise appearing in
the Wiener--It\^o integral. A similar difference appears in the
comparison of Theorems~4.1 or~$4.1'$ with Theorem~4.2, and this
difference has the same reason in the two cases.
Next I present the proof of the following Example~8.8 which is a
multivariate version of Example~3.3. For the sake of simplicity
I restrict my attention to the case $k=2$.
\medskip\noindent
{\bf Example 8.8 (A converse estimate to Theorem 8.3).} {\it Let us
take a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ with values in the plane $X=R^2$ such
that $\xi_j=(\eta_{j,1},\eta_{j,2})$, $\eta_{j,1}$ and $\eta_{j,2}$
are independent random variables with the following distributions.
The distribution of $\eta_{j,1}$ is defined with the help of a
parameter $\sigma^2$, $0<\sigma^2\le\frac18$, in the same way as
the distribution of the random variables $X_j$ in Example~3.3, i.e.
$\eta_{j,1}=\bar\eta_{j,1}-E\bar\eta_{j,1}$ with
$P(\bar\eta_{j,1}=1)=\bar\sigma^2$,
$P(\bar\eta_{j,1}=0)=1-\bar\sigma^2$, where $\bar\sigma^2$ is that
solution of the equation $x^2-x+\sigma^2=0$, which is smaller
than~$\frac12$. The distribution of the random variables
$\eta_{j,2}$ is given by the formula
$P(\eta_{j,2}=1)=P(\eta_{j,2}=-1)=\frac12$ for all $1\le j\le n$.
Introduce the function $f(x,y)=f((x_1,x_2),(y_1,y_2))=x_1y_2+x_2y_1$,
$x=(x_1,x_2)\in R^2$, $y=(y_1,y_2)\in R^2$ if $(x,y)$ is in the
support of the distribution of the random vector $(\xi_1,\xi_2)$,
i.e. if $x_1$ and $y_1$ take the values $1-\bar\sigma^2$ or
$-\bar\sigma^2$ and $x_2$ and $y_2$ take the values $\pm1$. Put
$f(x,y)=0$ otherwise. Define the $U$-statistic
$$
I_{n,2}(f)=\frac12\sum_{1\le j,k\le n,\,j\neq k} f(\xi_j,\xi_k)
=\frac12\sum_{1\le j,k\le n,\,j\neq k}
(\eta_{j,1}\eta_{k,2}+\eta_{k,1}\eta_{j,2})
$$
of order 2 with the above kernel function $f$ and sequence of
independent random variables $\xi_1,\dots,\xi_n$. Then $I_{n,2}(f)$
is a degenerate $U$-statistic such that $|\sup f(x,y)|\le 1$ and
$Ef^2(\xi_j,\xi_j)=\sigma^2$.
If $u\ge B_1n\sigma^3$ with some appropriate constant $B_1>2$,
$\bar B_2^{-1}n\ge u\ge \bar B_2 n^{-1/2}$ with a sufficiently
large fixed number $\bar B_2>0$ and
$\frac14\ge\sigma^2\ge\frac1{n^2}$, and $n$ is a sufficiently
large number, then the estimate
\begin{equation}
P(n^{-1}I_{n,2}(f)>u)\ge \exp\left\{-Bn^{1/3}u^{2/3}\log
\left(\frac u{n\sigma^3}\right)\right\} \label{(8.17)}
\end{equation}
holds with some $B>0$.}
\medskip\noindent
{\it Remark:}\/ In Theorem~8.3 we got the estimate
$P(n^{-1}I_{n,2}(f)>u)\le e^{-\alpha u/\sigma}$ for the above
defined degenerate $U$-statistic $I_{n,2}(f)$ if
$0\le u\le n\sigma^3$. In the particular case $u=n\sigma^3$
we have the estimate
$P(n^{-1}I_{n,2}(f)>n\sigma^3)\le e^{-\alpha n\sigma^2}$. On the
other hand, the above example shows that in the case
$u\gg n\sigma^3$
we can get only a weaker estimate. It is worth looking at the
estimate~(\ref{(8.17)}) with fixed parameters $n$ and $u$ and
to observe the dependence of the upper bound on the variance
$\sigma^2$ of $I_{n,2}(f)$. In the case $\sigma^2=u^{2/3}n^{-2/3}$
we have the upper bound $e^{-\alpha n^{1/3}u^{2/3}}$. Example~8.8
shows that in the case $\sigma^2\ll u^{2/3}n^{-2/3}$ we can get
only a relatively small improvement of this estimate. A similar
picture appears as in Example~3.3 in the case $k=1$.
\medskip
It is simple to check that the $U$-statistic introduced in the
above example is degenerate because of the independence of the
random variables $\eta_{j,1}$ and $\eta_{j,2}$ and the identity
$E\eta_{j,1}=E\eta_{j,2}=0$. Beside this,
$Ef(\xi_j,\xi_j)^2=\sigma^2$. In the proof of the
estimate~(\ref{(8.17)})
the results of Chapter~3, in particular Example~3.3 can be applied
for the sequence $\eta_{j,1}$, $j=1,2,\dots,n$. Beside this, the
following result, known from the theory of large deviations will
be applied. If $X_1,\dots,X_n$ are independent and identically
distributed random variables, $P(X_1=1)=P(X_1=-1)=\frac12$, then
for any number $0\le \alpha<1$ there exists some numbers
$C_1=C_1(\alpha)>0$ and $C_2=C_2(\alpha)>0$ such that
$P\left(\sum\limits_{j=1}^nX_j >u\right)\ge C_1e^{-C_2u^2/n}$ for all
$0\le u\le \alpha n$.
\medskip\noindent
{\it Proof of Example 8.8.}\/ The inequality
\begin{eqnarray}
&&P(n^{-1}I_{n,2}(f)>u) \label{(8.18)} \\
&&\qquad \ge P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right)
-P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right) \nonumber
\end{eqnarray}
holds. Because of the independence of the random variables
$\eta_{j,1}$ and $\eta_{j,2}$ the first probability at the
right-hand side of (\ref{(8.18)}) can be bounded from below
by bounding
the multiplicative terms in it with $v_1=4n^{1/3}u^{2/3}$ and
$v_2=n^{2/3}u^{1/3}$. The first term will be estimated by means
of Example 3.3. This estimate can be applied with the choice
$y=v_1$, since the relation $v_1\ge 4n\sigma^2$ holds if
$u\ge B_1n\sigma^3$ with $B_1>1$, and the remaining conditions
$0\le \sigma^2\le\frac18$ and $n\ge4v_1\ge6$ also hold under the
conditions of Example~8.8. The second term can be bounded with
the help of the large-deviation result mentioned after the
remark, since $v_2\le \frac12 n$ if $u\le \bar B_2^{-1}n$ with
a sufficiently large $\bar B_2>0$. In such a way we get the
estimate
\begin{eqnarray*}
&&P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>4nu\right) \\
&&\qquad \ge P\left(\sum_{j=1}^n\eta_{j,1} >v_1\right)
P\left(\sum_{j=1}^n\eta_{j,2}>v_2\right) \\
&&\qquad \ge C\exp\left\{-B_1v_1\log
\left(\frac{v_1}{n\sigma^2}\right)-B_2\frac{v_2^2}{n}\right\} \\
&&\qquad \ge C\exp\left\{-B_3n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\}
\end{eqnarray*}
with appropriate constants $B_1>1$, $B_2>0$ and $B_3>0$. On the
other hand, by applying Bennett's inequality, more precisely its
consequence given in formula~(\ref{(3.4)}) for the sum of the random
variables $X_j=\eta_{j,1}\eta_{j,2}$ at level $nu$ instead of
level~$u$ we get the following upper bound for the second term at
the right-hand side of~(\ref{(8.18)}).
\begin{eqnarray*}
P\left(\sum_{j=1}^n\eta_{j,1}\eta_{j,2}>2nu\right)
&\le& \exp\left\{-Knu\log \frac u{\sigma^2}\right\} \\
&\le& \exp\left\{-2B_4n^{1/3}u^{2/3}
\log\left(\frac u{n\sigma^3}\right)\right\},
\end{eqnarray*}
since $E\eta_{j,1}\eta_{j,2}=0$,
$E\eta^2_{j,1}\eta^2_{j,2}=\sigma^2$,
$nu\ge B_1n^2\sigma^3\ge 2n\sigma^2$ because of the
conditions $B_1>2$ and $n\sigma\ge1$. Hence the
estimate~(\ref{(3.4)})
(with parameter $nu$) can be applied in this case. Beside this,
the constant $B_4$ can be chosen sufficiently large in the last
inequality if the number~$n$ or the bound~$\bar B_2$ in
Example~8.8 us chosen sufficiently large. This means that this
term is negligible small. The above estimates imply the
statement of Example~8.8.
\hfill$\qed$
\medskip
Let me remark that under some mild additional restrictions the
estimate (\ref{(8.17)}) can be slightly sharpened, the term
$\log$ can be replaced by $\log^{2/3}$ in the exponent of the
right-hand side of~(\ref{(8.17)}). To get such an estimate
some additional calculation is needed where the numbers
$v_1$ and $v_2$ are replaced by
$\bar v_1=4n^{1/3}u^{2/3}\log^{-1/3}\left(\frac u{n\sigma^3}\right)$
and
$\bar v_2=n^{2/3}u^{1/3}\log^{1/3}\left(\frac u{n\sigma^3}\right)$.
\medskip
I finish this chapter with a short overview about the remaining
part of this work.
In our proofs we needed some results about $U$-statistics, and this
is the main topic of Chapter~9. One of the results discussed there
is the so-called Hoeffding decomposition of $U$-statistics to the
linear combination of degenerate $U$-statistics of different order.
We also needed some additional results which explain how some
properties (e.g. a bound on the $L_2$ and $L_\infty$ norm of a
kernel function, the $L_2$-density property of a class~${\cal F}$ of
kernel function) is inherited if we turn from the original
$U$-statistics to the degenerate $U$-statistics appearing in
their Hoeffding decomposition. Chapter~9 contains some results
in this direction. Another important result in it is Theorem~9.4
which yields a decomposition of multiple integrals with respect
to a normalized empirical distribution to the linear combination
of degenerate $U$-statistics. This result is very similar to the
Hoeffding decomposition of $U$-statistics. The main difference
between them is that in the decomposition of multiple integrals
much smaller coefficients appear. Theorem~9.4 makes possible to
reduce the proof of Theorems~8.1 and~8.2 to the corresponding
results in Theorems~8.3 and~8.4 about degenerate $U$-statistics.
The definition and the main properties of Wiener--It\^o integrals
needed in the proof of Theorems~8.5 and~8.6 are presented in
Chapter~10. It also contains a result, called the diagram formula
for Wiener--It\^o integrals which plays an important role in our
considerations. Beside this, we proved a limit theorem, where we
expressed the limit of normalized degenerate $U$-statistics with
the help of multiple Wiener--It\^o integrals. This result may
explain why it is natural to consider Theorem~8.5 as the
natural Gaussian counterpart of Theorem~8.5, and Theorem~8.6 as
the natural Gaussian counterpart of Theorem~8.6.
We could prove Bernstein's and Bennett's inequality by means of a
good estimation of the exponential moments of the partial sums we
were investigating. In the proof of their multivariate versions,
in Theorems~8.3 and~8.5 this method does not work, because the
exponential moments we have to bound in these cases may be
infinite. On the other hand, we could prove these results by means
of a good estimate on the high moments of the random variables
whose tail distribution we wanted to bound. In the proof of
Theorem~8.5 the moments of multiple Wiener--It\^o integrals
have to be bounded, and this can be done with the help of the
diagram formula for Wiener--It\^o integrals. In Chapter~11
and~12 we proved that there is a version of the diagram formula
for degenerate $U$-statistics, and this enables us to estimate
the moments needed in the proof of Theorem~8.3. In Chapter~13
we proved Theorems~8.3, 8.5 and a multivariate version of the
Hoeffding inequality. At the end of this chapter we still
discussed some results which state that in certain cases when
we have some useful additional information about the behaviour
of the kernel function~$f$ beside the upper bound of their
$L_2$ and $L_\infty$ norm the estimates of Theorems~8.3 or~8.5
can be improved.
Chapter~14 contains the natural multivariate versions of the
results in Chapter~6. In Chapter~6 Theorem~4.2 was proved about
the supremum of Gaussian random variables and in Chapter~14
its multivariate version, Theorem~8.6. Both results are proved
with the help of the chaining argument. On the other hand, the
chaining argument is not strong enough to prove Theorem~4.1.
But as it is shown in Chapter~6, it enables us to prove a result
formulated in Proposition~6.1, and to reduce the proof of
Theorem~4.1 with its help to a simpler result formulated in
Proposition~6.2. One of the results in Chapter~14,
Proposition~14.1, is a multivariate version of Proposition~6.1.
We showed that the proof of Theorem~8.4 can be reduced with its
help to the proof of a result formulated in Proposition~14.2,
which can be considered a multivariate version of Proposition~6.2.
Chapter~14 contains still another result. It turned out that
it is simpler to work with so-called decoupled $U$-statistics
introduced in this chapter than with usual $U$-statistics,
because they have more independence properties. In
Proposition~$14.2'$ a version of Proposition~14.2 is formulated
about degenerate $U$-statistics, and it is shown with the help
of a result of de la Pe\~na and Montgomery--Smith that the proof
of Proposition~14.2, and thus of Theorem~8.4 can be reduced to
the proof of Proposition~$14.2'$.
Proposition~$14.2'$ is proved similarly to its one-variate
version, Proposition~6.2. The strategy of the proof is explained
in Chapter~15. The main difference between the proof of the two
propositions is that since the independence properties exploited
in the proof of Proposition~6.2 hold only in a weaker form in
this case, we have to apply a more refined and more difficult
argument. In particular, we have to apply instead of the
symmetrization lemma, Lemma~7.1, a more general version of it.
We presented an appropriate version of this result in Lemma~15.2.
It is hard to check the conditions of Lemma~15.2 when we try to
apply it in the problems arising in the proof of
Proposition~$14.2'$. This is the reason why we had to prove
Proposition~$14.2'$ with the help of two inductive propositions,
formulated in Propositions~15.3 and~15.4, while in the proof of
Proposition~6.2 it was enough to prove a single result, presented
in Proposition~7.3. We discuss the details of the problems and
the strategy of the proof in Chapter~15. The proof of
Propositions~15.3 and~15.4 is given in Chapters~16 and~17.
Chapter~16 contains the symmetrization arguments needed for us,
and the proof is completed with its help in Chapter~17.
Finally in Chapter~18 we give an overview of this work, and
explain its relation to some similar researches. The proof of
some results is given in the Appendix.
\chapter{Some results about $U$-statistics}
This chapter contains the proof of the Hoeffding decomposition
theorem, an important result about $U$-statistics. It states that
all $U$-statistics can be represented as a sum of degenerate
$U$-statistics of different order. This representation can be
considered as the natural multivariate version of the
decomposition of a sum of independent random variable to the sum
of independent random variables with expectation zero plus a
constant (which can be interpreted as a random variable of zero
variable). Some important properties of the Hoeffding
decomposition will also be proved. In particular, it will be
investigated how some properties of the kernel function of a
$U$-statistic is inherited in the behaviour of the kernel
functions of the $U$-statistics in its Hoeffding decomposition.
If the Hoeffding decomposition of a $U$-statistic is taken, then
the $L_2$ and $L_\infty$-norms of the kernel functions appearing
in the $U$-statistics of the Hoeffding decomposition will be
bounded by means of the corresponding norm of the kernel function
of the original $U$-statistic. It will also be shown that if we
take a class of $U$-statistics with an $L_2$-dense class of kernel
functions (and the same sequence of independent and identically
distributed random variables in the definition of each
$U$-statistic) and consider the Hoeffding decomposition of all
$U$-statistics in this class, then the kernel functions of the
degenerate $U$-statistics appearing in these Hoeffding
decompositions also constitute an $L_2$-dense class. Another
important result of this chapter is Theorem~9.4. It yields a
decomposition of a $k$-fold random integral with respect to a
normalized empirical distribution to the linear combination of
degenerate $U$-statistics. This result enables us to derive
Theorem~8.1 from Theorem 8.3 and Theorem~8.2 from Theorem~8.4,
and it is also useful in the proof of Theorems~8.3 and~8.4.
Let us first consider Hoeffding's decomposition. In the
special case $k=1$ it states that the sum
$S_n=\sum\limits_{j=1}^n\xi_j$ of independent and identically
distributed random variables can be rewritten as
$S_n=\sum\limits_{j=1}^n(\xi_j-E\xi_j)
+\left(\sum\limits_{j=1}^nE\xi_j\right)$, i.e.\
as the sum of independent random variables with zero expectation
plus a constant. We introduced the convention that a constant is
the kernel function of a degenerate $U$-statistic of order zero,
and $I_{n,0}(c)=c$ for a $U$-statistic of order zero. I wrote
down the above trivial formula, because Hoeffding's decomposition
is actually its adaptation to a more general situation. To
understand this let us first see how to adapt the above
construction to the case $k=2$.
In this case a sum of the form
$2I_{n,2}(f)=\sum\limits_{1\le j,k\le n,j\neq k} f(\xi_j,\xi_k)$
has to be considered. Write
$f(\xi_j,\xi_k)=[f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)]+
E(f(\xi_j,\xi_k)|\xi_k)=f_1(\xi_j,\xi_k)+\bar f_1(\xi_k)$ with
$f_1(\xi_j,\xi_k)=f(\xi_j,\xi_k)-E(f(\xi_j,\xi_k)|\xi_k)$, and
$\bar f_1(\xi_k)=E(f(\xi_j,\xi_k)|\xi_k)$ to make the conditional
expectation of $f_1(\xi_j,\xi_k)$ with respect to $\xi_k$ equal
zero. Repeating this procedure for the first coordinate we define
$f_2(\xi_j,\xi_k)=f_1(\xi_j,\xi_k)-E(f_1(\xi_j,\xi_k)|\xi_j)$ and
$\bar f_2(\xi_j)=E(f_1(\xi_j,\xi_k)|\xi_j)$.
Let us also write $\bar f_1(\xi_k)=
[\bar f_1(\xi_k)-E\bar f_1(\xi_k)]+E\bar f_1(\xi_k)$ and
$\bar f_2(\xi_j)=[\bar f_2(\xi_j)-E\bar f_2(\xi_j)]
+E\bar f_2(\xi_j)$.
Simple calculation shows that $2I_{n,2}(f_2)$ is a degenerate
$U$-statistics of order 2, and the identity
$2I_{n,2}(f)=2I_{n,2}(f_2)+I_{n,1}((n-1)(\bar f_1-E\bar f_1))+
I_{n,1}((n-1)((\bar f_2-E\bar f_2))+n(n-1)E(\bar f_1+\bar f_2)$
yields the decomposition of $I_{n,2}(f)$ into a sum of degenerate
$U$-statistics of different orders.
Hoeffding's decomposition can be obtained by working out the details
of the above argument in the general case. But it is simpler to
calculate the appropriate conditional expectations by working with
the kernel functions of the $U$-statistics. To carry out such
a program we introduce the following notations.
Let us consider the $k$-fold product $(X^k,{\cal X}^k,\mu^k)$ of a
measure space $(X,{\cal X},\mu)$ with some probability measure $\mu$,
and define for all integrable functions $f(x_1,\dots,x_k)$ and indices
$1\le j\le k$ the projection~$P_jf$ of the function $f$ to its $j$-th
coordinate, i.e.\ integration of the function~$f$ with respect to its
$j$-th coordinate.
For the sake of simpler notations in our later considerations we
shall define the operator $P_j$ in a slightly more general setting.
Let us consider a set $A=\{p_1,\dots,p_s\}\subset\{1,\dots,k\}$, put
$X^A=X_{p_1}\times X_{p_2}\times\cdots\times X_{p_s}$, ${\cal X}^A
={\cal X}_{p_1}\times {\cal X}_{p_2}\times\cdots\times{\cal X}_{p_s}$,
$\mu^A=\mu_{p_1}\times \mu_{p_2}\times\cdots\times \mu_{p_s}$, take
the product space $(X^A,{\cal X}^A,\mu^A)$ and if $j\in A$, then
define the operator $P_j$ as mapping a function on this product
space to a function on the product space
$(X^{A\setminus\{j\}},{\cal X}^{A\setminus\{j\}})$ by the formula
\begin{equation}
(P_jf)(x_{p_1},\dots,x_{p_{r-1}},x_{p_{r+1}},\dots,x_{p_s})
=\int f(x_{p_1},\dots,x_{p_s})\mu(\,dx_j), \quad\text{if } j=p_r.
\label{(9.1)}
\end{equation}
Let us also define the (orthogonal projection) operators
$Q_j=I-P_j$ as $Q_jf=f-P_jf$ for all integrable functions $f$ on
the space $(X^A,{\cal X}^A,\mu^A)$, and $j\in A$, i.e. put
\begin{eqnarray}
(Q_jf)(x_{p_1},\dots,x_{p_s})&=&(I-P_j)f(x_{p_1},\dots,x_{p_s})
\nonumber\\
&=&f(x_{p_1},\dots,x_{p_s})-\int f(x_{p_1},\dots,x_{p_s})\mu(\,dx_j).
\label{(9.1a)}
\end{eqnarray}
In the definition~(\ref{(9.1)}) $P_jf$ is a function not
depending on the coordinate $x_j$, but in the definition of $Q_j$
we introduce the fictive coordinate $x_j$ to make the expression
$Q_jf=f-P_jf$ meaningful.
\medskip\noindent
{\it Remark.} I shall use the following notation.
$(P_jf)(x_{p_1},\dots,x_{p_{r-1}},x_{p_{r+1}},\dots,x_{p_s})$
will denote the value of the function $P_jf$ in the point
$(x_{p_1},\dots,x_{p_{r-1}},x_{p_{r+1}},\dots,x_{p_s})$. On the other
hand, I write sometimes $P_jf(x_{p_1},\dots,x_{p_s})$ (without
parentheses) instead of $P_jf$ when it is natural to write the
function~$f$ together with its arguments. The same notation will
be applied for the operator $Q_j$.
\medskip
The following result holds.
\medskip\noindent
{\bf Theorem 9.1 (The Hoeffding decomposition of
$U$-statistics).}\index{Hoeffding decomposition of $U$-statistics}
{\it Let $f(x_1,\dots,x_k)$ be an integrable function on the $k$-fold
product $(X^k,{\cal X}^k,\mu^k)$ of a space $(X,{\cal X},\mu)$
with a probability measure $\mu$. It has a decomposition of the form
\begin{eqnarray}
&&f(x_1,\dots,x_k)=\sum\limits_{V\subset\{1,\dots,k\}}
f_V(x_{j_1},\dots,x_{j_{|V|}})
\label{(9.2)} \\
&& \qquad\quad \textrm{with} \quad
f_V(x_{j_1},\dots,x_{j_{|V|}})
=\left(\prod_{j\in\{1,\dots,k\}\setminus V}P_j
\prod_{j'\in V}Q_{j'}\right)f(x_1,\dots,x_k), \nonumber
\end{eqnarray}
with $V=\{j_1,\dots,j_{|V|}\}$, $j_1u)\le \sum_{V\subset\{1,\dots,k\}}
P\left(n^{-|V|/2} ||V|!I_{n,|V|}(f_V)|>\frac u{2^kC(k)}\right)
\label{(9.10)}
\end{equation}
with a constant $C(k)$ satisfying the inequality $p!C(n,k,p)\le
k!C(k)$ for all coefficients $C(n,k,p)$, $1\le p\le k$,
in~(\ref{(9.9)}). Hence Theorem~$8.1'$ follows from Theorem~8.3
and relations~(\ref{(9.4)}) and~(\ref{($9.4'$)}) in Theorem~9.2 by
which the $L_2$-norm of the functions $f_V$ is bounded by the
$L_2$-norm of the function~$f$ and the $L_\infty$-norm of $f_V$
is bounded by $2^{|V|}$-times the $L_\infty$-norm or $f$. It is
enough to estimate each term at the right-hand side of~(\ref{(9.10)})
by means of Theorem~8.3. It can be assumed that $2^kC(k)>1$. Let us
first assume that also the inequality $\frac u{2^kC(k) \sigma}\ge1$
holds. In this case formula~(\ref{($8.3'$)}) in Theorem~$8.1'$
can be obtained by means of the estimation of each term at the
right-hand side of~(\ref{(9.10)}). Observe that
$\exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}\right)^{2/s}
\right\}\le \exp\left\{-\alpha\left(\frac u{2^kC(k)\sigma}
\right)^{2/k}\right\}$ for all
$s\le k$ if $\frac u{2^kC(k)\sigma}\ge1$. In the other case, when
$\frac u{2^kC(k)\sigma}\le1$, formula~(\ref{($8.3'$)}) holds again
with a sufficiently large $C>0$, because in this case its right-hand
side of~(\ref{($8.3'$)}) is greater than~1.
Theorem~8.2 can be similarly derived from Theorem~8.4 by observing
that relation~(\ref{(9.10)}) remains valid if $|J_{n,k}(f)|$ is
replaced by
$\sup\limits_{f\in{\cal F}}|J_{n,k}(f)|$ and $|I_{n,|V|}(f_V)|$
by $\sup\limits_{f_V\in{\cal F}_V} |I_{n,|V|}(f_V)|$ in it, and we
have the right to choose the constant~$M$ in formula~(\ref{(8.6)}) of
Theorem~8.2 sufficiently large. The only difference in the argument
is that beside formulas~(\ref{(9.4)}) and~(\ref{($9.4'$)}) the last
statement of Theorem~9.2 also has to be applied in this case. It
tells that if ${\cal F}$ is an $L_2$-dense class of functions on a
space $(X^k,{\cal X}^k)$, then the classes of functions
${\cal F}_V=\{2^{-|V|}f_V\colon\, f\in{\cal F}\}$ are also
$L_2$-dense classes of functions for all $V\subset\{1,\dots,k\}$
with the same exponent and parameter.\index{estimate on the
supremum of multiple random integrals with respect to an empirical
distribution}
\medskip
Before its proof I make some comments about the content of
Theorem~9.4. The expression $J_{n,k}(f)$ was defined as a $k$-fold
random integral with respect to the signed measure $\mu_n-\mu$,
where the diagonals were omitted from the domain of integration.
Formula~(\ref{(9.9)}) expresses the random integral $J_{n,k}(f)$
as a linear combination of degenerate $U$-statistics of different
order. This is similar to the Hoeffding decomposition of the
$U$-statistic $I_{n,k}(f)$ to the linear combination of degenerate
$U$-statistics defined with the same kernel functions~$f_V$. The
main difference between these two formulas is that in the
expansion~(\ref{(9.9)}) of $J_{n,k}(f)$ the terms $I_{n,|V|}(f_V)$
appear with small coefficients $C(n,k,|V|)|V|!\frac1{n^{|V|/2}}$.
As we shall see, $E(C(n,k,|V|)|V|!\frac1{n^{|V|/2}}I_{n,V}(f_V))^20$ there is a finite partition
$A=\bigcup\limits_{s=1}^N B_s$ of the set~$A$ with the property
$\mu(B_s)<\varepsilon$ for all $1\le s\le N$. There is a formally
weaker definition of a non-atomic measures by which a
$\sigma$-finite measure~$\mu$ is non-atomic if for all measurable
sets $A$ such that $0<\mu(A)<\infty$ there is a measurable set
$B\subset A$ with the property $0<\mu(B)<\mu(A)$. But these two
definitions of non-atomic measures are actually equivalent,
although this equivalence is not trivial. I do not discuss this
problem here, since it is a little bit outside from the direction
of the present work. In our further considerations we shall work
with the first definition of non-atomic measures.
I would also remark that non-atomic measures behave not completely
so, as our first heuristic feeling would suggest. It is true that
if $\mu$ is a non-atomic measure, then $\mu(\{a\})=0$ for all
one-point sets~$\{a\}$. But the reverse statement does not hold.
There are (in some sense degenerate) measures $\mu$ for which each
one-point set has zero~$\mu$ measure, and which are nevertheless
not non-atomic. I omit the discussion of this question.
The $k$-fold Wiener-It\^o integrals\index{Wiener--It\^o integrals}
of the functions $f\in{\cal H}_{\mu,k}$ with respect to the white
noise~$\mu_W$ will be defined in a rather standard way. First
they will be defined for some simple functions, called elementary
functions, then it will be shown that the integral for these
elementary functions has an $L_2$ contraction property which
makes possible to extend it to the class of all functions in
${\cal H}_{\mu,k}$.
Let us first introduce the following class of elementary
functions $\bar{\cal H}_{\mu,k}$ of $k$ variables.\index{elementary
functions of $k$ variables} A function $f(x_1,\dots,x_k)$ on
$(X^k,{\cal X}^k)$ belongs to $\bar{\cal H}_{\mu,k}$ if there
exist finitely many disjoint measurable subsets $A_1,\dots,A_M$,
$1\le M<\infty$, of the set~$X$ with finite $\mu$-measure (i.e.
$A_j\cap A_{j'}=\emptyset$ if $j\neq j'$, and $\mu(A_j)<\infty$ for
all $1\le j\le M$) such that the function $f$ has the form
\begin{equation}
f(x_1,\dots,x_k)=\left\{
\begin{array}{l}
c(j_1,\dots,j_k)\quad\textrm{if } (x_1,\dots,x_k) \in
A_{j_1}\times\cdots \times A_{j_k} \textrm{ with} \\
\qquad \textrm{some indices } (j_1,\dots,j_k),
\quad 1\le j_s\le M,\; 1\le s\le k,\\
\qquad \textrm{ such that all numbers } j_1,\dots,j_k
\textrm{ are different} \\
0 \quad\textrm{if }(x_1,\dots,x_k)\notin \!\!\!
\bigcup\limits_{\substack
{(j_1,\dots,j_k)\colon\, 1\le j_s\le M, \; 1\le s\le k,\\
\textrm{ and all } j_1,\dots,j_k\textrm { are different.} }}\! \!\!
A_{j_1}\times\cdots \times A_{j_k}
\end{array}
\right. \label{(10.2)}
\end{equation}
with some real numbers $c(j_1,\dots,j_k)$, $1\le j_s\le M$, $1\le
s\le k$, defined for such arguments for which $j_1,\dots,j_k$
are different numbers. This means
that the function $f$ is constant on all $k$-dimensional
rectangles $A_{j_1}\times\dots\times A_{j_k}$ with different,
non-intersecting edges, and it equals zero on the complementary
set of the union of these rectangles. The property that the support
of the function~$f$ is the union of rectangles with
non-intersecting edges is sometimes interpreted so that the
diagonals are omitted from the domain of integration of
Wiener--It\^o integrals.
The Wiener-It\^o integral of an elementary function
$f(x_1,\dots,x_k)$ of the form~(\ref{(10.2)}) with respect to a white
noise $\mu_W$ with the (non-atomic) reference measure $\mu$
is defined by the formula
\begin{eqnarray}
&&\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k) \nonumber\\
&&\qquad=\sum_{\substack{1\le j_s\le M,\;1\le s\le k \\
\textrm{all } j_1,\dots,j_k \textrm{ are different} }}
c(j_1,\dots,j_k) \mu_W(A_{j_1})\cdots\mu_W(A_{j_k}). \label{(10.3)}
\end{eqnarray}
(The representation of the function $f$ in~(\ref{(10.2)}) is not
unique, the sets $A_j$ can be divided into smaller disjoint sets,
but the Wiener--It\^o integral defined in~(\ref{(10.3)}) does not
depend on the representation of the function~$f$. This can be
seen with the help of the additivity property
$\mu_W(A\cup B)=\mu_W(A)+\mu_W(B)$ if $A\cap B=\emptyset$
of the white noise~$\mu_W$.) The notation
\begin{equation}
Z_{\mu,k}(f)=\frac1{k!}
\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k), \label{(10.4)}
\end{equation}
will be used in the sequel, and the expression $Z_{\mu,k}(f)$
will be called the normalized Wiener--It\^o integral of the
function~$f$. Such a terminology will be applied also for the
Wiener--It\^o integrals of all functions $f\in{\cal H}_{\mu,k}$ to
be defined later.
If $f$ is an elementary function in $\bar{\cal H}_{\mu,k}$ defined
in~(\ref{(10.2)}), then its normalized Wiener--It\^o integral defined
in~(\ref{(10.3)}) and~(\ref{(10.4)}) satisfies the relations
\begin{eqnarray}
Ek!Z_{\mu,k}(f)&&=0, \nonumber \\
E(k!Z_{\mu,k}(f))^2&&= \!\!
\sum_{\substack{(j_1,\dots,j_k)\colon\,
1\le j_s\le M,\; 1\le s\le k, \nonumber \\
\textrm{and all } j_1,\dots,j_k\textrm{ are different.} }}
\sum_{\pi\in \Pi_k}
c(j_1,\dots,j_k)c(j_{\pi(1)},\dots,j_{\pi(k)}) \nonumber \\
&&\qquad\qquad E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j_{\pi(1)}})\cdots\mu_W(A_{j_{\pi(k)}}) \nonumber \\
&&=k!\int \textrm{Sym\,} f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k) \nonumber \\
&&\le k!\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k),
\label{(10.5)}
\end{eqnarray}
with $\textrm{Sym\,}f(x_1,\dots,x_k)=
\frac1{k!}\sum\limits_{\pi\in\Pi_k}f(x_{\pi(1)},\dots,x_{\pi(k)})$,
where $\Pi_k$ denotes the set of all permutations
$\pi=\{\pi(1),\dots,\pi(k)\}$ of the set $\{1,\dots,k\}$.
The identities written down in~(\ref{(10.5)}) can be simply
checked. The first relation follows from the identity
$E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})=0$ for disjoint sets
$A_{j_1},\dots,A_{j_k}$, which holds, since the expectation of the
product of independent random variables with zero expectation is
taken. The second identity follows similarly from the identity
\begin{eqnarray*}
&&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})=0\\
&&\qquad \textrm{ if the sets of indices }
\{j_1,\dots,j_k\} \textrm { and }
\{j'_1,\dots,j'_k\} \textrm{ are different,} \\
&&E\mu_W(A_{j_1})\cdots\mu_W(A_{j_k})
\mu_W(A_{j'_1})\cdots\mu_W(A_{j'_k})
=\mu(A_{j_1})\cdots\mu(A_{j_k})\\
&&\qquad \textrm{ if } \{j_1,\dots,j_k\}=
\{j'_1,\dots,j'_k\} \textrm{ i.e. if }
j'_1=j_{\pi(1)},\dots,j'_k=j_{\pi(k)} \\
&&\qquad \textrm{ with some permutation } \pi\in\Pi_k,
\end{eqnarray*}
which holds because of the facts that the $\mu_W$ measure of
disjoint sets are independent with expectation zero, and
$E\mu_W(A)^2=\mu(A)$. The remaining relations in~(\ref{(10.5)})
can be simply checked.
It is not difficult to check that
\begin{equation}
EZ_{\mu,k}(f)Z_{\mu,k'}(g)=0 \label{(10.6)}
\end{equation}
for all functions $f\in \bar{\cal H}_{\mu,k}$ and
$g\in \bar{\cal H}_{\mu,k'}$ if $k\neq k'$, and
\begin{equation}
Z_{\mu,k}(f)=Z_{\mu,k}(\textrm{Sym}\, f) \label{(10.7)}
\end{equation}
for all functions $f\in \bar{\cal H}_{\mu,k}$.
The definition of Wiener--It\^o integrals can be extended to
general functions $f\in{\cal H}_{\mu,k}$ with the help of
formula~(\ref{(10.5)}). To carry out this extension we still have
to know that the class of functions $\bar{\cal H}_{\mu,k}$ is
a dense subset of the class ${\cal H}_{\mu,k}$ in the Hilbert
space $L_2(X^k,{\cal X}^k,\mu^k)$, where $\mu^k$ is the $k$-th power
of the reference measure $\mu$ of the white noise~$\mu_W$. I
briefly explain how this property of $\bar{\cal H}_{\mu,k}$ can be
proved. The non-atomic property of the measure~$\mu$ is exploited
at this point.
To prove this statement it is enough to show that the indicator
function of any product set $A_1\times\cdots\times A_k$
such that $\mu(A_j)<\infty$, $1\le j\le k$, but the sets
$A_1,\dots,A_k$ may be non-disjoint is in the $L_2(\mu^k)$
closure of $\bar{\cal H}_{\mu,k}$. In the proof of this
statement it will be exploited that since $\mu$ is a non-atomic
measure, the sets $A_j$ can be represented for all
$\varepsilon>0$ and $1\le j\le k$ as a finite union
$A_j=\bigcup\limits_s B_{j,s}$ of disjoint sets $B_{j,s}$
with the property $\mu(B_{j,s})<\varepsilon$.
By means of these relations the
product $A_1\times\cdots\times A_k$ can be written in the form
\begin{equation}
A_1\times\cdots\times A_k=\bigcup_{s_1,\dots,s_k}
B_{1,s_1}\times\cdots\times B_{k,s_k} \label{(10.8)}
\end{equation}
with some sets $B_{j,s_j}$ such that $\mu(B_{j,s_j})<\varepsilon$
for all sets in this union. Moreover, we may assume, by refining
the partitions of the sets $A_j$ if this is necessary that any
two sets $B_{j,s_j}$ and $B_{j',s'_{j'}}$ in this representation
are either disjoint, or they agree. Take such a representation of
$A_1\times\cdots\times A_k$, and consider the set we obtain by
omitting those products $B_{1,s_1}\times\cdots\times B_{k,s_k}$
from the union at the right-hand side of~(\ref{(10.8)}) for which
$B_{i,s_i}=B_{j,s_j}$
for some $1\le i0$, and the sets $B_j$ into the
union of small disjoint sets $F_j^{(m)}$, $1\le j\le l$, with
some fixed number $1\le m\le M$, in such a way that
$\mu(F_j^{(m)})\le \varepsilon$ with some fixed $\varepsilon>0$.
Beside this, we also require that two sets
$D_j^{(m)}$ and $F_{j'}^{(m')}$ should be either disjoint or
they should agree. (The sets $D_j^{(m)}$ are disjoint for
different indices, and the same relation holds for the
sets $F_{j'}^{(m')}$.)
Then the identities
$$
k!Z_{\mu,k}(f)=\prod_{j=1}^k
\left(\sum_{m=1}^M\mu_W(D_j^{(m)})\right)
$$
and
$$
l!Z_{\mu,l}(g)=\prod_{j'=1}^l
\left(\sum_{m'=1}^M\mu_W(F_{j'}^{(m')})\right),
$$
hold, and the product of these two Wiener--It\^o integrals can be
written in the form of a sum by means of a term by term
multiplication. Let us divide the terms of the sum we get in such a
way into classes indexed by the diagrams $\gamma\in\Gamma(k,l)$
in the following way: Each term in this sum is a product of the form
$\prod\limits_{j=1}^k\mu_W(D_j^{(m_j)})
\prod\limits_{j'=1}^l\mu_W(F_{j'}^{(m_j')})$. Let it belong to the
class indexed by the diagram $\gamma$ with edges
$((1,j_1),(2,j_1'))$,\dots, and $((1,j_s),(2,j'_s))$ if the elements
in the pairs $(D_{j_1}^{(m_{j_1})},F_{j'_1}^{(m_{j'_1})})$,\dots,
$(D_{j_s}^{(m_{j_s})},F_{j'_s}^{(m_{j'_s})})$ agree, and otherwise all
terms are different. Then letting $\varepsilon\to0$ (and taking
partitions of the sets $D_j$ and $F_{j'}$ corresponding to the
parameter $\varepsilon$) the
sums of the terms in each class turn to integrals, and our
calculation suggests the identity
\begin{equation}
(k!Z_{\mu,k}(f))(l!Z_{\mu,l}(g))
=\sum_{\gamma\in\Gamma(k,l)}\bar Z_\gamma(f,g) \label{(10.13)}
\end{equation}
with
\begin{eqnarray}
\bar Z_\gamma(f,g)&&=\int
f(x_{\alpha_\gamma((1,1))},\dots,x_{\alpha_\gamma((1,k))})
g(x_{(2,1)},\dots,x_{(2,l)}) \label{(10.13a)} \\
&&\qquad \mu_W(\,dx_{\alpha_\gamma((1,1))})\dots
\mu_W(\,dx_{\alpha_\gamma((1,k))})
\mu_W(\,dx_{(2,1)})\dots\mu_W(\,dx_{(2,l)}) \nonumber
\end{eqnarray}
with the function $\alpha_\gamma(\cdot)$ introduced before
formula~(\ref{(10.9)}). The indices $\alpha(1,j)$ of the
arguments in~(\ref{(10.13a)}) mean
that in the case $\alpha_\gamma((1,j))=(2,j')$ the argument
$x_{(1,j)}$ has to be replaced by $x_{(2,j')}$. In particular,
$$
\mu_W(\,dx_{\alpha_\gamma((1,j))})\mu_W(\,dx_{(2,j')})
=(\mu_W(\,dx_{(2,j')}))^2=\mu(\,dx_{(2,j')})
$$
in this case because of the `identity'
$(\mu_W(\,dx))^2=\mu(\,dx)$. Hence the above informal
calculation yields the identity
$\bar Z_\gamma(f,g)=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))$,
and relations~(\ref{(10.13)}) and~(\ref{(10.13a)}) imply
formula~(\ref{(10.12)}).
A similar heuristic argument can be applied to get formulas for
the product of integrals of normalized empirical distributions or
(normalized) Poisson fields, only the starting `identity'
$(\mu_W(\,dx))^2=\mu(\,dx)$ changes in these cases, some additional
terms appear in it, which modify the final result. I return to
this question in the next chapter.
\medskip
It is not difficult to generalize Theorem~10.2A with the help of
some additional notations to a diagram formula about the product
of finitely many Wiener--It\^o integrals. We shall do this in
Theorem~10.2. Then to understand this result better I present
an example which shows how to calculate the terms in the sum
expressing the product of three Wiener--It\^o integrals as a
sum of Wiener--It\^o integrals.
We consider the product of the Wiener--It\^o integrals
$k_p!Z_{\mu,k_p}(f_p)$, $1\le p\le m$, of $m\ge2$ functions
$f_p(x_1,\dots,x_{k_p})\in{\cal H}_{\mu,k_p}$, of order
$k_p\ge1$, $1\le p\le m$, and define a class of diagrams
$\Gamma=\Gamma(k_1,\dots,k_m)$ in the following way.
The diagrams $\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ have
vertices of the form $(p,r)$, $1\le p\le m$, $1\le r\le k_p$. The
set of vertices $\{(p,r)\colon\, 1\le r\le k_p\}$ with a fixed number
$p$ will be called the $p$-th row of the diagram $\gamma$. A diagram
$\gamma\in\Gamma=\Gamma(k_1,\dots,k_m)$ may have some edges. All
edges of a diagram connect vertices from different rows, and from
each vertex there starts at most one edge. All diagrams satisfying
these properties belong to $\Gamma(k_1,\dots,k_m)$. If a diagram
$\gamma$ contains an edge of the form $((p_1,r_1),(p_2,r_2))$ with
$p_12$. Put
\begin{equation}
W(\gamma)=\sum_{\beta\in O(\gamma)}(\ell(\beta)-1)
+\sum_{\beta\in C(\gamma)}(\ell(\beta)-2),\quad
\gamma\in\Gamma(k_1,\dots,k_m), \label{(11.9)}
\end{equation}
where $\ell(\beta)$ denotes the length of the chain~$\beta$.
To define the next quantity let us introduce some notations.
We consider the chains of the form
$\beta=\{(p_1,r_1),\dots,(p_l,r_l)\}$,
$1\le p_1p\}$.
In words, ${\cal B}_1(\gamma,p)$ consists of those chains
$\beta\in\gamma$ which have colour~$1$, all their vertices
are in the first~$p$ rows of the diagram, and contain a
vertex in the~$p$-th row. The set ${\cal B}_2(\gamma,p)$
consists of those chains $\beta\in\gamma$ which have either
colour~$-1$, and all their vertices are in the first~$p$
rows of the diagram, or they have (with an arbitrary colour) a
vertex both in the first~$p$ rows and in the remaining rows
of the diagram. Put $B_1(\gamma,p)=|{\cal B}_1(\gamma,p)|$
and $B_2(\gamma,p)=|{\cal B}_2(\gamma,p)|$. With the help
of these numbers we define
\begin{equation}
J_n(\gamma,p)=\left\{
\begin{array}{l}
\prod\limits_{j=1}^{B_1(\gamma,p)}
\left(\frac{n-B_1(\gamma,p)-B_2(\gamma,p)+j}n\right)
\quad\textrm{if } B_1(\gamma,p)\ge1\\
\quad 1\quad \textrm{if } B_1(\gamma,p)=0
\end{array}
\right. \label{(11.10)}
\end{equation}
for all $2\le p\le m$ and diagrams $\gamma\in\Gamma(k_1,\dots,k_m)$.
Theorem 11.2 will be formulated with the help of the above notations.
\medskip\noindent
{\bf Theorem 11.2 (The diagram formula for the product of several
degenerate $U$-statistics).}\index{diagram formula for the product
of degenerate $U$-statistics} {\it Let a sequence of independent
and identically distributed random variables $\xi_1,\xi_2,\dots$
be given with some distribution $\mu$ on a measurable space
$(X,{\cal X})$ together with $m\ge2$ bounded functions
$f_p(x_1,\dots,x_{k_p})$ on the spaces $(X^{k_p},{\cal X}^{k_p})$,
$1\le p\le m$, canonical with respect to the probability
measure~$\mu$. Let us consider the class of coloured diagrams
$\Gamma(k_1,\dots,k_m)$ together with the functions
$F_\gamma=F_{\gamma}(f_1,\dots,f_m)$,
$\gamma\in\Gamma(k_1,\dots,k_m)$, defined in formulas
(\ref{(11.8)}) and the constants $W(\gamma)$
and $J_n(\gamma,p)$, $1\le p\le m$, given in
formulas~(\ref{(11.9)}) and~(\ref{(11.10)}).
The functions $F_\gamma(f_1,\dots,f_m)$ are bounded and canonical
with respect to the measure $\mu$ with $|O(\gamma)|$ variables,
and the product of the normalized degenerate $U$-statistics
$n^{-k_p/2}k_p!I_{n,k_p}(f_p)$, $1\le p\le m$,
$n\ge \max\limits_{1\le p\le m} k_p$, defined in~(\ref{(8.7)})
can be written in the form
\begin{eqnarray}
&&\prod_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_p)
={\sum_{\gamma\in\Gamma(k_1,\dots,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n,\,m)}
\left(\prod_{p=2}^m J_n(\gamma,p)\right) \nonumber \\
&&\qquad n^{-W(\gamma)/2}\cdot n^{-|O(\gamma)|/2} |O(\gamma)|!
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m)),
\label{(11.11)}
\end{eqnarray}
where $\sum^{\prime(n,\,m)}$ means that summation is taken
for those $\gamma\in\Gamma(k_1,\dots,k_m)$ which satisfy the
relation $B_1(\gamma,p)+B_2(\gamma,p)\le n$ for all
$2\le p\le m$ with the quantities $B_1(\gamma,p)$ and
$B_2(\gamma,p)$ introduced before the definition of
$J_n(\gamma,p)$ in~(\ref{(11.10)}), and the expression
$W(\gamma)$ was defined in~(\ref{(11.9)}). The terms
$I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))$ at the
right-hand side of formula (\ref{(11.11)}) can be replaced by
$I_{n,|O(\gamma)|}(\textrm{\rm Sym}\,F_\gamma(f_1,\dots,f_m))$.}
\medskip
To understand better the formulation of Theorem~11.2 let us consider
the following example.
\medskip\noindent
Take three normalized degenerate $U$-statistics $n^{-3/2}3!I_{n,3}(f_1)$,
$n^{-2}4!I_{n,4}(f_2)$ and $n^{-3/2}3!I_{n,3}(f_3)$ with canonical
kernel functions $f_1(x_1,x_2,x_3)$, $f_2(x_1,x_2,x_3,x_4)$ and
$f_3(x_1,x_2,x_3)$, and let us see how to calculate a term from
the sum at the right-hand side of formula~(11.11) which expresses
the product
$$
n^{-3/2}3!I_{n,3}(f_1)n^{-2}4!I_{n,4}(f_2)n^{-3/2}3!I_{n,3}(f_3)
$$
in the form of a linear combination of degenerate $U$-statistics.
In this case we have to consider coloured diagrams with rows of
vertices (1,1), (1,2), (1,3), then (2,1), (2,2), (2,3), (2,4), and
finally (3,1), (3,2), (3,3). We have to consider all coloured
diagrams with these rows, and to calculate their contribution to
the sum at the right-hand side of~(\ref{(11.11)}). Let us consider
for instance the diagram containing two closed chains (with colour~1)
$((1,3),(2,4),(3,3))$ of length~3, $((1,1),(2,2))$ of length~2,
an open chain (with colour~$-1$) $((2,1),(3,1))$ of length~2, and
the remaining vertices (1,2), (2,3), (3,2) are chains of length~1
which are consequently open (with colour~$-1$). (See picture.)
\vskip3mm
\begin{figure}[ht]
\begin{center}
\epsfig{file=diag11c.eps,width=4cm}
\centerline{Our diagram $\gamma$}
\end{center}
\end{figure}
\vfill\eject
We want to calculate $F_\gamma(f_1,f_2,f_3)$. For this goal first
we have to determine the coloured diagrams $\gamma_{pr}\in\Gamma(3,4)$
and $\gamma_{cl}\in\Gamma(4,3)$ (here the first parameter~4 in the
definition of the class of diagrams where $\gamma_{cl}$ belongs to
is the number of open chains in $\gamma_{pr}$, which is, as we will
see, equals~4), and the kernel function $F_{\gamma_{pr}}(f_1,f_2)$.
(See the picture of the diagram $\gamma_{pr}$ together with a
labelling of its chains and the diagram~$\gamma_{cl}$
to which we also attached a labelling.)
\vskip3mm
\bigskip
% \noindent
~\begin{tabular}{ccc}
\epsfig{file=diag11d.eps,height=30mm}&~~~~~&
\phantom{~}\hskip-1mm\epsfig{file=diag11e.eps,height=30mm}\\
\begin{minipage}{42mm} The diagram $\gamma_{pr}$ corresponding to
$\gamma$ together with the enu\-me\-ra\-tion of its open chains
\end{minipage}
&&
\begin{minipage}{42mm}
The diagram $\gamma_{cl}$ constructed with the help of
$\gamma_{pr}$ and of the enu\-me\-ra\-tion of its open chains
\end{minipage}\\
\end{tabular}
\vskip3mm
In our example $\gamma_{pr}$ is a diagram with two rows (1,1),
(1,2), (1,3) and (2,1), (2,2), (2,3), (2,4). It contains a closed
chain $((1,1),(2,2))$ and an open chain $((1,3),(2,4))$ of length 2,
(the latter is the restriction of a chain of length~3), and open
chains of length~1, which are the vertices (1,2), (2,1), (2,3).
This is the same diagram which we considered in the example after
Theorem~11.1. In that example we have fixed an enumeration of the
chains of this diagram. We also made the convention that the
enumeration of the chains of a diagram fixed at the start cannot
be modified later. Hence we have the following enumeration of the
open chains of this diagram: (1,2)--label 1, (2,1)--label~2,
$((1,3),(2,4))$--label~3, and (2,3)--label~4.
We define the coloured diagram~$\gamma_{cl}$ with the help of the
diagram~$\gamma_{pr}$ and the enumeration of its open chains.
It has two rows. The vertices of the first row $(1,1)$, $(1,2)$,
$(1,3)$ and $(1,4)$ correspond to the open chains of the
diagram~$\gamma_{pr}$ with labels~1, 2, 3 and~4 respectively. The
vertices of the second row, $(2,1)$, $(2,2)$ and~$(2,3)$
correspond to the vertices $(3,1)$, $(3,2)$ and~$(3,3)$ of the
last row of the original diagram~$\gamma$. The diagram~$\gamma_{cl}$
has an open chain $((1,2),(2,1))$ of length two, (here the open
chain (2,1) of $\gamma_{pr}$ labelled by~2, is connected to the
vertex~(3,1) with second index~1), a closed chain of length~2
$((1,3),(2,3))$ (here the open chain of $\gamma_{pr}$ labelled
by~3 is connected with the vertex~(3,3)), and the remaining open
chains of $\gamma_{cl}$ of length~1 are (1,1), (1,4) (the open
chains (1,2) and (2,3) of $\gamma_{pr}$ with labels~1, and~4),
and (2,2).
Actually we have already calculated the function
$F_{\gamma_{pr}}(f_1,f_2)$ in formula~(\ref{(11.7)}). We can
calculate similarly the function
$F_\gamma(f_1,f_2,f_3)=F_{\gamma_{cl}}(F_{\gamma_{pr}}(f_1,f_2),f_3)$.
First we fix a labelling of the chains of the diagram~$\gamma_{cl}$,
say (1,1)--label~1, $((1,2),(2,1))$--label~2, (1,4)--label~3,
(2,2)--label~4, and $((1,3),(2,3))$--label~5. (I have denoted this
labelling in the corresponding picture.) With such a labelling
\begin{eqnarray*}
&&F_\gamma(f_1,f_2,f_3)(x_1,x_2,x_3,x_4)=Q_2P_5
(F_{\gamma_{pr}}(x_1,x_2,x_5,x_3)f_3(x_2,x_4,x_5)) \\
&&\qquad=\int F_{\gamma_{pr}}(x_1,x_2,x_5,x_3)f_3(x_2,x_4,x_5)\mu(\,dx_5)\\
&&\qquad\qquad
-\int F_{\gamma_{pr}}(x_1,x_2,x_5,x_3)f_3(x_2,x_4,x_5)
\mu(\,dx_2)\mu(\,dx_5).
\end{eqnarray*}
The normalized degenerate $U$-statistic corresponding to~$\gamma\,$
is
$$
n^{-2}4!I_{n,4}(F_{\gamma}(f_1,f_2,f_3)),
$$
and the term corresponding
to~$F_\gamma$ in formula (\ref{(11.11)}) is
$$
\left(\frac{n-4}n\right)^2\cdot n^{-1}\cdot n^{-2}4!I_{n,4}
(F_{\gamma}(f_1,f_2,f_3))
$$
if $n\ge5$. In the case $n\le4$ this term disappears.
\medskip\medskip
In Theorem 11.2 the product of such degenerate $U$-statistics were
considered whose kernel functions were bounded. This also implies
that all functions $F_\gamma$ appearing at the right-hand side of
(\ref{(11.11)}) are well-defined (i.e. the integrals appearing in
their definition are convergent) and bounded. In the applications
of Theorem~11.2 it is useful to have a good bound on the $L_2$-norm
of the functions $F_\gamma(f_1,\dots,f_m)$. Such a result is
formulated in the following
\medskip\noindent
{\bf Lemma 11.3 (Estimate about the $L_2$-norm of the kernel
functions of the $U$-statistics appearing in the diagram
formula).}\index{bound on the kernel functions in the diagram
formula for $U$-statistics}
{\it Let $m$ functions $f_p(x_1,\dots,x_{k_p})$, $1\le p\le m$,
be given on the products $(X^{k_p},{\cal X}^{k_p},\mu^{k_p})$
of some measure space $(X,{\cal X},\mu)$, $1\le p\le m$, with
a probability measure $\mu$, which satisfy inequality~(\ref{(8.1)})
(if the index $k$ is replaced by the index $k_p$ in formula~(\ref{(8.1)})).
Let us take a coloured diagram $\gamma\in\Gamma(k_1,\dots,k_m)$,
and consider the function $F_\gamma(f_1,\dots,f_m)$ defined
inductively by means of formula (\ref{(11.8)}). The $L_2$-norm of
the function $F_\gamma(f_1,\dots,f_m)$ (with respect to the
product measure~$\mu\times\cdots\times\mu$ on the space where
$F_\gamma(f_1,\dots,f_m)$ is defined) satisfies the inequality
$$
\|F_\gamma(f_1,\dots,f_m)\|_2
\le2^{W(\gamma)}\prod_{p\in U(\gamma)} \|f_p\|_2,
$$
where $W(\gamma)$ is given in~(\ref{(11.9)}), and the set
$U(\gamma)\subset\{1,\dots,m\}$ is defined as
\begin{eqnarray}
U(\gamma)&&=\{p\colon\; 1\le p\le m,\quad\textrm{for all vertices }
(p,r),\; 1\le r\le k_p \textrm{ the chain }\beta\in\gamma
\nonumber \\
&&\qquad \text{ for which } (p,r)\in\beta \textrm{ has the
property that either } u(\beta)=p \nonumber \\
&&\qquad\textrm{ or } d(\beta)=p\textrm{ and } c_\gamma(\beta)=1\}.
\label{(11.12)}
\end{eqnarray}
(If the point $(p,r)$ is contained in a chain $\beta=\{(p,r)\}\in\gamma$
of length~1, then $u(\beta)=d(\beta)=p$, and $c_\gamma(\beta)=-1$.
In this case the vertex $(p,r)$ satisfies that condition which all
vertices $(p,r)$, $1\le r\le k_p$, must satisfy to guarantee the
property $p\in U(\gamma)$.)}
\medskip\noindent
{\it Remark.}\/ Let us give a less formal definition of the set
$U(\gamma)$ in formula (\ref{(11.12)}). It contains the indices
of those rows of the diagram~$\gamma$ whose vertices behave in
a sense nicely. This nice behaviour means the following. Each
vertex is contained in a chain $\beta$ of the diagram~$\gamma$.
We say that a vertex behaves nicely if it is either
at the highest or the lowest level of the chain~$\beta\in\gamma$
containing it. Moreover, if it is at its lower level, then we
also demand that $\beta$ must be closed, i.e. $c(\beta)=1$. If a
vertex is contained in a chain containing no other vertex, then
it is both at the higher and lower level of this chain. In this
case we say that the vertex behave nicely.
\medskip
The last result of this chapter is a corollary of Theorem~11.2.
In this corollary we give an estimate on the expected value of a
product of degenerate $U$-statistics. To formulate this result
we introduce the following terminology. We call a (coloured)
diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ closed if
$c_\gamma(\beta)=1$ for all chains $\beta\in\gamma$, and denote
the set of all closed diagrams by $\bar\Gamma(k_1,\dots,k_m)$.
Observe that $F_\gamma(f_1,\dots,f_m)$ is constant (a function
of zero variable) if and only if $\gamma$ is a closed diagram,
i.e. $\gamma\in\bar\Gamma(k_1,\dots,k_m)$, and
$$
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))
=I_{n,0}(F_\gamma(f_1,\dots,f_m))
=F_\gamma(f_1,\dots,f_m)
$$
in this case. Now we formulate the following result.
\medskip\noindent
{\bf Corollary of Theorem 11.2 about the expectation of a product
of degenerate $U$-statistics.}\index{calculation of the expectation
of a product of degenerate $U$-statistics}
{\it Let a finite sequence of functions $f_p(x_1,\dots,x_{k_p})$,
$1\le p\le m$, be given on the products $(X^{k_p},{\cal X}^{k_p})$ of
some measurable space $(X,{\cal X})$ together with a sequence of
independent and identically distributed random variables with
value in the space $(X,{\cal X})$ and some distribution~$\mu$
which satisfy the conditions of Theorem 11.2.
Let us apply the notation of Theorem~11.2 together with the notion
of the above introduced class of closed diagrams
$\bar\Gamma(k_1,\dots,k_m)$. The identity
\begin{eqnarray}
&&E\left(\prod_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_{k_p})\right)
\label{(11.13)} \\
&&\qquad = {\sum_{\gamma\in\bar\Gamma(k_1,\dots,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\prime(n,m)}
\left(\prod_{p=1}^m
J_n(\gamma,p)\right) n^{-W(\gamma)/2}\cdot F_\gamma(f_1,\dots,f_m)
\nonumber
\end{eqnarray}
holds. This identity has the consequence
\begin{equation}
\left|E\left(\prod_{p=1}^m n^{-k_p/2} k_p!
I_{n,k_p}(f_{k_p})\right)\right|
\le \sum_{\gamma\in\bar\Gamma(k_1,\dots,k_m)}
n^{-W(\gamma)/2}|F_\gamma(f_1,\dots,f_m)|.
\label{(11.14)}
\end{equation}
Beside this, if the functions~$f_p$, $1\le p\le m$, satisfy
conditions~(\ref{(8.1)}) and~(\ref{(8.2)}) (with indices~$k_p$
instead of~$k$ in them), then the numbers
$|F_\gamma(f_1,\dots,f_m)|$ at the right-hand
side of~(\ref{(11.14)}) satisfy the inequality
\begin{eqnarray}
|F_\gamma(f_1,\dots,f_m)|\le2^{W(\gamma)}\sigma^{|U(\gamma)|} \quad
\textrm{for all } \gamma\in\bar\Gamma(k_1,\dots,k_m).
\label{(11.15)}
\end{eqnarray}
In formula~(\ref{(11.15)}) the same number~$W(\gamma)$ and
set $U(\gamma)$ appear as in Lemma 11.3. The only difference is
that in the present case the definition of $U(\gamma)$ becomes a bit
simpler, since $c_\gamma(\beta)=1$ for all chains $\beta\in\gamma$.}
\medskip\noindent
{\it Remark:}\/ We have applied a different terminology for
diagrams in this chapter and in Chapter~10, where the theory
of Wiener--It\^o integrals was discussed. But there is a
simple relation between their terminology. If we take only
those diagrams considered in this chapter which contain only
chains of length~1 or~2, and beside this the chains of length~1
have colour~$-1$, and the chains of length~2 have colour~1,
then we get the diagrams considered in the previous chapter.
Moreover, the functions $F_\gamma=F_\gamma(f_1,\dots,f_m)$
are the same in the two cases. Hence formula~(\ref{(10.18)})
in the Corollary of Theorem~10.2 and formula~(\ref{(11.14)})
in the Corollary of Theorem~11.2 make possible to compare
the moments of Wiener--It\^o integrals and degenerate
$U$-statistics.
The main difference between the estimates of this chapter and
those given in the Gaussian case is that formula~(\ref{(11.14)})
contains some additional terms. They are the contributions of
those diagrams $\gamma\in\bar\Gamma(k_1,\dots,k_m)$ which
contain chains $\beta\in\gamma$ with length $\ell(\beta)>2$.
These are those diagrams $\gamma\in\bar\Gamma(k_1,\dots,k_m)$
for which $W(\gamma)\ge1$. The estimate~(\ref{(11.15)}) given
for the terms $F_\gamma$ corresponding to such diagrams is
weaker, than the estimate given for the terms $F_\gamma$ with
$W(\gamma)=0$, since $|U(\gamma)|0 \label{(12.5)}
\end{equation}
for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$.
Relation~(\ref{(12.4)}) follows from relation~(\ref{(12.2)})
in the same way as formula~(\ref{(9.3)}) follows from
formula~(\ref{(9.2)}) in the proof of the Hoeffding
decomposition. Let us understand why the coefficient
$n^{|C(\gamma)|}J_n(\gamma)$ appears at the right-hand
side of~(\ref{(12.4)}).
This coefficient can be calculated in the following way.
Let us write up the identity
\begin{eqnarray*}
&&n^{-(k_1+k_2)/2}
(f_1\circ f_2)_{\bar\gamma}
(\xi_{j_1},\dots,\xi_{j_{s(\bar\gamma)}})\\
&&\qquad =\sum_{\gamma\in\Gamma(\bar\gamma)}
n^{-(k_1+k_2)/2}\bar F_\gamma(f_1,f_2)
(\xi_{j_{l_1}},\dots,\xi_{j_{l_{|O(\gamma)|}}})
\end{eqnarray*}
with the help of~(\ref{(12.2)}) for all sequences
$\xi_{j_1},\dots,\xi_{j_{s(\bar\gamma)}}$, and let us sum it up
for all such sets of arguments $(j_1,\dots,j_{s(\bar\gamma)})$
for which all indices $j_p$, $1\le p\le s(\bar\gamma)$, are
different, and $1\le j_p\le n$. Then we get at the left-hand
side of the identity the $U$-statistic
$$
n^{-(k_1+k_2)/2}s(\bar\gamma)! I_{n,\bar s(\bar\gamma)}
\left((f_1\circ f_2)_{\bar\gamma}\right).
$$
We still have to check that at the right-hand side of this
identity we get a sum, where a term of the form
$n^{-(k_1+k_2)/2}\bar F_\gamma(f_1,f_2)
(\xi_{j_{l_1}},\dots,\xi_{j_{l_{|O(\gamma)|}}})$ appears
with multiplicity $n^{|C(\gamma)|}J_n(\gamma)$. Indeed, such a
term appears for such vectors $(j_1,\dots, j_{s(\bar\gamma)})$
for which the value of $|O(\gamma)|$ arguments are fixed, the
remaining arguments can take arbitrary value between~1 and~$n$
with the only restriction that all coordinates must be
different. (The operators $P_v$ are applied for these remaining
coordinates.) There are $n^{|C(\gamma)|}J_n(\gamma)$ such
vectors. The above observations imply identity~(\ref{(12.4)}).
Let us observe that $k_1+k_2-2|C(\gamma)|=|O(\gamma)|+W(\gamma)$
with the number $W(\gamma)$ introduced in the formulation of
Theorem~11.1. Hence
$$
n^{-(k_1+k_2)/2}n^{|C(\gamma)|}=n^{-W(\gamma)/2}n^{-|O(\gamma)|/2}.
$$
Let us replace the left-hand side of the last identity by its
right-hand side in~(\ref{(12.4)}), and let us sum up the identity
we get in such a way for all $\bar\gamma\in\bar\Gamma(k_1,k_2)$
such that $s(\bar\gamma)\le n$. The identity we get in such a way
together with formulas~(\ref{(12.1)}) and~(\ref{(12.5)}) imply
such a version of identity~(\ref{(11.4)}) where the kernel
functions $F_\gamma(f_1,f_2)$ of the $U$-statistics at the
right-hand side of the equation are replaced by the kernel functions
$\bar F_\gamma(f_1,f_2)$ defined in~(\ref{(12.3)}). But we can
get the function $F_\gamma(f_1,f_2)$ by reindexing the arguments
of the function $\bar F_\gamma(f_1,f_2)$. This can be seen by
taking the original indexation of the chains of~$\gamma$ and
looking at the indexation of the vertices it implies. On the
other hand, we know that the reindexation of the variables of
the kernel function does not change the value of the $U$-statistic.
Hence $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))
=I_{n,|O(\gamma)|}(\bar F_\gamma(f_1,f_2))$, and
identity~(\ref{(11.4)}) holds in its original form.
Clearly, $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))=
I_{n,|O(\gamma)|}(\textrm{\rm Sym}F_\gamma(f_1,f_2))$,
hence $I_{n,|O(\gamma)|}(F_\gamma(f_1,f_2))$ can be replaced by
$I_{n,|O(\gamma)|}(\textrm{\rm Sym}\,F_\gamma(f_1,f_2))$ in
formula~(\ref{(11.4)}). Beside this, we have shown that the
functions $F_\gamma(f_1,f_2)$ are canonical, and it can be simply
shown that they are bounded, if the functions~$f_1$ and~$f_2$ have
this property. We still have to prove inequalities~(\ref{(11.5)})
and~(\ref{(11.6)}).
\medskip
Inequality (\ref{(11.5)}), the estimate of the $L_2$-norm of the
function $F_\gamma(f_1,f_2)$ follows from the Schwarz
inequality, and actually it agrees with
inequality~(\ref{(10.11)}), proved at the start
of Appendix~B. Hence its proof is omitted here.
To prove inequality (\ref{(11.6)}) let us introduce, similarly to
formula (\ref{(9.1a)}), the operators
$$
(\tilde Q_{j}h)(x_{1},\dots,x_r)=h(x_1,\dots,x_r)+
\int h(x_1,\dots,x_r)\mu(\,dx_j),\quad 1\le j\le r,
$$
in the space of functions $h(x_1,\dots,x_r)$ with coordinates
in the space $(X,{\cal X})$. Observe that both the operators
$\tilde Q_j$ and the operators $P_j$ defined in (\ref{(9.1)}) are
positive, i.e. they map a non-negative function to a non-negative
function. Beside this, $Q_j\le\tilde Q_j$, and the norms of the
operators $\frac{\tilde Q_j}2$ and $P_j$ are bounded by 1
both in the $L_1(\mu)$, the $L_2(\mu)$ and the supremum norm.
Let us define the function
\begin{eqnarray*}
&&\tilde F_\gamma(f_1,f_2)(x_1,\dots,x_{|O(\gamma)|}) \\
&&\qquad=\left(\prod_{j\colon \beta(j)\in C(\gamma)}P_j
\prod_{j'\colon \beta(j')\in O_2(\gamma) } \tilde Q_{j'}\right)
(f_1\circ f_2)_\gamma(x_q,\dots,x_{|O(\gamma)|+|C(\gamma)|})
\end{eqnarray*}
with the notation of Chapter~11.
The function
$\tilde F_\gamma(f_1, f_2)$ was defined similarly to
$F_\gamma(f_1,f_2)$ defined in~(\ref{(11.3)}) with the help of
$(f_1\circ f_2)_\gamma$ only the operators $Q_j$
were replaced by $\tilde Q_j$ in its definition.
The properties of the operators $P_j$ and $\tilde Q_j$
listed above together with the condition
$\sup|f_2(x_1,\dots,x_k)|\le1$ imply that
\begin{equation}
|F_\gamma(f_1,f_2)|\le \tilde F_\gamma(|f_1|,|f_2|)
\le \tilde F_\gamma(|f_1|,1), \label{(12.6)}
\end{equation}
where `$\le$' means that the function at the right-hand side is
greater than or equal to the function at the left-hand side in
all points, and the term~1 in~(\ref{(12.6)}) denotes the function
which equals identically~1. Because of
the relation~(\ref{(12.6)}) to prove relation~(\ref{(11.6)})
it is enough to show that
\begin{eqnarray}
&&\|(\tilde F_\gamma(|f_1|,1)_\gamma\|_2 \nonumber \\
&&\qquad=\left\|\left(\prod_{j\colon \beta(j)\in C(\gamma)} P_j
\prod_{j'\colon \beta(j'),\in O_2(\gamma)} \tilde Q_{j'}\right)
|f_1(x_{\alpha_\gamma((1,1))},
\dots,x_{\alpha_\gamma((1,k_1))})|\right\|_2 \nonumber \\
&&\qquad\le 2^{|O_2(\gamma)|}\|f_1\|_2=2^{W(\gamma)}\|f_1\|_2.
\label{(12.7)}
\end{eqnarray}
But this inequality trivially holds, since the norm of all
operators $P_j$ in formula (\ref{(12.7)}) is bounded
by~1, the norm of all operators $\tilde Q_j$ is bounded
by~2 in the $L_2(\mu)$ norm, and $|O_2(\gamma)|=W(\gamma)$.
\hfill$\qed$
\medskip\noindent
{\it Proof of Theorem 11.2.} Theorem~11.2 will be proved
with the help of Theorem~11.1 by induction with respect to
the number~$m$ of the terms in the product of the degenerate
$U$-statistics $k_p!I_{n,k_p}(f_p)$, $1\le p\le m$. It is
not difficult to check with the help of Theorem~11.1 and
the recursive definition of the functions~$F_\gamma$ by
applying induction with respect to~$m$ that the functions
$F_\gamma(f_1,\dots,f_m)$ are bounded and canonical if the
functions~$f_1,\dots,f_m$ satisfy the same properties.
We still have to prove the identity~(\ref{(11.11)}). This
will be proved also by induction with respect to~$m$ with
the help of Theorem~11.1.
For $m=2$ formula~(\ref{(11.11)}) follows from Theorem~11.1,
since in this case it agrees with relation~(\ref{(11.4)}).
To prove this formula for $m\ge3$ first we express with the
help of our inductive hypothesis the product of the first
$m-1$ terms in the product of degenerate $U$-statistics
as a sum of degenerate $U$-statistics.
Then we express the product of each term in this sum with
the last $U$-statistic of the product as a sum of
$U$-statistics with the help of Theorem~11.1, and sum up
these identities. In such a way we express the product
of~$m$ degenerate $U$-statistics in the form of a sum of
degenerate $U$-statistics. We have to show that in such a
way we get formula~(\ref{(11.11)}). In the proof of this
statement we shall exploit that in the calculation of the
product of the first $m-1$ $U$-statistics we have to
work with the diagrams~$\gamma_{pr}$ and if we calculate the
product of these terms with the $m$-th the $U$-statistic,
then we calculate with the diagrams~$\gamma_{cl}$.
To carry out the above program first we observe that a
diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ is uniquely determined
by the pairs of $(\gamma_{pr},\gamma_{cl})$ defined with the help
of~$\gamma$, i.e. if $\gamma,\gamma'\in\Gamma(k_1,\dots,k_m)$,
and $\gamma\neq\gamma'$, then either $\gamma_{pr}\neq\gamma'_{pr}$
or $\gamma_{cl}\neq\gamma'_{cl}$. Hence we can identify each
diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ with the pair
$(\gamma_{pr},\gamma_{cl})$ we defined with its help. Beside
this, the pairs of diagrams $(\gamma_{pr},\gamma_{cl})$
satisfy the relations $\gamma_{pr}\in\Gamma(k_1,\dots,k_{m-1})$
and $\gamma_{cl}\in\Gamma(|O(\gamma_{pr})|,k_m)$.
Moreover, the class of pairs of diagrams
$(\gamma_{pr},\gamma_{cl})$, $\gamma\in\Gamma(k_1,\dots,k_m)$,
have the following characterization. Take all such pairs of
diagrams $(\bar\gamma,\hat\gamma)$ for which
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ and
$\tilde\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$. There is a
one to one correspondence between the pairs of diagrams
$(\bar\gamma,\hat\gamma)$ with this property and the diagrams
$\gamma\in\Gamma(k_1,\dots,k_m)$ in such a way that
$\bar\gamma=\gamma_{pr}$ and $\hat\gamma=\gamma_{cl}$. (This
correspondence depends on the labelling of the open chains
of the diagrams $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ that
we have previously fixed.) It is not difficult to check the
above statements, and I leave it to the reader.
Because of our inductive hypothesis we can write by applying
relation~(\ref{(11.11)}) of Theorem~11.2 with parameter~$m-1$
the identity
\begin{eqnarray}
&&\prod_{p=1}^{m-1} n^{-k_p/2}k_p!I_{n,k_p}(f_p)
={\sum_{\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n,\,m-1)}
\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right) \nonumber \\
&&\qquad n^{-W(\bar\gamma)/2}
\cdot n^{-|O(\bar\gamma)|/2} |O(\bar\gamma)|!
I_{n,|O(\bar\gamma)|}(F_{\bar\gamma}(f_1,\dots,f_{m-1})).
\label{(12.8)}
\end{eqnarray}
(Here we use the notations of Chapter~11.)
We get by applying the identity~(\ref{(11.4)}) of Theorem~11.1
for the product
$$
n^{-|O(\bar\gamma)|/2}|O(\bar\gamma)|!I_{n,|O(\bar\gamma)|}
(F_{\bar\gamma}(f_1,\dots,f_{m-1}))\cdot n^{-k_m/2}k_m!I_{n,k_m}(f_m),
$$
and by multiplying it with
$\left(\prod\limits_{p=2}^{m-1}J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2}$ that the identity
\begin{eqnarray}
&&\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2}n^{-|O(\bar\gamma)|/2}O(\bar\gamma)!
I_{n,|O(|\bar\gamma|}(F_{\bar\gamma}(f_1,\dots,f_{m-1}))\nonumber\\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\cdot n^{-k_m/2}k_m! I_{n,k_m}(f_m) \nonumber \\
&&\qquad=\left(\prod_{p=2}^{m-1} J_n(\bar\gamma,p)\right)
n^{-W(\bar\gamma)/2} \!\!\!
{\sum_{\hat\gamma\in\Gamma(|O(\bar\gamma|,k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n)}
\,\,\,\, \prod_{j=1}^{|C(\hat\gamma)|}
\left(\frac{n-s(\hat\gamma)+j}n\right) \label{(12.9)} \\
&&\qquad\qquad n^{-W(\hat\gamma)/2} \cdot
n^{-|O(\hat\gamma)|/2}|O(\hat\gamma)|!
I_{n,|O(\hat\gamma)|}
(F_{\hat\gamma}(F_{\bar\gamma}(f_1,\dots,f_{m-1}),f_m)).
\nonumber
\end{eqnarray}
holds for all $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$,
where ${\sum\limits_{\hat\gamma\in\Gamma(|O(\bar\gamma|),k_m)}}
^{\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\! \prime(n)}$
means that summation is taken for such diagrams
$\hat\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ for which
$s(\hat\gamma)=|O(\hat\gamma)|+|C(\hat\gamma)|\le n$, and
$\prod\limits_{j=1}^{|C(\hat\gamma|}$ equals~1, if
$|C(\hat\gamma)|=0$.
We shall prove relation~(\ref{(11.11)}) for the parameter~$m$
with the help of relations~(\ref{(12.8)}) and~(\ref{(12.9)}).
Let us sum up formula~(\ref{(12.9)}) for all such diagrams
$\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$ for which
$B_1(\bar\gamma,p)+B_2(\bar\gamma,p)\le n$ for all $2\le p\le m-1$.
The numbers $B_1(\cdot)$ and $B_2(\cdot)$ in these inequalities
are the numbers introduced before formula~(\ref{(11.10)}), only
in this case the diagram~$\gamma$ is replaced by~$\bar\gamma$.
We imposed those conditions on the terms~$\bar\gamma$ in this
summation which appear in the conditions of the summation in
${\sum}^{\prime(n,m-1)}$ at the right-hand side of
formula~(\ref{(12.8)}) when it is applied with parameter~$m-1$.
Hence formula~(\ref{(12.8)}) implies that the sum of the
terms at the left-hand side of these identities equals
$\prod\limits_{p=1}^m n^{-k_p/2}k_p!I_{n,k_p}(f_p)$, i.e.
the left-hand side of~(\ref{(11.11)}) for parameter~$m$. To
prove formula~(\ref{(11.11)}) for the parameter~$m$ it is
enough to show that the sum of the right-hand side terms of the above
identities equals the right-hand side of~(\ref{(11.11)}).
In the proof of this relation we shall apply the properties of
the pairs of diagrams $(\gamma_{pr},\gamma_{cl})$ coming from a
diagram~$\gamma\in\Gamma(k_1,\dots,k_m)$ mentioned before.
Namely, we shall exploit that there is a one to one
correspondence between the diagrams
$\gamma\in\Gamma(k_1,\dots,k_m)$ and pairs of diagrams
$(\bar\gamma,\hat\gamma)$, $\bar\gamma\in\Gamma(k_1,\dots,k_{m-1})$,
$\hat\gamma\in\Gamma(|O(\bar\gamma)|,k_m)$ in such a way that
$\gamma$ and the pair ($\bar\gamma,\hat\gamma)$ correspond to each
other if and only if $\bar\gamma=\gamma_{pr}$ and
$\hat\gamma=\gamma_{cl}$. This correspondence enables us to
reformulate the statement we have to prove in the following way. Let
us rewrite formula~(\ref{(12.9)}) by replacing $\bar\gamma$ with
$\gamma_{pr}$ and $\hat\gamma$ by $\gamma_{cl}$, with that diagram
$\gamma\in\Gamma(k_1,\dots,k_m)$ for which $\bar\gamma=\gamma_{pr}$
and $\hat\gamma=\gamma_{cl}$. It is enough to show that if
we take those modified versions of~(\ref{(12.9)}) which we
get by replacing the pairs $(\bar\gamma,\hat\gamma)$ by the
pairs $(\gamma_{pr},\gamma_{cl})$ with some
$\gamma\in\Gamma(k_1,\dots,k_m)$
and sum up them for those~$\gamma$ for which
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, then the sum of the right-hand side
expressions in these identities equals the right-hand
side of~(\ref{(11.11)}).
We shall prove the above identity with the help of the
following statements to be verified later.
For all $\gamma\in\Gamma(k_1,\dots,k_m)$ the identities
$W(\gamma_{pr})+W(\gamma_{cl})=W(\gamma)$ and
$$
\prod\limits_{p=2}^{m-1} J_n(\gamma_{pr},p)
\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)
=\prod\limits_{p=2}^m J_n(\gamma,p),
$$
hold, where $\prod\limits_{j=1}^{|C(\gamma_{cl})|}=1$ if
$|C(\gamma_{cl})|=0$. The inequalities
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ hold simultaneously for
all $2\le p\le m$ for a diagram~$\gamma$ if and only if
the inequalities $B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$
for all $2\le p\le m-1$ and $s(\gamma_{cl})\le n$ hold
simultaneously for this~$\gamma$.
To prove the identity we claimed to hold with the help of
the above relations let us first check that we sum up for
the same set of $\gamma\in\Gamma(k_1,\dots,k_m)$ if we take
the sum of modified versions of~(\ref{(12.9)}) for all $\gamma$
such that $B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$ and if we take the ${\sum}^{\prime(n,m)}$
at the right-hand side of~(\ref{(11.11)}). Indeed, in the
second case we have to take those diagrams $\gamma$ for which
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ for all $2\le p\le m$, while
in the first case we take those diagrams~$\gamma$ for which
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, and $s(\gamma_{cl})\le n$. The last
condition is contained in a slightly hidden form in the
summation ${\sum}^{\prime(n)}$ of formula~(\ref{(12.9)}).
Hence the above mentioned relations imply that have to sum up
for the same diagrams~$\gamma$ in the two cases.
Beside this, it follows from~(\ref{(11.8)}) that the same
$U$-statistics appear for a
diagram~$\gamma\in\Gamma(k_1,\dots,k_m)$ in~(\ref{(11.11)}) and
in the modified version of~(\ref{(12.9)}). We still have to
check that they have the same coefficients in the two cases.
But this holds, because the previously formulated identities
imply that
\begin{eqnarray*}
n^{-(W(\gamma_{pr})/2}n^{-W(\gamma_{cl})/2}&=&n^{-W(\gamma)/2},\\
\prod\limits_{p=2}^{m-1} J_n(\gamma_{pr},p)
\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)
&=& \prod\limits_{p=2}^m J_n(\gamma,p)
\end{eqnarray*}
and $n^{-|O(\gamma_{cl})|/2}|O(\gamma_{cl})|!
=n^{-|O(\gamma)|/2}|O(\gamma)|!$, since
$|O(\gamma)|=|O(\gamma_{cl})|$, as we have seen before.
To complete the proof of the identity it remained to check the
relations we applied in the previous argument. We start with
the proof of the identity
$W(\gamma_{pr})+W(\gamma_{cl})=W(\gamma)$ for the
function~$W(\cdot)$ defined in~(\ref{(11.9)}).
Let us first remark that $W(\gamma_{cl})=|O_2(\gamma_{cl})|$,
where $O_2(\gamma_{cl})$ is the set of open chains
in~$\gamma_{cl}$ with length~2. Beside this if
$\beta\in\gamma$ is such that
$\beta\cap\{(m,1),\dots,(m,k)\}=\emptyset$, i.e. if the
chain~$\beta$ contains no vertex from the last row of the
diagram~$\gamma$, then $\ell(\beta)=\ell(\beta_{pr})$, and
$c_\gamma(\beta)=c_{\gamma_{pr}}(\beta_{pr})$. If
$\beta\cap\{(m,1),\dots,(m,k)\}\neq\emptyset$, then either
$c_\gamma(\beta)=1$, $\ell(\beta_{pr})=\ell(\beta)-1$, and
$c_{\gamma_{pr}}(\beta)=-1$ or $c_\gamma(\beta)=-1$ and one
of the following cases appears. Either $\ell(\beta)=1$, and
the chain $\beta_{pr}$ does not exists, or $\ell(\beta)>1$,
and $\ell(\beta_{pr})=\ell(\beta)-1$,
$c_{\gamma_{pr}}(\beta_{pr})=-1$. We get by calculating
$W(\gamma)$ with the help of the above relations that
$W(\gamma)=W(\gamma_{pr})+|{\cal V}(\gamma)|$, where
${\cal V}(\gamma)=\{\beta\colon\; \beta\in\gamma,\,
\beta\cap\{(m,1),\dots,(m,k)\}\neq\emptyset,\, \ell(\beta)>1,\,
c_\gamma(\beta)=-1\}$. Since
$|{\cal V}(\gamma)|=|O_2(\gamma_{cl})|$, the above
relations imply the desired identity.
To prove the remaining relations first we observe that for
each diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ and number
$2\le p\le m-1$ the identities
$B_1(\gamma_{pr},p)=B_1(\gamma,p)$ and
$B_2(\gamma_{pr},p)=B_2(\gamma,p)$ hold. Beside this,
$|C(\gamma_{cl})|=B_1(\gamma,m)$
and $|O(\gamma_{cl})|=B_2(\gamma,m)$. The identity about
$|C(\gamma_{cl})|$ simply follows from the definition
of~$\gamma_{cl}$ and $B_1(\gamma,m)$. To prove the
identity about $|O(\gamma_{cl})|$ observe that
$|O(\gamma_{cl})|=|O(\gamma)|$, and
$|O(\gamma)|=B_2(\gamma,m)$. (Observe that in the case
$p=m$ the definition of the set ${\cal B}_2(\gamma,m)$
becomes simpler, because there is no chain
$\beta\in\gamma$ for which $d(\beta)>m$.)
The remaining relations can be deduced from these facts.
Indeed, they imply that $J_n(\gamma_{pr},p)=J_n(\gamma,p)$
for all $2\le p\le m-1$. Beside this,
$\prod\limits_{j=1}^{|C(\gamma_{cl})|}
\left(\frac{n-s(\gamma_{cl})+j}n\right)=J_n(\gamma,m)$
because of the relations $|C(\gamma_{cl})|=B_1(\gamma,m)$
$|O(\gamma_{cl})|=B_2(\gamma,m)$,
$s(\gamma_{cl})=|C(\gamma_{cl})|+|O(|\gamma_{cl})|$ and the
definition of $J_n(\gamma,m)$. Hence the identity about the
product of the terms $J_n(\gamma,p)$ holds. It can be seen
similarly that the relations
$B_1(\gamma,p)+B_2(\gamma,p)\le n$ holds for all
$2\le p\le m-1$ if and only if
$B_1(\gamma_{pr},p)+B_2(\gamma_{pr},p)\le n$ for all
$2\le p\le m-1$, and $B_1(\gamma,m)+B_2(\gamma,m)\le n$ if
and only if $s(\gamma_{cl})\le n$.
Thus we have proved identity~(\ref{(11.11)}). To complete the
proof of Theorem~11.2 we still have to show that under its
conditions $F_{\gamma}(f_1,\dots,f_m)$ is a bounded, canonical
function. But this follows from Theorem~11.1 and
relation~(\ref{(11.8)}) by a simple induction argument.
\hfill$\qed$
\medskip\noindent
{\it Proof of Lemma 11.3.} Lemma~11.3 will be proved by induction
with respect to the number~$m$ of the terms in the product of
$U$-statistics with the help of inequalities~(\ref{(11.5)})
and~(\ref{(11.6)}). These relations imply the desired inequality
for $m=2$. In the case $m>2$ we apply the identity~(\ref{(11.8)})
$F_{\gamma}(f_1,\dots,f_m)=
F_{\gamma_{cl}}(F_{\gamma_{pr}}(f_1,\dots,f_{m-1}),f_m)$. We have
seen that $W(\gamma)=W(\gamma_{pr})+W(\gamma_{cl})$, and it is not
difficult to show that $U(\gamma)=U(\gamma_{pr})+U(\gamma_{cl})$.
Hence if $U(\gamma_{cl})=0$, i.e. if $\gamma_{cl}$ contains a
chain of length~2 with colour~$-1$, then $U(\gamma)=U(\gamma_{pr})$,
and an application of~(\ref{(11.8)}) and~(\ref{(11.6)}) for the
diagram~$\gamma_{cl}$ implies Lemma~11.3 in this case.
If $U(\gamma_{cl})=1$, then $W(\gamma_{cl})=0$,
$U(\gamma)=U(\gamma_{pr})+1$, $W(\gamma)=W(\gamma_{pr})$, and
the application of~(\ref{(11.8)}) and~(\ref{(11.5)}) for the
diagram~$\gamma_{cl}$ implies Lemma~11.3 in this case.
\hfill$\qed$
\medskip
The corollary of Theorem 11.2 is a simple consequence of
Theorem~11.2 and Lem\-ma~11.3.
\medskip\noindent
{\it Proof of the corollary of Theorem 11.2.}\/ Observe that
$F_\gamma$ is a function of $|O(\gamma)|$ arguments. Hence a
coloured diagram $\gamma\in\Gamma(k_1,\dots,k_m)$ is in the
class of closed diagrams, i.e.
$\gamma\in\bar\Gamma(k_1,\dots,k_m)$ if and only if
$F_\gamma(f_1,\dots,f_m)$ is a constant. Thus
formula~(\ref{(11.13)}) is a simple consequence of
relation~(\ref{(11.11)}) and the observation
that $EI_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))=0$ if
$|O(\gamma)|\ge1$, i.e. if
$\gamma\notin\bar\Gamma(k_1,\dots,k_m)$, and
\begin{eqnarray*}
I_{n,|O(\gamma)|}(F_\gamma(f_1,\dots,f_m))
&&=I_{n,0}(F_\gamma(f_1,\dots,f_m))=F_\gamma(f_1,\dots,f_m) \\
&&\qquad\qquad\qquad\qquad\qquad
\textrm{ if }\gamma\in\bar\Gamma(k_1,\dots,k_m).
\end{eqnarray*}
Relations~(\ref{(11.14)}) and~(\ref{(11.15)}) follow from
relation~(\ref{(11.13)}) and Lemma~11.3.
\hfill$\qed$
\chapter{The proof of Theorems 8.3, 8.5 and Example 8.7}
In this chapter we prove the estimates on the distribution of
a multiple Wiener--It\^o integral or degenerate $U$-statistic
formulated in Theorems~8.5 and~8.3, and also present the proof of
Example~8.7. Beside this, we prove a multivariate version
of Hoeffding's inequality~(Theorem~3.4). The latter result is
useful in the estimation of the supremum of a class of degenerate
$U$-statistics. The estimate on the distribution of a multiple
random integral with respect to a normalized empirical
distribution given in Theorem~8.1 is omitted, because, as it was
shown in Chapter~9, this result follows from the estimate of
Theorem~8.3 on degenerate $U$-statistics. We finish this chapter
with a separate part Chapter~13~B, where the results proved in
this chapter are discussed together with the method of their
proofs and some recent results. These new results state that in
certain cases the estimates on the tail distribution of
Wiener--It\^o integrals and $U$-statistics considered in this
chapter can be improved if we have some additional information
on the kernel function of these Wiener--It\^o integrals or
$U$-statistics.
The proof of Theorems~8.5 and~8.3 is based on a good estimate
on high moments of Wiener--It\^o integrals and degenerate
$U$-statistics. Such estimates can be proved with the help of
the corollaries of Theorems~10.2 and~11.2. This approach
slightly differs from the classical proof in the one-variate
case. The one-variate version of the above problems is an
estimate about the tail distribution of a sum of independent
random variables. Such an estimate can be obtained with the
help of a good bound on the moment generating function of the
sum. This method does not work in the multivariate case,
because, as later calculations will show, there is no good
estimate on the moment-generating function of $U$-statistics
or multiple Wiener--It\^o integrals of order $k\ge3$.
Actually, the moment-generating function of a Wiener--It\^o
integral of order $k\ge3$ is always divergent, because the
tail distribution behaviour of such a random integral is
similar to that of the $k$-th power of a Gaussian random
variable. On the other hand, good bounds on the moments
$EZ^{2M}$ of a random variable~$Z$ for all positive
integers~$M$ (or at least for a sufficiently rich class of
parameters~$M$) together with the application of the Markov
inequality for $Z^{2M}$ and an appropriate choice of the
parameter~$M$ yield a good estimate on the tail distribution
of~$Z$.
Propositions~13.1 and~13.2 contain some estimates on the moments
of Wiener--It\^o integrals and degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.1 (Estimate on the moments of Wiener--It\^o
integrals).}\index{estimate on the moments of Wiener--It\^o
integrals} {\it Let us consider a function $f(x_1,\dots,x_k)$
of $k$ variables on some measurable space $(X,{\cal X})$ which
satisfies formula~(\ref{(8.12)}) with some $\sigma$-finite
non-atomic measure $\mu$. Take the $k$-fold Wiener--It\^o
integral $Z_{\mu,k}(f)$ of this function with respect to a
white noise $\mu_W$ with reference measure~$\mu$. The
inequality
\begin{equation}
E\left(|k!Z_{\mu,k}(f)|\right)^{2M}\le 1\cdot3\cdot5\cdots
(2kM-1)\sigma^{2M}\quad\textrm {for all }M=1,2,\dots
\label{(13.1)}
\end{equation}
holds.}
\medskip
By Stirling's formula Proposition~13.1 implies that
\begin{equation}
E(|k!Z_{\mu,k}(f)|)^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!}\sigma^{2M}
\le A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\label{(13.2)}
\end{equation}
for any $A>\sqrt2$ if $M\ge M_0=M_0(A)$. Formula~(\ref{(13.2)}) can be
considered as a simpler, better applicable version of
Proposition~13.1. It can be better compared with the moment estimate
on~degenerate $U$-statistics given in formula~(\ref{(13.3)}).
Proposition~13.2 provides a similar, but weaker inequality for the
moments of normalized degenerate $U$-statistics.
\medskip\noindent
{\bf Proposition 13.2 (Estimate on the moments of degenerate
$U$-statistics).}\index{estimate on the moments of degenerate
$U$-statistics} {\it Let us consider a degenerate $U$-statistic
$I_{n,k}(f)$ of order $k$ with sample size $n$ and with a kernel
function $f$ satisfying relations~(\ref{(8.1)}) and~(\ref{(8.2)})
with some $0<\sigma^2\le1$. Fix a positive number $\eta>0$.
There exist some universal constants $A<\infty$ and $C<\infty$
such that
\begin{eqnarray}
&&E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}
\le A\left(1+C\sqrt\eta\right)^{2kM}
\left(\frac2e\right)^{kM}\left(kM\right)^{kM}\sigma^{2M}\nonumber \\
&&\qquad\qquad \textrm{for all integers } M \textrm{ such that }
0\le kM\le \eta n\sigma^2. \label{(13.3)}
\end{eqnarray}
In formula~(\ref{(13.3)}) the constant $C$ can be chosen as $C=\sqrt2$.}
\medskip
Proposition~13.2 yields a good estimate on
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ with a fixed
exponent~$2M$ with
the choice $\eta=\frac{kM}{n\sigma^2}$. With such a choice of the
number $\eta$ formula~(\ref{(13.3)}) yields an estimate on the moments
$E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}$ comparable with the
estimate on the corresponding Wiener--It\^o integral if
$M\le n\sigma^2$, while
it yields a much weaker estimate if $M\gg n\sigma^2$.
Now I turn to the proof of these propositions.
\medskip\noindent
{\it Proof of Proposition 13.1.}\/ Proposition 13.1 can be simply
proved by means of the Corollary of Theorem~10.2 with the choice
$m=2M$, and $f_p=f$ for all $1\le p\le 2M$. Formulas~(\ref{(10.18)})
and~(\ref{(10.19)}) yield that
\begin{eqnarray*}
E\left(k!Z_{\mu,k}(f)^{2M}\right)&\le&\left( \int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(dx_k)\right)^M|
\Gamma_{2M}(k)| \\
&\le& |\Gamma_{2M}(k)|\sigma^{2M},
\end{eqnarray*}
where $|\Gamma_{2M}(k)|$ denotes the number of closed diagrams
$\gamma$ in the class
$\bar\Gamma(\underbrace{k,\dots,k}_{2M\textrm{ times}})$
introduced in the corollary of Theorem~10.2. Thus to complete the
proof of Proposition~13.1 it is enough to show that
$|\Gamma_{2M}(k)|\le 1\cdot3\cdot5\cdots(2kM-1)$. But this can
easily be seen with the help of the following observation. Let
$\bar\Gamma_{2M}(k)$ denote the class of all graphs with vertices
$(l,j)$, $1\le l\le 2M$, $1\le j\le k$, such that from all vertices
$(l,j)$ exactly one edge starts, all edges connect different
vertices, but edges connecting vertices $(l,j)$ and $(l,j')$ with
the same first coordinate~$l$ are also allowed. Let
$|\bar\Gamma_{2M}(k)|$ denote the number of graphs in
$\bar\Gamma_{2M}(k)$. Then clearly
$|\Gamma_{2M}(k)|\le|\bar\Gamma_{2M}(k)|$. On the other hand,
$|\bar\Gamma_{2M}(k)|=1\cdot3\cdot5\cdots(2kM-1)$. Indeed, let us
list the vertices of the graphs from $\bar\Gamma_{2M}(k)$ in an
arbitrary way. Then the first vertex can be paired with another
vertex in $2kM-1$ way, after this the first vertex from which no
edge starts can be paired with $2kM-3$ vertices from which no edge
starts. By following this procedure the next edge can be chosen
$2kM-5$ ways, and by continuing this calculation we get the desired
formula.
\hfill$\qed$
\medskip\noindent
{\it Proof of Proposition 13.2.}\/ Relation~(\ref{(13.3)}) will
be proved by
means of relations (\ref{(11.14)}) and (\ref{(11.15)}) in the
Corollary of Theorem~11.2 with the choice $m=2M$ and $f_p=f$
for all $1\le p\le 2M$. Let us take the class of closed
coloured diagrams
$\Gamma(k,M)=\bar\Gamma(\underbrace{k,\dots,k}_{2M\textrm{times}})$.
This will be partitioned into subclasses
$\Gamma(k,M,r)$, $1\le r\le kM$, where $\Gamma(k,M,r)$ contains
those closed diagrams $\gamma\in\Gamma(k,M)$ for which
$W(\gamma)=2r$. Let us recall that $W(\gamma)$ was defined
in~(\ref{(11.9)}), and in the case of closed diagrams
$W(\gamma)=\sum\limits_{\beta\in\gamma}(\ell(\beta)-2)$. For a
diagram $\gamma\in\Gamma(k,M)$, $W(\gamma)$ is an even number,
since $W(\gamma)+2s(\gamma)=2kM$, i.e. $W(\gamma)=2r$ with $r=kM-s$,
where $s=s(\gamma)$ denotes the number of chains in~$\gamma$.
First we prove an estimate about the cardinality of~$\Gamma(k,M,r)$.
We claim that there exists a universal constant $A<\infty$ such that
\begin{eqnarray}
|\Gamma(k,M,r)|&\le& {{2kM}\choose{2r}} 1\cdot3\cdot5\cdots(2kM-2r-1)
(kM-r)^{2r} \label{(13.4)} \\
&\le& A\left(\frac2e\right)^{kM} {{2kM}\choose{2r}}
2^{-r}(kM)^{kM+r} \quad\textrm{for all } 0\le r\le kM \nonumber
\end{eqnarray}
with some universal constant $A<\infty$.
To prove formula~(\ref{(13.4)}) let us first observe that
$|\Gamma(k,M,r)|$ can be bounded from above with the number of
such partitions of a set with $2kM$ points which consists of
$s=kM-r$~sets containing at least two points. Indeed,
for each $\gamma\in\Gamma(k,M,r)$ the chains of the
diagram~$\gamma$ yield a partition of the set
$\{(p,r)\colon\;1\le p\le 2M,\,1\le k\le r\}$ consisting
of~$2r$ sets such that each of them contains at least two points.
Moreover, the partition given in such a way determines the
chains of~$\gamma$, because the vertices of a chain are listed
in a prescribed order. Namely, the indices of the rows which
contain them follow each other in increasing order. This
implies that we can correspond to each diagram
$\gamma\in\Gamma(k,M,r)$ a different partition of a set of
$2Mk$ elements with the prescribed properties.
The number of the partitions with the above properties can be
bounded from above in the following way. Let us calculate the
number of possibilities for choosing $s=kM-r$ disjoint subsets
of cardinality two from a set of cardinality~$2kM$, and multiply
this number with the possibility of attaching each of the
remaining $2r$ points of the original set to one of these sets of
cardinality~2.
We can choose these sets of cardinality~2 in
${{2kM}\choose{2r}}1\cdot3\cdot5\cdots(2kM-1)$ ways, since we can
choose the union of these sets, which consists of $2kM-2r$
points in ${{2kM}\choose{2kM-2r}}={{2kM}\choose{2r}}$ ways, and
then we can choose the pair of the first element in~$2kM-2r-1$ ways,
then the pair of the first still not chosen element in
$2kM-2r-3$ ways, and continuing this procedure we get the above
formula for the number of choices for these sets of cardinality~2.
Then the remaining $2r$ points of the original set can be put
in~$(kM-r)^{2r}$ ways in one of these $kM-r$ sets of
cardinality~2. The above relations imply the first inequality of
formula~(\ref{(13.4)}).
To get the second inequality observe that by the Stirling formula
$1\cdot3\cdot5\cdots(2kM-2r-1)=\frac{(2kM-2r)!}{2^{kM-r}(kM-r)!}
\le A\left(\frac2e\right)^{kM-r}(kM-r)^{kM-r}$ with some universal
constant~$A<\infty$. Beside this, we can write
$(kM-r)^{kM+r}\le (kM)^r(kM-r)^{kM}
=(kM)^{kM+r}(1-\frac r{kM})^{kM}\le e^{-r}(kM)^{kM+r}$. These
estimates imply the second inequality in~(\ref{(13.4)}).
We prove the estimate~(\ref{(13.3)}) with the help of the
relations~(\ref{(11.14)}), (\ref{(11.15)}) and~(\ref{(13.4)}).
First we estimate the term $n^{-W(\gamma)/2}|F_\gamma|$ for a
diagram $\gamma\in\Gamma(k,M,r)$ under the conditions
$kM\le\eta n\sigma^2$ and $\sigma^2\le1$ with the help of
relation~(\ref{(11.15)}).
In this case we can write $|U(\gamma)|\ge 2M-W(\gamma)=2M-2r$ for
the function~$U(\gamma)$ defined in~(\ref{(11.12)}). Hence by
relation~(\ref{(11.15)})
$$
n^{-W(\gamma)/2}|F_\gamma|\le 2^{2r} n^{-r}\sigma^{|U(\gamma)|}
\le 2^{2r} \left(n\sigma^2\right)^{-r}\sigma^{2M}
\le\eta^{r}2^{2r}(kM)^{-r}\sigma^{2M}
$$
for $\gamma\in\Gamma(k,M,r)$ because of the conditions
$kM\le \eta n\sigma^2$ and $\sigma^2\le1$.
This estimate together with relation~(\ref{(11.14)}) imply
that for $kM\le\eta n\sigma^2$
\begin{eqnarray*}
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}
&\le&\sum_{\gamma\in\Gamma(k,M)}
n^{-W(\gamma)/2}\cdot |F_\gamma| \\
&\le& \sum_{r=0}^{kM}|\Gamma(k,M,r)|
\eta^{r}2^{2r}(kM)^{-r}\sigma^{2M}.
\end{eqnarray*}
Hence by formula~(\ref{(13.4)})
\begin{eqnarray*}
E\left(n^{-k/2}k!I_{n,k}(f_{k})\right)^{2M}
&\le& A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\sum_{r=0}^{kM}{{2kM}\choose{2r}}
\left(\sqrt{2\eta}\right)^{2r}\\
&\le& A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M}
\left(1+\sqrt2\sqrt{\eta}\right)^{2kM}
\end{eqnarray*}
if $0\le kM\le\eta n\sigma^2$. Thus we have proved
Proposition~13.2 with $C=\sqrt2$.
\hfill$\qed$
\medskip
It is not difficult to prove Theorem 8.5 with the help of
Proposition~13.1.\index{estimate on the tail distribution
of a multiple Wiener--It\^o integral}
\medskip\noindent
{\it Proof of Theorem 8.5.}\/
By formula~(\ref{(13.2)}) which is a consequence of
Proposition~13.1 and the Markov inequality
\begin{equation}
P\left(|k!Z_{\mu,k}(f)|>u\right)\le
\frac{E\left(k!Z_{\mu,k}(f)\right)^{2M}}{u^{2M}}
\le A\left(\frac {2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM}
\label{(13.5)}
\end{equation}
with some constant $A>\sqrt2$ if $M\ge M_0$ with some constant
$M_0=M_0(A)$, and $M$ is an integer.
Put $\bar M=\bar M(u)=\frac1{2k}\left(\frac u\sigma\right)^{2/k}$,
and $M=M(u)=[\bar M]$, where $[x]$ denotes the integer part of
a real number $x$. Choose some number $u_0$ such that
$\frac1{2k}\left(\frac {u_0}\sigma\right)^{2/k}\ge M_0+1$. Then
relation~(\ref{(13.5)}) can be applied with $M=M(u)$ for
$u\ge u_0$, and this yields that
\begin{eqnarray}
P\left(|k!Z_{\mu,k}(f)|>u\right)
&\le& A\left(\frac {2kM\sigma^{2/k}}{eu^{2/k}}\right)^{kM}
\le e^{-kM}\le Ae^{k}e^{-k\bar M} \nonumber \\
&=&Ae^k\exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\} \quad\textrm{if } u\ge u_0.
\label{(13.6)}
\end{eqnarray}
Relation~(\ref{(13.6)}) means that relation~(\ref{(8.14)}) holds
for $u\ge u_0$ with
the pre-exponential coefficient $Ae^k$. Beside this
$u_0\le\textrm{const.}$ By enlarging this coefficient if it is
needed it can be guaranteed that relation~(\ref{(8.14)}) holds
for all $u>0$. Theorem~8.5 is proved.
\hfill$\qed$
\medskip
Theorem 8.3 can be proved similarly by means of Proposition~13.2.
Nevertheless, the proof is technically more complicated, since
in this case the optimal choice of the parameter in the Markov
inequality cannot be given in such a direct form as in the proof of
Theorem~8.5. In this case the Markov inequality is applied with an
almost optimal choice of the parameter~$M$.\index{estimate on the
tail distribution of a degenerate $U$-statistic}
\medskip\noindent
{\it Proof of Theorem 8.3.}\/ The Markov inequality and
relation~(\ref{(13.3)}) with $\eta=\frac{kM}{n\sigma^2}$ imply that
\begin{eqnarray}
P(n^{-k/2}|k!I_{n,k}(f)|>u)
&\le& \frac{E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M}}{u^{2M}}
\label{(13.7)} \\
&\le& A\left(\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}\right)^{kM} \nonumber
\end{eqnarray}
for all integers $M\ge0$.
Relation~(\ref{(8.10)}) will be proved with the help of
estimate~(\ref{(13.7)}) under the condition
$0\le\frac u\sigma\le n^{k/2}\sigma^k$. To this end let us
introduce the number $\bar M$ by means of the formula
$$
k\bar M=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+\frac B{\sqrt n\sigma}
\left(\frac u\sigma\right)^{1/k}}
=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}
$$
with a sufficiently large number $B=B(C)>0$ and $M=[\bar M]$,
where $[x]$ means the integer part of the number $x$.
Observe that $\sqrt{k\bar M}\le\left(\frac u\sigma\right)^{1/k}$,
$\frac{\sqrt{k\bar M}}{\sqrt n\sigma}
\le\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}\le1$,
and
$$
\left(1+C\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\right)^2\le
1+B\frac{\sqrt{k\bar M}}{\sqrt n\sigma}\le 1+B\left(u n^{-k/2}
\sigma^{-(k+1)}\right)^{1/k}
$$
with a sufficiently large $B=B(C)>0$ if
$\frac u\sigma\le n^{k/2}\sigma^k$. Hence
\begin{eqnarray}
&&\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM}}{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}
\le \frac1e\cdot 2k\bar M\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k} \nonumber \\
&&\qquad =\frac1e\cdot\frac{\left(1+C\frac{\sqrt{k\bar M}}
{\sqrt n\sigma}\right)^2}
{1+B\left(u n^{-k/2}\sigma^{-(k+1)}\right)^{1/k}}\le\frac1e
\label{(13.8)}
\end{eqnarray}
if $\frac u\sigma\le n^{k/2}\sigma^k$. Inequalities~(\ref{(13.7)})
and~(\ref{(13.8)}) together yield that
$$
P(n^{-k/2}k!|I_{n,k}(f)|>u)\le A e^{-kM}\le Ae^k e^{-k\bar M}
$$
if $0\le\frac u\sigma\le n^{k/2}\sigma^k$. Hence the choice of
the number~$\bar M$ implies that inequality~(\ref{(8.10)}) holds
with the pre-exponential constant $Ae^k$ and the sufficiently
large but fixed number~$B>0$. Theorem~8.3 is proved.
\hfill$\qed$
\medskip\noindent
{\it Remark.}\/ One would like to understand why the introduction
of the quantities~$\bar M$ and~$M$ in the proof of Theorem~8.3 was
a good choice. The natural choice for~$M$ would
have been that number where the right-hand side expression
in~(\ref{(13.7)}) takes its minimum. But we cannot calculate this
number in a simple way. Hence we chose instead a sufficiently good
and simple approximation for it. We get a first order approximation
of this quantity if we consider the minimum of the simplified
expression we get by dropping the factor
$\left(1+C\frac{\sqrt{kM}}{\sqrt n\sigma}\right)^2$ from the formula
at the right-hand side of~(\ref{(13.7)}). We get in such a way
the approximation $M_0=\frac1{2k}(\frac u\sigma)^{2/k}$, but this
is not a sufficiently good choice of the number~$M$ for our purposes.
We get a better approximation by determing the place of minimum of
the expression we get by replacing the number~$M$ with the
number~$M_0$ in the factor we omitted in the previous approximation,
i.e. we look for the place of minimum of
\begin{eqnarray*}
&&A\left(\frac1e\cdot 2kM\left(1+C\frac{\sqrt{kM_0}}
{\sqrt n\sigma}\right)^2
\left(\frac\sigma u\right)^{2/k}\right)^{kM} \\
&&\qquad =A\left(\frac1e\cdot 2kM
\left(1+\frac C{\sqrt{2n}\sigma}\left(\frac u\sigma\right)^{1/k}\right)^2
\left(\frac\sigma u\right)^{2/k}\right)^{kM}.
\end{eqnarray*}
This suggests the approximation
$M_1=\frac1{2k}\left(\frac u\sigma\right)^{2/k}\frac1
{\left(1+\frac C{\sqrt{2n}\sigma} (\frac u\sigma)^{1/k}\right)^2}$
for the place of minimum we are looking for. We can choose a similar
expression for the parameter~$M$ which is almost as good as this
number, but it is simpler to work with it. To find it observe that
under the conditions of Theorem~8.3 we commit a small error by
replacing the term
$(1+\frac C{\sqrt{2n}\sigma} (\frac u\sigma)^{1/k})^2$
in the denominator of the formula defining~$M_1$ by
$1+\frac{2C}{\sqrt{2n}\sigma} (\frac u\sigma)^{1/k}$. To see this
observe that the condition $\frac u\sigma\le n^{k/2}\sigma^k$ of
Theorem~8.3 implies that
$\frac1{\sqrt n\sigma}(\frac u\sigma)^{1/k}\le1$. Moreover, in the
really interesting cases this expression is very close to zero. This
suggests to expand the above square, and make an approximation by
omitting the quadratic term. We can try to choose the number~$M$
obtained in such a way in the proof of Theorem~8.3. Moreover, it is
useful to replace the parameter~$C$ with another number with which
we can work better. It turned out that we can work better if the
number~$C$ is replaced with another large coefficient. This led to
the introduction of the quantity
$k\bar M=\frac12\left(\frac u\sigma\right)^{2/k}\frac1
{1+\frac B{\sqrt n\sigma}\left(\frac u\sigma\right)^{1/k}}$
with a sufficiently large (but fixed) number~$B$ in the proof
of Theorem~8.3.
\medskip
Example 8.7 is a relatively simple consequence of It\^o's formula
for multiple Wiener--It\^o integrals.
\medskip\noindent
{\it Proof of Example 8.7.}\/ We may restrict our attention to the
case $k\ge2$. It\^o's formula for multiple Wiener-It\^o integrals,
more explicitly relation~(\ref{(10.21)}), implies that the random
variable $k!Z_{\mu,k}(f)$ can be expressed as $k!Z_{\mu,k}(f)
=\sigma H_k\left(\int f_0(x)\mu_W(\,dx)\right)=\sigma H_k(\eta)$,
where $H_k(x)$ is the $k$-th Hermite polynomial with leading
coefficient~1, and $\eta=\int f_0(x)\mu_W(\,dx)$ is a standard
normal random variable. Hence we get by exploiting that the
coefficient of $x^{k-1}$ in the polynomial $H_k(x)$ is zero that
$P(k!|Z_{\mu,k}(f)|>u)=P(|H_k(\eta)|\ge\frac u\sigma)\ge
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)$ with a
sufficiently large constant $D>0$ if $\frac u\sigma>1$. There
exist such positive constants $A$ and $B$ for which
$$
P\left(|\eta^k|-D|\eta^{k-2}|>\frac u\sigma\right)
\ge P\left(|\eta^k|>\frac u\sigma+A
\left(\frac u\sigma\right)^{(k-2)/k}\right)\quad
\textrm{if } \frac u\sigma>B.
$$
Hence
\begin{eqnarray*}
P(k!|Z_{\mu,k}(f)|>u)&\ge&
P\left(|\eta|>\left(\frac u\sigma\right)^{1/k}
\left(1+A\left(\frac u\sigma\right)^{-2/k}\right)\right) \\
&\ge&\frac{\bar C \exp\left\{-\frac12
\left(\frac u\sigma\right)^{2/k}\right\}}
{\left(\frac u\sigma\right)^{1/k}+1}
\end{eqnarray*}
with an appropriate $\bar C>0$ if $\frac u\sigma>B$. Since
$P(k!|Z_{\mu,k}(f)|>0)>0$, the above inequality also holds
for $0\le \frac u\sigma\le B$ if the constant $\bar C>0$ is chosen
sufficiently small. This means that relation~(\ref{(8.16)}) holds.
\hfill$\qed$
\medskip
Next we prove a multivariate version of Hoeffding's inequality.
Before its formulation some notations will be introduced.
Let us fix two positive integers~$k$ and~$n$ and some
real numbers $a(j_1,\dots,j_k)$ for all sequences of arguments
$\{j_1,\dots,j_k\}$ such that $1\le j_l\le n$, $1\le l\le k$, and
$j_l\neq j_{l'}$ if $l\neq l'$.
With the help of the above real numbers $a(\cdot)$ and a
sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, the random
variable
\begin{equation}
V=\sum_{\substack {(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
a(j_1,\dots, j_k)
\varepsilon_{j_1}\cdots \varepsilon_{j_k} \label{(13.9)}
\end{equation}
and number
\begin{equation}
S^2=\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
a^2(j_1,\dots, j_k). \label{(13.10)}
\end{equation}
will be introduced.
With the help of the above notations the following result can be
formulated.
\medskip\noindent
{\bf Theorem 13.3 (The multivariate version of Hoeffding's
inequality).}\index{multivariate version of Hoeffding's
inequality} {\it The random variable $V$ defined in
formula~(\ref{(13.9)}) satisfies the inequality
\begin{equation}
P(|V|>u)\le C
\exp\left\{-\frac12\left(\frac uS\right)^{2/k}\right\}
\quad\textrm{for all }u\ge 0 \label{(13.11)}
\end{equation}
with the constant $S$ defined in~(\ref{(13.10)}) and some
constants $C>0$ depending only on the parameter $k$ in the
expression $V$.}
\medskip
Theorem~13.3 will be proved by means of two simple lemmas. Before
their formulation the random variable
\begin{equation}
Z=\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
|a(j_1,\dots,j_k)|\eta_{j_1}\cdots \eta_{j_k} \label{(13.12)}
\end{equation}
will be introduced, where $\eta_1,\dots,\eta_n$ are independent
random variables with standard normal distribution, and the numbers
$a(j_1,\dots,j_k)$ agree with those in formula~(\ref{(13.9)}). The
following lemmas will be proved.
\medskip\noindent
{\bf Lemma 13.4.} {\it The random variables $V$ and $Z$ introduced
in~(\ref{(13.9)}) and (\ref{(13.12)}) satisfy the inequality
$$
EV^{2M}\le EZ^{2M}\quad\textrm{for all }M=1,2,\dots.
$$
}
\medskip\noindent
{\bf Lemma 13.5.} {\it The random variable $Z$ defined in
formula~(\ref{(13.12)}) satisfies the inequality
\begin{equation}
EZ^{2M}\le 1\cdot3\cdot5\cdots(2kM-1)S^{2M}\quad\textrm{for all }
M=1,2,\dots \label{(13.13)}
\end{equation}
with the constant $S$ defined in formula~(\ref{(13.10)}).}
\medskip\noindent
{\it Proof of Lemma 13.4.}\/ We can write, by carrying out the
multiplications in the expressions $EV^{2M}$ and $EZ^{2M}$,
by exploiting the additive and multiplicative properties of the
expectation for sums and products of independent random variables
together with the identities
$E\varepsilon_j^{2k+1}=0$ and $E\eta_j^{2k+1}=0$
for all $k=0,1,\dots$ that
\begin{equation}
EV^{2M}= \!\!\!\!\!\!\!\!\!\!\!
\sum_{\substack{ (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM}}
\!\!\!\!\!\!\!\!\!\!\!
A(j_1,\dots,j_l,m_1,\dots,m_l)
E\varepsilon_{j_1}^{2m_1}\cdots E\varepsilon_{j_l}^{2m_l}
\label{(13.14)}
\end{equation}
and
\begin{equation}
EZ^{2M}= \!\!\!\!\!\!\!\!\!\!\!\!\!
\sum_{\substack{ (j_1,\dots, j_l,\, m_1,\dots, m_l)\colon \\
1\le j_s\le n,\;
m_s\ge1,\; 1\le s\le l,\; m_1+\dots+m_l=kM}}
\!\!\!\!\!\!\!\!\!\!\!\!\!
B(j_1,\dots,j_l,m_1,\dots,m_l) E\eta_{j_1}^{2m_1}\cdots
E\eta_{j_l}^{2m_l} \label{(13.15)}
\end{equation}
with some coefficients $A(j_1,\dots,j_l,m_1,\dots,m_l)$ and
$B(j_1,\dots,j_l,m_1,\dots,m_l)$ such that
\begin{equation}
|A(j_1,\dots,j_l,m_1,\dots,m_l)|\le
B(j_1,\dots,j_l,m_1,\dots,m_l). \label{(13.16)}
\end{equation}
The coefficients $A(\cdot,\cdot,\cdot)$ and $B(\cdot,\cdot,\cdot)$
could be expressed explicitly, but we do not need such a formula.
What is important for us is that $A(\cdot,\cdot,\cdot)$ can be
expressed as the sum of certain terms, and $B(\cdot,\cdot,\cdot)$
as the sum of the absolute value of the same terms. Hence
relation~(\ref{(13.16)}) holds. Since
$E\varepsilon_j^{2m}\le E\eta_j^{2m}$
for all parameters $j$ and $m$ formulas~(\ref{(13.14)}),
(\ref{(13.15)}) and~(\ref{(13.16)}) imply
Lemma~13.4.
\hfill$\qed$
\medskip\noindent
{\it Proof of Lemma~13.5.} Let us consider a white noise $W(\cdot)$
on the unit interval $[0,1]$ with the Lebesgue measure $\lambda$ on
$[0,1]$ as its reference measure, i.e.\ let us take a set of
Gaussian random variables $W(A)$ indexed by the measurable sets
$A\subset [0,1]$ such that $EW(A)=0$, $EW(A)W(B)=\lambda(A\cap B)$
with the Lebesgue measure $\lambda$ for all measurable subsets of
the interval $[0,1]$. Let us introduce $n$ orthonormal functions
$\varphi_1(x),\dots,\varphi_n(x)$ with respect to the Lebesgue
measure on the interval $[0,1]$, and define the random variables
$\eta_j=\int \varphi_j(x)W(\,dx)$, $0\le j\le n$. Then
$\eta_1,\dots,\eta_n$ are independent random variables with standard
normal distribution, hence we may assume that they appear in the
definition of the random variable~$Z$ in formula~(\ref{(13.12)}). Beside
this, the identity $\eta_{j_1}\cdots\eta_{j_k}=\int \varphi_{j_1}(x_1)
\cdots\varphi_{j_k}(x_k)W(\,dx_1)\dots W(\,dx_k)$ holds for all
$k$-tuples $(j_1,\dots,j_k)$ such that $1\le j_s\le n$ for all
$1\le s\le k$, and the indices $j_1$,\dots, $j_s$ are different.
This identity follows from It\^o's formula for multiple Wiener--It\^o
integrals formulated in formula~(\ref{(10.20)}) of Theorem~10.3.
Hence the random variable $Z$ defined in~(\ref{(13.12)}) can be
written in the form
$$
Z=\int f(x_1,\dots,x_k)W(\,dx_1)\dots W(\,dx_k)
$$
with the function
$$
f(x_1,\dots,x_k)=
\sum_{\substack{(j_1,\dots, j_k)\colon\, 1\le j_l\le n
\textrm{ for all } 1\le l\le k,\\
j_l\neq j_{l'} \textrm{ if }l\neq l'}}
|a(j_1,\dots,j_k)| \varphi_{j_1}(x_1)\cdots \varphi_{j_k}(x_k).
$$
Because of the orthogonality of the functions $\varphi_j(x)$
$$
S^2=\int_{[0,1]^k} f^2(x_1,\dots,x_k)\,dx_1\dots\,dx_k.
$$
Lemma~13.5 is a straightforward consequence of the above relations
and formula~(\ref{(13.1)}) in Proposition~13.1.
\hfill$\qed$
\medskip\noindent
{\it Proof of Theorem~13.3.}\/ The proof of Theorem~13.3 with the
help of Lemmas~13.4 and~13.5 is an almost word for word repetition
of the proof of Theorem~8.5. By Lemma~13.4 inequality~(\ref{(13.13)})
remains valid if the random variable $Z$ is replaced by the random
variable~$V$ at its left-hand side. Hence the Stirling formula
yields that
$$
EV^{2M}\le EZ^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!} S^{2M}\le C
\left(\frac2e\right)^{kM}(kM)^{kM}S^{2M}
$$
for any $C\ge\sqrt2$ if $M\ge M_0(A)$. As a consequence, by the
Markov inequality the estimate
\begin{equation}
P(|V|>u)\le\frac{EV^{2M}}{u^{2M}}\le C\left(\frac{2kM}e\left(\frac
Su\right)^{2/k}\right)^{kM} \label{(13.17)}
\end{equation}
holds for all $C>\sqrt 2$ if $M\ge M_0(C)$. Put $k\bar M=k\bar
M(u)=\frac12\left(\frac uS\right)^{2/k}$ and $M=M(u)=[\bar M]$, where
$[x]$ denotes the integer part of the number~$x$. Let us choose
a threshold number $u_0$ by the identity
$\frac1{2k}\left(\frac{u_0}S\right)^{2/k}=M_0(C)+1$.
Formula~(\ref{(13.17)}) can be applied with $M=M(u)$ for
$u\ge u_0$, and it yields that
$$
P(|V|>u)\le Ce^{-kM}\le Ce^ke^{-k\bar M}=Ce^k\exp\left\{-\frac12
\left(\frac uS\right)^{2/k}\right\}\ \quad\textrm{if } u\ge u_0.
$$
The last inequality means that relation~(\ref{(13.11)})
holds for $u\ge u_0$
if the constant $C$ is replaced by $Ce^k$ in it. With the choice of
a sufficiently large constant~$C$ relation~(\ref{(13.11)}) holds
for all $u\ge0$. Theorem~13.3 is proved.
\hfill$\qed$
\medskip
\medskip\noindent
{\script 13. B) A short discussion about the methods and results.}
\medskip\noindent
A comparison of Theorem 8.5 and Example 8.7 shows that the
estimate~(\ref{(8.14)})
is sharp. At least no essential improvement of this estimate
is possible which holds for {\it all}\/ Wiener--It\^o integrals
with a kernel function $f$ satisfying the conditions of Theorem~8.5.
This fact also indicates that the bounds~(\ref{(13.1)})
and~(\ref{(13.2)}) on high
moments of Wiener--It\^o integrals are sharp. It is worth
while comparing formula~(\ref{(13.2)}) with the estimate of
Proposition~13.2 on moments of degenerate $U$-statistics.
Let us consider a normalized $k$-fold degenerate $U$-statistic
$n^{-k/2}k!I_{n,k}(f)$ with some kernel function $f$ and a
$\mu$-distributed sample of size~$n$. Let us compare its moments
with those of a $k$-fold Wiener--It\^o integral k!$Z_{\mu,k}(f)$
with the same kernel function~$f$ with respect to a white noise
$\mu_W$ with reference measure~$\mu$. Let $\sigma$ denote the
$L_2$-norm of the kernel function~$f$. If
$M\le\varepsilon n\sigma^2$ with a small number $\varepsilon>0$,
then Proposition~13.2 (with an appropriate
choice of the parameter~$\eta$ which is small in this case)
provides an almost as good bound on the $2M$-th moment of the
normalized $U$-statistic as Proposition~13.1 does on the
$2M$-th moment of the corresponding Wiener--It\^o integral. In
the case $M\le Cn\sigma^2$ with some fixed (not necessarily small)
number $C>0$ the $2M$-th moment of the normalized $U$-statistic
can be bounded by $C(k)^M$ times the natural estimate on the
$2M$-th moment of the Wiener--It\^o integral with some
constant~$C(k)>0$ depending only on the number~$C$. This can be
so interpreted that in this case the estimate on the moments of the
normalized $U$-statistic is weaker than the estimate on the moments
of the Wiener--It\^o integral, but they are still comparable.
Finally, in the case $M\gg n\sigma^2$ the estimate on the $2M$-th
moment of the normalized $U$-statistic is much worse than the
estimate on the $2M$-th moment of the Wiener--It\^o integral.
A similar picture arises if the distribution of the normalized
degenerate $U$-statistic
$$
F_n(u)=P(n^{-k/2}|k!I_{n,k}(f)|>u)
$$
is compared to the distribution of the Wiener--It\^o integral
$$
G(u)=P(|k!Z_{\mu,k}(f)|>u).
$$
In the case $0\le u\le\varepsilon n^{k/2}\sigma^{k+1}$ with a
small $\varepsilon>0$ Theorem~8.3 yields an almost as good
estimate for the probability $F_n(u)$ as Theorem~8.5 yields for
$G(u)$. In the case $0\le u\le n^{k/2}\sigma^{k+1}$ these
results yield similar bound for $F_n(u)$ and $G(u)$, only in the
exponent of the estimate on $F_n(u)$ in formula~(\ref{(8.10)})
a worse constant appears. Finally, if $u\gg n^{k/2}\sigma^{k+1}$,
then --- as Example~8.8 shows, at least in the case $k=2$, ---
the (tail) distribution function $F_n(u)$ satisfies a much
worse estimate than the function $G(u)$.
A similar picture arose in the one-variate version of this
problem discussed in Chapter~3, where the normalized sums
of independent random variables were investigated, and their
tail-distributions were compared to that of a normally
distributed random variable. To understand this similarity
better it is useful to recall Theorem~10.4, i.e. the limit
theorem about normalized degenerate $U$-statistics.
Theorems~8.3 and~8.5 enable us to compare the tail behaviour
of normalized degenerate $U$-statistics with their limit
presented in the form of multiple Wiener--It\^o integrals,
while the one-variate versions of these results compare the
distribution of sums of independent random variables with
their Gaussian limit.
The proofs of the above results show that good bounds on the
moments of degenerate $U$-statistics and multiple Wiener--It\^o
provide a good estimate on their distribution. To understand the
behaviour of high moments of degenerate $U$-statistics better it
is useful to have a closer look at the simplest case $k=1$,
when the moments of sums of independent random variables with
expectation zero are considered.
Let us consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with expectation
zero, take their sum $S_n=\sum\limits_{j=1}^n\xi_j$, and let us try
to give a good estimate on the moments $ES_n^{2M}$ for all
$M=1,2,\dots$. Because of the independence of the random variables
$\xi_j$ and the condition $E\xi_j=0$ the identity
\begin{equation}
ES_n^{2M}=\sum_{\substack{(j_1,\dots,j_s,l_1,\dots,l_s)\\
j_1+\dots+j_s=2M,\,j_u\ge 2,\textrm{ for all }1\le u\le s \\
1\le l_1u)>Ke^{-Au^{2/k}} \label{(13.19)}
\end{equation}
with some numbers $K=K(f,\mu)>0$ and $A=A(f,\mu)>0$.}
\medskip\noindent
The constant $A$ in the exponent $Au^{2/k}$ of
formula~(\ref{(13.19)}) is
always finite, but Mc.~Kean's proof yields no explicit upper
bound on it. The following example shows that in certain cases
if we fix the constant~$K$ in relation~(\ref{(13.19)}), then this
inequality holds only with a very large constant $A>0$ even
if the variance of the Wiener--It\^o integral equals~1.
Take a probability measure $\mu$ and a white noise $\mu_W$ with
reference measure $\mu$ on a measurable space $(X,{\cal X})$, and let
$\varphi_1,\varphi_2,\dots$ be a sequence of orthonormal functions
on $(X,{\cal X})$ with respect to this measure $\mu$. Define for all
$L=1,2,\dots$, the function
\begin{equation}
f(x_1,\dots,x_k)=f_L(x_1,\dots,x_k)=(k!)^{1/2}L^{-1/2}
\sum\limits_{j=1}^L \varphi_j(x_1)\cdots\varphi_j(x_k)
\label{(13.20)}
\end{equation}
and the Wiener--It\^o integral
$$
Z_{\mu,k}(f)=Z_{\mu,k}(f_L)=\frac1{k!}\int f_L(x_1,\dots,x_k)
\mu_W(\,dx_1)\dots\mu_W(\,dx_k).
$$
Then $EZ_{\mu,k}^2(f)=1$, and the high moments of $Z_{\mu,k}(f)$ can
be well estimated. For a large parameter~$L$ these moments are much
smaller, than the bound given in Proposition~13.1. (The
calculation leading to the estimation of the moments of
$Z_{\mu,k}(f)$ will be omitted.) These moment estimates also imply
that if the parameter~$L$ is large, then for not too large
numbers~$u$ the probability $P(|Z_{\mu,k}(f)|>u)$ has a much better
estimate than that given in Theorem~8.5. As a consequence,
for a large number $L$ and fixed number~$K$
relation~(\ref{(13.19)}) may hold only with a very big number $A>0$.
We can expect that if we take a Gaussian random
polynomial~$P(\xi_1,\dots,\xi_n)$ whose arguments are Gaussian
random variables $\xi_1,\dots,\xi_n$, and which is the sum of
many small almost independent terms with expectation zero, then
a similar picture arises as in the case of a Wiener--It\^o
integral with kernel function~(\ref{(13.20)}) with a
large parameter~$L$.
Such a random polynomial has an almost Gaussian distribution by
the central limit theorem, and we can also expect that its not
too high moments behave so as the corresponding moments of a
Gaussian random variable with expectation zero and the same
variance as the Gaussian random polynomial we consider. Such a
bound on the moments has the consequence that the estimate on
the probability of the event
$\{\omega\colon\; P(\xi_1(\omega),\dots,\xi_n(\omega))>u\}$
given in Theorem~8.5 can be improved if the number~$u$ is not
too large. A similar picture arises if we consider
Wiener--It\^o integrals whose kernel function satisfies some
`almost independence' properties. The problem is to find the
right properties under which we can get a good estimate that
exploits the almost independence property of a Gaussian random
polynomial or of a Wiener--It\^o integral. The main result of
R.~Lata{\l}a's paper~\cite{r27} can be considered as a response
to this question. This paper has some precedents, see~\cite{r19}
and~\cite{r22}, or paper~\cite{r18} where such a result was applied.
I describe the result of this paper below.
\medskip
To formulate Lata{\l}a's result some new notions have to be
introduced. Given a finite set $A$ let ${\cal P}(A)$ denote the
set of all its partitions. If a partition
$P=\{B_1,\dots,B_s\}\in{\cal P}(A)$ consists of $s$ elements then we
say that this partition has order~$s$, and write $|P|=s$. In the
special case $A=\{1,\dots,k\}$ the notation ${\cal P}(A)={\cal P}_k$
will be used. Given a measurable space $(X,{\cal X})$ with a
probability measure $\mu$ on it together with a finite set
$B=\{b_1,\dots,b_j\}$ let us introduce the following notations. Take
$j$ different copies $(X_{b_r},{\cal X}_{b_r})$ and $\mu_{b_r}$,
$1\le r\le j$, of this measurable space and probability measure
indexed by the elements of the set $B$, and define their product
$(X^{(B)},{\cal X}^{(B)},\mu^{(B)})=\left(\prod\limits_{r=1}^j X_{b_r},
\prod\limits_{r=1}^j{\cal X}_{b_r},
\prod\limits_{r=1}^j\mu_{b_r}\right)$. The points
$(x_{b_1},\dots,x_{b_j})\in X^{(B)}$ will be denoted by
$x^{(B)}\in X^{(B)}$ in the sequel. With the help of the above
notations I introduce the quantities needed in the formulation
of the following Theorem~13.7.
Let $f=f(x_1,\dots,x_k)$ be a function on the $k$-fold product
$(X^k,{\cal X}^k,\mu^k)$ of a measure space $(X,{\cal X},\mu)$
with a probability measure $\mu$. For all partitions
$P=\{B_1,\dots,B_s\}\in{\cal P}_k$ of the set $\{1,\dots,k\}$ consider
the functions $g_r\left(x^{(B_r)}\right)$ on the space $X^{(B_r)}$,
$1\le r\le s$, and define with their help the quantities
\begin{eqnarray}
\alpha(P)
&&=\alpha(P,f,\mu) \nonumber \\
&&=\sup_{g_1,\dots,g_s} \int f(x_1,\dots,x_k)
g_1\left(x^{(B_1)}\right)\cdots g_s\left(x^{(B_s)}\right)\mu(dx_1)
\dots\mu(dx_k); \nonumber \\
&&\qquad\quad \textrm{where supremum is taken for such functions}
\nonumber \\
&&\qquad \quad g_1,\dots,g_s,\quad g_r\colon\,
X^{B_r}\to R^1 \textrm{ for which} \nonumber \\
&&\qquad\quad
\int g_r^2\left(x^{(B_r)}\right)\mu^{(B_r)}\left(\,dx^{(B_r)}\right)\le1
\quad \textrm{for all } 1\le r\le s, \label{(13.21)}
\end{eqnarray}
and put
\begin{equation}
\alpha_s=\max_{P\in{\cal P}_k,\,|P|=s}\alpha(P),
\quad 1\le s\le k. \label{(13.22)}
\end{equation}
In Lata{\l}a's estimation of Wiener--It\^o integrals of order~$k$
the quantities $\alpha_s$, $1\le s\le k$, play a similar role as
the number $\sigma^2$ in Theorem~8.5. Observe that in the case
$|P|=1$, i.e.\ if $P=\{1,\dots,k\}$ the identity
$\alpha^2(P)=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$
holds, which means that $\alpha_1=\sigma$. The following estimate
is valid for Wiener--It\^o integrals of general order.
\medskip\noindent
{\bf Theorem 13.7 (Lata{\l}a's estimate about the tail-distribution
of Wiener--It\^o integrals).}\index{Lata{\l}a's estimate about
the tail-distribution of Wiener--It\^o integrals}
{\it Let a $k$-fold Wiener--It\^o integral $Z_{\mu,k}(f)$,
$k\ge1$, be defined with the help of a white noise $\mu_W$ with
a non-atomic reference measure~$\mu$ and a kernel function~$f$
of $k$~variables such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty.
$$
There is some universal constant $C(k)<\infty$ depending only on
the order~$k$ of the random integral such that the inequalities
\begin{equation}
E(Z_{\mu,k}(f))^{2M}\le
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
\label{(13.23)}
\end{equation}
and
\begin{equation}
P(|Z_{\mu,k}(f)|>u)\le C(k)\exp\left\{-\frac1{C(k)}\min_{1\le s\le k}
\left(\frac u{\alpha_s}\right)^{2/s}\right\}
\label{(13.24)}
\end{equation}
hold for all $M=1,2,\dots$ and $u>0$ with the quantities $\alpha_s$,
defined in formulas~(\ref{(13.21)}) and~(\ref{(13.22)}).}
\medskip
Inequality~(\ref{(13.24)}) is a simple consequence
of~(\ref{(13.23)}). In the special case when
$\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$, t
inequality~(\ref{(13.23)}) yields such an estimate on the moment
$EZ_{\mu,k}(f)^{2M}$ which has
the same magnitude as the $2M$-th moment of a standard Gaussian
random variable multiplied by a constant, and (\ref{(13.24)})
yields a good estimate on the probability $P(|Z_{\mu,k}(f)|>u)$.
Actually the result of Theorem~13.7 can be reduced to the special
case when $\alpha_s\le M^{-(s-1)/2}$ for all $1\le s\le k$. Thus
it can be interpreted so that if the quantities~$\alpha_s$ of a
$k$-fold Wiener--It\^o integral are sufficiently small, then
these `almost independence' conditions imply that the $2M$-th
moment of this integral behaves similarly to a one-fold
Wiener--It\^o integral with the same variance.
Actually Lata{\l}a formulated his result in a different form, and
he proved a slightly weaker result. He considered Gaussian
polynomials of the following form:
\begin{eqnarray}
&&P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k) \nonumber \\
&&\qquad =\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)\xi^{(1)}_{j_1}\cdots\xi^{(k)}_{j_k},
\label{(13.25)}
\end{eqnarray}
where $\xi_j^{(s)}$, $1\le j\le n$ and $1\le s\le k$, are independent
standard normal random variables. Lata{\l}a gave an estimate about
the moments and tail-distribution of such random polynomials.
The problem about the behaviour of such random polynomials can be
reformulated as a problem about the behaviour of Wiener--It\^o
integrals in the following way: Take a measurable space $(X,{\cal X})$
with a non-atomic measure~$\mu$ on it. Let $Z_\mu$ be a white noise
with reference measure~$\mu$, let us choose a set of orthogonal
functions $h^{(s)}_j(x)$, $1\le j\le n$, $1\le s\le k$, on the
space $(X,{\cal X})$ with respect to the measure~$\mu$, and define
the function
\begin{equation}
f(x_1,\dots,x_k)=\frac1{k!}
\sum_{ (j_1,\dots,j_k)\colon\,1\le j_s\le n,\,1\le s\le k}
a(j_1,\dots,j_k)h^{(1)}_{j_1}(x_1)\cdots h^{(k)}_{j_k}(x_k)
\label{(13.26)}
\end{equation}
together with the Wiener--It\^o integral $Z_{\mu,k}(f)$. Since
the random integrals $\bar\xi_j^{(s)}=\int h_j^{(s)}(x)Z_\mu(\,dx)$,
$1\le j\le n$, $1\le s\le k$, are independent, standard Gaussian
random variables, it is not difficult to see with the help of
It\^o's formula (Theorem~10.3 in this work) that the distributions
of the random polynomial
$P(\xi_j^{(s)},\;1\le j\le n,\,1\le s\le k)$ and $Z_{\mu,k}(f)$
agree. Here we reformulated Lata{\l}a's estimates about random
polynomials of the form~(\ref{(13.25)}) to estimates about
Wiener--It\^o integrals with kernel function of the
form~(\ref{(13.26)}).
These estimates are equivalent to Lata{\l}a's result if we restrict
our attention to the special class of Wiener--It\^o integrals
with kernel functions of the form~(\ref{(13.26)}). But we have
formulated our result for Wiener--It\^o integrals with a general
kernel function. Lata{\l}a's proof heavily exploits the special
structure of the random polynomials given in~(\ref{(13.25)}),
the independence of the
random variables~$\xi_j^{(s)}$ for different parameters~$s$ in
it. (It would be interesting to find a proof which does not
exploit this property.) On the other hand, this result can
be generalized to the case discussed in Theorem~13.7. This
generalization can be proved by exploiting the theorem of
de la Pe{\~n}a and Montgomery--Smith about the comparison of
$U$-statistics and decoupled $U$-statistics (formulated in
Theorem~14.3 of this work) and the properties of the
Wiener--It\^o integrals. I omit the details of the proof.
Lata{\l}a also proved a converse estimate in~\cite{r27} about random
polynomials of Gaussian random polynomials which shows that the
estimates of Theorem~13.7 are sharp. We formulate it in its
original form, i.e. we restrict our attention to the case of
Wiener--It\^o integrals with kernel functions of the
form~(\ref{(13.26)}).
\medskip\noindent
{\bf Theorem 13.8 (A lower bound about the tail distribution of
Wiener--It\^o integrals).} {\it A random integral $Z_{\mu,k}(f)$
with a kernel function of the form~(\ref{(13.26)}) satisfies the
inequalities
$$
E(Z_{\mu,k}(f))^{2M}\ge
\left(C(k)\max_{1\le s\le k}(M^{s/2}\alpha_s)\right)^{2M},
$$
and
$$
P(|Z_{\mu,k}(f)|>u)\ge \frac1{C(k)}\exp\left\{-C(k)
\min_{1\le s\le k}\left(\frac u{\alpha_s} \right)^{2/s}\right\}
$$
for all $M=1,2,\dots$ and $u>0$ with some universal constant
$C(k)>0$ depending only on the order~$k$ of the integral and the
quantities $\alpha_s$, defined in formula~(\ref{(13.21)})
and~(\ref{(13.22)}).}
\medskip
Let me finally remark that there is a counterpart of Theorem~13.7
about degenerate $U$-statistics. Adamczak's paper~\cite{r1} contains
such a result. Here we do not discuss it, because this result is
far from the main topic of this work. We only remark that some new
quantities have to be introduced to formulate it. The appearance of
these conditions is related to the fact that in an estimate about
the tail-behaviour of a degenerate $U$-statistic we need a bound
not only on the $L_2$-norm but also on the supremum norm of the
kernel function. In a sharp estimate the bound about the supremum
of the kernel function has to be replaced by a more complex system
of conditions, just as the condition about the $L_2$-norm of the
kernel function was replaced by a condition about the quantities
$\alpha_s$, $1\le s\le k$, defined in formulas~(\ref{(13.21)})
and~(\ref{(13.22)}) in Theorem~13.7.
\chapter{Reduction of the main result in this work}
The main result of this work is Theorem 8.4 or its multiple integral
version Theorem~8.2. It was shown in Chapter~9 that Theorem 8.2
follows from Theorems~8.4. Hence it is enough to prove Theorem~8.4.
It may be useful to study this problem together with its multiple
Wiener--It\^o integral version, Theorem~8.6.
Theorems~8.6 and~8.4 will be proved similarly to their one-variate
versions, Theorems~4.2 and~4.1. Theorem~8.6 will be proved with
the help of~Theorem~8.5 about the estimation of the tail
distribution of multiple Wiener--It\^o integrals. A natural
modification of the chaining argument applied in the proof of
Theorem~4.2 works also in this case. No new difficulties arise. On
the other hand, in the proof of Theorem~8.4 several new
difficulties have to be overcome. I start with the proof of
Theorem~8.6.\index{estimate on the supremum of Wiener--It\^o
integrals}
\medskip\noindent
{\it Proof of Theorem 8.6.}\/ Fix a number $0<\varepsilon<1$, and
let us list the elements of the countable set
${\cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$ let us
choose by exploiting the conditions of Theorem~8.6 a set of
functions ${\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}
\subset{\cal F}$ with
$m_p\le2D\,2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}$ elements in
such a way that
$\inf\limits_{1\le j\le m_p}\int(f-f_{a(j,p)})^2\,d\mu
\le 2^{-4p-8}\varepsilon^2\sigma^2$ for all $f\in{\cal F}$, and
beside this let
$f_p\in{\cal F}_p$. For all indices $a(j,p)$, $p=1,2,\dots$,
$1\le j\le m_p$, choose a predecessor $a(j',p-1)$, $j'=j'(j,p)$,
$1\le j'\le m_{p-1}$, in such a way that the functions
$f_{a(j,p)}$ and
$f_{a(j',p-1)}$ satisfy the relation
$\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu
\le\varepsilon^2\sigma^22^{-4(p+1)}$.
Theorem~8.5 with the choice
$\bar u=\bar u(p)=2^{-(p+1)}\varepsilon u$ and
$\bar\sigma=\bar\sigma(p)=2^{-2p-2}\varepsilon\sigma$ yields
the estimates
\begin{eqnarray}
P(A(j,p))&=& P\left(|k!Z_{\mu,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
2^{-(1+p)}\varepsilon u\right)\nonumber \\
&\le& C \exp\left\{-\frac12
\left(\frac{2^{p+1}u}\sigma\right)^{2/k}\right\},
\qquad 1\le j\le m_p,
\label{(14.1)}
\end{eqnarray}
for all $p=1,2,\dots$, and
\begin{eqnarray}
P(B(s))&=&P\left(|k!Z_{\mu,k}(f_{a(0,s)})| \
\ge \left(1-\frac \varepsilon2\right)u\right) \nonumber \\
&\le& C\exp\left\{-\frac12
\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}, \quad 1\le s\le m_0. \label{(14.2)}
\end{eqnarray}
Since each $f\in{\cal F}$ is the element of at least one set
${\cal F}_p$, $p=0,1,2,\dots$, (We made a construction, where
$f_p\in {\cal F}_p$), the definition of the predecessor of an
index $a(j,p)$ and of the events $A(j,p)$ and
$B(s)$ in formulas~(\ref{(14.1)}) and (\ref{(14.2)}) together
with the previous estimates imply that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}|k!Z_{\mu,k}(f)|\ge u\right)
\le P\left(\bigcup_{p=1}^\infty\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \nonumber \\
&&\qquad \le \sum_{p=1}^\infty\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \nonumber \\
&&\qquad \le \sum_{p=1}^{\infty} 2CD2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}
\sigma\right)^{2/k} \right\}\nonumber \\
&&\qquad\qquad +2^{1+4L}CD\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}. \label{(14.3)}
\end{eqnarray}
Some calculations show that if
$u\ge ML^{k/2}\sigma\frac1\varepsilon(\log^{k/2}\frac2
\varepsilon+\log^{k/2}\frac2\sigma)$
with a sufficiently large constant~$M=M(k)$, then the inequalities
$$
2^{(2p+4)L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{2^{p+1}u}\sigma\right)^{2/k}
\right\}\le
2^{-p}\left\{-\frac12\left(\frac{(1-\varepsilon)u}
\sigma\right)^{2/k} \right\}
$$
hold for all $p=1,2\dots$, and
$$
2^{4L}\varepsilon^{-L}\sigma^{-L}
\exp\left\{-\frac12\left(\frac{\left(1-\frac\varepsilon2\right)u}
\sigma\right)^{2/k}\right\}\le
\exp\left\{-\frac12\left(\frac{\left(1-\varepsilon\right)u}
\sigma\right)^{2/k}\right\}.
$$
These inequalities together with relation~(\ref{(14.3)}) imply
relation~(\ref{(8.15)}). Theorem~8.6 is proved.
\hfill$\qed$
\medskip
The proof of Theorem~8.4 is harder. In this case the chaining
argument in itself does not supply the proof, since Theorem~8.3
gives a good estimate about the distribution of a degenerate
$U$-statistic only if it has a not too small variance. The same
difficulty appeared in the proof of Theorem~4.1, and the method
applied in that case will be adapted to the present situation.
A multivariate version of Proposition~6.1 will be proved in
Proposition~14.1, and another result which can be considered as
a multivariate version of Proposition~6.2 will be formulated
in Proposition~14.2. It will be shown that Theorem~8.4 follows
from Propositions~14.1 and~14.2. Most steps of these proofs can
be considered as a simple repetition of the corresponding
arguments in the proof of the results in Chapter~6. Nevertheless,
I wrote down them for the sake of completeness.
\medskip
The result formulated in Proposition~14.1 can be proved in almost
the same way as its one-variate version, Proposition~6.1. The only
essential difference is that now we apply a multivariate version
of Bernstein's inequality given in the Corollary of Theorem~8.3.
In the calculations of the proof of Proposition~14.1 the term
$(\frac u\sigma)^{2/k}$ shows a behaviour similar to the term
$(\frac u\sigma)^2$ in Proposition~6.1. Proposition~14.1 contains the
information we can get by applying Theorem~8.3 together with the
chaining argument. Its main content, inequality~(\ref{(14.4)}),
yields a good estimate on the supremum of degenerate
$U$-statistics if it is taken for an appropriate finite subclass
${\cal F}_{\bar\sigma}$ of the original class of kernel
functions~${\cal F}$. The class of kernel functions
${\cal F}_{\bar\sigma}$ is a relatively dense subclass of
${\cal F}$ in the $L_2$ norm. Proposition~14.1 also provides some
useful estimates on the value of the parameter~$\bar\sigma$ which
describes how dense the class of functions ${\cal F}_{\bar\sigma}$
is in ${\cal F}$.
\medskip\noindent
{\bf Proposition 14.1.} {\it Let the $k$-fold power
$(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$ be given
together with some probability measure $\mu$ on $(X,{\cal X})$
and a countable, $L_2$-dense class ${\cal F}$ of functions
$f(x_1,\dots,x_k)$ of~$k$ variables with some exponent~$L\ge1$
and parameter~$D\ge1$ with respect to the measure~$\mu^k$ on the
product space $(X^k,{\cal X}^k)$ which also has the following
properties. All functions $f\in{\cal F}$ are canonical with
respect to the measure~$\mu$, and they satisfy
conditions~(\ref{(8.4)}) and~(\ref{(8.5)}) with some real number
$0<\sigma\le1$. Take a sequence of independent, $\mu$-distributed
random variables $\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and
consider the (degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in {\cal F}$, defined in formula~(\ref{(8.7)}), and fix some
number $\bar A=\bar A_k\ge2^k$.
There is a number $M=M(\bar A,k)$ such that for all
numbers~$u>0$ for which the inequality
$n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}
\ge M(L\log\frac2\sigma+\log D)$ holds, a number
$\bar\sigma=\bar\sigma(u)$, $0\le\bar\sigma\le \sigma\le1$,
and a collection of functions
${\cal F}_{\bar\sigma}={\cal F}_{\bar\sigma(u)}
=\{f_1,\dots,f_m\}\subset{\cal F}$ with $m\le D\bar\sigma^{-L}$
elements can be chosen in such a way that the union of the sets
${\cal D}_j=\{f\colon\, f\in {\cal F},\; \int|f-f_j|^2\,d\mu
\le\bar\sigma^2\}$, $1\le j\le m$ cover the set ${\cal F}$. i.e.
${\cal F}=\bigcup\limits_{j=1}^m{\cal D}_j$, and the
(degenerate) $U$-statistics $I_{n,k}(f)$,
$f\in{\cal F}_{\bar\sigma(u)}$, satisfy the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma(u)}}n^{-k/2}|k!I_{n,k}(f)|
\ge \frac u{\bar A}\right)\le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\} \nonumber \\
&&\qquad \qquad \textrm{if}\quad n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k}
\ge M\left(L\log\frac2\sigma+\log D\right) \label{(14.4)}
\end{eqnarray}
with the constants $\alpha=\alpha(k)$, $C=C(k)$ appearing in
formula~(\ref{($8.10'$)}) of the Corollary of Theorem~8.3 and the
exponent $L$ and parameter $D$ of the $L_2$-dense class ${\cal F}$.
Beside this, also the inequality
$4\left(\frac u{\bar A\bar\sigma}\right)^{2/k}\ge
n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$
holds for this number $\bar\sigma=\bar\sigma(u)$. If the
number~$u$ satisfies also the inequality
\begin{equation}
n\sigma^2\ge \left(\frac u\sigma\right)^{2/k}\ge
M(L^{3/2}\log\frac2\sigma +(\log D)^{3/2}) \label{(14.5)}
\end{equation}
with a sufficiently large number $M=M(\bar A,k)$, then the relation
$n\bar\sigma^2\ge L\log n+\log D$ holds, too.}
\medskip\noindent
{\it Proof of Proposition 14.1.} Let us list the elements of the
countable set ${\cal F}$ as $f_1,f_2,\dots$. For all $p=0,1,2,\dots$
let us choose, by exploiting the $L_2$-density property of the class
${\cal F}$ with respect to the product measure $\mu^k$,
a set
$$
{\cal F}_p=\{f_{a(1,p)},\dots,f_{a(m_p,p)}\}\subset
{\cal F}
$$
with $m_p\le D\,2^{2pL}\sigma^{-L}$ elements in such a way
that $\inf\limits_{1\le j\le m_p}\int (f-f_{a(j,p)})^2\,d\mu\le
2^{-4p}\sigma^2$ for all $f\in{\cal F}$.
For all indices $a(j,p)$, $p=1,2,\dots$, $1\le j\le m_p$, choose a
predecessor $a(j',p-1)$, $j'=j'(j,p)$, $1\le j'\le m_{p-1}$, in such a
way that the functions $f_{a(j,p)}$ and $f_{a(j',p-1)}$ satisfy the
relation $\int|f_{a(j,p)}-f_{a(j',p-1)}|^2\,d\mu\le \sigma^2
2^{-4(p-1)}$. Then the inequalities
$\int\left(\frac{f_{a(j,p)}-f_{a(j',p-1)}}2\right)^2\,d\mu
\le4\sigma^2 2^{-4p}$
and
$$
\sup\limits_{x_j\in X,\,1\le j\le k}\left|
\frac{f_{a(j,p)}(x_1,\dots,x_k)-f_{a(j',p-1)}(x_1,\dots,x_k)}2\right|
\le 1
$$
hold. The Corollary of Theorem~8.3 yields that
\begin{eqnarray}
P(A(j,p))&&=P\left(n^{-k/2}|k!I_{n,k}(f_{a(j,p)}-f_{a(j',p-1)})|\ge
\frac{2^{-(1+p)}u}{\bar A}\right)\nonumber \\
&&\le C \exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A
\sigma}\right)^{2/k} \right\}
\quad \textrm {if}\quad 4n\sigma^2 2^{-4p}\ge\left(\frac{2^{p}u}
{8\bar A\sigma}\right)^{2/k}, \nonumber \\
&&\qquad\qquad 1\le j\le m_p,\; p=1,2,\dots,
\label{(14.6)}
\end{eqnarray}
and
\begin{eqnarray}
P(B(s))&&=P\left(n^{-k/2}|k!I_{n,k}(f_{0,s})|
\ge \frac u{2\bar A}\right)\le
C\exp\left\{-\alpha\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\},
\nonumber \\
&& \qquad 1\le s\le m_0, \quad \textrm{ if }
n\sigma^2\ge \left(\frac u{2\bar A\sigma}\right)^{2/k}. \label{(14.7)}
\end{eqnarray}
Introduce an integer $R=R(u)$, $R>0$, which satisfies the relations
$$
2^{(4+{2/k})(R+1)}\left(\frac{u}{\bar A\sigma}\right)^{2/k} \ge
2^{2+6/k}n\sigma^2\ge 2^{(4+2/k)R}
\left(\frac{u}{\bar A\sigma}\right)^{2/k},
$$
and define $\bar\sigma^2=2^{-4R}\sigma^2$ and
${\cal F}_{\bar\sigma}={\cal F}_R$ (this is the class of functions
${\cal F}_p$ introduced at the start of the proof with $p=R$).
We defined the number~$R$, analogously to the proof of Proposition~6.1,
as the largest number~$p$ for which the condition formulated
in~(\ref{(14.6)}) holds. As
$n\sigma^2\ge\left(\frac u\sigma\right)^{2/k}$,
and $\bar A\ge2^k$ by our
conditions, there exists such a positive integer $R$.) The
cardinality~$m$ of the set ${\cal F}_{\bar\sigma}$ is clearly not
greater than $D\bar\sigma^{-L}$, and
$\bigcup\limits_{j=1}^m {\cal D}_j={\cal F}$. Beside this, the number
$R$ was chosen in such a way that the inequalities
(\ref{(14.6)}) and (\ref{(14.7)}) hold for $1\le p\le R$. Hence the
definition of the predecessor of an index $a(j,p)$ implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}k|k!I_{n,k}(f)|
\ge \frac u{\bar
A}\right) \le P\left(\bigcup_{p=1}^R\bigcup_{j=1}^{m_p}A(j,p)
\cup\bigcup_{s=1}^{m_0}B(s)\right) \\
&&\qquad \le \sum_{p=1}^R\sum_{j=1}^{m_p}P(A(j,p))
+\sum_{s=1}^{m_0}P(B(s)) \\
&&\le \sum_{p=1}^{\infty} CD\,2^{2pL}\sigma^{-L}
\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar A\sigma}\right)^{2/k}
\right\}
+CD\sigma^{-L}\exp\left\{-\alpha\left(\frac
u{2\bar A\sigma}\right)^{2/k}\right\}.
\end{eqnarray*}
If the condition $\left(\frac u\sigma\right)^{2/k}\ge
M(L\log\frac2\sigma+\log D)$ holds with a sufficiently large
constant $M$ (depending on $\bar A$), then the inequalities
$$
D2^{2pL}\sigma^{-L}\exp\left\{-\alpha\left(\frac{2^{p}u}{8\bar
A\sigma}\right)^{2/k} \right\}
\le 2^{-p}\exp\left\{-\alpha\left(\frac{2^{p}u}
{10\bar A \sigma}\right)^{2/k} \right\}
$$
hold for all $p=1,2,\dots$, and
$$
D\sigma^{-L}\exp\left\{-\alpha
\left(\frac u{2\bar A\sigma}\right)^{2/k}\right\}\le
\exp\left\{-\alpha\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}.
$$
Hence the previous estimate implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}|k!I_{n,k}(f)|\ge
\frac u{\bar A}\right) \le\sum_{p=1}^{\infty}C 2^{-p}
\exp\left\{-\alpha\left(\frac{2^{p}u}{10\bar A \sigma}\right)^{2/k}
\right\}\\
&&\qquad +C\exp\left\{-\alpha\left(\frac u{10\bar A
\sigma}\right)^{2/k}\right\} \le 2C\exp\left\{-\alpha
\left(\frac u{10 \bar A\sigma}\right)^{2/k}\right\},
\end{eqnarray*}
and relation~(\ref{(14.4)}) holds.
The estimates
\begin{eqnarray*}
\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
&\le&2^{-2-6/k}2^{2R/k}\left(\frac u{\bar A\sigma}\right)^{2/k}
=2^{-4R}\cdot2^{(4+2/k)R-2-6/k}\left(\frac{u}
{\bar A\sigma}\right)^{2/k}\\
&\le& n\bar\sigma^2=2^{-4R} n\sigma^2\le
2^{-4R}\cdot2^{(4+2/k)(R+1)-2-6/k}
\left(\frac{u}{\bar A\sigma}\right)^{2/k}\\
&=&2^{2-4/k}\cdot 2^{2R/k}\left(\frac{u}{\bar A \sigma}\right)^{2/k}
=2^{2-4/k}\cdot2^{-2R/k} \left(\frac{u}
{\bar A\bar\sigma}\right)^{2/k}
\le4\left(\frac{u}{\bar A\bar\sigma}\right)^{2/k}
\end{eqnarray*}
hold because of the relation~$R\ge1$. This means that
$n\bar\sigma^2$
has the upper and lower bound formulated in Proposition~14.1.
It remained to show that $n\bar\sigma^2\ge
L\log n+D$ if relation~(\ref{(14.5)}) holds.
This inequality clearly holds under the conditions of
Proposition~14.1
if $\sigma\le n^{-1/3}$, since in this case
$\log\frac2\sigma\ge\frac{\log n}3$, and
\begin{eqnarray*}
n\bar\sigma^2&\ge&\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}
\ge\frac1{64}\bar A^{-2/k}
M\left(L\log\frac2\sigma +\log D)\right) \\
&\ge&\frac1{192}\bar A^{-2/k} M(L\log n+\log D)\ge L\log n+\log D
\end{eqnarray*}
if $M= M(\bar A,k)$ is sufficiently large.
If $\sigma\ge n^{-1/3}$, then the inequality $2^{(4+2/k)R}
\left(\frac u{\bar A\sigma}\right)^{2/k} \le2^{2+6/k} n\sigma^2$
can be applied. This
implies that $2^{-4R}\ge2^{-4(2+6/k))/(4+2/k)}
\left[\dfrac{\left(\frac u{\bar A\sigma}\right)^{2/k}}
{n\sigma^2}\right]^{4/(4+2/k)}$, and
$$
n\bar\sigma^2=2^{-4R}n\sigma^2\ge\frac{2^{-16/3}}{\bar A^{4/3}}
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma
\textrm{ with } \gamma=\frac4{4+\frac2k}\ge\frac23.
$$
The inequalities $n\sigma^2\ge n^{1/3}$ and
$n\sigma^2\ge(\frac u\sigma)^{2/k}
\ge M(L^{3/2}\log\frac2\sigma+(\log D)^{3/2})
\ge\frac M2(L^{3/2}+(\log D)^{3/2})$ hold,
(since $\log\frac2\sigma\ge\frac12$). They yield that for
sufficiently large $M=M(\bar A,k)$
$$
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^\gamma\ge
(n\sigma^2)^{1-\gamma}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3}=
(n\sigma^2)^{1/(2k+1)}
\left[\left(\frac u\sigma\right)^{2/k}\right]^{2/3},
$$
and
\begin{eqnarray*}
n\bar\sigma^2
&\ge& \frac{\bar A^{-4/3}}{50}
(n\sigma^2)^{1/(2k+1)}\left[\left(\frac
u\sigma\right)^{2/k}\right]^{2/3}\\
&\ge& \frac{\bar A^{-4/3}}{50}n^{1/3(2k+1)}
\left(\frac M2\right)^{2/3} (L^{3/2}+(\log D)^{3/2})^{2/3}
\ge L\log n+\log D.
\end{eqnarray*}
\hfill$\qed$
\medskip
A multivariate analogue of Proposition~6.2 is formulated in
Proposition~14.2, and it will be shown that Propositions~14.1
and~14.2 imply Theorem~8.4.\index{estimate on the supremum of
degenerate $U$-statistics}
\medskip\noindent
{\bf Proposition 14.2.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ and a countable $L_2$-dense class ${\cal F}$ of
canonical (with respect to the measure~$\mu$) kernel functions
$f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$ and exponent
$L\ge1$ on the product space $(X^k,{\cal X}^k)$. Let all functions
$f\in{\cal F}$ satisfy conditions~(\ref{(8.1)})
and~(\ref{(8.2)}) with some
$0<\sigma\le1$ such that $n\sigma^2>L\log n+D$. Let us consider
the (degenerate) $U$-statistics $I_{n,k}(f)$ with the random
sequence $\xi_1,\dots,\xi_n$, $n\ge\max(2,k)$, and kernel
functions $f\in{\cal F}$. There exists a threshold index
$A_0=A_0(k)>0$ and two numbers $\bar C=\bar C(k)>0$ and
$\gamma=\gamma(k)>0$ depending only on the order $k$ of the
$U$-statistics such that the degenerate $U$-statistics
$I_{n,k}(f)$, $f\in{\cal F}$, satisfy the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}n^{-k/2}|k!I_{n,k}(f)|
\ge A n^{k/2}\sigma^{k+1}\right)
\le \bar C e^{-\gamma A^{1/2k}n\sigma^2}\quad \textrm{if } A\ge A_0.
\label{(14.8)}
\end{equation}
}
\medskip
Proposition~14.2 yields an estimate for the tail distribution of
the supremum of degenerate $U$-statistics at level
$u\ge A_0n^{k/2}\sigma^{k+1}$, i.e. in the case when Theorem~8.3
does not give a good estimate on the tail-distribution of the single
degenerate $U$-statistics taking part in the supremum at
the left-hand side of~(\ref{(14.8)}).
Formula~(\ref{(8.11)}) will be proved by means of
Proposition~14.1 with an
appropriate choice of the parameter $\bar A$ in it and
Proposition~14.2 with the choice $\sigma=\bar\sigma=\bar\sigma(u)$
and the classes of functions
${\cal F}_j=\left\{\frac{g-f_j}2\colon\, g\in{\cal D}_j\right\}$
with the number $\bar\sigma$, functions~$f_j$ and sets of
functions~${\cal D}_j$, $1\le j\le m$, introduced in Proposition~14.1.
Clearly,
\begin{eqnarray}
&&P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}|k!I_{n,k}(f)|\ge u\right)\le
P\left(\sup_{f\in{\cal F}_{\bar\sigma}}n^{-k/2}|k!I_{n,k}(f)|
\ge \frac u{\bar A}\right) \nonumber \\
&&\qquad+\sum_{j=1}^m P\left(\sup_{g\in{\cal D}_j} n^{-k/2}
\left|k!I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar A}\right)u\right),
\label{(14.9)}
\end{eqnarray}
where $m$ is the cardinality of the set of functions
${\cal F}_{\bar\sigma}$ appearing in Proposition~14.1.
We shall estimate the two terms of the sum at the right-hand side
of~(\ref{(14.9)}) by means of Propositions~14.1 and~14.2 with a good
choice of the parameters $\bar A$ and the corresponding $M=M(\bar A)$
in Proposition~14.1 together with a parameter $A\ge A_0$ in
Proposition~14.2.
We shall choose the parameter~$A\ge A_0$ in the application of
Proposition~14.2 so that it satisfies also the relation
$\gamma\ A^{1/2k}\ge2$ with the
number~$\gamma$ appearing in relation~(\ref{(14.8)}), hence we put
$A=\max(A_0,(\frac2\gamma)^{2k})$. After this choice we want to
define the parameter $\bar A$ in Proposition~14.1 in such a way
that the numbers~$u$ satisfying the conditions of Proposition~14.1
also satisfy the relation
$(\frac12-\frac1{2\bar A})u\ge An^{k/2}\bar\sigma^{k+1}$ with
the already fixed number~$A$ and the number
$\bar\sigma=\bar\sigma(u)$ defined in the proof of
Proposition~14.1. This inequality can be rewritten
in the form $A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}
(\frac u{\bar\sigma})^{2/k}\ge n\bar\sigma^2$. On the other hand,
under the conditions of Proposition~14.1 the inequality
$4(\frac u{\bar A\bar\sigma})^{2/k}\ge n\bar\sigma^2$ holds.
Hence the desired inequality holds if
$A^{-2/k}(\frac12-\frac1{2\bar A})^{2/k}\ge 4{\bar A}^{-2/k}$.
Thus the number $\bar A=2^{k+1}A+1$ is an appropriate choice.
With such a choice of $\bar A$ (together with the corresponding
$M=M(\bar A,k)$) and $A$ we can write
\begin{eqnarray*}
&&P\left(\sup_{g\in{\cal D}_j} n^{-k/2}
\left|k!I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge\left(\frac12-\frac1{2\bar
A}\right)u\right) \\
&&\qquad\le P\left(\sup_{g\in{\cal D}_j}n^{-k/2}
\left|k!I_{n,k}\left(\frac{f_j-g}2\right)\right|
\ge A n^{k/2}\bar\sigma^{k+1}\right)
\le \bar Ce^{-\gamma A^{1/2k}n\bar\sigma^2}
\end{eqnarray*}
for all $1\le j\le m$.
(Observe that the set of functions $\frac{f_j-g}2,\;g\in{\cal D}_j$, is
an $L_2$-dense class with parameter $D$ and exponent $L$.) Hence
Proposition~14.1 (relation (\ref{(14.4)}) together with
the inequality $m\le
D\bar \sigma^{-L}$) and formula (\ref{(14.8)}) with our
$A\ge A_0$ and relation~(\ref{(14.9)}) imply that
\begin{equation}
P\left(\sup\limits_{f\in{\cal F}} n^{-k/2}|k!I_{n,k}(f)|\ge u\right)
\le 2C\exp\left\{-\alpha
\left(\frac u{10\bar A\sigma}\right)^{2/k}\right\}
+\bar CD\bar\sigma^{-L} e^{-\gamma A^{1/2k}n\bar\sigma^2}.
\label{(14.10)}
\end{equation}
We show by repeating an argument given in Chapter~6 that
$D\bar\sigma^{-L}\le e^{n\bar\sigma^2}$. Indeed, we have to show
that $\log D+L\log\frac1{\bar\sigma}\le n\bar\sigma^2$. But, as we
have seen, the relation $n\bar\sigma^2\ge L\log n+\log D$ with
$L\ge1$ and $D\ge1$ implies that $n\bar\sigma^2\ge\log n$, hence
$\log\frac1{\bar\sigma}\le\log n$, and
$\log D+L\log\frac1{\bar\sigma}\le \log D+L\log n\le n\bar\sigma^2$.
On the other hand, $\gamma A^{1/2k}\ge2$ by the definition of
the number~$A$, and by the estimates of Proposition~14.1
$n\bar\sigma^2\ge\frac1{64}\left(\frac u{\bar A\sigma}\right)^{2/k}$.
The above relations imply that
$D\bar\sigma^{-L} e^{-\gamma A^{1/2k}n \bar\sigma^2}
\le e^{-\gamma A^{1/2k}n\bar\sigma^2/2}
\le \exp\left\{-\frac\gamma{128} A^{1/2k} \bar A^{-2/k}
\left(\frac u\sigma\right)^{2/k}\right\}$.
Hence relation~(\ref{(14.10)}) yields that
\begin{eqnarray*}
&&P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}|k!I_{n,k}(f)|\ge u\right)\\
&&\qquad\le 2C\exp \left\{-\frac\alpha{(10\bar A)^2}\left(\frac
u\sigma\right)^{2/k}\right\} +\bar C\exp\left\{-\frac\gamma{128}
A^{1/2k} \bar A^{-2/k} \left(\frac u\sigma\right)^{2/k}\right\},
\end{eqnarray*}
and this estimate implies Theorem~8.4.
\hfill$\qed$
\medskip
To complete the proof of Theorem~8.4 we have to prove
Proposition~14.2. It will be proved, similarly to its one-variate
version Proposition~6.2, by means of a symmetrization argument.
We want to find its right formulation. It would be natural to
formulate it as a result about the supremum of degenerate
$U$-statistics. However, we shall choose a slightly different
approach. There is a notion, called decoupled $U$-statistic.
Decoupled $U$-statistics behave similarly to $U$-statistics, but
it is simpler to work with them, because they have more
independence properties. It turned out to be useful to introduce
them and to apply a result of de la Pe\~na and
Montgomery--Smith which enables us to reduce the estimation of
$U$-statistics to the estimation of decoupled $U$-statistics,
and to work out the symmetrization argument for decoupled
$U$-statistics.
Next we introduce the notion of decoupled $U$-statistics
together with their randomized version. We also formulate a
result of de la Pe\~na and Montgomery--Smith in Theorem~14.3
which enables us to reduce Proposition~14.2 to a version of it,
presented in Proposition~$14.2'$. It states a result similar
to Proposition~14.2 about decoupled $U$-statistics. The proof of
Proposition~$14.2'$ is the hardest part of the problem. In
Chapter~15, 16 and~17 we deal essentially with this problem.
The result of de la Pe\~na and Montgomery--Smith will be
proved in Appendix~D.
\medskip\noindent
{\bf Definition of decoupled and randomized decoupled
$U$-statistics.}\index{decoupled $U$-statistics}\index{randomized
decoupled $U$-statistics} {\it Let us have $k$ independent
copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of a sequence
$\xi_1,\dots,\xi_n$ of independent and identically distributed
random variables taking their values in a measurable space
$(X,{\cal X})$ together with a measurable function $f(x_1,\dots,x_k)$
on the product space $(X^k,{\cal X}^k)$ with values in a separable
Banach space. The decoupled $U$-statistic $\bar I_{n,k}(f)$
determined by the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, and kernel function $f$ is defined by the formula
\begin{equation}
\bar I_{n,k}(f)=\frac1{k!}\sum_{\substack{(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1)},\dots,\xi_{l_k}^{(k)}\right).
\label{(14.11)}
\end{equation}
Let us have beside the sequences of random variables
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and function
$f(x_1,\dots,x_k)$ a sequence of independent random variables
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is independent also of the sequences of
random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$.
The randomized decoupled $U$-statistic $\bar I_{n,k}(f,\varepsilon)$
(depending on the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function $f$ and the randomizing sequence
$\varepsilon_1,\dots,\varepsilon_n$) is defined by the formula
\begin{equation}
\bar I^\varepsilon_{n,k}(f)=\frac1{k!}\sum_{\substack
{(l_1,\dots,l_k)\colon\, 1\le l_j\le n,\;j=1,\dots,k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f\left(\xi_{l_1}^{(1)},
\dots,\xi_{l_k}^{(k)}\right).
\label{(14.12)}
\end{equation}
}
\medskip
A decoupled or randomized decoupled $U$-statistics (with a real
valued kernel function) will be called degenerate if its kernel
function is canonical. This terminology is in full accordance with
the definition of (usual) degenerate $U$-statistics.
A result of de la Pe\~na and Montgomery--Smith will be formulated
below. It gives an upper bound for the tail distribution of a
$U$-statistic by means of the tail distribution of an appropriate
decoupled $U$-statistic. It also has a generalization, where the
supremum of $U$-statistics is bounded by the supremum of decoupled
$U$-statistics. It enables us to reduce Proposition~14.2 to a
version of it formulated Proposition~$14.2'$, which gives a bound
on the tail distribution of the supremum of decoupled $U$-statistics.
It is simpler to prove this result than the original one.
Before the formulation of the theorem of de la Pe\~na and
Montgomery--Smith I make some remark about it. In this result we
consider more general $U$-statistics with kernel functions taking
values in a separable Banach space, and we compare the norm of
Banach space valued $U$-statistics and decoupled $U$-statistics.
(Decoupled $U$-statistics were defined with general Banach space
valued kernel functions, and the definition of $U$-statistics can
also be generalized to separable Banach space valued kernel
functions in a natural way.) This result was formulated in such
a general form for a special reason. This helped us to derive
formula~(\ref{(14.14)}) of the subsequent theorem from
formula~(\ref{(14.13)}). It can be exploited in the proof of
formula~(\ref{(14.14)}) that the constants in the
estimate~(\ref{(14.13)}) do not depend on the Banach
space where the kernel function~$f$ takes its values.
\medskip\noindent
{\bf Theorem 14.3 (Theorem of de la Pe\~na and Montgomery--Smith
about the comparison of $U$-statistics and decoupled
$U$-statistics).}\index{comparison of the tail distribution of
$U$-statistics and decoupled $U$-statistics (result of de la
Pe\~na and Montgomery--Smith)}
{\it Let us consider a sequence of independent
and identically distributed random variables $\xi_1,\dots,\xi_n$
with values in a measurable space $(X,{\cal X})$ together with $k$
independent copies $\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$,
of this sequence. Let us also have a function $f(x_1,\dots,x_k)$ on
the $k$-fold product space $(X^k,{\cal X}^k)$ which takes its values
in a separable Banach space~$B$. Let us take the $U$-statistic and
decoupled $U$-statistic $I_{n,k}(f)$ and $\bar I_{n,k}(f)$ with
the help of the above random sequences $\xi_1,\dots,\xi_n$,
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and kernel
function~$f$. There exist some constants $\bar C=\bar C(k)>0$
and $\gamma=\gamma(k)>0$ depending only on the order~$k$ of the
$U$-statistic such that
\begin{equation}
P\left(\|k!I_{n,k}(f)\|>u\right)
\le\bar CP\left(\|k!\bar I_{n,k}(f)\|>\gamma u\right)
\label{(14.13)}
\end{equation}
for all $u>0$. Here $\|\cdot\|$ denotes the norm in the Banach
space~$B$ where the function~$f$ takes its values.
More generally, if we have a countable sequence of functions
$f_s$, $s=1,2,\dots$, taking their values in the same separable
Banach-space, then
\begin{equation}
P\left(\sup_{1\le s<\infty} \left\|k! I_{n,k}(f_s)\right\|>u\right)\le
\bar CP\left(\sup_{1\le s<\infty}\left\|k!\bar I_{n,k}(f_s)\right\|
>\gamma u\right). \label{(14.14)}
\end{equation}
}
\medskip
Now I formulate the following version of Proposition~14.2.
\medskip\noindent
{\bf Proposition 14.2$'$.} {\it Let a probability measure $\mu$ be
given on a measurable space $(X,{\cal X})$ together with a sequence
of independent and $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$, $n\ge\max(k,2)$, and a countable $L_2$-dense
class ${\cal F}$ of canonical (with respect to the measure~$\mu$)
kernel functions $f=f(x_1,\dots,x_k)$ with some parameter $D\ge1$
and exponent $L\ge1$ on the product space $(X^k,{\cal X}^k)$. Let
all functions $f\in{\cal F}$ satisfy conditions~(\ref{(8.1)})
and~(\ref{(8.2)})
with some $0<\sigma\le1$ such that $n\sigma^2>L\log n+\log D$.
Let us take $k$ independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, of the random sequence $\xi_1,\dots,\xi_n$, and
consider the decoupled $U$-statistics $\bar I_{n,k}(f)$,
$f\in{\cal F}$, defined with their help in formula~(\ref{(14.11)}).
There exists a threshold index $A_0=A_0(k)>0$ depending only on
the order $k$ of the decoupled $U$-statistics $I_{n,k}(f)$,
$f\in{\cal F}$, such that the (degenerate) decoupled
$U$-statistics $\bar I_{n,k}(f)$, $f\in{\cal F}$, satisfy the
following version of inequality (\ref{(14.8)}):
\begin{equation}
P\left(\sup_{f\in{\cal F}}n^{-k/2}|k!\bar I_{n,k}(f)|
\ge An^{k/2}\sigma^{k+1}\right)
\le e^{-2^{-(1/2+1/2k)} A^{1/2k}n\sigma^2}\quad \textrm{if } A\ge A_0.
\label{(14.15)}
\end{equation}
}
\medskip
It is clear that Proposition~$14.2'$ and Theorem~14.3, more explicitly
formula (\ref{(14.14)}) in it, imply Proposition~14.2. Hence
the proof of Theorem~8.4 was reduced to Proposition~$14.2'$ in
this chapter. The proof of Proposition~$14.2'$ is based on a
symmetrization argument. Its main ideas will be explained in the
next chapter.
\chapter{The strategy of the proof for the main result of
this work}
In the previous chapter the proof of Theorem~8.4 was reduced to
that of Proposition~$14.2'$. Proposition~$14.2'$ is a multivariate
version of Proposition~6.2, and its proof is based on similar
ideas. An important step in the proof of Proposition~6.2 was a
symmetrization argument in which we reduced the estimation of
the probability $P\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{j=1}^nf(\xi_j)>u\right)$
to that of the probability
$P\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{j=1}^n\varepsilon_jf(\xi_j)>\frac u3\right)$, where
$\xi_1,\dots,\xi_n$ is a sequence of independent and identically
distributed random variables, and $\varepsilon_j$, $1\le j\le n$,
is a sequence of independent random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, independent of
the sequence~$\xi_j$. We want to prove a similar symmetrization
argument which helps to prove Proposition~$14.2'$.\index{estimate
on the supremum of degenerate $U$-statistics}
The symmetrization argument applied in the proof of Proposition~6.2
was carried out in two steps. We took a copy $\xi_1',\dots,\xi'_n$
of the sequence $\xi_1,\dots,\xi_n$, i.e. a sequence of independent
random variables which is independent also of the original
sequence $\xi_1,\dots,\xi_n$, and has the same distribution. In the
first step we compared the tail distribution of the expression
$\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^n[f(\xi_j)-f(\xi'_j)]$
with that of $\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^nf(\xi_j)$
with the help of Lemma~7.1. In the second step, in the proof of
Lemma~7.2, we applied a `randomization argument' which stated
that the distribution of the random fields
$\sum\limits_{j=1}^n[f(\xi_j)-f(\xi_j')]$ and
$\sum\limits_{j=1}^n\varepsilon_j[f(\xi_j)-f(\xi_j')]$,
$f\in{\cal F}$, agree. The symmetrization argument was proved
with the help of these two observations.
In the proof of Proposition~$14.2'$ we would like to reduce the
estimation of the tail distribution of the supremum of decoupled
$U$-statistics $\sup\limits_{f\in{\cal F}}\bar I_{n,k}(f)$
defined in formula~(\ref{(14.11)}) to the estimation of the
tail distribution of the supremum of the randomized decoupled
$U$-statistics
$\sup\limits_{f\in{\cal F}}\bar I_{n,k}^\varepsilon(f)$ defined
in formula~(\ref{(14.12)}) in a similar way. To do this we have
to find the multivariate version of the `randomization argument'
in the proof of Lemma~7.2. This will be done in the subsequent
Lemma~15.1. In Lemma~7.2 this randomization argument was
formulated with the help of some random variables introduced in
formulas~(\ref{(7.4)}) and~(\ref{($7.4'$)}). We shall define
their multivariate versions in formulas~(\ref{(15.1)})
and~(\ref{(15.2)}), and they will play a similar role in the
formulation of Lemma~15.1.
The adaptation of the first step of the symmetrization argument
of the proof of Proposition~6.2 is much harder. The proof of
Proposition~6.2 was based on a symmetrization lemma formulated
in Lemma~7.1. This result does not work in the present case.
Hence we shall generalize it in Lemma~15.2. The proof of the
symmetrization argument needed in the proof of
Proposition~$14.2'$ is difficult even with the help of this
result. The hardest part of our problem appears at this point.
I return to it after the formulation of Lemma~15.2.
To formulate Lemma~15.1 we introduce the following notations.
Let ${\cal V}_k= \{(v(1),\dots,v(k))\colon\; v(j)=\pm1,
\textrm{ for all }1\le j\le k\}$
denote the set of all $\pm1$ sequences of length~$k$. Let $m(v)$
denote the number of $-1$ digits in a sequence
$v=(v(1),\dots,v(k))\in{\cal V}_k$. Let a (real valued) function
$f(x_1,\dots,x_k)$ of $k$ variables be given on a measurable
space $(X,{\cal X})$ together with a sequence of independent and
identically distributed random variables $\xi_1,\dots,\xi_n$ with
values in the space $(X,{\cal X})$. Take $2k$ independent copies
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, $1\le j\le k$, of the
sequence $\xi_1,\dots,\xi_n$. Let us have beside them another sequence
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, of
independent random variables, also independent of all previously
introduced random variables. With the help of the above
quantities we introduce the random variables
\begin{equation}
\tilde I_{n,k}(f)=\frac1{k!}\sum_{v\in {\cal V}_k}
(-1)^{m(v)} \sum_{\substack{(l_1,\dots,l_k)\colon\, 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'} \textrm{ if } r\neq r'}}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\label{(15.1)}
\end{equation}
and
\begin{eqnarray}
\tilde I^\varepsilon_{n,k}(f)
&&=\frac1{k!}\sum_{v\in {\cal V}_k}
(-1)^{m(v)} \label{(15.2)} \\
&&\qquad \sum_{\substack{ (l_1,\dots,l_k)\colon\, 1\le l_r\le n,\;
r=1,\dots, k,\\
l_r\neq l_{r'}
\textrm{ if } r\neq r'}} \varepsilon_{l_1}\cdots\varepsilon_{l_k}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\nonumber
\end{eqnarray}
The number $m(v)$ in the above formulas denotes the number of the
digits $-1$ in the $\pm1$ sequence $v$ of length~$k$, hence it
counts how many random variables $\xi_{l_j}^{(j,1)}$, $1\le j\le k$,
were replaced by the `secondary copy' $\xi_{l_j}^{(j,-1)}$ for a
$v\in{\cal V}_k$ in the inner sum in formulas~(\ref{(15.1)})
or~(\ref{(15.2)}).
\medskip\noindent
{\it Remark.}\/ The definition of the linear combination of
decoupled $U$-statistics $\tilde I_{n,k}^\varepsilon(f)$ defined
in~(\ref{(15.2)}) shows some similarity to the definition of a
Stieltjes measure determined by a function $f(x_1,\dots,x_k)$.
One can argue that there is a deeper cause of these resemblance.
\medskip
The following result holds.
\medskip\noindent
{\bf Lemma 15.1.} {\it Let us consider a (non-empty) class of
functions ${\cal F}$ of $k$ variables $f(x_1,\dots,x_k)$ on the
space $(X^k,{\cal X}^k)$ together with the random variables
$\tilde I_{n,k}(f)$ and $\tilde I^\varepsilon_{n,k}(f)$ defined in
formulas~(\ref{(15.1)}) and~(\ref{(15.2)}) for all $f\in {\cal F}$.
The distributions of the random fields $\tilde I_{n,k}(f)$,
$f\in{\cal F}$, and $\tilde I^\varepsilon_{n,k}(f)$, $f\in {\cal F}$,
agree.}
\medskip
Let me recall that we say that the distribution of two random
fields $X(f)$, $f\in{\cal F}$, and $Y(f)$, $f\in{\cal F}$,
agree if for any finite sets $\{f_1,\dots,f_p\}\in{\cal F}$ the
distribution of the random vectors $(X(f_1),\dots,X(f_p))$ and
$(Y(f_1),\dots,Y(f_p))$ agree.
\medskip\noindent
{\it Proof of Lemma 15.1.}\/ I even claim that for any fixed
sequence
$$
u=(u(1),\dots,u(n)), \quad u(l)=\pm1, \;\; 1\le l\le n,
$$
of length~$n$ the conditional distribution of the field
$\tilde I^\varepsilon_{n,k}(f)$, $f\in {\cal F}$, under the
condition $(\varepsilon_1,\dots,\varepsilon_n)=u=(u(1),\dots,u(n))$
agrees with the distribution of the field of $\tilde I_{n,k}(f)$,
$f\in{\cal F}$.
Indeed, the random variables $\tilde I_{n,k}(f)$, $f\in{\cal F}$,
defined in (\ref{(15.1)}) are functions of a random vector
with coordinates
$(\xi_l^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$,
$1\le l\le n$, $1\le j\le k$, and the distribution of this random
vector remains the same if the coordinates
$(\xi_{l}^{(j)},\bar\xi_l^{(j)})=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
are replaced by
$(\bar\xi_l^{(j)},\xi_l^{(j)})=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$
for such pairs of indices $(l,j)$ for which $u(l)=-1$ (and the
index~$j$ is arbitrary), and the coordinates
$(\xi_{l}^{(j)},\bar\xi_l^{(j)})$ with such pairs of indices
$(l,j)$ for which $u(l)=1$ are not modified. As a consequence,
the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\cal F}$, we get by replacing the
original vector $(\xi_l^{(j)},\bar\xi_l^{(j)})$, $1\le l\le n$,
$1\le j\le k$, in the definition of the expression
$\tilde I_{n,k}(f)$ in~(\ref{(15.1)}) for all $f\in {\cal F}$ by this
modified vector depending on~$u$ has the same distribution as the
random field $\tilde I_{n,k}(f)$, $f\in{\cal F}$. On the other hand,
I claim that the distribution of the random field
$\tilde I_{n,k}(f|u)$, $f\in{\cal F}$, agrees with the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\cal F}$, defined in~(\ref{(15.2)}) under the condition that
$(\varepsilon_1,\dots,\varepsilon_n)=u$ with $u=(u(1),\dots,u(n))$.
To prove the last statement let us observe that the conditional
distribution of the random field $\tilde I^\varepsilon_{n,k}(f)$,
$f\in{\cal F}$, under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$ is the same
as the distribution of the random field we obtain by putting
$u(l)=\varepsilon_l$, $1\le l\le n$, in all coordinates
$\varepsilon_l$ of the random
variables $\tilde I^\varepsilon_{n,k}(f)$. On the other hand, the
random variables we get in such a way agree with the random
variables appearing in the sum defining $\tilde I_{n,k}(f|u)$,
only the terms in this sum are listed in a different order.
Lemma~15.1 is proved.
\hfill $\qed$
\medskip
Next I prove the following generalized version of Lemma~7.1.
\medskip\noindent
{\bf Lemma 15.2 (Generalized version of the Symmetrization
Lemma).}\index{symmetrization lemma}
{\it Let $Z_p$ and $\bar Z_p$, $p=1,2,\dots$, be two sequences of
random variables on a probability space $(\Omega,{\cal A},P)$. Let a
$\sigma$-algebra ${\cal B}\subset {\cal A}$ be given on the probability
space $(\Omega,{\cal A},P)$ together with a ${\cal B}$-measurable set
$B$ and two numbers $\alpha>0$ and $\beta>0$ such that the random
variables $Z_p$, $p=1,2,\dots$, are ${\cal B}$ measurable, and the
inequality
\begin{equation}
P(|\bar Z_p|\le\alpha|{\cal B})(\omega)\ge\beta\quad \textrm{for all }
\,p=1,2,\dots \textrm{ if }\,\omega\in B \label{(15.3)}
\end{equation}
holds.
Then
\begin{equation}
P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)
\le\frac1\beta P\left(\sup\limits_{1\le
p<\infty}|Z_p-\bar Z_p|>u\right)+(1-P(B))
\label{(15.4)}
\end{equation}
for all $u>0$.}
\medskip\noindent
{\it Proof of Lemma 15.2.}\/ Put $\tau=\min\{p\colon\, |Z_p|>\alpha+u)$
if there exists such an index $p\ge1$, and put $\tau=0$ otherwise. Then
we have, as $\{\tau=p\}\cap B\in{\cal B}$
\begin{eqnarray*}
P(\{\tau=p\}\cap B)
&=&\int_{\{\tau=p\}\cap B} 1\cdot\,dP
\le\int_{\{\tau=p\}\cap B}\frac1\beta
P(|\bar Z_p|\le \alpha|{\cal B})\,dP \\
&=&\frac1\beta P(\{\tau=p\}\cap\{|\bar Z_p|\le\alpha\}\cap B)\\
&\le& \frac1\beta P(\{\tau=p\}\cap\{|Z_p-\bar Z_p|>u\})
\quad \textrm{for all } p=1,2,\dots.
\end{eqnarray*}
Hence
\begin{eqnarray*}
&&P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)-(1-P(B))\le
P\left(\left\{\sup_{1\le p<\infty}|Z_p|>
\alpha+u\right\}\cap B\right) \\
&&\qquad=\sum_{p=1}^\infty P(\{\tau=p\}\cap B)
\le \frac1\beta \sum_{p=1}^\infty P(\{\tau=p\}\cap\{|Z_p-\bar
Z_p|>u\}) \\
&&\qquad \le\frac1\beta
P\left(\sup_{1\le p<\infty}|Z_p-\bar Z_p|>u\right).
\end{eqnarray*}
Lemma~15.2 is proved.
\hfill $\qed$
\medskip
Next I give a short explanation about the difficulties we meet in the
proof of Proposition~$14.2'$ and the approach applied in this work to
overcome them with the help of some symmetrization type arguments.
To find a symmetrization argument useful in the proof of
Proposition~$14.2'$ we want to bound the probability
$P\left(n^{-k/2}\sup\limits_{f\in{\cal F}}|k!\bar I_{n,k}(f)|>u\right)$ by
$$
C\cdot P\left(n^{-k/2}
\sup\limits_{f\in{\cal F}}|k!\tilde I_{n,k}(f)|>c u\right)
+\textrm{ a negligible error term}
$$
with some appropriate numbers $C<\infty$ and $00$ we say that the
set of decoupled $U$-statistics determined by the class of
functions ${\cal F}$ has a good tail behaviour at level~$T$ (with
parameters $n$ and $\sigma^2$ which are fixed in the sequel) if
\begin{equation}
P\left(\sup_{f\in{\cal F}}|n^{-k/2}k!\bar I_{n,k}(f)|\ge A
n^{k/2}\sigma^{k+1}\right)
\le \exp\left\{-A^{1/2k}n\sigma^2 \right\}
\quad \textrm{for all } A>T. \label{(15.5)}
\end{equation}
}
\medskip\noindent
{\bf Definition of good tail behaviour for a class of integrals of
decoupled $U$-statistics.}\index{good tail behaviour for a class
of integrals of decoupled $U$-statistics.}
{\it Let us have a product space
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some product measure
$\mu^k\times\rho$, where $(X^k,{\cal X}^k,\mu^k)$ is the $k$-fold
product of some measurable space $(X,{\cal X},\mu)$ with a
probability measure~$\mu$, and $(Y,{\cal Y},\rho)$ is some other
measurable space with a probability measure~$\rho$. Fix some positive
integer~$n\ge k$ and a positive number $0<\sigma\le1$, and consider
a countable class ${\cal F}$ of functions $f(x_1,\dots,x_k,y)$ on
the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$. Take $k$
independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of
a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. For all $f\in{\cal F}$ and $y\in Y$ let us define
the decoupled $U$-statistics $\bar I_{n,k}(f,y)=\bar I_{n,k}(f_y)$
by means of these random variables $\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$1\le j\le k$, the kernel function
$f_y(x_1,\dots,x_k)=f(x_1,\dots,x_k,y)$ and formula~(\ref{(14.11)}).
Define
with the help of these $U$-statistics $\bar I_{n,k}(f,y)$ the random
integrals
\begin{equation}
H_{n,k}(f)=\int [k!\bar I_{n,k}(f,y)]^2\rho(\,dy), \quad f\in{\cal F}.
\label{(15.6)}
\end{equation}
Choose some real number $T>0$. We say that the set of random
integrals $H_{n,k}(f)$, $f\in{\cal F}$, has a good tail behaviour at
level $T$ (with parameters $n$ and $\sigma^2$ which we fix in the
sequel) if
\begin{equation}
P\left(\sup_{f\in{\cal F}} n^{-k}H_{n,k}(f)
\ge A^2 n^k\sigma^{2k+2}\right)
\le \exp\left\{-A^{1/(2k+1)}n\sigma^2 \right\}
\quad \textrm{for all } A> T.
\label{(15.7)}
\end{equation}
}
\medskip
Propositions~15.3 and~15.4 will be formulated with the help of the
above notions.
\medskip\noindent
{\bf Proposition 15.3.} {\it Let us fix a positive
integer~$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a
probability measure $\mu$ on a measurable space $(X,{\cal X})$
together with two real numbers $L\ge1$ and $D\ge1$ such that
$n\sigma^2\ge L\log n+\log D$. Let us consider those countable
$L_2$-dense classes ${\cal F}$ of canonical kernel functions
$f=f(x_1,\dots,x_k)$ (with respect to the measure~$\mu$) on the
$k$-fold product space $(X^k,{\cal X}^k)$ with exponent~$L$
and parameter~$D$ for which all functions $f\in{\cal F}$ satisfy the
inequalities $\sup\limits_{x_j\in X, 1\le j\le k}
|f(x_1,\dots,x_k)|\le 2^{-(k+1)}$ and $\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$.
There is a real number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\cal F}$ which satisfy the above conditions
the sets of decoupled $U$-statistics $\bar I_{n,k}(f)$, $
f\in{\cal F}$, have a good tail behaviour at level~$T^{4/3}$ for
some $T\ge A_0$, then they also have a good tail behaviour at
level~$T$.}
\medskip\noindent
{\bf Proposition 15.4.} {\it Fix a positive integer
$n\ge\max(k,2)$, a real number $0<\sigma\le2^{-(k+1)}$, a product
space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ with some product
measure $\mu^k\times\rho$, where $(X^k,{\cal X}^k,\mu^k)$ is the
$k$-fold product of some probability space $(X,{\cal X},\mu)$, and
$(Y,{\cal Y},\rho)$ is some other probability space together with
two real numbers $L\ge1$ and $D\ge1$ such that the inequality
$n\sigma^2>L\log n+\log D$ holds.
Let us consider those countable $L_2$-dense classes ${\cal F}$
consisting of canonical functions $f(x_1,\dots,x_k,y)$ on the
product space $(X^k\times Y,{\cal X}^k\times{\cal Y})$ with
exponent $L\ge1$ and parameter $D\ge1$ whose elements
$f\in{\cal F}$ satisfy the inequalities
\begin{equation}
\sup\limits_{x_j\in X, 1\le j\le k, y\in Y}|f(x_1,\dots,x_k,y)|\le
2^{-(k+1)} \label{(15.8)}
\end{equation}
and
\begin{equation}
\int f^2(x_1,\dots,x_k,y)\mu(\,dx_1)\dots\mu(\,dx_k)\rho(\,dy)
\le\sigma^2 \quad \textrm{for all } f\in {\cal F}.
\label{(15.9)}
\end{equation}
There exists some number $A_0=A_0(k)>1$ such that if for all
classes of functions ${\cal F}$ which satisfy the above conditions
the random integrals $H_{n,k}(f)$, $f\in{\cal F}$, defined
in~(\ref{(15.6)}) have a good tail behaviour at level $T^{(2k+1)/2k}$
with some $T\ge A_0$, then they also have a good tail behaviour
at level~$T$.}
\medskip\noindent
{\it Remark:}\/ To complete the formulation of Proposition~15.4 we
still have to clarify when we call a function $f(x_1,\dots,x_k,y)$
defined on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$
canonical.\index{canonical function}
Here we apply a definition which slightly differs from that given
in formula~(\ref{(8.8)}).
We say that a function
$f(x_1,\dots,x_k,y)$ on the product space
$(X^k\times Y,{\cal X}^k\times{\cal Y},\mu^k\times\rho)$
is canonical if
\begin{eqnarray*}
&&\int f(x_1,\dots,x_{j-1},u,x_{j+1},\dots,x_k,y)\mu(\,du)=0\\
&&\qquad \qquad \textrm{for all } 1\le j\le k,\; x_s\in X,
\;s\neq j \textrm{ and }y\in Y.
\end{eqnarray*}
In this definition we do not require the analogous identity if
we integrate with respect to the variable $Y$ with fixed
arguments $x_j\in X$, $1\le j\le k$.
\medskip
Let me also remark that the estimate (\ref{(15.7)}) we have
formulated in the definition of the property `good tail behaviour
for a class of integrals of $U$-statistics' is fairly natural. We
have applied the natural normalization, and with such a
normalization it is natural to expect that the tail behaviour of
the distribution of $\sup\limits_{f\in{\cal F}}n^{-k}H_{n,k}(f)$
is similar to that of $\textrm{const.}\,\left(\sigma\eta^k\right)^2$,
where $\eta$ is a standard normal random variable.
Formula~(\ref{(15.7)}) expresses such a behaviour, only the power
of the number~$A$ in the exponent at the right-hand side was
chosen in a non-optimal way. Formula~(\ref{(15.5)}) in the
formulation of the property `good tail behaviour for a class of
decoupled $U$-statistics' has a similar interpretation. It says
that
$\sup\limits_{f\in{\cal F}}|n^{-k/2}k!I_{n,k}(f)|$
behaves
similarly to $\textrm{const.}\,\sigma|\eta^k|$ with a standard
normal random variable $\eta$.
\medskip
We wanted to prove the property of good tail behaviour for a class
of integrals of decoupled $U$-statistics under appropriate, not too
restrictive conditions. Let me remark that in Proposition~15.4 we
have imposed beside formula (\ref{(15.8)}) a fairly weak
condition (\ref{(15.9)})
about the $L_2$-norm of the function~$f$. Most difficulties appear
in the proof, because we did not want to impose more restrictive
conditions.
It is not difficult to derive Proposition~$14.2'$ from
Proposition~15.3. Indeed, let us observe that the set of decoupled
$U$-statistics determined by a class of functions ${\cal F}$
satisfying the conditions of Proposition~15.3 has a good
tail-behaviour at level $T_0=\sigma^{-(k+1)}$, since under the
conditions of this Proposition the probability at the left-hand
side of~(\ref{(15.5)}) equals zero for $A>\sigma^{-(k+1)}$. Then we get
from Proposition~15.3 by induction with respect to the number $j$,
that this set of decoupled $U$-statistics has a good tail-behaviour
also for all $T=T_j=T_0^{(3/4)^j}=\sigma^{-(k+1)(3/4)^j}$,
$j=0,1,2,\dots$, with such indices~$j$ for which
$T_j=\sigma^{-(k+1)(3/4)^j}\ge A_0$. This implies that if a class of
functions ${\cal F}$ satisfies the conditions of Proposition~15.3,
then the set of decoupled $U$-statistics determined by this class
of functions has a good tail-behaviour at level $T=A_0^{4/3}$,
i.e. at a level which depends only on the order~$k$ of the
decoupled $U$-statistics. This result implies Proposition~$14.2'$,
only it has to be applied for the class of function
${\cal F}'=\{2^{-(k+1)}f,\; f\in{\cal F}\}$ instead of the original
class of functions ${\cal F}$ which appears in Proposition~$14.2'$
with the same parameters~$\sigma$, $L$ and~$D$.
Similarly to the above argument an inductive procedure yields a
corollary of Proposition~15.4 formulated below. Actually, we shall
need this corollary of Proposition~15.4.
\medskip\noindent
{\bf Corollary of Proposition 15.4.} {\it If the class of functions
${\cal F}$ satisfies the conditions of Proposition~15.4, then there
exists a constant $\bar A_0=\bar A_0(k)>0$ depending only on $k$
such that the class of integrals $H_{n,k}(f)$, $f\in {\cal F}$,
defined in formula~(\ref{(15.6)}) have a good tail behaviour at level
$\bar A_0$.}
\medskip
Proposition~15.3 will be proved by means of a symmetrization
argument which applies Lemma~15.2. The main difficulty arises
when we want to check condition~(\ref{(15.3)}) with the
quantities we are working with in Proposition~15.3. This
difficulty can be overcome by means of Proposition~15.4, more
precisely by means of its corollary. It helps us to estimate
the conditional variances of the decoupled $U$-statistics we
have to handle in the proof of Proposition~15.3. The proof of
Propositions~15.3 and~15.4 apply similar arguments, and they
will be proved simultaneously. The following inductive procedure
will be applied in their proof. First Proposition~15.3 and then
Proposition~15.4 will be proved for $k=1$. If Propositions~15.3
and~15.4 are already proved for all $k'0$ and $\gamma=\gamma_k>0$
such that the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} n^{-k/2}\left|k!\bar I_{n,k}(f)\right|
>An^{k/2}\sigma^{k+1}\right) \nonumber \\
&&\qquad <2^{k+1}P\left(\sup_{f\in{\cal F}}
\left|k!\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right) \nonumber \\
&&\qquad\qquad +2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\label{(16.1)}
\end{eqnarray}
holds for all $A\ge A_0$.}
\medskip
It may be worth remarking that the second term at the right-hand side
of formula~(\ref{(16.1)}) yields a small contribution to
the upper bound in
this relation because of the condition $n\sigma^2\ge L\log n+\log D$.
To formulate Lemma~16.1B first some new quantities have to be
introduced. Some of them will be used somewhat later. The quantities
$\bar I_{n,k}^V(f,y)$ introduced in the subsequent
formula~(\ref{(16.2)}) depend on the sets $V\subset\{1,\dots,k\}$,
and they are the natural modifications of the inner sum terms in
formula (\ref{(15.1)}). Such expressions are needed in the
formulation of the symmetrization result applied in the proof of
Proposition~15.4. Their randomized versions
$\bar I_{n,k}^{(V,\varepsilon)}(f,y)$, introduced in
formula~(\ref{(16.5)}), correspond to the inner sum terms in
formula~(\ref{(15.2)}). The integrals of these expressions will
be also introduced in formulas~(\ref{(16.3)}) and~(\ref{(16.6)}).
Let us consider a class ${\cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\cal F}$ on a space $(X^k\times Y, {\cal X}^k
\times {\cal Y},\mu^k\times\rho)$ which satisfies the conditions of
Proposition~15.4. Let us take $2k$ independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$,
$\bar\xi_1^{(j)},\dots,\bar\xi_n^{(j)}$, $1\le j\le k$,
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables
$(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous random sequences.
Let us introduce the notation $\xi_l^{(j,1)}=\xi_l^{(j)}$
and $\xi_l^{(j,-1)}=\bar\xi_l^{(j)}$, $1\le l\le n$, $1\le j\le k$.
For all subsets $V\subset\{1,\dots,k\}$ of the set
$\{1,\dots,k\}$ let $|V|$ denote the cardinality of this set,
and define for all functions $f(x_1,\dots,x_k,y)\in {\cal F}$ and
sets $V\subset\{1,\dots,k\}$ the decoupled $U$-statistics
\begin{equation}
\bar I_{n,k}^V(f,y)=\frac1{k!}\sum_{\substack {(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,\xi_{l_k}
^{(k,\delta_k(V))},y\right),
\label{(16.2)}
\end{equation}
where $\delta_j(V)=\pm1$, $1\le j\le k$, is defined as
$\delta_j(V)=1$ if $j\in V$, and $\delta_j(V)=-1$ if $j\notin V$,
together with the random variables
\begin{equation}
H_{n,k}^V(f)=\int [k!\bar I_{n,k}^V(f,y)]^2\rho(\,dy), \quad f\in{\cal F}.
\label{(16.3)}
\end{equation}
We shall consider $\bar I_{n,k}^V(f,y)$ defined
in~(\ref{(16.2)}) as a random
variable with values in the space $L_2(Y,{\cal Y},\rho)$.
Put
\begin{equation}
\bar I_{n,k}(f,y)=\bar I_{n,k}^{\{1,\dots,k\}}(f,y),\quad
H_{n,k}(f)=H_{n,k}^{\{1,\dots,k\}}(f), \label{(16.4)}
\end{equation}
i.e. $\bar I_{n,k}(f,y)$ and $H_{n,k}(f)$ are the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ with
$V=\{1,\dots,k\}$, which means that these expressions are defined
with the help of the random variables $\xi^{(j)}_l=\xi_l^{(j,1)}$,
$1\le j\le k$, $1\le l\le n$.
Let us also define the `randomized version' of the random variables
$\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ as
\begin{eqnarray}
\bar I_{n,k}^{(V,\varepsilon)}(f,y)&&=\frac1{k!} \!\!
\sum_{\substack{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\!\!\!\!\!\!\!\!\!
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))},y\right), \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{if } f\in{\cal F}, \label{(16.5)}
\end{eqnarray}
and
\begin{equation}
H_{n,k}^{(V,\varepsilon)}(f)=\int
[k!\bar I_{n,k}^{(V,\varepsilon)}(f,y)]^2\rho(\,dy)
,\quad f\in{\cal F}, \label{(16.6)}
\end{equation}
where $\delta_j(V)=1$ if $j\in V$, and $\delta_j(V)=-1$ if
$j\in\{1,\dots,k\}\setminus V$.
Similarly to formula~(\ref{(16.2)}), we shall consider
$\bar I_{n,k}^{V,\varepsilon}(f,y)$ defined in~(\ref{(16.5)}) as a random
variable with values in the space $L_2(Y,{\cal Y},\rho)$.
Let us also introduce the random variables
\begin{equation}
\bar W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}
k!\bar I_{n,k}^{(V,\varepsilon)}(f,y)\right]^2\rho(\,dy),
\quad f\in{\cal F}. \label{(16.7)}
\end{equation}
With the help of the above notations Lemma~16.1B can be formulated
in the following way.
\medskip\noindent
{\bf Lemma 16.1B (Randomization argument in the proof of
Proposition~15.4).} {\it Let ${\cal F}$ be a set of functions on
$(X^k\times Y,{\cal X}^k\times{\cal Y})$ which satisfies the
conditions of Proposition~15.4 with some probability measure
$\mu^k\times\rho$. Let us have $2k$ independent copies
$\xi_{1}^{(j,\pm1)},\dots,\xi_{n}^{(j,\pm1)}$, $1\le j\le k$, of a
sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le
j\le n$, which is independent also of the previously considered
random sequences.
Then there exist some constants $A_0=A_0(k)>0$ and
$\gamma=\gamma_k$ such that if the integrals $H_{n,k}(f)$,
$f\in{\cal F}$, determined by this class of functions ${\cal F}$ have
a good tail behaviour at level $T^{(2k+1)/2k}$ for some $T\ge A_0$,
(this property was defined in Chapter~15 in the definition of good
tail behaviour for a class of integrals of decoupled $U$-statistics
before the formulation of Propositions~15.3 and~15.4), then the
inequality
\begin{eqnarray}
P\left(\sup_{f\in{\cal F}} \left|H_{n,k}(f)\right|
>A^2n^{2k}\sigma^{2(k+1)}\right)
&&<2P\left(\sup_{f\in{\cal F}} \left|\bar W(f)\right|
>\frac{A^2k!}2 n^{2k}\sigma^{2(k+1)}\right)\nonumber \\
&&\qquad+2^{2k+1}n^{k-1}e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\label{(16.8)}
\end{eqnarray}
holds for all $A\ge T$ with the random variables $H_{n,k}(f)$
introduced in the second identity of relation (\ref{(16.4)}) and
with $\bar W(f)$ defined in formula~(\ref{(16.7)}).}
\medskip
A corollary of Lemma~16.1B will be formulated which can be
better applied than the original lemma. Lemma~16.B is a little bit
inconvenient, because the expression at the right-hand side of
formula~(\ref{(16.8)}) contains a probability depending on
$\sup\limits_{f\in{\cal F}}|\bar W(f)|$, and $\bar W(f)$ is a too
complicated expression. Some new formulas~(\ref{(16.9)})
and~(\ref{(16.10)}) will
be introduced which enable us to rewrite $\bar W(f)$ in a slightly
simpler form. These formulas yield such a corollary of Lemma~16.B
which is more appropriate for our purposes. To work out the details
first some diagrams will be introduced.
Let ${\cal G}={\cal G}(k)$ denote the set of all diagrams
consisting of two rows, such that both rows of these diagrams are
the set $\{1,\dots,k\}$, and these diagrams contain some edges
$\{(j_1,j_1')\dots,(j_s,j_s')\}$, $0\le s\le k$, connecting a
point (vertex) of the first row with a point (vertex) of the
second row. The vertices $j_1,\dots,j_s$ which are end points of
some edge in the first row are all different, and the same relation
holds also for the vertices $j_1',\dots,j_s'$ in the second row.
Given a diagram $G\in{\cal G}$
let $e(G)=\{(j_1,j_1')\dots,(j_s,j_s')\}$ denote the set of its
edges, and let $v_1(G)=\{j_1,\dots,j_s\}$ be the set of those
vertices in the first row and $v_2(G)=\{j_1',\dots,j_s'\}$ the
set of those vertices in the second row of the diagram~$G$ from
which an edge of~$G$ starts.
Given a diagram $G\in {\cal G}$, two sets
$V_1,V_2\subset\{1,\dots,k\}$, a function $f$ defined on the
space $(X^k\times,Y,{\cal X}^k\times{\cal Y})$ and a probability
measure $\rho$ on $(Y,{\cal Y})$ we define the following
random variables $H_{n,k}(f|G,V_1,V_2)$ with the help of
the random variables $\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$,
$\xi_{1}^{(j,-1)},\dots,\xi_{n}^{(j,-1)}$, $1\le j\le k$, and
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$ taking part
in the definition of the random
variables $\bar W(f)$:
\begin{eqnarray}
&& H_{n,k}(f|G,V_1,V_2) \nonumber \\
&&\qquad =\sum_{\substack
{(l_1,\dots,l_k,\,l'_1,\dots,l'_k)\colon\\
1\le l_j\le n,\, l_j\neq l_{j'}
\textrm{ if }j\neq j',\,1\le j,j'\le k,\\
1\le l'_j\le n,\, l'_j\neq l'_{j'}\textrm { if }
j\neq j',\,1\le j,j'\le
k,\\ l_j=l'_{j'} \textrm { if } (j,j')\in e(G),\; l_j\neq l'_{j'}
\textrm { if } (j,j')\notin e(G)}}
\!\!\!\!\!\!\!\!\!\!\!\! \prod_{j\in\{1,\dots,k\}
\setminus v_1(G)} \!\!\!\!
\varepsilon_{l_j} \prod_{j\in\{1,\dots,k\}
\setminus v_2(G)} \!\!\!\! \varepsilon_{l'_j} \nonumber \\
&&\qquad\qquad \int f(\xi_{l_1}^{(1,\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y) \nonumber \\
&& \qquad\qquad\qquad f(\xi_{l'_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l'_k}^{(k,\delta_k(V_2))},y)
\rho(\,dy), \label{(16.9)}
\end{eqnarray}
where $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$. (Let us observe that if the
graph $G$ contains $s$ edges, then the product of the
$\varepsilon$-s in (\ref{(16.9)})
contains $2(k-s)$ terms, and the number of terms in the
sum~(\ref{(16.9)}) is
less than $n^{2k-s}$.) As the Corollary of Lemma~16.1B will indicate,
in the proof of Proposition~15.4 we shall need a good estimate on the
tail distribution of the random variables $H_{n,k}(f|G,V_1,V_2)$
for all $f\in{\cal F}$ and $G\in{\cal G}$, $V_1,V_2\subset\{1,\dots,k\}$.
Such an estimate can be obtained by means of Theorem 13.3, the
multivariate version of Hoeffding's inequality. But the estimate we
get in such a way will be rewritten in a form more appropriate for our
inductive procedure. This will be done in the next chapter.
The identity
\begin{equation}
\bar W(f)=\sum_{G\in {\cal G},\, V_1,V_2\subset\{1,\dots,k\}}
(-1)^{|V_1|+|V_2|} H_{n,k}(f|G,V_1,V_2) \label{(16.10)}
\end{equation}
will be proved.
To prove this identity let us write first
$$
\bar W(f)=\sum_{V_1,V_2\subset \{1,\dots,k\}} (-1)^{|V_1|+|V_2|}
\int k!\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
k!\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy).
$$
Let us express the products
$k!\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)k!\bar I_{n,k}^{(V_2,
\varepsilon)}(f,y)$ by means of formula (\ref{(16.5)}).
Let us rewrite this product as a sum of products of the form
$$
\prod\limits_{j=1}^k\varepsilon_{l_j}f(\cdots)
\prod_{j=1}^k\varepsilon_{l_j'}f(\cdots),
$$
and let us define the following partition of the terms in this
sum. The elements of this partition
are indexed by the diagrams $G\in {\cal G}$, and if we
take a diagram $G\in{\cal G}$ with the set of edges $e(G)=
\{(j_1,j_1'),\dots,(j_s,j_s')\}$, then the term of this sum
determined by the indices $l_1,\dots,l_k,l'_1,\dots,l'_k$
belongs to the element of the partition indexed by this diagram
$G$ if and only if $l_{j_u}=l_{j_u'}'$ for all $1\le u\le s$, and
no more numbers between the indices $l_1,\dots,l_k,l_1'\dots,l'_k$
may agree. Since $\varepsilon_{l_{j_u}}\varepsilon_{l'_{j_u'}}=1$
for all $1\le u\le s$ and the set of indices of the remaining
random variables $\varepsilon_{l_j}$ is
$\{l_j\colon\,j\in\{1,\dots,k\}\setminus v_1(G)\}$,
the set of indices of the remaining random variables
$\varepsilon_{l_j'}$
is $\{l'_j\colon\,j\in\{1,\dots,k\}\setminus v_2(G)\}$, we get
by integrating the product
$k!\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
k!\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)$
with respect to the measure $\rho$ that
$$
\int\bar I_{n,k}^{(V_1,\varepsilon)}(f,y)
\bar I_{n,k}^{(V_2,\varepsilon)}(f,y)\rho(\,dy)
=\sum_{G\in {\cal G}} H_{n,k}(f|G,V_1,V_2)
$$
for all $V_1,V_2\in\{1,\dots,k\}$. The last two identities imply
formula~(\ref{(16.10)}).
Since the number of terms in the sum of formula (\ref{(16.10)})
is less than
$2^{4k}k!$, this relation implies that Lemma~16.1B has the following
corollary:
\medskip\noindent
{\bf Corollary of Lemma 16.1B (A simplified version of the
randomization argument of Lemma~16.1B).} {\it Let a set of
functions ${\cal F}$ satisfy the conditions of Proposition~15.4. Then
there exist some constants $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that if the integrals $H_{n,k}(f)$, $f\in{\cal F}$, determined
by this class of functions ${\cal F}$ have a good tail behaviour at
level $T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} |H_{n,k}(f)|>A^2n^{2k}
\sigma^{2(k+1)}\right) \nonumber \\
&&\qquad\le 2\sum_{G\in {\cal G},\, V_1,V_2\subset\{1,\dots,k\}}
P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}} \right) \nonumber \\
&&\qquad\qquad\qquad\qquad
+2^{2k+1}n^{k-1} e^{-\gamma_k A^{1/2k} n\sigma^2/k}
\label{(16.11)}
\end{eqnarray}
holds for all $A\ge T$ with the random variables $H_{n,k}(f)$
and $H_{n,k}(f|G,V_1,V_2)$ defined in formulas (\ref{(16.4)})
and (\ref{(16.9)}).}
\medskip\noindent
In the proof of Lemmas 16.1A and 16.1B the result of the
following Lemmas~16.2A and~16.2B will be applied.
\medskip\noindent
{\bf Lemma 16.2A.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)} \quad \textrm{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le
l\le n$, which is also independent of the previous sequences.
Let ${\cal F}$ be a class of functions which satisfies the
conditions of Proposition 15.3. Introduce with the help of the above
random variables for all sets $V\subset\{1,\dots,k\}$ and functions
$f\in {\cal F}$ the decoupled $U$-statistic
\begin{equation}
\bar I_{n,k}^V(f)=\frac1{k!}\sum_{\substack {(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
f\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))}\right) \label{(16.12)}
\end{equation}
and its `randomized version'
\begin{eqnarray}
\bar I_{n,k}^{(V,\varepsilon)}(f)&&=\frac1{k!}
\sum_{\substack{(l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots, k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f
\left(\xi_{l_1}^{(1,\delta_1(V))},\dots,
\xi_{l_k}^{(k,\delta_k(V))}\right), \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
f\in{\cal F}, \label{($16.12'$)}
\end{eqnarray}
where $\delta_j(V)=\pm1$, and we have $\delta_j(V)=1$ if $j\in V$,
and $\delta_j(V)=-1$ if $j\in\{1,\dots,k\}\setminus V$.
Then the sets of random variables
\begin{equation}
S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}k!\bar I_{n,k}^V(f),
\quad f\in{\cal F}, \label{(16.13)}
\end{equation}
and
\begin{equation}
\bar S(f)=\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}
k!\bar I_{n,k}^{(V,\varepsilon)}(f), \quad f\in{\cal F},
\label{($16.13'$)}
\end{equation}
have the same joint distribution.}
\medskip\noindent
{\bf Lemma 16.2B.} {\it Let us take $2k$ independent copies
$$
\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}\quad \textrm{and} \quad
\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}, \quad 1\le j\le k,
$$
of a sequence of independent, $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ together with a sequence of independent random
variables $(\varepsilon_1,\dots,\varepsilon_n)$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is independent also of the previous sequences.
Let us consider a class ${\cal F}$ of functions
$f(x_1,\dots,x_k,y)\in {\cal F}$ on a space
$(X^k\times Y, {\cal X}^k\times {\cal Y},\mu^k\times\rho)$ which
satisfies the conditions of
Proposition~15.4. For all functions $f\in {\cal F}$
and $V\in\{1,\dots,k\}$ consider the decoupled $U$-statistics
$\bar I_{n,k}^V(f,y)$ defined by formula (\ref{(16.2)}) with
the help of the random variables
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, and define with their help
the random variables
\begin{equation}
W(f)=\int\left[\sum_{V\subset \{1,\dots,k\}} (-1)^{(k-|V|)}
k!\bar I_{n,k}^V(f,y)\right]^2\rho(\,dy), \quad f\in{\cal F}.
\label{(16.14)}
\end{equation}
Then the random vectors $\{W(f)\colon\, f\in {\cal F}\}$ defined
in~(\ref{(16.14)}) and $\{\bar W(f)\colon\, f\in {\cal F}\}$ defined
in~(\ref{(16.7)}) have the same distribution.}
\medskip\noindent
{\it Proof of Lemmas 16.2A and 16.2B.} Lemma~16.2A actually agrees
with the already proved Lemma~15.1, only the notation is
different. The proof of Lemma~16.2B is also very similar to that
of Lemma~15.1. It can be shown that even the following stronger
statement holds. For any $\pm1$ sequence $u=(u_1,\dots,u_n)$ of
length~$n$ the conditional distribution of the random field
$\bar W(f)$, $f\in{\cal F}$, under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u=(u_1,\dots,u_n)$ agrees with
the distribution of the random field $W(f)$, $f\in{\cal F}$.
To see this relation let us first observe that the conditional
distribution of the field $\bar W(f)$ under this condition agrees
with the distribution of the random field we get by replacing the
random variables $\varepsilon_l$ by $u_l$ for all $1\le l\le n$ in
formulas~(\ref{(16.5)}), (\ref{(16.6)}) and~(\ref{(16.7)}).
Beside this, define the vector
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
by the formula
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,-1)}_l,\xi^{(j,1)}_l)$
for those indices $(j,l)$ for which $u_l=-1$, and
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)=(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$
for which $u_l=1$ (independently of the value of the parameter $j$).
Then the joint distribution of the vectors
$(\xi(u)^{(j,1)}_l,\xi(u)^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
and $(\xi^{(j,1)}_l,\xi^{(j,-1)}_l)$, $1\le j\le k$, $1\le l\le n$,
agree. Hence the joint distribution of the random vectors
$\bar I_{n,k}^V(f,y)$, $f\in{\cal F}$, $V\subset \{1,\dots,k\}$ defined
in~(\ref{(16.2)}) and of the random vectors $W(f)$,
$f\in{\cal F}$, defined
in~(\ref{(16.14)}) do not change if we replace in their
definition the random
variables $\xi^{(j,1)}_l$ and $\xi^{(j,-1)}_l$ by $\xi(u)^{(j,1)}_l$
and $\xi(u)^{(j,-1)}_l$. But the set of random variables $W(f)$,
$f\in{\cal F}$, obtained in this way agrees with the set of random
variables we introduced to get a set of random variables with the
same distribution as the conditional distribution of $\bar W(f)$,
$f\in {\cal F}$ under the condition
$(\varepsilon_1,\dots,\varepsilon_n)=u$. (These
random variables are defined as the square integral of the same sum,
only the terms of this sum are listed in a different order in the
two cases.) These facts imply Lemma~16.2B.
\hfill$\qed$
\medskip
In the next step we prove the following Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3A.} {\it Let us consider a class of functions
${\cal F}$ satisfying the conditions of Proposition 15.3 with
parameter~$k$ together with $2k$ independent copies
$\xi^{(j,1)}_1$,\dots, $\xi^{(j,1)}_n$ and
$\xi^{(j,-1)}_1,\dots,\xi^{(j,-1)}_n$, $1\le j\le k$, of a sequence
of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables $\bar I_{n,k}^V(f)$,
defined for $f\in{\cal F}$ and $V\subset\{1,\dots,k\}$ in
formula~(\ref{(16.12)}). Let
$$
{\cal B}={\cal B}(\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)},\; 1\le j\le k)
$$
denote the $\sigma$-algebra generated by the random variables
$\xi_{1}^{(j,1)},\dots,\xi_{n}^{(j,1)}$ , $1\le j\le k$, i.e.\ by
the random variables with upper indices of the form $(j,1)$,
$1\le j\le k$. There exists a number $A_0=A_0(k)>0$ such that
for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, the
inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}
\left.E\left([k!\bar I_{n,k}^V(f)]^2\right|{\cal B}\right)
> 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right)<
n^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\label{(16.15)}
\end{equation}
holds with a sufficiently small $\gamma_k>0$ if $A\ge A_0$.}
\medskip\noindent
{\it Proof of Lemma 16.3A.}\/ Let us first consider the case
$V=\emptyset$. In this case the estimate $\left.E\left((k!\bar
I_{n,k}^\emptyset(f))^2\right|{\cal B}\right)
=E\left((k!\bar I_{n,k}^\emptyset(f))^2\right)
\le k!n^k\sigma^2\le 2^kk!n^{2k}\sigma^{2k+2}$ holds for all
$f\in{\cal F}$. In the above calculation it was exploited that the
functions $f\in{\cal F}$ are canonical, which implies certain
orthogonalities, and beside this the inequality $n\sigma^2\ge\frac12$
holds, because of the relation $n\sigma^2\ge L\log n+\log D$.
The above relations imply that for $V=\emptyset$ the probability at
the left-hand side of (\ref{(16.15)}) equals zero if the
number $A_0$ is chosen sufficiently large. Hence
inequality~(\ref{(16.15)}) holds in this case.
To avoid some complications in the notation let us first restrict our
attention to sets of the form $V=\{1,\dots,u\}$ with some $1\le u 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}
\right\} \label{(16.17)} \\
&&\qquad \subset \bigcup_{\substack{(l_{u+1},\dots,l_k)\colon\\
1\le l_j\le n,\; j=u+1,\dots,k.\\
l_j\neq l_{j'} \textrm { if } j\neq j'}} \nonumber \\
&&\qquad \qquad \left\{\omega\colon\, \sup_{f\in{\cal F}}
\left. E\left([k!\bar I_{n,k}^V(f,l_{u+1},\dots,l_k)]^2\right|
{\cal B}\right)(\omega)
>\frac{A^2n^{2k}\sigma^{2k+2}}{2^{(3k+3)}n^{k-u}} \right\}.
\nonumber
\end{eqnarray}
The probability of the events in the union at the right-hand side
of~(\ref{(16.17)}) can be estimated with the help of the
Corollary of Proposition~15.4 with parameter $u\frac {A^2n^{k+u}\sigma^{2k+2}} {2^{(3k+3)}}\right)
\nonumber \\
&&\qquad \le e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2} \label{(16.18)}
\end{eqnarray}
with an appropriate $\gamma_k>0$ for all sequences
$(l_{u+1},\dots,l_k)$, $1\le l_j\le n$, $u+1\le j\le k$, such
that $l_j\neq l_{j'}$ if $j\neq j'$.
Let us show that if a class of functions $f\in {\cal F}$
satisfies the conditions of Proposition~15.3, then it also
satisfies relation~(\ref{(16.18)}).
For this goal introduce the space $(Y,{\cal Y},\rho)=(X^{k-u},
{\cal X}^{k-u},\mu^{k-u})$, the $k-u$-fold power of the measure
space $(X, {\cal X},\mu)$, and for the sake of simpler notations
write $y=(x_{u+1},\dots,x_k)$ for a point $y\in Y$. Let us also
introduce the class of those function $\bar{\cal F}$ in the
space $(X^u\times Y,{\cal X}^u\times{\cal Y},\mu^u\times\rho)$
consisting of functions $\bar f$ of the form
$\bar f(x_1,\dots,x_u,y)=f(x_1,\dots,x_k)$ with
$y=(x_{u+1},\dots,x_k)$ and some function
$f(x_1,\dots,x_k)\in{\cal F}$.
If the class of function ${\cal F}$ satisfies the conditions of
Proposition~15.3 (with parameter~$k$), then the class of functions
$\bar{\cal F}$ satisfies the conditions of Proposition~15.4 with
parameter $u0$. Then we have
\begin{eqnarray}
&& P\biggl(\sup_{\bar f\in\bar{\cal F}}
E([k!\bar I_{n,k}^V(f,l_{u+1},\dots,l_k)]^2|{\cal B})
\ge \gamma_k^{(4u+2)}
A^2 (n+u-k)^{2u}\sigma^{2u+2}\biggr) \nonumber \\
&& \quad=P\left(\sup_{\bar f\in\bar{\cal F}} (n+u-k)^{-u}
H^{l(u)}_{n+u-k,u}(\bar f)\ge \gamma_k^{(4u+2)}
A^2(n+u-k)^u\sigma^{2u+2}\right) \nonumber \\
&&\qquad\le e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2}
\quad \textrm{for } A>A_0(u)\gamma_k^{-(4u+2)}.
\label{(16.20)}
\end{eqnarray}
It is not difficult to derive formula~(\ref{(16.18)})
from relation~(\ref{(16.20)}).
It is enough to check that the level
$\frac{A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ in the
probability at the left-hand side of~(\ref{(16.18)}) can be
replaced by $\gamma_k^{(4u+2)} A^2(n+u-k)^{2u}\sigma^{2u+2}$
if $\gamma_k>0$ is chosen sufficiently small. This statement
holds, since
$\gamma_k^{(4u+2)}
A^2(n+u-k)^{2u}\sigma^{2u+2}<
\gamma_k^{(4u+2)}A^2n^{2u}\sigma^{2u+2}
\le\frac {A^2n^{k+u}\sigma^{2k+2}}{2^{(3k+3)}}$ if the constant
$\gamma_k>0$ is chosen sufficiently small, since
$n\sigma^2>L\log n\le \frac12$ by the conditions of
Proposition~15.3.
Relations (\ref{(16.17)}) and (\ref{(16.18)}) imply that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}}\left. E\left([k!\bar I_{n,k}^V(f)]^2\right|
{\cal B}\right)(\omega)
>2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2} \right) \\
&& \qquad \le n^{k-u}e^{-\gamma_kA^{1/(2u+1)}(n+u-k)\sigma^2}.
\end{eqnarray*}
Since $e^{-\gamma_k A^{1/(2u+1)}(n+u-k)\sigma^2}
\le e^{-\gamma_k A^{1/(2k-1)}n\sigma^2/k}$
if $u\le k-1$, $n\ge k$ and $A>A_0$ with a sufficiently large
number~$A_0$, inequality (\ref{(16.15)}) holds for all
sets $V$ of the form $V=\{1,\dots,u\}$, $1\le uAn^{k/2}\sigma^{k+1}\right)
\label{(16.21)} \\
&&\qquad <2P\left(\sup_{f\in{\cal F}} |S(f)|
>\frac A2n^k\sigma^{k+1}\right)
+2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}
\nonumber
\end{eqnarray}
with the function $S(f)$ defined in (\ref{(16.13)}). To prove
relation (\ref{(16.21)}) introduce the random variables
$Z(f)=k!\bar I_{n,k}^{\{1,\dots,k\}}(f)$ and
$$
\bar Z(f)=-\sum_{V\subset \{1,\dots,k\},\,V\neq\{1,\dots,k\}}
(-1)^{k-|V|}k!\bar I_{n,k}^V(f)
$$
for all $f\in{\cal F}$, the
$\sigma$-algebra ${\cal B}$ considered in Lemma~16.3A and the set
$$
B=\bigcap_{\substack{V\subset\{1,\dots,k\}\\
V\neq\{1,\dots,k\}}} \left\{\omega\colon\,
\sup_{f\in{\cal F}}\left.E\left([k!\bar I_{n,k}^V(f)]^2\right|
{\cal B}\right)(\omega) \le 2^{-(3k+3)}A^2n^{2k}\sigma^{2k+2}\right\}.
$$
Observe that $S(f)=Z(f)-\bar Z(f)$, $f\in{\cal F}$, $B\in{\cal B}$,
and by Lemma~16.3A the inequality
$1-P(B)\le2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)}
n\sigma^2/k}$ holds. To prove relation~(\ref{(16.21)}) apply
Lemma~15.2 with the above introduced random variables $Z(f)$ and
$\bar Z(f)$, $f\in{\cal F}$, (both here and in the subsequent
proof of Lemma~16.1B we work with random variables $Z(\cdot)$
and $\bar Z(\cdot)$ indexed by the countable set of functions
$f\in{\cal F}$, hence the functions $f\in{\cal F}$ play the role
of the parameters~$p$ when Lemma~15.2 is applied) random set $B$
and $\alpha=\frac A2n^k\sigma^{k+1}$, $u=\frac A2n^k\sigma^{k+1}$.
(At the left-hand side of~(\ref{(16.21)}) we can replace
$k!\bar I_{n,k}(f)$ with $Z(f)$, $f\in{\cal F}$, because they
have the same joint distribution.) It is enough to show that
\begin{equation}
P\left(|\bar Z(f)|
>\frac A2n^k\sigma^{k+1}|{\cal B}\right)(\omega)\le\frac12
\quad \textrm{ for all }f\in{\cal F} \quad
\textrm {if } \omega\in B.
\label{(16.22)}
\end{equation}
But
\begin{eqnarray*}
&&P\left(k!|\bar I_{n,k}^{|V|}(f)|>2^{-(k+1)} An^k\sigma^{k+1}|
{\cal B}\right)(\omega) \\
&& \qquad \le\frac{2^{2(k+1)}E(\bar I^{|V|}_{n,k}(f)^2|{\cal B})(\omega)}
{A^2n^{2k}\sigma^{2(k+1)}}\le 2^{-(k+1)}
\end{eqnarray*}
for all functions $f\in {\cal F}$ and sets
$V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$, if $\omega\in B$
by the `conditional Chebishev inequality', hence
relations~(\ref{(16.22)}) and~(\ref{(16.21)}) hold.
Lemma 16.1A follows from relation~(\ref{(16.21)}), Lemma~16.2A
and the observation that the random variables
$\bar I_{n,k}^{(V,\varepsilon)}(f)$,
$f\in{\cal F}$, defined in~(\ref{($16.12'$)}) have the same
distribution for
all $V\subset\{1,\dots,k\}$ as the random variables
$\bar I_{n,k}^{\varepsilon}(f)$, defined in
formula~(\ref{(14.12)}). Hence Lemma~16.2A and the
definition~(\ref{($16.13'$)}) of the random variables
$\bar S(f)$, $f\in{\cal F}$, imply the inequality
\begin{eqnarray*}
P\left(\sup_{f\in{\cal F}} |S(f)|>\frac A2n^k\sigma^{k+1}\right)
&=&P\left(\sup_{f\in{\cal F}} |\bar S(f)|
>\frac A2n^k\sigma^{k+1}\right)\\
&\le& 2^kP\left(\sup_{f\in{\cal F}}
\left|k!\bar I_{n,k}^\varepsilon(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right).
\end{eqnarray*}
Lemma 16.1A is proved.
\hfill$\qed$
\medskip
Lemma~16.1B will be proved with the help of the following
Lemma~16.3B, which is a version of Lemma~16.3A.
\medskip\noindent
{\bf Lemma 16.3B.} {\it Let us consider a class of functions
${\cal F}$ satisfying the conditions of Proposition~15.4
together with $2k$ independent copies
$$
\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n,\textrm{ and }
\; \xi^{(j,-1)}_1,\dots,\xi^{(j,-1)}_n,\;\; 1\le j\le k,
$$
of a sequence of independent, $\mu$-distributed random variables
$\xi_1,\dots,\xi_n$. Take the random variables
$\bar I_{n,k}^V(f,y)$ and $H^V_{n,k}(f)$, $f\in{\cal F}$,
$V\subset\{1,\dots,k\}$, defined in formulas~(\ref{(16.2)})
and~(\ref{(16.3)}) with
the help of these quantities. Let
$$
{\cal B}={\cal B}(\xi_1^{(j,1)},\dots, \xi_n^{(j,1)},\; 1\le j\le k)
$$
denote the $\sigma$-algebra generated by the random variables
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$, $1\le j\le k$, i.e. by those
random variables which appear in the definition of the random
variables $\bar I_{n,k}^V(f,y)$ and $H_{n,k}^V(f)$ introduced in
formulas (\ref{(16.2)}) and~(\ref{(16.3)}), and have
second argument~1 in their upper index.
\begin{enumerate}
\item
There exist some numbers $A_0=A_0(k)>0$ and $\gamma=\gamma_k>0$
such that for all $V\subset\{1,\dots,k\}$, $V\neq\{1,\dots,k\}$,
the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}} E(H^{V}_{n,k}(f)|{\cal B})
>\frac{2^{-(4k+4)}}{(k!)^2}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right)
0$ and $\gamma=\gamma_k>0$ such
that if the integrals $H_{n,k}(f)$, $f\in{\cal F}$, determined by
this class of functions ${\cal F}$ have a good tail behaviour at
level $T^{(2k+1)/2k}$ for some $T\ge A_0$, then the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})
>\frac{2^{-(2k+2)}}{k!}A^2n^{2k}\sigma^{2k+2}\right)
<2n^{k-1}e^{-\gamma_k A^{1/2k}n\sigma^2/k}
\label{(16.25)}
\end{equation}
holds for any pairs of subsets $V_1,V_2\subset\{1,\dots,k\}$ with
the property that at least one of them does not equal the set
$\{1,\dots,k\}$ if the number~$A$ satisfies the condition $A>T$.
\end{enumerate}
}
\medskip\noindent
{\it Proof of Lemma 16.3B.}\/ Part a) of Lemma 16.3B can be proved
in almost the same way as Lemma 16.3A. Hence I only briefly
explain the main step of the proof. In the case $V=\emptyset$ the
identity $E(H^{V}_{n,k}(f)|{\cal B})=E(H^{V}_{n,k}(f))$ holds, hence it
is enough to show that $E(H^{V}_{n,k}(f))\le k!n^k\sigma^2
\le2^k k!n^{2k}\sigma^{2k+2}$ for all $f\in{\cal F}$ under the
conditions of Proposition~15.4. (This relation holds, because
the functions of the class ${\cal F}$ are canonical.) The case of a
general set $V$, $V\neq\emptyset$ and $V\neq\{1,\dots,k\}$, can be
reduced to the case $V=\{1,\dots,u\}$ with some $1\le u\frac {A^{(2k-1)/k}n^{k+u}\sigma^{2k+2}}{2^{(4k+4)}(k!)^2}\right)\\
&&\qquad \le e^{-\gamma_kA^{(2k-1)/2k(2u+1)}(n+u-k)\sigma^2}
\end{eqnarray*}
with a sufficiently small $\gamma_k>0$. This inequality can be
proved, similarly to relation~(\ref{(16.18)}) in the proof of
Lemma~16.3A
with the help of the Corollary of Proposition~15.4. Only here we
have to work in the space $(X^u\times \bar Y,{\cal X}^u
\times\bar{\cal Y}, \mu^u\times\bar \rho)$ where $\bar
Y=X^{k-u}\times Y$, $\bar{\cal Y}={\cal X}^{k-u}\times{\cal Y}$,
$\bar\rho=\mu^{k-u}\times\rho$ with the class of function
$\bar f\in\bar{\cal F}$ consisting of the functions~$\bar f$
defined by the formula
$\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)$
with some $f(x_1,\dots,x_k,y)\in {\cal F}$, where
$\bar y=(x_{u+1},\dots,x_k,y)$. Here we apply the following
version of formula~(\ref{(16.19)}).
$$
E\left([k!\bar I_{n,k}^V(f,l_{u+1},\dots,l_k,y)]^2|{\cal B}\right)
=\int [u!\bar I^{l(u)}_{n+u-k,u}(\bar f,\bar y)]^2\bar\rho(\,d\bar y)
=H^{l(u)}_{n+u-k,u}(\bar f)
$$
with the function $\bar f\in\bar{\cal F}$ for which the identity
$$
\bar f(x_1,\dots,x_u,\bar y)=f(x_1,\dots,x_k,y)
$$
holds with $\bar y=(x_{u+1},\dots,x_k,y)$, and we define the random
variables $\bar I^{l(u)}_{n+u-k,u}(\bar f,\bar y)$ and
$H^{l(u)}_{n+u-k,u}(\bar f)$ similarly to the corresponding terms after
formula~(\ref{(16.19)}),
only $y$ is replaced by $\bar y$, the measure $\rho$ by $\bar\rho$,
and the presently defined functions $\bar f\in\bar{\cal F}$ are
considered. I omit the details.
\medskip\noindent
Part b) of Lemma 16.3B will be proved with the help of Part a) and
the inequality
\begin{equation}
\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B}) \le
\left(\sup_{f\in{\cal F}} E(H^{V_1}_{n,k}(f)|{\cal B})\right)^{1/2}
\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})\right)^{1/2}.
\label{(16.26)}
\end{equation}
To prove inequality~(\ref{(16.26)}) observe that the random variables
$H^{(V_1,V_2)}_{n,k}(f)$, $H^{V_1}_{n,k}(f)$ and $H^{V_2}_{n,k}(f)$
can be expressed as functions of the random variables $\xi_l^{(j,1)}$,
$\xi^{(j,-1)}_l$, $1\le j\le k$, $1\le l\le n$ which are independent
of each other, and the random variables $\xi_l^{(j,1)}$ are
${\cal B}$ measurable, while the random variables $\xi_l^{(j,-1)}$
are independent of this $\sigma$-algebra. Hence we can calculate
the conditional expectations
$E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})$, $E(H^{V_1}_{n,k}(f)|{\cal B})$
and $E(H^{V_2}_{n,k}(f)|{\cal B})$ by putting the value of the
random variables $\xi^{(j,1)}(\omega)$ in the appropriate coordinate
of the functions expressing these random variables and integrating
by the remaining coordinates with respect the distribution of the
random variables $\xi^{(j,-1)}_l$. By writing up the above conditional
expectations in such a way and applying the Schwarz inequality for
them we get the inequality
$$
E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B}) \le
\left(E(H^{V_1}_{n,k}(f)|{\cal B})\right)^{1/2}
\left(E(H^{V_2}_{n,k}(f)|{\cal B})\right)^{1/2} \quad\textrm{for all }
f\in{\cal F}.
$$
It is not difficult to deduce relation~(\ref{(16.26)}) from this
inequality by showing that it remains valid if we put the
$\sup\limits_{f\in{\cal F}}$ expressions in it in that way as it is
done in~(\ref{(16.26)}).
In the proof of Part~b) of Lemma~16.3B we may assume that
$V_1\neq\{1,\dots,k\}$. Inequality~(\ref{(16.26)}) implies that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})
>\frac{2^{-(2k+2)}}{k!}A^2n^{2k}\sigma^{2k+2}\right)\\
&&\qquad \le P\left(\sup_{f\in{\cal F}} E(H^{V_1}_{n,k}(f)|{\cal B})
>\frac{2^{-(4k+4)}}{(k!)^2}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}\right) \\
&&\qquad\qquad+P\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\end{eqnarray*}
Hence if we know that also the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}} E(H^{V_2}_{n,k}(f)|{\cal B})
>A^{(2k+1)/k} n^{2k}\sigma^{2k+2}\right)
\le n^{k-1} e^{-\gamma_k A^{1/2k}n\sigma^2/k} \label{(16.27)}
\end{equation}
holds, then we can deduce relation~(\ref{(16.25)}) from the
estimate~(\ref{(16.23)}) and~(\ref{(16.27)}).
Relation~(\ref{(16.27)}) follows from Part~a) of Lemma~16.3B if
$V_2\neq\{1,\dots,k\}$ and $A\ge1$, since in this case the level
$A^{(2k+1)/k} n^{2k}\sigma^{2k+2}$ can be replaced
by the smaller number $2^{-(4k+2)}A^{(2k-1)/k} n^{2k}\sigma^{2k+2}$
in the probability of formula (\ref{(16.27)}). In the case
$V_2=\{1,\dots,k\}$ it follows from the conditions of Part~b) of
Lemma~16.3B if the number $\gamma_k$ is chosen so that
$\gamma_k\le1$. Indeed, since $A^{(2k+1)/2k}>T^{(2k+1)/2k}$, and
by the conditions of Proposition~15.4 (and as a consequence of
Lemma~16.3B) inequality~(\ref{(15.7)}) holds for all
$\bar A\ge T^{(2k+1)/2k}$, we can apply this relation for the
parameter~$A^{(2k+1)/2k}$. In such a way we get
inequality~(\ref{(16.27)}) also for $V_2=\{1,\dots,k\}$.
\hfill$\qed$
\medskip
Now we turn to the proof of Lemma~16.1B.
\medskip\noindent
{\it Proof of Lemma 16.1B.}\/ By Lemma~16.2B it is enough to
prove that relation (\ref{(16.8)}) holds if the random
variables $\bar W(f)$ are replaced in it by the random
variables $W(f)$ defined in formula~(\ref{(16.14)}). We shall
prove this by applying the generalized form of the
symmetrization lemma, Lemma~15.2, with the choice of
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)$, $\bar V=\{1,\dots,k\}$,
$\bar Z(f)=Z(f)-W(f)$, $f\in{\cal F}$,
${\cal B}={\cal B}(\xi_1^{(j,1)},\dots,\xi_n^{(j,1)};\;1\le j\le k)$,
$\alpha=\frac{A^2}2n^{2k}\sigma^{2k+2}$,
$u=\frac{A^2}2n^{2k}\sigma^{2k+2}$ and the set
\begin{eqnarray*}
B&&=\bigcap_{\substack{(V_1,V_2)\colon\, V_j\in \{1,\dots,k\},
\;j=1,2,\\
V_1\neq\{1,\dots,k\} \textrm { or } V_2\neq\{1,\dots,k\} }} \\
&&\qquad\qquad \left\{\omega\colon
\sup_{f\in{\cal F}} E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})(\omega)
\le \frac{2^{-(2k+2)}}{k!} A^{2} n^{2k}\sigma^{2k+2}\right\}.
\end{eqnarray*}
By part~b) of Lemma 16.3B the inequality
$$
1-P(B)\le2^{2k+1}n^{k-1}
e^{-\gamma_k A^{1/2k}n\sigma^2/k}
$$
holds. Observe that
$Z(f)=H_{n,k}^{(\bar V,\bar V)}(f)=H_{n,k}(f)$ for all $f\in{\cal F}$.
Hence to prove Lemma 16.1B with the
help of Lemma~15.2 it is enough to show that
\begin{equation}
P\left(\left.|\bar Z(f)|>\frac{A^2}{2k!} n^{2k}\sigma^{2k+2}\right|
{\cal B}\right)(\omega)\le\frac12 \quad \textrm{ for all }f\in{\cal F}
\textrm{ if } \omega\in B. \label{(16.28)}
\end{equation}
To prove this relation observe that because of the definition of the
set~$B$
\begin{eqnarray*}
&& E (|\bar Z(f)| |{\cal B})(\omega) \\
&&\qquad \le \sum_{\substack
{(V_1,V_2)\colon\, V_j\in \{1,\dots,k\},\;j=1,2,\\
V_1\neq\{1,\dots,k\} \textrm { or } V_2\neq\{1,\dots,k\} }}
E(H^{(V_1,V_2)}_{n,k}(f)|{\cal B})(\omega)
\le\frac{A^2}{4k!}n^{2k}\sigma^{2k+2}
\end{eqnarray*}
if $\omega\in B$ for all $f\in {\cal F}$. Hence the `conditional
Markov inequality' implies that
$P\left(\left.|\bar Z(f)|>\frac{A^2}{2k!} n^{2k}\sigma^{2(k+1)}\right|
{\cal B}\right)(\omega)\le\frac
{2k!E(|\bar Z(f)| |{\cal B})(\omega)}{A^2n^{2k}\sigma^{2k+2}}\le\frac12$
if $\omega\in B$, and inequality~(\ref{(16.28)}) holds.
Lemma~16.1B is proved.
\hfill$\qed$
\chapter{The proof of the main result}
In this chapter Propositions~15.3 and~15.4 are proved with the help of
Lemmas~16.1A and~16.1B. They complete the proof of Theorem~8.4, of the
main result in this work.
\medskip\noindent
{\script A.) The proof of Proposition 15.3.}
\medskip\noindent
The proof of Proposition 15.3 is similar to that of Proposition~7.3.
It applies an induction procedure with respect to the order~$k$ of
the $U$-statistics. In the proof of Proposition~15.3 for
parameter~$k$ we may assume that Propositions~15.3 and~15.4 hold
for $u2^{-(k+1)}A n^k\sigma^{k+1}\right)
$$
appearing at the right-hand side of the estimate (\ref{(16.1)})
in Lemma~16.1A. To estimate this probability we introduce (using
the notation of Proposition~15.3) the functions
\begin{eqnarray}
&&S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,1\le j\le k) \nonumber \\
&&\qquad =\sum_{\substack {(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; j=1,\dots, k,\\ l_j\neq l_{j'}
\textrm{ if } j\neq j'}}
f^2\left(x_{l_1}^{(1)},\dots,x_{l_k}^{(k)}\right),
\quad f\in{\cal F}, \label{(17.1)}
\end{eqnarray}
with $x_l^{(j)}\in X$, $1\le l\le n$, $1\le j\le k$.
We define with the help of this function the following set
$H=H(A)\subset X^{kn}$ for all $A>T$ similarly to
formula~(\ref{(7.7)}) in the proof of Proposition~7.3:
\begin{eqnarray}
&&H=H(A)=\biggl\{\left(x_l^{(j)},\,1\le l\le n,\,1\le j\le k\right)\colon
\nonumber \\
&&\qquad \sup_{f\in{\cal F}} S^2_{n,k}(f)(x_l^{(j)},\,1\le l\le n,\,
1\le j\le k)>2^kA^{4/3}n^k\sigma^2\biggr\}. \label{(17.2)}
\end{eqnarray}
First we want to show that
\begin{equation}
P(\{\omega\colon\, (\xi_l^{(j)}(\omega),
\,1\le j\le n,\,1\le j\le k)\in H\})
\le 2^k e^{-A^{2/3k}n\sigma^2} \quad\textrm{if }A\ge T.
\label{(17.3)}
\end{equation}
To prove relation (\ref{(17.3)}) we take the Hoeffding
decomposition of the
$U$-statistics with kernel functions $f^2(x_1,\dots,x_k)$,
$f\in{\cal F}$, given in Theorem~9.1, i.e. we write
\begin{equation}
f^2(x_1,\dots,x_k)
=\sum\limits_{V\subset\{1,\dots,k\}} f_V(x_j,j\in V),
\quad f\in{\cal F}, \label{(17.4)}
\end{equation}
with
$f_V(x_j,j\in V)=\prod\limits_{j\notin V}P_j\prod\limits_{j\in V}Q_j
f^2(x_1,\dots,x_k)$, where $P_j$ and $Q_j$ are the operators defined
in formulas (\ref{(9.1)}) and~(\ref{(9.1a)}).
The functions $f_V$ appearing in formula (\ref{(17.4)}) are
canonical (with respect to the measure $\mu$), and the identity
$S^2_{n,k}(f)(\xi_l^{(j)}\,1\le l\le n,1\le j \le k)=k!\bar I_{n,k}(f^2)$
holds for all $f\in {\cal F}$ with the expression $\bar I_{n,k}(\cdot)$
defined in~(\ref{(14.11)}). By applying the Hoeff\-ding
decomposition~(\ref{(17.4)})
for each term $f^2(\xi_{l_1}^{(1)}\dots,\xi_{l_k}^{(k)})$ in the
expression $S^2_{n,k}(f)$ we get that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}S^2_{n,k}(f)(\xi_l^{(j)},
\,1\le l\le n,\,1\le j\le
k) >2^kA^{4/3}n^k\sigma^2\right) \nonumber \\
&&\qquad \le \!\!\! \sum_{V\subset\{1,\dots,k\}} \!\!\!
P\left(\sup_{f\in{\cal F}}
n^{k-|V|}||V|!\,\bar I_{n,|V|}(f_V)|>A^{4/3}n^k\sigma^2\right)
\label{(17.5)}
\end{eqnarray}
with the functions $f_V$ appearing in formula~(\ref{(17.4)}).
We want to give
a good estimate for each term in the sum at the right-hand side
in~(\ref{(17.5)}). For this goal first we show that the
classes of functions
$\{f_V\colon\,f\in {\cal F}\}$ in the expansion~(\ref{(17.4)})
satisfy the
conditions of Proposition~15.3 for all $V\subset\{1,\dots,k\}$.
The functions $f_V$ are canonical for all $V\subset\{1,\dots,k\}$.
It follows from the conditions of Proposition~15.3 that
$|f^2(x_1,\dots,x_k)|\le 2^{-2(k+1)}$ and
$$
\int f^4(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le
2^{-(k+1)}\sigma^2.
$$
Hence relations (\ref{(9.4)}) and~(\ref{($9.4'$)}) of
Theorem~9.2 imply that
$$
\left|\sup_{x_j\in X,j\in V}f_V(x_j,j\in V)\right|
\le 2^{-(k+2)}\le2^{-(k+1)}
$$
and $\int f^2_V(x_j,j\in V)\prod\limits_{j\in V}\mu(\,dx_j)
\le 2^{-(k+1)} \sigma^2\le\sigma^2$ for all
$V\subset\{1,\dots,k\}$. Finally, to check that the class of
functions ${\cal F}_V=\{f_V\colon\, f\in{\cal F}\}$
is $L_2$-dense with exponent $L$ and parameter $D$ observe
that for all probability measures $\rho$ on $(X^k,{\cal X}^k)$
and pairs of functions $f,g\in {\cal F}$ the inequality
$\int(f^2-g^2)^2\,d\rho\le 2^{-2k}\int(f-g)^2\,d\rho$ holds.
This implies that if $\{f_1,\dots,f_m\}$,
$m\le D\varepsilon^{-L}$, is an
$\varepsilon$-dense subset of ${\cal F}$ in the space
$L_2(X^k,{\cal X}^k,\rho)$,
then the set of functions $\{2^kf_1^2,\dots,2^kf_m^2\}$ is an
$\varepsilon$-dense subset of the class of functions
${\cal F}'=\{2^kf^2\colon\,
f\in {\cal F}\}$, hence ${\cal F}'$ is also an $L_2$-dense class
of functions with exponent~$L$ and parameter~$D$. Then by
Theorem~9.2 the class of functions ${\cal F}_V$ is also
$L_2$-dense with exponent $L$ and
parameter~$D$ for all sets $V\subset\{1,\dots,k\}$.
For $V=\emptyset$, the function $f_V$ is constant, the relation
$$
f_V=\int f^2(x_1,\dots,x_k) \mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2
$$
holds, and $\bar I_{|V|}(f_{|V|})|=f_V\le\sigma^2$. Therefore
the term corresponding to $V=\emptyset$ in the sum of
probabilities at the right-hand side of (\ref{(17.5)}) equals
zero under the conditions of Proposition~15.3 with the choice
of some $A_0\ge1$. I claim that the remaining terms in the sum
at the right-hand side of~(\ref{(17.5)}) satisfy the inequality
\begin{eqnarray}
&&P\left(n^{k-|V|}\sup_{f\in{\cal F}}
||V|!\,\bar I_{n,|V|}(f_V)|>A^{4/3}n^{k}\sigma^2\right)\nonumber \\
&&\qquad \le P\left(\sup_{f\in{\cal F}}
||V|!\,\bar I_{n,|V|}(f_V)|>A^{4/3}
n^{|V|}\sigma^{|V|+1}\right)
\le e^{-A^{2/3k}n\sigma^2} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{if } 1\le|V|\le k. \label{(17.6)}
\end{eqnarray}
The first inequality in (\ref{(17.6)}) holds, since
$\sigma^{|V|+1}\le\sigma^2$
for $|V|\ge1$, and $n\ge k\ge|V|$. The second inequality
follows from the inductive hypothesis if $|V|2^{-(k+2)}A n^{k/2}\sigma^{k+1}\right)$ with
respect to the random variables $\xi_l^{(j)}$,
$1\le l\le n$, $1\le j\le k$ we get with the help of
the multivariate version of Hoeff\-ding's inequality
(Theorem~13.3) that
\begin{eqnarray}
&&P\left(\left.\left|k!\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+2)}A n^k\sigma^{k+1}\right|\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\right) \nonumber \\
&&\qquad \le C\exp\left\{-\frac12
\left(\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{2k+4}
S^2_{n,k}(f)(x_l^{(j)},1\le l\le n,\,1\le j\le k)}
\right)^{1/k}\right\} \nonumber \\
&&\qquad \le Ce^{-2^{-4-4/k}A^{2/3k}n\sigma^2} \quad
\textrm{for all }f\in{\cal F} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad \textrm{if } (x_l^{(j)},\,
1\le l\le n,\,1\le j\le k) \notin H \label{(17.7)}
\end{eqnarray}
with some appropriate constant $C=C(k)>0$.
Define for all $1\le j\le k$ and sets of points $x_l^{(j)}\in X$,
$1\le l\le n$, the probability measures
$\rho_j=\rho_{j,\,(x_l^{(j)},\,
1\le l\le n)}$, $1\le j\le k$ on $X$, uniformly distributed on
the set of points $\{x_l^{(j)},\; 1\le l\le n\}$, i.e. let
$\rho_j(x_l^{(j)})=\frac1n$ for all $1\le l\le n$. Let us also
define the product $\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)
=\rho_1\times\cdots\times\rho_k$ of these measures on the space
$(X^k,{\cal X}^k)$. If $f$ is a function on $(X^k,{\cal X}^k)$ such
that $\int f^2\,d\rho\le\delta^2$ with some $\delta>0$, then
\begin{eqnarray*}
&&\sup_{\varepsilon_1,\dots,\varepsilon_n} |k!\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)| \\
&&\qquad \le n^k\int
|f(u_1,\dots,u_k)|\rho(\,du_1,\dots,\,du_k)
\le n^k \left(\int f^2\,d\rho\right)^{1/2} \le n^k\delta,
\end{eqnarray*}
$u_j\in R^k$, $1\le j\le k$, and as a consequence
\begin{eqnarray}
&&\sup_{\varepsilon_1,\dots,\varepsilon_n}|k!\bar I_{n,k}^\varepsilon(f)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k) \label{(17.8)} \\
&&\qquad\qquad\qquad -k!\bar I_{n,k}^\varepsilon(g)
(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)| \nonumber \\
&&\qquad \le2^{-(k+2)}An^k\sigma^{k+1} \quad\textrm{if }
\int (f-g)^2\,d\rho\le (2^{-(k+2)}A\sigma^{k+1})^2,
\nonumber
\end{eqnarray}
where
$\bar I_{n,k}^\varepsilon(f)(x_l^{(j)},\, 1\le l\le n,\,1\le j\le k)$
equals the expression $\bar I_{n,k}^\varepsilon(f)$ defined
in~(\ref{(14.12)}) if we replace $\xi_{l_j}^{(j)}$ by $x_{l_j}^{(j)}$
for all $1\le j\le k$, and $1\le l_j\le n$ in it, and $\rho$ is
the measure
$\rho=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$ defined above.
\medskip\noindent
{\it Remark.}\/ Similarly to the remark made in the proof of
Proposition~7.3 we may restrict our attention to the case when
the random variables $\xi^{(j)}_l$ are non-atomic. A similar
statement holds also in the proof of Proposition~15.4,
\medskip
Let us fix the number $\delta=2^{-(k+2)}A\sigma^{k+1}$,
and let us list the elements of the set ${\cal F}$ as
${\cal F}=\{f_1,f_2,\dots\}$.
Put
$$
m=m(\delta)=\max(1,D\delta^{-L})
=\max(1,D(2^{(k+2)}A^{-(1)}\sigma^{-(k+1)})^L),
$$
and choose for all vectors
$x^{(n)}=(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)\in X^{kn}$ such a
sequence of positive integers $p_1(x^{(n)}),\dots,p_m(x^{(n)}))$
for which
$$
\inf\limits_{1\le l\le m}\int (f(u)-f_{p_l(x^{(n)})}(u))^2
\rho(x^{(n)})(\,du)\le\delta^2\quad\textrm{for all } f\in{\cal F}
\textrm{ and } x^{(n)}\in X^{kn}.
$$
(Here we apply the notation
$\rho(x^{(n)})=\rho(x_l^{(j)},\,1\le l\le n,\,1\le j\le k)$, which is
a probability measure on $X^k$ depending on $x^{(n)}$.)
This is possible, since ${\cal F}$ is an $L_2$-dense
class with exponent~$L$ and parameter~$D$, and we can choose
$m=D\delta^{-L}$, if $\delta<1$, Beside this, we can choose $m=1$
if $\delta\ge1$, since
$\int |f-g|^2\,d\rho\le\sup|f(x)-g(x)|^2\le2^{-2k}\le1$ for all
$f,g\in{\cal F}$. Moreover, we have shown in Lemma~7.4A that the
functions $p_l(x^{(n)})$, $1\le l\le m$, can be chosen as
measurable functions of the argument $x^{(n)}\in X^{kn}$.
Let us consider the random vector
$\xi^{(n)}(\omega)=(\xi^{(j)}_l(\omega),\,1\le l\le n,\,1\le j\le k)$.
By arguing similarly as we did in the proof of Proposition~7.3 we
get with the help of relation~(\ref{(17.8)}) and the property of the
functions $f_{p_l(x^{(n)})}(\cdot)$ constructed above that
\begin{eqnarray*}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
|k!\bar I_{n,k}^\varepsilon(f)(\omega)|
\ge2^{-(k+1)}An^k\sigma^{k+1}\right\} \\
&&\qquad \subset\bigcup\limits_{l=1}^m\left\{\omega\colon\,
|k!\bar I_{n,k}^\varepsilon(f_{p_l(\xi^{(n)}(\omega))})(\omega)|
\ge2^{-(k+2)}An^k\sigma^{(k+1)} \right\}.
\end{eqnarray*}
The above relation and formula (\ref{(17.7)}) imply that
\begin{eqnarray}
&&P \left.\biggl(\sup_{f\in{\cal F}}
\left|k!\bar I_{n,k}^{\varepsilon}(f)(\omega)\right|
>\frac{A n^k\sigma^{k+1}}{2^{(k+1)}}\right|
\xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\biggr) \nonumber \\
&&\qquad \le \sum_{l=1}^m P\left.\biggl(|
k!\bar I_{n,k}^{\varepsilon}(f_{p_l(\xi^{(n)}(\omega))}(\omega)|
>\frac{A n^k\sigma^{k+1}}{2^{k+2}}\right|
\nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\qquad \xi_l^{(j)}(\omega)=x_l^{(j)},
1\le l\le n,1\le j\le k\biggr) \nonumber \\
&&\qquad \le C m(\delta) e^{-2^{-4-4/k}A^{2/3k}n\sigma^2}
\le C (1+D(2^{k+2} A^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}n\sigma^2} \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad \textrm{if }
\{x_l^{(j)},\, 1\le l\le n,\,1\le j\le k\}\notin H. \label{(17.9)}
\end{eqnarray}
Relations~(\ref{(17.3)}) and~(\ref{(17.9)}) imply that
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}}
\left|k!\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right) \label{(17.10)} \\
&&\qquad \le C (1+D(2^{k+2}A^{-1}
\sigma^{-(k+1)})^L) e^{-2^{-4-4/k}A^{2/3k}n\sigma^2}
+2^k e^{-A^{2/3k}n\sigma^2} \quad\textrm{if } A>T.
\nonumber
\end{eqnarray}
Proposition 15.3 follows from the estimates~(\ref{(16.1)}),
(\ref{(17.10)}) and the condition $n\sigma^2\ge L\log n+\log D$,
$L,D\ge 1$, if $A\ge A_0$ with a sufficiently large number~$A_0$.
Indeed, in this case $n\sigma^2\ge\frac12$,
$(2^{k+2}A^{-1}\sigma^{-(k+1)})^L
\le(\frac{n^{(k+1)/2}}{(2n\sigma^2)^{(k+1)/2}})^L\le n^{L(k+1)/2}=
e^{L\log n\cdot (k+1)/2}\le e^{(k+1)n\sigma^2/2}$,
$D=e^{\log D}\le e^{n\sigma^2}$, and
$$
C (1+D(2^{k+2} A^{-1}\sigma^{-(k+1)})^L)
e^{-2^{-4-4/k}A^{2/3k}n\sigma^2}
\le\frac13 e^{-A^{1/2k}n\sigma^2}.
$$
The estimation of the remaining terms in the upper bound of the
estimates~(\ref{(16.1)}) and~(\ref{(17.10)}) leading to the proof of
relation~(\ref{(15.5)}) is simpler. We can exploit that
$e^{-A^{2/3k}n\sigma^2}\ll e^{-A^{1/2k}n\sigma^2}$ and as
$n^{k-1}\le e^{(k-1)n\sigma^2}$, hence
$2^k e^{-A^{2/3k}n\sigma^2}\le\frac13 e^{-A^{1/2k}n\sigma^2}$, and
$2^kn^{k-1}e^{-\gamma_k A^{1/(2k-1)} n\sigma^2/k}\le
2^ke^{(k-1)n\sigma^2}e^{-\gamma_k A^{1/(2k-1)}
n\sigma^2/k}\ll e^{-A^{1/2k}n\sigma^2}$
for a large number~$A$.
\hfill$\qed$
Now we turn to the proof of Proposition~15.4.
\medskip\noindent
{\script B.) The proof of Proposition 15.4.}
\medskip\noindent
Because of formula~(\ref{(16.11)}) in the Corollary of
Lemma~16.1B to prove Proposition 15.4 i.e.
inequality~(\ref{(15.7)}) it is enough to choose a
sufficiently large parameter $A_0$ and to show that with such
a choice the random variables $H_{n,k}(f|G,V_1,V_2)$ defined in
formula~(\ref{(16.9)}) satisfy the inequality
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}} \right) \le
2^{k+1} e^{-A^{1/2k}n\sigma^2}\nonumber \\
&&\qquad\textrm{ for all } G\in {\cal G}\quad \textrm{and }
\;V_1,V_2\in\{1,\dots,k\} \quad\textrm{if } A>T\ge A_0
\label{(17.11)}
\end{eqnarray}
under the conditions of Proposition~15.4.
Let us first prove formula (\ref{(17.11)}) in the case $|e(G)|=k$,
i.e.\ when all vertices of the diagram $G$ are end-points of some
edge, and the expression $H_{n,k}(f|G,V_1,V_2)$ contains no
`symmetrizing term' $\varepsilon_j$. In this case we apply a
special argument to prove relation~(\ref{(17.11)}).
We will show with the help of the Schwarz inequality that for a
diagram $G$ such that $|e(G)|=k$
\begin{eqnarray}
&&|H_{n,k}(f|G,V_1,V_2)| \label{(17.12)} \\
&&\qquad \le \left(\sum_{\substack{ (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
\int f^2(\xi_{l_1}^{(1),\delta_1(V_1))},
\dots,\xi_{l_k}^{(k,\delta_k(V_1))},y)
\rho(\,dy)\right)^{1/2} \nonumber \\
&& \qquad\qquad
\left(\sum_{\substack{ (l_1,\dots,l_k)\colon\\
1\le l_j\le n, \;1\le j\le k,\\
l_j\neq l_{j'} \textrm{ if }j\neq j'}}
\int f^2(\xi_{l_1}^{(1,\delta_1(V_2))},\dots,
\xi_{l_k}^{(k,\delta_k(V_2))},y) \rho(\,dy)\right)^{1/2} \nonumber
\end{eqnarray}
with $\delta_j(V_1)=1$ if $j\in V_1$, $\delta_j(V_1)=-1$ if
$j\notin V_1$, and $\delta_j(V_2)=1$ if $j\in V_2$,
$\delta_j(V_2)=-1$ if $j\notin V_2$.
Relation (\ref{(17.12)}) can be proved for instance by
bounding first the absolute value of each integral in
formula~(\ref{(16.9)}) by means of the Schwarz inequality, and
then by bounding the sum appearing in such a way by means of
the inequality $\sum |a_jb_j|\le \left(\sum a_j^2\right)^{1/2}
\left(\sum b_j^2\right)^{1/2}$. Observe that in the case
$|(e(G)|=k$ the summation in~(\ref{(16.9)}) is
taken for such vectors $(l_1,\dots,l_k,l_1',\dots,l_k')$ for
which $(l_1',\dots,l_k')$ is a permutation of the sequence
$(l_1,\dots,l_k)$ determined by the diagram~$G$. Hence the
sum we get after applying the Schwarz inequality for each
integral in~(\ref{(16.9)}) has the form $\sum a_jb_j$ where
the set of indices~$j$ in this sum agrees with
the set of vectors $(l_1,\dots,l_k)$ such that $1\le l_p\le n$
for all $1\le p\le k$, and $l_p\neq l_{p'}$ if $p\neq p'$.
By formula (\ref{(17.12)})
\begin{eqnarray*}
&&\left\{\omega\colon\,\sup_{f\in{\cal F}}
\left |H_{n,k}(f|G,V_1,V_2)(\omega)\right|
>\frac{A^2n^{2k}\sigma^{(2(k+1)}}{2^{4k+1}} \right\} \\
&&\qquad \subset
\biggl\{\omega\colon\, \sup_{f\in{\cal F}} \!\!\!\!
\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; 1\le j\le k, \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!\!
\int f^2(\xi_{l_1}^{(1,\delta_1(V_1))}(\omega),
\dots,\xi_{l_k}^{(k,\delta_k(V_1))}
(\omega),y) \rho(\,dy) \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
>\frac {A^2n^{2k}\sigma^{2(k+1)}k!}{2^{4k+1}} \biggr\} \\
&&\qquad\quad \cup \biggl\{\omega\colon\, \sup_{f\in{\cal F}} \!\!\!\!
\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n, \; 1\le j\le k, \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}} \!\!\!\!\!
\int f^2(\xi_{l_1}^{(1,\delta_1(V_2))}(\omega),\dots,
\xi_{l_k}^{(k,\delta_k(V_2))}
(\omega),y)\rho(\,dy) \\
&&\qquad\qquad \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
>\frac{A^2n^{2k}\sigma^{2(k+1)}k!}{2^{4k+1}}\biggr\},
\end{eqnarray*}
hence
\begin{eqnarray}
&&P\left(\sup_{f\in{\cal F}} \left |H_{n,k}(f|G,V_1,V_2)\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right) \nonumber \\
&&\qquad \le 2P\left(\sup_{f\in{\cal F}}
\left|\sum_{\substack{(l_1,\dots,l_k)\colon\\
1\le l_j\le n,\; 1\le j\le k, \\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
h_f(\xi_{l_1}^{(1,1)},\dots,\xi_{l_k}^{(k,1)})\right|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right) \nonumber \\
&&\qquad =2P\left(\sup_{f\in{\cal F}}|k!\bar I_{n,k}(h_f)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right), \label{(17.13)}
\end{eqnarray}
where $\bar I_{n,k}(h_f)$, $f\in{\cal F}$, are the decoupled
$U$-statistics defined in~(\ref{(14.11)}) with the kernel
functions $h_f(x_1,\dots,x_k)=\int f^2(x_1,\dots,x_k,y)\rho(\,dy)$
and the random variables $\xi^{(j,1)}_l$, $1\le j\le k$,
$1\le l\le n$. (In this upper bound we could get rid of the
terms $\delta_j(V_1)$ and $\delta_j(V_2)$, i.e. of the
dependence of the expression $H_{n,k}(f|G,V_1,V_2)$ on the
sets $V_1$ and $V_2$, since the probability of the events
in the previous formula do not depend on them.)
I claim that
\begin{equation}
P\left(\sup\limits_{f\in{\cal F}} |k!\bar I_{n,k}(h_f)|
\ge2^k An^k \sigma^2\right)\le
2^k e^{-A^{1/2k}n\sigma^2} \quad \textrm{for }A\ge A_0
\label{(17.14)}
\end{equation}
if the constant $A_0=A_0(k)$ is chosen sufficiently large in
Proposition~15.4. Relation (\ref{(17.14)}) together with the
relation
$A^2\frac{n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\ge2^kA n^k\sigma^2$
(if $A>A_0$ with a sufficiently large~$A_0$) imply that the
probability at the right-hand side of (\ref{(17.13)}) can be
bounded by $2^{k+1}e^{-A^{1/2k}n\sigma^2}$, and the
estimate~(\ref{(17.11)}) holds in the case $|e(G)|=k$.
Relation (\ref{(17.14)}) is similar to relation~(\ref{(17.3)})
(together with the definition of the random set~$H$ in
formula~(\ref{(17.2)})), and a modification of the proof of
the latter estimate yields the proof also in this case.
Indeed, it follows from the conditions of
Proposition~15.4 that
$0\le\int h_f(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)\le\sigma^2$
for all $f\in{\cal F}$, and it is not difficult to check that
$\sup|h_f(x_1,\dots,x_k)|\le2^{-2(k+1)}$, and the class of
functions ${\cal H}=\{2^kh_f,\; f\in{\cal F}\}$ is an
$L_2$-dense class with exponent~$L$ and parameter~$D$. Hence
by applying the Hoeff\-ding decomposition of the functions
$h_f$, $f\in {\cal F}$, similarly to formula~(\ref{(17.4)}) we
get for all $V\subset \{1,\dots,k\}$ such a set of functions
$\{h_f)_V,\,f\in{\cal F}\}$, which satisfies the conditions
of Proposition~15.3. Hence a natural adaptation of the
estimate given for the expression at the right-hand side
of~(\ref{(17.5)}) (with the help of~(\ref{(17.6)}) and the
investigation of $|V|!\,\bar I_{|V|}(f_V)$ for $V=\emptyset$)
yields the proof of formula (\ref{(17.14)}). We only have to
replace $S_{n,k}(f)$ by $k!\bar I_{n,k}(h_f)$, then
$|V|!\,\bar I_{n,|V|}(f_V)$ by $|V|!\,\bar I_{n,|V|}((h_f)_V)$
and the levels $2^kA^{4/3}n^k\sigma^2$ in~(\ref{(17.3)}) and
$A^{4/3}n^k\sigma^2$ in~(\ref{(17.5)}) by $2^kAn^k\sigma^2$
and $An^k\sigma^2$ respectively. Let us observe that
each term of the upper bound we get in such a way can be
directly bounded, since during the proof of Proposition~15.4
for parameter~$k$ we may assume that the result
of Proposition~15.3 holds also for this parameter~$k$.
\medskip
In the case of a diagram $G\in{\cal G}$ such that $e(G)2^{2k}A^{8/3}n^{2k}\sigma^4\right)
\le 2^{k+1}e^{-A^{2/3k}n\sigma^2} \quad
\textrm{if }A\ge A_0\textrm{ and } e(G)2^{2k}A^{8/3}n^{2k}\sigma^4) \le
2P\left(\sup\limits_{f\in{\cal F}}
k!\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right),
$$
where $\bar I_{n,k}(h_f)$, $f\in{\cal F}$, are the decoupled
$U$-statistics defined in~(\ref{(14.11)}) with the kernel
functions $h_f(x_1,\dots,x_k)=\int f^2(x_1,\dots,x_k,y)\rho(\,dy)$
and the random variables $\xi^{(j,1)}_l$, $1\le j\le k$,
$1\le l\le n$. (Here we exploited that in the last formula
$S^2({\cal F}|G,V_1,V_2)$ is bounded by the product of two
random variables whose distributions do not depend on the
sets $V_1$ and $V_2$.) Thus to prove inequality
(\ref{(17.16)}) it is enough to show that
\begin{equation}
2P\left(\sup\limits_{f\in{\cal F}}
k!\bar I_{n,k}(h_f)>2^kA^{4/3}n^k\sigma^2\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2} \quad \textrm{if } A\ge A_0.
\label{(17.21)}
\end{equation}
Actually formula (\ref{(17.21)}) follows from the already
proven formula~(\ref{(17.14)}), only the parameter $A$ has
to be replaced by $A^{4/3}$ in it.
With the help of relation (\ref{(17.16)}) the proof of
Proposition~15.4 can be completed similarly to
Proposition~15.3. The following version of
inequality~(\ref{(17.7)}) can be proved with the help
of the multivariate version of Hoeff\-ding's inequality
(Theorem~13.3) and the representation of the random variable
$H_{n,k}(f|G,V_1,V_2)$ in the form~(\ref{(17.15)}).
\begin{eqnarray}
&&P\biggl(\left.|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+2}} n^{2k}\sigma^{2(k+1)}\right|
\xi^{j,\pm1}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)
\nonumber \\
&& \qquad \qquad\le Ce^{-2^{-(6+2/k)} A^{2/3k}n\sigma^2}
\label{(17.22)} \\
&&\qquad\qquad \qquad \textrm{if}\quad S^2({\cal F}|G,V_1,V_2)(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \textrm{ and }A\ge A_0
\nonumber
\end{eqnarray}
with an appropriate constant $C=C(k)>0$ for all $f\in{\cal F}$
and $G\in {\cal G}$ such that $|e(G)|0$, where $2j=2k-2|e(G)|$, and
$0\le |e(G)|\le k-1$. Since $j\le k$, $n\sigma^2\ge\frac12$,
and also $\frac{A^{4/3}}{2^{10k+4}}\ge2$ if $A_0$ is chosen
sufficiently large we can write in the above upper bound for
the left-hand side of~(\ref{(17.22)}) $j=k$, and in such a way
we get inequality~(\ref{(17.22)}).
The next inequality, in which we estimate
$\sup\limits_{f\in{\cal F}}H_{n,k}(f|G,V_1,V_2)$, is a natural
version of formula~(\ref{(17.9)}) in the proof of Proposition~15.3.
We shall show that
\begin{eqnarray}
&&P\biggl(\left.\sup_{f\in{\cal F}} |H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2}{2^{4k+1}} n^{2k}\sigma^{2(k+1)}\right|
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)
(\omega)\nonumber \\
&&\qquad\qquad \le C \left(1+D\left(\frac{2^{4k+3}}
{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2} \nonumber \\
&& \qquad\qquad\qquad \textrm{if } S^2({\cal F}|G,V_1,V_2))(\omega)
\le2^{2k} A^{8/3}n^{2k}\sigma^4 \textrm{ and } A\ge A_0
\label{(17.23)}
\end{eqnarray}
for all $G\in{\cal G}$ such that $|e(G)|\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right|
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)\\
&&\qquad\le \sum_{l=1}^m
P\biggl(\left. |H_{n,k}(f_{p_l(\xi^{(n)}(\omega))}|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right| \\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\xi^{(j,\pm1)}_{l},\,1\le l\le n,\,1\le j\le k\biggr)(\omega)
\end{eqnarray*}
for almost all~$\omega$. The last inequality together
with~(\ref{(17.22)}) and the inequality $m=\max(1,D\delta^{-L})
\le 1+D\left(\frac{2^{4k+3}}{A^2\sigma^{2(k+1)}}\right)^L$ imply
relation~(\ref{(17.23)}).
It follows from relations (\ref{(17.16)}) and (\ref{(17.23)}) that
\begin{eqnarray*}
&&P\left(\sup_{f\in{\cal F}}|H_{n,k}(f|G,V_1,V_2)|
>\frac{A^2n^{2k}\sigma^{2(k+1)}}{2^{4k+1}}\right)\le
2^{k+1}e^{-A^{2/3k}n\sigma^2}\\
&&\qquad + C
\left(1+D\left(\frac{2^{4k+3}}{A^2\sigma^{2(k+1)}}\right)^L\right)
e^{-2^{-(6+2/k)}A^{2/3k}n\sigma^2}
\quad\textrm{if }A\ge A_0
\end{eqnarray*}
for all $V_1,V_2\subset\{1,\dots,k\}$ and diagram
$G\in{\cal G}$ such that $|e(G)|\le k-1$. This inequality
implies that relation~(\ref{(17.11)}) holds also in the
case $|e(G)|\le k-1$ if the constants $A_0$ is chosen
sufficiently large in Proposition~15.4, and this completes
the proof of Proposition~15.4. To prove relation~(\ref{(17.11)})
in the case $|e(G)|\le k-1$ with the help of the last inequality
it is enough to show that
$D(\frac{2^{4k+3}}{A^2\sigma^{2(k+1)}})^L
\le e^{\textrm{const.}\,n\sigma^2}$
if $A>A_0$ with a sufficiently large~$A_0$, since this implies
that the second term at the right-hand of our last
estimation is not too large.
This relation follows from the inequality
$n\sigma^2\ge L\log n+\log D$ which implies that
$$
\left(\frac{2^{4k+3}}{A^2\sigma^{2(k+1)}}\right)^L\le
\left(\frac{n^{(k+1)}}{(2n\sigma^2)^{(k+1)}}\right)^L
\le n^{(k+1)L}=e^{(k+1)L\log n}\le e^{{(k+1)}n\sigma^2}
$$
if $A_0$ is sufficiently large, and
$D=e^{\log D}\le e^{n\sigma^2}$.
\hfill$\qed$
\chapter{An overview of the results and a discussion of the
literature}
I discuss briefly the problems investigated in this work,
recall some basic results related to them, and also give some
references. I also write about the background of these problems
which may explain the motivation for their study. I list the
remarks following the subsequent chapters in this work. Chapter~1
is an introductory text, the real work starts at Chapter~2.
\medskip\noindent
{\script Chapter 2}
\medskip\noindent
I met the main problem considered in this work when I tried to
adapt the method of proof of the central limit theorem for
maximum-likelihood estimates to some more difficult questions about
so-called non-parametric maximum likelihood estimate problems.
The Kaplan--Meyer estimate for the empirical distribution function
with the help of censored data investigated in the second chapter
is an example for such problems. It is not a maximum-likelihood
estimate in the classical sense, but it can be considered as a
non-parametric version of it. In the estimation of the
distribution function with the help of censored data we cannot
apply the classical maximum likelihood method, since in this
problem we have to choose our estimate from a too large class of
distribution functions. The main problem is that there is no
dominating measure with respect to which all candidates which
may appear as our estimate have a density function. A natural
way to overcome this difficulty is to choose an appropriate
smaller class of distribution functions, to compare the
probability of the appearance of the sample we observed with
respect to all distribution functions of this class and to
choose that distribution function as our estimate for which
this probability takes its maximum.
The Kaplan--Meyer estimate can be found on the basis of the above
principle in the following way: Let us estimate the distribution
function $F(x)$ of the censored data simultaneously together with
the distribution function $G(x)$ of the censoring data. (We have a
sample of size $n$ and know which sample elements are censored and
which are censoring data.) Let us consider the class of such pairs
of estimates $(F_n(x),G_n(x))$ of the pair $(F(x),G(x))$ for which
the distribution function $F_n(x)$ is concentrated in the censored
sample points and the distribution function $G_n(x)$ is
concentrated in the censoring sample points; more precisely, let us
also assume that if the largest sample point is a censored point,
then the distribution function $G_n(x)$ of the censoring data takes
still another value which is larger than any sample point, and if
it is a censoring point then the distribution function $F_n(x)$ of
the censored data takes still another value larger than any sample
point. (This modification at the end of the definition is needed,
since if the largest sample point is from the class of censored
data, then the distribution $G(x)$ of the censoring data in this
point must be strictly less than~1, and if it is from the class of
censoring data, then the value of the distribution function $F(x)$
of the censored data must be strictly less than~1 in this point.)
Let us take this class of pairs of distribution functions
$(F_n(x),G_n(x))$, and let us choose that pair of distribution
functions of this class as the (non-parametric maximum likelihood)
estimate with respect to which our observation has the greatest
probability.\index{product limit estimator (Kaplan--Meyer method)}
The above extremum problem about a pair of distribution functions
$(F_n(x),G_n(x))$ can be solved explicitly, (see~\cite{r26}), and
it yields the estimate of $F_n(x)$ written down in formula~(2.3).
(The function $G_n(x)$ satisfies a similar relation, only the
random variables~$X_j$ and~$Y_j$ and the events $\delta_j=1$ and
$\delta_j=0$ have to be replaced in it.) If we want to prove that
the estimate of the distribution function we found in such a way
satisfies the central limit theorem, then we can do this with
the help of a good adaptation of the method applied in the
study of maximum likelihood estimates. We apply an appropriate
linearization procedure, and there is only one really hard part
of the proof. We have to show that this linearization procedure
gives a small error. This problem led to the study of a good
estimate on the tail distribution of the integral of an
appropriate function of two variables with respect to the
product of a normalized empirical measure with itself. Moreover,
as a more detailed investigation showed, we actually need the
solution of a more general problem where we have to bound the
tail distribution of the supremum of a class of such integrals.
The main subject of this work is to solve the above problems in
a more general setting, to estimate not only two-fold, but also
$k$-fold random integrals and the supremum of such integrals
for an appropriate class of kernel functions with respect to a
normalized empirical distribution for all~$k\ge1$.
The proof of the limit theorem for the Kaplan--Meyer estimate
explained in this work applied the explicit form of this estimate.
It would be interesting to find such a modification of this proof
which only exploits that the Kaplan--Meyer estimate is the solution
of an appropriate extremum problem. We may expect that such a proof
can be generalized to a general result about the limit behaviour
for a wide class of non-parametric maximum likelihood estimates.
Such a consideration was behind the remark of Richard Gill I quoted
at the end of Chapter~2.
A detailed proof together with a sharp estimate on the speed of
convergence for the limit behaviour of the Kaplan--Meyer
estimate based on the ideas presented in Chapter~2 is given
in paper~\cite{r39}. Paper~\cite{r40} explains more about its
background, and it also discusses the solution of some other
non-parametric maximum likelihood problems. The results about
multiple integrals with respect to a normalized empirical
distribution function needed in these works were proved
in~\cite{r31}. These results were satisfactory for the study
in~\cite{r39}, but they also have some drawbacks. They do
not show that if the random integrals we are considering have
small variances, then they satisfy better estimates. Beside this,
if we consider the supremum of random integrals of an appropriate
class of functions, then these results can be applied only in
very special cases. Moreover, the method of proof of~\cite{r31}
did not allow a real generalization of these results. Hence I
had to find a different approach when I tried to generalize them.
I do not know of other works where the distribution of multiple
random integrals with respect to a normalized empirical distribution
is studied. On the other hand, there are some works where a similar
problem is investigated about the distribution of (degenerate)
$U$-statistics. The most important results obtained in this
field are contained in the book of de la Pe\~na and Gin\'e
{\it Decoupling, From Dependence to Independence}\/~\cite{r8}.
The problems about the behaviour of degenerate $U$-statistics
and multiple integrals with respect to a normalized empirical
distribution function are closely related, but the explanation
of their relation is far from trivial. The main difference
between them is that integration with respect to $\mu_n-\mu$
instead of the empirical distribution $\mu_n$ means of some
sort of normalization, while this normalization is missing in
the definition of $U$-statistics. I return to this question
later.
Let me finish my discussion about Chapter~2 with some personal
remarks. Here I investigated a special problem. But in my
opinion the method applied in this chapter works well in
several similar problems about the limit behaviour of a
non-linear functional of independent identically distributed
random variables. In the study of such problems we express the
non-linear functional we are investigating as an integral with
respect to the normalized empirical distribution determined by
the random variables we are working with plus some negligibly
small error terms. Then we have to describe the limit behaviour
of the random integral we got, and this can be done with the
help of some classical results of probability theory. Beside
this we have to show that the remaining error terms are really
small. This can be done, but at this point the results discussed
in this work play a crucial role. I believe that a similar
picture arises in many cases. In certain problems it may happen
that the main term is not a one-fold, but a multiple integral
with respect to the normalized empirical distribution. But the
limit distribution of such functionals can also be described.
This is the content of Theorem~$10.4'$ proved in Appendix~C.
\medskip\noindent
{\script Chapter 3}
\medskip\noindent
The main part of this work starts at Chapter~3. A general overview
of the results without the hard technical details can be found
in~\cite{r34}.
First the estimation of sums of independent random variables
or of one-fold random integrals with respect to a normalized empirical
distribution and the supremum of such expressions is investigated
in Chapters~3 and~4. This question has a fairly big literature. I
would mention first of all the books {\it A course on empirical
processes}\/~\cite{r12},
{\it Real Analysis and Probability}\/~\cite{r13} and
{\it Uniform Central Limit Theorems}\/~\cite{r14} of R.~M.~Dudley.
These books contain a much more detailed description of the
empirical processes than the present work together with a lot of
interesting results.
In Chapter~3 I presented the proof of some classical results
about the tail behaviour of sums of independent and bounded random
variables with expectation zero. They are Bernstein's and Bennett's
inequalities. Their proofs can be found at many places, e.g. in
Theorem~1.3.2 of~\cite{r14} and~\cite{r6}.) We are also interested
in the question when these results give such an estimate that the
central limit theorem suggests. Actually, as it is explained in
Chapter~3, Bennett's inequality gives such a bound that the
Poissonian approximation of partial sums of independent random
variables suggests. Bernstein's inequality provides an estimate
suggested by the central limit theorem if the variance of the sum
we consider is not too small. The results in Chapter~3 explain
these statements more explicitly. If the variance of the sum is
too small, then Bennett's inequality provides a slight improvement
of Bernstein's inequality. Moreover, as Example~3.3 shows,
Bennett's inequality is essentially sharp in this case. But these
results are much weaker than the estimates suggested by a normal
comparison.
%The estimate on the tail distribution of a sum of independent random
%variables is weak if this sum has a small variance. This means that
%in this case the probability that the sum is larger than a given
%value may be much larger than the (rather small) value suggested by
%the central limit theorem. Such a situation may happen if the
%contribution of some unpleasant irregularities to this probability
%is non-negligible.
The relative weakness of Bernstein's and Bennett's inequality for
random sums with small variance had a deep consequence in our
investigation about the supremum (of appropriate classes) of sums
of independent random variables. Because of the weakness of these
estimates in certain cases we had to find a new method. We could
overcome the difficulty we met with the help of a symmetrization
argument which is explained in Chapter~7. But to apply this method
we needed another result, known under the name Hoeff\-ding's
inequality. It yields an estimate about the tail behaviour of
linear combinations of independent Rademacher functions. This
result always provides such a good bound as the central limit
theorem suggests. This is the reason why I discuss this inequality
at the end of Chapter~3, in Theorem~3.4. It is also a classical
result whose proof can be found for instance in~\cite{r24}.
The content of Chapter~3 can be found in the literature, e.g. in
\cite{r12}. The main difference between my discussion and that
of earlier works is that I put more emphasis on the investigation
of the question when the estimates on the tail distribution of
partial sums of independent random variables are similar to
their Gaussian counterpart. I had a good reason to discuss this
question in more detail. I was also interested in the estimation
of the tail distribution of the supremum of partial sums of
independent random variables, and in the study of this problem
we have to understand when the classical methods related to
Gaussian random variables can be applied and when we have to
look for a new approach.
\medskip\noindent
{\script Chapter 4}
\medskip\noindent
Chapter~4 contains the one-variate version of our main result
about the supremum of the integrals of a class ${\cal F}$ of
functions with respect to a normalized empirical measure together
with an equivalent statement about the tail distribution of the
supremum of a class of random sums defined with the help of a
sequence of independent and identically distributed random
variables and a class of functions ${\cal F}$ with some nice
properties. These results are formulated in Theorems~4.1 and~$4.1'$.
They appeared in~\cite{r31}. Also a Gaussian version of them is
presented in Theorem~4.2 about the distribution of the supremum
of a Gaussian random field with some appropriate properties. A
deeper version of Theorem~4.2 is studied in paper~\cite{r11}.
The content of these results can be so interpreted that if we
take the supremum of random integrals or of random sums
determined by a nice class of functions ${\cal F}$ in the way
described in Chapter~4, then the tail distribution of this
supremum satisfies an almost as good estimate as the `worst
element' of the random variables taking part in this supremum. But
such a result holds only if we consider the value of this tail
distribution at a sufficiently large level, since --- as some
concentration inequalities imply --- the supremum of these
random sums are larger than the expected value of this supremum
with probability almost~one. I also discussed a result in
Example~4.3 which shows that some rather technical conditions
of Theorem~4.1 cannot be omitted.
The most important condition in Theorem~4.1 was that the class of
functions ${\cal F}$ we considered in it is $L_2$-dense. This
property was introduced before the formulation of Theorem~4.1.
One may ask whether one can prove a better version of this result,
which states a similar bound for a different, possibly larger
class of functions~${\cal F}$. It is worth mentioning that
Talagrand proved results similar to Theorem~4.1 for different
classes of functions~${\cal F}$ in his book~\cite{r53}.
These classes of functions are very different of ours, and
Talagrand's results seem to be incomparable with ours. I return
to this question later in the discussion of Chapters~6 and~7,
which deal with the proof of the results of Chapter~4.
In the remaining part of the discussion of Chapter~4 I write
about the notion of countably approximable classes of random
variables and its role in the present work.
In the first formulation of our results we have imposed the
condition that the class of functions~${\cal F}$ is countable,
i.e. we take the supremum of countably many random variables. In
the proofs this condition was heavily exploited. On the other hand,
in some important applications we also need results about the
supremum of a possibly non-countable set of random variables.
To handle such cases I introduced the notion of countably
approximable classes of random variables and proved that in the
results of this work the condition about countability can be
replaced by the weaker condition that the supremum of countably
approximable classes is taken. R.~M.~Dudley worked out a different
method to handle the supremum of possibly non-countably many
random variables, and generally his method is applied in the
literature. The relation between these two methods deserves
some discussion.\index{countably approximable classes of random
variables}
To understand the problem we are discussing let us first recall
that if we take a class of random variables $S_t$, $t\in T$,
indexed by some index set $T$, then for all sets $A$ measurable
with respect to the $\sigma$-algebra generated by the random
variables $S_t$, $t\in T$, there exists a countable subset
$T'=T'(A)\subset T$ such that the set $A$ is measurable also with
respect to the smaller $\sigma$-algebra generated by the random
variable $S_t$, $t\in T'$. Beside this, if the finite dimensional
distributions of the random variables $S_t$, $t\in T$, are given,
then by the results of classical measure theory the probability
of all events measurable with respect to the $\sigma$-algebra
generated by these random variables $S_t$, $t\in T$, is also
determined. But it may happen that we want to deal with such
events whose probability cannot be defined in such a way. In
particular, if $T$ is a non-countable set, then the events
$\left\{\omega\colon\,\sup\limits_{t\in T}S_t(\omega)>u\right\}$
are non-measurable with respect to the above $\sigma$-algebra,
and generally we cannot speak of their probabilities. To overcome
this difficulty Dudley worked out a theory which enabled him to
work also with outer measures. His theory is based on some
rather deep results of the analysis. It can be found for instance
in his book~\cite{r14}.
I restricted my attention to such cases when after the completion of
the probability measure $P$ we can also speak of the real (and not
only outer) probabilities $P\left(\sup\limits_{t\in T}S_t>u\right)$.
I tried to find appropriate conditions under which these
probabilities really exist. More explicitly, I was interested in
the case when for all $u>0$ there exists some set $A=A_u$
measurable with respect to the $\sigma$-algebra generated by the
random variables $S_t$, $t\in T$, such that the symmetric
difference of the sets $A_u$ and
$\left\{\omega\colon\,\sup\limits_{t\in T}S_t(\omega)>u\right\}$
is contained in a set which is measurable with respect to the
$\sigma$-algebra generated by the random variables $S_t$, $t\in T$,
and it has probability zero. In such a case the probability
$P\left(\sup\limits_{t\in T}S_t>u\right)$ can be defined as
$P(A_u)$. This approach led me to the definition of countable
approximable classes of random variables. If this property holds,
then we can speak about the probability of the event that the
supremum of the random variables we are interested in is larger
than some fixed value. I proved a simple but
useful result in Lemma~4.4 which provides a condition for the
validity of this property. In Lemma~4.5 I proved with its help
that an important class of functions is countably approximable. It
seems that this property can be proved for many other interesting
classes of functions with the help of Lemma~4.4, but I did not
investigate this question in more detail.
The problem we met here is not an abstract, technical difficulty.
Indeed, the distribution of the supremum of uncountably many
random variables can become different if we modify each random
variable on a set of probability zero, although their finite
dimensional distributions remain the same after such an operation.
Hence, if we are interested in the probability of the supremum
of a non-countable set of random variables with prescribed finite
dimensional distributions we have to tell more explicitly
which version of this set of random variables we consider. It
is natural to look for such an appropriate version of the
random field $S_t$, $t\in T$, whose `trajectories' $S_t(\omega)$,
$t\in T$, have nice properties for all elementary events
$\omega\in\Omega$. Lemma~4.4 can be interpreted as a result in
this spirit. The condition given for the countable
approximability of a class of random variables at the end of
this lemma can be considered as a smoothness type condition about
the `trajectories' of the random field we consider. This
approach shows some analogy to some important problems in the
theory of stochastic processes when a regular version of a
stochastic process is considered, and the smoothness properties
are investigated for the trajectories of this version.
In our problems the version of the set of random variables
$S_t$, $t\in T$, we work with appears in a simple and natural
way. In these problems we have finitely many random variables
$\xi_1,\dots,\xi_n$ at the start, and all random variables
$S_t(\omega)$, $t\in T$, we are considering can be defined
individually for each $\omega$ as a function of these random
variables $\xi_1(\omega),\dots,\xi_n(\omega)$. We take the
version of the random field $S_t(\omega)$, $t\in T$, we get in
such a way and want to show that it is countably approximable.
In Chapter~4 this property is proved in an important model,
probably in the most important model in possible applications
we are interested in. In more complicated situations when our
random variables are defined not as a function of finitely
many sample points, for instance in the case when we define
our set of random variables by means of integrals with respect
to a Gaussian random field it is harder to find the right
regular version of our sets of random variables. In this case the
integrals we consider are defined only with probability~1, and it
demands some extra work to find their right version. But in
the problems studied in this work the above sketched approach is
satisfactory for our purposes, and it is simpler than that of
Dudley; we do not have to follow his rather difficult technique.
On the other hand, I must admit that I do not know the precise
relation between the approach of this work and that of Dudley.
\medskip\noindent
{\script Chapter 5}
\medskip\noindent
In Chapter~4 the notion of $L_p$-dense classes, $1\le p<\infty$,
also has been introduced. The notion of $L_2$-dense classes
appeared in the formulation Theorems~4.1 and~$4.1'$. It can be
considered as a version of the $\varepsilon$-entropy, discussed
at many places in the literature. (See e.g.~\cite{r12}
or~\cite{r13}.) On the other hand, there seems to be no standard
definition of the $\varepsilon$-entropy. The term of $L_2$-dense
classes seemed to be the appropriate object to work with in this
lecture note. To apply the results related to $L_2$-dense classes
we also need some knowledge about how to check this property in
concrete models. For this goal I discussed here
Vapnik--\v{C}ervonenkis classes, a popular and important notion
of modern probability theory. Several books and papers, (see e.g.
the books~\cite{r14}, \cite{r45},~\cite{r54} and the references
in them) deal with this subject. An important result in this
field is Sauer's lemma, (Theorem~5.1) which together with some
other results, like Theorem~5.3 imply that several interesting
classes of sets or functions are Vapnik--\v{C}ervonenkis
classes.\index{Vapnik-\v{C}ervonenkis classes of sets and
functions}
I put the proof of these results to the Appendix, partly because
they can be found in the literature, partly because in this work
Vapnik--\v{C}ervonenkis classes play a different and less important
role than at other places. Here Vapnik--\v{C}ervonenkis classes are
applied to show that certain classes of functions are $L_2$-dense.
At this point a result of Dudley formulated in Theorem~5.2 plays an
important role. It implies that a Vapnik--\v{C}ervonenkis class of
functions with absolute value bounded by a fixed constant is an
$L_1$, and as a consequence also an $L_2$-dense class of functions.
The proof of this important result which seems to be less known
even among experts of this subject than it would deserve is
contained in the main text. Dudley's original result was
formulated in the special case when the functions we consider are
indicator functions of some sets. But its proof contains all
important ideas needed in the proof of Theorem~5.2. A proof of the
result in the form formulated in this work can be found
in~\cite{r45}. This book also contains the other results of this
chapter about Vapnik--\v{C}ervonenkis classes.
%\vfill\eject
\medskip\noindent
{\script Chapters 6 and 7}
\medskip\noindent
Theorem 4.2, which is the Gaussian counterpart of Theorems~4.1
and~$4.1'$ is proved in Chapter~6 by means of a natural and
important technique, called the chaining argument.\index{chaining
argument} This means the application of an inductive procedure,
in which an appropriate sequence of finite subsets of the original
set of random variables is introduced, and a good estimate is
given on the supremum of the random variables in these subsets
by means of an inductive procedure. The subsets became denser
subsets of the original set of the random variables at each
step of this procedure. This chaining argument is a popular
method in certain investigations. It is hard to say with whom to
attach it. Its introduction may be connected to some works of
R.~M.~Dudley. It is worth mentioning that Talagrand~\cite{r53}
worked out a sharpened version of it which yields in the study
of certain problems a sharper and more useful estimate. But it
seems to me that in the study of the problems of this work this
improvement has a limited importance, it turns out to be useful
in the study of different problems.
Theorem 4.2 can be proved by means of the chaining argument, but
this method is not strong enough to supply a proof of Theorem~4.1.
It provides only a weak estimate in this case, because there is
no good estimate on the probability that a sum of independent
random variables is greater than a prescribed value if these
random variables have too small variances. As a consequence, the
chaining argument supplies a much weaker estimate than the result
we want to prove under the conditions of Theorem~4.1. Lemma~6.1
contains the result the chaining argument yields under these
conditions. In Chapter~6 still another result, Lemma~6.2 is
formulated. It can be considered as a special case of Theorem~4.1
where only the supremum of partial sums with small variances is
estimated. We also show in this chapter that Propositions~6.1 and~6.2
together imply Theorem~4.1. The proof is not difficult, despite of
some non-attractive details. It has to be checked that the
parameters in Propositions~6.1 and~6.2 can be fitted to each other.
Proposition~6.2 is proved in Chapter~7. It is based on a symmetrization
argument. This proof applies the ideas of a paper of Kenneth
Alexander~\cite{r3}, and although its presentation is different from
Alexander's approach, it can be considered as a version of his proof.
It may be worth mentioning that the symmetrization arguments were
first applied in the theory of Vapnik--\v{C}ervonenkis classes
to get some useful estimates (see e.g.~\cite{r45}). But it turned
out that an appropriate refinement of this method supplies sharper
results if we are working with $L_2$-dense classes instead of
Vapnik--\v{C}ervonenkis classes of functions.
A similar problem should also be mentioned at this place.
M.~Talagrand wrote a series of papers about concentration
inequalities, (see e.g. \cite{r51} or \cite{r52}), and his
research was also continued by some other authors. I would
mention the works of M.~Ledoux~\cite{r28} and
P.~Massart~\cite{r42}. Concentration inequalities give a
bound about the difference between the supremum of a set of
appropriately defined random variables and the expected value
of this supremum. They express how strongly this supremum is
concentrated around its expected value. Such results are closely
related to Theorem~4.1, and the discussion of their relation
deserves some attention. A typical concentration inequality is
the following result of Talagrand~\cite{r52}.\index{concentration
inequalities}
\medskip\noindent
{\bf Theorem 18.1 (Theorem of Talagrand).} {\it Consider $n$
independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ with values in some measurable space
$(X,{\cal X})$. Let ${\cal F}$ be some countable family of
real-valued measurable functions of $(X,{\cal X})$ such that
$\|f\|_\infty\le b<\infty$ for every $f\in{\cal F}$. Let
$Z=\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f(\xi_i)$ and
$v=E\left(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i)\right)$.
Then for every positive number~$x$
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac 1{K'}\frac
xb\log\left(1+\frac{xb}v\right)\right\}
$$
and
$$
P(Z\ge EZ+x)\le K\exp\left\{-\frac{x^2}{2(c_1v+c_2bx)}\right\},
$$
where $K$, $K'$, $c_1$ and $c_2$ are universal positive constants.
Moreover, the same inequalities hold when replacing $Z$ by $-Z$.}
\medskip
Theorem~18.1 yields, similarly to Theorem~4.1, an estimate about
the distribution of the supremum for a class of sums of independent
random variables. (The paper of P.~Massart~\cite{r42} contains a
similar estimate which is better for our purposes. The main
difference between these two estimates is that the bound given by
Massart depends on $\sigma^2=\sup\limits_{f\in{\cal F}}
\sum\limits_{i=1}^n \textrm{Var}\,f(\xi_i)$ instead of
$v=E\left(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i)\right)$.)
Theorem~18.1 can be considered as a generalization of
Bernstein's and Bennett's inequalities when the distribution of the
supremum of partial sums (and not only the distribution of one
partial sum) is estimated. A remarkable feature of this
result is that it assumes no condition about the structure of the
class of functions ${\cal F}$ (like the condition of $L_2$-dense
property of the class ${\cal F}$ imposed in Theorem~4.1). On the
other hand, the estimates in Theorem~18.1 contain the quantity
$EZ=E\left(\sup\limits_{f\in{\cal F}}
\sum\limits_{i=1}^n f(\xi_i)\right)$. Such an
expectation of some supremum appears in all concentration
inequalities. As a consequence, they are useful only if we can
bound the expected value of the supremum we want to estimate.
It is difficult to find a good bound on this expected value in the
general case. Paper~\cite{r17} provides a useful estimate on it if
the expected value of the supremum of random sums is considered
under the conditions of Theorem~4.1. But I preferred a direct
proof of this result.
Let me remark that because of the above mentioned concentration
inequality the condition
$u\ge\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$
with some appropriate constant which cannot be dropped from
Theorem~4.1 can be interpreted so that under the conditions of
Theorem~4.1 $\textrm{const.}\,\sigma\log^{1/2}\frac2\sigma$
is an upper bound for the expected value of the supremum we
investigated in this result. Example~4.3 implies that if the
conditions of Theorem~4.1 are violated, then the expected value
of the above supremum may be larger.
It is also worth mentioning Talagrand's work~\cite{r53} which
contains several interesting results similar to Theorem~4.1.
But despite their formal similarity, they are essentially
different from the results of this work. This difference
deserves a special discussion.
Talagrand proved in~\cite{r53} by working out a more refined, better
version of the chaining argument a sharp upper bound for the
expected value $E\sup\limits_{t\in T}\xi_t$ of the supremum of
countably many (jointly) Gaussian random variable with zero
expectation. This result is sharp. Indeed, Talagrand proved also
a lower bound for this expected value, and the quotient of his
upper and lower bound is bounded by a universal constant.
By applying similar arguments he also gave an upper bound for
$E\sup\limits_{f\in{\cal F}}\sum\limits_{k=1}^N f(\xi_k)$
in Proposition~2.7.2 of his book if $\xi_1,\dots,\xi_N$ is a
sequence of independent, identically distributed random variables
with some known distribution~$\mu$, and ${\cal F}$ is a class of
functions with some nice properties. Then he proved in Chapter~3
of this book some estimates with the help of this result for
certain models which solved some problems that could not be
solved with the help of the original version of the chaining
argument.
Let me make a short comparison between our Theorem~4.1 and
Talagrand's result. Talagrand investigated in his book~\cite{r53}
the expected value of the supremum of partial sums, while we
gave an estimate on its tail distribution. But this is not an
essential difference. Talagrand's results also give an estimate
on the tail distribution of the supremum by means of
concentration inequalities, and actually his proofs also provide
a direct estimate for the tail distribution we are interested in
without the application of these results. The main difference
between the two works is that Talagrand's method gives a sharp
estimate for different classes of functions~${\cal F}$.
Talagrand could prove sharp results in such cases when the class
of functions ${\cal F}$ for which the supremum is taken consists of
smooth functions. An example for such classes of functions which he
thoroughly investigated is the class of Lipschitz~1 functions. In
particular, in Chapter~3 of his book~\cite{r53} he proved that if
$\xi_1,\dots,\xi_n$ is a sequence of independent random variables,
uniformly distributed in the unit square $D=[0,1]\times[0,1]$, and
${\cal F}$ is the class of Lipschitz~1 functions on the unit
square~$D$ such that $\int_D f\,d\lambda=0$ for all $f\in{\cal F}$,
where $\lambda$ denotes the Lebesgue measure on~$D$, then
$E\sup\limits_{f\in{\cal F}}\sum\limits_{l=1}^n f(\xi_l)
\le L\sqrt{n\log n}$ with a universal constant~$L$. He was
interested in this result, because it is equivalent to a theorem
of Ajtai--Koml\'os--Tusn\'ady~\cite{r2}.
(See Chapter~3 of~\cite{r53} for details.) On the other hand, we
can give sharp results in such cases when ${\cal F}$ consists of
non-smooth functions, (see Example~5.5), and Talagrand's method
does not work in the study of such problems.
This difference in the conditions of the results in these two
books is not a small technical detail. Talagrand heavily
exploited in his proof that he worked with such classes of
functions~${\cal F}$ from which he could select a subclass of
functions of ${\cal F}$ of relatively small cardinality which is
dense in ${\cal F}$ not only in the $L_2(\mu)$-norm with the
probability measure~$\mu$ he was working with, but also in the
supremum norm. He needed this property, because this enabled
him to get sharp estimates on the tail distribution of the
differences of functions he had to work with by means of
Bernstein's inequality. The smallness of the supremum norm of
these random variables was useful, since it implied that
Bernstein's inequality provides a sharp estimate in a large
domain. Talagrand needed such sharp estimates to apply (a
refined version of) the chaining argument. On the other hand,
we considered such classes of functions ${\cal F}$ which may
have no small subclasses which are dense in ${\cal F}$ in the
supremum norm.
I would characterize the difference between the results of the
two works in the following way. Talagrand proved the sharpest
possible estimates which can be obtained by a refinement of
the chaining argument, while our main problem was to get sharp
estimates also in such cases when the chaining argument does
not work. Let me remark that we could prove our results only
for such classes of functions ${\cal F}$ which are $L_2$-dense.
(See Theorem~4.1.) In the Gaussian counterpart of this result,
in Theorem~4.2, it was enough to impose that ${\cal F}$ is
an $L_2$-dense class with respect to a fixed probability
measure~$\mu$. We needed the extra condition about $L_2$-dense
property to prove sharp results about the tail distribution of
supremum of partial sums when the chaining argument does not work.
\medskip\noindent
{\script Chapter 8}
\medskip\noindent
The main results of this work are presented in Chapter~8. One of
them is Theorem~8.3 which is a multivariate version of Bernstein's
inequality (Theorem~3.1) about degenerate $U$-statistics. A weaker
version of this result was first proved in a paper of Arcones
and Gin\'e in~\cite{r4}. In the present form it was proved in my
paper~\cite{r37}. Its version about multiple integrals with respect
to a normalized empirical measure formulated in Theorem~8.1 is
proved in~\cite{r33}. This paper contains a direct proof. On the
other hand, Theorem 8.1 can be derived from Theorem~8.3 by means of
Theorem~9.4 of this paper. Theorem 8.5 is the natural Gaussian
counterpart of Theorem~8.3. The limit theorem about degenerate
$U$-statistics, Theorem~10.4 (and its version about limit theorems
for multiple integrals with respect to normalized empirical measures,
presented in Theorem~$10.4'$ of Appendix~C was discussed in this
work to explain better the relation between degenerate $U$-statistics
(or multiple integrals with respect to normalized empirical
measures) and multiple Wiener--It\^o integrals. A proof of this
result based on similar ideas as that discussed here can be found
in~\cite{r15}. Theorem~6.6 of my lecture note~\cite{r30}
contains such a weaker version of Theorem~8.5 which does not
take into account the variance of the random integral we are
considering.
Example~8.7 is a natural supplement of Theorem~8.5. It shows
that the estimate of Theorem~8.5 is sharp if only the variance
of a Wiener--It\^o integral is known. At the end of Chapter~13
I also mentioned the results of papers~\cite{r1} and~\cite{r27}
without proof which also have some relation to this problem. I
discussed mainly the content of~\cite{r27}, and explained its
relation to some results discussed in this work. The proof of
these papers apply a method different of those in this work. I
make some comments about them in the discussion of Chapter~13.
Theorems~8.2 and~8.4 which are the natural multivariate
counterparts of Theorem~4.1 and~$4.1'$ yield an estimate about
the supremum of (degenerate) $U$-statistics or of multiple random
integrals with respect to a normalized empirical measure when
the class of kernel functions in these $U$-statistics or random
integrals satisfy some conditions. They were proved in my
paper~\cite{r35}. Actually I consider these theorems the hardest
and most important results of this lecture note. Earlier Arcones
and Gin\'e proved a weaker version of this result in paper~\cite{r5},
but their work did not help in the proof of the results of this
note. The proofs of the present note were based on an adaptation
of Alexander's method~\cite{r3} to the multivariate case.
Theorem~8.6 is the natural Gaussian counterpart of Theorems~8.2
and~8.4.
Example~8.8 in Chapter~8 shows that the condition
$u\le\textrm{const.}\, n\sigma^3$ imposed in Theorem~8.3 in
the case $k=2$ cannot be dropped. The paper of Arcones and
Gin\'e~\cite{r4} contains another example explained by Talagrand
to the authors of that paper which also has a similar consequence.
But that example does not provide such an explicit comparison
of the upper and lower bound on the probability investigated
in Theorem~8.3 as Example~8.8. Similar examples could be
constructed for all $k\ge1$.
Example 8.8 shows that at high levels only a very weak (and from
practical point of view not really important) improvement of the
estimation on the tail distribution of degenerate $U$-statistics
is possible. But probably there exists a multivariate version of
Bennett's inequality, i.e. of Theorem~3.2 which provides such
an estimate. Moreover, there is some hope to get a similar
strengthened form of Theorems~8.2 and~8.4 (or of Theorem~4.2 in
the one-dimensional case). This question is not investigated in
the present work.
\medskip\noindent
{\script Chapter 9}
\medskip\noindent
Chapter~9 deals with the properties of $U$-statistics. Its
first result, Theorem~9.1, is a classical result. It
is the so-called Hoeffding decomposition of $U$-statistics to
the sum of degenerate statistics. Its proof first appeared
in the paper~\cite{r23}, but it can be found at many places.
The explanation of this work contains some ideas similar
to~\cite{r50}. I tried to explain that Hoeffding's decomposition
is the natural multivariate version of the (trivial)
decomposition of sums of independent random variables to sums of
independent random variables {\it with expectation zero}\/ plus
the sum of the expectations of the original random variables.
Moreover, even the proof of Hoeffding's decomposition shows
some similarity to this simple decomposition.
Theorem~9.2 and Proposition~9.3 can be considered as a continuation
of the investigation about the Hoeffding decomposition. They tell
us how some properties of the kernel function of the original
$U$-statistic are inherited in the properties of the kernel
functions of the degenerate $U$-statistics taking part in its
Hoeffding decomposition. In several applications of Hoeffding's
decomposition we need such results.
The last result of Chapter~9, Theorem~9.4, enables us to reduce the
estimation of multiple random integrals with respect to normalized
empirical measures to the estimation of degenerate $U$-statistics.
This result is a version of Hoeffding's decomposition, where
instead of $U$-statistics multiple integrals with respect to a
normalized empirical distribution are decomposed to the sum of
{\it degenerate}\/ $U$-statistics. In these two decompositions
the same degenerate $U$-statistics appear. The main difference
between them is that in the decomposition of the random integrals
in Theorem~9.4 the coefficients of the degenerate $U$-statistics
are relatively small. The appearance of small coefficients in
this decomposition is due to the cancellation effect caused by
integration with respect to a {\it normalized}\/ empirical
measure $\sqrt n(\mu_n-\mu)$. Theorem~9.4 was proved
in~\cite{r35}. The proof in this note is essentially
different of the original proof in~\cite{r35}, and it is simpler.
\medskip\noindent
{\script Some remarks related to Chapters 10, 11 and 12}
\medskip\noindent
Theorem~8.1 can be derived from Theorem~8.3 and Theorem~8.2 from
Theorem~8.4 by means of Theorem~9.4. The proof of the latter
results is simpler. Chapters~10--12 contain the results needed
in the proof of Theorem~8.3 and of its Gaussian counterpart
Theorems~8.5 and~8.6. They are proved by means of good
estimates on the high moments of degenerate $U$-statistics and
multiple Wiener--It\^o integrals. The classical proof of the
one-variate counterparts of these results is based on a good
estimate of the moment generating function. This method had to
be replaced by the estimation of high moments, because the
moment generating function of a $k$-fold Wiener--It\^o integral
is divergent for all non-zero parameters if $k\ge3$, (this is
a consequence of Theorem~13.6), and this property of
Wiener--It\^o integrals is also reflected in the behaviour of
degenerate $U$-statistics. On the other hand, we can give good
estimates on the tail distribution of a random variable if we
have good estimates on its high moments. The results of
Chapters~10, 11 and~12 enable us to prove good moment estimates.
I know of two deep and interesting methods to study high moments
of multiple Wiener--It\^o integrals. The first of them is called
Nelson's inequality named after Edward Nelson who published it in
his paper~\cite{r44}. This inequality simply implies Theorem~8.5
about multiple Wiener--It\^o integrals, although with worse
constants. Later Leonhard Gross discovered a deep and useful
generalization of this result which he published in the work
{\it Logarithmic Sobolev inequalities}\/~\cite{r20}. Gross
considered in his paper a {\it stationary}\/ Markov process $X(t)$,
$t\ge0$, and gave a good bound on the $L_p$-norm of functions
of the form $U_t(f)(x)=E(f(X(t)|X(0)=x)$, where the $L_p$-norm
is taken with respect to the distribution of the random variable
$X(0)$. The proof of this $L_p$-norm estimate is based on the
study of the infinitesimal operator of the Markov
process. Gross' results provide Nelson's inequality, if they
are applied for the Ornstein--Uhlenbeck process.
Gross' investigation in~\cite{r20} revealed very much about
the behaviour of Markov processes. The book \cite{r44b} is
partly based on this method. Gross' approach turned out to
be very fruitful in the study of several hard problems of the
probability theory and statistical physics. (See e.g~\cite{r21}
or~\cite{r28}). It also provides a good estimate for the high
moments of Wiener--It\^o integrals.
There is another useful method to study Wiener--It\^o integrals
due to Kyoshi It\^o and Roland L'vovich Dobrushin. This seemed
to me more useful if we want estimate the high moments not only
of Wiener--It\^o integrals but also of degenerate $U$-statistics.
I applied this method in Chapters~10, 11 and~12. I showed
how we can get with its help results that enable us to prove
good moment estimates both for Wiener--It\^o integrals and
degenerate $U$-statistics. The main step in this approach is
the proof of a so-called diagram formula which makes possible
to rewrite a product of Wiener--It\^o integrals as a sum of
Wiener--It\^o integrals. Moreover, this result also has a
natural counterpart for the products of degenerate
$U$-statistics.
\medskip\noindent
{\script Chapter 10}
\medskip\noindent
In Chapter~10 I discuss a method related to Kyoshi It\^o and Roland
L'vovich Dobrushin. This is the theory of multiple Wiener--It\^o
integrals with respect to a white noise. This integral was
introduced in paper~\cite{r25}. It is useful, because every random
variable which is measurable with respect to the $\sigma$-algebra
generated by the Gaussian random variables of the underlying white
noise and has finite second moment can be written as the sum of
Wiener--It\^o integrals of different order. Moreover, if only
Wiener--It\^o integrals of symmetric kernel functions are taken,
then this representation is unique. Actually this result was
originally proved by Norbert Wiener~\cite{r55}. This representation
also appeared in physics under the name Fock space. It plays an
important role in quantum physics. Let me briefly explain the
reason for the name white noise
\index{white noise with some reference measure $\mu$}
for the appropriate notion introduced in Chapter~10.
The notion of white noise was originally introduced at a heuristic
level as the derivative of the trajectories of a Wiener process. But
as these trajectories are non-differentiable the introduction of
this notion demands a better explanation. A natural way to overcome
the difficulties is to consider the derivative of a
trajectory of a Wiener process as a generalized random function,
and to take its integral on all measurable sets. In such a way
we get a collection of Gaussian random variables $\xi(A)$ with
expectation zero, indexed by the measurable sets~$A$. These random
variables have correlation function
$E\xi(A)\xi(B)=\lambda(A\cap B)$, where $\lambda(\cdot)$ denotes
the Lebesgue measure. In such a way we get a correct definition of
the white noise which preserves the heuristic content of the
original approach. In the definition of general white noise we
allow to work with an arbitrary measure~$\mu$ and not only with
the Lebesgue measure~$\lambda$. If we have a white noise we would
like to have a tool that enables us to study not only the Gaussian
random variables measurable with respect to the $\sigma$-algebra
generated by the random variables of the white noise but all
random variables measurable with respect to this $\sigma$-algebra.
The Wiener--It\^o integrals were defined with such a goal.
An important result of the theory of Wiener--It\^o integrals, the
so-called diagram formula, formulated in Theorem~10.2, expresses
products of Wiener--It\^o integrals as a sum of such integrals.
This result which shows some similarity to the Feynman diagrams
applied in the statistical physics was proved in~\cite{r10}.
Actually this paper discussed a modified version of Wiener--It\^o
integrals which is more appropriate to study the action of shift
operators for non-linear functionals of a stationary Gaussian
field. But these modified Wiener--It\^o integrals can be
investigated in almost the same way as the original ones. The
diagram formula has a simple consequence formulated in Corollary
of Theorem~10.2 of this note. It enables us to calculate the
expectation of products of Wiener--It\^o integrals. It yields
an explicit formula for them. This result was applied in the
proof of Theorem~8.5, i.e.\ in the estimation of the
tail-distribution of Wiener--It\^o integrals. It\^o's formula
for multiple Wiener--It\^o integrals (Theorem~10.3) was proved
in~\cite{r25}.
Actually the above results about Wiener--It\^o integrals would
have been sufficient for our purposes. But I also presented some
other results for the sake of completeness. In particular, I
discussed some results about Hermite polynomials. Wiener--It\^o
integrals are closely related to Hermite polynomials or to their
multivariate version, to the so-called Wick polynomials. (See
e.g.~\cite{r30} or~\cite{r41} for the definition of Wick
polynomials.) Appendix~C contains the most important properties
of Hermite polynomials needed in the study of Wiener--It\^o
integrals. In particular, it contains the proof of Proposition~C2
about the completeness of the Hermite polynomials in the Hilbert
space of the functions square integrable with respect to the
standard Gaussian distribution. This result can be found for
instance in Theorem~5.2.7 of~\cite{r49}. In the present proof I
wanted to show that this result is closely related to the
so-called moment problem, i.e.\ to the question when a
distribution is determined by its moments uniquely. The method
of proof described in this note can be applied with some
refinement to prove some generalizations of Proposition~C2 about
the completeness of orthogonal polynomials with respect to more
general weight functions.
On the other hand, I did not try to give a complete picture
about Wiener--It\^o integrals. The reader interested in it
may consult with the book of S.~Janson~\cite{r25a}.
There are also other interesting and important topics related
to Wiener--It\^o integrals not discussed in this work. In some
investigations of probability theory and statistical physics
it is useful to study not only moments but also cumulants
(called also semiinvariants in the literature) of
Wiener--It\^o integrals. It is also useful to study the
moments and cumulants of polynomials and Wick polynomials
of Gaussian random vectors. The book of
Malyshev~V.~A. and Minlos~R.A.~\cite{r41} contains many
interesting results about this subject.
Another interesting and popular subject not discussed in this
work is the problem of limit theorems for Wiener--It\^o integrals.
In particular, one is interested in the question when a sequence
of such random integrals satisfies the central limit theorem.
The study of such problems heavily exploits the diagram
formula, or more precisely its consequence about the calculation
of moments and cumulants. In some works, see~e.g.~\cite{r44b}
or~\cite{r44d} this subject is worked out in detail. Moreover, a
popular subject of recent research is the study of the speed of
convergence in the central limit theorem. In such investigations
the so-called Stein method turned out to be very useful. In its
application the integral of sufficiently smooth test functions
with respect to the distribution we are investigating are
estimated together with the integral of their derivative (with
respect to the same distribution). In a somewhat surprising way
it turned out that if we are studying the central limit theorem
for Wiener--It\^o integrals with the help of the Stein method,
then the role of the derivative of a function is taken by the
so-called Malliavin derivative. (See~\cite{r44b}.) So the theory
of Malliavin calculus, see~\cite{r44c}, became very important in
such research. But this problem is a bit far from the main
subject of this work, hence I do not go into the details.
%\vfill\eject
\medskip\noindent
{\script Chapters 11 and 12}
\medskip\noindent
The diagram formula has a natural and useful analogue both for
degenerate $U$-statistics and multiple integrals with respect to
a normalized empirical measure. They enable us to rewrite the
product of degenerate $U$-statistics and multiple integrals as
the sum of such expressions. Actually the proof of these results
is simpler than the proof of the original diagram formula for
Wiener--It\^o integrals. They make possible to adapt several
useful methods of the study of non-linear functionals of
Gaussian random fields to the study of non-linear functionals of
normalized empirical measures. But to apply them we also need
some good estimate on the $L_2$-norm of the kernel functions of
the random integrals or $U$-statistics appearing in the diagram
formula. Hence we also proved such results.
A version of the diagram formula was proved for degenerate
$U$-statistics in~\cite{r37} and for multiple random integrals
with respect to a normalized empirical measures in~\cite{r33}.
Let me remark that in the formulation of the result in the
work~\cite{r37} a different notation was applied than in the
present note. In that paper I wanted to formulate such a version
of the diagram formula for $U$-statistics where we work with
diagrams similar to those introduced in the study of Wiener--It\^o
integrals. I could do this only in a somewhat artificial way. In
this work I formulated the diagram formula for $U$-statistics with
the help of diagrams of a more general form. I introduced the
notion of chains and coloured chains, and defined (coloured)
diagrams with their help. The formulation of the results with
the help of such more general diagrams seems to be more natural.
I met some works where similar diagrams were introduced, see
e.g.~\cite{r44d}, but I did not meet works where also the
coloured diagrams introduced in this work were applied. It is
possible that this happened so, because I do not know the
literature well enough, but this also may have a different
cause.
In the work~\cite{r44d} the diagram formula was applied for the
calculation of moments and cumulants, and if we are working only
with them, then the results of this work can also be formulated
with the help of so-called closed diagrams, and no coloured
diagrams are needed. They are needed if we want to express the
product of $U$-statistics as a sum of $U$-statistics. It may
also be interesting that the results considered in~\cite{r44d}
are based on some combinatorial arguments worked out
in~\cite{r46}.
There are some works like~\cite{r44d}, where diagram formulas
are considered for other models too, e.g. in models where
we integrate with respect to a normalized Poisson process.
Nevertheless, in my opinion the results about the diagram
formula for the products of Wiener--It\^o integrals and
in particular their modified versions for the products of
integrals with respect to normalized Poisson processes,
normalized empirical distribution or for the product of
$U$-statistics did not get such an attention in the
literature as they would deserve. An interesting paper in
this direction is that of Surgailis~\cite{r47}, where a
version of the diagram formula is proved for Poissonian
integrals. It may be worth mentioning that the diagram
formula for Poisson integrals shows a very strong similarity
to the diagram formula for the product of integrals with
respect to normalized empirical distributions. (Integrals
with respect to normalized empirical distribution were
discussed only at an informal level in this work.)
The Hermite polynomials and their multivariate versions, the
Wick polynomials have their counterparts when instead of
Wiener--It\^o integrals we consider more general classes of
random integrals. It\^o's formula creates a relation between
Wiener--It\^o integrals and Hermite polynomials or their
multivariate versions, the Wick polynomials. The relation
between Wiener--It\^o integrals and Hermite polynomials has
a natural counterpart in the study of other multiple random
integrals. In such a way a new notion, the Appell polynomials
appeared in the literature. (See e.g.~\cite{r48}.)
\medskip\noindent
{\script Chapter 13}
\medskip\noindent
Theorems~8.3,~8.5 and~8.7 were proved on the basis of the
results of Chapters~10--12 in Chapter~13. These proofs are slight
modifications of those given in~\cite{r37}. An earlier proof
of a result similar to Theorem~8.3 based on a different method
was given by Arcones and Gin\'e in~\cite{r4}. Theorem~8.3 is a
slightly stronger estimate than that of Arcones and Gin\'e.
It provides at not too high levels an estimate with almost as
good constants in the exponent as the corresponding estimate
about Wiener--It\^o integrals in Theorem~8.5. Chapter~13 also
contains the proof of a multivariate version of Hoeffding's
inequality formulated in Theorem~13.3. This result is needed in
the symmetrization argument applied in the proof of Theorem~8.4.
A weaker version of it (an estimate with a worse constant in the
exponent) which would be satisfactory for our purposes simply
follows from a classical result, called Borell's inequality,
which was proved in~\cite{r7a}. But since the methods needed to
prove this result are not discussed in this note, and I was
interested in a proof which yields an estimate with the best
possible constant in the exponent I chose another proof, given
in~\cite{r36}. It is based on the results of Chapter~10--12.
Later I have learned that this estimate is contained in an
implicit form also in the paper~\cite{r7} of Aline Bonami.
In Part~B of Chapter~13 I discussed some results related to the
problems considered in this work. I would like to make some
comments about the result of R.~Lata{\l}a presented in Theorem~13.7.
The estimates of this result depend on such quantities which are
hard to calculate. Hence they have a limited importance in the
problems I had in mind when working on this lecture note. On
the other hand, such results and the methods behind them may be
interesting in the study of some problems of statistical physics,
e.g. in the problems discussed in~\cite{r52a}. I would like to
remark that Lata{\l}a's proof works only for decoupled and not
for usual $U$-statistics. Formally, this is not a restriction,
because the results of de la Pe\~na and Montgomery--Smith
(see~\cite{r9}) enable us to extend their validity also for
usual $U$-statistics. Nevertheless, the lack of a direct
proof of this estimate for $U$-statistics disturbs me a bit,
because this means for me that we do not really understand
this result. I have some ideas how to get the desired proof,
but it demands some time and energy to work out the details.
\medskip\noindent
{\script Chapter 14}
\medskip\noindent
Chapters~14--17 are devoted to the proof of Theorems~8.4 and~8.6.
They are based on a similar argument as their one-variate
counterparts, Theorems~4.1 and~4.2. The proof of Theorem~8.6
about the supremum of Wiener--It\^o integrals is based, similarly
to the proof of Theorem~4.2, on the chaining argument. In the
proof of Theorem~8.4 the chaining argument yields only a weaker
result formulated in Proposition~14.1 which helps to reduce
Theorem~8.4 to the proof of Proposition~14.2. In the one-variate
case a similar approach was applied. In that case the proof of
Theorem~4.1 was reduced to that of Proposition~6.2 by means of
Proposition~6.1. The next step in the proof of Theorem~8.4 has
no one-variate counterpart. The notion of so-called decoupled
$U$-statistics was introduced, and Proposition~14.2 was reduced
to a similar result about decoupled $U$-statistics formulated
in Proposition~$14.2'$.
The adjective `decoupled' in the expression decoupled $U$-statistic
refers to the fact that it is such a version of a $U$-statistic
where independent copies of a sequence of independent and
identically distributed random variables are put into different
coordinates of the kernel function. Their study is a popular
subject of some mathematical schools. In particular, the main
topic of the book~\cite{r8} is a comparison of the properties
of $U$-statistics and decoupled $U$-statistics. A result of
de la Pe\~na and Montgomery--Smith~\cite{r9} formulated in
Theorem~14.3 helps in reducing some problems about
$U$-statistics to a similar problem about decoupled
$U$-statistics. In this lecture note the proof of Theorem~14.3
is given in Appendix~D. It follows the argument of the original
proof, but several steps are worked out in detail where the
authors gave only a very short explanation. Paper~\cite{r9}
also contains some kind of converse results to~Theorem~14.3,
but as they are not needed in the present work, I omitted
their discussion.
Decoupled $U$-statistics behave similarly to the original
$U$-statistics. Beside this, some symmetrization arguments
become considerably simpler if we are working with decoupled
$U$-statistics instead of the original ones, because decoupled
$U$-statistics have more independence property. This can
be exploited in some investigations. For example the proof of
Proposition~$14.2'$ is simpler than a direct proof of
Proposition~14.2. On the other hand, Theorem~14.3 enables us
to reduce the proof of Proposition~14.2 to that of
Proposition~$14.2'$, and we have exploited this possibility.
Let me finally remark that although our proofs could be
simplified with the help of decoupled $U$-statistics, they
could have been done also without it. But this would
demand a much more complicated notation that would have
made the proof much less transparent. Hence I have decided
to introduce decoupled $U$-statistics and to work with them.
\medskip\noindent
{\script Chapters 15, 16 and 17}
\medskip\noindent
The proof of Theorem~8.4 was reduced to that of Proposition~$14.2'$
in Chapter~14. Chapters~15--17 deal with the proof of this result.
The original proof was given in my paper~\cite{r35}. It is similar
to that of its one-variate version, Proposition~6.2, but some
additional difficulties have to be overcome. The main difficulty
appears when we want to find the multivariate analogue of the
symmetrization argument which could be carried out in the
one-variate case by means of Lemmas~7.1 and~7.2.
In the multivariate case Lemma~7.1 is not sufficient for our
purposes. So we work instead with a generalized version of
this result, formulated in Lemma~15.2. The proof of Lemma~15.2
is not hard. It is a simple and natural modification of the proof
of Lemma~7.1. The real difficulty arises when we want to apply it
in the proof of Proposition~$14.2'$. When we applied the
symmetrization argument Proposition~6.2 in the proof of Lemma~7.1
we worked with two independent sequences of random variables
$Z_n$ and $\bar Z_n$. In the analogous symmetrization argument
Lemma~15.2, applied in the proof of Proposition~$14.2'$, we had
to work with two not necessarily independent sequences of random
variables $Z_p$ and $\bar Z_p$. This has the consequence that it
is much harder to check condition~(\ref{(15.3)}) needed in the
application of Lemma~15.2 than the analogous
condition~(\ref{(7.1)}) in Lemma~7.1. The hardest problems in
the proof of Proposition~$14.2'$ appear at this point.
Proposition $14.2'$ was proved by means of an inductive procedure
formulated in Proposition 15.3, which is the multivariate analogue
of Proposition~7.3. A basic ingredient of both proofs was a
symmetrization argument. But while this symmetrization argument
could be simply carried out in the one-variate case, its
adaptation to the multivariate case was a most serious problem.
To overcome this difficulty another inductive statement was
formulated in Proposition~15.4. Propositions~15.3 and~15.4 could
be proved simultaneously by means of an appropriate inductive
procedure. Their proofs were based on a refinement of the
arguments in the proof of Proposition~7.3. But some new
difficulties arose. In the proof of Proposition~7.3 we could
simply apply Lemma~7.2, and it provided the necessary
symmetrization argument. On the other hand, the verification
of the corresponding symmetrization argument in the proof of
Propositions~15.3 and~15.4 was much harder. Actually this
was the subject of Chapter~16. After this we could prove
Propositions~15.3 and~15.4 in Chapter~17 similarly to
Proposition~7.3, although some additional technical
difficulties arose also at this point. Here we needed the
multivariate version of Hoeff\-ding's inequality, formulated
in Theorem~13.3 and some properties of the Hoeff\-ding
decomposition of $U$-statistics proved in Chapter~9.
\appendix
\chapter{The proof of some results about
Vapnik--\v{C}ervonenkis classes}
\label{introA}
\medskip\noindent
{\it Proof of Theorem 5.1. (Sauer's lemma).}\/\index{Sauer's lemma}
This result has several different proofs. Here I write down a
relatively simple proof of P. Frankl and J. Pach which appeared
in~\cite{r16}. It is based on some linear algebraic arguments.
The following equivalent reformulation of Sauer's lemma will be
proved. Let us take a set $S=S(n)$ consisting of $n$ elements and
a class ${\cal E}$ of subsets of $S$ consisting of $m$ subsets
$E_1,\dots,E_m\subset S$. Assume that $m\ge m_0+1$ with
$m_0=m_0(n,k)={n\choose0}+{n\choose1}+\cdots+{n\choose{k-1}}$.
Then there exists a set $F\subset S$ of cardinality $k$ which
is shattered by the class of sets ${\cal E}$. Actually, it is
enough to show that there exists a set $F$ of cardinality
greater than or equal to~$k$ which is shattered by the class
of sets ${\cal E}$, because if a set has this property, then
all of its subsets have it. This latter statement will be proved.
To prove this statement let us first list the subsets
$X_0,\dots,X_{m_0}$ of the set $S$ of cardinality less than or equal
to $k-1$, and correspond to all sets $E_i\in{\cal E}$ the vector
$e_i=(e_{i,1},\dots,e_{i,m_0})$, $1\le i\le m$, with elements
$$
e_{i,j}=\left\{
\begin{array}{l}
1\quad\textrm{if }X_j\subseteq E_i \\
0\quad\textrm{if }X_j\not\subseteq E_i
\end{array}
\right. \qquad 1\le i\le m, \textrm{ and } 1\le j\le m_0.
$$
Since $m>m_0$, the vectors $e_1,\dots,e_m$ are linearly dependent.
Because of the definition of the vectors $e_i$, $1\le i\le m$,
this can be expressed in the following way: There is a non-zero
vector $(f(E_1),\dots,f(E_m))$ such that
\begin{equation}
\sum_{E_i\colon\, E_i\supseteq X_j} f(E_i)=0 \quad \textrm{for all }
1\le j\le m_0. \label{(A1)}
\end{equation}
Let $F$, $F\subset S$, be a {\it minimal}\/ set with the property
\begin{equation}
\sum_{E_i\colon\, E_i\supseteq F} f(E_i)=\alpha\neq0. \label{(A2)}
\end{equation}
Such a set $F$ really exists, since every maximal element of the
family $\{E_i\colon\, 1\le i\le m,\, f(E_i)\neq0\}$ satisfies
relation (\ref{(A2)}). The requirement that $F$ should be a
minimal set means
that if $F$ is replaced by some $H\subset F$, $H\neq F$, at the
left-hand side of~(\ref{(A2)}), then this expression equals zero. The
inequality $|F|\ge k$ holds because of relation (\ref{(A1)}) and the
definition of the sets $X_j$.
Introduce the quantities
$$
Z_F(H)=\sum_{E_i\colon\, E_i\cap F=H} f(E_i)
$$
for all $H\subseteq F$.
Then $Z_F(F)=\alpha$, and for any set of the form $H=F\setminus\{x\}$,
$x\in F$,
$$
Z_F(H)=\sum_{E_i\colon\, E_i\cap F=H} f(E_i)
=\sum_{E_i\colon\, E_i\supseteq H}f(E_i)
-\sum_{E_i\colon\, E_i\supseteq F}f(E_i)=0-\alpha=-\alpha
$$
because of the minimality property of the set $F$.
Moreover, the identity
\begin{equation}
Z_F(H)=(-1)^p\alpha \quad\textrm{for all } H\subseteq F
\textrm{ such that } |H|=|F|-p, \; 0\le p\le |F|. \label{(A3)}
\end{equation}
holds. To show relation (\ref{(A3)}) observe that
\begin{equation}
Z_F(H)= \!\sum_{E_i\colon\, E_i\cap F=H} \! f(E_i)=\sum_{j=0}^p
(-1)^j\! \sum_{G\colon\,H\subset G\subset F,\;|G|=|H|+j} \,\,
\sum_{E_i\colon\, E_i\supseteq G}\! f(E_i) \label{(A4)}
\end{equation}
for all sets $H\subset F$ with cardinality $|H|=|F|-p$.
Identity~(\ref{(A4)}) holds, since the term $f(E_i)$ is
counted at the right-hand side of~(\ref{(A4)})
$\sum\limits_{j=0}^l (-1)^j{l\choose j}=(1-1)^l=0$ times if
$E_i\cap F=G$ with some $H\subset G\subseteq F$ with $|G|=|H|+l$
elements, $1\le l\le p$, while in the case $E_i\cap F=H$ it is
counted once. Relation~(\ref{(A4)}) together with~(\ref{(A2)})
and the minimality
property of the set~$F$ imply relation~(\ref{(A3)}).
It follows from relation~(\ref{(A3)}) and the definition of
the function $Z_F(H)$ that for all sets $H\subseteq F$ there
exists some set $E_i$ such that $H=E_i\cap F$, i.e. $F$ is
shattered by ${\cal E}$. Since $|F|\ge k$, this implies
Theorem~5.1.
\hfill$\qed$
\medskip\noindent
{\it Proof of Theorem 5.3.}\/ Let us fix an arbitrary set
$F=\{x_1,\dots,x_{k+1}\}$ of the set $X$, and consider the set of
vectors
${\cal G}_k(F)=\{(g(x_1),\dots,g(x_{k+1}))\colon\, g\in{\cal G}_k\}$
of the $k+1$-dimensional space $R^{k+1}$. By the conditions of
Theorem~5.3 ${\cal G}_k(F)$ is an at most $k$-dimensional subspace of
$R^{k+1}$. Hence there exists a non-zero vector
$a=(a_1,\dots,a_{k+1})$ such that
$\sum\limits_{j=1}^{k+1} a_jg(x_j)=0$ for all $g\in{\cal G}_k$. We
may assume that the set $A=A(a)=\{j\colon\, a_j<0,\, 1\le j\le k+1\}$
is non-empty, by multiplying the vector $a$ by $-1$ if it is necessary.
Thus the identity
\begin{equation}
\sum_{j\in A} a_jg(x_j)=\sum_{j\in \{1,\dots,k+1\}\setminus A}
(-a_j)g(x_j),\qquad \textrm{for all }g\in{\cal G}_k \label{(A5)}
\end{equation}
holds. Put $B=\{x_j\colon\, j\in A\}$. Then $B\subset F$, and
$F\setminus B\neq\{x\colon\, g(x)\ge0\}\cap F$ for all
$g\in{\cal G}_k$. Indeed, if there were some $g\in {\cal G}_k$
such that $F\setminus B=\{x\colon\, g(x)\ge0\}\cap F$, then
the left-hand side of the equation (\ref{(A5)}) would be strictly
positive (as $a_j<0$, $g(x_j)<0$ if $j\in A$, and
$A\neq\emptyset$) its right-hand side would be non-positive for
this $g\in{\cal G}_k$, and this is a contradiction.
The above proved property means that ${\cal D}$ shatters no set
$F\subset X$ of cardinality~$k+1$. Hence Theorem~5.1
implies that ${\cal D}$ is a Vapnik--\v{C}ervonenkis class.
\hfill$\qed$
\chapter{The proof of the diagram formula for
Wiener--It\^o integrals}
\label{introB}
We start the proof of Theorem~10.2A (the diagram formula for
the product of two Wiener--It\^o integrals) with the proof of
inequality (\ref{(10.11)}).\index{diagram formula for Wiener--It\^o
integrals} To show that this relation holds
let us observe that the Cauchy inequality yields
the following bound on the function $F_\gamma(f,g)$ defined
in~(\ref{(10.10)}) (with the notation introduced there):
\begin{eqnarray}
&&F^2_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\,\;(1,j)\in
V_1(\gamma),\, (2,j')\in V_2(\gamma)) \nonumber \\
&&\qquad\le
\int f^2(x_{\alpha_\gamma(1,1)},\dots,x_{\alpha_\gamma(1,k)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)} \mu(\,dx_{(2,j)}) \nonumber \\
&&\qquad\qquad
\int g^2(x_{(2,1)},\dots,x_{(2,l)})
\prod_{(2,j)\in\{(2,1),\dots,(2,l)\}\setminus
V_2(\gamma)}\mu(\,dx_{(2,j)}).
\label{(B1)}
\end{eqnarray}
The expression at the right-hand side of inequality~(\ref{(B1)})
is the product of two functions with different arguments. The first
function has arguments $x_{(1,j)}$ with $(1,j)\in V_1(\gamma)$ and
the second one $x_{(2,j')}$ with $(2,j')\in V_2(\gamma)$.
By integrating both sides of inequality~(\ref{(B1)}) with respect
to these arguments we get inequality~(\ref{(10.11)}).
Relation (\ref{(10.12)}) will be proved first for the product
of the Wiener--It\^o integrals of two elementary functions.
Let us consider two (elementary) functions $f(x_1,\dots,x_k)$
and $g(x_1,\dots,x_l)$ given in the following form: Let some
disjoint sets $A_1,\dots,A_M$, $\mu(A_s)<\infty$, $1\le s\le M$,
be given together with some real numbers $c(s_1,\dots,s_k)$
indexed with such $k$-tuples $(s_1,\dots,s_k)$, $1\le s_j\le M$,
$1\le j\le k$, for which the numbers $s_1,\dots,s_k$ in a
$k$-tuple are all different. Put
$f(x_1,\dots,x_k)=c(s_1,\dots,s_k)$ if
$(x_1,\dots,x_k)\in A_{s_1}\times\cdots\times A_{s_k}$ with
some vector $(s_1,\dots,s_k)$ with different coordinates,
and let $f(x_1,\dots,x_k)=0$ if $(x_1,\dots,x_k)$ is outside
of these rectangles. Take similarly some disjoint sets
$B_1,\dots,B_{M'}$, $\mu(B_t)<\infty$, $1\le t\le M'$, and
some real numbers $d(t_1,\dots,t_l)$, indexed with such
$l$-tuples $(t_1,\dots,t_l)$, $1\le t_{j'}\le M'$,
$1\le j'\le l$, for which the numbers $t_1,\dots,t_l$ in an
$l$-tuple are different. Put $g(x_1,\dots,x_l)=d(t_1,\dots,t_l)$
if $(x_1,\dots,x_l)\in B_{t_1}\times\cdots\times B_{t_l}$ with
edges indexed with some of the above introduced $l$-tuples,
and let $g(x_1,\dots,x_l)=0$ otherwise.
Let us take some small number $\varepsilon>0$ and rewrite
the above introduced functions $f(x_1,\dots,x_k)$ and
$g(x_1,\dots,x_l)$ with the help of this number
$\varepsilon>0$ in the following way. Divide the sets
$A_1,\dots,A_M$ to smaller sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{s=1}^{M(\varepsilon)} A_s^\varepsilon=
\bigcup\limits_{s=1}^{M} A_s$, in such a way that all sets
$A_1^\varepsilon,\dots,A_{M(\varepsilon)}^\varepsilon$ are
disjoint, and $\mu(A^\varepsilon_s)\le\varepsilon$,
$1\le s\le M(\varepsilon)$. Similarly, take sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$,
$\bigcup\limits_{t=1}^{M'(\varepsilon)} B_t^\varepsilon
=\bigcup\limits_{t=1}^{M'} B_t$, in such a way that all
sets
$B_1^\varepsilon,\dots,B_{M'(\varepsilon)}^\varepsilon$
are disjoint, and $\mu(B^\varepsilon_t)\le\varepsilon$,
$1\le t\le M'(\varepsilon)$. Beside this, let us also
demand that two sets $A_s^\varepsilon$ and
$B_t^\varepsilon$, $1\le s\le M(\varepsilon)$,
$1\le t\le M'(\varepsilon)$, are either disjoint or
they agree. Such a partition exists because of the
non-atomic property of measure $\mu$. The above defined
functions $f(x_1,\dots,x_k)$ and $g(x_1,\dots,x_l)$ can be
rewritten by means of these new sets $A^\varepsilon_s$ and
$B^\varepsilon_t$. Namely, let
$f(x_1,\dots,x_k)=c^\varepsilon(s_1,\dots,s_k)$ on the
rectangles
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}$
with $1\le s_j\le M(\varepsilon)$, $1\le j\le k$, with
different indices $s_1,\dots,s_k$, where
$c^\varepsilon(s_1,\dots,s_k)=c(p_1,\dots,p_k)$ with
those indices $(p_1,\dots,p_k)$ for which
$A^\varepsilon_{s_1}\times\cdots\times A^\varepsilon_{s_k}\subset
A_{p_1}\times\cdots\times A_{p_k}$.
The function $f$ disappears outside of these rectangles.
The function $g(x_1,\dots,x_l)$ can be written similarly
in the form $g(x_1,\dots,x_l)=d^\varepsilon(t_1,\dots,t_l)$
on the rectangles
$B^\varepsilon_{t_1}\times\cdots\times B^\varepsilon_{t_l}$
with $1\le t_{j'}\le M'(\varepsilon)$, $1\le j'\le l$, and
different indices, $t_1,\dots,t_l$. Beside this, the
function~$g$ disappears outside of these rectangles.
The above representation of the functions $f$ and $g$
through a parameter $\varepsilon$ is useful, since it
enables us to give a good asymptotic formula for the
product $k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)$ which yields the
diagram formula for the product of Wiener--It\^o integrals
of elementary functions with the help of a limiting
procedure $\varepsilon\to0$.
Fix a small number $\varepsilon>0$, take the
representation of the functions $f$ and $g$ with
its help, and write
\begin{equation}
k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)
=\sum_{\gamma\in \Gamma(k,l)} Z_\gamma(f,g,\varepsilon)
\label{(B2)}
\end{equation}
with
\begin{eqnarray}
Z_\gamma(f,g,\varepsilon)&&={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l) \nonumber \\
&&\qquad
\mu_W(A^\varepsilon_{s_1})\dots\mu_W(A^\varepsilon_{s_k})
\mu_W(B^\varepsilon_{t_1})\dots\mu_W(B^\varepsilon_{t_l}),
\label{(B3)}
\end{eqnarray}
where $\Gamma(k,l)$ denotes the class of diagrams introduced before
the formulation of Theorem~10.2A, and $\sum^\gamma$ denotes
summation for $k+l$-tuples $(s_1,\dots,s_k,t_1,\dots,t_l)$ such that
$1\le s_j\le M(\varepsilon)$, $1\le j\le k$,
$1\le t_{j'}\le M'(\varepsilon)$,
$1\le j'\le l$, and
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, i.e.\ if it is an edge of $\gamma$,
and otherwise all sets $A^\varepsilon_{s_j}$ and
$B^\varepsilon_{t_{j'}}$ are
disjoint. (This sum also depends on $\varepsilon$.) In the
case of an empty sum $Z_\gamma(f,g,\varepsilon)$ equals zero.
We write the expression $Z_\gamma(f,g,\varepsilon)$ for all
$\gamma\in\Gamma(k,l)$ in the form
\begin{equation}
Z_\gamma(f,g,\varepsilon)=Z_\gamma^{(1)}(f,g,\varepsilon)
+Z_\gamma^{(2)}(f,g,\varepsilon),
\quad \gamma\in\Gamma(k,l), \label{(B4)}
\end{equation}
with
\begin{eqnarray}
Z^{(1)}_\gamma(f,g,\varepsilon)
&&={\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l) \nonumber \\
&&\qquad\prod_{j\colon\, (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad \prod_{j'\colon\, (2,j')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)}\mu(B^\varepsilon_{t_{j'}})
\label{(B5)}
\end{eqnarray}
and
\begin{eqnarray}
Z^{(2)}_\gamma(f,g,\varepsilon)
&&={\sum}^\gamma
c^\varepsilon(s_1,\dots,s_k) d^\varepsilon(t_1,\dots,t_l)
\nonumber \\
&&\qquad \prod_{j\colon\, (1,j)\in V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad \biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\mu_W(A^\varepsilon_{s_j}) \nonumber \\
&& \qquad\qquad\qquad
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&& \qquad\qquad -\prod_{j'\colon\, (2,j')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)}\mu(B^\varepsilon_{t_{j'}})\biggr], \label{(B6)}
\end{eqnarray}
where $V_1(\gamma)$ and $V_2(\gamma)$ (introduced before
formula~(\ref{(10.9)}) during the preparation to the formulation of
Theorem~10.2A) are the sets of vertices in the first and second
row of the diagram $\gamma$ from which no edge starts.
I claim that there is some constant $C>0$ not depending on
$\varepsilon$ such that
\begin{equation}
E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z^{(1)}_\gamma(f,g,\varepsilon)\right)^2\le C\varepsilon
\quad \textrm{for all } \gamma\in\Gamma(k,l) \label{(B7)}
\end{equation}
with the Wiener--It\^o integral with the kernel function
$F_\gamma(f,g)$ defined in (\ref{(10.9)}), (\ref{(10.9a)})
and (\ref{(10.10)}), and
\begin{equation}
E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
\le C\varepsilon\quad\textrm{for all } \gamma\in\Gamma(k,l).
\label{(B8)}
\end{equation}
Relations~(\ref{(B2)}), (\ref{(B4)}), (\ref{(B7)}) and~(\ref{(B8)})
imply relation~(\ref{(10.12)}) if $f$ and $g$ are elementary
functions. Indeed, (\ref{(B4)}), (\ref{(B7)}) and~(\ref{(B8)})
imply that
$$
\lim_{\varepsilon\to0}\left\|\,|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))
-Z_\gamma(f,g,\varepsilon)\right\|_2\to0
\quad\textrm{for all }\gamma\in\Gamma(k,l),
$$
and this relation together with (\ref{(B2)}) yield
relation (\ref{(10.12)}) with
the help of a limiting procedure $\varepsilon\to0$.
To prove relation (\ref{(B7)}) let us introduce the function
\begin{eqnarray*}
&&F^\varepsilon_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in
V_1(\gamma),\; (2,j')\in V_2(\gamma))\\
&&\qquad=
F_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))\\
&&\qquad\qquad\quad\textrm{if } x_{(1,j)}\in A^\varepsilon_{s_j},
\textrm{ for all } (1,j)\in V_1(\gamma),\\
&&\qquad\qquad\quad\textrm{ } x_{(2,j')}\in B^\varepsilon_{t_{j'}},
\textrm{ for all } (2,j')\in V_2(\gamma)), \quad\textrm{and}\\
&& \qquad\qquad\quad\textrm{ all sets }
A^\varepsilon_{s_j},\; (1,j)\in V_1(\gamma),
\textrm{ and } B^\varepsilon_{t_{j'}}, \; (2,j')\in V_2(\gamma)
\textrm{ are different.}
\end{eqnarray*}
with the function~$F_\gamma(f,g)$ defined in~(\ref{(10.9a)})
and~(\ref{(10.10)}), and put
$$
F^\varepsilon_\gamma(f,g)(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\;
(2,j')\in V_2(\gamma))=0 \quad
\textrm{otherwise.}
$$
The function $F_\gamma^\varepsilon(f,g)$ is elementary, and
a comparison of its definition with relation~(\ref{(B5)})
and the definition of the function $F_\gamma(f,g)$ yields that
\begin{equation}
Z_\gamma^{(1)}(f,g,\varepsilon)=|\gamma|!
Z_{\mu,|\gamma|}(F_\gamma^\varepsilon(f,g)). \label{(B9)}
\end{equation}
The function $F^\varepsilon_\gamma(f,g)$ slightly differs
from $F_\gamma(f,g)$, since the function $F_\gamma(f,g)$ may not
disappear in such points
$(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma),\,
(2,j')\in V_2(\gamma))$ for which there is some pair $(j,j')$
with the property $x_{(1,j)}\in A^\varepsilon_{s_j}$ and
$x_{(2,j')}\in B^\varepsilon_{t_{j'}}$ with some sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$ such that
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$, while
$F_\gamma^\varepsilon(f,g)$ must be zero in such points. On the other
hand, in the case $|\gamma|=\max(k,l)-\min(k,l)$, i.e. if one
of the sets $V_1(\gamma)$ or $V_2(\gamma)$ is empty,
$F_\gamma(f,g)=F^\varepsilon_\gamma(f,g)$, \
$Z_\gamma^{(1)}(f,g,\varepsilon)
=|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))$, and
relation~(\ref{(B7)}) clearly holds for such diagrams $\gamma$.
In the case $|\gamma|=\max(k,l)-\min(k,l)>0$ we prove a good estimate
on the measure of the set where $F_\gamma\neq F_\gamma^\varepsilon$
with respect to an appropriate power of the measure~$\mu$.
Relation~(\ref{(B7)}) will be proved with the help of this estimate
and formula~(\ref{(B9)}).
Let us define the sets $A=\bigcup\limits_{s=1}^{M(\varepsilon)}
A^\varepsilon_s$ and
$B=\bigcup\limits_{t=1}^{M'(\varepsilon)} B^\varepsilon_t$.
These sets $A$ and $B$ do
not depend on the parameter $\varepsilon$. Beside this
$\mu(A)<\infty$, and $\mu(B)<\infty$. Define for all pairs
$(j_0,j_0')$ such that $(1,j_0)\in V_1(\gamma)$,
$(2,j_0')\in V_2(\gamma)$ the set
\begin{eqnarray*}
D(j_0,j'_0)
&&=\{(x_{(1,j)},x_{(2,j')},\; (1,j)\in V_1(\gamma), \,
(2,j')\in V_2(\gamma)) \colon\\
&&\quad x_{(1,j_0)}\in A^\varepsilon_s, \;
x_{(2,j'_0)}\in B^\varepsilon_t
\; \textrm{ with some } 1\le s\le M(\varepsilon) \textrm{ and }
1\le t\le M'(\varepsilon) \\
&&\qquad\qquad\textrm{such that }
A^\varepsilon_s=B^\varepsilon_t,\quad\textrm{and }
\quad x_{(1,j)}\in A\textrm{ for all } (1,j)\in V_1(\gamma), \\
&&\qquad\qquad \textrm{ and }x_{(2,j')}\in B
\textrm{ for all }(2,j')\in V_2(\gamma)\}.
\end{eqnarray*}
Introduce the notation $x^\gamma=(x_{(1,j)},x_{(2,j')}),\,
(1,j)\in V_1(\gamma),\,(2,j')\in V_2(\gamma))$, and consider
only such vectors $x^\gamma$ whose coordinates satisfy the
conditions $x_{(1,j)}\in A$ for all $(1,j)\in V_1(\gamma)$
and $x_{(2,j')}\in B$ for all $(2,j')\in V_2(\gamma)$. Put
$$
D_\gamma=\{x^\gamma\colon\,
F^\varepsilon_\gamma(f,g)(x^\gamma)\neq F_\gamma(f,g)(x^\gamma)\}.
$$
The relation $D_\gamma\subset\bigcup\limits_{j=1}^k
\bigcup\limits_{j'=1}^l D(j_0,j_0')$ holds, since if
$F^\varepsilon_\gamma(f,g)(x^\gamma)\neq F_\gamma(f,g)(x^\gamma)$
for some vector~$x^\gamma$, then it has some coordinates
$(1,j_0)\in V_1(\gamma)$ and $(2,j'_0)\in V_2(\gamma)$ such that
$x_{(1,j_0)}\in A^\varepsilon_s$ and
$x_{(1,j'_0)}\in B^\varepsilon_t$ with some sets
$A^\varepsilon_s=B^\varepsilon_t$, and the relation in the last
line of the definition of $D(j_0,j'_0)$ must also hold for
such a vector $x^\gamma$, since otherwise
$F_\gamma(f,g)(x_\gamma)=0=F^\varepsilon_\gamma(f,g)(x_\gamma)$.
I claim that there is some constant $C_1$ such that
$$
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D(j_0,j'_0))\le C_1\varepsilon
\quad\textrm{for all sets } D(j_0,j'_0),
$$
where $\mu^{|V_1(\gamma)|+|V_2(\gamma)|}$
denotes the direct product of the measure $\mu$ on some copies of
the original space $(X,{\cal X})$ indexed by $(1,j)\in V_1(\gamma)$
and $(2,j')\in V_2(\gamma)$. To see this relation one has to
observe that
$\sum\limits_{A^\varepsilon_s=B^\varepsilon_t}
\mu(A^\varepsilon_s)\mu(B^\varepsilon_t)\le
\sum\limits\varepsilon \mu(A^\varepsilon_s)=\varepsilon\mu(A)$.
Thus the set $D(j_0,j'_0)$ can be covered by the direct product
of a set whose $\mu$ measure is not greater than
$\varepsilon\mu(A)$ and of a rectangle whose edges are
either the set $A$ or the set $B$.
The above relations imply that
\begin{equation}
\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C_2\varepsilon
\label{(B10)}
\end{equation}
with some constant $C_2>0$.
Relation (\ref{(B9)}), estimate (\ref{(B10)}), the
property~c) formulated in
Theorem~10.1 for Wiener--It\^o integrals and the observation that
the function $F_\gamma(f,g)$ is bounded in supremum norm
if $f$ and $g$ are elementary functions imply the inequality
\begin{eqnarray*}
&&E\left(|\gamma|!Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z^{(1)}_\gamma(f,g,\varepsilon)\right)^2 \\
&&\qquad =|\gamma!|^2E\left( Z_{\mu,|\gamma|}
(F_\gamma(f,g)-F_\gamma^\varepsilon(f,g))\right)^2
\le |\gamma|!\| F_\gamma(f,g)-F_\gamma^\varepsilon(f,g)\|_2^2 \\
&&\qquad\le K\mu^{|V_1(\gamma)|+|V_2(\gamma)|}(D_\gamma)\le C\varepsilon.
\end{eqnarray*}
Hence relation~(\ref{(B7)}) holds.
To prove relation (\ref{(B8)}) we rewrite
$E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2$
in the following form:
\begin{eqnarray}
&&E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
={\sum}^\gamma
{\sum}^\gamma c^\varepsilon(s_1,\dots,s_k)
d^\varepsilon(t_1,\dots,t_l)
c^\varepsilon(\bar s_1,\dots,\bar s_k) \nonumber \\
&&\qquad\qquad
d^\varepsilon(\bar t_1, \dots,\bar t_l)
EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l) \nonumber \\
\label{(B11)}
\end{eqnarray}
with
\begin{eqnarray}
&&U(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l) \nonumber \\
&&\qquad =\prod_{j\colon\, (1,j)
\in V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\,(2,j')\in V_2(\gamma)}
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&&\qquad\qquad
\prod_{\bar j\colon\, (1,\bar j)\in V_1(\gamma)} %
\mu_W(A^\varepsilon_{\bar s_{\bar j}}) %
\prod_{\bar j'\colon\, (2,\bar j')\in V_2(\gamma)} %
\mu_W(B^\varepsilon_{\bar t_{\bar j'}}) \nonumber \\ %
&&\qquad\qquad \biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}
\!\!\!\!\!\!\!\!\!\!
\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)} \!\!\!\!\!\!\!\!\!\!\!\!
\!\!\!\!\!\!
\mu_W(B^\varepsilon_{t_{j'}}) \nonumber \\
&&\qquad\qquad\qquad
-\prod_{j'\colon\, (2,j')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)}\mu(B^\varepsilon_{t_{j'}})\biggr]
\nonumber \\
&&\qquad\biggl[\prod_{\bar j\colon\, (1,\bar j)\in %
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)} \!\!\!\!
\mu_W(A^\varepsilon_{\bar s_{\bar j}}) \!\!\!\!\! %
\prod_{\bar j'\colon\, %
(2,\bar j')\in \{(2,1),\dots,(2,l)\} %
\setminus\in V_2(\gamma)} \!\!\!\!\!\!
\mu_W(B^\varepsilon_{\bar t_{\bar j'}}) \nonumber \\ %
&&\qquad\qquad\qquad
-\prod_{\bar j'\colon\, (2,\bar j')\in\{(2,1),\dots,(2,l)\} %
\setminus V_2(\gamma)}
\mu(B^\varepsilon_{\bar t_{\bar j'}})\biggr]. \label{(B12)} %
\end{eqnarray}
The double sum $\sum^\gamma\sum^\gamma$ in (\ref{(B11)}) has to be
understood in the following way. The first summation is taken for
vectors $(s_1,\dots,s_k,t_1,\dots,t_l)$, and $\sum^\gamma$ is defined
in the same way as in~formula (\ref{(B3)}). The second summation
is taken for vectors
$(\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)$, and
again the summation $\sum^\gamma$ is taken as in~(\ref{(B3)}),
only here $\bar s_j$ plays the role of~$s_j$ and $\bar t_{j'}$
plays the role of~$t_{j'}$.
Relation~(\ref{(B8)}) will be proved by means of some
estimates about the expectation of the above defined random
variable $U(\cdot)$ which will be presented in the following
Lemma~B. To formulate this result I introduce the following
Properties~A and~B.
\medskip\noindent
{\bf Property A.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar j}\le M(\varepsilon)$, for %
$1\le j,\bar j\le k$, and %
$1\le t_j,\bar t_{\bar j'}\le M'(\varepsilon)$ for %
$1\le j',\bar j'\le l$, %
satisfies Property~A (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequence of sets
$A^\varepsilon_{s_j}$, $(1,j)\in V_1(\gamma)$,
$B^\varepsilon_{t_{j'}}$, $(2,j')\in V_2(\gamma)$,
and the sequence of sets
$A^\varepsilon_{\bar s_{\bar j}}$, $(1,\bar j)\in V_1(\gamma)$, %
$B^\varepsilon_{\bar t_{\bar j'}}$, $(2,\bar j')\in V_2(\gamma)$, %
agree. (Here we say that two sequences agree if they contain
the same elements in a possibly different order.)}
\medskip\noindent
{\bf Property B.\/} {\it A sequence $s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$, with elements
$1\le s_j,\bar s_{\bar j}\le M(\varepsilon)$, for %
$1\le j,\bar j\le k$, and %
$1\le t_j,\bar t_{\bar j'}\le M'(\varepsilon)$ for %
$1\le j',\bar j'\le l$, %
satisfies Property~B (depending on a fixed diagram~$\gamma$ and
number~$\varepsilon>0$) if the sequences of sets
$$
A^\varepsilon_{s_j},\;
(1,j)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma), \;\;\;
B^\varepsilon_{t_{j'}}, \;
(2,j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma),
$$
and
$$
A^\varepsilon_{\bar s_{\bar j}}, %
(1,\bar j)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma), %
\;\;\; B^\varepsilon_{\bar t_{\bar j'}}, \; %
(2,\bar j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma), %
$$
have at least one common element.}
\medskip
(In the above definitions two sets $A^\varepsilon_s$ and
$B^\varepsilon_t$ are
identified if $A^\varepsilon_s=B^\varepsilon_t$.)
Now I formulate the following
\medskip\noindent
{\bf Lemma B.} {\it Let us consider the function $U(\cdot)$
introduced in formula~(\ref{(B12)}). Assume that its arguments
$s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l$ are chosen
in such a way that the function $U(\cdot)$ with these
arguments appears in the double sum $\sum^\gamma\sum^\gamma$
in formula~(\ref{(B11)}), i.e.\
$A^\varepsilon_{s_j}=B^\varepsilon_{t_{j'}}$ if
$((1,j),(2,j'))\in E(\gamma)$, otherwise all sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$ are disjoint,
and an analogous statement holds if
the coordinates $s_1,\dots,s_k,t_1,\dots,t_l$ are replaced by
$\bar s_1,\dots,\bar s_k$ and $\bar t_1,\dots,\bar t_l$.
If the sequence of the arguments in $U(\cdot)$ does not satisfies
either Property~A or Property~B, then
\begin{equation}
EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)=0.
\label{(B13)}
\end{equation}
If the sequence of the arguments in $U(\cdot)$ satisfies both
Property~A and Property~B, then
\begin{equation}
|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)|
\le C\varepsilon \prod{\vphantom\prod}'
\mu(A^\varepsilon_{\bar s_{\bar j}})\mu(B^\varepsilon_{\bar t_{\bar j'}}) %
\label{(B14)}
\end{equation}
with some appropriate constant $C=C(k,l)>0$ depending only on
the number of variables $k$ and $l$ of the functions $f$ and $g$.
The prime in the product $\prod'$ at the right-hand side
of~(\ref{(B14)}) means that in this product the measure $\mu$
of those sets $A^\varepsilon_{\bar s_{\bar j}}$ and %
$B^\varepsilon_{\bar t_{\bar j'}}$ are considered, %
whose indices are listed among the arguments
$\bar s_{\bar j}$ or $\bar t_{\bar j'}$ of %
$U(\cdot)$, and the measure~$\mu$ of each such set appears
exactly once. (This means that if
$A^\varepsilon_{\bar s_{\bar j}}=B^\varepsilon_{\bar t_{\bar j'}}$ %
then one of the terms between %
$\mu(A^\varepsilon_{\bar s_{\bar j}})$ and
$\mu(B^\varepsilon_{\bar t_{\bar j'}})$ %
is omitted from the product. For the sake of definitiveness let
us preserve the set $\mu(A^\varepsilon_{\bar s_{\bar j}})$ in such a case.)} %
\medskip\noindent
{\it Remark.}\/ The content of Lemma~B is that most terms
in the double sum in formula~(\ref{(B11)}) equal zero, and
even the non-zero terms are small.
\medskip\noindent
{\it The proof of Lemma B.}\/ Let us prove first
relation~(\ref{(B13)})
in the case when Property~A does not hold. It will be exploited
that for disjoint sets the random variables $\mu_W(A_s)$ and
$\mu_W(B_t)$ are independent, and this provides a good
factorization of the expectation of certain products.
Let us carry out the multiplications in the expression
$U(\cdot)$ defined~(\ref{(B12)}). We get a sum consisting
of 4~terms. We show that each of them has zero expectation.
Indeed, if a sequence
$s_1,\dots,s_k,t_1,\dots,t_l,\bar s_1,\dots,\bar s_k,
\bar t_1,\dots,\bar t_l$
does not satisfy Property~A, but it satisfies the
remaining conditions of Lemma~B, then each term in the sum
expressing $U(\cdots)$ with these arguments is a product
which contains a factor $\mu_W(A^\varepsilon_{s_{j_0}})$,
$(1,j_0)\in V_1(\gamma)$ with the following property. It is
independent of all those terms in this product which are in
the following list: $\mu_W(A^\varepsilon_{s_j})$ with some
$j\neq j_0$, $1\le j\le k$, or $\mu_W(B^\varepsilon_{t_{j'}})$,
$1\le j\le l$, or $\mu_W(A^\varepsilon_{\bar s_{\bar j}})$ with %
$(1,\bar j)\in V_1(\gamma)$, or %
$\mu_W(B^\varepsilon_{\bar t_{\bar j'}})$ with %
$(2,\bar j')\in V_2(\gamma)$. We will show with the help of %
this property that the expectation of the terms we consider
can be written in the form of a product either with a factor
of the form $E\mu_W(A^\varepsilon_{s_{j_0}})=0$ or with a
factor of the form $E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$.
Hence this expectation equals zero.
Indeed, although the above properties do not exclude the
existence of a set $A^\varepsilon_{t_{\bar j'}}$,
$(1,\bar j')\in\{(1,1),\dots,(1,k)\setminus V_1(\gamma)$ %
or $B^\varepsilon_{t_{\bar j'}}$,
$(2,\bar j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$ %
such that $\mu_W(A^\varepsilon_{t_{\bar j'}})$ or
$\mu_W(B^\varepsilon_{t_{\bar j'}})$, %
is not independent of $\mu_W(A^\varepsilon_{s_{j_0}})$, but
this can only happen if $A^\varepsilon_{t_{\bar j}}
=B^\varepsilon_{t_{\bar j'}}=A^\varepsilon_{s_{j_0}}$. This
implies that in such a case when our term does not contain a
factor of the form $E\mu_W(A^\varepsilon_{s_{j_0}})$, then
it contains a factor of the form
$E\mu_W(A^\varepsilon_{s_{j_0}})^3=0$. Hence $EU(\cdot)=0$
if the arguments of $U(\cdot)$ do not satisfy Property~A.
To finish the proof of relation (\ref{(B13)}) it is enough
consider the case when the arguments of $U(\cdot)$ satisfy
Property~A, but they do not satisfy Property~B. The validity
of Property~A implies that the sets
$\{A^\varepsilon_{s_j},\,j\in V_1(\gamma)\}
\cup\{B^\varepsilon_{t_{j'}},\,j'\in V_2(\gamma)\}$
and
$\{A^\varepsilon_{\bar s_j},\,j\in V_1(\gamma)\}
\cup\{B^\varepsilon_{\bar t_{j'}},\,j'\in V_2(\gamma)\}$
agree. The conditions of Lemma~B also imply that the elements
of these sets are disjoint of the sets $A^\varepsilon_{s_j}$,
$B^\varepsilon_{t_{j'}}$, $A^\varepsilon_{\bar s_{\bar j}}$ and %
$B^\varepsilon_{\bar t_{\bar j'}}$ with indices %
$(1,j),(1,\bar j)\in\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)$ %
and
$(2,j'),(2,\bar j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$. %
If Property~B does not hold, then we can divide the latter class
of sets into two disjoint subclasses in an appropriate way. The
first subclass consists of the sets
$A^\varepsilon_{s_j}$ and $B^\varepsilon_{t_{j'}}$, and the second
one of the sets $A^\varepsilon_{\bar s_{\bar j}}$ and %
$B^\varepsilon_{\bar t_{\bar j'}}$ %
with indices such that
$(1,j),(1,\bar j)\in\{(1,1),\dots,(1,k)\} %
\setminus V_1(\gamma)$ and
$(2,j'),(2,\bar j')\in\{(2,1),\dots,(2,l)\}\setminus V_2(\gamma)$. %
These facts imply that $EU(\cdot)$ has a factorization,
which contains the term
\begin{eqnarray*}
&&E\biggl[\prod_{j\colon\, (1,j)\in
\{(1,1),\dots,(1,k)\}\setminus V_1(\gamma)}\mu_W(A^\varepsilon_{s_j})
\prod_{j'\colon\, (2,j')\in \{(2,1),\dots,(2,l)\}\setminus
\in V_2(\gamma)}\mu_W(B^\varepsilon_{t_{j'}}) \\
&&\qquad\qquad -\prod_{j'\colon\, (2,j')\in\{(2,1),\dots,(2,l)\}
\setminus V_2(\gamma)}\mu(B^\varepsilon_{s_{j'}})\biggr]=0,
\end{eqnarray*}
hence relation (\ref{(B13)}) holds also in this case. The
last expression has zero expectation, since if we take
such pairs $A^\varepsilon_{s_j},B^\varepsilon_{t_j'}$ for
the sets appearing in it for which that
$((1,j),(2,j'))\in E(\gamma)$, i.e. these vertices are
connected with an edge of $\gamma$, then
$A^\varepsilon_{s_j}=B^\varepsilon_{t_j'}$ in a pair, and
elements in different pairs are disjoint. This
observation allows a factorization in the product whose
expectation is taken, and then the identity
$E\mu_W(A^\varepsilon_{s_j})\mu_W(B^\varepsilon_{t_{j'}})
=\mu(A^\varepsilon_{s_j})$ implies the desired identity.
To prove relation (\ref{(B14)}) if the arguments of the
function~$U(\cdot)$ satisfy both Properties~A and~B consider
the expression (\ref{(B12)}) which defines $U(\cdot)$, carry
out the term by term multiplication between the two
differences at the end of this formula, take expectation for
each term of the sum obtained in such a way and factorize
them. Since $E\mu_W(A)^2=\mu(A)$, $E\mu_W(A)^4=3\mu(A)^2$
for all sets $A\in{\cal X}$, $\mu(A)<\infty$, some
calculation shows that each term can be expressed as
constant times a product whose elements are those
probabilities $\mu(A_{\bar s_{\bar j}}^\varepsilon)$ and
$\mu(B_{\bar t_{\bar j'}}^\varepsilon)$ or their square which
appear at the right-hand side of (\ref{(B14)}). Moreover,
since the arguments of $U(\cdot)$ satisfy Property~B, there
will be at least one term of the form $\mu(A_s^\varepsilon)^2$
in this product. Since
$\mu(A_s^\varepsilon)^2\le \varepsilon\mu(A_s^\varepsilon)$,
these calculations provide
formula~(\ref{(B14)}). Lemma~B is proved.
\hfill$\qed$
\medskip
Relation (\ref{(B11)}) implies that
\begin{equation}
E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
\le K\sum{\vphantom{\sum}}^\gamma \sum{\vphantom{\sum}}^\gamma
|EU(s_1,\dots,s_k,t_1,\dots,t_l,
\bar s_1,\dots,\bar s_k,\bar t_1,\dots,\bar t_l)| \label{(B15)}
\end{equation}
with some appropriate $K>0$. By Lemma~B it is enough to sum up
only for such terms $U(\cdot)$ in (\ref{(B15)}) whose
arguments satisfy
both Properties~A and~B. Moreover, each such term can be bounded
by means of inequality (\ref{(B14)}). Let us write up the upper
bound we get on $E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2$
in such a way. We get a sum consisting of terms of the form
$\mu(A^\varepsilon_{s_1})\cdots\mu(A^\varepsilon_{s_p})
\mu(B^\varepsilon_{t_1})\cdots\mu(B^\varepsilon_{\bar t_q})$
multiplied by constant~times~$\varepsilon$. The sets
$A^\varepsilon_s$ and $B^\varepsilon_t$ whose measure~$\mu$ appears in
such a term are disjoint. Beside this $1\le p\le k$, and
$1\le q\le l$.
In the above indicated estimation of
$E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2$ with the help of
formula~(\ref{(B15)}) and Lemma~B we have exploited the following
fact. A term
$$
\mu(A^\varepsilon_{s_1})\cdots\mu(A^\varepsilon_{s_p})
\mu(B^\varepsilon_{t_1})\cdots\mu(B^\varepsilon_{\bar t_q})
$$
with prescribed indices $s_1,\dots,s_p$ and $t_1,\dots,t_q$ came
up in the sum at the right-hand of our bound as a contribution
of only finitely many expressions $|EU(\cdots)|$. Hence we get
this term in the upper bound with a multiplying coefficient bounded by
constant~times~$\varepsilon$.
We also have $\sum\limits_{s=1}^{M(\varepsilon)}
\mu(A^\varepsilon_s)+\sum\limits_{t=1}^{M'(\varepsilon)}
\mu(B^\varepsilon_t)=\mu(A)+\mu(B)<\infty$.
The above relations imply that
\begin{eqnarray*}
E\left(Z^{(2)}_\gamma(f,g,\varepsilon)\right)^2
&\le& C_1\varepsilon\sum_{\substack{1\le p\le k \\ 1\le q\le l}}
\sum_{\substack{1\le s_l\le M\\ 1\le l\le p}} \sum_{\substack{1\le t_l\le M'\\
1\le l\le q}}
\mu(A^\varepsilon_{s_1})\cdots\mu(A^\varepsilon_{s_p})
\mu(B^\varepsilon_{t_1})\cdots\mu(B^\varepsilon_{\bar t_q}) \\
&\le& C_2\varepsilon\sum_{j=1}^{(k+l)}(\mu(A)+\mu(B))^j
\le C\varepsilon.
\end{eqnarray*}
Hence relation (\ref{(B8)}) holds.
\medskip
To prove Theorem 10.2A in the general case take for all pairs of
functions $f\in{\cal H}_{\mu,k}$ and $g\in{\cal H}_{\mu,l}$ two
sequences of elementary functions $f_n\in\bar{\cal H}_{\mu,k}$
and $g_n\in\bar{\cal H}_{\mu,l}$, $n=1,2,\dots$, such that
$\|f_n-f\|_2\to0$ and $\|g_n-g\|_2\to0$ as $n\to\infty$.
It is enough to show that
\begin{equation}
E|k!Z_{\mu,k}(f)l!Z_{\mu,l}(g)-k!Z_{\mu,k}(f_n)
l!Z_{\mu,l}(g_n)|\to0\quad \textrm{as }n\to\infty,
\label{(B16)}
\end{equation}
and
\begin{eqnarray}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\to0
\textrm{ as } n\to\infty \nonumber \\
&&\qquad\qquad\qquad \qquad\qquad\qquad\qquad
\textrm{for all } \gamma\in\Gamma(k,l),
\label{(B17)}
\end{eqnarray}
since then a simple limiting procedure $n\to\infty$, and the
already proved part of the theorem for Wiener--It\^o integrals of
elementary functions imply Theorem~10.2A.
To prove relation (\ref{(B16)}) write with the help of Property~c)
in Theorem~(10.1)
\begin{eqnarray*}
&&E|k!Z,{\mu,k}(f)l!Z_{\mu,l}(g)-
k!Z_{\mu,k}(f_n)l!Z_{\mu,l}(g_n)|\\
&&\qquad\le k!l!\left(E|Z_{\mu,k}(f)Z_{\mu,l}(g-g_n)|
+E|Z_{\mu,k}(f-f_n)Z_{\mu,l}(g_n)\right)| \\
&&\qquad\le k!l!
\left(\left(EZ^2_{\mu,k}(f)\right)^{1/2}
\left(EZ^2_{\mu,l}(g-g_n)\right)^{1/2} \right. \\
&&\qquad\qquad \left. +\left(EZ^2_{\mu,k}(f-f_n)\right)^{1/2}
\left(EZ^2_{\mu,l}(g_n)\right)^{1/2}\right)\\
&&\qquad\le (k!l!)^{1/2}\left(\|f\|_2\|g-g_n\|_2
+\|f-f_n\|_2\|g_n\|_2\right).
\end{eqnarray*}
Relation (\ref{(B16)}) follows from this inequality with a limiting
procedure $n\to\infty$.
To prove relation (\ref{(B17)}) write
\begin{eqnarray*}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&&\qquad\le
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g-g_n))\right|+
|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f-f_n,g_n))\right|\\
&&\qquad\le
|\gamma|!\left(EZ^2_{\mu,|\gamma|}
(F_\gamma(f,g-g_n))\right)^{1/2}+
|\gamma|!\left(EZ^2_{\mu,|\gamma|}
(F_\gamma(f-f_n,g_n))\right)^{1/2}\\
&&\qquad\le (|\gamma|!)^{1/2}\left(\|F_\gamma(f,g-g_n)\|_2+
\|F_\gamma(f-f_n,g_n)\|_2\right),
\end{eqnarray*}
and observe that by relation (\ref{(10.11)})
$\|F_\gamma(f,g-g_n)\|_2\le \|f\|_2\|g-g_n\|_2$, and
\hfill\break
$\|F_\gamma(f-f_n,g_n)\|_2\le \|f-f_n\|_2\|g_n\|_2$. Hence
\begin{eqnarray*}
&&|\gamma|!E\left|Z_{\mu,|\gamma|}(F_\gamma(f,g))-
Z_{\mu,|\gamma|}(F_\gamma(f_n,g_n))\right|\\
&&\qquad\le(|\gamma|!)^{1/2}
\left(\|f\|_2\|g-g_n\|_2+\|f-f_n\|_2\|g_n\|_2\right).
\end{eqnarray*}
The last inequality implies relation (\ref{(B17)})
with a limiting procedure
$n\to\infty$. Theorem 10.2A is proved.
\hfill$\qed$
\chapter{The proof of some results about
Wiener--It\^o integrals}
\label{introC}
First I prove It\^o's formula about multiple
Wiener--It\^o integrals (Theorem~10.3). The proof is based
on the diagram formula for Wiener--It\^o integrals and a
recursive formula about Hermite polynomials proved in
Proposition~C. In Proposition~C2 I present the proof of
another important property of Hermite polynomials. This
result states that the class of all Hermite polynomials is a
{\it complete}\/ orthogonal system in an appropriate
Hilbert space. It is needed in the proof of Theorem 10.5
which provides an isomorphism between a Fock space and the
Hilbert space generated by Wiener--It\^o integrals with respect
to a white noise with an appropriate reference measure. At the
end of Appendix~C the proof of Theorem~10.4, a limit theorem
about degenerate $U$-statistics is given together with a
version of this result about the limit behaviour of multiple
integrals with respect to a normalized empirical distribution.
\medskip\noindent
{\bf Proposition C about some properties of Hermite
polynomials.}\index{Hermite polynomials} {\it The functions
\begin{equation}
H_k(x)=(-1)^k e^{x^2/2}\frac {d^k}{dx^k}e^{-x^2/2},
\quad k=0,1,2,\dots \label{(C1)}
\end{equation}
are the Hermite polynomials with leading
coefficient 1, i.e.\ $H_k(x)$ is a polynomial of
order $k$ with leading coefficient 1 such that
\begin{equation}
\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \textrm{if } k\neq l. \label{(C2)}
\end{equation}
Beside this,
\begin{equation}
\int_{-\infty}^\infty H^2_k(x) \frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=k!
\quad \textrm{for all } k=0,1,2\dots. \label{(C$2'$)}
\end{equation}
The recursive relation
\begin{equation}
H_k(x)=x H_{k-1}(x)-(k-1)H_{k-2}(x) \label{(C3)}
\end{equation}
holds for all $k=1,2,\dots$.}
\medskip\noindent
{\it Remark.} It is more convenient to consider
relation~(\ref{(C3)}) valid also in the case $k=1$. In this
case $H_1(x)=x$, $H_0(x)=1$, and relation holds with an
arbitrary function $H_{-1}(x)$.
\medskip\noindent
{\it Proof of Proposition C.} It is clear from
formula~(\ref{(C1)}) that $H_k(x)$ is a polynomial of
order $k$ with leading coefficient 1. Take $l\ge k$, and
write by means of integration by parts
\begin{eqnarray*}
&&\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=\int_{-\infty}^\infty\frac1{\sqrt{2\pi}}
H_k(x)(-1)^l\frac{d^l}{dx^l} e^{-x^2/2}\,dx\\
&&\qquad\qquad
=\int_{-\infty}^\infty\frac1{\sqrt{2\pi}} \frac d{dx} H_k(x)
(-1)^{l-1}\frac{d^{l-1}}{dx^{l-1}}e^{-x^2/2}\,dx.
\end{eqnarray*}
Successive partial integration together with the identity
$\frac{d^k}{dx^k}H_k(x)=k!$ yield that
$$
\int_{-\infty}^\infty H_k(x)H_l(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=k!\int_{-\infty}^\infty\frac1{\sqrt{2\pi}}
(-1)^{l-k}\frac{d^{l-k}}{dx^{l-k}}e^{-x^2/2}\,dx.
$$
The last relation supplies formulas (\ref{(C2)})
and~(\ref{(C$2'$)}).
To prove relation (\ref{(C3)}) observe that
$H_k(x)-xH_{k-1}(x)$ is a polynomial of order $k-2$. (The term
$x^{k-1}$ is missing from this expression. Indeed, if $k$ is
an even number, then the polynomial $H_k(x)-xH_{k-1}(x)$ is
an even function, and it does not contain the term $x^{k-1}$
with an odd exponent $k-1$. Similar argument holds if the
number $k$ is odd.) Beside this, it is orthogonal (with
respect to the standard normal distribution) to all Hermite
polynomials $H_l(x)$ with $0\le l\le k-3$. Hence
$H_k(x)-xH_{k-1}(x)=CH_{k-2}(x)$ with some constant $C$ to be
determined.
Multiply both sides of the last identity with $H_{k-2}(x)$
and integrate them with respect to the standard normal
distribution. Apply the orthogonality of the polynomials
$H_k(x)$ and $H_{k-2}(x)$, and observe that the identity
$$
\int H_{k-1}(x)xH_{k-2}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
=\int H^2_{k-1}(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=(k-1)!
$$
holds. (In this calculation we have exploited that $H_{k-1}(x)$
is orthogonal to $H_{k-1}(x)-xH_{k-2}(x)$, because the order of
the latter polynomial is less than $k-1$.) In such a way we get
the identity $-(k-1)!=C(k-2)!$ for the constant~$C$ in the last
identity, i.e. $C=-(k-1)$, and this implies relation (\ref{(C3)}).
\hfill$\qed$
\medskip\noindent
{\it Proof of It\^o's formula for multiple Wiener--It\^o
integrals.}\/\index{It\^o's formula for multiple Wiener--It\^o
integrals} Let $K=\sum\limits_{p=1}^m k_p$, the sum of the
order of the Hermite polynomials, denote the order of the
expression in relation (\ref{(10.20)}).
Formula~(\ref{(10.20)}) clearly holds
for expressions of order $K=1$. It will be proved in the
general case by means of induction with respect to the
order~$K$.
In the proof the functions $f(x_1)=\varphi_1(x_1)$ and
$$
g(x_1,\dots,x_{K_m-1})=\prod_{j=1}^{K_1-1}\varphi_1(x_j)
\cdot \prod_{p=2}^m \prod_{j=K_{p-1}}^{K_p-1}\varphi_p(x_j),
$$
will be introduced and the product
$Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)$ will be calculated
by means of the diagram formula. (The same notation is
applied as in Theorem 10.3. In particular, $K=K_m$, and in
the case $K_1=1$ the convention
$\prod\limits_{j=1}^{K_1-1}\varphi_1(x_j)=1$ is applied.)
In the application of the diagram formula diagrams with
two rows appear. The first row of these diagrams contains the
vertex $(1,1)$ and the second row contains the vertices
$(2,1),\dots,(2,K_m-1)$. It is useful to divide the diagrams to
three disjoint classes. The first class, $\Gamma_0$ contains
only the diagram $\gamma_0$ without any edges. The second class
$\Gamma_1$ consists of those diagrams which have an edge of the
form $((1,1),(2,j))$ with some $1\le j\le k_1-1$, and the third
class $\Gamma_2$ is the set of those diagrams which have an
edge of the form $((1,1),(2,j))$ with some $k_1\le j\le K_m-1$.
Because of the orthogonality of the functions $\varphi_s$ for
different indices~$s$ $F_\gamma\equiv0$ and
$Z_{\mu,K_m-2}(F_\gamma)=0$ for $\gamma\in\Gamma_2$.
The class $\Gamma_1$ contains $k_1-1$ diagrams. Let us consider a
diagram $\gamma$ from this class with an edge $((1,1),(2,j_0))$,
$1\le j\le k_1-1$. We have for such a diagram $F_\gamma=
\prod\limits_{j\in\{1,\dots,K_1-1\}
\setminus \{j_0\}}\varphi_1(x_{(2,j)})
\prod\limits_{p=2}^m
\prod\limits_{j=K_{p-1}}^{K_p-1}\varphi_p(x_{(2,j)})$, and
by our inductive hypothesis $(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)=
H_{k_1-2}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p)$. Finally
$$
K_m!Z_{\mu,K_m}(F_{\gamma_0})=
K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m
\left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)
$$
for the diagram $\gamma_0\in\Gamma_0$ without any edge.
Our inductive hypothesis also implies the following identity for
the expression we wanted to calculate with the help of the diagram
formula.
$$
Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)=\eta_1
H_{k_1-1}(\eta_1)\prod\limits_{p=2}^m H_{k_p}(\eta_p).
$$
The above calculations together with the observation
$|\Gamma_1|=k_1-1$ yield the identity
\begin{eqnarray}
&&K_m!Z_{\mu,K_m}\left(\prod_{p=1}^m \left(\prod_{j=K_{p-1}+1}^{K_p}
\varphi_p(x_j)\right)\right)=K_m!Z_{\mu,K_m}(F_{\gamma_0})
\nonumber \\
&&\qquad=Z_{\mu,1}(f)(K_m-1)!Z_{\mu,K_m-1}(g)-
\sum_{\gamma\in\Gamma_1}(K_m-2)!Z_{\mu,K_m-2}(F_\gamma)
\nonumber \\
&&\qquad=\eta_1 H_{k_1-1}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)
-(k_1-1) H_{k_1-2}(\eta_1)\prod_{p=2}^m H_{k_p}(\eta_p)
\nonumber \\
&&\qquad=\left[\eta_1H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)\right]\prod_{p=2}^m H_{k_p}(\eta_p).
\label{(C4)}
\end{eqnarray}
On the other hand, $\eta_1 H_{k_1-1}(\eta_1)
-(k_1-1) H_{k_1-2}(\eta_1)=H_{k_1}(\eta_1)$ by
formula (\ref{(C3)}). These relations imply
formula~(\ref{(10.20)}), i.e. It\^o's formula.
\hfill$\qed$
\medskip
I present the proof of another important property of the Hermite
polynomials in the following Proposition~C2.
\medskip\noindent
{\bf Proposition~C2 on the completeness of the orthogonal system
of Hermite polynomials.}\index{Hermite polynomials} {\it The
Hermite polynomials $H_k(x)$, $k=0,1,2,\dots$, defined in
formula~(\ref{(C4)}) constitute a complete orthonormal system
in the $L_2$-space of the functions square integrable with
respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ on the real line.}
\medskip\noindent
{\it Proof of Proposition C2.} Let us consider the orthogonal
complement of the subspace generated by the Hermite polynomials
in the space of the square integrable functions with respect
to the measure $\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$. It is enough
to prove that this orthogonal completion contains only the
identically zero function. Since the orthogonality of a function to
all polynomials of the form $x^k$, $k=0,1,2,\dots$ is equivalent
to the orthogonality of this function to all Hermite polynomials
$H_k(x)$, $k=0,1,2,\dots$, Proposition~C2 can be reformulated in
the following form:
If a function $g(x)$ on the real line is such that
\begin{equation}
\int_{-\infty}^\infty x^k g(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx=0
\quad \textrm{for all }k=0,1,2,\dots \label{(C5)}
\end{equation}
and
\begin{equation}
\int_{-\infty}^\infty g^2(x)
\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx<\infty,
\label{(C6)}
\end{equation}
then $g(x)=0$ for almost all $x$.
Given a function $g(x)$ on the real line whose absolute value is
integrable with respect to the Gaussian measure
$\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx$ define the (finite)
measure $\nu_g$,
$$
\nu_g(A)=\int_A g(x)\frac1{\sqrt{2\pi}}e^{-x^2/2}\,dx
$$
on the measurable sets of the real line together with its Fourier
transform $\tilde\nu_g(t)=\int_{-\infty}^\infty e^{itx}\nu_g(\,dx)$.
(This measure $\nu_g$ and its Fourier transform can
be defined for all functions~$g$ satisfying relation (\ref{(C6)}),
because their absolute value is integrable with respect to
the Gaussian measure.) First I show that Proposition~C2 can be
reduced to the following statement: If a function $g$ satisfies
both (\ref{(C5)}) and (\ref{(C6)})
then $\tilde\nu_g(t)=0$ for all $-\inftyu\right)\le A(k)
P\left(\|\bar I_{n,k}(f(\ell))\|>\gamma(k)u\right)
\label{(14.13d)}
\end{equation}
with some constants $A(k)>0$ and $\gamma(k)>0$ depending only
on the order~$k$ of these generalized $U$-statistics.
The sign $\|\cdot\|$ in~(\ref{(14.13d)}) denotes the norm in
the Banach space we are working in.
We concentrate mainly on the proof of the
generalization (\ref{(14.13d)}) of relation (\ref{(14.13)}).
Formula~(\ref{(14.14)}) is a relatively simple consequence of
it. Formula~(\ref{(14.13d)}) will be proved by means of an
inductive procedure which works only in this more general
setting. It will be derived from the following statement.
Let us take two independent copies $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$
and $\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ of our original sequence of
random variables $\xi_1,\dots,\xi_n$, and introduce for all sets
$V\subset \{1,\dots,k\}$ the function $\alpha_V(j)$, $1\le j\le k$,
defined as $\alpha_V(j)=1$ if $j\in V$ and $\alpha_V(j)=2$ if
$j\notin V$. Let us define with their help the following
version of decoupled $U$-statistics:
\begin{eqnarray}
I_{n,k,V}(f(\ell))
&&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\; j=1,\dots,k}
\!\!\!\!
f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\alpha_V(1))},\dots,
\xi_{l_k}^{(\alpha_V(k))}\right) \nonumber \\
&&\qquad\qquad\qquad\qquad\qquad\quad \textrm{for all }
V\subset \{1,\dots,k\}.
\label{(D3)}
\end{eqnarray}
The following inequality will be proved: There are some constants
$C_k>0$ and $D_k>0$ depending only on the order $k$ of the
generalized $U$-statistic $I_{n,k}(f(\ell))$ such that for all
numbers $u>0$
\begin{equation}
P\left(\|I_{n,k}(f(\ell))\|>u\right)\le
\sum_{V\subset\{1,\dots,k\},\,1\le|V|\le k-1} C_kP\left(D_k\|
I_{n,k,V}(f(\ell))\|>u\right). \label{(D4)}
\end{equation}
Here $|V|$ denotes the cardinality of the set $V$, and the
condition $1\le |V|\le k-1$ in the summation of
formula~(\ref{(D4)}) means that the
sets $V=\emptyset$ and $V=\{1,\dots,k\}$ are omitted from the
summation, i.e. the terms where either $\alpha_V(j)=1$
or $\alpha_V(j)=2$ for all $1\le j\le k$ are not considered.
Formula (\ref{(14.13d)}) can be derived from
formula~(\ref{(D4)}) by means of an inductive argument. The
hard part of the problem is to prove formula~(\ref{(D4)}).
To do this first we prove the following simple lemma.
\medskip\noindent
{\bf Lemma D1.} {\it Let $\xi$ and $\eta$ be two independent and
identically distributed random variables taking values in a
separable Banach space~$B$. Then
$$
3P\left(|\xi+\eta|>\frac 23u\right)\ge P(|\xi|>u)
\quad \textrm{for all }u>0.
$$
}
\medskip\noindent
{\it Proof of Lemma D1.}\/ {\it Let $\xi$, $\eta$ and
$\zeta$ be three independent, identically distributed
random variables taking values in~$B$. Then
\begin{eqnarray*}
3P\left(|\xi+\eta|>\frac23 u\right)
&&=P\left(|\xi+\eta|>\frac23 u\right)
+P\left(|\xi+\zeta|>\frac23 u\right) \\
&&\qquad +P\left(|-(\eta+\zeta)|>\frac23 u\right)\\
&&\ge P(|\xi+\eta+\xi+\zeta-\eta-\zeta|>2u)=P(|\xi|>u).
\end{eqnarray*}
}
\hfill$\qed$
\medskip
To prove formula (\ref{(D4)}) we introduce the random variable
\begin{eqnarray}
T_{n,k}(f(\ell))&=&\frac1{k!}
\sum_{\substack {(l_1,\dots,l_k),\; (s_1,\dots,s_k) \colon\\
1\le l_j\le n,\, s_j=1 \textrm{ or }s_j=2,\; j=1,\dots, k,}}
\!\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},\dots,\xi_{l_k}^{(s_k)}\right)
\nonumber \\
&=& \sum_{V\subset\{1,\dots,k\}}\!\!\!\!\!
I_{n,k,V}(f(\ell)). \label{(D5)}
\end{eqnarray}
The random variables $I_{n,k}(f(\ell))$,
$I_{n,k,\emptyset}(f(\ell))$ and $I_{n,k,\{1,\dots,k\}}(f(\ell))$
are identically distributed, and the last two random variables are
independent of each other. Hence Lemma~D1 yields that
\begin{eqnarray}
&&P(\|I_{n,k}(f(\ell))\|>u)
\le3P\left(\|I_{n,k,\emptyset}(f(\ell))
+I_{n,k,\{1,\dots,k\}}(f(\ell))\|>\frac23u\right) \nonumber\\
&&\qquad =3P\left(\left\|T_{n,k}(f(\ell))-\!\!\!\!\!\!
\sum_{V\colon\, V\subset\{1,\dots,k\},\,
1\le|V|\le k-1} I_{n,k,|V|}(f(\ell))\right\|>\frac23u\right)
\!\!\!\!\!\! \nonumber \\
&&\qquad \le 3P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u) \nonumber\\
&&\qquad\qquad\qquad+
\!\!\!\!\!\!\!\!\!
\sum_{V\colon\, V\subset\{1,\dots,k\},\, 1\le|V|\le k-1}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
3P(3\cdot2^{k-1}\|I_{n,k,|V|}(f(\ell))\|>u). \label{(D6)}
\end{eqnarray}
To derive relation (\ref{(D4)}) from relation (\ref{(D6)}) we
need a good upper bound on the probability
$P(3\cdot2^{k-1}\|T_{n,k}(f(\ell))\|>u)$. To get such an estimate
we shall compare the tail distribution of $\|T_{n,k}(f(\ell))\|$
with that of $\|I_{n,k,V}(f(\ell))\|$ for an arbitrary set
$V\subset\{1,\dots,k\}$. This will be done with the
help of Lemmas~D2 and~D4 formulated below.
In Lemma~D2 such a random variable $\|\hat I_{n,k,V}(f(\ell))\|$
will be constructed whose distribution agrees with that of
$\|I_{n,k,V}(f(\ell))\|$. The expression
$\hat I_{n,k,V}(f(\ell))$, whose norm will be investigated
will be defined in formulas~(\ref{(D7)}) and~(\ref{(D8)}).
It is a random polynomial of some Rademacher functions
$\varepsilon_1,\dots,\varepsilon_n$. The coefficients of
this polynomial are random variables, independent of the
Rademacher functions $\varepsilon_1,\dots,\varepsilon_n$.
Beside this, the constant term of this polynomial equals
$T_{n,k}(f(\ell))$. These properties of the polynomial
$\hat I_{n,k,V}(f(\ell))$ together with Lemma~D4 formulated
below enable us prove such an estimate on the distribution
of $\|T_{n,k}(f(\ell))\|$ that together with
formula~(\ref{(D6)}) imply relation~(\ref{(D4)}). Let us
formulate these lemmas.
\medskip\noindent
{\bf Lemma D2.} {\it Let us consider a sequence of independent
random variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
$1\le l\le n$, which is also independent of the random variables
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ appearing in the definition of
the modified decoupled $U$-statistics $I_{n,k,V}(f(\ell))$ given
in formula (\ref{(D3)}). Let us define with their help the
sequences of random variables $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$
and $\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ whose elements
$(\eta_l^{(1)},\eta_l^{(2)})
=(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))$,
$1\le l\le n$, are defined by the formula
$$
(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))
=\left(\frac{1+\varepsilon_l}2\xi_l^{(1)}+
\frac{1-\varepsilon_l}2\xi_l^{(2)},\frac{1-\varepsilon_l}2\xi_l^{(1)}+
\frac{1+\varepsilon_l}2\xi_l^{(2)}\right),
$$
i.e. let
$$
(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(1)},\xi_l^{(2)})\quad\textrm{if } \varepsilon_l=1,
$$
and
$$
(\eta_l^{(1)}(\varepsilon_l),\eta_l^{(2)}(\varepsilon_l))=
(\xi_l^{(2)},\xi_l^{(1)})\quad\textrm{if } \varepsilon_l=-1,
\quad 1\le l\le n.
$$
Then the joint distribution of the pair of sequences of random
variables $\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ agrees with that of the pair of
sequences $\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$, which is also independent of the
sequence $\varepsilon_1,\dots,\varepsilon_n$.
Let us fix some $V\subset\{1,\dots,k\}$, and introduce the random
variable
\begin{equation}
\hat I_{n,k,V}(f(\ell))=\frac1{k!}\sum_{(l_1,\dots,l_k) \colon\,
1\le l_j\le n,\; j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right), \label{(D7)}
\end{equation}
where similarly to formula (\ref{(D3)}) $\alpha_V(j)=1$ if
$j\in V$, and $\alpha_V(j)=2$ if $j\notin V$. Then the identity
\begin{eqnarray}
&&2^k\hat I_{n,k,V}(f(\ell)) \label{(D8)} \\
&&\quad=\frac1{k!}
\!\!\sum_{\substack {(l_1,\dots,l_k),
\;(s_1,\dots,s_k)\colon\\
1\le l_j\le n,\; s_j=1 \textrm{ or }s_j=2, \nonumber \\
\;j=1,\dots, k,}}
\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right) \nonumber
\end{eqnarray}
holds, where $\kappa^{(j)}_{1,V}=1$ and $\kappa^{(j)}_{2,V}=-1$ if
$j\in V$, and $\kappa^{(j)}_{1,V}=-1$ and $\kappa^{(j)}_{2,V}=1$ if
$j\notin V$, i.e. $\kappa_{1,V}^{(j)}=3-2\alpha_V(j)$ and
$\kappa_{2,V}^{(j)}=-\kappa_{1,V}^{(j)}$.}
\medskip
Before the formulation of Lemma~D4 another Lemma~D3 will be
presented which will be applied in its proof.
\medskip\noindent
{\bf Lemma D3.} {\it Let $Z$ be a random variable taking values in
a separable Banach space $B$ with expectation zero, i.e. let
$E\kappa(Z)=0$ for all $\kappa\in B'$, where $B'$ denotes the
(Banach) space of all (bounded) linear transformations of $B$ to
the real line. Then $P(\|v+Z\|\ge\|v\|)\ge \inf\limits_{\kappa\in B'}
\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$ for all $v\in B$.}
\medskip\noindent
{\bf Lemma D4.} {\it Let us consider a positive integer $n$ and
a sequence of independent random variables
$\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$, $1\le l\le n$.
Beside this,
fix some positive integer $k$, take a separable Banach space~$B$ and
choose some elements $a_s(l_1,\dots,l_s)$ of this Banach space $B$,
$1\le s\le k$, $1\le l_j\le n$, $l_j\neq l_{j'}$ if $j\neq j'$,
$1\le j,j'\le s$. With the above notations the inequality
\begin{equation}
P\left(\left\|v+\sum_{s=1}^k \,\, \sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s}\right\|
\ge\|v\|\right)\ge c_k \label{(D9)}
\end{equation}
holds for all $v\in B$ with some constant $c_k>0$ which depends
only on the parameter $k$. In particular, it does not depend on
the norm in the separable Banach space~$B$.}
\medskip\noindent
{\it Proof of Lemma D2.}\/ Let us consider the conditional
joint distribution of the sequences of random variables
$\eta_{1}^{(1)},\dots,\eta_n^{(1)}$ and
$\eta_{1}^{(2)},\dots,\eta_n^{(2)}$ under the condition that the
random vector $\varepsilon_1,\dots,\varepsilon_n$ takes
the value of some prescribed
$\pm1$ series of length~$n$. Observe that this conditional
distribution agrees with the joint distribution of the sequences
$\xi_{1}^{(1)},\dots,\xi_n^{(1)}$ and
$\xi_{1}^{(2)},\dots,\xi_n^{(2)}$ for all possible conditions.
This fact implies the statement about the joint distribution of
the sequences $(\eta_l^{(1)},\eta_l^{(2)})$, $1\le l\le n$ and their
independence of the sequence $\varepsilon_1,\dots,\varepsilon_n$.
To prove identity~(\ref{(D8)}) let us fix a set
$M\subset\{1,\dots,n\}$, and consider the case when
$\varepsilon_l=1$ if $l\in M$ and $\varepsilon_l=-1$ if
$l\notin M$. Put $\beta_{V,M}(j,l)=1$ if $j\in V$ and $l\in M$
or $j\notin V$ and $l \notin M$, and let $\beta_{V,M}(j,l)=2$
otherwise. Then we have for all $(l_1,\dots,l_k)$,
$1\le l_j\le n$, $1\le j\le k$, and our fixed set $V$
\begin{eqnarray}
&&\sum_{\substack{(s_1,\dots,s_k)\colon\\
s_j=1 \textrm{ or }s_j=2,\;j=1,\dots, k}}
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})\cdots
(1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right) \nonumber \\
&&\qquad\qquad\qquad =2^k f_{l_1,\dots,l_k}
\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right),
\label{(D10)}
\end{eqnarray}
since the product $(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k})$
equals either zero or $2^k$, and it equals $2^k$ for that
sequence $(s_1,\dots,s_k)$ for which
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ for all
$1\le j\le k$, and the relation
$\kappa^{(j)}_{s_j,V}\varepsilon_{l_j}=1$ is
equivalent to $\beta_{V,M}(j,l_j)=s_j$ for all $1\le j\le k$.
(In relation~(\ref{(D10)}) it is sufficient to consider only
such products for which $l_j\neq l_{j'}$ if $j\neq j'$
because of the properties of the functions $f_{l_1,\dots,l_k}$.)
Beside this, $\xi_l^{\beta_{V,M}(l,j)}=\eta_l^{\alpha_V(j)}$
for all $1\le l\le n$ and $1\le j\le k$, and as a consequence
$$f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(\beta_{V,M}(1,l_1))},\dots,
\xi_{l_k}^{(\beta_{V,M}(k,l_k))}\right)=
f_{l_1,\dots,l_k}\left(\eta_{l_1}^{(\alpha_V(1))},\dots,
\eta_{l_k}^{(\alpha_V(k))}\right).
$$
Summing up the identities (\ref{(D10)}) for all
$1\le l_1,\dots,l_k\le n$ and applying the last identity we
get relation~(\ref{(D8)}), since the identity obtained in such
a way holds for all $M\subset\{1,\dots,n\}$.
\hfill$\qed$
\medskip\noindent
{\it Proof of Lemma D3.}\/ Let us first observe that if $\xi$
is a real valued random variable with zero expectation, then
$P(\xi\ge0) \ge \frac{(E|\xi|)^2}{4E\xi^2}$ since $(E|\xi|)^2
=4(E(\xi I(\{\xi\ge0\}))^2\le 4P(\xi\ge0)E\xi^2$ by the Schwarz
inequality, where $I(A)$ denotes the indicator function of
the set~$A$. (In the above calculation and in the subsequent proofs
I apply the convention $\frac00=1$. We need this convention if
$E\xi^2=0$. In this case we have the identities $P(\xi=0)=1$ and
$E|\xi|=0$, hence the above proved inequality holds in this
case, too.)
Given some $v\in B$, let us choose a linear operator $\kappa$ such
that $\|\kappa\|=1$, and $\kappa(v)=\|v\|$. Such an operator exists
by the Banach--Hahn theorem. Observe that
$\{\omega\colon\,\|v+Z(\omega)\|
\ge\|v\|\} \supset\; \{\omega\colon\,
\kappa(v+Z(\omega))\ge\kappa(v)\}
=\{\omega\colon\, \kappa(Z(\omega))\ge0\}$. Beside this,
$E\kappa(Z)=0$. Hence we can apply the above proved inequality
for $\xi=\kappa(Z)$, and it yields that
$P(\|v+Z\|\ge\|v\|)\ge P(\kappa(Z)\ge0)
\ge\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$. Lemma~D3 is proved.
\hfill$\qed$
\medskip\noindent
{\it Proof of Lemma D4.}\/
Take the class of random polynomials
$$
Y=\sum_{s=1}^k\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
b_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
where $\varepsilon_l$, $1\le l\le n$, are independent random
variables with $P(\varepsilon_l=1)=P(\varepsilon_l=-1)=\frac12$,
and the coefficients
$b_s(l_1,\dots,l_s)$, $1\le s\le k$, are arbitrary real numbers.
The proof of Lemma~D4 can be reduced to the statement that there
exists a constant $c_k>0$ depending only on the order~$k$ of these
polynomials such that the inequality
\begin{equation}
(E|Y|)^2\ge 4c_k EY^2. \label{(D11)}
\end{equation}
holds for all such polynomials~$Y$. Indeed, consider the polynomial
$$
Z=\sum_{s=1}^k\sum_{\substack {(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
a_s(l_1,\dots,l_s)\varepsilon_{l_1}\cdots\varepsilon_{l_s},
$$
and observe that $E\kappa(Z)=0$ for all linear functionals $\kappa$
on the space $B$. Hence Lemma~D3 implies that the left-hand side
expression in~(\ref{(D9)}) is bounded from below by
$\inf\limits_{\kappa\in B'}\frac{(E|\kappa(Z)|)^2}{4E\kappa(Z)^2}$.
On the other hand, relation~(\ref{(D11)}) implies that
$\inf\limits_{\kappa\in B'}\frac{(E|\kappa(Z)|)^2}
{4E\kappa(Z)^2}\ge c_k$.
To prove relation (\ref{(D11)}) first we compare the moments
$EY^2$ and $EY^4$. Let us introduce the random variables
$$
Y_s=\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\ l_j\neq l_{j'} \textrm{ if }
j\neq j'}}
b_s(l_1,\dots,l_s) \varepsilon_{l_1}\cdots\varepsilon_{l_s}
\quad 1\le s\le k.
$$
We shall show that the estimates of Chapter~13 imply that
\begin{equation}
EY_s^4\le 2^{4s} \left(EY_s^2\right)^2 \label{(D12)}
\end{equation}
for these random variables $Y_s$.
Relation (\ref{(D12)}) together with the uncorrelatedness of
the random variables $Y_s$, $1\le s\le k$, imply that
\begin{eqnarray*}
EY^4
&=&E\left(\sum_{s=1}^k Y_s\right)^4\le k^3\sum_{s=1}^k EY_s^4\le
k^3 2^{4k} \sum_{s=1}^k (EY_s^2)^2\\
&\le& k^3 2^{4k}\left(\sum_{s=1}^k EY_s^2\right)^2=k^3 2^{4k}(EY^2)^2.
\end{eqnarray*}
This estimate together with the H\"older inequality with $p=3$ and
$q=\frac32$ yield that
$$
EY^2=E|Y|^{4/3}\cdot|Y|^{2/3}\le
(EY^4)^{1/3}(E|Y|)^{2/3}\le k2^{4k/3}(EY^2)^{2/3}(E|Y|)^{2/3},
$$
i.e. $EY^2\le k^32^{4k}(E|Y|)^2$, and relation (\ref{(D11)}) holds
with $4c_k=k^{-3}2^{-4k}$. Hence to complete the proof of Lemma~D4
it is enough to check relation~(\ref{(D12)}).
In the proof of relation (\ref{(D12)}) we may assume that the
coefficients $b_s(l_1,\dots,l_s)$ of the random variable $Y_s$ are
symmetric functions of the arguments
$l_1,\dots,l_s$, since a symmetrization of these coefficients does
not change the value of $Y$. Put
$$
B^2_s=\sum_{\substack{(l_1,\dots,l_s)\colon\,
1\le l_j\le n,\; j=1,\dots, s,\\
l_j\neq l_{j'} \textrm{ if } j\neq j'}}
b_s^2(l_1,\dots,l_s), \quad 1\le s\le k.
$$
Then
$$
EY_s^2=s! B_s^2,
$$
and
$$
EY_s^4\le 1\cdot3\cdot5\cdots(4s-1)B_s^4
=\frac{(4s)!}{2^{2s}(2s)!}B_s^4
$$
by Lemmas 13.4 and 13.5 with the choice $M=2$ and $k=s$.
Inequality~(\ref{(D12)}) follows from the last two relations.
Indeed, to prove formula~(\ref{(D12)}) by means of these
relations it is enough to check that
$\frac{(4s)!}{2^{2s}(2s)!(s!)^2}\le 2^{4s}$. But it is easy to
check this inequality with induction with respect to $s$.
(Actually there is a well-known inequality in the literature,
known under the name Borell's inequality, which implies
inequality~(\ref{(D12)}) with a better coefficient at the right
hand side of this estimate.) We have proved Lemma~D4.
\hfill$\qed$
\medskip
Let us turn back to the estimation of the probability
$P(3\cdot2^{k-1}\|T_{n,k}(f)\|>u)$. Let us introduce the
$\sigma$-algebra ${\cal F}={\cal B}(\xi_l^{(1)},\xi_l^{(2)},\,1\le
l\le n)$ generated by the random variables $\xi_l^{(1)},\xi_l^{(2)}$,
$1\le l\le n$, and fix some set $V\subset\{1,\dots,k\}$.
I show with the help of Lemma~D4 and formula~(\ref{(D8)}) that
there exists some constant $c_k>0$ such that the random
variables $T_{n,k}f(\ell))$ defined in formula~(\ref{(D5)}) and
$\hat I_{n,k,V}(f(\ell))$ defined in formula~(\ref{(D7)}) satisfy
the inequality
\begin{equation}
P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\cal F}\right)
\ge c_k \quad \textrm{ with probability 1.} \label{(D13)}
\end{equation}
In the proof of~(\ref{(D13)}) we shall exploit that in
formula~(\ref{(D8)}) $2^k\hat I_{n,k,V}(f(\ell))$ is represented
by a polynomial of the Rademacher functions
$\varepsilon_1,\dots,\varepsilon_n$ whose constant term is
$T_{n,k}(f(\ell))$. The coefficients of this polynomial are
functions of the random variables $\xi^{(1)}_l$ and $\xi^{(2)}_l$,
$1\le l\le n$. The independence of these random variables from
$\varepsilon_{l}$, $1\le l\le n$, and the definition of the
$\sigma$-algebra ${\cal F}$ yield that
\begin{eqnarray}
&&P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>
\|T_{n,k}(f(\ell))\||{\cal F}\right) \label{(D14)} \\
&&\qquad=P_{\varepsilon_V}\biggl(\biggl\|\frac1{k!}
\sum_{\substack{(l_1,\dots,l_k),\; (s_1,\dots,s_k)\colon\\
1\le l_j\le n, s_j=1 \textrm{ or }s_j=2,\\
j=1,\dots, k,}}
\!\!\!
(1+\kappa^{(1)}_{s_1,V}\varepsilon_{l_1})
\cdots (1+\kappa^{(k)}_{s_k,V}\varepsilon_{l_k}) \nonumber \\
&&\qquad\qquad \qquad\qquad\qquad\qquad\qquad
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_1)},
\dots,\xi_{l_k}^{(s_k)}\right)\biggr\| \nonumber \\
&&\qquad \qquad\qquad\qquad\qquad\qquad
>\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|\biggr),
\nonumber
\end{eqnarray}
where $P_{\varepsilon_V}$ means that the values of the
random variables $\xi_l^{(1)}$, $\xi_l^{(2)}$, $1\le l\le n$,
are fixed, (their value depend on the atom of the
$\sigma$-algebra ${\cal F}$ we are considering) and the
probability is taken with respect to the remaining random
variables $\varepsilon_l$, $1\le l\le n$. At the right-hand
side of (\ref{(D14)}) the probability of such an event is
considered that the norm of a polynomial of order~$k$ of the
random variables $\varepsilon_1,\dots,\varepsilon_n$ is larger
than
$\|T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)\|$.
Beside this, the constant term of this polynomial
equals~$T_{n,k}(f(\ell))(\xi_l^{(j)},\,1\le l\le n,\, j=1,2)$.
Hence this probability can be bounded by means of Lemma~D4,
and this result yields relation~(\ref{(D13)}).
The distributions of $I_{n,k,V}(f(\ell))$ and
$\hat I_{n,k,V}(f(\ell))$ agree by the first statement of Lemma~D2
and a comparison of formulas~(\ref{(D3)}) and~(\ref{(D7)}). Hence
relation (\ref{(D13)}) implies that
\begin{eqnarray*}
&&P\left(\|2^k I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right)
=P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|
\ge\frac13\cdot2^{1-k} u\right) \\
&&\qquad
\ge P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|\ge\|T_{n,k}(f(\ell))\|,\;
\|T_{n,k}(f(\ell))\|\ge\frac13\cdot2^{1-k} u\right)\\
&&\qquad=\int_{\{\omega\colon\, \|T_{n,k}(f(\ell))(\omega)\|
\ge\frac13\cdot2^{1-k} u\}}
\!\!\!\!\!
P\left(\|2^k\hat I_{n,k,V}(f(\ell))\|>\|T_{n,k}(f(\ell))\|
|{\cal F}\right)\,dP\\
&&\qquad \ge c_k P(3\cdot2^{k-1} \|T_{n,k}(f(\ell))\|\ge u).
\end{eqnarray*}
The last inequality with the choice of any set
$V\subset\{1,\dots,k\}$, $1\le |V|\le k-1$, together with
relation~(\ref{(D6)}) imply formula~(\ref{(D4)}).
We shall formulate an inductive hypothesis, and relation
(\ref{(14.13d)}) will be proved together with it by means of
an induction procedure with respect to the order $k$ of the
$U$-statistic. In the proof of this inductive procedure
we shall apply the already proved relation~(\ref{(D4)}). To
formulate it some new quantities will be introduced.
Let ${\cal W}={\cal W}(k)$ denote the set of all partitions
of the set $\{1,\dots,k\}$. Let us fix $k$ independent copies
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of the
sequence of random variables $\xi_{1},\dots,\xi_n$. Given a
partition $W=(U_1,\dots,U_s)\in{\cal W}(k)$ let us introduce
the function $s_W(j)$, $1\le j\le k$, which tells for all
arguments $j$ the index of that element of the partition~$W$
which contains the point $j$, i.e. the value of the function
$s_W(j)$, $1\le j\le k$, in a point $j$ is defined by the
relation $j\in V_{s_W(j)}$. Let us introduce the expression
\begin{eqnarray*}
I_{n,k,W}(f(\ell))
&&=\frac1{k!}\sum_{ (l_1,\dots,l_k)\colon\,
1\le l_j\le n,\;j=1,\dots,k}
f_{l_1,\dots,l_k}\left(\xi_{l_1}^{(s_W(1))},
\dots,\xi_{l_k}^{(s_W(k))}\right)\\
&&\qquad\qquad\qquad\qquad\qquad\qquad\qquad
\textrm{for all }W\in{\cal W}(k).
\end{eqnarray*}
An expression of the form $I_{n,k,W}(f(\ell))$, $W\in{\cal W}_k$,
will be called a decoupled $U$-statistic with generalized
decoupling. Given a partition $W=(U_1,\dots,U_s)\in{\cal W}_k$
let us call the number $s$, i.e.\ the number of the elements of
this partition the rank both of the partition $W$ and of the
decoupled $U$-statistic $I_{n,k,W}(f(\ell))$ with generalized
decoupling.
Now I formulate the following hypothesis. For all $k\ge2$ and
$2\le j\le k$ there exist some constants $C(k,j)>0$ and
$\delta(k,j)>0$ such that for all $W\in{\cal W}_k$ a decoupled
$U$-statistic $I_{n,k,W}(f(\ell))$ with generalized decoupling
satisfies the inequality
\begin{eqnarray}
&&P(\|I_{n,k,W}(f(\ell))\|>u)\le C(k,j)P\left(\|\bar
I_{n,k}(f(\ell))\|>\delta(k,j) u\right) \nonumber \\
&&\qquad\qquad\qquad\textrm{for all }2
\le j\le k \textrm{ if the rank of } W \textrm{ equals }j.
\label{(D15)}
\end{eqnarray}
It will be proved by induction with respect to $k$ that both
relations~(\ref{(14.13d)}) and~(\ref{(D15)}) hold for
$U$-statistics of order~$k$.
Let us observe that for $k=2$ relation~(\ref{(14.13d)})
follows from~(\ref{(D4)}).
Relation~(\ref{(D15)}) also holds for $k=2$, since in
this case we have to consider only the case $j=k=2$.
Relation (\ref{(D15)}) also holds in this case with
$C(2,2)=1$ and $\delta(2,2)=1$. Hence we can start our
inductive proof with $k=3$. First I prove
relation~(\ref{(D15)}).
In relation (\ref{(D15)}) the tail-distribution of decoupled
$U$-sta\-tis\-tics with generalized decoupling is compared
with that of the decoupled $U$-statistic
$\bar I_{n,k}(f(\ell))$ introduced in~(\ref{(D2)}). Given
the order $k$ of these $U$-statistics it will be proved
by means of a backward induction with respect to the
rank~$j$ of the decoupled $U$-statistics $I_{n,k,W}(f(\ell))$
with generalized decoupling.
Relation~(\ref{(D15)}) clearly holds for $j=k$ with $C(k,k)=1$
and $\delta(k,k)=1$. If we already know that these relations
hold up to $k-1$, then we prove first relation~(\ref{(D15)}) for
generalized decoupling $U$-statistics of order~$k$ with respect
to backward induction for the rank $2\le ju)\le \bar A(k) P\left(\|I_{n,k,\bar W}
(f(\ell))\|>\bar \gamma(k) u\right) \label{(D16)}
\end{equation}
with $\bar A(k)=\sup\limits_{2\le p\le k-1}A(p)$,
$\bar\gamma(k)=\inf\limits_{2\le p\le k-1}\gamma(p)$ if the
rank $j$ of $W$ is such that $2\le j\le k-1$, where the
constants $A(p)$ and $\gamma(p)$ agree with the corresponding
coefficients in formula~(\ref{(14.13d)}).
To prove relation~(\ref{(D16)}) (where $U_j=\{t,\dots,k\}$
is the last element of the partition~$W$) let us define
the $\sigma$-algebra ${\cal F}$ generated by the random
variables appearing in the first $t-1$ coordinates
of these $U$-statistics, i.e. by the random variables
$\xi^{s_W(j)}_{l_j}$,$1\le j\le t-1$, and $1\le l_j\le n$
for all $1\le j\le t-1$. We have $2\le t\le k-1$.
By our inductive hypothesis relation~(\ref{(14.13d)})
holds for $U$-statistics of order $p=k-t+1$,
since $2\le p\le k-1$. I claim that this implies that
\begin{equation}
P(\|I_{n,k,W}(f(\ell))\|>u|{\cal F})\le A(k-t+1)
P\left(\|I_{n,k,\bar W}(f(\ell))\|
>\gamma(k-t+1)u|{\cal F}\right) \label{(D17)}
\end{equation}
with probability~1. Indeed, by the independence properties of
the random variables $\xi_l^{s_W(j)}$
(and $\xi_l^{s_{\bar W}(j)}$),
$1\le j\le k$, $1\le l\le n$,
$$
P(\|I_{n,k,W}(f(\ell))\|>u|{\cal F})
=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,W}(f(\ell)\|>u)
$$
and
\begin{eqnarray*}
&&P\left(\|I_{n,k,\bar W}(f(\ell))\|>\gamma(k-t+1)u|{\cal F}\right)\\
&&\qquad=P_{\xi_l^{s_W(j)},1\le j\le t-1}(\|I_{n,k,\bar W}f(\ell)\|
>\gamma(k-t+1)u),
\end{eqnarray*}
where $P_{\xi_l^{s_W(j)}, 1\le j\le t-1}$ denotes that the
values of the
random variables $\xi_l^{s_W(j)}(\omega)$, $1\le j\le t-1$,
$1\le l\le n$, are fixed, and we consider the probability that
the appropriate functions of these fixed values and of the
remaining random variables
$\xi^{s_W(j)}$ and $\xi^{s_{\bar W}(j)}$, $t\le j\le k$,
satisfy the desired relation. These identities and the relation
between the sets $W$ and $\bar W$ imply that relation~(\ref{(D17)})
is equivalent to the identity~(\ref{(14.13d)}) for the generalized
$U$-statistics of order $2\le k-t+1\le k-1$ with kernel functions
\begin{eqnarray*}
&&f_{l_t,\dots,l_k}(x_t,\dots,x_k)\\
&&\qquad= \!\!\!\!\!
\sum_{(l_1,\dots,l_{t-1})\colon\, 1\le l_j\le n,\;1\le j\le t-1}
\!\!\!\!\!\!\!\!\!\!
f_{l_1,\dots,l_k}(\xi_{l_1}^{s_W(1)}(\omega),\dots,
\xi_{l_{t-1}}^{s_W(t-1)}(\omega),x_t,\dots,x_k).
\end{eqnarray*}
Relation~(\ref{(D16)}) follows from inequality~(\ref{(D17)}) if
expectation is taken at both sides. As the rank of $\bar W$ is
strictly greater than the rank of $W$, relation~(\ref{(D16)})
together with our backward inductive assumption imply
relation~(\ref{(D15)}) for all $2\le j\le k$.
Relation~(\ref{(D15)}) implies in particular (with the
applications of partitions of order~$k$ and rank~2) that the
terms in the sum at the right-hand side of~(\ref{(D4)})
satisfy the inequality
$$
P\left(D_k\|I_{n,k,V}(f(\ell))\|>u\right)\le \bar C(k,j)
P\left(\|\bar I_{n,k}(f(\ell))\|>\bar D_k u\right)
$$
with some appropriate $\bar C_k>0$ and $\bar D_k>0$ for all
$V\subset\{1,\dots,k\}$, $1\le|V|\le k-1$. This inequality
together with relation~(\ref{(D4)}) imply that
inequality~(\ref{(14.13d)}) also holds for
the parameter~$k$.
\medskip
In such a way we get the proof of relation (\ref{(14.13d)}) and
its special case, relation~(\ref{(14.13)}). Let us prove
formula~(\ref{(14.14)}) with its help first in the simpler case
when the supremum of finitely many functions is taken. If
$M<\infty$ functions $f_1,\dots,f_M$ are considered, then
relation~(\ref{(14.14)}) for the supremum of the $U$-statistics
and decoupled $U$-statistics with these kernel functions can be
derived from formula (\ref{(14.13)}) if it is applied for the
function $f=(f_1,\dots,f_M)$ with values in the separable
Banach space $B_M$ which consists of the vectors
$(v_1,\dots,v_M)$, $v_j\in B$, $1\le j\le M$, and the norm
$\|(v_1,\dots,v_M)\|=\sup\limits_{1\le j\le m}\|v_j\|$ is
introduced in it. The application of formula (\ref{(14.13)})
with this choice yields formula~(\ref{(14.14)}) for this
supremum. Let us emphasize that the constants appearing in
this estimate do not depend on the number~$M$. (We took only
$M<\infty$ kernel functions, because with such a choice the
Banach space $B_M$ defined above is also separable.)
Since the distribution of the random variables
$\sup\limits_{1\le s\le M}\left\|I_{n,k}(f_s)\right\|$
converge to that of
$\sup\limits_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|$, and
the distribution of the random variables $\sup\limits_{1\le s\le M}
\left\| \bar I_{n,k}(f_s)\right\|$ converge to that of
$\sup\limits_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|$ as
$M\to\infty$, relation (\ref{(14.14)}) in the general case
follows from its already proved special case and a limiting
procedure $M\to\infty$.
\hfill$\qed$
\medskip\noindent
{\it Remark.} The above proved formula (\ref{(14.13d)}) can be
slightly generalized. It also holds if the expressions
$I_{n,k}(f(\ell))$ and $\bar I_{n,k}(f(\ell))$ appearing in
this inequality are defined in a more general way. Namely,
they are the random functions introduced in
formulas~(\ref{(D1)}) and (\ref{(D2)}), but the sequences
$\xi_1,\dots,\xi_n$ and their independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$ in these formulas are independent
random variables which may also be non-identically distributed.
Such a generalization can be proved without any essential change
in the original proof.
\begin{thebibliography}{99}
\bibitem{r1}
Adamczak, R. (2006) Moment inequalities for
$U$-statistics. {\it Annals of Probability} {\bf34}, 2288--2314
\bibitem{r2}
Ajtai, M., Koml\'os, J. and Tusn\'ady, G. (1984) On optimal matchings.
{\it Combinatorica}\/ {\bf 4} no. 4, 259--264
\bibitem{r3}
Alexander, K. (1987) The central limit theorem for
empirical processes over Vapnik--\v{C}ervonenkis classes. {\it Annals
of Probability} {\bf 15}, 178--203
\bibitem{r4}
Arcones, M. A. and Gin\'e, E. (1993) Limit theorems for
$U$-processes. {\it Annals of Probability}, {\bf 21}, 1494--1542
\bibitem{r5}
Arcones, M. A. and Gin\'e, E. (1994) $U$-processes
indexed by Vapnik--\v{C}ervonenkis classes of functions with
application to asymptotics and bootstrap of $U$-statistics with
estimated parameters. {\it Stoch. Proc. Appl.} {\bf 52}, 17--38
\bibitem{r6}
Bennett, G. (1962) Probability inequality for the sum of
independent random variables. {\it J. Amer. Statist. Assoc.}\/
{\bf 57}, 33--45
\bibitem{r7}
Bonami, A. (1970) \'Etude des coefficients de Fourier des
fonctions de $L^p(G)$. {\it Ann. Inst. Fourier (Grenoble)\/} {\bf 20}
335--402
\bibitem{r7a}
Borell, C. (1979) On the integrability of Banach space valued Walsh
polynomials. {\it S\'eminaire de Probabilit\'es XIII. Lecture Notes
in Math.} 721, 1--3 Springer--Verlag, Berlin
\bibitem{r8}
de la Pe\~na, V. H. and Gin\'e, E. (1999) {\it
Decoupling. From dependence to independence.}\/ Springer series in
statistics. Probability and its application. Springer--Verlag,
New York, Berlin, Heidelberg
\bibitem{r9}
de la Pe\~na, V. H. and Montgomery--Smith, S. (1995)
Decoupling inequalities for the tail-probabilities of multivariate
$U$-statistics. {\it Ann. Probab.}, {\bf 23}, 806--816
\bibitem{r10}
Dobrushin, R. L. (1979) Gaussian and their subordinated
fields. {\it Annals of Probability}\/ {\bf 7}, 1-28
\bibitem{r11}
Dudley, R. M. (1978) Central limit theorems for empirical
measures. {\it Annals of Probability}\/ {\bf 6}, 899--929
\bibitem{r12}
Dudley, R. M. (1984) A course on empirical processes.
{\it Lecture Notes in Mathematics}\/ {\bf 1097}, 1--142
Springer--Verlag, New York
\bibitem{r13}
Dudley, R. M. (1989) {\it Real Analysis and
Probability.}\/ Wadsworth \& Brooks, Pacific Grove, California
\bibitem{r14}
Dudley, R. M. (1998) {\it Uniform Central Limit
Theorems.}\/ Cambridge University Press, Cambridge U.K.
\bibitem{r15}
Dynkin, E. B. and Mandelbaum, A. (1983) Symmetric
statistics, Poisson processes and multiple Wiener integrals. {\it
Annals of Statistics\/} {\bf 11}, 739--745
\bibitem{r16}
Frankl, P. and Pach J. (1983) On the number of sets in
null-$t$-design. {\it European J. Combinatorics} {\bf 4} 21--23
\bibitem{r17}
Gin\'e, E. and Guillou, A. (2001) On consistency of
kernel density estimators for randomly censored data: Rates holding
uniformly over adaptive intervals. {\it Ann. Inst. Henri
Poincar\'e PR\/} {\bf 37} 503--522
\bibitem{r18}
Gin\'e, E., Kwapie\'n, S, Lata\l{}a, R. and Zinn, J.
(2001) The LIL for canonical $U$-statistics of order~2.
{Annals of Probability} {\bf 29} 520--527
\bibitem{r19}
Gin\'e, E., Lata\l{}a, R. and Zinn, J. (2000)
Exponential and moment inequalities for $U$-statistics in {\it High
dimensional probability II.} Progress in Probability 47. 13--38.
Birkh\"auser Boston, Boston, MA.
\bibitem{r20}
Gross, L. (1975) Logarithmic Sobolev inequalities.
Amer. J. Math. {\bf 97}, 1061--1083
\bibitem{r21}
Guionnet, A. and Zegarlinski, B. (2003) Lectures on
Logarithmic Sobolev inequalities. {\it Lecture Notes in Mathematics}
{\bf 1801} 1--134 2. Springer--Verlag, New York
\bibitem{r22}
Hanson, D. L. and Wright, F. T. (1971) A bound on the
tail probabilities for quadratic forms in independent random
variables. {\it Ann. Math. Statist.} {\bf 42} 52--61
\bibitem{r23}
Hoeffding, W. (1948) A class of statistics with
asymptotically normal distribution. {\it Ann. Math. Statist.}
{\bf 19} 293--325
\bibitem{r24}
Hoeffding, W. (1963) Probability inequalities for sums
of bounded random variables. {\it J. Amer. Math. Society}\/
{\bf 58}, 13--30
\bibitem{r25}
It\^o K. (1951) Multiple Wiener integral. {\it J. Math.
Soc. Japan}\/ {\bf3}. 157--164
\bibitem{r25a}
Janson, S. (1997) {\it Gaussian Hilbert Spaces.}
Cambridge University Press, Cambridge
\bibitem{r26}
Kaplan, E.L. and Meier P. (1958) Nonparametric
estimation from incomplete data, {\it Journal of American
Statistical Association}, {\bf 53}, 457--481.
\bibitem{r27}
Lata\l{a}, R. (2006) Estimates of moments and tails of
Gaussian chaoses. {\it Annals of Probability} {\bf34} 2315--2331
\bibitem{r28}
Ledoux, M. (1996) On Talagrand deviation inequalities
for product measures. {\it ESAIM: Probab. Statist.}\/ {\bf 1.}
63--87. Available at http://www.emath./fr/ps/.
\bibitem{r29}
Ledoux, M. (2001) The concentration of measure phenomenon.
{\it Mathematical Surveys and Monographs}\/ {\bf 89} American
Mathematical Society, Providence, RI.
\bibitem{r30}
Major, P. (1981) Multiple Wiener--It\^o integrals. {\it
Lecture Notes in Mathematics\/} {\bf 849}, Springer--Verlag, Berlin,
Heidelberg, New York,
\bibitem{r31}
Major, P. (1988) On the tail behaviour of the
distribution function of multiple stochastic integrals. {\it
Probability Theory and Related Fields}, {\bf 78}, 419--435
\bibitem{r32}
Major, P. (1994) Asymptotic distributions for weighted
$U$-statistics. {\it The Annals of Probability}, {\bf 22} 1514--1535
\bibitem{r33}
Major, P. (2005) An estimate about multiple stochastic
integrals with respect to a normalized empirical measure.
{\it Studia Scientarum Mathematicarum Hungarica.} {\bf 42}(3) 295--341 %
\bibitem{r34}
Major, P. (2005) Tail behaviour of multiple random integrals
and $U$-sta\-tis\-tics. {\it Probability Reviews.} {\bf2} 448--505
\bibitem{r35}
Major, P. (2006) An estimate on the maximum of a nice
class of stochastic integrals. {\it Probability Theory
and Related Fields.} {bf2} {\bf 134}, 489--537 %
\bibitem{r36}
Major, P. (2006) A multivariate generalization of
Hoeffding's inequality. {\it Electronic Communication in
Probability} {\bf 2} (220--229)
\bibitem{r37}
Major, P. (2007) On a multivariate version of
Bernstein's inequality {\it Electronic Journal of
Probability} {\bf12} 966--988
%\bibitem{r38}
%Major, P. (2005) On the tail behaviour of multiple
%random integrals and degenerate $U$-statistics. (First version of
%this lecture note) http://www.renyi.hu/\~{}major
\bibitem{r39}
Major, P. and Rejt\H{o}, L. (1988) Strong embedding of
the distribution function under random censorship. {\it Annals of
Statistics} {\bf 16}, 1113--1132
\bibitem{r40}
Major, P. and Rejt\H{o}, L. (1998) A note on
nonparametric estimations. In the conference volume to the 65.
birthday of Mikl\'os Cs\"org\H{o}. 759--774
\bibitem{r41}
Malyshev, V. A. and Minlos, R. A. (991) Gibbs Random
Fields. Method of cluster expansion. Kluwer, Academic Publishers,
Dordrecht
\bibitem{r42}
Massart, P. (2000) About the constants in Talagrand's
concentration inequalities for empirical processes.
{\it Annals of Probability}\/ {\bf 28}, 863--884
\bibitem{r43}
Mc. Kean, H. P. (1973) Wiener's theory of non-linear
noise. in {\it Stochastic Differential Equations}
SIAM--AMS Proc. 6 197--209
\bibitem{r44}
Nelson, E. (1973) The free Markov field. J. Functional
Analysis {\bf 12}, 211--227
\bibitem{r44b}
Nourdin, I. and Peccati, G. (2012) {\it Normal approximations with
Malliavin calculus: from Stein's method to Universality.} Cambridge
Tracts in Mathematics, 192 Cambridge University Press, Cambridge
\bibitem{r44c}
Nualart, D. (2006) {\it Malliavin calculus and related topics of
probability and Its Applications.} 2.~edition, Berlin, Springer--Verlag,
\bibitem{r44d}
Peccati, G. and Taqqu, M. S. (2010) {\it Wiener chaos: moments, cumulants
and diagrams.} Springer--Verlag, New York
\bibitem{r45}
Pollard, D. (1984) {\it Convergence of Stochastic
Processes.}\/ Springer--Verlag, New York
\bibitem{r46}
Rota, G.-C. and Wallstrom, C. (1997) Stochastic
integrals: a combinatorial approach. {\it Annals of Probability}
{\bf 25} (3) 1257--1283
\bibitem{r47}
Surgailis, D. (1984) On multiple Poisson stochastic
integrals and associated Markov semigroups. {\it Probab. Math.
Statist.} 3. no. {\bf 2} 217-239
\bibitem{r48}
Surgailis, D. (2000) Long-range dependence and Appell
rank. {\it Annals of Probability} {\bf 28} 478--497
%\bibitem{r41}
%Surgailis, D. (2000) CLTs for polynomials of linear
%sequences: Diagram formulae with illustrations. in {\it Long Range
%Dependence} 111--128 Birkh\"auser, Boston, Boston, MA.
\bibitem{r49}
Szeg\H{o}, G. (1967) {\it Orthogonal Polynomials.}
American Mathematical Society Colloquium Publications. Vol. 23,
American Mathematical Society, Providence, R.I.
\bibitem{r50}
Takemura, A. (1983) Tensor Analysis of ANOVA
decomposition. {\it J. Amer. Statist. Assoc.} {\bf 78}, 894--900
\bibitem{r51}
Talagrand, M. (1994) Sharper bounds for Gaussian and
empirical processes. {\it Annals of Probability} {\bf 22}, 28--76
\bibitem{r52}
Talagrand, M. (1996) New concentration inequalities in
product spaces. {\it Invent. Math.} {\bf 126}, 505--563
\bibitem{r52a}
Talagrand M. (2003) {\it Spin Glasses: A challenge for mathematicians.}
Springer--Verlag, Berlin
\bibitem{r53}
Talagrand, M. (2005) {\it The general chaining.}
Springer Monographs in Mathematics. Springer--Verlag, Berlin
Heidelberg New York
\bibitem{r54}
Vapnik, V. N. (1995) {\it The Nature of Statistical
Learning Theory.} Springer--Verlag, New York
\bibitem{r55}
Wiener, N. (1838) The homogeneous chaos.{Amer. J. Math.} {\bf 60}
879--936
\end{thebibliography}
\backmatter
\printindex
\extrachap{Acronyms}
\begin{description}
\item[$\Phi(u)$] {Standard normal distribution function. page~13}
\item[${\cal F}$]{It denotes generally a class of functions with
some nice property. See e.g. page~20}
\item[$S_n(f)$] {The normalized sum
$\frac1{\sqrt n}\sum\limits_{k=1}^nf(\xi_k)$ of independent
identically distributed random variables with some test function $f$.
page~20}
\item[$\mu_n(A)%(\omega)
$] {The value of the empirical distribution on the set $A$. page~21}
\item[$J_n(f)$]{One-fold random integral with respect
to a normalized empirical distribution. page~21}
\item[$J_{n,k}(f)$] {$k$-fold random integral with respect to a
normalized empirical distribution. page~28}
\item[$\int'$] {The prime in the integral means that the diagonals
are omitted from the domain of integration of a multiple
integral. page~28}
\item[$|S|$] {The cardinality of a (finite) set $S$. page~32}
\item[$I_{n,k}(f)$] {$U$-statistic of order $k$ with $n$ sample
points and kernel function~$f$. page~64}
\item[$I_{n,0}(c)$] {$U$-statistic of order zero, where $c$ is a
constant. page~65}
\item[$\textrm{Sym}\, f$] {Symmetrization of the function $f$. page~95}
\item[$\mu_W$] {White noise with reference measure $\mu$. pages~67 and 92}
\item[$Z_{\mu,k}(f)$] {$k$-fold Wiener--It\^o integral with respect
of a white noise with reference measure~$\mu$. pages~67 and 94}
\item[$P_jf$] {The projection of the function $f$ defined in the
Euclidean space $R^k$ to the subspace consisting of the functions
not depending on the $j$-th coordinate. page~76}
\item[$Q_jf$] {The projection orthogonal to the projection $P_j$ in
the space of functions on $R^k$. page~76}
\item[$f_V(x_{j_1},\dots,x_{j_{|V|}})$] {The canonical function depending
on the arguments indexed by the set $V$ which appears in the Hoeffding
decomposition of the $U$-statistic $I_{n,k}(f)$. page~77}
\item[${\cal H}_{\mu,k}$] {The class of functions which can be chosen as
the kernel function of a $k$-fold Wiener--It\^o integral with respect to
a white noise with reference measure~$\mu$. page~93}
\item[$\Gamma(k,l)$] {The class of diagrams in the diagram formula
for the product of a $k$-fold and an $l$-fold Wiener--It\^o
integral. page~98}
\item[$F_\gamma(f,g)$] {The kernel function of the Wiener--It\^o integral
corresponding to the diagram~$\gamma$ in the diagram formula for
the product of two Wiener--It\^o integrals. page~99 \hfill\break
The kernel function $F_\gamma(f_1,f_2)$ corresponding to the coloured diagram
$\gamma$ in the diagram formula for the product of two degenerate
$U$-statistics appears at page~119}
\item[$\Gamma(k_1,\dots,k_m)$] {The class of diagrams in the diagram
formula for the product of Wiener--It\^o integrals of order $k_1$, $k_2$,
\dots $k_m$. page~104 \hfill\break
The same notation is applied for the class of coloured diagrams in the
diagram formula for the product of degenerate $U$-statistics. page~117}
\item[$F_\gamma(f_1,\dots,f_m)$] {The kernel function of the
Wiener--It\^o integral in the general form of the diagram formula
corresponding to the diagram~$\gamma$. page~105 \hfill\break
The same notation is applied for the kernel function corresponding
to a coloured diagram~$\gamma$ in the diagram formula for the
product of degenerate $U$-statistics. page~126}
\item[$\bar\Gamma(k_1,\dots,k_m)$] {The class of closed diagrams
in the diagram formula. page~108 \hfill\break
The same notation for the class of closed coloured diagrams. page~130}
\item[$H_k(u)$] {The $k$-th Hermite polynomial with leading
coefficient~1. page~109}
\item[$\textrm{Exp}\,({\cal H}_\mu)$] {The Fock space. page~110}
\item[$\ell(\beta)$] {The length of a chain $\beta$ in a (coloured)
diagram. page~117}
\item[$c(\beta)$] {The colour of a chain $\beta$ in a
(coloured) diagram. page~117}
\item[$O(\gamma)$ and $C(\gamma)$] {The open and closed chains
of a coloured diagram~$\gamma$. page~117}
\item[$O_2(\gamma)$] {The set of open chains of length~2 in a
coloured diagram with two rows. page~119}
\item[$W(\gamma)$] {An appropriate function of a coloured
diagram~$\gamma$ appearing in the diagram formula for the product
of degenerate $U$-statistics. It is defined in the case of the
product of two degenerate $U$-statictics at page~120, in the general
case at page~126}
\item[$\bar I_{n,k}(f)$] {Decoupled $U$-statistic of order~$k$
with $n$ sample points. page~169}
\item[$\bar I_{n,k}^\varepsilon(f)$] {Randomized decoupled
$U$-statistic of order~$k$ with $n$ sample points. \hfill\break
page~169}
\item[$\tilde I_{n,k}(f)$ and $\tilde I_{n,k}^\varepsilon(f)$]
Some linear combinations of decoupled $U$-statistics and randomized
decoupled $U$-statistics applied in the symmetrization argument of
Chapter 15. page 174
\item[$H_{n,k}(f)$] {A random variable appearing in the definition of
good tail behaviour for a class of integrals of decoupled
$U$-statistics in Chapter~15. page~178}
\item[$\cal G$] A class of diagram defined in Chapter 16. applied
in the proof of the main result. page~184
\item[$H_{n,k}(f|G,V_1,V_2)$] {A random variable playing central role
in the proofs of Chapters~16 and~17. It depends of a function of
$k$ variables, a diagram~$G$ and two subsets $V_1$ and $V_2$ of
the set $\{1,\dots,k\}$. page~185}
\item[$I_{n,k}(f(\ell))$] {Generalized $U$-statistics.
page~257}
\item[$\bar I_{n,k}(f(\ell))$] {Generalized decoupled $U$-statistics.
page~257}
\end{description}
\end{document}
__