%BeginFileInfo
%%Publisher=MATTSON
%%Project=PS
%%Manuscript=PS-2005-50
%%Stage=
%%TID=henrikas
%%Format=latex006
%%Distribution=live4
%%Destination=PS
%%PS.Maker=vtex_tex_ps
%%PDF.Maker=vtex_tex_pdf
%EndFileInfo
%EndFileInfo
\documentclass{article}
%\usepackage{nextcount}
\RequirePackage[OT1]{fontenc}
\RequirePackage[ps,amsthm,amsmath,noinfoline]{imsart}
\RequirePackage[dvips]{hyperref}
%\RequirePackage[pdftex]{hyperref}
% will be filled by editor:
\doi{10.1214/154957805100000186}
\pubyear{2005}
\volume{2}
\firstpage{448}
\lastpage{505}
\begin{document}
\begin{frontmatter}
\title{Tail behaviour of multiple random integrals and
$U$-statistics\thanksref{t1}}
\thankstext{t1}{This is an original survey paper}
\runtitle{Random integrals and $U$-statistics}
\begin{aug}
\author{\fnms{P\'eter} \snm {Major}\corref{}
\ead[label=e1]{major@renyi.hu}}
%\ead[label=e2,url]{www.renyi.hu/$\sim$major}}
\address{Alfr\'ed R\'enyi Mathematical Institute of the Hungarian
Academy of Sciences\\
\printead{e1}\\
url: \href{{http://www.renyi.hu/~major}}{http://www.renyi.hu/$\sim$major}}
%\printead{e2}}
\end{aug}
\runauthor{P\'eter Major}
\begin{abstract}
This paper contains sharp estimates about the distribution of multiple
random integrals of functions of several variables with respect to a
normalized empirical measure, about the distribution of $U$-statistics
and multiple Wiener--It\^o integrals with respect to a white noise.
It also contains good estimates about the supremum of appropriate
classes of such integrals or $U$-statistics. The proof of most
results is omitted, I have concentrated on the explanation of their
content and the picture behind them. I also tried to explain the reason
for the investigation of such questions. My goal was to yield such a
presentation of the results which a non-expert also can understand,
and not only on a formal level.
\end{abstract}
\begin{keyword}[class=AMS]
\kwd[Primary ] {60F10}
\kwd[; secondary ]{60G50}
\end{keyword}
\begin{keyword}
\kwd{multiple Wiener--It\^o integrals}
\kwd{(degenerate) $U$-statistics}
\kwd{large-deviation type estimates}
\kwd{diagram formula}
\kwd{symmetrization}
\kwd{decoupling method}
\kwd{$L_p$-dense classes of functions}
\end{keyword}
% history:
\received{\smonth{1} \syear{2005}}
\end{frontmatter}
\newtheorem{thm}{Theorem}[section]
\newtheorem{exmp}[thm]{Example}
\newtheorem{lem}[thm]{Lemma}
\newtheorem{prop}[thm]{Proposition}
\section{Formulation of the main problems}\label{s1}
To formulate the main problems discussed in this paper first I
introduce some notations. Let us have a sequence of independent and
identically distributed random variables $\xi_1,\dots,\xi_n$ on a
measurable space $(X,{\cal X})$ with distribution $\mu$, and introduce
their empirical distribution
\begin{equation}
\mu_n(A)=\frac1n\#\{j\colon\,\xi_j\in A,\;1\le j\le n\},\quad A\in{\cal X}.
\label{1.1}
\end{equation}
Given a measurable function $f(x_1,\dots,x_k)$ on the product space
$(X^k,{\cal X}^k)$ let us consider the integral of this function with
respect to the $k$-fold direct product of the normalized version
$\sqrt n(\mu_n-\mu)$ of the empirical measure $\mu_n$, i.e. take
the integral
\begin{eqnarray}
J_{n,k}(f) \!\!\!\!\! \!\!\!\!\! &&=\frac{n^{k/2}}{k!} \int'
f(x_1,\dots,x_k)(\mu_n(\,dx_1)-\mu(\,dx_1))\dots
(\mu_n(\,dx_k)-\mu(\,dx_k)), \nonumber \\
&&\quad\textrm{where the prime in $\textstyle\int'$ means that the
diagonals } x_j=x_l, \nonumber \\
&&\quad 1\le ju)$ under some appropriate (and not too restrictive)
conditions on the function $f$.
\medskip
It seems to be natural to omit the diagonals $x_j=x_l$, $j\neq l$,
from the domain of integration in the definition of the random
integrals $J_{n,k}(f)$ in~(\ref{1.2}). In the applications I met
the estimation of such a version of the integrals was needed.
I shall also discuss the following more general problem:
\medskip\noindent
{\it Problem b).}
Let $f\in {\cal F}$ be a nice class of functions on the space
$(X^k,{\cal X}^k)$. Give a good estimate of the
probabilities $P\left(\sup\limits_{f\in{\cal F}}J_{n,k}(f)>u\right)$
where $J_{n,k}(f)$ denotes again the random integral of a function~$f$
defined in (\ref{1.2}).
\medskip
I met the problems formulated above when I tried to adapt the method
of investigation about the limit behaviour of maximum likelihood
estimates to more difficult problems, to so-called non-parametric
maximum likelihood estimates. An important step in the investigation
of maximum likelihood estimates consists of a good approximation of
the maximum-likelihood function whose root we are looking for.
The Taylor expansion of this function yields a good approximation
if its higher order terms are dropped. In
an adaptation of this method to more complicated situations the
solution of the above mentioned problems~a) and~b) appear in a
natural way. They play a role similar to the estimation of the
coefficients of the Taylor expansion in the study of maximum
likelihood estimates. Here I do not discuss the details of this
approach to non-parametric maximum-likelihood problems. The
interested reader may find some further information about it in
papers~\cite{r23} and~\cite{r24}, where such a question is
investigated in detail in a special case.
In the above mentioned papers the so-called Kaplan--Meyer method is
investigated for the estimation of a distribution function by means
of censored data. The solution of problem~a) is needed to bound
the error of the Kaplan--Meyer estimate for a single argument of
the distribution function, and the solution of problem b) helps to
bound the difference of this estimate and the real distribution
function in the supremum norm. Let me remark that the approach in
papers~\cite{r23} and~\cite{r24} seems to be applicable under much
more general circumstances, but this requires the solution of some
hard problems.
I do not know of other authors who dealt directly with the study
of random integrals similar to that defined in (\ref{1.2}). On the other
hand, several authors investigated the behaviour of $U$-statistics,
and discussed the next two problems that I describe
under the name problem~a$')$ and problem~b$')$.
To formulate them first I recall the notion of $U$-statistics.
If a sequence of independent and identically distributed random
variables $\xi_1,\dots,\xi_n$ is given on a measurable
space $(X,{\cal X})$ together with a function of $k$ variables
$f(x_1,\dots,x_k)$
on the space $(X^k,{\cal X}^k)$, $n\ge k$, then the expression
\begin{equation}
I_{n,k}(f)=\frac1{k!}\sum_{1\le j_s\le n,\; s=1,\dots, k\atop
j_s\neq j_{s'} \textrm{ \scriptsize if } s\neq s'}
f\left(\xi_{j_1},\dots,\xi_{j_k}\right). \label{1.3}
\end{equation}
is called a $U$-statistic of order $k$ with kernel function $f$. Now I
formulate the following two problems.
\medskip\noindent
{\it Problem a$'$).}
Give a good estimate of the probabilities
$P(n^{-k/2}I_{n,k}(f)>u)$ under some appropriate (and not too
restrictive) conditions on the function $f$.
\medskip\noindent
{\it Problem b$'$).}
Let ${\cal F}$ be a nice class of functions on the space
$(X^k,{\cal X}^k)$. Give a good estimate of the probabilities
$P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}I_{n,k}(f)>u\right)$
where $I_{n,k}(f)$ denotes again the $U$-statistic with kernel
function~$f$ defined in~(\ref{1.3}).
\medskip
Problems a) and b) are closely related to problems a$'$) and b$'$),
but the detailed description of their relation demands some hard work.
The main difference between these two pairs of problems is that
integration with respect to a power of the measure $\mu_n-\mu$
in formula (\ref{1.2}) means some kind of normalization, while the
definition of the $U$-statistics in (\ref{1.3}) contains no normalization.
Moreover, there is no simple way to introduce some good normalization
in $U$-statistics. This has the consequence that in problems~a)
and~b) a good estimate can be given for a much larger class of
functions than in problems~a$'$) and~b$'$). Hence the original pair
of problems seems to be more useful in several possible applications.
Both the integrals $J_{n,k}(f)$ defined in (\ref{1.2}) and the
$U$-statistics $I_{n,k}(f)$ defined in (\ref{1.3}) are non-linear
functionals of independent random variables, and the main difficulty
arises in their study because of this non-linearity.
On the other hand, the normalized empirical measure
$\sqrt n(\mu_n-\mu)$ is close to a Gaussian field for a large
sample size~$n$. Moreover, as we shall see, $U$-statistics with a
large sample size behave similarly to multiple Gaussian
integrals. This suggests that the study of multiple Gaussian
integrals may help a lot in the solution of our problems. To
investigate them first I recall the definition of white
noise that we shall need later.
\medskip\noindent
{\bf Definition of a white noise with some reference measure.}
{\it Let us have a $\sigma$-finite measure $\mu$ on a measurable
space $(X,{\cal X})$. A white noise with reference measure $\mu$ is
a Gaussian field $\mu_W=\{\mu_W(A)\colon A\in{\cal X},\,\mu(A)<\infty\}$,
i.e. a set of jointly Gaussian random variables indexed by the above
sets~$A$, which satisfies the relations $E\mu_W(A)=0$ and
$E\mu_W(A)\mu_W(B)=\mu(A\cap B)$.}
\medskip\noindent
{\it Remark:}\/ In the definition of a white noise one also mentions
the property $\mu_W(A\cup B)=\mu_W(A)+\mu_W(B)$ with probability~1
if $A\cap B=\emptyset$, and $\mu(A)<\infty$, $\mu(B)<\infty$. This could
be omitted from the definition, because it follows from the remaining
properties of white noises. Indeed, simple
calculation shows that $E(\mu_W(A\cup B) -\mu_W(A)-\mu_W(B))^2=0$
if $A\cap B=\emptyset$, hence $\mu_W(A\cup B)-\mu_W(A)-\mu_W(B)=0$
with probability~1 in this case. It also can be observed that if some
sets $A_1,\dots,A_k\in {\cal X}$, $\mu(A_j)<\infty$, $1\le j\le k$,
are disjoint, then the random variables $\mu_W(A_j)$, $1\le j\le k$,
are independent because of the uncorrelatedness of these jointly
Gaussian random variables.
\medskip
It is not difficult to see that for an arbitrary reference
measure~$\mu$ on a space $(X,{\cal X})$ a white noise $\mu_W$ with
this reference measure really exists. This follows simply from
Kolmogorov's fundamental theorem, by which if the finite dimensional
distributions of a random field are prescribed in a
consistent way, then there exists a random field with these finite
dimensional distributions.
If a white noise $\mu_W$ with a $\sigma$-finite reference measure
$\mu$ is given on some measurable space $(X,{\cal X})$ together with a
function $f(x_1,\dots,x_k)$ on $(X^k,{\cal X}^k)$ such that
\begin{equation}
\sigma^2=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty,
\label{1.4}
\end{equation}
then the multiple Wiener--It\^o integral of the function $f$ with respect
to a white noise $\mu_W$ with reference measure $\mu$ can be defined,
(see e.g. \cite{r14} or \cite{r17}). It will be denoted by
\begin{equation}
Z_{\mu,k}(f)=\int f(x_1,\dots,x_k)\mu_W(\,dx_1)\dots\mu_W(\,dx_k).
\label{1.5}
\end{equation}
Here we shall not need a detailed discussion of Wiener--It\^o
integrals, it will be enough to recall the idea of their definition.
Let us have a measurable space $(X,{\cal X})$ together with a non-atomic
$\sigma$-finite measure $\mu$ on it. (Wiener--It\^o
integrals are defined only with respect to a white noise $\mu_W$ with a
non-atomic reference measure $\mu$.) We call a function~$f$ on
$(X^k,{\cal X}^k)$ elementary if there exists a finite partition
$A_1,\dots,A_M$, $1\le M<\infty$, of the set $X$ (i.e. $A_j\cap
A_{j'}=\emptyset$ if $j\neq j'$ and $\bigcup\limits_{j=1}^M A_j=X$) such
that $\mu(A_j)<\infty$ for all $1\le j\le M$, and the function
$f$ satisfies the properties
\begin{eqnarray}
&&f(x_1,\dots,x_k)=c(j_1,\dots,j_k)\quad\textrm{if }x_1\in
A_{j_1},\dots, x_k\in A_{j_k}, \nonumber\\
&&\qquad \hskip5truecm 1\le j_s\le M, \; \;1\le s\le k,
\nonumber \\
&&\qquad\textrm{and }c(j_1,\dots,j_k)=0\quad\textrm{if }j_s=j_{s'}\;\;
\textrm{for some }1\le s~~u)$ under some appropriate (and not too restrictive)
conditions on the function $f$ and measure $\mu$.
\medskip\noindent
{\it Problem b$''$).}
Let ${\cal F}$ be a nice class of functions on the space
$(X^k,{\cal X}^k)$. Give a good estimate of the probabilities
$P\left(\sup\limits_{f\in{\cal F}}Z_{\mu,k}(f)>u\right)$ where
$Z_{\mu,k}(f)$ denotes again a Wiener--It\^o integral with function~$f$
and white noise with reference measure~$\mu$.
\medskip
In this paper the above problems will be discussed. Such
estimates will be presented which depend on some basic
characteristics of the random expressions $J_{n,k}(f)$,
$I_{n,k}(f)$ or $Z_{\mu,k}(f)$. They will depend mainly on the
$L_2$ and $L_\infty$-norm of the function~$f$ taking part in the
definition of the above quantities. (The $L_2$-norm of the
function~$f$ is closely related to the variance of the random
variables we consider.) The proof of the estimates is
related to some other problems interesting in themselves. My main
goal was to explain the results and ideas behind them. I put
emphasis on the explanation of the picture that can help
understanding them, and the details of almost all proofs are
omitted. A detailed explanation together with the proofs can be
found in my Lecture Note~\cite{r22}.
This paper consists of 9 sections. The first four sections contain
the results about problems~a),~a$'$) and~a$''$) together with
some other statements which may explain better their background.
Section~\ref{s5} contains the main ideas of their proof. In Section~\ref{s6}
problems b),~b$'$) and~b$''$) are discussed together with some
related questions. The main ideas of the proofs of the results
in Section~\ref{s6} which contain many unpleasant technical details are
discussed in Sections~\ref{s7} and~\ref{s8}. In Section~\ref{s9}
Talagrand's theory
about concentration inequalities is considered together with some
new results and open questions.
%\vfill\eject
\section{The discussion of some large deviation results}\label{s2}
First we restrict our attention to problems a),~a$'$) and~a$''$),
i.e.\ to the case when the distribution of the random integral or
$U$-statistic of one function is estimated. These problems
are much simpler in the special case $k=1$. But they are not
trivial even in this case. A discussion of some large deviation
results may help to understand them better. I recall some large
deviation results, but not in their most general form. Actually
these results will not be needed later, they are interesting
for the sake of some orientation.
\begin{thm}[Large deviation theorem about partial sums of
independent and identically distributed random variables]\label{t2.1}
Let $\xi_1,\xi_2,\dots$, be a sequence of independent and identically
distributed random variables such that $E\xi_1=0$, $Ee^{t\xi_1}
<\infty$ with some $t>0$. Let us define the partial sums
$S_n=\sum\limits_{j=1}^n\xi_j$, $n=1,2,\dots$. Then the relation
\begin{equation}
\lim_{n\to\infty}\frac1n\log P(S_n\ge nu)=-\rho(u) \qquad
\textrm{for all } u>0 \label{2.1}
\end{equation}
holds with the function $\rho(u)$ defined by the formula
$\rho(u)=\sup\limits_t\left(tu-\log Ee^{t\xi_1}\right)$.
The function $\rho(\cdot)$
in formula (\ref{2.1}) has the following properties: $\rho(u)>0$ for all
$u>0$, and it is a monotone increasing function, there
is some number $00$, where $\sigma^2=E\xi_1^2$ is the
variance of $\xi_1$.
\end{thm}
\smallskip
\noindent
The above theorem states that for all $\varepsilon>0$ the inequality
$P(S_n>nu)\le e^{-n(\rho(u)-\varepsilon)}$ holds if
$n\ge n(u,\varepsilon)$, and this estimate is essentially sharp.
Actually, in nice cases, when the equation $\rho(u)
=\sup\limits_t\left(tu-\log Ee^{t\xi_1}\right)$ has a solution
in~$t$, the above inequality also holds with $\varepsilon=0$ for all
$n\ge1$. The function $\rho(u)$ in the exponent of the above large
deviation estimate strongly depends on the distribution of $\xi_1$.
It is the so-called Legendre transform of $\log Ee^{t\xi_1}$,
of the logarithm of the moment generating function of $\xi_1$, and
its values in an arbitrary interval determine the distribution of
$\xi_1$. On the other hand, the estimate (\ref{2.1}) for small
$u>0$ shows some resemblance to the bound suggested by the central
limit theorem. Indeed, for small $u>0$ it yields the upper bound
$e^{-n\sigma^2u^2/2+nO(u^3)}$, while the central limit theorem would
suggest the estimate $e^{-n\sigma^2u^2/2}$. (Let us recall that the
standard normal distribution function $\Phi(u)$ satisfies the
inequality $\left(\frac1u-\frac1{u^3}\right)\frac{e^{-u^2/2}}{\sqrt{2\pi}}
<1-\Phi(u)<\frac1u\frac{e^{-u^2/2}}{\sqrt{2\pi}}$ for all $u>0$,
hence for large $u$ it is natural to bound it by $e^{-u^2/2}$.)
The next result I mention, Bernstein's inequality, (see e.g.~\cite{r5},
1.3.2~Bernstein's inequality) has a closer relation to the problems
discussed in this paper. It gives a good upper bound on the
distribution of sums of independent, bounded random variables with
expectation zero. It is important that this estimate is universal,
the constants it contains do not depend on the properties of the
random variables we consider.
\begin{thm}[Bernstein's inequality]\label{t2.2}
Let $X_1,\dots,X_n$ be independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and $V_n^2=\textrm{Var}\, S_n
=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P(S_n>u)\le\exp\left\{-\frac{u^2}{2V_n^2\left(1+\frac u{3V_n^2}
\right)}\right\}
\quad\textrm{for all }u>0. \label{2.2}
\end{equation}
\end{thm}
Let us take a closer look on the content of Theorem~\ref{t2.2}. Estimate
(\ref{2.2}) yields a bound of different form if the first term is
dominating in the sum $1+\frac u{3V_n^2}$ in the denominator of the
fraction in this expression and if the second term is dominating in
it. If we fix some constant $C>0$, then formula (\ref{2.2}) yields that
$P(S_n>u)\le e^{-Bu^2/2V_n^2}$ with some constant $B=B(C)$ for
$0\le u\le CV_n^2$. If, moreover $0\le u\le\varepsilon V_n^2$ with
some small $\varepsilon>0$, then the estimate $P(S_n>u)
\le e^{-(1-K\varepsilon)u^2/2V_n^2}$ holds
with a universal constant $K>0$. This means that in the case $0~~__u)$
can be bounded by the distribution $G(u)=P(\textrm{const.}\, V_n\eta>u)$
where $\eta$ is a standard normal random variable, and $V_n^2$ is
the variance of the partial sum $S_n$. If $0\le u\le \varepsilon V_n^2$
with a small $\varepsilon>0$, then it also can be bounded by
$P((1-K\varepsilon))V_n\eta>u)$ with some universal constant $K>0$.
In the case $u\gg V_n^2$ formula (\ref{2.2}) yields a different type
of estimate. In this case we get that $P(S_n>u)0$, and this seems to be a rather weak estimate.
In particular, it does not depend on the variance $V_n^2$ of $S_n$. In
the degenerate case $V_n=0$ when $P(S_n>u)=0$, estimate (\ref{2.2}) yields
a strictly positive upper bound for $P(S_n>u)$. One would like to get
such an improvement of Bernstein's inequality which gives a better
bound in the case $u\gg V_n^2$. Bennett's inequality (see e.g.~\cite{r28},
Appendix~B, 4~Bennett's inequality) satisfies this requirement.
\begin{thm}[Bennett's inequality]\label{t2.3}
%{\bf Theorem 2.3 (Bennett's inequality).}
Let $X_1,\dots,X_n$
be independent random variables, $P(|X_j|\le1)=1$,
$EX_j=0$, $1\le j\le n$. Put $\sigma_j^2=EX_j^2$, $1\le j\le n$,
$S_n=\sum\limits_{j=1}^n X_j$ and $V_n^2=\textrm{Var}\, S_n
=\sum\limits_{j=1}^n\sigma_j^2$.
Then
\begin{equation}
P(S_n>u)\le\exp\left\{-V^2_n\left[\left(1+\frac u{V^2_n}\right)
\log\left(1+\frac u{V^2_n}\right)-\frac u{V^2_n}\right]\right\}
\quad\textrm{for all
}u>0. \label{2.3}
\end{equation}
As a consequence, for all $\varepsilon>0$ there exists some
$B=B(\varepsilon)>0$ such that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-(1-\varepsilon)u\log \frac u{V^2_n}
\right\}\quad\textrm{if } u>BV_n^2, \label{2.4}
\end{equation}
and there exists some positive constant $K>0$ such that
\begin{equation}
P\left(S_n>u\right)\le\exp\left\{-Ku\log \frac u{V^2_n}
\right\}\quad\textrm{if }u>2V_n^2. \label{2.5}
\end{equation}
\end{thm}
Estimates (\ref{2.4}) or (\ref{2.5}) yield a slight improvement of
Bernstein's inequality in the case $u\ge K V_n^2$ with a sufficiently
large $K>0$. On the other hand, even this estimate is much weaker than
the estimate suggested by a formal application of the central limit
theorem. The question arises whether they are sharp or
can be improved. The next example shows that inequalities~(\ref{2.4})
or~(\ref{2.5}) in Bennett's inequality are essentially sharp. If no
additional restrictions are imposed, then at most the universal
constants can be improved in them. Even a sum of
independent, bounded and identically distributed random variables
can be constructed which satisfies a lower bound similar to the
upper bounds in formulas~(\ref{2.4}) and~(\ref{2.5}), only with possibly
different constants.
\smallskip
\begin{exmp}\label{e2.4}
%{\bf Example 2.4.}
{Let us fix some positive integer $n$,
real numbers $u$ and $\sigma^2$ such that $0<\sigma^2\le
\frac18$, $n>3u\ge6$ and $u>4n\sigma^2$. Put $V_n^2=n\sigma^2$ and
take a sequence of independent, identically distributed random
variables $X_1,\dots,X_n$ such that
$P(X_j=1)=P(X_j=-1)=\frac{\sigma^2}2$,
and $P(X_j=0)=1-\sigma^2$. Put $S_n=\sum\limits_{j=1}^n X_j$. Then
$ES_n=0$, $\textrm{Var}\, S_n=V_n^2$, and
$$
P(S_n\ge u)>\exp\left\{-Bu\log \frac u{V^2_n}\right\}
$$
with some appropriate (universal) constant $B>0$.}
\end{exmp}
\smallskip
\noindent
{\it Remark:}\/ The estimate of Example~\ref{e2.4} or of
relations~(\ref{2.4}) and~(\ref{2.5}) is well comparable with the tail
distribution of a Poisson distributed random variable with parameter
$\lambda=\textrm{const.}\, n\sigma^2\ge1$ at level $u\ge2\lambda$. Some
calculation shows that a Poisson distributed random variable
$\zeta_\lambda$ with parameter $\lambda>1$ satisfies the
inequality $e^{-C_1 u\log(u/\lambda)}\le
P(\zeta_\lambda-E\zeta_\lambda>u)\le
P(\zeta_\lambda>u)\le P(\zeta_\lambda-E\zeta_\lambda>\frac u2)\le
e^{-C_2u\log(u/\lambda)}$ with some appropriate constants
$02\lambda$, and
$E\zeta_\lambda=\textrm{Var}\,\zeta_\lambda=\lambda$. This estimate is
similar to the above mentioned relations.
\medskip
Example~\ref{e2.4} is proved in Example~3.2 of my
Lecture Note~\cite{r22}, but here I present a simpler proof.
\medskip\noindent
{\it Proof of the statement of Example~\ref{e2.4}.}\/ Let us fix an
integer $u$ such that $n>3u$ and $u>4n\sigma^2$. Let $B=B(u)$
denote the event that among the random variables $X_j$, $1\le
j\le n$, there are exactly $3u$ terms with values~$+1$ or~$-1$,
and all other random variables $X_j$ equal zero. Let us
also define the event $A=A(u)\subset B(u)$ which holds if $2u$
random variables $X_j$ are equal to 1, $u$ random variables $X_j$
are equal to $-1$, and all remaining random variables $X_j$,
$1\le j\le n$, are equal to zero. Clearly, $P(S_n\ge u)\ge
P(A)=P(B)P(A|B)$. On the other hand, $P(B)={n \choose 3u}
\left(\sigma^2\right)^{3u}\left(1-\sigma^2\right)^{n-3u}\ge
\left(\frac n{3u}\right)^{3u}\left(\sigma^2\right)^{3u}e^{-4n\sigma^2}
=e^{-3u\log(3u/n\sigma^2)-4n\sigma^2}$. Here we exploited that
because of the condition
$\sigma^2\le\frac18$ we have $1-\sigma^2\ge e^{-4\sigma^2}$.
Beside this, $u\ge 4n\sigma^2$, and $P(B)\ge
e^{-3u\log(3u/n\sigma^2)-u}\ge e^{-B_1u\log(u/n\sigma^2)}$ with
some appropriate $B_1>0$ under our assumptions.
Let us consider a set of $3u$ elements, and choose a random
subset of it by taking all elements of this set with probability
$1/2$ to this random subset independently of each other. I claim
that the conditional probability $P(A|B)$ equals the probability
that this random subset has $2u$ elements. Indeed, even the
conditional probability of the event $A$ under the condition
that for a prescribed set of indices~$J\subset\{1,\dots,n\}$ with
exactly $3u$ elements we have $X_j=\pm1$ if $j\in J$ and $X_j=0$ if
$j\notin J$ equals the probability of the event that the above
defined random subset has $2u$ elements. This is so, because under
this condition the random variables $X_j$ take the value $+1$ with
probability $1/2$ for all $j\in J$ independently of each other.
Hence $P(A|B)={3u \choose 2u}2^{-3u}\ge e^{-Cu}\ge
e^{-B_2u\log(u/n\sigma^2)}$ with some appropriate constants $C>0$
and $B_2>0$ under our conditions, since $\frac u{n\sigma^2}\ge4$
in this case. The estimates given for $P(B)$ and $P(A|B)$ imply
the statement of Example~\ref{e2.4}.
\medskip
Bernstein's inequality provides a solution to problems~a)
and a$'$) in the case $k=1$ under some conditions. Because of the
normalization (multiplication by $n^{-1/2}$ in these problems)
it yields an estimate with the choice
$\bar u=\sqrt nu$. Observe that $J_{n,1}(f) =\frac1{\sqrt
n}\sum\limits_{j=1}^n (f(\xi_j)-Ef(\xi_j))$ for $k=1$ in the
definition~(\ref{1.2}). In problem~a) it gives a
good bound on $P(J_{n,1}(f)>u)$ for a function~$f$ such that
$|f(x)|\le\frac12$ for all $x\in X$ with the choice
$X_j=f(\xi_j)-Ef(\xi_j)$, $1\le j\le n$, and $\bar u=\sqrt n u$.
In problem~a$'$) it gives a good bound on $P(n^{-1/2}I_{n,1}(f)>u)$
under the condition $|f(x)|\le1$ for all $x\in X$, and $Ef(\xi_1)=0$
with the choice $X_j=f(\xi_j)$, $1\le j\le n$, and $\bar u=\sqrt nu$.
This means that in the case $0\le u\le C\sqrt n\sigma^2$ the bounds
$P(J_{n,1}(f)>u)\le e^{-Ku^2/2\sigma^2}$ and
$P(n^{-1/2}I_{n,1}(f)>u)\le e^{-Ku^2/2\sigma^2}$ hold with
$\sigma^2=\textrm{Var}\, f(\xi_1)$ and some constant $K=K(C)$ depending
on the number~$C$ if the above conditions are imposed in
problem~a) or~a$'$). If $0\le u\le\varepsilon\sqrt n\sigma^2$ with some
small $\varepsilon>0$, then the above constant $K$ can be chosen very
close to the number~1.
The above results can be interpreted so that in the case $0\le
u\le\textrm{const.}\,\sqrt n\sigma^2$ and a bounded function $f$
an estimate suggested by the central limit theorem holds for problem~a),
only an additional constant multiplier may appear in the exponent.
A similar statement holds in problem~ a$'$), only here
the additional condition $Ef(\xi_j)=0$ has to be imposed. On the
other hand, the situation is quite different if $u\gg \sqrt n\sigma^2$.
In this case Bernstein's inequality yields only a very weak estimate.
Bennett's inequality gives a slight improvement. It yields the
inequality $P(J_{n,1}(f)>u)\le e^{-Bu\sqrt n\log(u/\sqrt n\sigma^2)}$
with an appropriate constant $B>0$ if $|f(x)|\le\frac12$ for all
$x\in X$, $u\ge 2\sqrt n\sigma^2$, and $\sigma^2=\textrm{Var}\, f(\xi_1)$.
The estimate $P(n^{-1/2}I_{n,1}(f)>u)\le e^{-Bu\sqrt n\log
(u/\sqrt n\sigma^2)}$ holds with an appropriate $B>0$ if $|f(x)|\le1$
for all $x\in X$, $Ef(\xi_1)=0$, $\textrm{Var}\, f(\xi_1)=\sigma^2$, and
$u\ge2\sqrt n\sigma^2$. These estimates are much weaker than
the bound suggested by a formal application of the central limit
theorem. On the other hand, as Example~\ref{e2.4} shows, no better estimate
can be expected in this case. Moreover, the proof of this example
gives some insight why a different type of estimate appears in the cases
$u\le \sqrt n\sigma^2$ and $u\gg\sqrt n\sigma^2$ for problems~a)
and~a$'$).
In the proof of Example~\ref{e2.4} a `bad' irregular event~$A$ was defined
such that if it holds, then the sum of the random variables considered
in this example is sufficiently large. Generally, the probability of
such an event is very small, but if the variance of the random
variables is very small, (in problems~a) and~a$'$) this is the case
if $\sigma^2\ll un^{-1/2}$) then such `bad' irregular events can be
defined whose probabilities are not negligible.
Problems~a) and~a$'$) will also be considered for $k\ge2$, and this
will be called the multivariate case. The results we get for
the solution of problems~a) and~a$'$) in the multivariate case is
very similar to the results described above. To understand them
first some problems have to be discussed. In particular, the answer
for the following two questions has to be understood:
\medskip\noindent
{\it Question a).}\/ In the solution of
problem a$'$) in the case $k=1$ the condition
$Ef(\xi_1)=0$ was imposed, and this means some kind of normalization.
What condition corresponds to it in the multivariate case?
This question leads to the definition of degenerate
$U$-statistics and to the so-called Hoeffding's decomposition of
$U$-statistics to a sum of degenerate $U$-statistics.
\medskip\noindent
{\it Question b).}\/ The discussion of problems a) and~a$'$)
was based on the central limit theorem. What kind of limit
theorems can take its place in the multivariate case? What kind
of limit theorems hold for $U$-statistics $I_{n,k}(f)$ or
multiple random integrals $J_{n,k}(f)$ defined in (\ref{1.2})? The limit
appearing in these problems can be expressed by means of multiple
Wiener--It\^o integrals in a natural way.
\medskip
In the next section the two above questions will be discussed.
\section{On some problems about $U$-statistics and random
integrals}\label{s3}
\subsection{ The normalization of $U$-statistics}\label{s3.1}
In the case $k=1$ problem a$'$) means the estimation of sums of
independent and identically distributed random variables. In this case
a good estimate was obtained under the condition $Ef(\xi_1)=0$.
In the multivariate case $k\ge2$ a stronger normalization property
has to be imposed to get good estimates about the distribution of
$U$-statistics. In this case it has to be assumed that the conditional
expectations of the terms $f(\xi_{j_1},\dots,\xi_{j_k})$ of the
$U$-statistic under the condition that the value of all but one
arguments takes a prescribed value equals zero. This property is
formulated in a more explicit way in the following definition
of degenerate $U$-statistics.
\medskip\noindent
{\bf Definition of degenerate $U$-statistics.} {\it Let us consider
the $U$-statistic $I_{n,k}(f)$ of order~$k$ defined in formula (\ref{1.3})
with kernel function $f(x_1,\dots,x_k)$ and a sequence of independent
and identically distributed random variables $\xi_1,\dots,\xi_n$. It
is a degenerate $U$-statistic if its kernel function satisfies the
relation
\begin{eqnarray}
&&Ef(\xi_1,\dots,\xi_k|\xi_1=x_1,\dots,\xi_{j-1}=x_{j-1},
\xi_{j+1}=x_{j+1},\dots,\xi_k=x_k)=0 \nonumber \\
&&\qquad\qquad \textrm{for all } 1\le j\le k \textrm { and } x_s\in X, \;
s\in\{1,\dots,k\}\setminus\{j\}.
\label{3.1}
\end{eqnarray} }
\medskip
The definition of degenerate $U$-statistics is closely related to the
notion of canonical functions described below.
\medskip \noindent
{\bf Definition of canonical functions.} {\it A function
$f(x_1,\dots,x_k)$ taking values in the $k$-fold product
$(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$ is called
canonical with respect to a probability measure $\mu$ on
$(X,{\cal X})$ if
\begin{eqnarray}
&&\int f(x_1,\dots,x_{j-1},u,x_{j+1},\dots,x_k)\mu(\,du)=0 \nonumber \\
&&\qquad \textrm{for all \ } 1\le j\le k \textrm{ \ and \ } x_s\in X, \;
s\in\{1,\dots,k\}\setminus\{j\}.
\label{3.2}
\end{eqnarray} }
\medskip
It is clear that a $U$-statistic $I_{n,k}(f)$ is degenerate if and
only if its kernel function $f$ is canonical with respect to the
distribution $\mu$ of the random variables $\xi_1,\dots,\xi_n$
appearing in the definition of the $U$-statistic.
Given a function~$f$ and a probability measure $\mu$, this function
can be written as a sum of canonical functions (with different
sets of arguments) with respect to the measure~$\mu$, and this enables
us to decompose a $U$-statistic as a linear combination of degenerate
$U$-statistics. This is the content of Hoeffding's decomposition of
$U$-statistics described below. To formulate it first I introduce
some notations.
Consider the $k$-fold product $(X^k,{\cal X}^k,\mu^k)$ of a
measure space $(X,{\cal X},\mu)$ with some probability measure $\mu$,
and define for all integrable functions $f(x_1,\dots,x_k)$ and indices
$1\le j\le k$ the projection~$P_jf$ of the function $f$ to its $j$-th
coordinate as
\begin{equation}
P_jf(x_1,\dots,x_{j-1},x_{j+1},\dots,x_k)=\int
f(x_1,\dots,x_k)\mu(\,dx_j), \quad 1\le j\le k. \label{3.3}
\end{equation}
In some investigations it may be useful to rewrite formula (\ref{3.3}) by
means of conditional expectations in an equivalent form as
\begin{eqnarray*}
&&P_jf(x_1,\dots,x_{j-1},x_{j+1},\dots,x_k)\\
&&\qquad=E(f(\xi_1,\dots,\xi_k)|\xi_1=x_1,\dots,\xi_{j-1}=x_{j-1},
\xi_{j+1}=x_{j+1},\dots,\xi_k=x_k),
\end{eqnarray*}
where $\xi_1,\dots,\xi_k$ are independent random variables with
distribution~$\mu$.
Let us also define the operators $Q_j=I-P_j$ as $Q_jf=f-P_jf$ on the
space of integrable functions on $(X^k,{\cal X}^k,\mu^k)$, $1\le j\le k$.
In the definition (\ref{3.3}) $P_jf$ is a function not depending on the
coordinate $x_j$, but in the definition of $Q_j$ we introduce the
fictive coordinate $x_j$ to make the expression $Q_jf=f-P_jf$
meaningful. The following result holds.
\begin{thm}[Hoeffding's decomposition of $U$-statistics]\label{t3.1}
%{\bf Theorem 3.1 (Hoeffding's decomposition of $U$-statistics).}
{Let an integrable function $f(x_1,\dots,x_k)$ be given on the $k$-fold
product space $(X^k,{\cal X}^k,\mu^k)$ of a space $(X,{\cal X},\mu)$
with a probability measure $\mu$. It has the decomposition
\begin{eqnarray}
&&f=\sum\limits_{V\subset\{1,\dots,k\}} f_V, \quad \textrm{with}
\nonumber \\
\qquad
&&f_V(x_j,\,j\in V)=\left(\prod_{j\in\{1,\dots,k\}\setminus V}P_j
\prod_{j\in V}Q_j\right)f(x_1,\dots,x_k) \label{3.4}
\end{eqnarray}
such that all functions $f_V$, $V\subset \{1,\dots,k\}$, in (\ref{3.4})
are canonical with respect to the probability measure $\mu$, and they
depend on the $|V|$ arguments $x_j$, $j\in V$.
Let $\xi_1,\dots,\xi_n$ be a sequence of independent, $\mu$ distributed
random variables, and consider the
$U$-statistics $I_{n,k}(f)$ and $I_{n,|V|}(f_V)$ corresponding to
the kernel functions $f$, $f_V$ defined in (\ref{3.4}) and random variables
$\xi_1,\dots,\xi_n$. Then
\begin{equation}
I_{n,k}(f)=\sum\limits_{V\subset\{1,\dots,k\}}
(n-|V|)(n-|V|-1)\cdots(n-k+1)\frac{|V|!}{k!}
I_{n,|V|}(f_V) \label{3.5}
\end{equation}
is a representation of $I_{n,k}(f)$ as a sum of degenerate
$U$-statistics, where $|V|$ denotes the cardinality of the set $V$.
(The product $(n-|V|)(n-|V|-1)\cdots(n-k+1)$ is defined as 1 if
$V=\{1,\dots,k\}$, i.e. $|V|=k$.) This representation is called the
Hoeffding decomposition of~$I_{n,k}(f)$.}
\end{thm}
Hoeffding's decomposition was originally proved in paper~\cite{r13}. It may be
interesting also to mention its generalization in~\cite{r32}.
I omit the proof of Theorem~\ref{t3.1}, although it is fairly simple.
I only try
to briefly explain that the construction of Hoeffding's decomposition
is natural. Let me recall that a random variable can be decomposed
as a sum of a random variable with expectation zero plus a constant,
and the random variable with expectation zero in this decomposition is
defined by taking out from the original random variable its expectation.
To introduce such a transformation which turns to zero not only
the expectation of the transformed random variable, but also its
conditional expectation with respect to some condition it is
natural to take out from the original random variable its
conditional expectation. Since the operators $P_j$ defined in~(\ref{3.3})
are closely related to the conditional expectations appearing in
the definition of degenerate $U$-statistics, the above
consideration makes natural to write the identity
$f=\prod\limits_{j=1}^k(P_j+Q_j)f=\sum\limits_{V\subset\{1,\dots,k\}}f_V$
with the functions defined in~(\ref{3.4}). (In the justification of the
last formula some properties of the operators~$P_j$
and~$Q_j$ have to be exploited.)
It is clear that $EI_{n,k}(f)=0$ for a degenerate $U$-statistic. Also
the inequality
\begin{equation}
E\left(I_{n,k}(f)\right)^2\le\frac {n^k}{k!}\sigma^2 \quad \textrm{with }
\sigma^2=\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k) \label{3.6}
\end{equation}
holds if $I_{n,k}(f)$ is a degenerate $U$-statistic. The measure
$\mu$ in (\ref{3.6}) is the distribution of the random variables
$\xi_j$ taking part in the definition of the $U$-statistic. Moreover,
$\lim\limits_{n\to\infty}n^{-k}E\left(I_{n,k}(f)\right)^2
=\frac{\sigma^2}{k!}$
if the kernel function $f$ is a symmetric function of its arguments,
i.e. $f(x_1,\dots,x_k)=f(x_{\pi(1)},\dots,x_{\pi(k)})$ for all
permutations $\pi=(\pi(1),\dots,\pi(k))$ of the set $\{1,\dots,k\}$.
Relation (\ref{3.6}) can be proved by means of the observation that
$$
Ef(\xi_{j_1},\dots,\xi_{j_k})f(\xi_{j_1'},\dots,\xi_{j_k'})=0
$$
if $\{j_1,\dots,j_k\}\neq \{j_1',\dots,j_k'\}$, and $f$ is a canonical
function with respect to the distribution $\mu$ of the random
variables $\xi_j$. On the other hand,
$$
|Ef(\xi_{j_1},\dots,\xi_{j_k})f(\xi_{j_1'},\dots,\xi_{j_k'})|\le
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)
$$
by the Schwarz inequality if $\{j_1,\dots,j_k\}=\{j_1',\dots,j_k'\}$,
i.e. if the sequence of indices $j_1'\dots,j_k'$ is a permutation of
the sequence of indices $j_1,\dots,j_k$, and there is an identity in
this relation if the function $f$ is symmetric. The last formula
enables us to check the asymptotic relation given for
$E\left(I_{n,k}(f)\right)^2$ after relation~(\ref{3.6}).
Relation (\ref{3.6}) suggests to restrict our attention in the
investigation of problem a$'$) to degenerate
$U$-statistics, and it also explains why the normalization
$n^{-k/2}$ was chosen in it. For degenerate $U$-statistics with
this normalization such an upper bound can be expected in problem~a$'$)
which does not
depend on the sample size~$n$. The estimation of the distribution
of a general $U$-statistic can be reduced to the degenerate case by
means of Hoeffding's decomposition (Theorem~\ref{t3.1}).
The random integrals $J_{n,k}(f)$ are defined in (\ref{1.2}) by means of
integration with respect to the signed measure $\mu_n-\mu$, and this
means some sort of normalization. This normalization has the
consequence that the distributions of these integrals satisfy a good
estimation for rather general kernel functions~$f$. Beside this, a random
integral $J_{n,k}(f)$ can be written as a sum of $U$-statistics to
which the Hoeffding decomposition can be applied. Hence it can be
rewritten as a linear combination of degenerate $U$-statistics. In
the next result I describe the representation of
$J_{n,k}(f)$ we get in such a way. It shows that the implicit
normalization caused by integration with respect to $\mu_n-\mu$ has
a serious cancellation effect. This enables us to get a good
solution for problem~a) or~b) if we have a good solution for
problem~a$'$) or~b$'$). Unfortunately, the proof of this result
demands rather unpleasant calculations. Hence here I omit the
proof. It can be found in~\cite{r19} or in Theorem~9.4 of~\cite{r22}.
\begin{thm}\label{t3.2}
%{\bf Theorem 3.2.}
{Let us have a non-atomic measure $\mu$
on a measurable space $(X,{\cal X})$ together with a sequence of
independent, $\mu$-distributed random variables $\xi_1,\dots,\xi_n$,
and take a function $f(x_1,\dots,x_k)$ of $k$ variables on the
space $(X^k,{\cal X}^k)$ such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty.
$$
Let us consider the empirical distribution function $\mu_n$ of the
sequence $\xi_1,\dots,\xi_n$ introduced in (\ref{1.1}) together with the
$k$-fold random integral $J_{n,k}(f)$ of the function $f$ defined in
(\ref{1.2}). The identity
\begin{equation}
J_{n,k}(f)=\sum_{V\subset\{1,\dots,k\}}C(n,k,V)n^{-|V|/2}
I_{n,|V|}(f_V), \label{3.7}
\end{equation}
holds with the canonical (with respect to the measure $\mu$)
functions $f_V(x_j,\;j\in V)$ defined in (\ref{3.4}) and appropriate
real numbers $C(n,k,V)$, $V\subset\{1,\dots,k\}$, where
$I_{n,|V|}(f_V)$ is the (degenerate) $U$-statistic with kernel
function $f_V$ and random sequence $\xi_1,\dots,\xi_n$ defined in
(\ref{1.3}). The constants $C(n,k,V)$ in~(\ref{3.7}) satisfy the relations
$|C(n,k,V)|\le C(k)$ with some constant $C(k)$ depending only on
the order $k$ of the integral $J_{n,k}(f)$,
$\lim\limits_{n\to\infty}C(n,k,V)=C(k,V)$ with some constant
$C(k,V)<\infty$ for all $V\subset\{1,\dots,k\}$, and
$C(n,k,\{1,\dots,k\})=1$ for $V=\{1,\dots,k\}$. }
\end{thm}
Let us also remark that the functions $f_V$ defined in (\ref{3.4}) satisfy
the inequalities
\begin{equation}
\int f_V^2(x_j,\,j\in V)
\prod\limits_{j\in V}\mu(\,dx_j)\le \int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k) \label{3.8}
\end{equation}
and
\begin{equation}
\sup_{x_j,\, j\in V} |f_V(x_j,\,j\in V)|\le2^{|V|}\sup_{x_j,\,1\le
j\le k}|f(x_1,\dots,x_k)| \label{3.9}
\end{equation}
for all $V\subset\{1,\dots,k\}$.
The decomposition of the random integral $J_{n,k}(f)$ in
formula~(\ref{3.7}) is similar to the Hoeffding decomposition of general
$U$-statistics presented in Theorem~\ref{t3.1}. The main difference between
them is that the coefficients of the normalized
degenerate $U$-statistics $n^{-|V|/2}I_{n,|V|}(f_V)$ at the
right-hand side of formula (\ref{3.7}) can be bounded by a universal
constant depending neither on the sample size $n$, nor on the kernel
function $f$ of the random integral. This fact has important
consequences.
Theorem~\ref{t3.2} enables us to get good estimates for problem a) if we
have such estimates for problem a$'$). In particular, formulas (\ref{3.6}),
(\ref{3.7}) and (\ref{3.8}) yield good bounds on the expectation and
variance of the random integral $J_{n,k}(f)$. The inequalities
\begin{eqnarray}
&&E\left(J_{n,k}(f)\right)^2\le C\sigma^2 \quad \textrm{and} \quad
|EJ_{n,k}(f)|\le C\sigma, \nonumber \\
&&\qquad \textrm{with}\quad \sigma^2=\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)
\label{3.10}
\end{eqnarray}
hold with some universal constant $C>0$ depending only on the order of
the random integral~$J_{n,k}(f)$.
Relation (\ref{3.10}) yields such an estimate for the second moment of
$J_{n,k}(f)$ as we expect. On the other hand, although it gives
a sufficiently good bound on its first moment, it does not state
that the expectation of $J_{n,k}(f)$ equals zero. Indeed,
formula~(\ref{3.7}) only gives that
$|EJ_{n,k}(f)|=|C(n,k,\emptyset)f_\emptyset|\le C|f_\emptyset|
=C\left|\int f(x_1,\dots,x_k)\mu(\,dx_1) \dots\mu(\,dx_k)\right|
\le C\sigma$ with some appropriate constant~$C>0$. The following
example shows that $EJ_{n,k}(f)$ need not be always zero. (To
understand better why such a situation may appear observe that the
random measures $(\mu_n-\mu)(B_1)$ and $(\mu_n-\mu)(B_2)$ are not
independent for disjoint sets~$B_1$ and~$B_2$.)
Let us consider a random integral $J_{n,2}(f)$ of order~2 with an
appropriate kernel function~$f$. Beside this, choose a sequence of
independent random variables $\xi_1,\dots,\xi_n$ with uniform
distribution on the unit interval $[0,1]$ and denote its empirical
distribution by $\mu_n$. We shall consider the example where the
kernel function $f=f(x,y)$ is the indicator function of the unit
square, i.e.\ $f(x,y)=1$ if $0\le x,y\le1$, and $f(x,y)=0$ otherwise.
The random integral
$J_{n,2}(f)=n\int_{x\neq y}f(x,y)(\mu_n(\,dx)-\,dx)(\mu_n(\,dy)-dy)$
will be taken, and its expected value $EJ_{n,2}(f)$ will be
calculated. By adjusting the
diagonal $x=y$ to the domain of integration and taking out the
contribution obtained in this way we get that $EJ_{n,2}(f)
=nE(\int_0^1\left(\mu_n(\,dx)-\mu(\,dx)\right)^2 -n^2
\cdot\frac1{n^2}=-1$,
i.e. the expected value of this random integral is not equal
to zero. (The last term is the integral of the function $f(x,y)$ on
the diagonal $x=y$ with respect to the product measure
$\mu_n\times\mu_n$ which equals $(\mu_n-\mu)\times(\mu_n-\mu)$
on the diagonal.)
Now I turn to the second problem discussed in this section.
\medskip\noindent
\subsection{Limit theorems for $U$-statistics and random integrals}\label{s3.2}
The following limit theorem about normalized degenerate
$U$-statistics will be interesting for us.
\begin{thm}[Limit theorem about normalized degenerate
$U$-sta\-tis\-tics]\label{t3.3}
%{\bf Theorem 3.3.}
Let us consider a sequence of degenerate $U$-statistics
$I_{n,k}(f)$ of order~$k$, $n=k,k+1,\dots$, defined in (\ref{1.3}) with
the help of a kernel function $f(x_1,\dots,x_k)$ on the $k$-fold
product $(X^k,{\cal X}^k)$ of a measurable space $(X,{\cal X})$,
canonical with respect to some non-atomic probability measure~$\mu$
on $(X,{\cal X})$ and such that $\int f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(dx_k)<\infty$ together with a sequence of
independent and identically distributed random variables
$\xi_1,\xi_2,\dots$ with distribution $\mu$ on $(X,{\cal X})$. The
sequence of normalized $U$-statistics $n^{-k/2}I_{n,k}(f)$
converges in distribution, as $n\to\infty$, to the $k$-fold
Wiener--It\^o integral
$$
\frac1{k!}Z_{\mu,k}(f)=\frac1{k!}\int
f(x_1,\dots,x_k)\mu_W(dx_1)\dots\mu_W(dx_k)
$$
with kernel function $f(x_1,\dots,x_k)$ and a white noise $\mu_W$
with reference measure~$\mu$.
\end{thm}
The proof of Theorem~\ref{t3.3} can be found for instance in~\cite{r6}. Here I
present a heuristic explanation which can be considered
as a sketch of proof.
To understand Theorem~\ref{t3.3} it is useful to rewrite the normalized
degenerate $U$-statistics considered in it in the form of multiple
random integrals with respect to a normalized empirical measure. The
identity
\begin{eqnarray}
\!\!\!\!\! &&n^{-k/2}I_{n,k}(f)=n^{k/2}\int'
f(x_1,\dots,x_k)\mu_n(\,dx_1)\dots\mu_n(\,dx_k\label{3.11}) \\
\!\!\!\!\! &&\qquad =n^{k/2}\int' f(x_1,\dots,x_k)(\mu_n(\,dx_1)-\mu(\,dx_1))
\dots(\mu_n(\,dx_k)-\mu(\,dx_1)) \nonumber
\end{eqnarray}
holds, where $\mu_n$ is the empirical distribution function of the
sequence $\xi_1,\dots,\xi_n$ defined in~(\ref{1.1}), and the prime in
$\int'$ denotes that the diagonals, i.e. the points
$x=(x_1,\dots,x_k)$ such that $x_j=x_{j'}$ for some pairs
of indices $1\le j,j'\le k$, $j\neq j'$ are omitted from the
domain of integration. The last identity of formula (\ref{3.11}) holds,
because in the case of a function $f(x_1,\dots,x_k)$ canonical with
respect to a non-atomic measure $\mu$ we get the same result by
integrating with respect to $\mu_n(\,dx_j)$ and with respect to
$\mu_n(\,dx_j)-\mu(\,dx_j)$. (The non-atomic property of the measure
$\mu$ is needed to guarantee that the integrals with respect
to the measure $\mu$ considered in this formula remain zero if the
diagonals are omitted from the domain of integration.)
Formula (\ref{3.11}) may help to understand Theorem~\ref{t3.3}, because
the random fields $n^{1/2}(\mu_n(A)-\mu(A))$, $A\in {\cal X}$, converge
to a Gaussian field $\nu(A)$, $A\in{\cal X}$, as $n\to\infty$, and
this suggests a limit similar to the result of Theorem~\ref{t3.3}. But
it is not so simple to carry out a limiting procedure leading to
the proof of Theorem~\ref{t3.3} with the help of formula (\ref{3.11}). Some
problems arise, because the fields $n^{1/2}(\mu_n-\mu)$ converge
to a not white noise type Gaussian field. The limit we get is
similar to a Wiener bridge on the real line. Hence a relation
between Wiener processes and Wiener bridges suggests to write the
following version of formula (\ref{3.11}). Let $\eta$ be a standard
Gaussian random variable, independent of the random sequence
$\xi_1,\xi_2,\dots$. We can write, by exploiting again the
canonical property of the function~$f$, the identity
\begin{eqnarray}
n^{-k/2}I_{n,k}(f)=n^{k/2}\int'
&&f(x_1,\dots,x_k)(\mu_n(\,dx_1)-\mu(\,dx_1)+\eta\mu(\,dx_1))\nonumber \\
&&\qquad \dots(\mu_n(\,dx_k)-\mu(\,dx_k)+\eta\mu(\,dx_k)).
\label{3.12}
\end{eqnarray}
The random measures $n^{1/2}(\mu_n-\mu+\eta\mu)$ converge to
a white noise with reference measure $\mu$, hence a limiting
procedure in formula (\ref{3.12}) yields Theorem~\ref{t3.3}. Moreover, in the
case of elementary functions~$f$ the central limit theorem and
formula (\ref{3.12}) imply the statement of Theorem~\ref{t3.3} directly.
(Elementary functions are defined in formula (\ref{1.6}).) After this,
Theorem~\ref{t3.3} can be proved in the general case with the help of the
investigation of the $L_2$-contraction property of some operators.
I omit the details.
A similar limit theorem holds for random integrals $J_{n,k}(f)$.
It can be proved by means of Theorem~\ref{t3.2} and an adaptation of the
above sketched argument for the proof of Theorem~\ref{t3.3}. It states
the following result.
\begin{thm}\label{t3.4}
{\bf %Theorem 3.4.
Limit theorem about multiple random integrals
$J_{n,k}(f)$.} {Let us have a sequence of independent and
identically distributed random variables $\xi_1,\xi_2,\dots$ with
some non-atomic distribution $\mu$ on a measurable space
$(X,{\cal X})$ and a function $f(x_1,\dots,x_k)$ on the $k$-fold
product $(X^k,{\cal X}^k)$ of the space $(X,{\cal X})$ such
that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty.
$$
Let us consider for all $n=1,2,\dots$ the random integrals
$J_{n,k}(f)$ of order~$k$ defined in formulas (\ref{1.1}) and
(\ref{1.2}) with
the help of the empirical distribution $\mu_n$ of the sequence
$\xi_1,\dots,\xi_n$ and the function~$f$. The random integrals
$J_{n,k}(f)$ converge in distribution, as $n\to\infty$, to the
following sum $U(f)$ of multiple Wiener--It\^o integrals:
\begin{eqnarray*}
U(f)&=& \sum_{V\subset\{1,\dots,k\}}
\frac{C(k,V)}{|V|!}Z_{\mu,|V|}(f_V)\\
&=& \sum_{V\subset\{1,\dots,k\}} \!\!\! \frac{C(k,V)}{|V|!}
\int f_V(x_j,\,j\in V)\prod_{j\in V}\mu_W(dx_j),
\end{eqnarray*}
where the functions $f_V(x_j\,j\in V)$, $V\subset\{1,\dots,k\}$,
are those functions defined in formula (\ref{3.4}) which appear in the
Hoeffding decomposition of the function $f(x_1,\dots,x_k)$, the
constants $C(k,V)$ are the limits appearing in the limit relation
$\lim\limits_{n\to\infty}C(n,k,V)=C(k,V)$ satisfied by the quantities
$C(n,k,V)$ in formula (\ref{3.7}), and $\mu_W$ is a white
noise with reference measure~$\mu$.}
\end{thm}
The results of this section suggest that to understand what kind of
results can be expected for the solution of problems~a) and~a$'$) it
is useful to study first their simpler counterpart, problem~a$''$)
about multiple Wiener--It\^o integrals. They also show that
problem~a$'$) is interesting in the case when degenerate
$U$-statistics are investigated. The next section contains some
results about these problems.
\section{Estimates on the distribution of random integrals
and $U$-statistics}\label{s4}
First I formulate the results about the solution of problem a$''$),
about the tail-behaviour of multiple Wiener--It\^o integrals.
\begin{thm}\label{t4.1}
%{\bf Theorem 4.1.}
{Let us consider a $\sigma$-finite measure
$\mu$ on a measurable space $(X,{\cal X})$ together with a white noise
$\mu_W$ with reference measure $\mu$. Let us have a real-valued
function $f(x_1,\dots,x_k)$ on the space $(X^k,{\cal X}^k)$ which
satisfies relation (\ref{1.4}) with some $\sigma^2<\infty$. Take the
random integral $Z_{\mu,k}(f)$ introduced in formula (\ref{1.5}). It
satisfies the inequality
\begin{equation}
P(|Z_{\mu,k}(f)|>u)\le C \exp\left\{-\frac12\left(\frac
u\sigma\right)^{2/k}\right\}\quad \textrm{for all } u>0 \label{4.1}
\end{equation}
with an appropriate constant $C=C(k)>0$ depending only on the
multiplicity $k$ of the integral.}
\end{thm}
%\medskip
The proof of Theorem~\ref{t4.1} can be found in my paper~\cite{r20}
together with
the following example which shows that it gives a sharp estimate.
\begin{exmp}\label{e4.2}
%{\bf Example 4.2.}
{Let us have a $\sigma$-finite measure $\mu$
on some measurable space $(X,{\cal X})$ together with a white noise
$\mu_W$ on $(X,{\cal X})$ with reference measure~$\mu$. Let $f_0(x)$
be a real valued function on $(X,{\cal X})$ such that $\int
f_0(x)^2\mu(\,dx)=1$, and take the function $f(x_1,\dots,x_k)=
\sigma f_0(x_1)\cdots f_0(x_k)$ with some number $\sigma>0$ and the
Wiener--It\^o integral $Z_{\mu,k}(f)$ introduced in formula (\ref{1.5}).
Then the relation
$$
\int f(x_1,\dots,x_k)^2\,\mu(\,dx_1)\dots\,\mu(\,dx_k)=\sigma^2
$$
holds, and the Wiener--It\^o integral $Z_{\mu,k}(f)$ satisfies the
inequality
\begin{equation}
P(|Z_{\mu,k}(f)|>u)\ge \frac{\bar C}{\left(\frac u\sigma\right)
^{1/k}+1}\exp\left\{-\frac12\left(\frac
u\sigma\right)^{2/k}\right\}\quad \textrm{for all } u>0 \label{4.2}
\end{equation}
with some constant $\bar C>0$.}
\end{exmp}
Let us also remark that a Wiener--It\^o integral $Z_{\mu,k}(f)$
defined in (\ref{1.5}) with a kernel function $f$ satisfying
relation~(\ref{1.4}) also satisfies the relations $EZ_{\mu,k}(f)=0$ and
$EZ_{\mu,k}(f)^2\le k!\sigma^2$ with the number $\sigma^2$ in~(\ref{1.4}).
If the function~$f$ is symmetric, i.e.\ if
$f(x_1,\dots,x_k)=f(x_{\pi(1)},\dots,x_{\pi(k)})$ for all permutations
$\pi$ of the set $\{1,\dots,k\}$, then in the last relation identity
can be written instead of inequality. Beside this,
$Z_{\mu,k}(f)=Z_{\mu,k}(\textrm{Sym}\,f)$, where $\textrm{Sym}\,f$
denotes the symmetrization of the function $f$, and this means that we
can restrict our attention to the Wiener--It\^o integrals of symmetric
functions without violating the generality. Hence Theorem~\ref{t4.1} can be
interpreted in the following way. The random integral $Z_{\mu,k}(f)$
has expectation zero, its variance is less than or equal to
$k!\sigma^2$ under the conditions of this result, and there is
identity in this relation if $f$ is a symmetric function. Beside this,
the distribution of $Z_{\mu,k}(f)$ satisfies an estimate similar to that
of $\sigma\eta^k$, where $\eta$ is a standard normal random variable.
The estimate (\ref{4.1}) in Theorem~\ref{t4.1} is not always sharp, but
Example~\ref{e4.2} shows that there are cases when the expression in its
exponent cannot be improved.
Let me also remark that the above statement can be formulated in a
slightly nicer form if the distribution of $Z_{\mu,k}(f)$ is compared
not with that of $\sigma\eta^k$, but with that of $\sigma H_k(\eta)$,
where $H_k(x)$ is the $k$-th Hermite polynomial with leading
coefficient~1. The identities $EH_k(\eta)=0$, $EH_k(\eta)^2=k!$ hold.
This means that not only the tail distributions of $Z_{\mu,k}(f)$
and $\sigma H_k(\eta)$ are similar, but in the case of a symmetric
function $f$ also their first two moments agree.
In problems a) and~a$'$) a slightly weaker but similar estimate holds.
In the case of problem~a$'$) the following result is valid (see ~\cite{r20}).
\begin{thm}\label{t4.3}
%{\bf Theorem 4.3.}
{Let $\xi_1,\dots,\xi_n$ be a sequence of
independent and identically distributed random variables on a space
$(X,{\cal X})$ with some distribution~$\mu$. Let us consider a function
$f(x_1,\dots,x_k)$ on the space $(X^k,{\cal X}^k)$, canonical with
respect to the measure~$\mu$ which satisfies the conditions
\begin{eqnarray}
\|f\|_\infty&=&\sup_{x_j\in X, \,1\le j\le k}|f(x_1,\dots,x_k)|\le 1,
\label{4.3} \\
\|f\|^2_2&=&\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)
\le\sigma^2, \label{4.4}
\end{eqnarray}
with some $0<\sigma^2\le1$ together with the degenerate $U$-statistic
$I_{n,k}(f)$ defined in formula (\ref{1.3}) with this kernel function~$f$.
There exist some constants $A=A(k)>0$ and $B=B(k)>0$ depending only
on the order $k$ of the $U$-statistic $I_{n,k}(f)$ such that
\begin{equation}
P(k!n^{-k/2}|I_{n,k}(f)|>u)\le A\exp\left\{-\frac{u^{2/k}}{2\sigma^{2/k}
\left(1+B\left(un^{-k/2}\sigma^{-(k+1)}\right)^{1/k}\right)}\right\}
\label{4.5}
\end{equation}
for all $0\le u\le n^{k/2}\sigma^{k+1}$.}
\end{thm}
\smallskip
\noindent
{\it Remark:} Actually, the universal constant $B>0$ can be chosen
independently of the order $k$ of the degenerate $U$-statistic
$I_{n,k}(f)$ in inequality (\ref{4.5}).
\medskip
Theorem~\ref{t4.3} can be considered as a generalization of Bernstein's
inequality Theorem~\ref{t2.2} to the multivariate case in a slightly
weaker form when only the sum of independent and identically
distributed random variables is considered. Its statement, inequality
(\ref{4.5}) does not contain an explicit value for the constants~$A$
and~$B$, which are equal to $A=2$ and $B=\frac13$ in the case of
Bernstein's inequality. (The constant $A=2$ appears, because of the
absolute value in the probability at the left-hand side of (\ref{4.5}).)
There is a formal difference between formula (\ref{2.2}) and the
statement of formula (\ref{4.5}) in the case $k=1$, because in
formula~(\ref{4.5}) the $U$-statistic $I_{n,k}(f)$ of order $k$ is
multiplied by $n^{-k/2}$. Another difference between them is that
inequality (\ref{4.5}) in Theorem~\ref{t4.3} is stated under the condition
$0\le u\le n^{k/2}\sigma^{k+1}$, and this restriction has no
counterpart in Bernstein's inequality. But, as I shall show,
Theorem~\ref{t4.3} also contains an estimate for $u\ge n^{k/2}\sigma^{k+1}$
in an implicit way, and it can be considered as the multivariate
version of Bernstein's inequality.
Bernstein's inequality gives a good estimate only if
$0\le u\le K\sqrt n\sigma^2$ with some $K>0$ (with the normalization of
Theorem~\ref{t4.3}, i.e. if the probability
$$
P\left(n^{-1/2}\sum\limits_{k=1}^nX_k>u\right)
$$
is considered). In the multivariate case a similar
picture appears. We get a good estimate for problem~a$'$)
suggested by Theorem~\ref{t4.1} only under the condition
$0\le u\le\textrm{const.}\, n^{k/2}\sigma^{k+1}$.
If $0____0$, then Theorem~\ref{t4.3} implies the inequality
$P(k!n^{-k/2}|I_{n,k}(f)|>u)\le A\exp\left\{-\frac
{1-C\varepsilon^{1/k}}2 \left(\frac u\sigma\right)^{2/k}\right\}$
with some universal constants $A>0$ and $C>0$ depending only on the
order~$k$ of the $U$-statistic $I_{n,k}(f)$. This means that in this
case Theorem~\ref{t4.3} yields an almost as good estimate as Theorem~\ref{t4.1}
about the distribution of multiple Wiener--It\^o integrals.
We have seen that Bernstein's inequality has a similar property if
the estimate (\ref{2.2}) is compared with the central limit theorem
in the case $0____0$.
To see what kind of estimate Theorem~\ref{t4.3} yields in the case $u\ge
n^{k/2}\sigma^{k+1}$ let us observe that in condition (\ref{4.4})
we have an inequality and not an identity. Hence in the case
$n^{k/2}\ge u>n^{k/2}\sigma^{k+1}$ relation~(\ref{4.5}) holds with
$\bar\sigma=\left(u{n^{-k/2}}\right)^{1/{(k+1)}}$, and this yields that
\begin{eqnarray*}
P(k!n^{-k/2}|I_{n,k}(f)|>u)&\le& A\exp\left\{-\frac1{2(1+B)^{1/k}}
\left(\frac u{\bar\sigma}\right)^{2/k}\right\} \\
&=&Ae^{-(u^2n)^{1/(k+1)}/2(1+B)^{1/k}}.
\end{eqnarray*}
(The inequality $n^{k/2}\ge u$
was imposed to satisfy the condition $0\le \bar\sigma^2\le1$.)
If $u>n^{k/2}$, then the probability at the left-hand side
of~(\ref{4.5}) equals zero because of condition~(\ref{4.3}).
It is not difficult to see by means of the above calculation that
Theorem~\ref{t4.3} implies the inequality
\begin{eqnarray}
\!\!\!\!\!\! &&P\left(k!n^{-k/2}|I_{n,k}(f)|>u\right)\label{4.6} \\
\!\!\!\!\!\! &&\qquad \le
c_1\exp\left\{-\frac{c_2u^{2/k}}{\sigma^{2/k}\left(1+c_3
\left(un^{-k/2}\sigma^{-(k+1)}\right)^{2/k(k+1)}\right)} \right\}
\quad \textrm{for all }u>0 \nonumber
\end{eqnarray}
with some universal constants $c_1$, $c_2$ and $c_3$ depending
only on the order $k$ of the $U$-statistic $I_{n,k}(f)$, if the
conditions of Theorem~\ref{t4.3} hold. Inequality~(\ref{4.6}) holds for all
$u\ge0$. Arcones and Gin\'e formulated and proved this estimate
in a slightly different but equivalent form in paper~\cite{r3} under
the name generalized Bernstein's inequality. This result is
weaker than Theorem~\ref{t4.3}, since it does not give a good value
for the constant~$c_2$. The method of paper~\cite{r3} is based on a
symmetrization argument. Symmetrization arguments can be very
useful in the study of problems~b) and~b$'$) formulated in the
Introduction, but they cannot supply a proof of Theorem~\ref{t4.3} with
good constants because of some principal reasons.
The following result which can be considered as a solution of
problem~a) is a fairly simple consequence of Theorem~\ref{t4.3},
Theorem~\ref{t3.2} and formulas~(\ref{3.8}) and~(\ref{3.9}).
%\medskip\noindent
\begin{thm}\label{t4.4}
%{\bf Theorem 4.4.}
{Let us take a sequence
of independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ on a measurable space $(X,{\cal X})$ with
a non-atomic distribution~$\mu$ on it together with a
measurable function $f(x_1,\dots,x_k)$ on the $k$-fold product
$(X^k,{\cal X}^k)$ of the space $(X,{\cal X})$ with some $k\ge1$ which
satisfies conditions (\ref{4.3}) and (\ref{4.4}) with some constant
$0<\sigma\le1$. Then there exist some constants $C=C_k>0$ and
$\alpha=\alpha_k>0$ such that the random integral $J_{n,k}(f)$ defined
by formulas (\ref{1.1}) and (\ref{1.2}) with this sequence of random
variables $\xi_1,\dots,\xi_n$
and function $f$ satisfies the inequality
\begin{equation}
P\left(|J_{n,k}(f)|>u\right)\le C \exp\left\{-\alpha
\left(\frac u\sigma\right)^{2/k}\right\} \quad \textrm{for all } 0____0$. On the other hand, this estimate is sharp
in that sense that disregarding the value of the universal
constant~$\alpha$ it cannot be improved. It seems to be appropriate in
the solution of the problems about non-parametric maximum likelihood
estimates mentioned in the Introduction.
The estimate (\ref{4.7}) on the probability
$P\left(|J_{n,k}(f)|>u\right)$ can be rewritten, similarly
to relation~(\ref{4.6}), in such a form which holds for all $u>0$.
On the other hand,
both Theorem~\ref{t4.3} and Theorem~\ref{t4.4} yield a very weak estimate
if $u\gg
n^{k/2}\sigma^{k+1}$. We met a similar situation in Section~\ref{s2} when
these problems were investigated in the case $k=1$. It is natural
to expect that a generalization of Bennett's inequality holds in
the multivariate case $k\ge2$, and it gives an improvement of
estimates (\ref{4.5}) and (\ref{4.7}) in the case
$u\gg n^{k/2}\sigma^{k+1}$ for all $k\ge1$. I can prove only partial
results in this direction which are not sharp in the general case. On the
other hand, there is a possibility to give such a generalization of
Example~\ref{e2.4} which shows that the inequalities implied by
Theorem~\ref{t4.3}
or~\ref{t4.4} in the case $u\ge n^{k/2}\sigma^{k+1}$, $k\ge2$ have only
a slight improvement.
The results of Theorems~\ref{t4.3} and~\ref{t4.4} imply that in the case $u\le
n^{k/2}\sigma^{k+1}$ under the condition of these results the
probabilities $P(n^{k/2}|I_{n,k}(f)|>u)$ and $P(|J_{n,k}(f)|>u)$ can
be bounded by $P(C\sigma|\eta|^k>u)$ with an appropriate universal
constant $C=C(k)>0$ depending only on the order~$k$ of the
degenerate $U$-statistic $I_{n,k}(f)$ or of the multiple random
integral $J_{n,k}(f)$, where the random variable $\eta$ has standard
normal distribution, and
$$
\sigma^2=\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k).
$$
A generalization of
Example~\ref{e2.4} can be given which shows for all $k\ge1$ that in the
case $u\gg n^{k/2}\sigma^{k+1}$ we can have only a much weaker
estimate. I shall present such an example only for $k=2$, but it
can be generalized for all $k\ge1$. This example is taken from my
Lecture Note~\cite{r22} (Example~8.6). Here I present it without
a detailed proof. The proof which exploits the properties of
Example~\ref{e2.4} is not long. But I
found more instructive to explain the idea behind this example.
%\medskip\noindent
\begin{exmp}\label{e4.5}
%{\bf Example 4.5.}
{Let $\xi_1,\dots,\xi_n$ be a sequence of
independent, identically distributed valued random variables taking
values in the plane, i.e.\ in $X=R^2$, such that
$\xi_j=(\eta_{j,1},\eta_{j,2})$, $\eta_{j,1}$
and $\eta_{j,2}$ are independent, $P(\eta_{j,1}=1)=P(\eta_{j,1}=-1)
=\frac{\sigma^2}8$, $P(\eta_{j,1}=0)=1-\frac{\sigma^2}4$,
$P(\eta_{j,2}=1)=P(\eta_{j,2}=-1)=\frac12$ for all $1\le j\le n$.
Let us introduce the function
$f(x,y)=f((x_1,x_2),(y_1,y_2))=x_1y_2+x_2y_1$,
$x=(x_1,x_2)\in R^2$, $y=(y_1,y_2)\in R^2$ on $X^2$, and define the
$U$-statistic
\begin{equation}
I_{n,2}(f)=\sum_{1\le j,k\le n,\,j\neq k}
(\eta_{j,1}\eta_{k,2}+\eta_{k,1}\eta_{j,2}) \label{4.8}
\end{equation}
of order 2 with the above kernel function $f$ and the sequence of
independent random variables $\xi_1,\dots,\xi_n$. Then $I_{n,2}(f)$
is a degenerate $U$-statistic. If $u\ge B_1n\sigma^3$ with some
appropriate constant $B_1>0$, $B_2^{-1}n\ge u\ge B_2n^{-2}$ with a
sufficiently large fixed number $B_2>0$, and
$1\ge\sigma\ge\frac1n$, then the estimate
\begin{equation}
P(n^{-1}I_{n,2}(f)>u)\ge \exp\left\{-Bn^{1/3}u^{2/3}\log
\left(\frac u{n\sigma^3}\right)\right\} \label{4.9}
\end{equation}
holds with some constant $B>0$ depending neither on~$n$ nor
on~$\sigma$.}
\end{exmp}
%\medskip
It is not difficult to see that the $U$-statistic $I_{n,2}(f)$
introduced in Example~\ref{e4.5} is a degenerate $U$-statistic of order two
with a kernel function $f$ such that $\sup |f(x,y)|\le1$ and
$\sigma^2=\int f^2(x,y)\mu(\,dx)\mu(\,dy)
=E(2\eta_{j,1}\eta_{j,2})^2=\sigma^2$. Example~\ref{e4.5} means that in the
case $u\gg n\sigma^3$, (i.e.\ if $u\gg n^{k/2}\sigma^{k+1}$ with
$k=2$) a much weaker estimate holds than in the case $u\le
n\sigma^3$. Let us fix the numbers $u$ and $n$, and consider the
dependence of our estimate on $\sigma$. The estimate
$P(n^{-1}|I_{n,2}(f)|>u)\le e^{-Ku/\sigma}=e^{-Ku^{2/3}n^{1/3}}$ holds
if $\sigma=u^{1/3}n^{-1/3}$, and Example~\ref{e4.5} shows that a rather weak
improvement appears if $\sigma\ll u^{1/3}n^{-1/3}$.
To understand why the statement of Example~\ref{e4.5} holds observe that
a small error is made if the condition $j\neq k$ is omitted from
the summation in formula (\ref{4.8}), and this suggests that the
approximation
$$
\frac1n I_{n,2}(f)\sim\frac2n
\left(\sum\limits_{j=1}^n\eta_{j,1}\right)
\left(\sum\limits_{j=1}^n\eta_{j,2}\right)
$$
causes a negligible error.
This fact together with the independence of the sequences $\eta_{j,1}$,
$1\le j\le n$, and $\eta_{j,2}$, $1\le j\le n$, imply that
\begin{eqnarray}
P(n^{-1}I_{n,2}(f)>u)&\sim&
P\left(\left(\sum_{j=1}^n\eta_{j,1}\right)
\left(\sum_{j=1}^n\eta_{j,2}\right)>\frac{nu}2\right)\nonumber \\
&\ge& P\left(\sum_{j=1}^n\eta_{j,1}>v_1\right)
P\left(\sum_{j=1}^n\eta_{j,2}>v_2\right)
\label{4.10}
\end{eqnarray}
with such a choice of numbers $v_1$ and $v_2$ for which
$v_1v_2=\frac{nu}2$.
The first probability at the right-hand side of (\ref{4.10}) can be
bounded because of the result of Example~\ref{e2.4} as
$P\left(\sum\limits_{j=1}^n\eta_{j,1}>v_1\right)\ge
e^{-Bv_1\log(4v_1/n\sigma^2)}$
if $v_1\ge 4n\sigma^2$, and the second probability
as $P\left(\sum\limits_{j=1}^n\eta_{j,2}>v_2\right)\ge Ce^{-Kv_2^2/n}$
with some appropriate $C>0$ and $K>0$ if $0\le v_2\le n$. The proof of
Example~\ref{e4.5} can be obtained by means of an appropriate choice of
the numbers $v_1$ and $v_2$.
\medskip
In Theorem~\ref{t4.1} the distribution of a $k$-fold Wiener--It\^o integral
$Z_{\mu,k}(f)$ was bounded by the distribution of $\sigma\eta^k$
with a standard normal random variable $\eta$ and an appropriate
constant $\sigma$. By Theorems~\ref{t4.3} and~\ref{t4.4} a similar, but weaker
estimate holds for the distribution of a degenerate $U$-statistic
$I_{n,k}(f)$ or random integral $J_{n,k}(f)$. In the next section I
briefly explain why such results hold.
There is a method to get a good estimate on the moments of the
random variables considered in the above theorems, and they enable
us to get a good estimate also on the distribution of the random
integrals and $U$-statistics appearing in these theorems. The
moments of a $k$-fold Wiener--It\^o integral can be bounded by the
moments of $\sigma\eta^k$ with an appropriate $\sigma>0$, and this
estimate implies Theorem~\ref{t4.1}. Theorems~\ref{t4.3} and~\ref{t4.4}
can be proved in
a similar way. But we can give a good estimate only on not too high
moments of the random variables $I_{n,k}(f)$ and $J_{n,k}(f)$, and
this is the reason why we get only a weaker result for their
distribution.
\medskip\noindent
{\it Remark:} My goal was to obtain a good estimate in Problems~a)
and a$'$) if we have a bound on the $L_2$ and $L_\infty$ norm of the
kernel function~$f$ in them. A similar problem was considered in
Problem~a$''$) about Wiener--It\^o integrals with the difference that
in this case only an $L_2$ bound of the function~$f$ is needed.
Theorems~\ref{t4.1}, \ref{t4.3} and~\ref{t4.4} provided such a bound,
and as Example~\ref{e4.2}
shows these estimates are sharp. On the other hand, if we have
some additional information about the kernel function~$f$, then more
precise estimates can be given which in certain cases yield an
essential improvement. Such results were known for
$U$-statistics and Wiener--It\^o integrals of order~$k=2$,
(see~\cite{r9} and~\cite{r12}) and quite recently (after the
submission of the first version of this work) they were generalized
in~\cite{r1} and~\cite{r15}
to general $k\ge2$. Moreover, these improvements are useful in the
study of some problems. Hence a referee suggested to explain
them in the present work. I try to follow
his advice by inserting their discussion at the end, in the open
problems part of the paper.
\section{On the proof of the results in Section \ref{s4}}\label{s5}
Theorem~\ref{t4.1} can be proved by means of the following
%\medskip\noindent
\begin{prop}\label{p5.1}
%{\bf Proposition 5.1.}
{Let the conditions of Theorem~\ref{t4.1} be
satisfied for a multiple Wiener--It\^o integral $Z_{\mu,k}(f)$ of
order~$k$. Then, with the notations of Theorem~\ref{t4.1}, the inequality
\begin{equation}
E\left(|Z_{\mu,k}(f)|\right)^{2M}\le 1\cdot3\cdot5\cdots(2kM-1)
\sigma^{2M}
\label{5.1}
\end{equation}
holds for all $M=1,2,\dots$.}
\end{prop}
%\medskip
By the Stirling formula Proposition~\ref{p5.1} implies that
\begin{equation}
E(|Z_{\mu,k}(f)|)^{2M}\le \frac{(2kM)!}{2^{kM}(kM)!}\sigma^{2M}\le
A\left(\frac2e\right)^{kM}(kM)^{kM}\sigma^{2M} \label{5.2}
\end{equation}
for any $A>\sqrt2$ if $M\ge M_0=M_0(A)$, and this estimate is sharp.
The following Proposition~\ref{p5.2} which can be applied in the proof of
Theorem~\ref{t4.3} states a similar, but weaker inequality for the moments
of normalized degenerate $U$-statistics.
%\medskip\noindent
\begin{prop}\label{p5.2}
%{\bf Proposition 5.2.}
{Let us consider a degenerate $U$-statistic
$I_{n,k}(f)$ of order $k$ with sample size $n$ and with a kernel
function $f$ satisfying relations (\ref{4.3}) and (\ref{4.4}) with some
$0<\sigma^2\le1$. Fix a positive number $\eta>0$. There exist some
universal constants $A=A(k)>\sqrt2$, $C=C(k)>0$ and $M_0=M_0(k)\ge1$
depending only on the order of the $U$-statistic $I_{n,k}(f)$ such that
\begin{eqnarray}
E\left(n^{-k/2}k!I_{n,k}(f)\right)^{2M} \!\!\!\!\!\!\!\! &&\le A
\left(1+C\sqrt\alpha\right)^{2kM}
\left(\frac2e\right)^{kM}\left(kM\right)^{kM}\sigma^{2M}\label{5.3} \\
&&\quad \textrm{for all integers } M \textrm{ such that } kM_0\le kM
\le\alpha n\sigma^2. \nonumber
\end{eqnarray}
The constant $C=C(k)$ in formula (\ref{5.3}) can be chosen e.g. as
$C=2\sqrt2$ which does not depend on the order $k$ of the
$U$-statistic $I_{n,k}(f)$.}
\end{prop}
%\medskip
Formula (\ref{5.1}) can be reformulated as
$E(|Z_{\mu,k}(f)|)^{2M}\le E(\sigma\eta^k)^{2M}$, where $\eta$
is a standard normal random variable. Theorem~\ref{t4.1} states that
the tail distribution of $k!|Z_{\mu,k}(f)|$ satisfies an estimate
similar to that of $\sigma|\eta|^k$. This can be deduced relatively
simply from Proposition~\ref{p5.1} and the Markov inequality
$P(|Z_{\mu,k}(f)|>u)\le \frac{E(k!|Z_{\mu,k}(f)|)^{2M}}{u^{2M}}$
with an appropriate choice of the parameter~$M$.
Proposition~\ref{p5.2} gives a bound on the moments of $k!n^{-k/2}I_{n,k}(f)$
similar to the estimate (\ref{5.2}) on the moments of $Z_{\mu,k}(f)$.
The difference between them is that estimate~(\ref{5.3}) in
Proposition~\ref{p5.2} contains a factor $\left(1+C\sqrt\alpha\right)^{2kM}$
at its right-hand side, and it holds only for such moments
$E\left(k!n^{-k/2}I_{n,k}(f)\right)^{2M}$ for which $kM_0\le kM\le\alpha
n\sigma^2$ with some constant~$M_0$. The parameter $\alpha>0$ in
relation~(\ref{5.2}) can be chosen in an arbitrary way, but it yields a
really useful estimate only for not too large values. Theorem~\ref{t4.3}
can be proved by means of the estimate in Proposition~\ref{p5.2} and the
Markov inequality. But because of the relatively weak estimate of
Proposition~\ref{p5.2} only the estimate of Theorem~\ref{t4.3} can be
proved for
degenerate $U$-statistics. The main step both in the proof of
Theorem~\ref{t4.1} and~\ref{t4.3} is to get good moment estimates.
A most important result of the probability theory, the so-called
diagram formula about multiple Wiener--It\^o integrals can be
applied in the proof of Proposition~\ref{p5.1}.
This result can be found e.g. in~\cite{r17}. It enables us to
rewrite the product of Wiener--It\^o integrals as a sum of
Wiener--It\^o integrals of different order. It got the name
`diagram formula', because the kernel functions of the Wiener--It\^o
integrals appearing in the sum representation of the product of
Wiener--It\^o integrals are defined with the help of certain
diagrams. As the expectation of a Wiener--It\^o integral of
order~$k$ equals zero for all $k\ge1$, the expectation of the
product equals the sum of the constant terms (i.e.\ of the
integrals of order zero) in the diagram formula. The sum of the
constant terms in the diagram formula can be bounded, and such a
calculation leads to the proof of Proposition~\ref{p5.1}.
A version of the diagram formula can be proved both for the product
of multiple random integrals $J_{n,k}(f)$ defined in formula~(\ref{1.2})
(see~\cite{r18}) or for degenerate $U$-statistics (see~\cite{r20}) which
expresses the product of multiple random integrals or degenerate
$U$-statistics as a sum of multiple random integrals or degenerate
$U$-statistics of different order. The main difference between
these new and the original diagram formula about Wiener--It\^o
integrals is that in the case of random (non-Gaussian) integrals
or degenerate $U$-statistics some new diagrams appear, and they
give an additional contribution in the sum representation of the
product of random integrals $J_{n,k}(f)$ or of degenerate
$U$-statistics~$I_{n,k}(f)$.
Proposition~\ref{p5.2} can be proved by means of the diagram
formula for the product of degenerate $U$-statistics and a good
bound on the contribution of all integrals corresponding to the
diagrams. Theorem~\ref{t4.4} can be proved similarly by means of the
diagram formula for the product of multiple random
integrals~$J_{n,k}(f)$ (see~\cite{r18}). The main difficulty of such
an approach arises, because the expected value of a $k$-fold random
integral $J_{n,k}(f)$ (unlike that of a Wiener--It\^o integral or
degenerate $U$-statistic) may be non-zero also in the case $k\ge1$.
The expectation of all these integrals is small, but since the
diagram formula contains a large number of such terms, it cannot
supply such a sharp estimate for the moments random integrals
$J_{n,k}(f)$ as we have for degenerate $U$-statistics $I_{n,k}(f)$.
On the other hand, Theorem~\ref{t4.4} can be deduced from
Theorems~\ref{t4.3},~\ref{t3.2},
and formulas~(\ref{3.8}) and~(\ref{3.9}).
\medskip\noindent
{\it Remark:}\/ The diagram formula is an important tool both in
investigations in probability theory and statistical physics. The
second chapter of the book~\cite{r25} contains a detailed discussion
of this formula. Paper~\cite{r28} explains the combinatorial
picture behind it, and it contains some interesting generalizations.
Paper~\cite{r31} is interesting because of a different reason. It
shows how to prove central limit theorems for stationary processes
in some non-trivial cases by means of the diagram formula. In this
paper it is proved that the moments of the normalized
partial sums have the right limit as the number of terms in them
tends to infinity. Actually, the limit of the semi-invariants is
investigated, but this can be considered as an appropriate
reformulation of the study of the moments. The approach in
paper~\cite{r31} and the proof of the results mentioned in this work
show some similarity, but there is also an essential difference
between them. In paper~\cite{r31} the limit of fixed moments is
investigated, while e.g. in Problem~a$'$) we want to get good
asymptotics for such moments of $U$-statistics $I_{n,k}(f)$ whose
order may depend on the sample size~$n$ of the $U$-statistic. The
reason behind this difference is that we want to get a good estimate
of the probabilities defined in Problem~a$'$) also for large
numbers~$u$, and this yields some large deviation character to the
problem.
\medskip
The statement of Example~\ref{e4.2} follows relatively simply from another
important result about multiple Wiener--It\^o integrals, from the
so-called It\^o formula for multiple Wiener--It\^o integrals
(see e.g.~\cite{r14} or~\cite{r17}) which enables us
to express the random integrals considered in Example~\ref{e4.2} as the
Hermite polynomial of an appropriately defined standard normal
random variable.
Here I did not formulate the diagram formula, hence I cannot explain
the details of the proof of Propositions~\ref{p5.1} and~\ref{p5.2}.
I discuss instead
an analogous, but simpler problem briefly which may help
in capturing the ideas behind the proofs outlined above.
Let us consider a sequence of independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with expectation
zero, take their sum $S_n=\sum\limits_{j=1}^n\xi_j$, and let us try to
give a good estimate on the moments $ES_n^{2M}$ for all
$M=1,2,\dots$. Because of the independence of the random variables
$\xi_j$ and the condition $E\xi_j=0$ we can write
\begin{equation}
ES_n^{2M}=\sum_{ \scriptstyle{
\begin{array}{c}(j_1,\dots,j_s,l_1,\dots,l_s)\\
j_1+\dots+j_s=2M,\,
j_u\ge 2\textrm{ for all } 1\le u\le s, \\
l_u\neq l_{u'} \textrm{ if }
u\neq u'\end{array}}}
E\xi_{l_1}^{j_1}\cdots E\xi_{l_s}^{j_s}. \label{5.4}
\end{equation}
Simple combinatorial considerations show that a dominating number
of terms at the right-hand side of (\ref{5.4}) are indexed by a vector
$(j_1,\dots,j_M,l_1,\dots,l_M)$ such that $j_u=2$ for all $1\le
u\le M$, and the number of such vectors is equal to ${ n\choose M}
\frac{(2M)!}{2^M}\sim n^M\frac{(2M)!}{2^MM!}$. The last asymptotic
relation holds if the number $n$ of terms in the random sum~$S_n$
is sufficiently large. The above considerations suggest that under
not too restrictive conditions $ES_n^{2M}\sim
\left(n\sigma^2\right)^M\frac{(2M)!}{2^MM!}=E\eta_{n\sigma^2}^{2M}$,
where $\sigma^2=E\xi^2$ is the variance of the terms in the sum $S_n$,
and $\eta_u$ is a random variable with normal distribution with
expectation zero and variance~$u$. The question arises when the above
heuristic argument gives a right estimate.
For the sake of simplicity let us restrict our attention to the
case when the absolute value of the random variables $\xi_j$ is
bounded by~1. Let us observe that even in this case we have to
impose a condition that the variance $\sigma^2$ of the random
variables $\xi_j$ is not too small. Indeed, let us consider such
random variables $\xi_j$, for which
$P(\xi_j=1)=P(\xi_j=-1)=\frac{\sigma^2}2$,
$P(\xi_j=0)=1-\sigma^2$. These random variables $\xi_j$ have
variance $\sigma^2$, and the contribution of the terms $E\xi_j^{2M}$,
$1\le j\le n$, to the sum in (\ref{5.4}) equals $n\sigma^2$. If $\sigma^2$
is very small, then it may occur that $n\sigma^2\gg
\left(n\sigma^2\right)^M \frac{(2M)!}{2^MM!}$, and the approximation
given for $ES_n^{2M}$ in the previous paragraph does not hold any
longer. Let us observe
that for larger moments $ES_n^{2M}$ the choice of a smaller
variance $\sigma^2$ is sufficient to violate the asymptotic
relation obtained by this approximation.
A similar picture arises in Proposition~\ref{p5.2}. If the variance of
the random variable $I_{n,k}(f)$ is not too small, then those
terms give the essential contribution to the moments of
$I_{n,k}(f)$ which correspond to such diagrams which appear also
in the diagram formula for Wiener--It\^o integrals. The higher
moment we estimate the stronger condition we have to impose on the
variance of $I_{n,k}(f)$ to preserve this property and to
get a good bound on the moment we consider.
In the next Section problems b), b$'$) and b$''$) will be discussed,
where the distribution of the supremum of
multiple random integrals $J_{n,k}(f)$, degenerate $U$-statistics
$I_{n,k}(f)$ and multiple Wiener--It\^o integrals $Z_{\mu,k}(f)$ will
be estimated for an appropriate class of functions~$f\in{\cal F}$.
Under some appropriate conditions for the class of functions~${\cal F}$
a similar estimate can be proved in these problems as in their
natural counterpart when only one function is taken. The only
difference is that worse universal constants may appear in the new
estimates. The conditions we had to impose in the results about
problems~a) and~a$'$) appear in their counterparts problems~b)
and~b$'$) in a natural way. But these conditions also have some hidden,
more surprising consequences in the study of the new problems.
\section{On the supremum of random integrals and $U$-statistics}\label{s6}
To formulate the results of this section first I introduce some
notions which appear in their formulation. Such properties will
be introduced which say about a class of functions that it has
relatively small and in some sense dense finite subsets.
First I introduce the following definition.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions with respect
to some measure.} {\it Let us have a measurable space
$(Y,{\cal Y})$ together with a $\sigma$-finite measure $\nu$ and a set
${\cal G}$ of ${\cal Y}$ measurable real valued functions on this space.
For all $1\le p<\infty$, we say that ${\cal G}$ is an $L_p$-dense
class with respect to $\nu$ and with parameter $D$ and exponent
$L$ if for all numbers $1\ge\varepsilon>0$ there exists a finite
$\varepsilon$-dense subset ${\cal G}_{\varepsilon}=\{g_1,\dots,g_m\}
\subset {\cal G}$ in the space
$L_p(Y,{\cal Y},\nu)$ consisting of $m\le D\varepsilon^{-L}$ elements,
i.e.\ there exists a set ${\cal G}_{\varepsilon}\subset {\cal G}$ with
$m\le D\varepsilon^{-L}$ elements such that
$\inf\limits_{g_j\in {\cal G}_\varepsilon}\int |g-g_j|^p\,d\nu
<\varepsilon^p$ for all functions~$g\in {\cal G}$.}
\medskip
The following notion will also be needed.
\medskip\noindent
{\bf Definition of $L_p$-dense classes of functions.} {\it Let us
have a measurable space $(Y,{\cal Y})$ and a set ${\cal G}$ of
${\cal Y}$ measurable real valued functions on this space. We call
${\cal G}$ an $L_p$-dense class of functions, $1\le p<\infty$, with
parameter $D$ and exponent $L$ if it is $L_p$-dense with parameter $D$
and exponent $L$ with respect to all probability measures $\nu$ on
$(Y,{\cal Y})$.}
\medskip
The above introduced properties can be considered as possible
versions of the so-called $\varepsilon$-entropy frequently applied in
the literature. Nevertheless, there seems to exist no unanimously
accepted version of this notion. Generally the above introduced
definitions will be applied with the choice $p=2$, but because
of some arguments in this paper it was more natural to
introduce them in a more general form. The first result I present
can be considered as a solution of problem~b$''$).
%\medskip\noindent
\begin{thm}\label{t6.1}
%{\bf Theorem 6.1.}
{Let us consider a measurable space
$(X,{\cal X})$ together with a $\sigma$-finite non-atomic
measure~$\mu$ on it, and let $\mu_W$ be a white noise with reference
measure $\mu$ on $(X,{\cal X})$. Let ${\cal F}$ be a countable and
$L_2$-dense class of functions $f(x_1,\dots,x_k)$ on $(X^k,{\cal X}^k)$
with some parameter $D$ and exponent $L$ with respect to the product
measure $\mu^k$ such that
$$
\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots \mu(\,dx_k)\le\sigma^2
\quad \textrm{\rm with some } 0<\sigma\le1 \textrm { \rm for all }
f\in {\cal F}.
$$
Let us consider the multiple Wiener integrals $Z_{\mu,k}(f)$
introduced in formula (\ref{1.5}) for all~$f\in{\cal F}$. The inequality
\begin{equation}
P\left(\sup_{f\in {\cal F}}|Z_{\mu,k}(f)|>u\right)\le C(D+1)
\exp\left\{-\alpha\left(\frac u\sigma\right)^{2/k}\right\}
\quad \textrm{if }\left(\frac u\sigma\right)^{2/k} \!\! \ge ML
\log\frac2\sigma \label{6.1}
\end{equation}
holds with some universal constants $C=C(k)>0$, $M=M(k)>0$ and
$\alpha=\alpha(k)>0$.}
\end{thm}
The next two results can be considered as a solution of problems~b)
and~b$'$).
%\medskip\noindent
\begin{thm}\label{t6.2}
%{\bf Theorem 6.2.}
{Let a probability measure $\mu$ be given
on a measurable space $(X,{\cal X})$ together with a countable
and $L_2$-dense class ${\cal F}$ of functions $f(x_1,\dots,x_k)$ of
$k$ variables with some parameter $D$ and exponent $L$, $L\ge1$, on
the product space $(X^k,{\cal X}^k)$ which satisfies the conditions
\begin{equation}
\|f\|_\infty=\sup_{x_j\in X,\;1\le j\le k}|f(x_1,\dots,x_k)|\le 1,
\qquad \textrm{for all } f\in {\cal F} \label{6.2}
\end{equation}
and
\begin{eqnarray}
\|f\|_2^2=Ef^2(\xi_1,\dots,\xi_k)
\!\!\!\!\!\!\!\! &&=\int f^2(x_1,\dots,x_k)
\mu(\,dx_1)\dots\mu(\,dx_k)\le \sigma^2 \label{6.3} \\
&&\qquad\qquad \textrm{for all } f\in {\cal F} \nonumber
\end{eqnarray}
with some constant $0<\sigma\le1$. Then there exist some constants
$C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$ depending only on
the parameter $k$ such that the supremum of the random integrals
$J_{n,k}(f)$, $f\in {\cal F}$, defined by formula (\ref{1.2}) satisfies
the inequality
\begin{eqnarray}
P\left(\sup\limits_{f\in{\cal F}}|J_{n,k}(f)|\ge u\right)
\!\!\!\!\!\!\!\! &&\le CD
\exp\left\{-\alpha \left(\frac u{\sigma}\right)^{2/k}\right\}\label{6.4} \\
&&\qquad \textrm{if}\quad n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k} \ge M(L+\beta)^{3/2}\log\frac2\sigma,
\nonumber
\end{eqnarray}
where $\beta=\max\left(\frac{\log D}{\log n},0\right)$ and the numbers
$D$ and $L$ agree with the parameter and exponent of the $L_2$-dense
class~${\cal F}$.}
\end{thm}
\begin{thm}\label{t6.3}
%{\bf Theorem 6.3.}
{Let a probability measure $\mu$ be given on
a measurable space $(X,{\cal X})$ together with a countable and
$L_2$-dense class ${\cal F}$ of functions $f(x_1,\dots,x_k)$ of $k$
variables with some parameter $D$ and exponent $L$, $L\ge1$, on the
product space $(X^k,{\cal X}^k)$ which satisfies conditions (\ref{6.2}) and
(\ref{6.3}) with some constant $0<\sigma\le1$. Beside these conditions
let us also assume that the $U$-statistics $I_{n,k}(f)$ defined with the
help of a sequence of independent $\mu$ distributed random
variables $\xi_1,\dots,\xi_n$ are degenerate for all $f\in{\cal F}$,
or in an equivalent form, all functions $f\in {\cal F}$ are canonical
with respect to the measure~$\mu$. Then there exist some constants
$C=C(k)>0$, $\alpha=\alpha(k)>0$ and $M=M(k)>0$ depending only on
the parameter $k$ such that the inequality
\begin{eqnarray}
P\left(\sup\limits_{f\in{\cal F}}n^{-k/2}|I_{n,k}(f)|\ge u\right)
\!\!\!\!\!\!\!\! &&\le CD
\exp\left\{-\alpha \left(\frac u{\sigma}\right)^{2/k}\right\}
\label{6.5} \\
&&\qquad \textrm{if}\quad n\sigma^2\ge
\left(\frac u\sigma\right)^{2/k} \ge M(L+\beta)^{3/2}\log\frac2\sigma,
\nonumber
\end{eqnarray}
holds, where $\beta=\max\left(\frac{\log D}{\log n},0\right)$ and the
number $D$ and $L$ agree with the parameter and exponent of the
$L_2$-dense class~${\cal F}$.}
\end{thm}
%\medskip
The above theorems whose proofs can be found in~\cite{r19}
or~\cite{r22} in a more
detailed version say that under some conditions on the class of
functions ${\cal F}$ an almost as good estimate holds for
problems~b),~b$'$) and~b$''$) as for the analogous
problems~a),~a$'$) and~a$''$), where similar problems were
investigated, but only one function~$f$ was considered. An essential
restriction in the results of Theorems~\ref{t6.1},~\ref{t6.2}
and~\ref{t6.3} is that the
condition $\left(\frac u\sigma\right)^{2/k}\ge M(L,D)\log\frac2\sigma$
is imposed in them with some constant $M(L,D,k)$ depending on the
exponent~$L$ and parameter~$D$ of the $L_2$-dense class ${\cal F}$. In
Theorem~\ref{t6.1} $M(L,D,k)=ML$ was chosen, in Theorems~\ref{t6.2}
and~\ref{t6.3}
$M(L,D,k)=M(L+\beta)^{3/2}$ with an appropriate
universal constant $M=M(k)$ and $\beta=\max\left(0,\frac{\log D}{\log
n}\right)$. We are interested not so much in a good choice
of the quantity $M(L,D,k)$ in these results. Actually, they could
have been chosen in a better way. We would like to understand
why such conditions have to be imposed in these results.
I shall also discuss some other questions related to the above
theorems. Beside the role of the lower bound on $\left(\frac
u\sigma\right)^{2/k}$ one would also like to understand why we have
imposed the condition of $L_2$-dense property for the class of
functions~${\cal F}$ in Theorems~\ref{t6.2} and~\ref{t6.3}. This is a stronger
restriction than the condition about the $L_2$-dense property of
the class ${\cal F}$ with respect to the measure~$\mu^k$ imposed in
Theorem~\ref{t6.1}. It may be a little bit mysterious why in
Theorems~\ref{t6.2}
and~\ref{t6.3} such a condition is needed by which this class of functions
is $L_2(\nu)$-dense also with respect to such probability
measures~$\nu$ which seem to have no relation to our problems. I can
give only a partial answer to this question. In the next section I
present a very brief sketch of the proofs which shows that in the
proof of Theorems~\ref{t6.2} and~\ref{t6.3} the $L_2$-dense property
of the class
of functions ${\cal F}$ is applied in the form as it was imposed. I
shall discuss another question which also naturally arises in this
context. One would like to know some results which enable us to
check the $L_2$-dense property and show that it holds in many
interesting cases.
I shall discuss still another problem related to the above results.
One would like to weaken the condition by which the classes of
functions ${\cal F}$ must be countable. Let me recall that in the
Introduction I mentioned that our results can be applied in the
study of some non-parametric maximum likelihood problems. In these
applications such cases may occur where we have to work with the
supremum of non-countably infinite random integrals. I shall
discuss this question separately at the end of this section.
I show an example which shows that the condition $\left(\frac
u\sigma\right)^{2/k}\ge M(L,D,k)\log\frac2\sigma$ with some appropriate
constant $M(L,D,k)>0$ cannot be omitted from Theorem~\ref{t6.1}. In this
example $([0,1],{\cal B})$, i.e. the interval $[0,1]$
together with the Borel $\sigma$-algebra is taken as the measurable
space $(X,{\cal X})$, and the Lebesgue measure $\lambda$ is considered
on $[0,1]$ together with the usual white noise $\lambda_W$ with the
Lebesgue measure as its reference measure. Fix some number $\sigma>0$,
and define the class of functions of $k$ variables
${\cal F}={\cal F}_\sigma$ on $([0,1]^k,{\cal B}^k)$ as the indicator
functions of the $k$-dimensional rectangles
$\prod\limits_{j=1}^k[a_j,b_j]\subset [0,1]^k$ such that all numbers
$a_j$ and $b_j$, $1\le j\le k$, are rational, and the volume of these
rectangles satisfy the condition $\prod\limits_{j=1}^k(b_j-a_j)
\le\sigma^2$. It can be seen that this countable class of functions
${\cal F}$ is
$L_2$-dense with respect to the measure $\lambda$, (moreover it is
$L_2$-dense in the general sense), hence Theorem~\ref{t6.1} can be applied
to the supremum of the Wiener--It\^o integrals $Z_{\lambda,k}(f)$
with the above class of functions $f\in{\cal F}$.
Let the above chosen number $\sigma>0$ be sufficiently small and
such that $\sigma^{2/k}$ is a rational number. Let us define
$N=[\sigma^{-2/k}]$ functions $f_j\in{\cal F}$, where $[x]$ denotes
the integer part of the number $x$ in the following way: The function
$f_j$ is the indicator function of the $k$-dimensional cube we get by
taking the $k$-fold direct product of the interval
$[(j-1)\sigma^{2/k},j\sigma^{2/k}]$ with itself, $1\le j\le N$.
Then all functions $f_j$ are elements of the above defined class of
functions ${\cal F}={\cal F}_\sigma$, and the Wiener--It\^o integrals
$Z_{\lambda,k}(f_j)$, $1\le j\le N$, are independent random variables.
Hence
\begin{equation}
P\left(\sup_{f\in{\cal F}}|Z_{\lambda,k}(f)|>u\right)\ge
P\left(\sup_{1\le j\le N}|Z_{\lambda,k}(f_j)|>u\right)
=1-P(|Z_{\lambda,k}(f_1)|\le u)^N \label{6.6}
\end{equation}
for all numbers $u>0$. I will show with the help of relation~(\ref{6.6})
that for a small $\sigma>0$ and such a number $u$ for which
$\left(\frac u\sigma\right)^{2/k}=a\log\frac2\sigma$ with some
$a<\frac4k$ the probability $P\left(\sup\limits_{f\in{\cal F}}
|Z_{\lambda,k}(f)|>u\right)$ is very close to~1.
By the It\^o formula for multiple Wiener--It\^o integrals
(see e.g.~\cite{r14}) the identity
$Z_{\lambda,k}(f_j)=\sigma H_k(\eta_j)$ holds, where $H_k(\cdot)$
is the $k$-th Hermite polynomial with leading coefficient 1, and
$\eta_j=\sigma^{-1/k}\int_{(j-1)\sigma^{2/k}}^{j\sigma^{2/k}}d\lambda_W$,
hence it is a standard normal random variable. With the help of this
relation it can be shown that for all $0<\gamma<1$ there exists some
$\sigma_0=\sigma_0(\gamma)$ such that $P(|Z_{\lambda,k}(f_1)|\le u)
\le1-e^{-\gamma(u/\sigma)^{2/k}/2}=1-\left(\frac\sigma2\right)
^{\gamma a/2}$ if $0<\sigma<\sigma_0$. Hence relation~(\ref{6.6}) and the
inequality $N\ge \sigma^{-2/k}-1$ imply that
$P\left(\sup\limits_{f\in{\cal F}} |Z_{\lambda,k}(f)|>u\right)\ge
1-\left(1-\left(\frac\sigma2\right)^{\gamma a/2}\right)
^{\sigma^{-2/k}-1}$. By choosing $\gamma$ sufficiently close to~1 it
can be shown with the help of the above relation that with a
sufficiently small $\sigma>0$ and the above choice of the number~$u$
the probability
$P\left(\sup\limits_{f\in{\cal F}}|Z_{\lambda,k}(f)|>u\right)$ is
almost~1.
The above calculation shows that a condition of the type
$$
\left(\frac
u\sigma\right)^{2/k}\ge M(L,D,k)\log\frac2\sigma
$$
is really needed in
Theorem~\ref{t6.1}. With some extra work a similar example can be constructed
in the case of Theorem~\ref{t6.2}. In this example the same
space $(X,{\cal X})$ and the same class of functions
${\cal F}={\cal F}_\sigma$ can be chosen, only the white noise has to
be replaced for instance by a sequence of independent random variables
$\xi_1,\dots,\xi_n$ with uniform distribution on the unit interval
and with a sufficiently large sample size~$n$. (The lower bound on
the sample size should depend also on~$\sigma$.) Also in the case of
Theorem~\ref{t6.3} a similar example can be constructed. I omit the details.
\medskip
The theory of Vapnik--\v{C}ervonenkis classes is a fairly popular
and important subject in probability theory. I shall show
that this theory is also useful in the study of our problems. It
provides a useful sufficient condition for the $L_2$-dense property
of a class of functions, a property which played an important role
in Theorems~\ref{t6.2} and~\ref{t6.3}. To formulate the result interesting for
us first I recall the notion of Vapnik--\v{C}ervonenkis classes.
\medskip\noindent
{\bf Definition of Vapnik-\v{C}ervonenkis classes of sets and
functions.} {\it Let a set $S$ be given, and let us select a class
${\cal D}$ consisting of certain subsets of this set~$S$. We call
${\cal D}$ a Vapnik--\v{C}ervonenkis class if there exist two real
numbers $B$ and $K$ such that for all positive integers~$n$ and
subsets $S_0(n)=\{x_1,\dots,x_n\}\subset S$ of cardinality~$n$
of the set $S$ the collection of sets of the form $S_0(n)\cap D$,
$D\in{\cal D}$, contains no more than $Bn^K$ subsets of~$S_0(n)$.
We shall call $B$ the parameter and $K$ the exponent of this
Vapnik--\v{C}ervonenkis class.
A class of real valued functions ${\cal F}$ on a space $(Y,{\cal Y})$
is called a Vapnik--\v{C}ervonenkis class if the collection of
graphs of these functions is a Vapnik--\v{C}ervonenkis class,
i.e.\ if the sets $A(f)=\{(y,t)\colon y\in Y,\;\min(0,f(y))\le t\le
\max(0,f(y))\}$, $f\in {\cal F}$, constitute a
Vapnik--\v{C}er\-vo\-nen\-kis class of subsets of the product
space $S=Y\times R^1$.}
\medskip
The theory about Vapnik--\v{C}ervonenkis classes has generated a huge
literature. Many sufficient conditions have been stated which ensure
that certain classes of sets or functions are
Vapnik--\v{C}ervonenkis classes. Here I do not discuss them.
I only present an important result of Richard Dudley, which
states that a Vapnik--\v{C}ervonenkis class of functions bounded
by~1 is an $L_1$-dense class of functions.
%\medskip\noindent
\begin{thm}\label{t6.4}
%{\bf Theorem 6.4.}
{Let $f(y)$, $f\in {\cal F}$, be a
Vapnik--\v{C}ervonenkis class of real valued functions on some
measurable space $(Y,{\cal Y})$ such that
$\sup\limits_{y\in Y}|f(y)|\le1$
for all $f\in{\cal F}$. Then ${\cal F}$ is an $L_1$-dense class of
functions on $(Y,{\cal Y})$. More explicitly, if ${\cal F}$ is a
Vapnik--\v{C}ervonenkis class with parameter $B\ge1$ and exponent
$K>0$, then it is an $L_1$-dense class with exponent $L=2K$ and
parameter $D=CB^2 (4K)^{2K}$ with some universal constant $C>0$.}
\end{thm}
The proof of this result can be found in \cite{r28} (25~Approximation
Lemma) or in my Lecture Note~\cite{r22}. Formally, Theorem~\ref{t6.4}
gives a sufficient condition for a class of functions to be an
$L_1$-dense class. But it is fairly simple to show that a class of
functions satisfying the conditions of Theorem~\ref{t6.4} is not only an
$L_1$, but also an $L_2$-dense class. Indeed, an $L_1$-dense class
of functions whose absolute values are bounded by~1 in the supremum
norm is also an $L_2$-dense class, only with a possibly different
exponent and parameter. I finish this section by discussing the
problem how to replace the condition of countable cardinality of
the class of functions in Theorems~\ref{t6.2} and~\ref{t6.3} by a useful weaker
condition.
\subsection {On the supremum of non-countable classes of random
integrals and $U$-statistics}\label{s6.1}
First I introduce the following notion.
\medskip\noindent
{\bf Definition of countably approximable classes of random
variables.} {\it Let a class of random variables $U(f)$,
$f\in {\cal F}$, indexed by a class of functions on a measurable space
$(Y,{\cal Y})$ be given. We say that this class of random variables
$U(f)$, $f\in{\cal F}$, is countably approximable if there is a
countable subset ${\cal F}'\subset {\cal F}$ such that for all numbers
$u>0$ the sets
$A(u)=\{\omega\colon\sup\limits_{f\in{\cal F}}|U(f)(\omega)|\ge u\}$ and
$B(u)=\{\omega\colon\sup\limits_{f\in{\cal F}'} |U(f)(\omega)|\ge u\}$
satisfy the identity $P(A(u)\setminus B(u))=0$.}
\medskip
It is fairly simple to see that in Theorems~\ref{t6.1},~\ref{t6.2}
and~\ref{t6.3} the
condition about the countable cardinality of the class of
functions~${\cal F}$ can be replaced by the weaker condition that
the class of random variables $Z_{\mu,k}(f)$, $J_{n,k}(f)$ or
$I_{n,k}(f)$, $f\in{\cal F}$, is a countably approximable class of
functions. One would like to get some results which enable us to
check this property. The following simple lemma~(see Lemma~4.3
in~\cite{r22}) may be useful for this.
%\medskip\noindent
\begin{lem}\label{l6.5}
%{\bf Lemma 6.5.}
{Let a class of random variables $U(f)$,
$f\in{\cal F}$, indexed by some set ${\cal F}$ of functions on a
space $(Y,{\cal Y})$ be given. If there exists a countable subset
${\cal F}'\subset {\cal F}$ of the set ${\cal F}$ such that the sets
$A(u)=\{\omega\colon\sup\limits_{f\in {\cal F}}|U(f)(\omega)|\ge u\}$
and $B(u)=\{\omega\colon\sup\limits_{f\in {\cal F}'} |U(f)(\omega)|
\ge u\}$ introduced for all $u>0$ in the definition of countable
approximability satisfy the relation $A(u)\subset B(u-\varepsilon)$
for all $u>\varepsilon>0$, then the class of random variables $U(f)$,
$f\in{\cal F}$, is countably approximable.
The above property holds if for all $f\in{\cal F}$, $\varepsilon>0$
and $\omega\in\Omega$ there exists a function
$\bar f=\bar f(f,\varepsilon,\omega) \in{\cal F}'$ such that
$|U(\bar f)(\omega)|\ge|U(f)(\omega)|-\varepsilon$.}
\end{lem}
Thus to prove the countable approximability property of a class of
random variables $U(f)$, $f\in{\cal F}$, it is enough to check the
condition formulated in the second paragraph of Lemma~\ref{l6.5}. I present
an example when this condition can be checked. This example is
particularly interesting, since in the study of non-parametric
maximum likelihood problems such examples have to be considered.
Let us fix a function $f(x_1,\dots,x_k)$,
$\sup|f(x_1,\dots,x_k)|\le1$, on the space $(X^k,{\cal X}^k)
=(R^{ks},{\cal B}^{ks})$ with some $s\ge1$, where ${\cal B}^t$
denotes the Borel $\sigma$-algebra on the Euclidean space $R^t$,
together with some probability measure $\mu$ on $(R^s,{\cal B}^s)$.
For all vectors $(u_1,\dots,u_k)$, $(v_1,\dots,v_k)$ such that
$u_j,v_j\in R^s$ and $u_j\le v_j$, $1\le j\le k$, (i.e. all
coordinates of $u_j$ are smaller than or equal to the
corresponding coordinate of $v_j$) let us define the function
$f_{u_1,\dots,u_k,v_1,\dots,v_k}$ which equals the
function~$f$ on the rectangle $[u_1,v_1]\times\cdots[u_k,v_k]$,
and it is zero outside of this rectangle.
Let us consider a sequence of i.i.d. random variables
$\xi_1,\dots,\xi_n$ taking values in the space $(R^s,{\cal B}^s)$
with some distribution $\mu$, and define the empirical measure
$\mu_n$ and random integrals
$J_{n,k}(f_{u_1,\dots,u_k,v_1,\dots,v_k})$ by formulas~(\ref{1.1})
and~(\ref{1.2}) for all vectors $(u_1,\dots,u_k)$,
$(v_1,\dots,v_k)$, $u_j\le v_j$ for all $1\le j\le k$, with the
above defined functions $f_{u_1,\dots,u_k,v_1,\dots,v_k}$. The
following result holds (see Lemma~4.4 in~\cite{r22}).
%\medskip\noindent
\begin{lem}\label{l6.6}
%{\bf Lemma 6.6.}
{Let us take $n$ independent and identically
distributed random variables $\xi_1,\dots,\xi_n$ with values in the
space $(R^s,{\cal B}^s)$. Let us define with the help of their
distribution $\mu$ and the empirical distribution $\mu_n$ determined
by them the class of random variables
$J_{n,k}(f_{u_1,\dots,u_k,v_1,\dots,v_k})$ introduced in
formula~(\ref{1.2}), where the class of kernel functions ${\cal F}$
in these integrals consists of all functions
$f_{u_1,\dots,u_k,v_1,\dots,v_k}\in (R^{sk},{\cal B}^{sk})$,
$u_j,v_j\in R^s$, $u_j\le v_j$, $1\le j\le k$, introduced in the last
but one paragraph. This class of random variables $J_{n,k}(f)$,
$f\in{\cal F}$, is countably approximable.}
\end{lem}
Let me also remark that the class of functions
$f_{u_1,\dots,u_k,v_1,\dots,v_k}$ is also an $L_2$-dense class of
functions, actually it is also a Vapnik--\v{C}ervonenkis class of
functions. As a consequence, Theorem~\ref{t6.2} can be applied to this
class of functions.
To clarify the background of the above results I
make the following remark. The class of random variables
$Z_{\mu,k}(f)$, $J_{n,k}(f)$ or $I_{n,k}(f)$, $f\in{\cal F}$, can be
considered as a stochastic process indexed by the functions
$f\in{\cal F}$, and we estimate the supremum of this
stochastic process. In the study of a stochastic process with a
large parameter set one introduces some smoothness type property of
the trajectories which can be satisfied. Here we followed a very
similar approach. The condition formulated in the second paragraph
of Lemma~\ref{l6.5} can be considered as the smoothness type property
needed in our problem.
In the study of a general stochastic process one has to make
special efforts to find its right version with sufficiently smooth
trajectories. In the case of the random processes $J_{n,k}(f)$ or
$I_{n,k}(f)$, $f\in {\cal F}$, this right version can be constructed
in a natural, simple way. A finite sequence of random variables
$\xi_1(\omega),\dots,\xi_n(\omega)$ is given at the start, and
the random integrals $J_{n,k}(f)(\omega)$ or $U$-statistics
$I_{n,k}(f)(\omega)$, $f\in{\cal F}$, can be constructed separately
for all $\omega\in\Omega$ on the probability field
$(\Omega,{\cal A},P)$ where the random variables
$\xi_1(\omega),\dots,\xi_n(\omega)$ are living. It has to be
checked whether the `trajectories' of this random process have the
`smoothness properties' necessary for us. The case of a class of
Wiener--It\^o integrals $Z_{\mu,k}(f)$, $f\in{\cal F}$, is different.
Wiener--It\^o integrals are defined with the help of some
$L_2$-limit procedure. Hence each random integral $Z_{\mu,k}(f)$ is
defined only with probability~1, and in the case of a non-countable
set of functions ${\cal F}$ the right version $Z_{\mu,k}(f)$,
$f\in{\cal F}$, of the Wiener--It\^o integrals has to be found to get
a countably approximable class of random variables.
R.~M. Dudley (see e.g.~\cite{r5}) worked out a rather deep
theory to overcome the measurability difficulties appearing in the
case of a non-countable set of random variables by working with
analytic sets, Suslin property, outer probability, and so on. I must
admit that I do not know the precise relation between this theory
and our method. At any rate, in the problems discussed here our
elementary approach seems to be satisfactory.
\medskip
In the next two sections I discuss the idea of the proof of
Theorems~\ref{t6.1}, \ref{t6.2} and~\ref{t6.3}. A simple and natural
approach, the
so-called chaining argument suffices to prove Theorem~\ref{t6.1}. In the
case of Theorems~\ref{t6.2} and~\ref{t6.3} this chaining argument can only help
to reduce the proof to a slightly weaker statement, and we apply an
essentially different method based on some randomization arguments
to complete the proof. Since in the multivariate case $k\ge2$ some
essential additional difficulties appear, it seemed to be more
natural to discuss it in a separate section.
\section {The method of proof of Theorems \ref{t6.1}, \ref{t6.2} and
\ref{t6.3}}\label{s7}
There is a simple but useful method, called the chaining argument,
which helps to prove Theorem~\ref{t6.1}. It suggests to take
an appropriate increasing sequence ${\cal F}_j$, $j=0,1,\dots$, of
$L_2$-dense subsets of the class of functions ${\cal F}$ and to
estimate the supremum of the Wiener--It\^o integrals
$Z_{\mu,k}(f)$, $f\in{\cal F}_j$, for all $j=0,1,\dots$.
In the application of this method first we define a sequence
of subclasses ${\cal G}_j$ of ${\cal F}$, $j=0,1,2,\dots$, such that
${\cal G}_j=\{g_{j,1},\dots,g_{j,m_j}\}\subset{\cal F}$ is an
$2^{-jk}\sigma$-dense subset of ${\cal F}$ in the $L_2(\mu^k)$-norm,
i.e.\ they satisfy the relation
\begin{eqnarray}
&&\hskip-1truecm \inf\limits_{1\le l\le m_j} \rho(f,g_{j,l})^2 \nonumber\\
&&\hskip-1truecm\qquad=
\inf\limits_{1\le l\le m_j} \int
(f(x_1,\dots,x_k)-g_{j,l}(x_1,\dots,x_k))^2\mu(\,dx_1)\dots\mu(\,dx_k)
\nonumber \\
&&\hskip-1truecm \qquad \le 2^{-2jk}\sigma^2
\label{7.1}
\end{eqnarray}
for all $f\in{\cal F}$, and also the inequality $m_j\le D2^{jkL}
\sigma^{-L}$ holds. Such sets ${\cal G}_j$ exist because of the
conditions of Theorem~\ref{t6.1}. Let us also define the classes of
functions ${\cal F}_j=\bigcup\limits_{p=0}^j {\cal G}_p$, and sets
$$
B_j=B_j(u)=\left\{\omega\colon\sup\limits_{f\in {\cal F}_j}
|Z_{\mu,k}(f)(\omega)|\ge u\left(1-2^{-jk/2}\right)\right\},\quad
j=0,1,2,\dots.
$$
Given a function $f_{j+1,l}\in{\cal G}_{j+1}$ let us
choose such a function $f_{j,l'}\in{\cal F}_j$ with some $l'=l'(l)$ for
which $\rho(f_{j,l'},f_{j+1,l})\le 2^{-jk}\sigma$ with the function
$\rho(f,g)$ defined in formula (\ref{7.1}). Then
\begin{equation}
P(B_{j+1})\le P(B_j)+\sum\limits_{l=1}^{m_{j+1}}
P\left(|Z_{\mu,k}(f_{j+1,l}-f_{j,l'})|>u2^{-k(j+1)/2}\right). \label{7.2}
\end{equation}
Theorem~\ref{t4.1} yields a good estimate of the terms in the sum at the
right-hand side of (\ref{7.2}), and it also provides a good bound of the
probability $P(B_0)$. With the help of some small modification of
the construction it can be achieved that also the relation
$\bigcup\limits_{j=0}^\infty {\cal F}_j={\cal F}$ holds. The proof
of Theorem~\ref{t6.1} follows from the estimates obtained in such a way.
Theorem~\ref{t6.2} can be deduced from Theorem~\ref{t6.3} relatively
simply with
the help of Theorem~\ref{t3.3}, since Theorem~\ref{t6.3} enables us to give a
good bound on all terms in the sum at the right-hand side of
formula~(\ref{3.7}). The only non-trivial step in this argument is
to show that the set of functions $f_V$, $f\in {\cal F}$, appearing
in formula~(\ref{3.7}) satisfy the estimates needed in the
application of Theorem~\ref{t6.3}. Relations~(\ref{3.8}) and~(\ref{3.9})
are parts of the needed estimates. Beside this, it has to be shown
that if ${\cal F}$ is an $L_2$-dense class of functions, then the
same relation holds for the classes of functions
${\cal F}_V=\{f_V\colon f\in {\cal F}\}$ for all sets
$V\subset\{1,\dots,k\}$. This relation can also be shown
with the help of a not too difficult proof (see~\cite{r19}
or~\cite{r22}), but this question will be not discussed here.
One may try to prove Theorem~\ref{t6.3}, similarly to
Theorem~\ref{t6.1}, with the
help of the chaining argument. But this method does not work well
in this case. The reason for its weakness is that the tail
distribution of a degenerate $U$-statistic with a small variance
$\sigma^2$ does not satisfy such a good estimate as the tail
distribution of a multiple Wiener--It\^o integral. At this point
the condition $u\le n^{k/2}\sigma^{k+1}$ in Theorem~4.2 plays
an important role. Let us recall that, as Example~\ref{e4.5} shows, the
tail distribution of the normalized degenerated $U$-statistics
$n^{-k/2}I_{n,k}(f)$ satisfies only a relatively weak estimate at
level~$u$ if $u\gg n^{k/2}\sigma^{k+1}$. We may try to work with an
estimate analogous to relation~(\ref{7.2}) in the proof of Theorem~\ref{t6.3}.
But the probabilities appearing at the right-hand side of such an
estimate cannot be well estimated for large indices~$j$.
Thus we can start the procedure of the chaining argument, but
after finitely many steps we have to stop it. In such a way we
can find a relatively dense subset ${\cal F}_0\subset{\cal F}$ (in
$L_2(\mu)$ norm) such that a good estimate can be given for the
distribution of the supremum $\sup\limits_{f\in{\cal F}_0}I_{n,k}(f)$.
This result enables us to reduce Theorem~\ref{t6.3} to a slightly
weaker statement formulated in Proposition~\ref{p7.1} below, but it
yields no more help. Nevertheless, such a reduction is useful.
%\medskip\noindent
\begin{prop}\label{p7.1}
%{\bf Proposition 7.1.}
{Let us have a probability measure $\mu$
on a measurable space $(X,{\cal X})$ together with a sequence of
independent and $\mu$ distributed random variables $\xi_1,\dots,\xi_n$
and a countable $L_2$-dense class ${\cal F}$ of canonical kernel
functions $f=f(x_1,\dots,x_k)$ (with respect to the measure~$\mu$)
with some parameter $D$ and exponent $L$ on the product space
$(X^k,{\cal X}^k)$ such that all functions $f\in{\cal F}$ satisfy
conditions (\ref{6.2}) and (\ref{6.3}) with some $0<\sigma\le1$. Let us
consider
the (degenerate) $U$-statistics $I_{n,k}(f)$ with the random sequence
$\xi_1,\dots,\xi_n$ and kernel functions $f\in{\cal F}$. There
exists a sufficiently large constant $K=K(k)$ together with some
numbers $\bar C=\bar C(k)>0$, $\gamma=\gamma(k)>0$ and threshold
index $A_0=A_0(k)>0$ depending only on the order $k$ of the
$U$-statistics such that if $n\sigma^2>K(L+\beta)\log n$ with
$\beta=\max\left(\frac{\log D}{\log n},0\right)$,
then the degenerate $U$-statistics $I_{n,k}(f)$, $f\in{\cal F}$,
satisfy the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}|n^{-k/2}I_{n,k}(f)|
\ge A n^{k/2}\sigma^{k+1}\right)
\le \bar C e^{-\gamma A^{1/2k}n\sigma^2}\quad \textrm{if } A\ge A_0.
\label{7.3}
\end{equation} }
\end{prop}
The statement of Proposition~\ref{p7.1} is similar to that of
Theorem~\ref{t6.3}.
The essential difference between them is that Proposition~\ref{p7.1}
yields an estimate only for $u\ge A_0 n^{k/2}\sigma^{k+1}$ with a
sufficiently large constant $A_0$, i.e.\ for relatively large
numbers~$u$. In the case $u\gg n^{k/2}\sigma^{k+1}$ it yields a
weaker estimate than formula~(\ref{6.5}) in Theorem~\ref{t6.3}, but
actually we need this estimate only in the case of the number $A$ in
formula~(\ref{7.3}) being bounded away both from zero and infinity.
The proof of Proposition~\ref{p7.1}, briefly explained
below, is based on an inductive procedure carried out by
means of a symmetrization argument. In each step of this induction
we diminish the number~$A_0$ for which we show that
inequality~(\ref{7.3}) holds for all numbers $An^{k/2}\sigma^{k+1}$
with $A\ge A_0$. This diminishing of the number~$A_0$ is done as
long as it is possible. It has to be stopped at such a number
$A_0$ for which the probability $P(|n^{-k/2}I_{n,k}(f)|\ge
A_0n^{k/2}\sigma^{k+1})$ can be well estimated by Theorem~\ref{t4.3} for all
functions $f\in{\cal F}$. This has the consequence that Proposition~\ref{p7.1}
yields just such a strong estimate which is needed to reduce the
proof of Theorem~\ref{t6.3} to a statement that can be proved by means of
the chaining argument.
In the symmetrization argument applied in the proof of
Proposition~\ref{p7.1} several additional difficulties arise if the
multivariate case $k\ge2$ is considered. Hence in this section
only the case $k=1$ is discussed. A degenerate $U$-statistic
$I_{n,1}(f)$ of order~1 is the sum of independent, identically
distributed random variables with expectation zero. In this paper
the proof of Proposition~\ref{p7.1} will be only briefly explained. A
detailed proof can be found in~\cite{r19} or~\cite{r22}. Let me also
remark that
the method of these works was taken from Alexander's paper~\cite{r2},
where all ideas appeared in a different context.
We shall bound the probability appearing at the left-hand side
of~(\ref{7.3}) (if~$k=1$) from above by the probability of the event
that the supremum of appropriate randomized sums is larger than
some number. We apply a symmetrization method which means that we
estimate the expression we want to bound by means of a randomized
(symmetrized) expression. Lemma~\ref{l7.2}, formulated below, has such a
character.
%\medskip\noindent
\begin{lem}\label{l7.2}
%{\bf Lemma 7.2.}
{Let a countable class of functions ${\cal F}$
on a measurable space $(X,{\cal X})$ and a real number $0<\sigma<1$
be given. Consider a sequence of independent, identically
distributed $X$-valued random variables $\xi_1,\dots,\xi_n$ such
that $Ef(\xi_1)=0$, $Ef^2(\xi_1)\le\sigma^2$ for all $f\in{\cal F}$
together with another sequence $\varepsilon_1,\dots,\varepsilon_n$ of
independent random variables with distribution
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, independent also of the random sequence
$\xi_1,\dots,\xi_n$. Then
\begin{eqnarray}
&&\hskip-2cm P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}\left|
\sum\limits_{j=1}^n f(\xi_j)\right| \ge An^{1/2}\sigma^{2}\right)
\nonumber \\
&&\hskip-2cm \qquad \le 4P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}
\left|\sum\limits_{j=1}^n
\varepsilon_jf(\xi_j)\right| \ge \frac A3 n^{1/2}\sigma^{2}\right)
\quad\textrm{if } A\ge \frac{3\sqrt2}{\sqrt n\sigma}.
\label{7.4}
\end{eqnarray} }
\end{lem}
Let us first understand why Lemma~\ref{l7.2} can help in the proof of
Proposition~\ref{p7.1}. It enables to reduce the estimate of the
probability at the left-hand side of formula~(\ref{7.4}) to that at its
right-hand side. This reduction turned out to be useful for the
following reason. At the right-hand side of formula~7.4
the probability of such an event appears which depends on the random
variables $\xi_1,\dots,\xi_n$ and some randomizing terms
$\varepsilon_1,\dots,\varepsilon_n$. Let us estimate the probability
of this event by bounding first its conditional probability under the
condition that the values of the random variables $\xi_1,\dots,\xi_n$
are prescribed. These conditional probabilities can be well estimated
by means of Hoeffding's inequality formulated below, and the
estimates we get for them also yield a good bound on the expression
at the right-hand side of~(\ref{7.4}).
Hoeffding's inequality, (see e.g. in~\cite{r28} pp. 191--192), more
precisely its special case we need here, states that the linear
combinations of independent random variables $\varepsilon_j$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le j\le n$,
behave so as the central limit theorem suggests. More explicitly, the
following inequality holds.
%\medskip\noindent
\begin{thm}[Hoeffding's inequality]\label{t7.3}
%Theorem 7.3.
Let $\varepsilon_1,
\dots,\varepsilon_n$ be independent random variables,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$,
$1\le j\le n$, and let $a_1,\dots,a_n$ be arbitrary real numbers.
Put $V=\sum\limits_{j=1}^na_j\varepsilon_j$. Then
\begin{equation}
P(V>y)\le\exp\left\{-\frac{y^2}{2\sum_{j=1}^na_j^2 }\right\}\quad
\textrm{for all }y>0. \label{7.5}
\end{equation}
\end{thm}
As we shall see, the application of Lemma~\ref{l7.2} together with the
above mentioned conditioning argument and Hoeffding's inequality
enable us to reduce the estimation of the distribution of
$\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^nf(\xi_j)$ to that
of $\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^nf^2(\xi_j)=
\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^n[f^2(\xi_j)-Ef^2(\xi_j)]+
n\sup\limits_{f\in{\cal F}}Ef^2(\xi_1)$. At first sight it may seem so
that we did not gain very much by applying this approach.
The estimation of the supremum of a class of sums of independent and
identically distributed random variables was replaced by the
estimation of a similar supremum. But a closer look shows that this
method can help us in finding a proof of Proposition~\ref{p7.1}. We have
to follow at what level we wanted to bound the distribution of the
supremum in the original problem, and what level we have to choose
in the modified problem to get a good estimate in the problem we
are interested in. It turns out that in the second problem we need
a good estimate about the distribution of the supremum of a class
of sums of independent and identically distributed random variables
at a considerable higher level. This observation enables us to work
out an inductive procedure which leads to the proof of
Proposition~\ref{p7.1}.
Indeed, in Proposition~\ref{p7.1} estimate (\ref{7.3}) has to be proved for
all numbers $A\ge A_0$ with some appropriate number $A_0$.
This estimate trivially holds if $A_0>\sigma^{-2}$,
because in this case condition~(\ref{6.2}) about the
functions $f\in{\cal F}$ implies that the probability at the left-hand
side of~(\ref{7.3}) equals zero. The argument of the previous paragraph
suggests the following statement: If relation (\ref{7.3})
holds for some constant~$A_0$, then it also holds for a smaller~$A_0$.
Hence Proposition~\ref{p7.1} can be proved by means of an inductive procedure
in which the number~$A_0$ is diminished at each step.
The actual proof consists of an elaboration of the details in the
above heuristic approach. An inductive procedure is applied in
which it is shown that if relation (\ref{7.3}) holds with some number
$A_0$ for a class of functions ${\cal F}$ satisfying the conditions
of Proposition~\ref{p7.1}, then this relation also holds for it if $A_0$
is replaced by $A_0^{3/4}$, provided that $A_0$ is larger than some
fixed universal constant. I would like to emphasize that we prove
this statement not only for the class of functions ${\cal F}$ we are
interested in, but simultaneously for all classes of functions
which satisfy the conditions of Proposition~\ref{p7.1}. As we want to
prove the inductive statement for a class of functions ${\cal F}$,
then we apply our previous information not to this class, but to
another appropriately defined class of functions
${\cal F}'={\cal F}'({\cal F})$ which also satisfies the conditions of
Propositions~\ref{p7.1}. I omit the details of the proof, I only discuss
one point which deserves special attention.
Hoeffding's inequality, applied in the justification of the
inductive procedure leading to the proof of Proposition~\ref{p7.1} gives
an estimate for the distribution of a single sum, while we need a
good estimate on the supremum of a class of sums. The question may
arise whether this does not cause some problem in the proof. I try
to briefly explain that the reason to introduce the condition about
the $L_2$-dense property of the class~${\cal F}$ was to overcome this
difficulty.
In the inductive procedure we want to prove that relation~(\ref{7.3})
holds for all $A\ge A_0^{3/4}$ if it holds for all $A\ge A_0$. It
can be shown by means of the inductive assumption which states that
relation~(\ref{7.3}) holds for $A\ge A_0$ and Hoeffding's inequality
Theorem~\ref{t7.3} that there is a set $D\subset\Omega$ such that the
conditional probabilities
\begin{equation}
P\left(\left.\frac1{\sqrt n}\left|\sum\limits_{j=1}^n
\varepsilon_j f(\xi_j)\right| \ge\frac{
An^{1/2}\sigma^{2}}6\right|\xi_1(\omega),\dots\xi_n(\omega)\right)
\label{7.6}
\end{equation}
are very small for all $f\in{\cal F}$, and the probability of the
set $\Omega\setminus D$ is negligibly small. Let me emphasize that
at this step of the proof we can give a good estimate about the
conditional probability in~(\ref{7.6}) for all functions $f\in{\cal F}$ if
$\omega\in D$, but we cannot work with their supremum which we would
need to apply formula (\ref{7.4}). This difficulty can be overcome with the
help of the following argument.
Let us introduce the (random) probability measure $\nu=\nu(\omega)$
uniformly distributed on the points $\xi_1(\omega),\dots,\xi_n(\omega)$
for all $\omega\in D$. Let us observe that the (random) measure $\nu$
has a support consisting of~$n$ points, and the $\nu$-measure of
all points in the support of~$\nu$ equals $\frac1n$. This implies
that the supremum of a function defined on the support of the measure
$\nu$ can be bounded by means of the $L_2(\nu)$-norm of this function.
This property together with the $L_2(\nu)$-dense property of the class
of functions ${\cal F}$ imposed in the conditions of Proposition~\ref{p7.1}
imply that a finite set $\{f_1,\dots,f_m\}\subset {\cal F}$ can be
chosen with relatively few elements~$m$ in such a way that for
all $f\in{\cal F}$ there is some function $f_l$, $1\le l\le m$,
whose distance from the function $f$ in the $L_2(\nu)$ norm is less
than $A\sigma^2/6$, hence $\inf\limits_{1\le l\le m}n^{-1/2}
\left|\sum\limits_{j=1}^n \varepsilon_j(f(\xi_j)-f_l(\xi_j))\right|\le
n^{1/2}\int|f-f_l|d\nu\le\frac{An^{1/2}\sigma^{2}}6$. The condition
that ${\cal F}$ is $L_2$-dense with exponent~$L$ and parameter~$D$
enables us to give a good upper bound on the number~$m$. This is
the point, where the condition that the class of functions ${\cal F}$
is $L_2$-dense was exploited in its full strength. Since we can
give a good bound on the conditional probability in~(\ref{7.6}) for all
functions $f=f_l$, $1\le l\le m$, we can bound the probability at
the right-hand side of (\ref{7.4}). It turns out that the estimate we
get in such a way is sufficiently sharp, and the inductive statement,
hence also Proposition~\ref{p7.1} can be proved by working out the details.
I briefly explain the proof of Lemma~\ref{l7.2}. The randomizing
terms~$\varepsilon_j$, $1\le j\le n$, in it can be introduced with the
help of the following simple lemma.
%\medskip\noindent
\begin{lem}\label{l7.4}
%{\bf Lemma 7.4.}
{Let $\xi_1,\dots,\xi_n$ and
$\bar\xi_1,\dots,\bar\xi_n$ be two sequences of independent and
identically distributed random variables with the same distribution
$\mu$ on some measurable space $(X,{\cal X})$, independent of each
other. Let $\varepsilon_1,\dots,\varepsilon_n$ be a sequence of
independent random variables $P(\varepsilon_j=1)=P(\varepsilon_j=-1)
=\frac12$, $1\le j\le n$, which is
independent of the random sequences $\xi_1,\dots,\xi_n$ and
$\bar\xi_1,\dots,\bar\xi_n$. Take a countable set of functions
${\cal F}$ on the space $(X,{\cal X})$. Then the set of random variables
$$
\frac1{\sqrt n}\sum_{j=1}^n \left(f(\xi_j)-f(\bar\xi_j)\right),
\quad f\in {\cal F},
$$
and its randomized version
$$
\frac1{\sqrt n}\sum_{j=1}^n \varepsilon_j \left(f(\xi_j)
-f(\bar\xi_j)\right), \quad f\in {\cal F},
$$
have the same joint distribution.}
\end{lem}
Lemma~\ref{l7.2} can be proved by means of Lemma~\ref{l7.4} and
some calculations.
There is one harder step in the calculations. A probability of the type
$$
P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^n
f(\xi_j)>u\right)
$$
has to be bounded from above by means of a probability of the type
$$
P\left(\frac1{\sqrt n}\sup\limits_{f\in{\cal F}}\sum\limits_{j=1}^n
\left(f(\xi_j)-f(\bar\xi_j)\right)>u-K\right)
$$
with some number $K>0$. (Here the notation of Lemma~\ref{l7.4} is applied.)
At this point the following symmetrization lemma may be useful.
%\medskip\noindent
\begin{lem}[Symmetrization Lemma]\label{l7.5}
%{\bf Lemma 7.5 (Symmetrization Lemma).}
{Let $Z_p$ and $\bar
Z_p$, $p=1,2,\dots$, be two sequences of random variables
independent of each other, and let the random variables $\bar Z_p$,
$p=1,2,\dots$, satisfy the inequality
\begin{equation}
P(|\bar Z_p|\le\alpha)\ge\beta\quad \textrm{for all } p=1,2,\dots
\label{7.7}
\end{equation}
with some numbers $\alpha\ge0$ and $\beta\ge0$. Then
$$
P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)\le\frac1\beta
P\left(\sup\limits_{1\le p<\infty}|Z_p-\bar Z_p|>u\right)\quad
\textrm{for all } u>0.
$$ }
\end{lem}
The proof of Lemma~\ref{l7.5} can be found for instance in~\cite{r28}
(8~Symmetrization Lemma) or in \cite{r22}~Lemma~7.1.
Let us list the element of the countable class of functions
${\cal F}$ in Lemma~\ref{l7.2} in the form ${\cal F}=\{f_1,f_2,\dots,\}$.
Then Lemma~\ref{l7.2} can be proved by means of Lemmas~\ref{l7.4}
and~\ref{l7.5} with the choice of the random variables
\begin{equation}
Z_p=\frac1{\sqrt n}\sum\limits_{j=1}^n f_p(\xi_j)\quad\textrm{and} \quad
\bar Z_p=\frac1{\sqrt n}\sum\limits_{j=1}^n f_p(\bar\xi_j),
\quad p=1,2,\dots. \label{7.8}
\end{equation}
I omit the details.
One may try to generalize the above sketched proof
of Theorem~\ref{t6.3} to the multivariate case~$k\ge2$. Here the question
arises on how to generalize Lemma~\ref{l7.2} to the multivariate case and how
to prove this generalization. These are highly non-trivial problems.
This will be the main subject of the next section.
\section {On the proof of Theorem~\ref{t6.3} in the multivariate
case}\label{s8}
Here we are mainly interested in the question how to carry out the
symmetrization procedure in the proof of Proposition~\ref{p7.1} to the
multivariate case $k\ge2$. It turned out that it is possible to
reduce this problem to the investigation of modified $U$-statistics,
where $k$ independent copies of the original random sequence are
taken and put into the $k$ different arguments of the kernel
function of the $U$-statistic of order~$k$. Such modified versions
of $U$-statistics are called decoupled $U$-statistics in the
literature, and they can be better studied by means of the
symmetrization argument we are going to apply. To give a precise
meaning of the above statements some definitions have to be
introduced and some results have to be formulated. I introduce the
following notions.
\medskip\noindent
{\bf The definition of decoupled and randomized decoupled
$U$-statistics.} {\it Let us have $k$ independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of a sequence
$\xi_1,\dots,\xi_n$ of independent and identically distributed
random variables taking their values in a measurable space
$(X,{\cal X})$ together with a measurable function $f(x_1,\dots,x_k)$
on the product space $(X^k,{\cal X}^k)$ with values in a separable
Banach space. Then the decoupled $U$-statistic determined by
the random sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$,
and kernel function $f$ is defined by the formula
\begin{equation}
\bar I_{n,k}(f)=\frac1{k!}\sum\limits_{ 1\le l_j\le n,\; j=1,\dots, k
\atop l_j\neq l_{j'} \textrm{ if } j\neq j'}
f\left(\xi_{l_1}^{(1)},\dots,\xi_{l_k}^{(k)}\right). \label{8.1}
\end{equation}
Let us have beside the sequences
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and function
$f(x_1,\dots,x_k)$ a sequence of independent random variables
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$, $P(\varepsilon_l=1)
=P(\varepsilon_l=-1)=\frac12$, $1\le l\le n$,
which is independent also of the sequences of random variables
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$. We define the
randomized decoupled $U$-statistic determined by the random
sequences $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, the kernel
function $f$ and the randomizing sequence $\varepsilon_1,\dots,
\varepsilon_n$ by the formula
\begin{equation}
\bar I_{n,k}^\varepsilon(f)=\frac1{k!}\sum\limits_{ 1\le l_j\le n,\;
j=1,\dots, k\atop
l_j\neq l_{j'} \textrm{ if } j\neq j'}
\varepsilon_{l_1}\cdots\varepsilon_{l_k}f\left(\xi_{l_1}^{(1)},\dots,
\xi_{l_k}^{(k)}\right).
\label{8.2}
\end{equation} }
\medskip
Our first goal is to reduce the study of inequality (\ref{7.3}) in
Proposition~\ref{p7.1} to an analogous problem about the supremum of
decoupled $U$-statistics defined above. Then we want to show that
a symmetrization argument enables us to reduce this problem to
the study of randomized decoupled $U$-statistics introduced in
formula~(\ref{8.2}). A result of de la Pe\~na and Montgomery--Smith
formulated below helps to carry out such a program. Let me remark
that both in the definition of decoupled $U$-statistics and in the
result of de la Pe\~na and Montgomery--Smith functions~$f$ taking
their values in a separable Banach space were considered, i.e.\
we did not restrict our attention to real-valued functions.
This choice was motivated by the fact that in such a
general setting we can get a simpler proof of inequality~(\ref{8.4})
presented below. (The definition of $U$-statistics given in
formula~(\ref{1.3}) is also meaningful in the case of Banach-space
valued functions~$f$.)
%\medskip\noindent
\begin{thm}[de la Pe\~na and Montgomery--Smith]\label{t8.1}
%Theorem 8.1.
Let us consider a sequence of independent and identically distributed
random variables $\xi_1,\dots,\xi_n$ on a measurable space
$(X,{\cal X})$ together with $k$ independent copies
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$. Let us also have
a function $f(x_1,\dots,x_k)$ on the $k$-fold product space
$(X^k,{\cal X}^k)$ which takes its values in a separable Banach
space~$B$. Define the $U$-statistic and decoupled
$U$-statistic $I_{n,k}(f)$ and $\bar I_{n,k}(f)$ with the help of
the above random sequences $\xi_1,\dots,\xi_n$,
$\xi_{1}^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, and kernel
function~$f$. There exist some constants $\bar C=\bar C(k)>0$
and $\gamma=\gamma(k)>0$ depending only on the order~$k$ of the
$U$-statistic such that
\begin{equation}
P\left(\|I_{n,k}(f)\|>u\right)\le \bar CP\left(\|\bar I_{n,k}(f)\|>
\gamma u\right)
\label{8.3}
\end{equation}
for all $u>0$. Here $\|\cdot\|$ denotes the norm in the Banach
space~$B$ where the function~$f$ takes its values.
More generally, if we have a countable sequence of functions~$f_s$,
$s=1,2,\dots$, taking their values in the same separable
Banach-space, then
\begin{equation}
P\left(\sup_{1\le s<\infty} \left\| I_{n,k}(f_s)\right\|>u\right)\le
\bar CP\left(\sup_{1\le s<\infty}\left\|\bar I_{n,k}(f_s)\right\|>
\gamma u\right). \label{8.4}
\end{equation}
\end{thm}
The proof of Theorem~\ref{t8.1} can be found in~\cite{r4} or in Appendix~B
of my Lecture Note~\cite{r22}. Actually~\cite{r4} contains only the proof
of inequality~(\ref{8.3}), but~(\ref{8.4}) can be deduced from it
simply by introducing appropriate separable Banach spaces and by
exploiting that the universal constants in formula~(\ref{8.3}) do not
depend on the Banach space where the random variables are living.
Theorem~\ref{t8.1} is useful for us, because it shows that
Proposition~\ref{p7.1}
simply follows from its version presented in Proposition~\ref{p8.2} below,
where $U$-statistics are replaced by decoupled $U$-statistics. The
distribution of a decoupled $U$-statistic is not changing if
the sequences of random variables put in some coordinates of
its kernel function are replaced by an independent copy, and this
is a very useful property in the application of symmetrization
arguments. Beside this, the usual arguments applied in calculation
with usual $U$-statistics can be adapted to the study of decoupled
$U$-statistics. Now I formulate the following version of
Proposition~\ref{p7.1}.
%\medskip\noindent
\begin{prop}\label{p8.2}
%{\bf Proposition 8.2.}
{Consider a class of functions
$f\in{\cal F}$ on the $k$-fold product $(X^k,{\cal X}^k)$ of a
measurable space $(X,{\cal X})$, a probability measure $\mu$ on
$(X,{\cal X})$
together with a sequence of independent and $\mu$ distributed
random variables $\xi_1,\dots,\xi_n$ which satisfy the conditions
of Proposition~\ref{p7.1}. Let us take $k$ independent copies
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of the random
sequence $\xi_1,\dots,\xi_n$, and consider the decoupled
$U$-statistics $\bar I_{n,k}(f)$, $f\in {\cal F}$, defined with
their help by formula (\ref{8.1}). There exists a sufficiently
large constant $K=K(k)$ together with some number
$\gamma=\gamma(k)>0$ and threshold index $A_0=A_0(k)>0$
depending only on the order $k$ of the decoupled $U$-statistics
we consider such that if $n\sigma^2>K(L+\beta) \log n$ with
$\beta=\max\left(\frac{\log D}{\log n},0\right)$, then the (degenerate)
decoupled $U$-statistics $\bar I_{n,k}(f)$, $f\in{\cal F}$,
satisfy the following version of inequality (\ref{7.3}):
\begin{equation}
P\left(\sup_{f\in{\cal F}}|n^{-k/2}\bar I_{n,k}(f)|\ge A
n^{k/2}\sigma^{k+1}\right) \le e^{-\gamma A^{1/2k}n\sigma^2}
\quad \textrm{if } A\ge A_0 .\label{8.5}
\end{equation} }
\end{prop}
%\medskip
Proposition~\ref{p8.2} and Theorem~\ref{t8.1} imply
Proposition~\ref{p7.1}. Hence it is enough to concentrate on the proof of
Proposition~\ref{p8.2}. It is natural to try to adapt the method applied
in the proof of Proposition~\ref{p7.1} in the case $k=1$. I try to explain
what kind of new problems appear in the multivariate case and how
to overcome them.
The proof of Proposition~\ref{p7.1} was based on a symmetrization type
result formulated in Lemma~\ref{l7.2} and Hoeffding's
inequality~Theorem~\ref{t7.3}. We have to find the multivariate versions
of these results. It is not difficult to find the multivariate
version of Hoeffding's inequality. Such a result can be found
in~\cite{r22} Theorem~12.3, or~\cite{r21} contains an improved version
with optimal constant in the exponent. Here I do not formulate this
result, I only explain its main content. Let us consider a homogeneous
polynomial of Rademacher functions of order~$k$. The multivariate
version of Hoeffding's inequality states that its tail distribution
can be bounded by that of $K\sigma\eta^k$ with some constant
$K=K(k)$ depending only on the order $k$ of the homogeneous
polynomial, where $\eta$ is a standard normal random variable, and
$\sigma^2$ is the variance of the random homogeneous polynomial.
The problem about the multivariate generalization of Lemma~\ref{l7.2} is
much harder. We want to prove the following multivariate
version of this result.
%\medskip\noindent
\begin{lem}\label{l8.3}
%{\bf Lemma 8.3.}
{Let ${\cal F}$ be a class of functions on the
space $(X^k,{\cal X}^k)$ which satisfies the conditions of
Proposition~\ref{p7.1} with some probability measure $\mu$. Let us have $k$
independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$, of
a sequence of independent $\mu$ distributed random variables
$\xi_1,\dots,\xi_n$ and a sequence of independent random variables
$\varepsilon=(\varepsilon_1,\dots,\varepsilon_n)$, $P(\varepsilon_l=1)
=P(\varepsilon_l=-1)=\frac12$, $1\le l\le n$,
which is independent also of the random sequences
$\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$. Consider the
decoupled $U$-statistics $\bar I_{n,k}(f)$ defined
with the help of these random variables by formula~(\ref{8.1}) together
with their randomized version $\bar I_{n,k}^{\varepsilon}(f)$ defined
in~(\ref{8.2}) for all $f\in{\cal F}$. There exists some constant
$A_0=A_0(k)>0$ such that the inequality
\begin{eqnarray}
&&\hskip-1truecm P\left(\sup_{f\in{\cal F}} n^{-k/2}\left|\bar
I_{n,k}(f)\right|>An^{k/2}\sigma^{k+1}\right)\label{8.6} \\
&&\hskip-1truecm \qquad<2^{k+1}P\left(\sup_{f\in{\cal F}} \left|
\bar I_{n,k}^{\varepsilon}(f)\right|
>2^{-(k+1)}A n^k\sigma^{k+1}\right)
+Bn^{k-1}e^{-A^{1/(2k-1)} n\sigma^2/k} \nonumber
\end{eqnarray}
holds for all $A\ge A_0$ with some appropriate constant $B=B(k)$. One
can choose for instance $B=2^k$ in this result.}
\end{lem}
The estimate (\ref{8.6}) in Lemma~\ref{l8.3} is similar to formula (\ref{7.4})
in Lemma~\ref{l7.2}. There is a slight difference between them, because the
right-hand side of~(\ref{8.6}) contains an additional constant term.
But this term is sufficiently small, and its presence causes no problem
as we try to prove Proposition~\ref{p8.2} by means of Lemma~\ref{l8.3}.
In this
proof we want to estimate the distribution of the supremum of the
decoupled $U$-statistics $\bar I_{n,k}(f)$, $f\in{\cal F}$, defined
in formula (\ref{8.1}), and Lemma~\ref{l8.3} helps us in reducing this
problem to an analogous one, where these decoupled $U$-statistics are
replaced by the randomized decoupled $U$-statistics
$\bar I_{n,k}^\varepsilon(f)$, defined in formula (\ref{8.2}). This
reduced problem can be studied by taking the conditional probability
of the event whose probability is considered at the right-hand side
of~(\ref{8.6}) with respect to the condition that all random variables
$\xi^{(j)}_l$, $1\le j\le k$, $1\le l\le n$, take a prescribed value.
These conditional probabilities can be estimated by means of the
multivariate version of the Hoeffding inequality, and then an
adaptation of the method described in the previous section supplies
the proof of Proposition~\ref{p8.2}. The proof is harder in this new case,
but no new principal difficulty arises.
Lemma~\ref{l7.2} was proved by means of a simple result formulated in
Lemma~\ref{l7.4} which enabled us to introduce the randomizing
terms~$\varepsilon_j$, $1\le j\le n$. In this result we have taken
beside the original sequence $\xi_1,\dots,\xi_n$ an independent copy
$\bar\xi_1,\dots,\bar\xi_n$. In the next Lemma~\ref{l8.4} I formulate a
multivariate version of Lemma~\ref{l7.4} which may help in the proof of
Lemma~\ref{l8.3}. In its formulation I introduce beside the $k$
independent copies $\xi_1^{(j)},\dots,\xi_n^{(j)}$, $1\le j\le k$,
of the original sequence of independent, identically distributed
random variables $\xi_1,\dots,\xi_n$ appearing in the definition of
a decoupled $U$-statistic of order~$k$ another $k$ independent
copies $\bar\xi_1^{(j)},\dots,\bar\xi_n^{(j)}$, $1\le j\le k$, of
this sequence. Because of notational convenience I reindex them,
and I shall deal in Lemma~\ref{l8.4} with $2k$ independent copies
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, $1\le j\le k$, of the
original sequence $\xi_1,\dots,\xi_n$.
Now I formulate Lemma~\ref{l8.4}.
%\medskip\noindent
\begin{lem}\label{l8.4}
%{\bf Lemma 8.4.}
{Let us have a (non-empty) class of functions
${\cal F}$ of $k$ variables $f(x_1,\dots,x_k)$ on a measurable space
$(X^k,{\cal X}^k)$ together with $2k$ independent copies
$\xi_1^{(j,1)},\dots,\xi_n^{(j,1)}$ and
$\xi_1^{(j,-1)},\dots,\xi_n^{(j,-1)}$, $1\le j\le k$, of a sequence
of independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ on $(X,{\cal X})$ and another sequence of
independent random variables $\varepsilon_1,\dots,\varepsilon_n$,
$P(\varepsilon_j=1)=P(\varepsilon_j=-1)=\frac12$, $1\le j\le n$,
independent of all
previously considered random sequences. Let us denote the class of
sequences of length $k$ consisting of $\pm1$ digits by $V_k$,
and let $m(v)$ denote the number of digits $-1$ in a sequence
$v=(v(1),\dots,v(k))\in V_k$. Let us introduce with the help of
the above notations the random variables $\tilde I_{n,k}(f)$
and $\tilde I_{n,k}(f,\varepsilon)$ as
\begin{equation}
\tilde I_{n,k}(f)=\frac1{k!}\sum_{v\in V_k}
(-1)^{m(v)} \sum\limits_{ 1\le l_r\le n,\; r=1,\dots, k\atop
l_r\neq l_{r'} \textrm{ \scriptsize if } r\neq r'}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\label{8.7}
\end{equation}
and
\begin{equation}
\tilde I_{n,k}(f,\varepsilon)=\frac1{k!}\sum_{v\in V_k}
(-1)^{m(v)} \sum\limits_{ 1\le l_r\le n,\; r=1,\dots, k\atop
l_r\neq l_{r'} \textrm{\scriptsize if } r\neq r'}
\varepsilon_{l_1}\cdots \varepsilon_{l_k}
f\left(\xi_{l_1}^{(1,v(1))},\dots,\xi_{l_k}^{(k,v(k))}\right)
\label{8.8}
\end{equation}
for all $f\in {\cal F}$. The joint distributions of the random
variables $\{\tilde I_{n,k}(f);\, f\in{\cal F}\}$ and
$\{\tilde I_{n,k}(f,\;\varepsilon); f\in {\cal F}\}$ defined in
formulas~(\ref{8.7}) and~(\ref{8.8}) agree.}
\end{lem}
The proof of Lemma~\ref{l8.4} can be found as Lemma~11.5 in~\cite{r22}.
Actually, this proof is not difficult. Let us observe that the inner
sum in formula~(\ref{8.7}) is a decoupled $U$-statistic, and in
formula~(\ref{8.8}) it is a randomized decoupled $U$-statistic.
(Actually they are multiplied by $k!$). In formulas~(\ref{8.7})
and~(\ref{8.8}) such a linear combination of these expressions was
taken which is similar to the formula appearing in the definition of
Stieltjes measures.
Let us list the functions in the class of functions ${\cal F}$ in
Lemma~\ref{l8.3} in the form $\{f_1,f_2,\dots\}={\cal F}$, and introduce
the quantities
\begin{equation}
Z_p=\frac{n^{-k/2}}{k!} \sum\limits_{ 1\le l_r\le n,\; r=1,\dots, k
\atop l_r\neq l_{r'} \textrm{ \scriptsize if } r\neq r'}
f_p\left(\xi_{l_1}^{(1,1))},\dots,\xi_{l_k}^{(k,1)}\right),\quad
p=1,2,\dots, \label{8.9}
\end{equation}
and
\begin{equation}
\bar Z_p=Z_p-n^{-k/2}\tilde I_{n,k}(f_p), \quad p=1,2,\dots,
\label{8.10}
\end{equation}
with the random variables $\tilde I_{n,k}(f)$ introduced in (\ref{8.7})
with the function $f=f_p$. We would like to prove Lemma~\ref{l8.3} with the
help of Lemma~\ref{l8.4}. This can be done with the help of some calculations,
but this requires to overcome some very hard problems. We should like to
bound a probability of the form $P\left(\sup\limits_{1\le p<\infty}Z_p>
u\right)$ from above with the help of a probability of the form
$P\left(\sup\limits_{1\le p<\infty}(Z_p-\bar Z_p)>\frac u2\right)$ for
all sufficiently large numbers~$u$. The question arises how to prove
such an estimate. This problem is the most difficult part of the proof.
In the case $k=1$ considered in the previous section the analogous
problem could be simply solved by means of a Symmetrization Lemma
formulated in Lemma~\ref{l7.5}. This Lemma cannot be applied in the present
case, because it has an important condition, it demands that the
sequences of random variables $Z_p$, $p=1,2,\dots$, and $\bar Z_p$,
$p=1,2,\dots$, should be independent. In the problem of Section~\ref{s7}
we could work with such sequences which satisfy this condition. On
the other hand, the sequences $Z_p$ and $\bar Z_p$, $p=1,2,\dots$,
defined in formulas~(\ref{8.9}) and~(\ref{8.10}) we have to work with
now are not independent in the case $k\ge2$. They satisfy some weak
sort of independence, and the problem is how to exploit this to get
the estimates we need.
Let us first formulate such a version of the Symmetrization Lemma
which can be applied also in the problem investigated now. This is
done in the next Lemma~\ref{l8.5}.
%\medskip\noindent
\begin{lem}[Generalized version of the Symmetrization Lemma]\label{l8.5}
%{\bf Lemma 8.5 (Generalized version of the Symmetrization Lemma.)}
{Let $Z_p$ and $\bar Z_p$, $p=1,2,\dots$, be two sequences of
random variables on a probability space $(\Omega,{\cal A},P)$. Let a
$\sigma$-algebra ${\cal B}\subset {\cal A}$ be given on the probability
space $(\Omega,{\cal A},P)$ together with a ${\cal B}$-measurable set
$B$ and two numbers $\alpha>0$ and $\beta>0$ such that the random
variables $Z_p$, $p=1,2,\dots$, are ${\cal B}$ measurable, and the
inequality
\begin{equation}
P(|\bar Z_p|\le\alpha|{\cal B})(\omega)\ge\beta\quad \textrm{for all }
p=1,2,\dots \textrm{ if } \omega\in B \label{8.11}
\end{equation}
holds.
Then
\begin{equation}
P\left(\sup_{1\le p<\infty}|Z_p|>\alpha+u\right)\le\frac1\beta
P\left(\sup\limits_{1\le
p<\infty}|Z_p-\bar Z_p|>u\right)+(1-P(B))
\label{8.12}
\end{equation}
for all $u>0$.}
\end{lem}
The proof of Lemma~\ref{l8.5} is contained together with its proof
in~\cite{r22} under the name Lemma~13.1, and the proof is not hard.
It consists of a natural adaptation of the proof of the original
Symmetrization Lemma, presented in Lemma~\ref{l7.5}. The hard problem is to
check the condition in formula~(\ref{8.11}) in concrete applications.
In our case we would like to apply this lemma to the random variables
$Z_p$ and $\bar Z_p$, $p=1,2,\dots$, defined in formulas~(\ref{8.9})
and~(\ref{8.10}) together with the $\sigma$-algebra ${\cal B}={\cal B}
(\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n,\,1\le j\le k)$
generated by the random variables $\xi^{(j,1)}_1,\dots,\xi^{(j,1)}_n$,
$1\le j\le k$. We would like to show that relation~(\ref{8.11}) holds
with this choice on a set $B$ of probability almost~1. (Let me
emphasize that in~(\ref{8.11}) a set of inequalities must hold for all
$p=1,2,\dots$ simultaneously if $\omega\in B$.)
In the analogous problem considered in Section~\ref{s7} condition (\ref{7.7})
had to be checked with some appropriate constants $\alpha>0$ and
$\beta>0$ for the random variables $\bar Z_p$, $p=1,2,\dots$,
defined in formula~(\ref{7.8}). This could be done fairly simply by
the calculation of the variance of the random variables $\bar Z_p$,
$p=1,2,\dots$. A natural adaptation of this approach is to bound from
above the supremum $\sup\limits_{1\le p<\infty}E\left(\bar Z_p^2|
{\cal B}\right)$ of the conditional second moments of the random
variables $\bar Z_p$, $1\le p<\infty$, defined in (\ref{8.10}) with
respect to the $\sigma$-algebra ${\cal B}$ and to show that this
expression is small with large probability. I have followed this
approach in~\cite{r19} and~\cite{r22}. One can get the desired
estimates, but many unpleasant technical details have to be tackled
in the proof. I do not discuss here all details, I only briefly
explain what kind of problems we meet when try to apply this method
in the special case $k=2$ and give some indications how they can be
overcome.
In the case $k=2$ the definition of $\bar Z_p$ is very similar to
that of $n^{-k/2}\tilde I_{n,2}(f_p)$ defined in (\ref{8.7}) with the
function $f=f_p$. The only difference is that in the definition of
$Z_p$ we have to take the values $v=(1,-1)$, $v=(-1,1)$ and
$v=(-1,-1)$ in the outer sum, i.e.\ the term $v=(1,1)$ is dropped,
and we multiply by $(-1)^{m(v)+1}$ instead of $(-1)^{m(v)}$. We
can get the desired estimate on the conditional supremum of second
moments if we can prove a good estimate on the conditional second
moments of the supremum of the inner sums in $\tilde I_{n,2}(f_p)$,
$1\le p<\infty$, in the case of each index $v=(1,-1)$, $v=(-1,1)$
and $v=(-1,-1)$. If we can get a good estimate in the case
$v=(1,-1)$, then we can get it in the remaining cases, too. So we
have to give a good bound on the expression
\begin{equation}
\sup_{1\le p<\infty}E\left(\left.
\frac1n\left(\sum\limits_{ 1\le l_r\le n,\; r=1,2,\;
l_1 \neq l_2} f_p\left(\xi_{l_1}^{(1,1)},\xi_{l_2}^{(2,-1)}
\right)\right)^2\right|
{\cal B}\right). \label{8.13}
\end{equation}
Moreover, since the sequence of random variables $\xi_l^{(2,-1)}$,
$1\le l\le n$, is independent of the $\sigma$-algebra ${\cal B}$, and
the canonical property of the functions $f_p$ implies some
orthogonalities, the estimation of the expression in (\ref{8.13}) can
be simplified. A detailed calculation shows that it is enough to prove
the following inequality:
Let us have a countable class ${\cal F}$ of canonical functions
$f(x,y)$ with respect to a probability measure $\mu$ on the second
power $(X^2,{\cal X}^2)$ of a measurable space $(X,{\cal X})$, which
is $L_2$-dense with some exponent $L$ and parameter $D$, (the
probability measure $\mu$ is living in the space $(X,{\cal X})$)
together with a sequence of independent and $\mu$-distributed random
variables $\xi_1,\dots,\xi_n$, $n\ge2$, on $(X,{\cal X})$, and let the
relations
$$
\int f(x,y)^2\mu(\,dx)\mu(\,dy)\le \sigma^2, \quad \sup\limits |f(x,y)|
\le1 \qquad \textrm{for all }
f\in{\cal F}
$$
hold with some number $0<\sigma^2\le1$ which satisfies the relation
$n\sigma^2\ge K(L+\beta)\log n$ with $\beta=\max\left(\frac{\log
D}{\log n},0\right)$ and a sufficiently large fixed constant $K>0$.
Then the inequality
\begin{equation}
P\left(\sup_{f\in{\cal F}}\frac1n \int\left(\sum\limits_{l=1}^n
f(\xi_l,y)\right)^2\mu(\,dy)
\ge A^2 n\sigma^4\right) \le \exp\left\{-A^{1/3}n\sigma^2 \right\}
\label{8.15}
\end{equation}
holds if $A\ge A_0$ with some sufficiently large fixed constant $A_0$.
Inequality (\ref{8.15}) is similar to relation (\ref{7.3}) in
Proposition~\ref{p7.1}
in the case $k=1$, but it does not follow from it. (It follows
from~(\ref{7.3}) in the special case when the function~$f$ does not
depend on the argument~$y$ with respect to which we integrate.) On
the other hand, inequality~(\ref{8.15}) can be proved by working out a
similar, although somewhat more complicated symmetrization argument
and induction procedure as it was done in the proof of
Proposition~\ref{p7.1} in the case $k=1$. After this, inequality~(\ref{8.15})
enables us to work out the symmetrization argument we need to prove
Proposition~\ref{p7.1} for $k=2$. This procedure can be continued for all
$k=2,3,\dots$. If we have already proved Proposition~\ref{p7.1} for
some~$k$, then an inequality can be formulated and proved with the
help of the already known results which enable us to carry out that
symmetrization procedure which is needed in the proof of
Proposition~\ref{p7.1} in the case $k+1$. This is a rather cumbersome
method with a lot of technical details, hence its detailed
explanation had to be omitted from an overview paper. In the
work~\cite{r22} Sections~13,~14 and~15 deal only with the proof of
Proposition~\ref{p7.1}. Section~13 contains the proof of some preparatory
results and the formulation of the inductive statements we have to
prove to get the result of Proposition~\ref{p7.1}, Section~14 contains the
proof of the Symmetrization arguments we need, and finally the
proof is completed with their help in~Section~15.
\medskip
There is an interesting theory of Talagrand about so-called
concentration inequalities. This theory has some relation to
the questions discussed in this paper. In the last section this
relation will be discussed together with some other results and open
problems.
\section {Relation with other results and some open problems}\label{s9}
Talagrand worked out a deep theory about so-called concentration
inequalities. (See his overview in paper~\cite{r33} about this subject.)
His results are closely related to the supremum estimates described
in this paper. First I discuss this relation.
\subsection {On Talagrand's concentration inequalities}\label{s9.1}
Talagrand considered a sequence of independent random variables
$\xi_1,\dots,\xi_n$, a class of functions ${\cal F}$, took the partial
sums $\sum\limits_{j=1}^n f(\xi_j)$ for all functions $f\in{\cal F}$,
and investigated their supremum. He proved such estimates which state
that this supremum is very close to its expected value, (it is
concentrated around it). The following theorem in paper~\cite{r34} is a
typical result in this direction.
\begin{thm}[Theorem of Talagrand]\label{t9.1}
%Theorem 9.1.
Consider $n$ independent and identically distributed random variables
$\xi_1,\dots,\xi_n$ with values in some measurable space
$(X,{\cal X})$. Let ${\cal F}$ be some countable family of
real-valued measurable functions of $(X,{\cal X})$ such that
$\|f\|_\infty\le b<\infty$ for every $f\in{\cal F}$. Let
$Z=\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f(\xi_i)$
and $v=E(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i))$.
Then for every positive number~$x$,
\begin{equation}
P(Z\ge EZ+x)\le K\exp\left\{-\frac 1{K'}\frac
xb\log\left(1+\frac{xb}v\right)\right\} \label{9.1}
\end{equation}
and
\begin{equation}
P(Z\ge EZ+x)\le K\exp\left\{-\frac{x^2}{2(c_1v+c_2bx)}\right\},
\label{9.2}
\end{equation}
where $K$, $K'$, $c_1$ and $c_2$ are universal positive constants.
Moreover, the same inequalities hold when replacing $Z$ by $-Z$.
\end{thm}
Inequality (\ref{9.1}) can be considered as a generalization of Bennett's
inequality, inequality~(\ref{9.2}) as a generalization of Bernstein's
inequality. In these estimates the distribution of the supremum of
possibly infinitely many partial sums of independent and identically
distributed functions are considered. A remarkable feature of
Theorem~\ref{t9.1} is that it imposes no condition about the structure of
the class of functions ${\cal F}$. In this respect it differs from
Theorems~\ref{t6.2} and~\ref{t6.3} in this paper, where such a class
of functions
${\cal F}$ is considered which satisfies a so-called $L_2$-density
property.
Talagrand's study was also continued by other authors who got
interesting results. In particular, the works of M.~Ledoux~\cite{r16}
and P.~Massart~\cite{r26} are worth mentioning. In these works the above
mentioned result was improved. Such a version was proved which also
holds for the supremum of appropriate classes of sums of independent
but not necessarily identically distributed random variables. (On
the other hand, I do not know of such a generalization in which
$U$-statistics of higher order are considered.) The improvements of
these works consist for instance in a version of Theorem~\ref{t9.1} where
the quantity
$v=E(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n f^2(\xi_i))$
is replaced by $\sigma^2=\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n
\textrm{Var}\, (f(\xi_i))$,
i.e.\ the supremum of the expectation of the individual partial sums
$\sum\limits_{i=1}^nf^2(\xi_i)$ is considered (the statement that
$\sigma^2$ equals the supremum of the expected values of the
partial sums $\sum\limits_{i=1}^nf^2(\xi_i)$ holds if $Ef(\xi_i)=0$
for all random variables $\xi_i$ and functions $f$) instead of the
second moment of the supremum of these partial sums.
On the other hand, the estimates in Theorem~\ref{t9.1} contain the
expected value $EZ=E\left(\sup\limits_{f\in{\cal F}}\sum\limits_{i=1}^n
f(\xi_i)\right)$, and this quantity appears in all concentration type
inequalities. This fact has deep consequences which deserve a more
detailed discussion.
Let us consider Theorem~\ref{t9.1} or one of its improvements and try
to understand what kind of solution they provide for problem~b)
or~b$'$) formulated in Section~\ref{s1} in the case $k=1$. They supply
a good estimate on the probabilities we consider for the numbers
$u\ge n^{-1/2}EZ=n^{-1/2}E(\sup\limits_{f\in{\cal F}}
\sum\limits_{i=1}^n f(\xi_i))$. But to apply these results we need a
good estimate on the expectation $EZ$ of the supremum of the partial
sums we consider, and the proof of such an estimate is a highly
non-trivial problem.
Let us consider problem b$'$) (in the case $k=1$) for such
a class of functions ${\cal F}$ which satisfies the conditions of
Theorem~\ref{t6.3}. The considerations taken in Section~\ref{s6} show that
there are such classes of functions ${\cal F}$ which satisfy the
conditions of Theorem~\ref{t6.3}, and for which the probability
$P(\sup\limits_{f\in{\cal F}} n^{-1/2}\sum\limits_{i=1}^n f(\xi_i)>
\alpha\sigma \log \frac2\sigma)$ is almost 1 with an appropriate small
number $\alpha>0$ for all large enough sample sizes~$n$. (Here the
number $\sigma$ is the same as in Theorem~\ref{t6.3}.) This means that
$En^{-1/2}Z\ge(\alpha-\varepsilon)\sigma\log\frac2\sigma$ for all
$\varepsilon>0$ if the sample size~$n$ of the sequence
$\xi_1,\dots,\xi_n$ is greater than $n_0=n_0(\varepsilon,\sigma)$.
Some calculation also shows that under the conditions of Theorem~\ref{t6.3}
$En^{-1/2}Z\le K\sigma\log\frac2\sigma$
with an appropriate number $K>0$. (In this calculation some
difficulty may arise, because Theorem~\ref{t6.3} for $k=1$ does not yield
a good estimate if $u\ge\sqrt n\sigma^2$. But we can write
$P(\sup\limits_{f\in{\cal F}} n^{-1/2}\sum\limits_{i=1}^n f(\xi_i)>u)
\le e^{-\alpha(u/\bar\sigma)^2}=e^{-\alpha u\sqrt n}$ with
$\bar\sigma^2=un^{-1/2}$ if $u\ge \sqrt n\sigma^2$, and this
estimate is sufficient for us. We get the upper bound we formulated
for $n^{-1/2}EZ$ from Theorem~\ref{t6.3} only under the condition
$n\sigma^2\ge\textrm{const.}\,\log\frac2\sigma$ with some appropriate
constant. It can be seen that this condition is really needed, it
appeared not because of the weakness of our method. I omit the details
of the calculation.) Then the concentration inequality Theorem~\ref{t9.1}, or
more precisely its improvement, Theorem~3 in paper~\cite{r26} which gives
a similar inequality, but with the quantity $\sigma^2$ instead
of~$v$ implies Theorem~\ref{t6.3} in the case $k=1$. This means that
Theorem~\ref{t6.3} can be deduced from concentration type inequalities in
the case $k=1$ if we can show that under its conditions
$En^{-1/2}Z\le K\sigma\log\frac2\sigma$ with some appropriate $K>0$
depending only on the exponent and parameter of the $L_2$-dense
class ${\cal F}$. Such an estimate can be proved (see the proof in~\cite{r8}
on the basis of paper~\cite{r33}), but it requires rather long and
non-trivial considerations. I prefer a direct proof of Theorem~\ref{t6.3}.
Finally I discuss a refinement of Theorems~\ref{t4.1} and~\ref{t4.3}
promised in
a remark at the end of Section~\ref{s4} together with some open problems.
\subsection {Some refinements of the estimate in Theorems~\ref{t4.1}
and~\ref{t4.3}}\label{s9.2}
If we have a bound on the $L_2$ and $L_\infty$ norm of the kernel
function~$f$ of a $U$-statistic $I_{k,n}(f)$, but we have no
additional information about the behaviour of~$f$, (and such a situation
is quite common in mathematical statistics problems), then the estimate
of Theorem~\ref{t4.3} about the distribution of $U$-statistics cannot be
considerably improved. On the other hand, one would like to prove such
a multi-dimensional type version of the large deviation theorem about
partial sums of independent random variables which gives a good
asymptotic formula for the probability $P(n^{-k/2}I_{k,n}(f)>u)$ for
large values~$u$. Such an estimate should depend on the function~$f$. A
similar question can be posed about the distribution of multiple
Wiener-It\^o integrals $Z_{n,k}(f)$ if $k\ge2$, because the distribution
of such random integrals (unlike the degenerate case $k=1$) is not
determined by their variance.
Such large deviation problems are very hard, and I know of no result in
this direction. On the other hand, some quantities can be introduced
which enable us to give a better estimate on the distribution of
Wiener--It\^o integrals or $U$-statistics in the case of their
knowledge. Such results were known for Wiener--It\^o integrals
$Z_{\mu,2}(f)$ and $U$-statistics $I_{n,2}(f)$ of order~2 earlier, and
quite recently they were generalized for all $k\ge2$. I describe them
and show that they are useful in the solution of some problems. My
formulation will differ a little bit from the previous ones. In
particular, I shall speak about Wiener--It\^o integrals where previous
authors considered only polynomials of Gaussian random vectors. But the
Wiener--It\^o integral presentation of these results seems to be more
natural for me. First I formulate the estimate about Wiener--It\^o
integrals of order~2 proved in~\cite{r12} by Hanson and Wright.
%\medskip\noindent
\begin{thm}\label{t9.2}
%{\bf Theorem 9.2.}
{Let a two-fold Wiener--It\^o integral
$$
Z_{\mu,2}(f)=\int f(x,y)\mu_W(\,dx)\mu_W(\,dy)
$$
be given, where $\mu_W$ is a white noise with a non-atomic reference
measure $\mu$, and the function $f$ satisfies the inequalities
\begin{equation}
\int f(x,y)^2\mu(\,dx)\mu(\,dy)\le \sigma^2 \label{9.4}
\end{equation}
and
\begin{equation}
\int f(x,y)g_1(x)g_2(y)\mu(\,dx)\mu(\,dy)\le D \label{9.4a}
\end{equation}
with some number $D>0$ for all functions $g_1$ and $g_2$ such that
$\int g_j^2(x)\mu(\,dx)\le1$, $j=1,2$.
There exists a universal constant $K>0$ such that the inequality
\begin{equation}
P(|Z_{\mu,2}(f)|>u)\le K\exp\left\{-\frac1K\min
\left(\frac{u^2}{\sigma^2},\frac uD\right)\right\} \label{9.5}
\end{equation}
holds for all $u>0$.}
\end{thm}
As it was remarked in Section~\ref{s4} we can assume without violating the
generality that the function $f$ in the definition of Wiener--It\^o
integrals is symmetric. In this case Theorem~\ref{t9.2} can be reformulated
to a simpler statement.
To do this let us define with the help of the (symmetric) function $f$
the following so-called Hilbert--Schmidt operator $A_f$ in the
$L_2(\mu)$ space of square integrable functions with respect to the
measure~$\mu$: $A_fv(x)=\int f(x,y)v(y)\mu(\,dy)$ for all $L_2(\mu)$
measurable functions~$v(\cdot)$. It is known that $A_f$ is a compact,
self-adjoint operator, hence it has a discrete spectrum. Let
$\lambda_1,\lambda_2,\dots$ denote the eigenvalues of the operator
$A_f$. It follows from the theory of Hilbert-Schmidt operators and the
It\^o formula for multiple Wiener--It\^o integrals that the identity
$Z_{\mu,2}(f)=\sum\limits_{j=1}^\infty\lambda_j(\eta_j^2-1)$ holds with
some appropriately defined independent standard normal random
variables $\eta_1,\eta_2,\dots$. Beside this,
$\sum\limits_{j=1}^\infty\lambda_j^2=\int f^2(x,y)\mu(\,dx)\mu(\,dy)$.
Hence condition (\ref{9.4}) can be reformulated as
$\sum\limits_{j=1}^\infty\lambda_j^2\le\sigma^2$, and
condition~(\ref{9.4a})
is equivalent to the statement that $\sup\limits_j |\lambda_j|\le D$.
In such a way Theorem~\ref{t9.2} can be reduced to another statement whose
proof is simpler.
Theorem~\ref{t9.2} yields a useful estimate if $D^2\ll\sigma^2$. In this case
it states that for large numbers $u$ the bound $P(Z_{\mu,2}(f)>u)
\le\textrm{const.}e^{-u/2\sigma}$ supplied by Theorem~\ref{t4.1} can be
improved to the bound $P(Z_{\mu,2}(f)>u)\le\textrm{const.}e^{-u/KD}$.
The correction term $\frac{u^2}{\sigma^2}$ at the right-hand side
of (\ref{9.5}) is needed to get an estimate which holds for all $u>0$.
It may be worthwhile recalling the following result (see~\cite{r27}
or~\cite{r17}, Theorem~6.6). All $k$-fold Wiener--It\^o
integrals $Z_{\mu,k}(f)$
satisfy the inequality $P(|Z_{\mu,k}(f)|>u)>Ke^{-Au^{2/k}}$
with some $K=K(f,\mu)>0$ and $A=A(f,\mu)>0$. There is a strictly
positive number $A=A(f,\mu)$ in the exponent of the last relation,
but the proof of~\cite{r27} yields no explicit lower bound for it.
There is a similar estimate about the distribution
of degenerate $U$-statistics of order~2. This is the content of
the following Theorem~\ref{t9.3}.
%\medskip\noindent
\begin{thm}\label{t9.3}
%{\bf Theorem 9.3.}
{Let a sequence $\xi_1,\dots,\xi_n$ of
independent $\mu$ distributed random variables be given together with
a function $f(x,y)$ canonical with respect to the measure~$\mu$, and
consider the (degenerate) $U$-statistic $I_{n,2}(f)$ defined in (\ref{1.3})
with the help of the above quantities. Let us assume that the
function~$f$ satisfies conditions~(\ref{9.4}) and~(\ref{9.4a}) with some
$\sigma>0$ and $D>0$, and also the relations
\begin{equation}
\sup_x\int f^2(x,y)\mu(\,dy)\le A_1,\quad
\sup_y\int f^2(x,y)\mu(\,dx)\le A_2,\quad
\sup_{x,y}|f(x,y)|\le B \label{9.6}
\end{equation}
hold with some appropriate constants $A_1>0$, $A_2>0$ and $B>0$. Then
there exists a universal constant $K>0$ such that the inequality
\begin{equation}
P\left(n^{-1}|I_{n,2}|>u\right)\le K\exp\left\{-\frac1K
\left(\frac{u^2}{\sigma^2}, \frac uD,\frac{n^{1/3}u^{2/3}}
{(A_1+A_2)^{1/3}},\frac{n^{1/2}u^{1/2}}{B^{1/2}}\right)
\right\} \label{9.7}
\end{equation}
is valid for all $u>0$.}
\end{thm}
Theorem~\ref{t9.3} was proved in \cite{r9}. The estimate of
Theorem~\ref{t9.3} is
similar to that of Theorem~\ref{t9.2}, the difference between them is that in
formula~(\ref{9.7}) some additional correction terms had to be inserted
to make it valid for all $u>0$. But the proof of Theorem~\ref{t9.3} is much
harder. It can be shown that the estimate (\ref{9.7}) implies that of
Theorem~\ref{t4.3} in the special case $k=2$ if we disregard the appearance
of the not explicitly defined universal constant~$K$ in it.
To see this observe that Theorem~\ref{t4.3} contains the conditions
$u\le n^{1/2}\sigma^2$ and $B\le1$ which imply that
$\frac{n^{1/3}u^{2/3}}{(A_1+A_2)^{1/2}}\ge\frac1{\sqrt2}
\left(\frac u{\sigma^2}\right)^{2/3}u^{2/3}=\frac1{\sqrt2}
\left(\frac u\sigma\right)^{4/3}$,
and $\frac{n^{1/2}u^{1/2}}{B^{1/2}}\ge
\frac u{\sigma^2}u^{1/2}=\sigma^{-1/2}\left(\frac u\sigma\right)^{3/2}
\ge\left(\frac u\sigma\right)^{3/2}$, since $\sigma\le1$ in this case.
Beside this, $\frac uD\ge\frac u\sigma$. The above relations imply that
in the case $u\ge\sigma$ the estimate (\ref{9.7}) is weakened if the
expression in its exponent is replaced by
$\frac1{\sqrt2K}\frac u\sigma$.
Theorem~\ref{t4.3} trivially holds if $0\le u\le\sigma$.
Theorem~\ref{t9.3} is useful in such problems where a refinement of the
estimate in Theorem~\ref{t4.3} is needed which exploits better the properties
of the kernel function~$f$ of a degenerate $U$-statistics of order~2.
Such a situation appears in paper~\cite{r10}, where the law of iterated
logarithm is investigated for degenerate $U$-statistics of order~2.
Let us consider an infinite sequence $\xi_1,\xi_2,\dots$ of independent
$\mu$ distributed random variables together with a function $f$
canonical with respect to the measure~$\mu$, and define the degenerate
$U$-statistic $I_{n,2}(f)$ with their help for all $n=1,2,\dots$. In
paper~\cite{r10} the necessary and sufficient condition of the iterated
logarithm is given for such a sequence. More explicitly, it is proved
that
$$
\limsup_{n\to\infty}\frac{|I_{n,2}(f)|}{n\log \log n}<\infty \quad
\textrm{with probability 1}
$$
if and only if the following two conditions are satisfied:
\makeatletter
\renewcommand{\theenumi}{\alph{enumi}}
\renewcommand{\labelenumi}{\theenumi)}
\makeatother
%\advance \baselineskip1pt
\medskip
\begin{enumerate}
\item $\int_{\{(x,y)\colon |f(x,y)|\le u\}}
f^2(x,y)\mu(\,dx)\mu(\,dy)\le C\log\log u$
with some $C<\infty$ for all $u\ge10$.
%\advance \baselineskip-1pt
\medskip
\item $\int f(x,y)g(x)h(y)\mu(\,dx)\mu(\,dy)\le C$
with some appropriate $C<\infty$ for all such pairs of functions
$g$ and $h$ which satisfy the relations $\int g^2(x)\mu(\,dx)\le1$,
$\int h^2(x)\mu(\,dx)\le1$, $\sup\limits_x|g(x)|<\infty$,
$\sup\limits_x|h(x)|<\infty$.
\end{enumerate}
\medskip
The above result is proved by means of a clever truncation of the
terms in the $U$-statistics and an application of the estimation of
Theorem~\ref{t9.3} for these truncated $U$-statistics. It has the form one
would expect by analogy with the classical law of iterated logarithm
for sums of independent, identically distributed random variables with
expectation zero, but it also has an interesting, unexpected feature.
The classical law of iterated logarithm for sums of iid. random
variables holds if and only if the terms in the sum have finite
variance. (The only if part is proved in paper~\cite{r7}
or~\cite{r30}.) The above
formulated law of iterated logarithm for degenerate $U$-statistics
also holds in the case of finite second moment, i.e. if
$Ef^2(\xi_1,\xi_2)<\infty$, but as the authors in~\cite{r10} show in
an example, there are also cases when it holds, although
$Ef^2(\xi_1,\xi_2)=\infty$. Paper~\cite{r11} is another example where
Theorem~\ref{t9.3} can be successfully applied to solve certain problems.
\medskip
To formulate the generalization of Theorems~\ref{t9.2} and~\ref{t9.3}
for general
$k\ge2$ some notations have to be introduced. Given a finite set $A$
let ${\cal P}(A)$ denote the set of all its partitions. If a partition
$P=\{B_1,\dots,B_s\}\in{\cal P}(A)$ consists of $s$ elements then we
say that this partition has order~$s$, and write $|P|=s$. In the
special case $A=\{1,\dots,k\}$ the notation ${\cal P}(A)={\cal P}_k$
will be used. Given a measurable space $(X,{\cal X})$ with a
probability measure $\mu$ on it together with a finite set
$B=\{b_1,\dots,b_j\}$ let us introduce the following notations. Take
$j$ different copies $(X_{b_r},{\cal X}_{b_r})$ and $\mu_{b_r}$,
$1\le r\le j$, of this measurable space and probability measure indexed
by the elements of the set $B$, and define their product
$(X^{(B)},{\cal X}^{(B)},\mu^{(B)})=\left(\prod\limits_{r=1}^j X_{b_r},
\prod\limits_{r=1}^j{\cal X}_{b_r},
\prod\limits_{r=1}^j\mu_{b_r}\right)$. The points
$(x_{b_1},\dots,x_{b_j})\in X^{(B)}$ will be denoted by
$x^{(B)}\in X^{(B)}$ in the sequel. With the help of the above
notations I introduce the quantities needed in the formulation of the
generalization of Theorems~\ref{t9.2} and~\ref{t9.3}.
Let a function $f=f(x_1,\dots,x_k)$ be given on the $k$-fold product
$(X^k,{\cal X}^k,\mu^k)$ of a measurable space $(X,{\cal X})$ with a
probability measure $\mu$. For all partitions
$P=\{B_1,\dots,B_s\}\in{\cal P}_k$ of the set $\{1,\dots,k\}$ consider
the functions $g_r\left(x^{(B_r)}\right)$ on the space $X^{(B_r)}$,
$1\le r\le s$, and define with their help the quantity
\begin{eqnarray}
\alpha(P)&&\hskip-0.6truecm =\alpha(P,f,\mu) \label{9.8}\\
&&\hskip-0.6truecm =\sup_{g_1,\dots,g_s}
\biggl\{\int f(x_1,\dots,x_k)
g_1\left(x^{(B_1)}\right)\cdots g_s\left(x^{(B_s)}\right)\mu(dx_1)
\dots\mu(dx_k)\colon \nonumber \\
&&\qquad\qquad \int g_r^2\left(x^{(B_r)}\right)\mu^{(B_r)}
\left(\,dx^{(B_r)}\right)\le1
\quad \textrm{for all } 1\le r\le s\biggl\}. \nonumber
\end{eqnarray}
In the estimation of Wiener--It\^o integrals of order~$k$ the
quantities $\alpha(P)$, $P\in{\cal P}$, play such a role as the
numbers $D$ and $\sigma^2$ introduced in formulas~(\ref{9.4})
and~(\ref{9.4a}) in Theorem~\ref{t9.2}. Observe that in the case $|P|=1$,
i.e. if $P=\{1,\dots,k\}$ the identity $\alpha^2(P)=\int
f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)$ holds. The following
estimate is valid for Wiener--It\^o integrals of general order
(see~\cite{r15}).
%\medskip\noindent
\begin{thm}\label{t9.4}
%{\bf Theorem 9.4.}
{Let a $k$-fold Wiener--It\^o integral
$I_{\mu,k}(f)$, $k\ge1$, be defined with the help of a white noise
$\mu_W$ with a non-atomic reference measure~$\mu$ and a kernel
function $f$ of $k$-variable such that
$\int f^2(x_1,\dots,x_k)\mu(\,dx_1)\dots\mu(\,dx_k)<\infty$. There is
some universal constant $C(k)<\infty$ depending only of the order~$k$
of the random integral such that the inequality
\begin{equation}
P(|Z_{\mu,k}(f)|>u)\le C(k)\exp\left\{-\frac1{C(k)}\min_{1\le s\le k}
\min_{P\in{\cal P}_k,\,|P|=s} \left(\frac u{\alpha(P)}
\right)^{2/s}\right\}
\label{9.9}
\end{equation}
holds for all $u>0$ with the quantities $\alpha(P)$, $P\in{\cal P}_k$,
defined in formula~(\ref{9.8}).}
\end{thm}
Also the following converse estimate holds which shows that the above
estimate is sharp. (See again paper~\cite{r15}.) This estimate also
yields an improvement of the result in~\cite{r27} mentioned in this
subsection.
%%%%%%%%%%%%%%%%%%
%\nextcount{9.4$'$}
\def\thethm{9.4$'$}
\begin{thm}\label{t9.4'}
%{\bf Theorem $9.4'$.}
{The random integral $Z_{\mu,k}(f)$
considered in Theorem~\ref{t9.4} also satisfies the inequality
$$
P(|Z_{\mu,k}(f)|>u)\ge \frac1{C(k)}\exp\left\{-C(k)\min_{1\le s\le k}
\min_{P\in{\cal P}_k,\,|P|=s} \left(\frac u{\alpha(P)}
\right)^{2/s}\right\}
$$
for all $u>0$ with some universal constant $C(k)>0$ depending only on
the order~$k$ of the integral and the quantities $\alpha(P)$,
$P\in{\cal P}_k$, defined in formula~(\ref{9.8}).}
\end{thm}
\def\thethm{\arabic{section}.\arabic{thm}}
\setcounter{thm}{4}
To formulate the result about the distribution of degenerate
$U$-statistics for all $k\ge2$ an analog of the expression $\alpha(P)$
defined in (\ref{9.8}) has to be introduced. Let us consider a set
$A\subset \{1,\dots,k\}$ with $|A|=k-r$ elements, $0\le ru)\\
&&\hskip-1truecm \qquad \le C\exp\left\{-\frac1{C}
\max_{\{(r,s)\colon 0\le r0$ with the above constant $C$ and the quantities
$\alpha(A,P)$ defined in (\ref{9.10}), (\ref{9.11}) and (\ref{9.12}).}
\end{thm}
It can be seen with the help of some calculation that Theorem~\ref{t9.5}
implies Theorem~\ref{t4.3} for all orders $k\ge2$ if we disregard the
presence of the unspecified universal constant~$C$. (It has to be
exploited that under the conditions of Theorem~\ref{t4.3}
$\alpha^2(A,P)\le\sigma^2$ if $|A|=r$ with $r=0$, $\alpha(A,P)\le1$ for
$|A|=r\ge1$, $\sigma^2\le1$, and $n^{k/2}\sigma^{k+1}\ge u$.)
The proof of Theorems~\ref{t9.4} and~\ref{t9.5} is based, similarly
to the proof of Theorems~\ref{t4.1} and~\ref{t4.3}, on a good estimate
of the (possibly high) moments of Wiener--It\^o integrals and degenerate
$U$-statistics. The proofs
of these estimates in~\cite{r1} and~\cite{r15} are based on many deep
and hard inequalities of different authors. One may ask whether the
diagram formula, propagated in this work, which gives an explicit
formula about these moments cannot be applied in the proof of these
results. I think that the answer to this question is in the positive,
and even I have some ideas how to carry out such a program. But at the
time of writing this work I had not enough time to work out the details.
A natural open problem is to find the large deviation estimates about
the tail distribution of multiple Wiener--It\^o integrals and
$U$-statistics mentioned at the start of this subsection. Such results
may better explain why the quantities $\alpha(P)$ and $\alpha(A,P)$
appear in the estimates of Theorems~\ref{t9.4} and~\ref{t9.5}. It would be
interesting to find the true value of the universal constants in these
estimates or to get at least some partial results in this direction
which would help in solving the following problem:
\medskip\noindent
{\bf Problem.} {\it Consider a $k$-fold multiple Wiener--It\^o integral
$Z_{\mu,k}(f)$. Show that its distribution satisfies the relation
$$
\lim_{u\to\infty}u^{2/k}\log P(|Z_{\mu,k}(f)|>u)=K(\mu,f)>0
$$
with some number $K(\mu,f)>0$, and determine its value.}
\medskip
There appear some other natural problems relating to the above results.
Thus for instance, it was assumed in all estimates about
$U$-statistics discussed in this work that their kernel functions are
bounded. A closer study of this condition deserves some attention. It
was explained in this paper that its role was to exclude the appearance
of some irregular events with relatively large probability which would
imply that only weak estimates hold in some cases interesting for us.
One may ask whether this condition cannot be replaced by a weaker and
more appropriate one in certain problems.
Finally, I mention the following problem.
\medskip\noindent
{\bf Problem.} {\it Prove an estimate analogous to the result of
Theorem~\ref{t9.5} about the supremum of appropriate classes of
$U$-statistics.}
\medskip
To solve the above problem one has to tackle some difficulties. In
particular, to adapt the method of proof of previous results such a
generalization of the multivariate version of Hoeffding's inequality
(see~\cite{r21}) has to be proved about the distribution of homogeneous
polynomials of Rademacher functions where the bound depends not only on
the variance of these random polynomials, but also on some quantities
analogous to the expression $\alpha(P)$ introduced in~(\ref{9.8}).
\begin{thebibliography}{9}
\bibitem{r1}
\textsc{Adamczak, R.} (2005) Moment inequalities for $U$-statistics.
\hfill\break
Available at \url{http://www.arxiv.org/abs/math.PR/0506026}
\bibitem{r2}
\textsc{Alexander, K.} (1984) Probability inequalities for
empirical processes and a law of the iterated logarithm.
\textit{ Annals of Probability} \textbf{12}, 1041--1067
\MR{757769}
\bibitem{r3}
\textsc{Arcones, M. A.} and \textsc{Gin\'e, E.} (1993) Limit theorems
for $U$-processes. \textit{ Annals of Probability} \textbf{21}, 1494--1542
\MR{1235426}
\bibitem{r4}
\textsc{de la Pe\~na, V. H.} and \textsc{Montgomery--Smith, S.} (1995)
Decoupling inequalities for the tail-probabilities of multivariate
$U$-statistics. \textit{Annals of Probability}, \textbf{23}, 806--816
\MR{1334173}
\bibitem{r5}
\textsc{Dudley, R. M.} (1998) \textit{Uniform Central Limit
Theorems.}\/ Cambridge University Press, Cambridge U.K.
\MR{1720712}
\bibitem{r6}
\textsc{Dynkin, E. B.} and \textsc{Mandelbaum, A.} (1983) Symmetric
statistics, Poisson processes and multiple Wiener integrals.
\textit{Annals of Statistics\/} \textbf{11}, 739--745
\MR{707925}
\bibitem{r7}
\textsc{Feller, W.} (1968) An extension of the law of the iterated
logarithm to variables without variance. \textit{Journal of Mathematics
and Mechanics} 343--355
\textbf{18}
\MR{233399}
\bibitem{r8}
\textsc{Gin\'e, E.} and \textsc{Guillou, A.} (2001) On consistency of kernel
density estimators for randomly censored data: Rates holding uniformly
over adaptive intervals. \textit{Ann. Inst. Henri Poincar\'e PR\/}
\textbf{37} 503--522
\MR{1876841}
\bibitem{r9}
\textsc{Gin\'e, E.}, \textsc{Lata\l{}a, R.} and
\textsc{Zinn, J.} (2000) Exponential and
moment inequalities for $U$-statistics in \textit{High dimensional
probability II.} Progress in Probability 47. 13--38. Birkh\"auser
Boston, Boston, MA.
\MR{1857312}
\bibitem{r10}
\textsc{Gin\'e, E.}, \textsc{Kwapie\'n, S.}, \textsc{Lata\l{}a, R.}
and \textsc{Zinn, J.}
(2001) The LIL for canonical $U$-statistics of order~2.
\textit{Annals of Probability} \textbf{29} 520--527
\MR{1825163}
\bibitem{r11}
\textsc{Gin\'e, E.} and \textsc{Mason, D. M.} (2004) The law of the iterated
logarithm for the integrated squared deviation of a kernel density
estimator. \textit{Bernoulli} \textbf{10} 721--752
\MR{2076071}
\bibitem{r12}
\textsc{Hanson, D. L.} and \textsc{Wright, F. T.} (1971) A bound on the tail
probabilities for quadratic forms in independent random variables.
\textit{Ann. Math. Statist.} \textbf{42} 1079--1083
\MR{279864}
\bibitem{r13}
\textsc{Hoeffding, W.} (1948) A class of statistics with
asymptotically normal distribution. \textit{Ann. Math. Statist.}
\textbf{19} 293--325
\MR{26294}
\bibitem{r14}
\textsc{It\^o K.} (1951) Multiple Wiener integral. \textit{J. Math.
Soc. Japan}\/ \textbf{3}. 157--164
\MR{44064}
%\MR{xxxxxx}
\bibitem{r15}
\textsc{Lata\l{a}, R.} (2005) Estimates of moments and tails of
Gaussian chaoses. \hfill\break
Available at \url{http://www.arxiv.org/abs/math.PR/0505313}
\bibitem{r16}
\textsc{Ledoux, M.} (1996) On Talagrand deviation inequalities for
product measures. \textit{ESAIM: Probab. Statist.}\/ \textbf{1.}
63--87. \hfill\break
Available at \url{http://www.emath./fr/ps/}.
\MR{1399224}
\bibitem{r17}
\textsc{Major, P.} (1981) Multiple Wiener--It\^o integrals.
\textit{Lecture Notes in Mathematics\/} \textbf{849}, Springer Verlag,
Berlin, Heidelberg, New York,
\MR{611334}
\bibitem{r18}
\textsc{Major, P.} (2005) An estimate about multiple stochastic
integrals with respect to a normalized empirical measure.
\textit{Studia Scientarum Mathematicarum Hungarica.}
\textbf{42} (3) 295--341
\bibitem{r19}
\textsc{Major, P.} (2006) An estimate on the maximum of a nice
class of stochastic integrals. \textit{Probability
Theory and Related Fields.} \textbf{134} (3) 489--537
%\hfill\break available at the homepage
%\url{http://dx.doi.org/10.1007/s00440-005-0440-9}
%or http://www.renyi.hu/\~{}major
\bibitem{r20}
\textsc{Major, P.}
(2005) On a multivariate version of
Bernstein's inequality. Submitted to \textit{Annals of Probability}.
\hfill\break
Available at the homepage
\url{http://www.renyi.hu/~major}
\bibitem{r21}
\textsc{Major, P.} (2005) A multivariate generalization of
Hoeffding's inequality. Submitted to \textit{Annals of Probability}.
\hfill\break
Available at the homepage
\url{http://www.renyi.hu/~major}
\bibitem{r22}
\textsc{Major, P.} (2005) On the tail behaviour of multiple
random integrals and degenerate $U$-statistics. (manuscript for a
future Lecture Note) \hfill\break
Available at the homepage
\url{http://www.renyi.hu/\~major}
\bibitem{r23}
\textsc{Major, P.} and \textsc{Rejt\H{o}, L.} (1988) Strong embedding of
the distribution function under random censorship.
\textit{Annals of Statistics}, \textbf{16}, 1113--1132
\MR{959190}
\bibitem{r24}
\textsc{Major, P.} and \textsc{Rejt\H{o}, L.} (1998) A note on
nonparametric estimations. \textit{A volume in Honour of
Mikl\'os Cs\"org\H{o}.} North Holland 759--774
\MR{1661516}
\bibitem{r25}
\textsc{Malyshev, V. A.} and \textsc{Minlos, R. A.} (991)
\textit{Gibbs Random
Fields. Method of cluster expansion.} Kluwer, Academic Publishers,
Dordrecht
\MR{1191166}
\bibitem{r26}
\textsc{Massart, P.} (2000) About the constants in Talagrand's
concentration inequalities for empirical processes.
\textit{ Annals of Probability}\/ \textbf{28}, 863--884
\MR{1782276}
\bibitem{r27}
\textsc{Mc. Kean, H. P.} (1973) Wiener's theory of non-linear noise.
in \textit{ Stochastic Differential Equations} SIAM--AMS Proc. 6 197--209
\MR{353471}
\bibitem{r28}
\textsc{Pollard, D.} (1984) \textit{Convergence of Stochastic
Processes.}\/ Springer Verlag, New York
\MR{762984}
\bibitem{r29}
\textsc{Rota, G.-C.} and \textsc{Wallstrom, C.} (1997) Stochastic integrals:
a combinatorial approach. \textit{Annals of Probability} \textbf{25} (3)
1257--1283
\MR{1457619}
\bibitem{r30}
\textsc{Strassen, V.} (1966) A converse to the law of the iterated
logarithm. \textit{Z. Wahrscheinlichkeitstheorie} \textbf{4} 265--268
\MR{200965}
\bibitem{r31}
\textsc{Surgailis, D.} (2003) CLTs for polynomials of linear
sequences: Diagram formula with illustrations. in \textit{Long Range
Dependence} 111--127 Birkh\"auser, Boston, Boston, MA.
\MR{1956046}
\bibitem{r32}
\textsc{Takemura, A.} (1983) Tensor Analysis of ANOVA decomposition.
\textit{J. Amer. Statist. Assoc.} \textbf{78}, 894--900
\MR{727575}
\bibitem{r33}
\textsc{Talagrand, M.} (1994) Sharper bounds for Gaussian and
empirical processes. \textit{Ann. Probab.} \textbf{22}, 28--76
\MR{1258865}
\bibitem{r34}
\textsc{Talagrand, M.} (1996) New concentration inequalities
in product spaces. \textit{Invent. Math.} \textbf{126}, 505--563
\MR{1419006}
\end{thebibliography}
\end{document}
__